Alert repeatedly sends "back to normal" message despite never triggering

bbarg · July 15, 2021, 8:28pm

Hello all :).

I’ve got a scheduled query that runs every minute and produces a single scalar value. I’ve created an alert based on that query with the following settings:

trigger said value is < 65000
notifications are sent “just once, until back to normal”
slack channel set up as alert destination
custom description that includes the {{ALERT_STATUS}} and {{QUERY_RESULT_VALUE}} macros (for debugging purposes).

I’m seeing two kinds of strange behavior:

Often (but unfortunately for our debugging purposes, not always) the slack channel will receive a message every minute saying “{{ALERT_NAME}} went back to normal}}”, coupled with an “OK” status and a value that is clearly above the 65000 alert threshold. These messages even when the alert was not recently in a “TRIGGERED” state. Put another way, the slack channel multiple “back to normal” messages for a single “triggered” message, if a “triggered” message appears at all.
This behavior persists if I tell the alert to send notifications “at most every 5 minutes”. The only difference is that the green notifications are sent every 5 minutes instead of every minute when the underlying query refreshes.

Other troubleshooting steps I’ve tried:

I’ve debugged the underlying query to the point that I’m pretty confident it’s not returning values below the threshold for brief periods of time. The query computes rolling average anyways, so that kind of behavior shouldn’t happy given what I know about the underlying timeseries it’s pulling from. Plus, wouldn’t I see a red “triggered” message in slack if the alert entered the “triggered” state?
I’ve tried creating a new query and alert with the same query text, refresh rate, and alert settings and the same problem exists.

Questions:

Any other troubleshooting steps I should try?
Any other ways I could configure this alert to stop this behavior?

Relevant platform information:

self-hosted redash
redash/preview:9.0.0-beta.b49483 image

jesse · July 23, 2021, 10:01pm

For clarification, this setting “at most every 5 minutes” only affects the frequency of TRIGGERED messages. You will always receive a notification when the alert is first triggered and you will always receive one notification when the value goes back to normal. You shouldn’t see multiple back-to-normal notifications, though, so this does look like a bug.

Does this reproduce using the V10 beta?

bbarg · July 26, 2021, 1:52pm

Thanks for following up!

For clarification, this setting “at most every 5 minutes” only affects the frequency of TRIGGERED messages.

Good to know. For reference, upon subsequent testing of the “just once, until back to normal setting”, we saw both TRIGGERED AND OK messages delivering multiple times, each time the underlying query was evaluated.

Does this reproduce using the V10 beta?

Not sure, but I’m happy to try!

jesse · July 26, 2021, 2:16pm

Very interesting. I wonder if the jobs aren’t being updated in redis so the alert is picked up twice (or more).

Very curious to hear the result of trying this with V10.

bbarg · July 29, 2021, 5:31pm

We upgraded to V10 and observed the same behavior (“triggered” and “ok” notifications deliver repeatedly every time the alert is evaluated instead of once when the state changes).

One thing I am wondering though: we are actually running two redash instances. One is on v10 (as of today), and the other is on V8 (specifically redash/redash:8.0.0.b32245). They share the same backing postgres database. Do you think it’s possible that the two services are interfering with each other in some way?

Happy to try any other debugging strategies you think might be helpful.

jesse · July 30, 2021, 7:33pm

Yes absolutely this would cause that kind of issue.

The postgres database contains all of Redash’s state. I’m actually surprised that this works with the same backing database. As there are significant database schema differences between those versions

Try disabling the alert on one of the instances and see if the issue resolves.

▼Categories

▼Tags

Alert repeatedly sends "back to normal" message despite never triggering