Is there a way to disable schedule ability for queries targeting specific data sources?

brianlie · May 29, 2019, 11:32am

In our production environment we have two Presto instances, all of them available to be accessed from our Redash servers. However, we would like one Presto instance to not be targeted by scheduled queries, only ad-hoc or on-demand ones.

Aside from user-side preventions, is it possible to at least not triggering scheduled query execution into specific data source while still allowing human-triggered query execution into the same data source?

We have altered the source code of Redash v3.0.0 and build our own version for quite some time, but I have not discovered a reliable way to do this (disable schedule ability for queries targeting specific data source).

But then I heard there’s a pause feature for data sources. But how do I trigger this, it seems API-only? I read that pausing a data source will disable both query execution and query scheduling but I think this can be worked around by commenting out some code in the query handler script.

arikfr · May 29, 2019, 12:25pm

This is an interesting use case. There are a few options in Redash that could’ve helped you but don’t really fit your use case:

As you figured there is a way to pause a data source, but this will affect adhoc queries as well.
There is a way to limit the number of schedule options (like only once a day/week/whatever), but this is a global option – not per data source.
You can skip assigning workers for the scheduled queue, but if a query is in the scheduled queue (its schedule came up), then adhoc executions won’t work…

If you’re interested, have the dev resources and committed to make it work, we can collaborate on adding support for having data sources with no scheduled queries.

Another option for you is to have a cron job that periodically checks if there are queries with schedule and cancels their scheduling. Shouldn’t be that hard actually, assuming you have where to run this cron.

brianlie · May 29, 2019, 1:32pm

We could work around option #1 by modifying the part of the code that does the checking job, I have identified them here (again, please note we’re still using Redash v3) – easy way for now will be commenting these lines out.

For option #2, yeah we can’t do this because some scheduled queries still needs alerting feature of Redash, and it needs at least 5 minutes of regular checks.

Option #3, we’re still exploring this one.
We’re thinking of modifying a particular column for the data_sources table inside Redash’s PostgreSQL metadata database, could this work?
There’s one column called “scheduled_queue_name” so I’m thinking of modifying this column for some particular data source into a random name, so that Celery workers won’t recognize them. But I don’t know the implication of this. Wouldn’t task queue be overloaded?

By the way Arik we also set Redash up to use a truly Celery beat instead of the default Celery worker instance acting as a beat.

I noticed the pause status is stored inside Redis, does that mean when the whole cluster (including Redis) are restarted, the previously paused data source will no longer be paused?

brianlie · May 29, 2019, 1:37pm

Another option for you is to have a cron job that periodically checks if there are queries with schedule and cancels their scheduling. Shouldn’t be that hard actually, assuming you have where to run this cron.

That should also work, we’re also working on this solution. But instead of stopping scheduled queries we’re working on a script that could change the query’s target data source ID directly from Redash’s PostgreSQL metadata database (the “queries” table).

The two Presto instances I mentioned above are identical after all. The only difference is one instance is for automated/scheduled queries – because along with Redash we also have Airflow DAGs running queries into Presto, and the other Presto instance dedicated for manual on-demand queries.

In case of a Redash query scheduling into wrong Presto instance we would just change the target data source ID. I was wondering if the solution can be made from server side because we have many internal Redash servers running on different domain names.

arikfr · May 30, 2019, 6:18am

Why write to the database and not use the API? You can use the API to update the data source and it will be safer in regards to future changes to the schema.

But now that I understand the use case better, I think there is a simpler solution here: you can create your own version of the Presto query runner that will pick the correct connection based on whether this is a scheduled query or not. Deploying this dedicated query runner using the standard Docker image we release should be quite easy, and then you will be able to keep up with our releases.

Redash v3 was released 18 months ago and we made significant improvements since, so I would put some effort in being able to upgrade

This is interesting – why not have a single instance? I would love to chat more about your usage, if you interested, send me an email to arik at redash.io and we can schedule a call? Thanks.

▼Categories

▼Tags

Is there a way to disable schedule ability for queries targeting specific data sources?