Redash pings Snowflake database every 30min

rick · July 29, 2019, 12:30pm

Issue Summary

Snowflake as a cloud data warehouse runs and costs money only when queries are send.
And queries have a “cool-down” phase of at least 1 minute (default config in my case was 30mins).

But now Redash is sending the following query, every 30 minutes, keeping Snowflake busy and expensive. My scheduled queries only run once a day.

SELECT col.table_schema,
               col.table_name,
               col.column_name
        FROM mydatabasename.information_schema.columns col
        WHERE col.table_schema <> 'INFORMATION_SCHEMA'

I am not sure what this “ping” is doing, but I am confident you can disable it without losing features in future versions.
And if you have an easy work-around to disable this ping in the current version, I would be happy to try it.

Technical details:

Redash Version: 7.0.0+b18042 (4a978bad)
Browser/OS: Chrome
How did you install Redash: self-hosted

arikfr · July 29, 2019, 12:35pm

It refreshes the schema cache (list of tables/columns).

As for controlling it, you have several options:

Add a REDASH_SCHEMAS_REFRESH_SCHEDULE environment value to your configuration. Its value is in minutes and this is the frequency Redash will use to refresh the schemas. As you can guess the default is 30.
Add to Redis a list key named data_sources:schema:blacklist with your data source being a member in this list. This will disable any schema refreshes for this data source.

I wonder if there is a way to get updated schema from Snowflake that doesn’t wake up the cluster?

rick · July 29, 2019, 12:40pm

Thanks for the great reply!
You might want to take a look into the following docu:
https://docs.snowflake.net/manuals/sql-reference/ddl-table.html#table-management
Those 3 commands do not wake the cluster:

SHOW TABLES
SHOW COLUMNS
DESCRIBE TABLE

Here is a full list of Show commands:
https://docs.snowflake.net/manuals/sql-reference/sql/show.html

arikfr · July 29, 2019, 12:50pm

Thanks! From a quick look, it seems that SHOW COLUMNS can replace the current query we have with minimal changes.

Nathan · July 31, 2019, 1:13am

Possibly related - I’ve run into issues using Hive/AWS Glue when trying to query information_schema catalogs and the SHOW TABLES/SHOW COLUMNS methods seem to be more reliable for those sources as well.

arikfr · August 8, 2019, 7:20am

Can you elaborate on what kind of issues?

The reason we query information_schema is to avoid having to run multiple queries. But of course if it creates issues, it’s better to run multiple queries.

Nathan · August 8, 2019, 9:37pm

Sure can. To clarify, this is not an issue with Redash specifically. When a table has an invalid definition in a Hive catalog (or Glue catalog) it can prevent queries against information_schema views completing successfully unless the query excludes that particular table or schema (and sometimes it seems queries fail even if this filter is in place). It can be tricky to identify which table definition is causing these issues in a large catalog. Some tools that rely on these views for schema and table metadata can have problems if there’s no way to filter the schemas/tables being queried and the only alternative is to use the SHOW syntax e.g. SHOW TABLES, SHOW COLUMNS (or find and fix the bad table definition). For example I think this is how the AWS Athena console generates it’s schema browser data, rather than using information_schema - i.e. it runs SHOW TABLES to list the available tables, then SHOW COLUMNS if the table is clicked on.

That said, we are currently using Redash with Athena without any issues, but it’s probably something to be aware of.

rodrigo-borges · March 31, 2020, 2:35pm

I’m not sure if I should revive this thread or start a new one, but I’m having a similar problem.

In my case, the data source is a MySQL database, but it’s a production server and it’s huge. The analyses we run on it are light-weight and don’t burden it, but the automatic schema queries are painfully slow because of its size and are affecting performance.

I would like to implement the second solution, adding my production data sources to the blacklist, but I have no idea how to do so. We self-host Redash via docker compose, so Redis is one of its services.

Could you help me with the next steps to add/create the blacklist?

itay_a · April 6, 2020, 3:02pm

@arikfr
can you please update us on the needed steps?
its relevant to us as well, thanks

▼Categories

▼Tags

Redash pings Snowflake database every 30min

Issue Summary

Technical details: