Enforce to add partition key and limit clause

ytannai · November 26, 2020, 5:07am

AWS Athena increases the usage fee depending on the amount of data loaded.
To control this, I’m adding the following patch:
If a table with a partition key set forces you to specify it in the where clause when referring to the table. SQL that violates this will be canceled before being submitted to Athena.

Similarly, I apply a patch to force all SQL to have a LIMIT clause to keep Redash’s memory usage below a certain level.

For example, add a checkbox to enforce the Partition Key in places like the attached image.

kylashpriya · November 29, 2020, 3:04pm

Hello [ytannai],
Thanks t alot for your valuable info. I’m looking similar stuff for Redshift data source. We use Redshift as our primary data source and we expose data via redash to users. Quite often, people ignore limit clause and it blows up our memory. Could you please share some info if you have it!

Thanks in advance!

Thanks and Regards,
Kylasam N A

ytannai · November 30, 2020, 3:47am

Hi @kylashpriya

Our primary data source is AWS Athena and was faced same issue with you.
There were some users that cause memory usage issue for redash because of fetch many rows (over 100k).
Consequently, some patches like below was applied to self hosted redash.

def __check_limit(self, query):
    query = sqlparse.format(query, keyword_case='upper', identifier_case='lower')
    parsed = sqlparse.parse(query)
    select_flag = False
    limit_flag = False
    for t in parsed[0].tokens:
        if str(t) == 'SELECT':
            select_flag = True
        if str(t) == 'LIMIT':
            limit_flag = True
    if select_flag and limit_flag:
        return
    elif not select_flag:
        return
    elif select_flag and not limit_flag:
        raise Exception("required limit")

I posted same question to AWS Support. They saied that Athena and official libraries provide some feature to fix the problem:
For example, RowsToFetchPerBlock and UseResultsetStreaming in Athena JDBC Driver can control row counts.
Athena’s GetQUeryResults API can also control. (doc)
But redash (and related libraries) don’t use it.

I know the default row count is 1000 when LIMIT is absent in current redash, but some user writes a query with the large LIMIT count.
So I need the LIMIT count enforcing feature.

For Alternative, I try to write small proxy server to intercept LIMIT-less query just now.

Sorry for Athena only reply because we don’t use Redshift.

kylashpriya · March 12, 2021, 12:41am

@ytannai
Thanks a lot once again for your reply.

Could you please let me know in which redash module/code/script your custom logic was implemented? I shall try the same. Hope the way redash queries remains the same!

U was trying to add custom limit clause in /app/redash/query_runner/pg.py under the container [ 19b6d771c573 redash/redash:8.0.0.b32245 “/app/bin/docker-ent…” 8 months ago Up 30 minutes 0.0.0.0:5000->5000/tcp redash_server_1] with the below logic,
def run_query(self, query, user):
connection = self._get_connection()
_wait(connection, timeout=10)

    cursor = connection.cursor()

    try:
        if "LIMIT" not in query.upper():
            query_new = query + " LIMIT 10"
            print("New query is==>",query_new)
            cursor.execute(query_new)
            _wait(connection)

but this doesn’t work! I’m still seeing tonnes of records in the query window!
Any small help from you would be a life saver!

Thanks a lot in advance and wish you a fantastic Friday!

Thanks and Regards,
Kylash N A

griffinator76 · March 12, 2021, 3:25am

@kylashpriya Hi, you may want to read up on how the LIMIT clause works in Redshift. Using this clause limits the number of rows returned to the user, but not necessarily the amount of data scanned by the query.
You may want to look at trying to ensure all queries contain a WHERE clause as people running queries without that has been a major issue I’ve seen in previous projects.
In my experience it’s really hard to prevent poorly educated users from constructing badly performing queries or queries that return enormous result sets. Education of users is definitely a worthwhile effort as well!

kylashpriya · March 12, 2021, 7:33am

Hello [ytannai],
Once again, thanks a ton for your help. I was intended to do adding explicit limit even before querying at Redshift. So the query from redash UI needs to be tweaked before firing at Redshift.

Can’t agree more on educating users. But it seems not to be working fine. Since most of them ran query without thinking much, we have been trying educating for last one year which is super hard now. So we have to educate them via strict rules.
In the other hand, if you have some options for configuring some cache cleaning techniques or optimisation, please do share it with me. We shall try that as well.

Last but not least; will continue educating the users.

▼Categories

▼Tags

Enforce to add partition key and limit clause