Thoughts on adding support for queries with parameters in query results data source


#1

Following @ariarijp pull request on changing the way we find query IDs in Query Results queries to later support queries with parameters, I wanted to start a discussion on how this might work and to share some thoughts I had:

There are two use cases for querying from queries with parameters:

1.

The simple use case: you just want to use a query with parameters and you “know” what the values are:

  • We can detect that the query you’re querying from needs parameters and show the needed parameter input fields and pass those values along when loading this query (will require adding the logic of detecting query IDs in client side code).
  • Allow for some syntax of passing parameters like @ariarijp suggested (although I feel that using JSON is not very user friendly, although very pragmatic).

2.

There is another more advanced use case: you want to pass returning values from one query into another. Here’s an example:

Datasource A has users list, datasource B has user events. You want to return list of active users in the past day along with their details. So you need to filter the users list in A based on ids from B.

So the query A on data source A might be:

SELECT name FROM users WHERE id IN ({{user_ids}});

The query B on B might be:

SELECT distinct user_id FROM events WHERE created_at > '...';

And you want the results from query B to be merged into the the query A.

Maybe this doesn’t have to be implemented as part of query results feature, though? Maybe it can be some syntax like parameters:

SELECT name FROM users WHERE id IN ({% query_results 456 %});

query_results is a function to load query results and we can have others to insert snippets and maybe other things?


I hope the explanation makes sense. Of course we can implement use case 1 without implementing use case 2.

Any feedback or thoughts will be welcomed :slight_smile:


Parameters on Querying Existing Query Results
ReQL query language
Query Query-Result-Cache with parameters
#2

Hi, Arik

Thank you for sharing your thoughts.
I can make sense about these use cases.

For now, I would like to talk about your “use case 1”.

First, here is PoC of my thoughts.

For example,

Query 1

SELECT * FROM film WHERE rating = '{{ rating }}'

Query 2

SELECT * FROM query_1('{"rating": "PG-13"}')

I prefer to use JSON as parameter.
The reason why JSON is very common format for developers and easy to understand.

I agree JSON is not friendly for some users.
However, I think SQL is very pragmatic and not friendly for non-technical users such as marketer, biz-dev and so on.

I think Query Results is advanced feature on Redash.
I guess this feature wanted by techinical guys, they don’t require any code completion.
(Of course, I want to use this feature with code completion :grinning:)

Please feel free to ask me about it and any feedback is welcome.


#3

Hi there, and thank you for such a great tool!
I think that I’m trying to do what is described here, in the simple use case (1). I have the “Active Users” query from GA with global parameters, like this one:

{
  "ids": "ga:xxxxxxxxx",
  "start_date": "{{Start Date}}",
  "end_date": "{{End Date}}",
  "metrics": "ga:users"
}

And I also have the “Total Users” number from our PostgreSQL database. Then I want to have the dashboard widget that displays Total / Active users numbers somehow like that:

select total.users || ' / ' || active.ga_users as 'Total / Active'
from query_9 as active
join query_7 as total on active.ga_dimension1 = total.role
order by Role;

This was working ok when parameters were hardcoded but it fails now. Is it correct that for now, it’s not possible to get the results like this with global parameters? If so, what a workaround could be?


#4

Hi @ariarijp, I tried your pull request with the current master (5.0.0-beta) but with no success, unfortunately – my existing queries became broken, e.g. select * from query_7 as users where users.role = 'staff' complains about missing users.role column. It works ok in master, however.


#5

@arikfr, I tried @ariarijp solution with the current master but got Error running query: 'query_9' is not a function. Was something changed since March that could break the patch? I did git log -- redash/query_runner but can’t see anything suspicious. What else could I do to troubleshoot the issue?


#6

Hi @leshik ,

Thank you for your feedback!
I will check my patch on 5.0.0 branch.


#7

@leshik I’ve checked that PoC on master branch commit 47068a6d911716cc2802542cdc5613e14602d561.

It’s looks like works for me.

Here is my queries,

Query 1 (It uses Redash metadata as a data source)

SELECT *
FROM EVENTS
WHERE action = '{{ action }}'
  AND object_type = '{{ object_type }}';

Query 2

SELECT *
FROM query_1('{"action":"view", "object_type":"query"}');

#8

@ariarijp do you have a patch against the current master? maybe I messed things up when trying to apply your patch.


#9

@leshik Sure, here you are.

diff --git a/redash/query_runner/query_results.py b/redash/query_runner/query_results.py
index fc0e09e4..db8fcf08 100644
--- a/redash/query_runner/query_results.py
+++ b/redash/query_runner/query_results.py
@@ -1,9 +1,12 @@
+import hashlib
 import json
 import logging
 import numbers
 import re
 import sqlite3
 
+import pystache
+import sqlparse
 from dateutil import parser
 
 from redash import models
@@ -42,9 +45,23 @@ def _guess_type(value):
     return TYPE_STRING
 
 
-def extract_query_ids(query):
-    queries = re.findall(r'(?:join|from)\s+query_(\d+)', query, re.IGNORECASE)
-    return [int(q) for q in queries]
+def extract_child_queries(query):
+    pattern = re.compile(r"^query_(\d+)(?:\('(\{.+\})'\))?", re.IGNORECASE)
+    queries = []
+
+    for token in sqlparse.parse(query)[0].tokens:
+        m = pattern.match(token.value)
+        if not m:
+            continue
+
+        queries.append({
+            'query_id': int(m.group(1)),
+            'params': {} if m.group(2) is None else json.loads(m.group(2)),
+            'table': 'tmp_' + hashlib.md5(token.value).hexdigest(),
+            'token': token.tokens[0].value if isinstance(token.tokens[0], sqlparse.sql.Function) else token.value,
+        })
+
+    return queries
 
 
 def _load_query(user, query_id):
@@ -54,26 +71,34 @@ def _load_query(user, query_id):
         raise PermissionError("Query id {} not found.".format(query.id))
 
     if not has_access(query.data_source.groups, user, not_view_only):
-        raise PermissionError(u"You are not allowed to execute queries on {} data source (used for query id {}).".format(
-            query.data_source.name, query.id))
+        raise PermissionError(
+            u"You are not allowed to execute queries on {} data source (used for query id {}).".format(
+                query.data_source.name, query.id))
 
     return query
 
 
-def create_tables_from_query_ids(user, connection, query_ids):
-    for query_id in set(query_ids):
-        query = _load_query(user, query_id)
+def create_tables_from_child_queries(user, connection, query, child_queries):
+    for child_query in child_queries:
+        _query = _load_query(user, child_query['query_id'])
 
-        results, error = query.data_source.query_runner.run_query(
-            query.query_text, user)
+        params = child_query['params']
+        if not params:
+            params = _get_default_params(_query)
+
+        query_with_params = pystache.render(_query.query_text, params)
+        results, error = _query.data_source.query_runner.run_query(query_with_params, user)
 
         if error:
             raise Exception(
-                "Failed loading results for query id {}.".format(query.id))
+                "Failed loading results for query id {}.".format(_query.id))
 
         results = json.loads(results)
-        table_name = 'query_{query_id}'.format(query_id=query_id)
+        table_name = child_query['table']
         create_table(connection, table_name, results)
+        query = query.replace(child_query['token'], table_name)
+
+    return query
 
 
 def fix_column_name(name):
@@ -101,6 +126,16 @@ def create_table(connection, table_name, query_results):
         connection.execute(insert_template, values)
 
 
+def _get_default_params(query):
+    params = {}
+
+    if 'parameters' in query.options:
+        for p in query.options['parameters']:
+            params[p['name']] = p['value']
+
+    return params
+
+
 class Results(BaseQueryRunner):
     noop_query = 'SELECT 1'
 
@@ -118,13 +153,13 @@ class Results(BaseQueryRunner):
 
     @classmethod
     def name(cls):
-        return "Query Results (Beta)"
+        return "Query Results Ex(Beta)"
 
     def run_query(self, query, user):
         connection = sqlite3.connect(':memory:')
 
-        query_ids = extract_query_ids(query)
-        create_tables_from_query_ids(user, connection, query_ids)
+        child_queries = extract_child_queries(query)
+        query = create_tables_from_child_queries(user, connection, query, child_queries)
 
         cursor = connection.cursor()
 

Good luck!


#10

@ariarijp Sorry, but I can’t make it work even using your queries. I’m able to execute the first query with parameters:

But the second one fails:

What might be the reason?


#11

More logs:

[2018-09-18 03:13:14,069][PID:73][INFO][Worker-21] task_name=redash.tasks.execute_query taks_id=a6c55223-e6d2-4d4c-ab64-8dd6c8176cb0 task=execute_query query_hash=a5010a00aaf69c88c09e36b60d1212c9 data_length=None error=[function query_19(unknown) does not exist
LINE 2: FROM query_19('{"action":"view", "object_type":"query"}');
             ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
]
[2018-09-18 03:13:14,074][PID:1][ERROR][MainProcess] Task redash.tasks.execute_query[a6c55223-e6d2-4d4c-ab64-8dd6c8176cb0] raised unexpected: QueryExecutionError('function query_19(unknown) does not exist\nLINE 2: FROM query_19(\'{"action":"view", "object_type":"query"}\');\n             ^\nHINT:  No function matches the given name and argument types. You might need to add explicit type casts.\n',)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/app/redash/worker.py", line 71, in __call__
    return TaskBase.__call__(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/app/redash/tasks/queries.py", line 528, in execute_query
    scheduled_query).run()
  File "/app/redash/tasks/queries.py", line 470, in run
    raise result
QueryExecutionError: function query_19(unknown) does not exist
LINE 2: FROM query_19('{"action":"view", "object_type":"query"}');
             ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

#12

@leshik Hmm, which data source type are you using?

HINT:  No function matches the given name and argument types.

It looks like an error from PostgreSQL.


#13

@ariarijp As you suggested, I’m using re:dash DB as data source. So I just added its postgres DB to my data sources.


#14

@leshik I see, Can you run your second query on Query Results Data Source?


#15

@ariarijp omg, I’m so sorry for wasting your time, I just forgot to change the data source in the query editor. It works now with your provided example as well as with my case with GA. I will experiment more and will let you know if I hit some edge case. Thank you!


#16

@leshik Great! Happy Redash-ing! :redash:


#17

@ariarijp I figured out what was breaking my queries. After the PoC patch, it’s not possible to use queries like this one anymore:

select total.school
from query_7 as total;

This works on current master, but stops after applying the patch. Before the patch, this query succeds, but after it fails with Error running query: no such column: total.school.

My queries rely on joins that depend on tables aliases so they stopped working.


#18

@leshik Thank you for your feedback!

I understand what is annoying you.

For now, It is one of the limitations of this PoC.

Of course, I always welcome any patch for solving it :slight_smile:


#19

Is it possible to do things like that?
FROM query_254('{"mindate":'{{ mindate }}', "maxdate":'{{ maxdate }}'}');
This doesn’t work because of the brackets. Am I doing this wrong or are parameters into parameters not supported?


#20

Hi Arikfr,

First let me start by saying that redash is awesome, thank you for open-sourcing it and most of all thank you for engaging so actively with the community that sprung around it!

As for adding support for queries with parameters, I believe that the best way to implement it would be to lean on sqlite’s native support for user-defined table-returning functions.

These can be implemented in Python with the help of either the sqlite-vtfunc module or Peewee’s TableFunction class.

(For the purposes of the rest of this discussion, I’m going call the queries from other data sources as “source queries” and the query that runs on sqlite based on the “source queries” as “sink query”).

Implementing query results as table functions brings ancillary benefits:

  • No need to invent new syntax. Passing parameters declared on the sink query to a source query is simply a matter of doing “select * from query_175({{parameter}})” whereas passing a correlated value from the sink query into the source query can be achieved with something like:

    select record.title, ( select count(*) from query_175(record.id)), from record

  • The source queries wouldn’t need to be executed completely before running the sink query. On the contrary, the sink query actually runs first, and the source queries only need to produce as much data as the sink query consumes, allowing the queries to run in parallel

    • This solves a pet peeve of mine, where the source queries are completely executed before the sink query fails with a syntax error when there is something a simple as a missing comma on the sink query.

Since these functions need to be pre-declared before running the query in sqlite, this would require parsing the query, as is currently done, to locate which existing queries should be declared as table functions.

Alternatively, the data source could require configuration of which source queries should be available to sink queries (and perhaps even their names!), which would also help to serve as a sort of access control. This would allow removing all code that currently parses sink queries to locate the source queries.

Using json in the argument passing, as suggested by @ariarijp, is also a possibility which could help with mapping arguments passed in the sink query with parameters declared in the source queries, which otherwise would need some sort of convention about the order of parameters.

These are my thoughts on the matter.

Again, thank you for engaging with the community on this matter!