ReQL query language


#1

I’ve been playing with the idea of having a custom SQL syntax for creating composed queries based on the results of other queries. I’m calling it ReSQL.

The syntax is based on sqlite’s SQL flavour but adds some distinctive productions so it’s obvious at a glance what is standard SQL and what’s not. I imagine most power-users of Re:Dash are familiar with SQL already.

provider "[" params "]" ( "::" mapper )* 

Examples

-- Fetch the number of rows reported by another query
SELECT COUNT(*) FROM query[3]
    
-- Filter a parametrized query resultset
SELECT * FROM query[id:3 p_country:'CAT']

-- Join two queries and aggregate
SELECT countries.name, COUNT(*)
FROM query[3] AS top_users
JOIN query[4] AS countries ON top_users.country = countries.id
GROUP BY countries.name

Additionally the syntax allows to perform adhoc queries using the configured datasources:

-- Fetch a remote CSV and apply a mapping function to parse it
SELECT * FROM http['https://foobar.com/report.csv']::csv

-- Custom HTTP requests with a JSONPath mapping
WITH authors AS http[ 
   url: 'http://foobar.com/report'
   AUTHORIZATION: 'token ${variable}'   -- variable interpolation
   body: [=                             -- initiate heredoc
     {
       "store": "Amazon",
       "report": "books" 
     }
   =]
]::json[' $library.book[*].author ']
SELECT *
FROM authors
WHERE language = 'en'

-- Use a configured data source for an adhoc query
WITH users AS datasource['Staging MySQL', [=
  SELECT * FROM mysql_table  -- would run on MySQL
=]
SELECT COUNT(*) FROM users

--  Free text search modifier
SELECT title, year
FROM http['https://foobar.com/movies.csv']::csv::fts[tokenizer: 'porter']
WHERE title MATCH 'lotr OR lord rings'

I tried to keep the examples short so they don’t properly show the whole potential that the design can achieve via composition.


How would it work?

A new data source (query runner) for ReQL would be available, so it builds on the existing infrastructure. When the server receives the query execution from the browser:

  • Parses the query to verify the syntax
  • Resolves the reql expression (query[...]) and checks they exist and the user is allowed to use them
  • Triggers the execution of each reql expression
  • Creates a temporary sqlite3 database
  • Inserts the results into separate tables in the temporary DB
  • Adapts the query converting the reql expressions into references to the created tables
  • Executes the query against the sqlite database
  • Reports back the result of the sqlite query

I considered also using the sqlite3 native virtual tables feature but it feels like a pain to troubleshoot and the sqlite3 parser is really lacking regarding syntax errors so I wanted to have a more user friendly parser on top.

Using the main postgresql database for it was also discarded because it would be a pain to isolate properly. With sqlite3 there is total isolation and also allows to define memory usage limits easily.

Another benefit of using sqlite3 with a custom parser is that we could offload some of the work to the browser (if it still supports WebSQL) while the user is editing the query.

Note that using sqlite3 is an implementation detail, it allows us to skip a large chunk of the implementation effort by using it for executing the final query. However I can imagine that in the future ReQL could be extended and have a fully native execution engine that avoids the constant marshalling of the data.

Caveats

  • First of all it’s not a simple project to implement although it seems certainly feasible.
  • Users could easily end up abusing the system by relying on ReSQL for everything instead of properly preparing the data elsewhere.
  • Schema browser support would be flaky

Closing words

Overall I think if offers a lot of potential, having a simple declarative way to massage the data would allow less tech inclined users to benefit from the system. If fits a nice middle ground between parametrized queries and a Python/docker data source.


#2

There are some really interesting ideas to discuss here, but the basic idea (querying query results) is actually already implemented :smiley:

Documentation: https://redash.io/help/user-guide/querying/query-results-data-source
Source code: https://github.com/getredash/redash/blob/master/redash/query_runner/query_results.py

It doesn’t support parameters. We have a discussion about it here:


#3

Ooops! I totally missed that query runner, it indeed covers the basic use case.

I think a good approach would be first to put a SQL parser infrastructure on top of that query runner (to avoid the regexp issues), later on the parser can be used to build more functionality incrementally (adhoc queries).

I’m a sucker for parsers so I’ll put something together later on :yum:


#4

Check the discussion with @ariarijp and the link to his implementation – he uses the sqlparse to extract the table names and to provide an infrastructure for future extensions.

As a rule of thumb, I prefer to keep the implementation simple, to make it maintainable. The biggest value of this feature comes from being able to build upon other queries (to further manipulate them or join multiple sources). Having to write the queries you query from as separate queries is not a big trade off, IMO. It also has the benefit of allowing reuse of these queries.

I would invest in tools like autocomplete for query IDs (using query names) and ability to see the “schema” of the queries you query from.


#5

I see your point about keeping things simple and it works well for data sources that expect a blob of text to run (i.e. a SQL query). However it doesn’t scale so well with other data sources that don’t offer a query language natively (i.e. URL).

By introducing an unified syntax (ReSQL) I think we can provide a bit of structure for interacting with a data source and helps with the user experience since the acquired knowledge can be used upon multiple data sources.

So the use cases would be:

  • Aggregate and filter data from multiple sources (currently solved by the query_result runner)
  • Structured parametrization of data sources (avoids current escaping issues with templating)
  • Format conversions for data (i.e. fetched url) or to expose nested documents (json inside DB)
  • Library of value transformations shared across datasources (i.e. ::date(input: epoch))
  • Snippets can be shared across different data sources

Using the Url data source as an example, it could simply expose a reql method like the following:

def reql(self, url, method='get', body=None, **kwargs):
    headers = [(k, v) for (k,v) in kwargs.items() if k.isupper()]
    ...

Then when we find a reference to that data source on ReQL like ...FROM http['https://foo.bar', CONTENT_TYPE:'json'] we simply call the .reql method with those arguments.

By the way, I’ve implemented the parser already, long time I didn’t enjoy a task so much :smile:


#7

I agree that it’s useful for data sources without their own query language, but still we can first load the source and then apply the SQL on it in a separate query.

When loading simple end points, like a CSV one, then having a query like below is great :

SELECT count(0)
FROM csv('https://.../data.csv')

But if you have a more complex endpoint, that you need to apply JSON extraction from, use some custom headers and more, then I’m not sure how better it is.

But considering you already have an implementation, I’m happy to explore this to see how complex it might get.


#8

Just pushed a PR against my fork with the work on the parser. It passes a whole lot of test cases from sqlite and postgresql and also introduces the ReQL syntax options.

Next will be to integrate it into the query_results data source.


#9

Updated the PR with the refactor of query_results to use the new parser. It still uses the same syntax query_N but now by default it retrieve the most recent result if available. A refresh can be forced with query_N_refresh.

Additionally it allows to define a memory limit for the temporary sqlite database.