Pipeline Data Source


#1

The discussion on #3584 gave me the idea of a pipeline query runner :bulb: It’s not directly related to #3584, but in a way it is. Basically the idea is to have a single query where you define multiple queries/steps and then a final step that uses them and provides the result.

The reason I thought of it when looking at 3584, is that you can do 3584 with the json data source. But then you need to split it into two queries:

  1. Query that uses json data source and loads the data from the status API.
  2. Query that uses Query Results data source and does something with the data (unless you need raw data).

It always feels annoying to have two (or more) queries. So this pipeline thing can make it possible to do it all in one query.

First implementation can be a YAML one:

sources:
  status: # `status` is just an alias you later use in the query
    type: json # `type` is the type of source you're loading from
    url: https://.../status.json
query: |
  SELECT *
  FROM status
  WHERE key = '...'

type for a Source can be json, csv, query, gsheets etc. Basically this will unify all the data sources we currently load non database data from. But type can be data_source and then take an ID of an existing data source and pass a query to it.

Writing queries in YAML can be unappealing, so a future iteration can be a proper UI that let’s you define sources with their own UI and then write a query.

That’s just an idea at the moment, but wanted to share to get feedback.


#2

This sounds awesome. What about schemas? Would be useful to have a unified view.

Also, how about the term ‘composite’ for the name of this data source? (not sure about it though)


#3

+1 from me. Would it be possible to do this with YAML but querying a SQL database as well?

sources:
  table1:
    query: SELECT * from users WHERE 1 = 1
  table2:
    query: SELECT * FROM cats WHERE 1 = 1
query: |
  SELECT * FROM users LEFT JOIN cats on ...