Converting Redash query results to Pandas DataFrame and vice versa


#1

Originally posted by @dersphere on GitHub (#2078):


To improve the usability of the python query runner I created two python functions to convert the result (of execute_query and get_query_result) to a pandas.DataFrame (incl. type conversions) and a DataFrame to result.
Both functions only iterate the columns (not the rows) and convert on pd.Series - so they should be pretty fast. They should also handle all possible redash-types and pandas-dtypes (incl. np.nan, datetime and date).

I am pretty sure that other people might find this useful, so my questions are:

  • Is it worth adding?
  • How is the best way to integrate it with redash?

I have two ideas in mind:

  1. add both methods as staticmethods to the current python query runner, add numpy and pandas to requirements and improve the docs
  2. add a new “Python Pandas” Query runner for this purpose (and alter execute_query and get_query_result to directly return DataFrames)

If you find it worth adding, I will create a PR.

Here are the two functions:

def df_to_redash(df_orig, index_to_col=False):
    import numpy as np
    df = df_orig.copy()
    if index_to_col:
        df.reset_index(inplace=True)
    result = {'columns': [], 'rows': []}
    conversions = [
        {'pandas_type': np.integer, 'redash_type': 'integer',},
        {'pandas_type': np.inexact, 'redash_type': 'float',},
        {'pandas_type': np.datetime64, 'redash_type': 'datetime', 'to_redash': lambda x: x.strftime('%Y-%m-%d %H:%M:%S')},
        {'pandas_type': np.bool_, 'redash_type': 'boolean'},
        {'pandas_type': np.object, 'redash_type': 'string'}
    ]
    labels = []
    for dtype, label in zip(df.dtypes, df.columns):
        for conversion in conversions:
            if issubclass(dtype.type, conversion['pandas_type']):
                result['columns'].append({'name': label, 'friendly_name': label, 'type': conversion['redash_type']})
                labels.append(label)
                func = conversion.get('to_redash')
                if func:
                    df[label] = df[label].apply(func)
                break
    result['rows'] = df[labels].replace({np.nan: None}).to_dict(orient='records')
    return result

def redash_to_df(result, col_to_index=False):
    import pandas as pd
    conversions = [
        {'redash_type': 'datetime', 'to_pandas': lambda x: pd.to_datetime(x, infer_datetime_format=True)},
        {'redash_type': 'date', 'to_pandas': lambda x: pd.to_datetime(x, infer_datetime_format=True)},
    ]
    df = pd.DataFrame.from_dict(result['rows'], orient='columns')
    labels = []
    for column in result['columns']:
        label = column['name']
        labels.append(label)
        for conversion in conversions:
            if conversion['redash_type'] == column['type']:
                func = conversion.get('to_pandas')
                if func:
                    df[label] = df[label].apply(func)
                break
    df = df[labels]
    if col_to_index and labels:
        df.set_index(labels[0], inplace=True)
    return df

Requirements:

  • Python query runner enabled
  • pandas and numpy installed (global or virtualenv) and enabled (in datasource settings)