Originally posted by @dersphere on GitHub (#2078):
To improve the usability of the python query runner I created two python functions to convert the result
(of execute_query
and get_query_result
) to a pandas.DataFrame (incl. type conversions) and a DataFrame to result
.
Both functions only iterate the columns (not the rows) and convert on pd.Series - so they should be pretty fast. They should also handle all possible redash-types and pandas-dtypes (incl. np.nan, datetime and date).
I am pretty sure that other people might find this useful, so my questions are:
- Is it worth adding?
- How is the best way to integrate it with redash?
I have two ideas in mind:
- add both methods as staticmethods to the current python query runner, add numpy and pandas to requirements and improve the docs
- add a new “Python Pandas” Query runner for this purpose (and alter
execute_query
andget_query_result
to directly return DataFrames)
If you find it worth adding, I will create a PR.
Here are the two functions:
def df_to_redash(df_orig, index_to_col=False):
import numpy as np
df = df_orig.copy()
if index_to_col:
df.reset_index(inplace=True)
result = {'columns': [], 'rows': []}
conversions = [
{'pandas_type': np.integer, 'redash_type': 'integer',},
{'pandas_type': np.inexact, 'redash_type': 'float',},
{'pandas_type': np.datetime64, 'redash_type': 'datetime', 'to_redash': lambda x: x.strftime('%Y-%m-%d %H:%M:%S')},
{'pandas_type': np.bool_, 'redash_type': 'boolean'},
{'pandas_type': np.object, 'redash_type': 'string'}
]
labels = []
for dtype, label in zip(df.dtypes, df.columns):
for conversion in conversions:
if issubclass(dtype.type, conversion['pandas_type']):
result['columns'].append({'name': label, 'friendly_name': label, 'type': conversion['redash_type']})
labels.append(label)
func = conversion.get('to_redash')
if func:
df[label] = df[label].apply(func)
break
result['rows'] = df[labels].replace({np.nan: None}).to_dict(orient='records')
return result
def redash_to_df(result, col_to_index=False):
import pandas as pd
conversions = [
{'redash_type': 'datetime', 'to_pandas': lambda x: pd.to_datetime(x, infer_datetime_format=True)},
{'redash_type': 'date', 'to_pandas': lambda x: pd.to_datetime(x, infer_datetime_format=True)},
]
df = pd.DataFrame.from_dict(result['rows'], orient='columns')
labels = []
for column in result['columns']:
label = column['name']
labels.append(label)
for conversion in conversions:
if conversion['redash_type'] == column['type']:
func = conversion.get('to_pandas')
if func:
df[label] = df[label].apply(func)
break
df = df[labels]
if col_to_index and labels:
df.set_index(labels[0], inplace=True)
return df
Requirements:
- Python query runner enabled
- pandas and numpy installed (global or virtualenv) and enabled (in datasource settings)