Following the opening of pull request 3908 and my past work on adding full text search (Postgres tsvector based) on the Redash backend I thought I share my concerns with the current design of it and offer options for ways out of it. I’m not sure if this should block PR 3908 (I don’t think it needs to), but I would certainly encourage pausing for a moment to consider an extended approach.
Looking back, I’m not super convinced the tsvector based search delivered what we hoped to get – smooth deployment and improved results. Especially given the fact that the tsvector backend is limited in many ways and excludes a huge part of the market, e.g. for non-latin languages like Chinese and Japanese.
My suggestion
- adding a search backend system (like the query runner registry) that can be extended with additional Python packages (it would only need to have a CRUD interface, so could be pretty simplistic, see below)
- revert to the “old” (read: pre-tsvector) ILIKE based search as the default backend (including multibyte-search as proposed in the PR)
- but implement more complex (pg tsvector, Whoosh, Algolia or Elasticsearch) backends as well, either by the Redash team/contributors or 3rd parties
- given the design of the tsvector backend which requires it to be part of the individual data models (e.g. the Query.search_vector column) I think we need to stick with it for a bit but eventually should demote and then deprecate it
Code example
class SearchRegistry(object):
def __init__(self):
self.backends = {}
for backend in settings.REDASH_DEFAULT_SEARCH_BACKENDS:
self.register(backend)
def register(self, class):
self.backends[class.type()] = class
return class
def unregister(self, class):
self.backends.pop(class.type())
search_backends = SearchRegistry()
class SearchBackend(object):
def search(self, term, filters, *args, **kwargs):
raise NotImplementedError
def add(self, obj):
raise NotImplementedError
def remove(self, obj):
raise NotImplementedError
def update(self, obj):
raise NotImplementedError
@search_backends.register
class SimpleSearchBackend(SearchBackend):
def add(self, obj):
pass # since it uses plain database fields
remove = update = add
def search(self, model, term, filters, limit=None, *args, **kwargs):
pattern = u'%{}%'.format(term)
return model.all_queries().filter(
or_(
model.name.ilike(pattern),
model.description.ilike(pattern)
)
).order_by(model.id).limit(limit)
@search_backends.register
class TsvectorSearchBackend(SimpleSearchBackend):
def search(self, model, term, filters, limit=None, *args, **kwargs):
return model.all_queries().search(term, sort=True).limit(limit)
@search_backends.register
class AlgoliaSearchBackend(SearchBackend):
def __init__(self):
self.client = SearchClient.create('YourApplicationID', 'YourAdminAPIKey')
self.index = client.init_index('redash_acme')
def create(self, obj):
self.index.save_object(object_to_dict(obj, objectID=obj.id))
update = create
def remove(self, obj):
self.index.delete_object(obj.id)
def search(self, model, term, filters, limit=None, *args, **kwargs):
return requests.get(ALGOLIA_API_URL + "/?object={}&term={}&filters={}".format(model.__class__.__name__, term, filters)).json().get('results')
This is obviously pseudo-code, but I hope you get the idea. Individual search backends could also get more APIs like reindexing all objects via the command line, support for other data models such as alerts and dashboards etc.
What do you think?