Search backend proposal

jezdez · July 1, 2019, 2:57pm

Following the opening of pull request 3908 and my past work on adding full text search (Postgres tsvector based) on the Redash backend I thought I share my concerns with the current design of it and offer options for ways out of it. I’m not sure if this should block PR 3908 (I don’t think it needs to), but I would certainly encourage pausing for a moment to consider an extended approach.

Looking back, I’m not super convinced the tsvector based search delivered what we hoped to get – smooth deployment and improved results. Especially given the fact that the tsvector backend is limited in many ways and excludes a huge part of the market, e.g. for non-latin languages like Chinese and Japanese.

My suggestion

adding a search backend system (like the query runner registry) that can be extended with additional Python packages (it would only need to have a CRUD interface, so could be pretty simplistic, see below)
revert to the “old” (read: pre-tsvector) ILIKE based search as the default backend (including multibyte-search as proposed in the PR)
but implement more complex (pg tsvector, Whoosh, Algolia or Elasticsearch) backends as well, either by the Redash team/contributors or 3rd parties
given the design of the tsvector backend which requires it to be part of the individual data models (e.g. the Query.search_vector column) I think we need to stick with it for a bit but eventually should demote and then deprecate it

Code example

class SearchRegistry(object):

    def __init__(self):
        self.backends = {}
        for backend in settings.REDASH_DEFAULT_SEARCH_BACKENDS:
            self.register(backend)

    def register(self, class):
        self.backends[class.type()] = class
        return class

    def unregister(self, class):
        self.backends.pop(class.type())


search_backends = SearchRegistry()


class SearchBackend(object):
    def search(self, term, filters, *args, **kwargs):
        raise NotImplementedError

    def add(self, obj):
        raise NotImplementedError

    def remove(self, obj):
        raise NotImplementedError

    def update(self, obj):
        raise NotImplementedError


@search_backends.register
class SimpleSearchBackend(SearchBackend):

    def add(self, obj):
        pass  # since it uses plain database fields

    remove = update = add

    def search(self, model, term, filters, limit=None, *args, **kwargs):
        pattern = u'%{}%'.format(term)
        return model.all_queries().filter(
            or_(
                model.name.ilike(pattern),
                model.description.ilike(pattern)
            )
        ).order_by(model.id).limit(limit)


@search_backends.register
class TsvectorSearchBackend(SimpleSearchBackend):

    def search(self, model, term, filters, limit=None, *args, **kwargs):
        return model.all_queries().search(term, sort=True).limit(limit)


@search_backends.register
class AlgoliaSearchBackend(SearchBackend):

    def __init__(self):
        self.client = SearchClient.create('YourApplicationID', 'YourAdminAPIKey')
        self.index = client.init_index('redash_acme')

    def create(self, obj):
        self.index.save_object(object_to_dict(obj, objectID=obj.id))

    update = create

    def remove(self, obj):
        self.index.delete_object(obj.id)


    def search(self, model, term, filters, limit=None, *args, **kwargs):
        return requests.get(ALGOLIA_API_URL + "/?object={}&term={}&filters={}".format(model.__class__.__name__, term, filters)).json().get('results')

This is obviously pseudo-code, but I hope you get the idea. Individual search backends could also get more APIs like reindexing all objects via the command line, support for other data models such as alerts and dashboards etc.

What do you think?

justinclift · July 1, 2019, 3:22pm

People do have full text search working with Chinese. It’s not something I’ve personally taken a look at, but (in theory) it “should work”:

Guessing you’ve already been down this path already, and it’s not suitable?

jezdez · July 1, 2019, 3:36pm

Oh absolutely, didn’t mean to imply ft search is impossible with Postgres (which I love). But from an operational standpoint it’s more complicated to deploy and maintain custom Postgres extensions for differently sized deployments like Redash (OSS and SaaS) in my experience. Happy to be convinced otherwise though!

justinclift · July 1, 2019, 3:40pm

No worries at all, that makes sense.

arikfr · July 1, 2019, 6:28pm

100% agree here.

Also I’m on board with the backends idea. I just wonder if there is a good approach to managing permissions in search results. Today, when we use queries on the db tables, we can reuse the permission check logic we use in regular select queries. Once we have a dedicated search store, I wonder what a good approach might be?

jezdez · July 2, 2019, 11:14am

I think each search result would need to map to individual database rows so that we can reuse the database based permissions system. Basically we’d also index the primary key of the object (e.g. a query) and deserialize it after fetching the search results from the backend. Since our current use for the search is really just for listing purposes I don’t think that’ll be a huge limiting problem.

arikfr · July 2, 2019, 1:56pm

So we will get object ids from the search provider, and then filter them using the DB when fetching those ids?

Two issues:

Latency (the need to fetch the data from the DB), but this is not much different than today.
We might be left with less results than we wanted: i.e. we fetch 20 search results, but 17 of those get filtered. This means that we always need to ask the search provider for more results than we want and also can’t provide proper pagination.

▼Categories

▼Tags

Search backend proposal

My suggestion

Code example