Redash stability issues and best practices ECS deployment - V8

Hello Everyone,
I have a somewhat urgent request for support. We have redash in our production environment in an AWS ECS cluster. Current setup:
Redash Service - 3 nodes
Workers - 3 nodes
Redis - 1 node (We did have 2 nodes initially but recently realized in the logs Redis is started in stand alone mode and thought that could be causing our issues as we were seeing a lot of issues with running queries of late. But it was running in that mode for a long time and things have gotten really unstable in the last couple weeks. We do have more load and lot more people running adhoc queries along with scheduled queries)
The cluster is supported on an ESG to scale up/down as needed.
The issues we have seen a lot recently are

  1. Query is submitted but the timer keeps ticking and no result is serviced
  2. In some cases log does not even show a query was submitted
  3. In some cases the worker log shows query finished but log has this “[2021-05-05 13:46:12,352][PID:20][INFO][ForkPoolWorker-4] Updated 0 queries with result (2297d16bc48bc855ba5567e67e3d36ef).”. I see the result was generated in postgres and there is a record but somehow the gui update did not happen and the timer just kept ticking for 2 hours with no results
  4. Frequently see messages around connection error with Redis and issues with consul service discovery. ([2021-05-06 02:44:29,584][PID:32][ERROR][ForkPoolWorker-12] Connection to Redis lost: Retry (0/20) now.)
  5. Also see this error frequently: [2021-05-05 18:35:50,370][PID:1][ERROR][MainProcess] Control command error: OperationalError(u"\nCannot route message for exchange ‘reply.celery.pidbox’: Table empty or key no longer exists.\nProbably the key (u’_kombu.binding.reply.celery.pidbox’) has been removed from the Redis database.\n",)

What I am really looking to this group for is recommendations on setting up redash on ECS and best practices to follow. We are really struggling with redash stability and especially in the last couple weeks. Any help would be highly appreciated. We are on version 8.

Why are you running multiple redis nodes? Unless you configure them to sync with one another this can only cause problems. I suspect this is the root of your problem. Unless you have millions of users querying at once you don’t need multiple Redis nodes anyway. Redis is highly efficient and is frankly overkill for the minimal load Redash applies.

I see that you’ve reduced to one node. Did you bounce your workers afterward?

we are not. Not for the last 2-3 weeks… Like I mentioned above we noticed that few weeks back and changed it to 1 node as redis runs in stand alone node and the 2 nodes have no knowledge of one another.
Yes we bounced the workers after the change

Good to know. I’m interested to know what happens to new query execution jobs when you click “Execute”?

If you open your browser network tools before clicking “Execute” you should see a new XHR request to /queries/$query_id/results. Does this request receive a job object in response?

If it does, what do you see in the logs for your workers related to that job_id?