I have a somewhat urgent request for support. We have redash in our production environment in an AWS ECS cluster. Current setup:
Redash Service - 3 nodes
Workers - 3 nodes
Redis - 1 node (We did have 2 nodes initially but recently realized in the logs Redis is started in stand alone mode and thought that could be causing our issues as we were seeing a lot of issues with running queries of late. But it was running in that mode for a long time and things have gotten really unstable in the last couple weeks. We do have more load and lot more people running adhoc queries along with scheduled queries)
The cluster is supported on an ESG to scale up/down as needed.
The issues we have seen a lot recently are
- Query is submitted but the timer keeps ticking and no result is serviced
- In some cases log does not even show a query was submitted
- In some cases the worker log shows query finished but log has this “[2021-05-05 13:46:12,352][PID:20][INFO][ForkPoolWorker-4] Updated 0 queries with result (2297d16bc48bc855ba5567e67e3d36ef).”. I see the result was generated in postgres and there is a record but somehow the gui update did not happen and the timer just kept ticking for 2 hours with no results
- Frequently see messages around connection error with Redis and issues with consul service discovery. ([2021-05-06 02:44:29,584][PID:32][ERROR][ForkPoolWorker-12] Connection to Redis lost: Retry (0/20) now.)
- Also see this error frequently: [2021-05-05 18:35:50,370][PID:1][ERROR][MainProcess] Control command error: OperationalError(u"\nCannot route message for exchange ‘reply.celery.pidbox’: Table empty or key no longer exists.\nProbably the key (u’_kombu.binding.reply.celery.pidbox’) has been removed from the Redis database.\n",)
What I am really looking to this group for is recommendations on setting up redash on ECS and best practices to follow. We are really struggling with redash stability and especially in the last couple weeks. Any help would be highly appreciated. We are on version 8.