Problem with slow UI and worker timeouts


#1

I am setting up on premises Redash into Kubernetes cluster (hosted in AWS) with the latest docker images.

I tried the docker-compose.porduction.yaml with docker and it worked fine.
Then I made Kubernetes services and deployment yaml configurations for server, worker and redis and configured the DB.

Create_db job worked fine, but if I use the Redash UI, then it is very slow (page loads about a minute) … but at the end it loads correctly.

I believe I have something misconfigured.

From server logs I see long “query_duration” but no errors:
[2017-05-03 09:02:13,233][PID:32180][INFO][metrics] method=GET path=/api/queries/recent endpoint=recent_queries status=200 content_type=application/json content_length=2 duration=119.76 query_count=5 query_duration=101.95 [2017-05-03 09:02:33,237][PID:32556][DEBUG][metrics] table=organizations query=select duration=2.41

From server logs I also see that workers are getting timeout regularly:
[2017-05-03 11:56:57 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:11622) [2017-05-03 11:56:57 +0000] [11622] [INFO] Worker exiting (pid: 11622) [2017-05-03 11:56:58 +0000] [11872] [INFO] Booting worker with pid: 11872

From the DB logs the query times are less than 1ms.
2017-05-03 11:28:26.783 UTC [5199]: [20-1] user=redash,db=redash,client=10.24.221.104(43642),app=[unknown] LOG: duration: 0.308 ms statement: SELECT users.groups AS users_groups, users.updated_at AS users_updated_at, users.created_at AS users_created_at, users.id AS users_id, users.org_id AS users_org_id, users.name AS users_name, users.email AS users_email, users.password_hash AS users_password_hash, users.api_key AS users_api_key

From the DB logs I also see that the client is closing connection regularly … might be related worker timeouts:
2017-05-03 02:45:00.327 UTC [31710]: [1-1] user=redash,db=redash,client=10.44.221.236(40460),app=[unknown] LOG: could not receive data from client: Connection reset by peer

Any suggestion about how get the redash UI normal.

Is there any diagram about how the server and workers are communicating and about Redash architecture?


#2

Each Redash process (worker or web server) communicates directly with the database, so in theory there is no overhead aside from regular stuff.

Gunicorn has a default timeout of 30 seconds per request. If your requests take longer, it might explain the timeouts and the disconnects from Postgres.

I would try to understand what queries take long, and see how long the database reports they take. It might be that the <1ms execution time you saw is for a specific query that happens to run fast, and not representative of the other queries.


#3

I’m seeing the same issues with a very similar setup. DB queries finish in milliseconds, though page loads take significantly longer.

@hardi have you had any progress in resolving this?


#4

Hi @grue

My solution might not help you much because I switched from “Redash containers in Kubernetes cluster” to “Ubuntu 16.04 + bootstrap.sh script” setup. I was a little hurry and it works for me now.

About understanding the Redash architecture I’d recommend to look into the same bootstrap.sh script and the supervisor config files. It helped me to understand the Redash components and the setup a little better.

Also I suspect that my issue might have been related with replacing the nginx with AWS ELB and not properly implementing the nginx config.

I hope you manage to detect the issue and will document the solution here, because on one day I might want to switch back to Kubernetes resources :slight_smile:


#5

check it might be data related - Query does not finish in UI and executes in just few ms in Database


#6

I’ve recently had the same issue and I fixed by changing the default timeout on the gunicorn config (editing supervisord.conf) … more details in this gist: https://gist.github.com/pcolazurdo/f02102e54206bf32e7e5ab33db7df51e