Fault finding guide

bbrowski · June 10, 2022, 10:36am

So my server is doing something weird and I cant find the problem. I wondering if there is a kind of step-by-step guide to fault finding the problem or testing to see where the issue is?

eg. Step 1 check the logs (I would have to be told where or how)
Step 2 etc.

bbrowski · June 10, 2022, 11:07am

I may have multiple problems.

So my server (4GB ram, 2CPU) just goes dead every six weeks or so. I have to reset the entire system. Then redash gives me a 504 error. I fiddle around a bit - no luck and give up. A day to two later its working and I have no idea how or why. I’m not sure if I did something to fix it or it just settles down and then is ok. Then six weeks later it crashes again and I’m stumped.

It looks to me (I know very little) like its related to a high CPU usages and that it is caused by the Nginx being in some sort of loop. But I’m not sure what is usual for a resting redash server. If I shut down all the containers the server is fine (almost no CPU usage). When bring it up all the docker containers start but the CPU usages load average is over 4.

I shut it all down and did a docker-compose up --force-recreate --build. I can then see what it is doing but it looks like it gets stuck in a loop and then I press Ctrl-C and it all shuts down.

statsadmin@safdashboard:/opt/redash$ sudo docker-compose up --force-recreate --build
Recreating redash_postgres_1 ... done
Recreating redash_redis_1    ... done
Recreating redash_scheduled_worker_1 ... done
Recreating redash_scheduler_1        ... done
Recreating redash_worker_1           ... done
Recreating redash_server_1           ... done
Recreating redash_adhoc_worker_1     ... done
Recreating redash_nginx_1            ... done
Attaching to redash_redis_1, redash_postgres_1, redash_scheduler_1, redash_scheduled_worker_1, redash_adhoc_worker_1, redash_server_1, redash_worker_1, redash_nginx_1
redis_1             | 1:C 10 Jun 2022 10:11:38.074 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
redis_1             | 1:C 10 Jun 2022 10:11:38.083 # Redis version=5.0.7, bits=64, commit=00000000, modified=0, pid=1, just started
redis_1             | 1:C 10 Jun 2022 10:11:38.083 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
redis_1             | 1:M 10 Jun 2022 10:11:38.114 * Running mode=standalone, port=6379.
redis_1             | 1:M 10 Jun 2022 10:11:38.114 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
redis_1             | 1:M 10 Jun 2022 10:11:38.114 # Server initialized
redis_1             | 1:M 10 Jun 2022 10:11:38.115 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
redis_1             | 1:M 10 Jun 2022 10:11:38.125 * DB loaded from disk: 0.010 seconds
redis_1             | 1:M 10 Jun 2022 10:11:38.125 * Ready to accept connections
postgres_1          |
postgres_1          | PostgreSQL Database directory appears to contain a database; Skipping initialization
postgres_1          |
postgres_1          | LOG:  database system was shut down at 2022-06-10 10:03:05 UTC
postgres_1          | LOG:  MultiXact member wraparound protections are now enabled
postgres_1          | LOG:  database system is ready to accept connections
postgres_1          | LOG:  autovacuum launcher started
scheduler_1         | Starting RQ scheduler...
scheduled_worker_1  | Starting RQ worker...
adhoc_worker_1      | Starting RQ worker...
worker_1            | Starting RQ worker...
scheduled_worker_1  | 2022-06-10 10:11:58,933 INFO RPC interface 'supervisor' initialized
worker_1            | 2022-06-10 10:11:58,937 INFO RPC interface 'supervisor' initialized
worker_1            | 2022-06-10 10:11:58,947 CRIT Server 'unix_http_server' running without any HTTP authentication checking
worker_1            | 2022-06-10 10:11:58,950 INFO supervisord started with pid 1
scheduled_worker_1  | 2022-06-10 10:11:58,965 CRIT Server 'unix_http_server' running without any HTTP authentication checking
scheduled_worker_1  | 2022-06-10 10:11:58,968 INFO supervisord started with pid 1
server_1            | [2022-06-10 10:11:58 +0000] [1] [INFO] Starting gunicorn 20.0.4
server_1            | [2022-06-10 10:11:58 +0000] [1] [INFO] Listening at: http://0.0.0.0:5000 (1)
server_1            | [2022-06-10 10:11:59 +0000] [1] [INFO] Using worker: sync
adhoc_worker_1      | 2022-06-10 10:11:59,025 INFO RPC interface 'supervisor' initialized
adhoc_worker_1      | 2022-06-10 10:11:59,025 CRIT Server 'unix_http_server' running without any HTTP authentication checking
adhoc_worker_1      | 2022-06-10 10:11:59,027 INFO supervisord started with pid 1
server_1            | [2022-06-10 10:11:59 +0000] [8] [INFO] Booting worker with pid: 8
server_1            | [2022-06-10 10:11:59 +0000] [9] [INFO] Booting worker with pid: 9
server_1            | [2022-06-10 10:11:59 +0000] [10] [INFO] Booting worker with pid: 10
server_1            | [2022-06-10 10:11:59 +0000] [11] [INFO] Booting worker with pid: 11
scheduled_worker_1  | 2022-06-10 10:11:59,985 INFO spawned: 'worker_healthcheck' with pid 8
worker_1            | 2022-06-10 10:12:00,003 INFO spawned: 'worker_healthcheck' with pid 8
adhoc_worker_1      | 2022-06-10 10:12:00,077 INFO spawned: 'worker_healthcheck' with pid 8
scheduled_worker_1  | 2022-06-10 10:12:00,105 INFO spawned: 'worker-0' with pid 9
adhoc_worker_1      | 2022-06-10 10:12:00,161 INFO spawned: 'worker-0' with pid 9
worker_1            | 2022-06-10 10:12:00,166 INFO spawned: 'worker-0' with pid 9
adhoc_worker_1      | 2022-06-10 10:12:00,294 INFO spawned: 'worker-1' with pid 10
scheduled_worker_1  | 2022-06-10 10:12:01,164 INFO success: worker_healthcheck entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
worker_1            | 2022-06-10 10:12:01,263 INFO success: worker_healthcheck entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
adhoc_worker_1      | 2022-06-10 10:12:01,388 INFO success: worker_healthcheck entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
nginx_1             | 102.39.50.201 - - [10/Jun/2022:10:12:04 +0000] "OPTIONS /Microsoft-Server-ActiveSync HTTP/1.1" 301 185 "-" "Apple-iPhone12C8/1906.77" "-"
server_1            | [2022-06-10 10:12:29 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:8)
server_1            | [2022-06-10 10:12:29 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:9)
server_1            | [2022-06-10 10:12:29 +0000] [8] [INFO] Worker exiting (pid: 8)
server_1            | [2022-06-10 10:12:30 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:10)
server_1            | [2022-06-10 10:12:30 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:11)
server_1            | [2022-06-10 10:12:29 +0000] [9] [INFO] Worker exiting (pid: 9)
server_1            | [2022-06-10 10:12:30 +0000] [11] [INFO] Worker exiting (pid: 11)
server_1            | [2022-06-10 10:12:30 +0000] [10] [INFO] Worker exiting (pid: 10)
server_1            | [2022-06-10 10:12:31 +0000] [12] [INFO] Booting worker with pid: 12
server_1            | [2022-06-10 10:12:31 +0000] [13] [INFO] Booting worker with pid: 13
server_1            | [2022-06-10 10:12:31 +0000] [14] [INFO] Booting worker with pid: 14
server_1            | [2022-06-10 10:12:31 +0000] [15] [INFO] Booting worker with pid: 15
server_1            | [2022-06-10 10:13:01 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:12)
server_1            | [2022-06-10 10:13:01 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:13)
server_1            | [2022-06-10 10:13:01 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:14)
server_1            | [2022-06-10 10:13:01 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:15)
server_1            | [2022-06-10 10:13:01 +0000] [14] [INFO] Worker exiting (pid: 14)
server_1            | [2022-06-10 10:13:01 +0000] [15] [INFO] Worker exiting (pid: 15)
server_1            | [2022-06-10 10:13:01 +0000] [12] [INFO] Worker exiting (pid: 12)
server_1            | [2022-06-10 10:13:01 +0000] [13] [INFO] Worker exiting (pid: 13)
server_1            | [2022-06-10 10:13:02 +0000] [16] [INFO] Booting worker with pid: 16
server_1            | [2022-06-10 10:13:03 +0000] [17] [INFO] Booting worker with pid: 17
server_1            | [2022-06-10 10:13:03 +0000] [18] [INFO] Booting worker with pid: 18
server_1            | [2022-06-10 10:13:03 +0000] [19] [INFO] Booting worker with pid: 19
nginx_1             | 2022/06/10 10:13:05 [error] 6#6: *2 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 102.39.50.201, server: safdashboard.dedicated.co.za, request: "OPTIONS /Microsoft-Server-ActiveSync HTTP/1.1", upstream: "http://172.18.0.8:5000/Microsoft-Server-ActiveSync", host: "154.0.162.233"
nginx_1             | 102.39.50.201 - - [10/Jun/2022:10:13:05 +0000] "OPTIONS /Microsoft-Server-ActiveSync HTTP/1.1" 504 183 "-" "Apple-iPhone12C8/1906.77"
nginx_1             | 2022/06/10 10:13:05 [info] 6#6: *2 client 102.39.50.201 closed keepalive connection
nginx_1             | 102.39.50.201 - - [10/Jun/2022:10:13:05 +0000] "OPTIONS /Microsoft-Server-ActiveSync HTTP/1.1" 301 185 "-" "Apple-iPhone12C8/1906.77" "-"
http://soluscloud.co.zahttp://soluscloud.co.zaserver_1            | [2022-06-10 10:13:33 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:16)
server_1            | [2022-06-10 10:13:33 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:17)
server_1            | [2022-06-10 10:13:33 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:18)
server_1            | [2022-06-10 10:13:33 +0000] [16] [INFO] Worker exiting (pid: 16)
server_1            | [2022-06-10 10:13:33 +0000] [18] [INFO] Worker exiting (pid: 18)
server_1            | [2022-06-10 10:13:33 +0000] [17] [INFO] Worker exiting (pid: 17)
server_1            | [2022-06-10 10:13:34 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:19)
server_1            | [2022-06-10 10:13:34 +0000] [19] [INFO] Worker exiting (pid: 19)
server_1            | [2022-06-10 10:13:34 +0000] [20] [INFO] Booting worker with pid: 20
server_1            | [2022-06-10 10:13:34 +0000] [21] [INFO] Booting worker with pid: 21
server_1            | [2022-06-10 10:13:34 +0000] [22] [INFO] Booting worker with pid: 22
server_1            | [2022-06-10 10:13:34 +0000] [23] [INFO] Booting worker with pid: 23
server_1            | [2022-06-10 10:14:04 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:20)
server_1            | [2022-06-10 10:14:04 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:21)
server_1            | [2022-06-10 10:14:04 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:22)
server_1            | [2022-06-10 10:14:04 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:23)
server_1            | [2022-06-10 10:14:04 +0000] [21] [INFO] Worker exiting (pid: 21)
server_1            | [2022-06-10 10:14:04 +0000] [22] [INFO] Worker exiting (pid: 22)
server_1            | [2022-06-10 10:14:04 +0000] [20] [INFO] Worker exiting (pid: 20)
server_1            | [2022-06-10 10:14:04 +0000] [23] [INFO] Worker exiting (pid: 23)
scheduler_1         | [2022-06-10 10:14:05,506][PID:1][INFO][redash.tasks.schedule] Scheduling 305c0cae0a196ae96915fd2b6f81001c435aad65 (refresh_queries) with interval 30.
scheduler_1         | [2022-06-10 10:14:05,525][PID:1][INFO][redash.tasks.schedule] Scheduling 59a84668e68687338646f735965213c58e814b32 (remove_ghost_locks) with interval 60.
scheduler_1         | [2022-06-10 10:14:05,537][PID:1][INFO][redash.tasks.schedule] Scheduling 18281a865ed3d1a60f366aeb8596fe2283aa421f (sync_user_details) with interval 60.
scheduler_1         | [2022-06-10 10:14:05,574][PID:1][INFO][redash.tasks.schedule] Scheduling e27209059575fcc17c527c47d0957cb21756e551 (cleanup_query_results) with interval 300.
scheduler_1         | [2022-06-10 10:14:05,577][PID:1][INFO][rq_scheduler.scheduler] Registering birth
nginx_1             | 2022/06/10 10:14:05 [error] 6#6: *7 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 102.39.50.201, server: safdashboard.dedicated.co.za, request: "OPTIONS /Microsoft-Server-ActiveSync HTTP/1.1", upstream: "http://172.18.0.8:5000/Microsoft-Server-ActiveSync", host: "154.0.162.233"
nginx_1             | 102.39.50.201 - - [10/Jun/2022:10:14:05 +0000] "OPTIONS /Microsoft-Server-ActiveSync HTTP/1.1" 504 183 "-" "Apple-iPhone12C8/1906.77"
server_1            | [2022-06-10 10:14:05 +0000] [24] [INFO] Booting worker with pid: 24
nginx_1             | 2022/06/10 10:14:05 [info] 6#6: *7 client 102.39.50.201 closed keepalive connection
server_1            | [2022-06-10 10:14:05 +0000] [26] [INFO] Booting worker with pid: 26
server_1            | [2022-06-10 10:14:06 +0000] [25] [INFO] Booting worker with pid: 25
server_1            | [2022-06-10 10:14:06 +0000] [27] [INFO] Booting worker with pid: 27
nginx_1             | 167.248.133.47 - - [10/Jun/2022:10:14:36 +0000] "GET / HTTP/1.1" 301 185 "-" "-" "-"
server_1            | [2022-06-10 10:14:36 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:24)
server_1            | [2022-06-10 10:14:36 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:25)
server_1            | [2022-06-10 10:14:36 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:26)
server_1            | [2022-06-10 10:14:36 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:27)
server_1            | [2022-06-10 10:14:36 +0000] [24] [INFO] Worker exiting (pid: 24)
server_1            | [2022-06-10 10:14:36 +0000] [25] [INFO] Worker exiting (pid: 25)
server_1            | [2022-06-10 10:14:36 +0000] [26] [INFO] Worker exiting (pid: 26)
server_1            | [2022-06-10 10:14:36 +0000] [27] [INFO] Worker exiting (pid: 27)
nginx_1             | 167.248.133.47 - - [10/Jun/2022:10:14:37 +0000] "GET / HTTP/1.1" 301 185 "-" "Mozilla/5.0 (compatible; CensysInspect/1.1; +https://about.censys.io/)" "-"
server_1            | [2022-06-10 10:14:37 +0000] [28] [INFO] Booting worker with pid: 28
nginx_1             | 167.248.133.47 - - [10/Jun/2022:10:14:37 +0000] "PRI * HTTP/2.0" 400 173 "-" "-" "-"
server_1            | [2022-06-10 10:14:37 +0000] [29] [INFO] Booting worker with pid: 29
server_1            | [2022-06-10 10:14:37 +0000] [30] [INFO] Booting worker with pid: 30
server_1            | [2022-06-10 10:14:38 +0000] [31] [INFO] Booting worker with pid: 31
nginx_1             | 2022/06/10 10:14:52 [info] 6#6: *13 epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while sending request to upstream, client: 167.248.133.47, server: safdashboard.dedicated.co.za, request: "GET / HTTP/1.1", upstream: "http://172.18.0.8:5000/", host: "154.0.162.233"
nginx_1             | 167.248.133.47 - - [10/Jun/2022:10:14:52 +0000] "GET / HTTP/1.1" 499 0 "-" "Mozilla/5.0 (compatible; CensysInspect/1.1; +https://about.censys.io/)"
scheduled_worker_1  | 2022/06/10 10:15:03 [worker_healthcheck] Starting the health check for worker process Checks config: [(<class 'redash.cli.rq.WorkerHealthcheck'>, {})]
scheduled_worker_1  | 2022/06/10 10:15:03 [worker_healthcheck] Installing signal handlers.
scheduled_worker_1  | READY
scheduled_worker_1  | 2022/06/10 10:15:03 [worker_healthcheck] Received TICK_60 event from supervisor
scheduled_worker_1  | RESULT 2
scheduled_worker_1  | OKREADY
scheduled_worker_1  | 2022/06/10 10:15:03 [worker_healthcheck] No processes in state RUNNING found for process worker
scheduled_worker_1  | 2022/06/10 10:15:03 [worker_healthcheck] Received TICK_60 event from supervisor
scheduled_worker_1  | RESULT 2
scheduled_worker_1  | OKREADY
scheduled_worker_1  | 2022/06/10 10:15:03 [worker_healthcheck] No processes in state RUNNING found for process worker
scheduled_worker_1  | 2022/06/10 10:15:03 [worker_healthcheck] Received TICK_60 event from supervisor
scheduled_worker_1  | RESULT 2
scheduled_worker_1  | OKREADY
scheduled_worker_1  | 2022/06/10 10:15:04 [worker_healthcheck] No processes in state RUNNING found for process worker
worker_1            | 2022/06/10 10:15:06 [worker_healthcheck] Starting the health check for worker process Checks config: [(<class 'redash.cli.rq.WorkerHealthcheck'>, {})]
worker_1            | READY
worker_1            | 2022/06/10 10:15:06 [worker_healthcheck] Installing signal handlers.
worker_1            | 2022/06/10 10:15:06 [worker_healthcheck] Received TICK_60 event from supervisor
worker_1            | RESULT 2
worker_1            | OKREADY
worker_1            | 2022/06/10 10:15:06 [worker_healthcheck] No processes in state RUNNING found for process worker
worker_1            | 2022/06/10 10:15:06 [worker_healthcheck] Received TICK_60 event from supervisor
worker_1            | 2022/06/10 10:15:06 [worker_healthcheck] No processes in state RUNNING found for process worker
worker_1            | RESULT 2
worker_1            | OKREADY
worker_1            | 2022/06/10 10:15:06 [worker_healthcheck] Received TICK_60 event from supervisor
worker_1            | 2022/06/10 10:15:06 [worker_healthcheck] No processes in state RUNNING found for process worker
worker_1            | RESULT 2
worker_1            | OKREADY
scheduled_worker_1  | [2022-06-10 10:15:06,865][PID:9][INFO][rq.worker] Worker rq:worker:ae9df11b731841e7b678d7d8f0b2500d: started, version 1.5.0
scheduled_worker_1  | [2022-06-10 10:15:06,920][PID:9][INFO][rq.worker] *** Listening on scheduled_queries, schemas...
scheduled_worker_1  | [2022-06-10 10:15:06,961][PID:9][INFO][rq.worker] Cleaning registries for queue: scheduled_queries
scheduled_worker_1  | [2022-06-10 10:15:06,986][PID:9][INFO][rq.worker] Cleaning registries for queue: schemas
server_1            | [2022-06-10 10:15:07 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:28)
server_1            | [2022-06-10 10:15:07 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:29)
server_1            | [2022-06-10 10:15:07 +0000] [28] [INFO] Worker exiting (pid: 28)
server_1            | [2022-06-10 10:15:07 +0000] [29] [INFO] Worker exiting (pid: 29)
worker_1            | [2022-06-10 10:15:08,243][PID:9][INFO][rq.worker] Worker rq:worker:b06d0d843b8347b285ac3381b5f3e2be: started, version 1.5.0
worker_1            | [2022-06-10 10:15:08,270][PID:9][INFO][rq.worker] *** Listening on periodic, emails, default...
worker_1            | [2022-06-10 10:15:08,295][PID:9][INFO][rq.worker] Cleaning registries for queue: periodic
worker_1            | [2022-06-10 10:15:08,304][PID:9][INFO][rq.worker] Cleaning registries for queue: emails
worker_1            | [2022-06-10 10:15:08,341][PID:9][INFO][rq.worker] Cleaning registries for queue: default
worker_1            | [2022-06-10 10:15:08,411][PID:9][INFO][rq.worker] periodic: 18281a865ed3d1a60f366aeb8596fe2283aa421f
worker_1            | [2022-06-10 10:15:08,804][PID:20][INFO][rq.worker] periodic: Job OK (18281a865ed3d1a60f366aeb8596fe2283aa421f)
worker_1            | [2022-06-10 10:15:08,807][PID:20][INFO][rq.worker] Result is kept for 600 seconds
server_1            | [2022-06-10 10:15:08 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:30)
server_1            | [2022-06-10 10:15:08 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:31)
worker_1            | [2022-06-10 10:15:08,949][PID:9][INFO][rq.worker] periodic: 305c0cae0a196ae96915fd2b6f81001c435aad65
server_1            | [2022-06-10 10:15:08 +0000] [32] [INFO] Booting worker with pid: 32
server_1            | [2022-06-10 10:15:09 +0000] [33] [INFO] Booting worker with pid: 33
worker_1            | [2022-06-10 10:15:09,121][PID:21][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.refresh_queries job.id=305c0cae0a196ae96915fd2b6f81001c435aad65 Refreshing queries...
server_1            | [2022-06-10 10:15:09 +0000] [34] [INFO] Booting worker with pid: 34
server_1            | [2022-06-10 10:15:09 +0000] [35] [INFO] Booting worker with pid: 35
worker_1            | [2022-06-10 10:15:10,704][PID:21][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.refresh_queries job.id=305c0cae0a196ae96915fd2b6f81001c435aad65 Done refreshing queries: {'outdated_queries_count': 0, 'last_refresh_at': 1654856110.6998672, 'query_ids': '[]'}
worker_1            | [2022-06-10 10:15:10,716][PID:21][INFO][rq.worker] periodic: Job OK (305c0cae0a196ae96915fd2b6f81001c435aad65)
worker_1            | [2022-06-10 10:15:10,716][PID:21][INFO][rq.worker] Result is kept for 600 seconds
postgres_1          | LOG:  unexpected EOF on client connection with an open transaction
worker_1            | [2022-06-10 10:15:10,864][PID:9][INFO][rq.worker] periodic: 59a84668e68687338646f735965213c58e814b32
worker_1            | [2022-06-10 10:15:11,180][PID:22][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.remove_ghost_locks job.id=59a84668e68687338646f735965213c58e814b32 Locks found: 0, Locks removed: 0
worker_1            | [2022-06-10 10:15:11,259][PID:22][INFO][rq.worker] periodic: Job OK (59a84668e68687338646f735965213c58e814b32)
worker_1            | [2022-06-10 10:15:11,260][PID:22][INFO][rq.worker] Result is kept for 600 seconds
worker_1            | [2022-06-10 10:15:11,464][PID:9][INFO][rq.worker] periodic: e27209059575fcc17c527c47d0957cb21756e551
worker_1            | [2022-06-10 10:15:11,545][PID:23][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Running query results clean up (removing maximum of 100 unused results, that are 7 days old or more)
worker_1            | [2022-06-10 10:15:12,004][PID:23][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.cleanup_query_results job.id=e27209059575fcc17c527c47d0957cb21756e551 Deleted 0 unused query results.
worker_1            | [2022-06-10 10:15:12,018][PID:23][INFO][rq.worker] periodic: Job OK (e27209059575fcc17c527c47d0957cb21756e551)
worker_1            | [2022-06-10 10:15:12,030][PID:23][INFO][rq.worker] Result is kept for 600 seconds
worker_1            | [2022-06-10 10:15:12,166][PID:9][INFO][rq.worker] periodic: 305c0cae0a196ae96915fd2b6f81001c435aad65
worker_1            | [2022-06-10 10:15:12,307][PID:24][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.refresh_queries job.id=305c0cae0a196ae96915fd2b6f81001c435aad65 Refreshing queries...
worker_1            | [2022-06-10 10:15:12,895][PID:24][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.refresh_queries job.id=305c0cae0a196ae96915fd2b6f81001c435aad65 Done refreshing queries: {'outdated_queries_count': 0, 'last_refresh_at': 1654856112.8887806, 'query_ids': '[]'}
worker_1            | [2022-06-10 10:15:12,990][PID:24][INFO][rq.worker] periodic: Job OK (305c0cae0a196ae96915fd2b6f81001c435aad65)
worker_1            | [2022-06-10 10:15:12,994][PID:24][INFO][rq.worker] Result is kept for 600 seconds
postgres_1          | LOG:  unexpected EOF on client connection with an open transaction
worker_1            | [2022-06-10 10:15:13,154][PID:9][INFO][rq.worker] periodic: 18281a865ed3d1a60f366aeb8596fe2283aa421f
worker_1            | [2022-06-10 10:15:13,485][PID:25][INFO][rq.worker] periodic: Job OK (18281a865ed3d1a60f366aeb8596fe2283aa421f)
worker_1            | [2022-06-10 10:15:13,486][PID:25][INFO][rq.worker] Result is kept for 600 seconds
worker_1            | [2022-06-10 10:15:13,611][PID:9][INFO][rq.worker] periodic: 305c0cae0a196ae96915fd2b6f81001c435aad65
worker_1            | [2022-06-10 10:15:13,861][PID:26][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.refresh_queries job.id=305c0cae0a196ae96915fd2b6f81001c435aad65 Refreshing queries...
worker_1            | [2022-06-10 10:15:14,410][PID:26][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.refresh_queries job.id=305c0cae0a196ae96915fd2b6f81001c435aad65 Done refreshing queries: {'outdated_queries_count': 0, 'last_refresh_at': 1654856114.406925, 'query_ids': '[]'}
worker_1            | [2022-06-10 10:15:14,420][PID:26][INFO][rq.worker] periodic: Job OK (305c0cae0a196ae96915fd2b6f81001c435aad65)
worker_1            | [2022-06-10 10:15:14,421][PID:26][INFO][rq.worker] Result is kept for 600 seconds
postgres_1          | LOG:  unexpected EOF on client connection with an open transaction
worker_1            | [2022-06-10 10:15:14,579][PID:9][INFO][rq.worker] periodic: 59a84668e68687338646f735965213c58e814b32
worker_1            | [2022-06-10 10:15:14,742][PID:27][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.remove_ghost_locks job.id=59a84668e68687338646f735965213c58e814b32 Locks found: 0, Locks removed: 0
worker_1            | [2022-06-10 10:15:14,773][PID:27][INFO][rq.worker] periodic: Job OK (59a84668e68687338646f735965213c58e814b32)
worker_1            | [2022-06-10 10:15:14,773][PID:27][INFO][rq.worker] Result is kept for 600 seconds
worker_1            | [2022-06-10 10:15:35,715][PID:9][INFO][rq.worker] periodic: 305c0cae0a196ae96915fd2b6f81001c435aad65
worker_1            | [2022-06-10 10:15:35,859][PID:28][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.refresh_queries job.id=305c0cae0a196ae96915fd2b6f81001c435aad65 Refreshing queries...
worker_1            | [2022-06-10 10:15:36,384][PID:28][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.refresh_queries job.id=305c0cae0a196ae96915fd2b6f81001c435aad65 Done refreshing queries: {'outdated_queries_count': 0, 'last_refresh_at': 1654856136.3767474, 'query_ids': '[]'}
worker_1            | [2022-06-10 10:15:36,419][PID:28][INFO][rq.worker] periodic: Job OK (305c0cae0a196ae96915fd2b6f81001c435aad65)
worker_1            | [2022-06-10 10:15:36,421][PID:28][INFO][rq.worker] Result is kept for 600 seconds
postgres_1          | LOG:  unexpected EOF on client connection with an open transaction
server_1            | [2022-06-10 10:15:39 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:32)
server_1            | [2022-06-10 10:15:39 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:33)
server_1            | [2022-06-10 10:15:39 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:34)
server_1            | [2022-06-10 10:15:39 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:35)
server_1            | [2022-06-10 10:15:39 +0000] [33] [INFO] Worker exiting (pid: 33)
server_1            | [2022-06-10 10:15:39 +0000] [35] [INFO] Worker exiting (pid: 35)
server_1            | [2022-06-10 10:15:39 +0000] [34] [INFO] Worker exiting (pid: 34)
server_1            | [2022-06-10 10:15:40 +0000] [36] [INFO] Booting worker with pid: 36
server_1            | [2022-06-10 10:15:40 +0000] [37] [INFO] Booting worker with pid: 37
server_1            | [2022-06-10 10:15:40 +0000] [38] [INFO] Booting worker with pid: 38
server_1            | [2022-06-10 10:15:41 +0000] [39] [INFO] Booting worker with pid: 39
adhoc_worker_1      | 2022/06/10 10:15:47 [worker_healthcheck] Starting the health check for worker process Checks config: [(<class 'redash.cli.rq.WorkerHealthcheck'>, {})]
adhoc_worker_1      | 2022/06/10 10:15:47 [worker_healthcheck] Installing signal handlers.
adhoc_worker_1      | READY
adhoc_worker_1      | 2022/06/10 10:15:47 [worker_healthcheck] Received TICK_60 event from supervisor
adhoc_worker_1      | RESULT 2
adhoc_worker_1      | OKREADY
adhoc_worker_1      | 2022/06/10 10:15:47 [worker_healthcheck] No processes in state RUNNING found for process worker
adhoc_worker_1      | 2022/06/10 10:15:47 [worker_healthcheck] Received TICK_60 event from supervisor
adhoc_worker_1      | RESULT 2
adhoc_worker_1      | OK2022/06/10 10:15:47 [worker_healthcheck] No processes in state RUNNING found for process worker
adhoc_worker_1      | READY
adhoc_worker_1      | 2022/06/10 10:15:47 [worker_healthcheck] Received TICK_60 event from supervisor
adhoc_worker_1      | RESULT 2
adhoc_worker_1      | OKREADY
adhoc_worker_1      | 2022/06/10 10:15:47 [worker_healthcheck] No processes in state RUNNING found for process worker
adhoc_worker_1      | [2022-06-10 10:15:49,259][PID:10][INFO][rq.worker] Worker rq:worker:5dee4f381a8346fa95e1396058091017: started, version 1.5.0
adhoc_worker_1      | [2022-06-10 10:15:49,266][PID:10][INFO][rq.worker] *** Listening on queries...
adhoc_worker_1      | [2022-06-10 10:15:49,284][PID:10][INFO][rq.worker] Cleaning registries for queue: queries
adhoc_worker_1      | [2022-06-10 10:15:49,603][PID:9][INFO][rq.worker] Worker rq:worker:2029bfe94f62490db896b79e53af6756: started, version 1.5.0
adhoc_worker_1      | [2022-06-10 10:15:49,609][PID:9][INFO][rq.worker] *** Listening on queries...
scheduled_worker_1  | 2022/06/10 10:16:01 [worker_healthcheck] Received TICK_60 event from supervisor
scheduled_worker_1  | 2022/06/10 10:16:01 [worker_healthcheck] No processes in state RUNNING found for process worker
scheduled_worker_1  | RESULT 2
scheduled_worker_1  | OKREADY
worker_1            | 2022/06/10 10:16:01 [worker_healthcheck] Received TICK_60 event from supervisor
worker_1            | 2022/06/10 10:16:01 [worker_healthcheck] No processes in state RUNNING found for process worker
worker_1            | RESULT 2
worker_1            | OKREADY
adhoc_worker_1      | 2022/06/10 10:16:01 [worker_healthcheck] Received TICK_60 event from supervisor
adhoc_worker_1      | 2022/06/10 10:16:01 [worker_healthcheck] No processes in state RUNNING found for process worker
adhoc_worker_1      | RESULT 2
adhoc_worker_1      | OKREADY
worker_1            | [2022-06-10 10:16:05,755][PID:9][INFO][rq.worker] periodic: 18281a865ed3d1a60f366aeb8596fe2283aa421f
worker_1            | [2022-06-10 10:16:06,047][PID:29][INFO][rq.worker] periodic: Job OK (18281a865ed3d1a60f366aeb8596fe2283aa421f)
worker_1            | [2022-06-10 10:16:06,047][PID:29][INFO][rq.worker] Result is kept for 600 seconds
worker_1            | [2022-06-10 10:16:06,207][PID:9][INFO][rq.worker] periodic: 305c0cae0a196ae96915fd2b6f81001c435aad65
worker_1            | [2022-06-10 10:16:06,352][PID:30][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.refresh_queries job.id=305c0cae0a196ae96915fd2b6f81001c435aad65 Refreshing queries...
worker_1            | [2022-06-10 10:16:06,838][PID:30][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.refresh_queries job.id=305c0cae0a196ae96915fd2b6f81001c435aad65 Done refreshing queries: {'outdated_queries_count': 0, 'last_refresh_at': 1654856166.8278556, 'query_ids': '[]'}
worker_1            | [2022-06-10 10:16:06,873][PID:30][INFO][rq.worker] periodic: Job OK (305c0cae0a196ae96915fd2b6f81001c435aad65)
worker_1            | [2022-06-10 10:16:06,873][PID:30][INFO][rq.worker] Result is kept for 600 seconds
postgres_1          | LOG:  unexpected EOF on client connection with an open transaction
worker_1            | [2022-06-10 10:16:06,981][PID:9][INFO][rq.worker] periodic: 59a84668e68687338646f735965213c58e814b32
worker_1            | [2022-06-10 10:16:07,186][PID:31][INFO][rq.job.redash.tasks.queries.maintenance] job.func_name=redash.tasks.queries.maintenance.remove_ghost_locks job.id=59a84668e68687338646f735965213c58e814b32 Locks found: 0, Locks removed: 0
worker_1            | [2022-06-10 10:16:07,195][PID:31][INFO][rq.worker] periodic: Job OK (59a84668e68687338646f735965213c58e814b32)
worker_1            | [2022-06-10 10:16:07,208][PID:31][INFO][rq.worker] Result is kept for 600 seconds
server_1            | [2022-06-10 10:16:11 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:36)
server_1            | [2022-06-10 10:16:11 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:37)
server_1            | [2022-06-10 10:16:11 +0000] [36] [INFO] Worker exiting (pid: 36)
server_1            | [2022-06-10 10:16:11 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:38)
server_1            | [2022-06-10 10:16:11 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:39)

I have a backup machine with exactly the same env and docker-compose.yml (except for the HTTPS stuff and Host name) which works perfectly.

jesse · June 14, 2022, 5:27pm

We don’t have this, but it would be a great addition to our documentation. I think it will help significantly with debugging user issues. This is especially true given the diversity of experience different users bring to the table, familiarity with docker, UNIX etc.

jesse · June 14, 2022, 5:29pm

What verison of Redash are you running? How many query runners do you have enabled? How many concurrent queries?

The behaviour you describe sounds like the system may just be starved for resources past a certain point. The failure cadence (every six weeks) suggests it might be a memory leak. It would be helpful to know what you see when you run docker stats.

As for why the identical instance doesn’t have this failure: does the other instance have any users? If not, it would again indicate that the issue here is resource deprivation versus a bug in code.

bbrowski · June 15, 2022, 8:10am

Thank you for at least pointing the way. Its redash:10.1.0.b50633

The part of dockerr-compose.yml (I assume this is where the number of queries and query runners are set) is:

server:
    <<: *redash-service
    command: server
    ports:
      - "5000:5000"
    environment:
      REDASH_WEB_WORKERS: 4
  scheduler:
    <<: *redash-service
    command: scheduler
  scheduled_worker:
    <<: *redash-service
    command: worker
    environment:
      QUEUES: "scheduled_queries,schemas"
      WORKERS_COUNT: 1
  adhoc_worker:
    <<: *redash-service
    command: worker
    environment:
      QUEUES: "queries"
      WORKERS_COUNT: 2
  worker:
    <<: *redash-service
    command: worker
    environment:
      QUEUES: "periodic emails default"
      WORKERS_COUNT: 1

When I do docker stats I get:

CONTAINER ID        NAME                        CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
05d2a3683640        redash_nginx_1              0.00%               4.59MiB / 872.1MiB    0.53%               1.55MB / 4.9MB      14MB / 0B           2
d13da4ce08b8        redash_scheduler_1          0.10%               209.4MiB / 872.1MiB   24.02%              10.1MB / 14.9MB     343MB / 0B          2
6f4ba48e3593        redash_adhoc_worker_1       0.09%               596.3MiB / 872.1MiB   68.37%              4.04MB / 5MB        17MB / 8.19kB       7
7c4987677051        redash_server_1             185.07%             188.4MiB / 872.1MiB   21.60%              293kB / 49.9kB      13.1MB / 41kB       5
fd374441a4cc        redash_worker_1             0.08%               405.6MiB / 872.1MiB   46.51%              192MB / 40.8MB      5.32MB / 8.19kB     5
b632f15eada6        redash_scheduled_worker_1   0.06%               405.5MiB / 872.1MiB   46.49%              3.06MB / 3.63MB     2.44MB / 8.19kB     5
1644056511c4        redash_redis_1              0.27%               3.797MiB / 872.1MiB   0.44%               54.9MB / 32.2MB     6.14MB / 5.21MB     4
3acb826a1db2        redash_postgres_1           0.00%               5.336MiB / 872.1MiB   0.61%               9.39MB / 177MB      29.4MB / 69.7MB     6

It looks to me like the issue is with the 185% CPU on the redash_server_1. There are very few users using the production server. No one at the moment since I still cant get past the 504 error.

bbrowski · June 20, 2022, 11:48am

So I’m stumped. I tried everything I can think of a I still get the high CPU load that wont settle even after the weekend. Something is wrong.

So I backed everything up from my production server and even made a copy of the working one I have on a local vm. I reprovisioned the server so its a clean Ubuntu 18.04.4 LTS (2 CPU 4Gb RAM).

I ran the setup.sh and installed redash (it worked - I can point my browser to the ip address and I get the setup screen).

But when I restore from backup I get the same CPU at 190+ % and a 504 error in my browser. It doesn’t matter which server backup I try and restore from (production redash:10.1.0.b50633 or local vm redash:10.0.0.b50363) and how I try to restore (rsync, cp, zip/unzip or pgdump). I must be doing something wrong?

So the last restore I tried the following:
Stopped redash on my working vm (sudo docker-compose stop)
Made a backup of the whole redash directory (sudo zip -r backup.zip /opt/redash)
I copied the zip file to the new server with the working install of redash that used the setup.sh file
Sopped redash on the new server (sudo docker-compose stop)
Unziped the backup to overwrite the /opt/redash directory (sudo unzip backup.zip -d /)
Restarted redash (sudo docker-compose up -d)

And we have the same problems 190+ % CPU usage and 504 error in the browser. I can repeat the error if I shutdown the server and do a clean install and then try and restore from the pgpdump (cat ~/backup_redash_postgres.sql | sudo docker exec -i redash_postgres_1 psql -U postgres) or rsync of the files.

What am I doing wrong?

jesse · June 23, 2022, 3:45am

I appreciate your patience while I recover from surgery before addressing this. Indeed this is quite the puzzler. I wonder if you are experiencing the same issue as this poster: Adding additional ec2 for redash - #6 by jesse

He complains that his server crashes unexpectedly and experiences high CPU usage.

jesse · July 6, 2022, 12:16pm

tl;dr increase your RAM to 4GB and report back.

RAM Usage

Docker alone is using 1800MB of RAM, which leaves just 248MB for the host OS to run all its services. The first thing I would try is to increase the RAM to 4GB (the minimum we recommend for use with V10).

If increasing the RAM is not at all possible, try reducing the REDASH_WEB_WORKERS from 4 to 2.

Also, check the output of docker-compose ps to make sure all your containers have started and show “Running” and not an exit code.

Error Analysis

This error in your logs:

[2022-06-10 10:15:08 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:31)

It comes from gunicorn which is the WSGI server for Flask. This is the process that actually handles network requests.

I learned from some web searches that WORKER TIMEOUT has many different causes. Here for example it was an issue with another dependency not starting, which caused gunicorn to fail. Here the problem was the system was starved for resources, which delayed gunicorn worker startup and because no timeout was configured gunicorn simply killed the worker process.

The common denominator in those situations was the servers did not have enough resources to start timely. This also seems the case on your system.

Other possibilities

There could be a problem with the data in your postgres dump. Since a fresh install of Redash run without failure on the same hardware. But I think this is unlikely unless someone was manually manipulating these files.

What is the size of your pg_dump in megabytes?

jesse · July 6, 2022, 1:14pm

Incidental timing: we received this PR this morning fix: increase the default gunicorn timeout by jacek-jablonski · Pull Request #5783 · getredash/redash · GitHub which could address your issue assuming RAM isn’t the problem. Specifically it solves this case:

bbrowski · July 11, 2022, 11:58am

You are my hero “music plays in background”

Seems to have sorted it out - I do have 4gm Ram on the server so it seems the problem was related to this REDASH_GUNICORN_TIMEOUT. After I stared it up again it settled down after about a minute or two and is running fine.

jesse · July 11, 2022, 11:59am

Yes!

Your original request for a fault-finding guide is a good one too. I’ll be working on this.