Redash v8-v10 upgrade issue on AWS

krishnaku · November 30, 2021, 1:17am

Issue Summary

I am running into an issue upgrading a self-hosted redash server on AWS EC2 from v8 to v10.

I’ve followed the instructions in the upgrade guide, and the individual steps completed without errors,
but once I do a docker-compose up -d at the end, the server service keeps crashing and rebooting the workers constantly. There dont seem to be any useful errors to trouble shoot what is going on.

The database migration ran and the alembic_version in the database is currently: 89bc7873a3e0
which I believe is v10 Head.

These are the docker-compose logs for the server.

Attaching to redash_server_1
server_1            | [2021-11-30 00:50:29 +0000] [1] [INFO] Starting gunicorn 20.0.4
server_1            | [2021-11-30 00:50:29 +0000] [1] [INFO] Listening at: http://0.0.0.0:5000 (1)
server_1            | [2021-11-30 00:50:29 +0000] [1] [INFO] Using worker: sync
server_1            | [2021-11-30 00:50:29 +0000] [9] [INFO] Booting worker with pid: 9
server_1            | [2021-11-30 00:50:29 +0000] [8] [INFO] Booting worker with pid: 8
server_1            | [2021-11-30 00:50:29 +0000] [10] [INFO] Booting worker with pid: 10

server_1            | [2021-11-30 00:50:29 +0000] [11] [INFO] Booting worker with pid: 11
server_1            | [2021-11-30 00:50:59 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:8)
server_1            | [2021-11-30 00:50:59 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:9)
server_1            | [2021-11-30 00:50:59 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:10)
server_1            | [2021-11-30 00:50:59 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:11)
server_1            | [2021-11-30 00:51:00 +0000] [16] [INFO] Booting worker with pid: 16
server_1            | [2021-11-30 00:51:00 +0000] [17] [INFO] Booting worker with pid: 17
server_1            | [2021-11-30 00:51:00 +0000] [18] [INFO] Booting worker with pid: 18
server_1            | [2021-11-30 00:51:00 +0000] [19] [INFO] Booting worker with pid: 19

This sequence just keeps repeating and the EC2 instance becomes unresponsive and needs to be restarted.

A summary of the issue and the browser/OS environment in which it occurs.

Technical details:

Redash Version: v10.1.0
Browser/OS: chrome/macos
How did you install Redash: AWS Ec2 image for v8 upgraded via the upgrade process.

Docker-compose (updated for v10)

version: "2"
x-redash-service: &redash-service
  image: redash/redash:10.0.0.b50363
  depends_on:
    - postgres
    - redis
  env_file: /opt/redash/env
  restart: always
services:
  server:
    <<: *redash-service
    command: server
    ports:
      - "5000:5000"
    environment:
      REDASH_WEB_WORKERS: 4
  scheduler:
    <<: *redash-service
    command: scheduler
 
  worker:
    <<: *redash-service
    command: worker
    environment:
      QUEUES: "periodic emails default"
      WORKERS_COUNT: 1

  scheduled_worker:
    <<: *redash-service
    command: worker
    environment:
      QUEUES: "scheduled_queries,schemas"
      WORKERS_COUNT: 1
  adhoc_worker:
    <<: *redash-service
    command: worker
    environment:
      QUEUES: "queries"
      WORKERS_COUNT: 2
  redis:
    image: redis:5.0-alpine
    restart: always
  postgres:
    image: postgres:9.6-alpine
    env_file: /opt/redash/env
    volumes:
      - /opt/redash/postgres-data:/var/lib/postgresql/data
    restart: always
    ports:
      - "5432:5432"

  nginx:
    image: redash/nginx:latest
    ports:
      - "80:80"
      - "443:443"

    depends_on:
      - server
    links:
      - server:redash
    volumes:
      - /opt/redash/nginx/nginx.conf:/etc/nginx/conf.d/default.conf
      - /opt/redash/nginx/certs:/etc/letsencrypt
      - /opt/redash/nginx/certs-data:/data/letsencrypt
    restart: always

jesse · November 30, 2021, 4:43am

Thanks for your question. In case you want to see what this process looks like in real time there’s a demo on Youtube here: Upgrade from V8 to V10 Walkthrough - YouTube

To your specific situation: what exact steps did you follow to upgrade your instance? There are special instructions for upgrading to V10 from V8 (see here). It seems like you may not have rebuilt your containers (step 7)

Second, any reason why you’re upgrading to 10.0 instead of 10.1 which includes the security patches we pushed last week?

krishnaku · November 30, 2021, 3:45pm

Hi Jesse,

Thanks for the quick response!
I went back through the video and followed the steps to migrate from my v8 instance to v10 again, but still facing the same issue.

Things I missed the last time around:

I had removed the environment settings in docker-compose only for the scheduler service, but your video shows that you remove it from all the other worker service entries (not sure if it is material since this seems related to replacing celery with RQ, but I did not think it would hurt).
Hit ctrl C to stop the docker-compose up --recreate-containers --build command. I had run this before, but I did not stop it.
I reran the migrations again to ensure that nothing was missed (the database had already been migrated in a previous run).

But I am still seeing the following in the docker logs for the server.

[2021-11-30 15:34:21 +0000] [1] [INFO] Starting gunicorn 20.0.4
[2021-11-30 15:34:21 +0000] [1] [INFO] Listening at: http://0.0.0.0:5000 (1)
[2021-11-30 15:34:21 +0000] [1] [INFO] Using worker: sync
[2021-11-30 15:34:21 +0000] [8] [INFO] Booting worker with pid: 8
[2021-11-30 15:34:21 +0000] [9] [INFO] Booting worker with pid: 9
[2021-11-30 15:34:21 +0000] [10] [INFO] Booting worker with pid: 10
[2021-11-30 15:34:21 +0000] [11] [INFO] Booting worker with pid: 11
[2021-11-30 15:34:51 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:8)
[2021-11-30 15:34:51 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:9)
[2021-11-30 15:34:51 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:10)
[2021-11-30 15:34:51 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:11)
[2021-11-30 15:34:51 +0000] [8] [INFO] Worker exiting (pid: 8)
[2021-11-30 15:34:51 +0000] [9] [INFO] Worker exiting (pid: 9)
[2021-11-30 15:34:51 +0000] [10] [INFO] Worker exiting (pid: 10)
[2021-11-30 15:34:51 +0000] [11] [INFO] Worker exiting (pid: 11)
[2021-11-30 15:34:53 +0000] [17] [INFO] Booting worker with pid: 17
[2021-11-30 15:34:53 +0000] [16] [INFO] Booting worker with pid: 16
[2021-11-30 15:34:53 +0000] [18] [INFO] Booting worker with pid: 18
[2021-11-30 15:34:53 +0000] [19] [INFO] Booting worker with pid: 19

For some reason, the server worker process started by gunicorn is crashing. Are there ways to turn on debug logs to see what the issue might be.

I was also running the v8 instance in a t3.small EC2 instance. I upgraded to a t3a.medium instance thinking it may be a memory issue, but it does not seem to have had any impact.

Any further thoughts/suggestions on what I might do here?

Krishna

krishnaku · November 30, 2021, 4:03pm

Actually it was something silly on my end. I had the wrong image tag on the docker-compose - was doing the 10.0 instead of the 10.1 image. Changing that fixed the issue and the v10 instance is up now.

Thanks again for the detailed instructions!

Krishna

▼Categories

▼Tags

Redash v8-v10 upgrade issue on AWS

Issue Summary

Technical details: