Celery Status: Failed loading status. Please refresh

jdel · July 25, 2019, 2:42pm

Issue Summary

When going in the Admin section on the Celery Status tab, randomly it displays “Failed loading status. Please refresh.”. Sometimes, after a reload it works again, but after redash is left unattended for a couple of days, reloading doesn’t help, only restarting the scheduler or redis makes the view available again.

Technical details:

Redash Version: redash/redash:7.0.0.b18042
Browser/OS: Chrome 75.0.3770.100 / macOS 10.14.5 (18F132)
How did you install Redash:

Migrated Redash 3.0.0+b3134 local install on 16.04.4 LTS (Xenial Xerus) to Docker 7.0.0.b18042 on CoreOS stable.
The migration is in fact a parallel run. I restored the 3.0.0 pgsql database backup into AWS RDS, then executed the migrations in the following order using a docker setup:

redash/redash:3.0.0.b3147
redash/redash:4.0.2.b4720
redash/redash:5.0.0.b4754
redash/redash:6.0.0.b8537
redash/redash:7.0.0.b18042

Everything seems successful and queries (scheduled alike) run perfectly fine. Occasionally the message will pop up and https://redash.my.domain/admin/queries/tasks returns a 500.

Server log

[2019-07-25 08:52:59,827][PID:14][INFO][metrics] method=GET path=/admin/queries/tasks endpoint=redash_index status=304 content_type=text/html; charset=utf-8 content_length=926 duration=0.64 query_count=2 query_duration=4.27
[2019-07-25 08:53:00,768][PID:14][INFO][metrics] method=GET path=/api/session endpoint=redash_session status=200 content_type=application/json content_length=1331 duration=3.59 query_count=3 query_duration=6.02
[2019-07-25 08:53:00,880][PID:14][INFO][metrics] method=GET path=/api/organization/status endpoint=redash_organization_status status=200 content_type=application/json content_length=100 duration=33.92 query_count=7 query_duration=16.14
[2019-07-25 08:53:00,892][PID:14][INFO][metrics] method=GET path=/static/images/favicon-32x32.png endpoint=static status=200 content_type=image/png content_length=2005 duration=0.57 query_count=2 query_duration=4.66
[2019-07-25 08:53:01,000][PID:14][INFO][metrics] method=GET path=/api/dashboards/favorites endpoint=dashboard_favorites status=200 content_type=application/json content_length=55 duration=17.37 query_count=4 query_duration=10.72
[2019-07-25 08:53:01,036][PID:14][INFO][metrics] method=GET path=/api/queries/favorites endpoint=query_favorites status=200 content_type=application/json content_length=55 duration=22.94 query_count=4 query_duration=12.73
[2019-07-25 08:53:01,162] ERROR in app: Exception on /api/admin/queries/tasks [GET]
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1988, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1641, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/local/lib/python2.7/dist-packages/flask_restful/__init__.py", line 271, in error_router
return original_handler(e)
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1544, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1639, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1625, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/app/redash/permissions.py", line 48, in decorated
return fn(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/flask_login/utils.py", line 228, in decorated_view
return func(*args, **kwargs)
File "/app/redash/handlers/admin.py", line 51, in queries_tasks
'tasks': celery_tasks(),
File "/app/redash/monitor.py", line 132, in celery_tasks
tasks = parse_tasks(celery.control.inspect().active(), 'active')
File "/usr/local/lib/python2.7/dist-packages/celery/app/control.py", line 108, in active
return self._request('active')
File "/usr/local/lib/python2.7/dist-packages/celery/app/control.py", line 95, in _request
timeout=self.timeout, reply=True,
File "/usr/local/lib/python2.7/dist-packages/celery/app/control.py", line 454, in broadcast
limit, callback, channel=channel,
File "/usr/local/lib/python2.7/dist-packages/kombu/pidbox.py", line 321, in _broadcast
channel=chan)
File "/usr/local/lib/python2.7/dist-packages/kombu/pidbox.py", line 360, in _collect
self.connection.drain_events(timeout=timeout)
File "/usr/local/lib/python2.7/dist-packages/kombu/connection.py", line 301, in drain_events
return self.transport.drain_events(self.connection, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/virtual/base.py", line 963, in drain_events
get(self._deliver, timeout=timeout)
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/redis.py", line 366, in get
ret = self.handle_event(fileno, event)
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/redis.py", line 348, in handle_event
return self.on_readable(fileno), self
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/redis.py", line 344, in on_readable
chan.handlers[type]()
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/redis.py", line 721, in _brpop_read
**options)
File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 768, in parse_response
response = connection.read_response()
File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 636, in read_response
raise e
ConnectionError: Error while reading from socket: (104, 'Connection reset by peer')
[2019-07-25 08:53:01,162][PID:14][ERROR][redash] Exception on /api/admin/queries/tasks [GET]
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1988, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1641, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/local/lib/python2.7/dist-packages/flask_restful/__init__.py", line 271, in error_router
return original_handler(e)
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1544, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1639, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1625, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/app/redash/permissions.py", line 48, in decorated
return fn(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/flask_login/utils.py", line 228, in decorated_view
return func(*args, **kwargs)
File "/app/redash/handlers/admin.py", line 51, in queries_tasks
'tasks': celery_tasks(),
File "/app/redash/monitor.py", line 132, in celery_tasks
tasks = parse_tasks(celery.control.inspect().active(), 'active')
File "/usr/local/lib/python2.7/dist-packages/celery/app/control.py", line 108, in active
return self._request('active')
File "/usr/local/lib/python2.7/dist-packages/celery/app/control.py", line 95, in _request
timeout=self.timeout, reply=True,
File "/usr/local/lib/python2.7/dist-packages/celery/app/control.py", line 454, in broadcast
limit, callback, channel=channel,
File "/usr/local/lib/python2.7/dist-packages/kombu/pidbox.py", line 321, in _broadcast
channel=chan)
File "/usr/local/lib/python2.7/dist-packages/kombu/pidbox.py", line 360, in _collect
self.connection.drain_events(timeout=timeout)
File "/usr/local/lib/python2.7/dist-packages/kombu/connection.py", line 301, in drain_events
return self.transport.drain_events(self.connection, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/virtual/base.py", line 963, in drain_events
get(self._deliver, timeout=timeout)
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/redis.py", line 366, in get
ret = self.handle_event(fileno, event)
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/redis.py", line 348, in handle_event
return self.on_readable(fileno), self
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/redis.py", line 344, in on_readable
chan.handlers[type]()
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/redis.py", line 721, in _brpop_read
**options)
File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 768, in parse_response
response = connection.read_response()
File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 636, in read_response
raise e
ConnectionError: Error while reading from socket: (104, 'Connection reset by peer')
[2019-07-25 08:53:01,163][PID:14][INFO][metrics] method=GET path=/api/admin/queries/tasks endpoint=redash_queries_tasks status=500 content_type=? content_length=-1 duration=14.79 query_count=3 query_duration=6.03
[2019-07-25 08:53:02,069][PID:14][INFO][metrics] method=POST path=/api/events endpoint=events status=200 content_type=application/json content_length=4 duration=2.40 query_count=2 query_duration=4.86

This seems like a Redis connection issue, but the Redis container is up, and the redis debug logs don’t show anything in particular. I even set --tcp-timeout 0 to be sure.

For reference, I am using a compose file based on the one provided in the setup instructions with some modifications:

v3.7, deployed in a single node docker swarm
Workers use replicas instead of WORKERS_COUNT processes per container
Entrypoint/command had to be hijacked in order to pip install ldap3
The stack network (redashnet) driver is overlay as swarm doesn’t support bridge
Front end is reverse proxied by Traefik with TLS termination, which is in another stack on the swarmnet network

Compose file

version: '3.7'
x-redash-service: &redash-service
  # This reflects the order DB migrations have been applied from version 3
  #image: redash/redash:3.0.0.b3147
  #image: redash/redash:4.0.2.b4720
  #image: redash/redash:5.0.0.b4754
  #image: redash/redash:6.0.0.b8537
  image: redash/redash:7.0.0.b18042
  depends_on:
    - redis
  env_file: /etc/redash.env

services:
  server:
    <<: *redash-service
    entrypoint: [bash]
    command: [-c, pip install ldap3 && /app/bin/docker-entrypoint server]
    # The below command is to be used in replacement of the above for DB schema upgrades
    #command: [-c, pip install ldap3 && /app/bin/docker-entrypoint manage db upgrade]
    networks:
      - swarmnet
      - redashnet
    ports:
      - 5000:5000
    environment:
      REDASH_WEB_WORKERS: 4
    deploy:
      replicas: 1
      labels:
        - traefik.enable=true
        - traefik.metrics.port=5000
        - traefik.metrics.frontend.rule=Host:${HOSTNAME}
  scheduler:
    <<: *redash-service
    entrypoint: [bash]
    command: [-c, pip install ldap3 && /app/bin/docker-entrypoint scheduler]
    networks:
      - redashneta
    environment:
      QUEUES: "celery"
      WORKERS_COUNT: 1
    deploy:
      replicas: 1
  scheduled-worker:
    <<: *redash-service
    entrypoint: [bash]
    command: [-c, pip install ldap3 && /app/bin/docker-entrypoint worker]
    networks:
      - redashnet
    environment:
      QUEUES: "scheduled_queries,schemas"
      WORKERS_COUNT: 1
    deploy:
      replicas: 2
  adhoc-worker:
    <<: *redash-service
    entrypoint: [bash]
    command: [-c, pip install ldap3 && /app/bin/docker-entrypoint worker]
    networks:
      - redashnet
    environment:
      QUEUES: "queries"
      WORKERS_COUNT: 1
    deploy:
      replicas: 1
  redis:
    image: redis:5.0-alpine
    command: redis-server --tcp-timeout 0 --loglevel verbose
    networks:
      - redashnet

networks:
  swarmnet:
    external: true
    name: swarmnet
  redashnet:
    name: redashnet

Redash environment file

# GENERAL
REDASH_REDIS_URL=redis://redis:6379/0
REDASH_HOST=https://redash.my.domain/
REDASH_LOG_LEVEL=DEBUG
REDASH_DATABASE_URL=postgresql://redash:redacted@redash-db.my.domain/redash
REDASH_COOKIE_SECRET=REDACTED

# LDAP
REDASH_PASSWORD_LOGIN_ENABLED=false
REDASH_LDAP_LOGIN_ENABLED=true
REDASH_LDAP_URL=ldaps://ldap.my.domain
REDASH_LDAP_BIND_DN=uid=redash,cn=sysaccounts,cn=etc,dc=domain,dc=my
REDASH_LDAP_BIND_DN_PASSWORD=REDACTED
REDASH_SEARCH_DN=cn=users,cn=accounts,dc=domain,dc=my
REDASH_LDAP_SEARCH_TEMPLATE=(uid=%(username)s)

How could I troubleshoot this deeper ?

Thanks in advance

arikfr · July 25, 2019, 7:22pm

Thank you for the detailed post!

Can you check the logs if the Redis container was restarted at some point?
Also, next time it happens, can you restart the server service instead of the other ones and see if it resolves the issue?

jdel · July 26, 2019, 7:57am

Hello @arikfr,
I have in fact already checked both points while troubleshooting.

The Redis container does not restart, and restarting the server container does not help.

In the meantime I have redeployed the whole thing with docker-compose to avoid using the swarm overlay network, but I have already seen the error message a couple of times. I will let it run for a little while to see how it turns out.

jdel · October 25, 2019, 9:30am

This weird behaviour seems to have been fixed with version 8. I haven’t been able to reproduce the error yet.

▼Categories

▼Tags

Celery Status: Failed loading status. Please refresh

Issue Summary

Technical details: