EC2 Deployment Instance Size Issue

tl;dr minimum EC2 instance size for V10 is t2.medium. Any ideas why?

Background

One of our V10 release follow-ups is to build new cloud images for V10. We build these images by running our setup script through Packer. Until this work is finished, you can use V10 in EC2 still, just deploy a V8 image and upgrade it.

Per our documentation:

Launch the instance with the pre-baked AMI we create (for small deployments t2.small should be enough):

The Problem

A t2.small instance doesn’t have enough RAM for V10. We didn’t expect this. And it’s quite annoying. Because the V8 image will deploy on t2.small and will run normally. But once you upgrade to V10 and restart the containers, the instance becomes non-responsive.

You need to change the instance type to t2.medium before service is restored.

V8 Resource Consumption

Here’s the output from sudo docker stats on a V8 instance that wasn’t upgraded. Notice total available RAM is just under 2gb. About 75% of this is in use.

CONTAINER ID        NAME                        CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
074069f3c875        redash_nginx_1              0.00%               2.492MiB / 1.945GiB   0.13%               8.01MB / 3.52MB     34.8MB / 0B         2
ba89d824f682        redash_adhoc_worker_1       0.05%               291.7MiB / 1.945GiB   14.65%              126MB / 59.3MB      101MB / 0B          3
6aba3ac4a872        redash_scheduler_1          0.17%               274.8MiB / 1.945GiB   13.80%              138MB / 81.8MB      645MB / 12.1MB      3
39d98fe4cf2a        redash_server_1             0.01%               670.3MiB / 1.945GiB   33.65%              1.16MB / 8.62MB     302MB / 0B          5
b2a9ccd1285b        redash_scheduled_worker_1   0.13%               190.2MiB / 1.945GiB   9.55%               126MB / 68MB        90.3MB / 0B         2
1e40cfb160a2        redash_postgres_1           0.01%               13.39MiB / 1.945GiB   0.67%               6.09MB / 5.45MB     382MB / 134MB       12
bc73e56ac9a2        redash_redis_1              0.18%               2.477MiB / 1.945GiB   0.12%               204MB / 385MB       84.6MB / 71.4MB     4

V10 Resource Consumption

Here’s the output from sudo docker stats on a V8 instance upgraded to V10. Total available RAM is just under 4gb (because this is a t2.medium instance) and total RAM consumption is around 2100Mb.

CONTAINER ID        NAME                    CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
d013031b8481        redash_nginx_1          0.00%               3.188MiB / 3.842GiB   0.08%               8.7MB / 4.42MB      0B / 0B             2
e44e049a70cf        redash_scheduler_1      0.00%               195.1MiB / 3.842GiB   4.96%               10.5MB / 18.7MB     0B / 0B             2
e0e0d1829567        redash_adhoc_worker_1   0.03%               589MiB / 3.842GiB     14.97%              4.49MB / 6.42MB     627kB / 0B          7
61bcba583f8e        redash_worker_1         0.03%               590.3MiB / 3.842GiB   15.01%              35.3MB / 54.4MB     77.8kB / 0B         7
7eb1a4502a89        redash_server_1         0.01%               772.1MiB / 3.842GiB   19.63%              1.61MB / 9.19MB     1.15MB / 0B         9
ff160d1d6d48        redash_postgres_1       0.00%               11.86MiB / 3.842GiB   0.30%               11.1MB / 11.5MB     762kB / 42.8MB      10
149b61ecedcd        redash_redis_1          0.15%               2.473MiB / 3.842GiB   0.06%               69.6MB / 39.9MB     0B / 2.54MB         4

Analysis

The V10 instance containers are using than the maximum RAM available on the t2.small instance (2gb). This means the system starts swapping, which is slow. This is the cause of degraded performance on the upgraded system.

What isn’t clear is why the containers use more memory in V10 than V8. The only containers that are comparable are the redis, postgres, and nginx containers (which come from images we don’t maintain).

From what I can tell, there’s no reason V10 should require this much RAM for a basic deployment. But I’d like to figure this out before we build the new images. In case there is an easy tweak we can make to the setup script that avoids it.

As an initial sanity check, to help narrow down potential causes… will the V10 application run if you build its image with the same NodeJS version as the V8 images (eg node:10)?

Note - I’m not suggesting to do this for the production build. More suggesting it as something to investigate. :slight_smile:

If it does build and run ok, I’d try that and see if it’s the newer NodeJS version causing the problem.

Ditto, similar type of thought for the Python processes. Maybe see if you can run the new V10 Python code using the older V8 python base (or the old v8 code on the new V10 python base), then compare.

Heh, I should probably ensure I’m not super tired when posting replies to things. :wink:

That being said, the concept from my post above - try running both code bases using the same foundation - is a reasonable approach for trying to figure out runtime operational differences.