Its fast on my machine, and slow on AWS
We built a system consisting of a bunch of processes running on the jvm, a mix of rabbitmq and http for interprocess communication, graylog for log aggregation, and mysql and postgresql as the data stores. We hosted the development, test and production environments on AWS, and we configured the entire stack using Cloudformation with all kinds of neat continuous deployment etc. The curse of it was that, under load, one of the processes was so much slower on the AWS platform that it was on my 2012 Macbook Pro. Basically, on AWS the app ground to a stand-still, started failing EC2 instance health checks, and eventually got pulled down by a number of failed ELB health checks. The next instances that came up would invariably meet the same fate. And locally it ripped through the work.
We tried many things to fix the issue, in a pretty haphazard fashion, until we lost patience with it, and decided to get systematic in getting to the bottom of the issue. Whats presented here is a summary of that process, which I reckon is a neat approach to solving these kind of issues.
Isolate the environment
Because my Macbook Pro is so different to the Ubuntu environment we run on AWS, I decided to make a Docker image containing the entire stack required by the slow system. That means installing RabbitMQ, Graylog, Mysql, Java 8 and the Jar file of the app itself. Also restoring the latest backup of production. Once I had that image defined, I ran through a sizeable job and timed it with a stop watch. Relative to AWS, it was fast – it took 3:50 (3 minutes 50 seconds) to do the work. Just to give you an idea, the AWS setup was taking 25 to 30 minutes, and seldom worked without dead letter messages and a host of other failure cases.
So I ran up a c4.xlarge, which was closest in spec to my laptop, to see if it was a problem with the basic performance of AWS itself. I installed the docker image and fired up the container and ran the same test. It took 2:39 to run the same work, so there was not a problem with AWS itself. This was exciting because it was clear that whatever the problem was, it was in the set of differences between my docker image and my AWS config – or it was due to the use of docker itself.
Introduce the old environment, one step at a time
So all I had to do was start introducing parts of the AWS environment into my docker setup. My first guess was the RDS database. So I edited my config file to use the RDS Mysql instead of the one in the docker image and re-ran my test. It took 3:31, which was a good bit slower but not the smoking gun. Darn it! All that was left to me was to spend the next ten hours systematically going through the following:
- Our AWS setup used Magnetic disks, my docker c4.xlarge was using SSD. Surely this was it. Nope – it ran in 3:40
- Was it the jvm args? In AWS we ran with the G1 garbage collector, and with no jvm args in docker. Nope, made almost no difference – it run in 3:43
- What about external Graylog instead of the one local to docker? Yes, no? Well, no – it ran in 3:45
- There was a difference in the instance sizes. Docker was on c4.xlarge, the app on AWS was on a t2.medium. Surprisingly not much difference – it ran in 3:52
- Ok, next. What about external RabbitMQ instead of the local one? Also no – it ran in 3:58
- Was it due to the fact that in AWS the app was in a private subnet, and my docker image was in a public one? Nope – it ran in the same time.
- Could it be due to the fact that the ELB health check against my AWS app was using a database connection for each call, and docker was not fronted by an ELB at all? Sticking in a static health check page served by “python -m SimpleHTTPServer” made no difference at all.
(Using docker images and AWS AMIs made the change-test-repeat cycle faster and relatively painless.)
At this stage my docker image was getting very close to the setup of my AWS app, and was still kicking its ass. And I was starting to tear my hair out. So I started to switch tack.
Move the old environment towards the new
What differences remained? I wondered it it was something to do with the packer-created AMI my app was running on. Or due maybe to the fact that the stack was created using Cloudformation and for the docker test I just launched a plain Ubuntu 14:04 AMI.
So I logged onto one of the AWS app machines, stubbed out its health check with my static python server, and killed the app process. Now the machine was mine to do with as I pleased. I stuck docker on it, and ran up the app in docker. Then I tore down my static health check page, and allowed the docker process to be the app. Amazingly, the app worked really fast. About 4 minutes, which was in the ballpark. At the stage I started to doubt myself, so I went back to run the test using just the AWS setup, and almost straight way it started grinding to a halt and failing messages and just getting killed by ELB health checks.
So it had something to do with the way the app was launched by my Cloudformation arrangement. Getting close, getting close! And then it hit me. As I was watching the stdout of the struggling AWS app crawling along, I realised that all this output was streaming to /var/log/cloud-init-output.log. And these were magnetic disks and it might be taking ages to flush to disk. So I changed the launch line in the UserData section of my LaunchConfiguration from
nohup java -jar app.jar
nohup java -jar app.jar > /dev/null
I was not concerned about losing log information because everything was being piped to Graylog anyway. Then I re-ran the test, and it came in at the same time as my docker image – around 4 minutes. God darn it, this was the issue. To much logging to slow disk, totally stalling the machine. What? Really, something as dumb as that? Oh dear.
Small systematic steps
So anyway, we’re out of the woods with this issue, and I’m mighty glad. The thing I hope I remember in future when faced with this kind of issue again is to isolate the environment using docker and creep towards the broken environment one step at a time, measuring the effect of each change until the offender reveals itself. No bug can survive small, systematic steps.