Posted in Software Development on 27 May 2020 at 21:39 UTC
This post is a followup to "IPXWrapper testing infrastructure", consider reading that first if you haven't already.
At the end of the last article (In September 2017 actually), I had a fully automated regression testing system for IPXWrapper. In September of 2019, after two years of not touching the system and doing a little work on IPXWrapper itself, I felt it was time to install Windows updates in the VM images... and that's where everything went wrong.
The first issue I had, is that the host machine wouldn't start up; after sitting around for only a year or two and moving house a couple of times, the the platters had gotten gummed up. Careful repairs (hitting it until it span up) were necessary.
After that I booted the reference images one at a time, installed updates, rebooted, installed updates, rebooted, left the VM to sit and let Windows do whatever housekeeping/bitcoin mining/spinning it does before moving on to installing more updates.
Eventually all the VMs were up to date, so I ran the test suite and... it failed. Wheeeeeee.
I never bothered getting to the bottom of why, but at some point in those two years, Windows 10 was so heavily optimised, that the host (Intel Q6700 quad core w/ 8GiB RAM) could no longer cope with running three parallel instances. Even "idle" instances with no Internet connectivity hammer the system aggressively enough that the tests would repeatedly time out.
Not wanting to invest in quantum computing or building my own datacentre at the time, I looked around to see if this "cloud" thing could solve my problems. To run VirtualBox you need either bare metal hardware, or nested virtualisation support. The former is available from several hosting companies, whereas the latter still isn't considered production-ready (to my knowledge).
In the end I settled on using Amazon. Not because they were the cheapest for bare metal hosting (spoiler: they aren't), but because their toybox allowed me to keep everything in standby for the rare occasion I actually want to spin it up.
The first step in making this cheap was to reduce the size of the VM images. Each Windows version had 3 VMs, which differed only in IP address. Reducing that to one which takes its configuration from a DHCP server and aggressively cleaning up and compacting the disk images brought the total size of the images down from ~150GB to ~50GB.
AWS has two storage backends which are relevant here:
Amazon S3 is an object store, where you upload individual files and then download them again as necessary. Data storage here is cheap enough for my needs and downloads within the same AWS region (data centre?) are free.
Amazon EBS holds network-accessible block devices which are attached to your virtual (or real) machines as primary storage. Storage here costs 4-5 times as much as S3, and you're charged for I/O.
The choice is obvious... use a RAMDISK!
The m5.metal EC2 instance type is a dedicated physical machine with 48 cores (96 threads), 384GiB of RAM and 25Gbps of network bandwidth, enough for us to create a 192GiB ramdisk at boot, format it as btrfs, download all ~50GB of images from S3 and have more than enough leftover computer to run six(!) instances of the test suite (each one requiring 4 VMs) in parallel.
When in standby, all I have is the test images stored in S3, and a small EBS volume holding a Linux install.
When required, an EC2 instance is created from that EBS volume which provisions itself as described above, before starting the Buildkite agent and waiting for jobs. The longest delay here is actually waiting for the instance to be deployed (which can take almost 10 minutes).
Standby cost for all this is only a few dollars a month. The costs climb rapidly if you actually start the system up though - m5.metal instances are currently $4.62/hr (ex VAT), which I found nagging at me whenever I had to boot the instance for prototyping or configuration.
For actual testing, I instead boot a "spot" instance, which is a fraction of the cost ($0.64/hr at this time), but can be terminated by Amazon at any time if they need the capacity back suddenly. I haven't had this happen yet, but it isn't the end of the world if it does, since it can just be retried later.
For my special snowflake unicorn workload, AWS is actually a good fit. If however I was using this on a regular basis, it would be cheaper in the long run to just buy and host a suitable computer myself.
I still haven't quite finished this project - instances are only started/stopped manually. So if a Buildkite job is started against IPXWrapper, I have to log into the AWS console, start a spot instance request and then clean it up after the job is finished. AWS is however very scriptable, so it should be pretty easy to start/stop instances as required by the Buildkite jobs whenever I get around to it.
No comments have been posted