We are a company focusing on consulting, information and training around new(-ish) technologies like all the Big Data stuff from the Hadoop ecosystem or the ELK stack (as well as others).
This technology is notoriously complex (no matter what marketing folks try to tell you) and under-documented. So, it's often necessary to just test things to see if and how they work. For some components you're lucky enough to have the source code. But even that's not enough when those components interact with closed-source components (e.g. Kerberos stuff in Java talking to Active Directory).
Lots of people use Sandbox VMs for that which is fine for a lot of scenarios, but we find that there's simply no substitute for having a realistic cluster available. After some failed attempts with stacks of Mini-ITX machines on our desks, we turned our attention to the cloud.
This is part one of a two-part serien on how we chose our technology stack (and in this part I have a question for our readers) as well as limitations we encountered when using the cloud.
While we build clusters in the cloud for various technologies (ELK stack, Kubernetes, Kafka and others) I'm mainly going to focus on Hadoop clusters in this post. We wanted a realistic as possible setup. What do I mean when I say realistic? I mean that most of our customers are (still?) hosting their clusters on-premise. Usually not connected to the Internet and only available internally. Secure clusters are usually tied to an Active Directory (although we've seen a few IPA installations, that's a story for another time though).
So our goal for bringing clusters to the cloud was to have a fully separated cluster (not accessible directly from the Internet) accessible via a VPN and a Windows machine running Active Directory.
This has the huge benefit that I can go to my browser and just enter http://worker-1:7180 and I end up on the Cloudera Manager homepage (or Ambari or whatever I'm using). The VPN pushes DNS to my client, so it forwards all requests to the Windows gateway machine. This makes for a totally isolated cluster that does not have multiple network interfaces or host names and it look for all intents and purposes as if it were local.
Having the Windows machine in there allows me to enable Kerberos for more realistic testing.
With this goal in mind I went and looked at all the documentation I could find for the three big cloud providers (Amazon AWS, Google Cloud and Microsoft Azure). I also stumbled across Terraform which is just fantastic. As with every tool you'll find criticism out there and I'm sure it's valid but for our use case it mostly works fine (my one wish would be a PostgreSQL backed remote store).
As for the clouds I have to admit that I cannot remember all the details, but I can tell you the final result: I chose Microsoft Azure. I did not look at pricing at all. All my investigations were focused on whether this particular cloud could support what I wanted it to do. Azure could. I remember that I had problems with AWS but decided that they could be solved, and Google Cloud had some problem that made it a deal breaker for my scenario. The issues were all around networking and that I wanted my machines to not have public IPs (or multiple IPs at all) and that I needed DNS resolving inside the cluster. I'm open for discussions and hints and this was about a year ago, so things could have changed.
Anyway... I chose Azure, built the Terraform scripts, learnt about WinRM and built Ansible playbooks and everything works (a few kinks that I was too lazy to fix but that we're slowly tackling now) fine! It's awesome and we've delivered a couple of Hadoop trainings using this solution. This requires our training participants to install an OpenVPN client and connect to the cloud. Sounds easy but was surprisingly painful for users with locked down corporate laptops.
So we've tested a new solution based on devices from GL Inet which are WiFi Repeaters that can connect to OpenVPN. We had hoped that all our training participants have to do now is to connect to our training WiFi and they should have access to the cluster. While that worked perfectly fine for our small-scale tests it did fail spectacularly on a larger scale. It turns out that the devices are just not powerful enough. CPU was maxed, and throughput went down to almost zero.
This is my question to the readers: Do you have any suggestions on other devices that would work here? I was suggested the Turris Omnia but am hesitant to shell out the money if I can't be sure it'll fit the bill. If anyone has any experience I'd appreciate a heads-up. Ideally the device would also be able to function as a LTE repeater but that's for bonus points only.
We still hope we'll be able to make our scripts open-source soon. If you want to see them please just mail us for now.
Azure has been working mostly fine for us. The portal is often sluggish but I hope it'll improve over time.
This is something we see frequently though:
azurerm_virtual_machine.worker-vm.8: Still creating... (19m0s elapsed) azurerm_virtual_machine.worker-vm.3: Still creating... (13m10s elapsed)
It often takes 20 minutes or more to provision a single machine. Destroying the machines can take the same amount of time again. And they sometimes don't come back when we restart them which is worrying and annoying but for us not critical. I would assume others see this differently.
While writing this blog post we had another issue where we tried to stop a 20 node cluster and it silently failed. The machines - and our bill - kept on running. This is not the first time this happened and cost us maybe 50-100€ this time. Not painful, but annoying - the entire Azure cloud experience sometimes still feels a bit "beta" but I'm hopeful.
In Part II we'll focus on the non-technical issues we were facing using Microsoft Azure. We hope those will be equally helpful for other companies getting started with Azure.
We gave Microsoft a chance to review this blog post (as well as the next one) prior to publishing but they declined.