FinOps tools for developers
published at 07-04-2022 by Florian Stoeber
While cloud environments are growing, many things are getting more complex and harder to understand. For example, you must understand all the #observability-topics to solve the problems or optimize weak elements of your environment. Without knowing that, you are not able to optimize it. Another big topic is the governance and cloud-cost optimization of your environment. In this blog post, I will show you essential topics that every developer or IT department can use to pick the low-hanging fruit.
More people produce more costs
Today it is easy to spin up cloud resources. On the one hand, this is a good thing, as developers can develop and test much faster because they do not need to wait for the IT department to create new instances. On the other hand, this leads to a few difficulties:
- Resources are not cleaned up
- Resources are idling all the time
- Developers are not aware of the costs a change will bring
- It is not possible to have a dedicated billing for one product in shared environments
- Kubernetes introduces another abstraction layer
How to solve these problems?
These difficulties are not so complicated. Nevertheless, they are leading to the most costs companies can save, so it is a good starting point to look at. In the following, I will mention a couple of tools to solve the problems or at least decrease their impact.
Of course, it is impossible to cover all possibilities to solve the problem within one blog post, but I will show you the most prominent ones. The examples aim to be cloud-agnostic, but sometimes a dedicated cloud provider is mentioned.
Infracost: Cost analysis for Infrastructure as Code
Developers should be aware of the costs they are responsible for. If I spin up a new instance or change an instance type to the next higher one, it would be nice to get a summary of the changes. With Terraform, many companies are already using Infrastructure as Code. Terraform can print out a human-readable plan representing all changes that will be done within the next “terraform apply” run. With Infracost, it is possible to create a similar plan, describing the changes from a cost point of view. Infracost can be integrated into your development workflow, for example, it can post a comment in your Pull Request and tells you how much the monthly costs of your infrastructure are increasing (or decreasing). It is a positive thing that you can be aware of that without any additional action. Infracost uses the terraform plan as an input and calculates with the Pricing API of the cloud provider the change in your cost consumption. Setting this up is very easy and well documented in the official documentation.
Cloud Custodian: Gain control of your cloud environments
Many companies set up a virtual machine and forget it (besides lifetime procedures) in an on-prem world. Often those instances were not used anymore, but in times when the on-prem data center was available, nobody felt responsible.
In a cloud world, you are paying for those instances, even if they are not used, and you can resize or shut down an instance. AWS states that 70% of the non-production workload could be saved. The first step is to turn off dev and test instances outside business hours (e.g. on the weekend or at night). Besides that, it is essential to identify and clean up resources that are not used anymore. Imagine creating a test instance for a specific purpose and forgetting it afterward. The instance is always idling, and these costs can be saved easily. Besides the increased costs, it might also be a security issue.
Cloud Custodian brings the possibility to execute policies against the cloud provider's API. The policies are always set up similarly. We can define just a few small policies to solve our first two problems:
Resources are not cleaned up after their use
For this policy, we need to have a common understanding that every instance must have an owner. If we spin up a Kubernetes cluster to test something, we are responsible for cleaning it up. It should not be the case that a bunch of zombie instances are running in our cloud accounts and nobody feels responsible for cleaning them up. Many users of Cloud Custodian are creating a simple policy to delete (or at least stop) all instances that are not tagged with an owner tag. For short-term tests, it's fine not to use the owner tag, so the instances are deleted a few hours later. If we use the owner label and forget to clean up the instance, we can create additional policies to remind the owner of that instance.
The following policy will delete all GCP instances without an owner label:
- name: action_delete-instance-without-label-owner
description: |
Deletes all instances without a tag called owner
resource: gcp.instance
filters:
- "tag:owner": absent
actions:
- type: delete
Turn off resources in offhours
Another example would be to turn off development and testing instances when you are not working. Turn it off every day at 8 pm and activate it on weekdays at 8 am. It is straightforward to achieve these things. An example policy for turning off EC2 instances in that timeframe is shown here. It will only turn off instances with a specific tag:
- name: action_stop-instance-with-tag-custodianoffhours
description: |
Will stop the instance after 8:00 pm if it has a custodianoffhours-tag
resource: ec2
filters:
- type: offhour
tag: custodianoffhours
default_tz: cet
offhour: 20
actions:
- stop
Like these two policies, it is possible to target almost every resource the API of the cloud providers are exposing:
- Notify on underutilized RDS instances? No problem.
- Delete orphaned EBS snapshots? Very easy.
- Remove public Loadbalancers and storage buckets? Go for it.
It is possible to find these example policies in the documentation. If you want to set up a policy, try it out. It is not complicated if you understand that Cloud Custodian is just utilizing the usual API calls. Reach out to me if you need support on this topic. If you look at these examples, Cloud Custodian is not just aiming for decreased costs. It is also possible to increase security and do usual maintenance tasks. If you are evaluating using it, have a detailed look at the various execution modes. They enable you to operate Cloud Custodian without needing a dedicated server as it integrates with the FaaS services, e.g., for AWS with Lambda.
Kube-Green: Control your Kubernetes clusters
Having the possibility to scale down AutoScalingGroups and EC2 instances at night and on the weekend is a lovely thing, but how do we handle the workload running on Kubernetes? Kube-Green is a very young project, but you can already use it.
Kube-Green allows you to scale down Deployments and deactivate Cronjobs running on Kubernetes. Combined with a properly configured cluster-autoscaler, it can scale down the entire environment, save money, and reduce carbon pollution. A definition of a manifest for Kube-Green is shown here:
apiVersion: kube-green.com/v1alpha1
kind: SleepInfo
metadata:
name: working-hours
spec:
weekdays: "1-5"
sleepAt: "20:00"
wakeUpAt: "08:00"
timeZone: "Europe/Rome"
suspendCronJobs: true
Observability systems
Having all the mentioned tools in place will not solve all your problems. Infrastructure is very flexible and can scale automatically depending on the workloads and load. On the one hand side, this is very good because we can set up resilient infrastructure. If we think of flexible costs, this should also be an advantage. If there is less load on a specific application, the autoscalers should scale it down, and we can save some money. This could become a problem if we did not set up the autoscalers well. We need to implement observability systems to tackle this problem. Observing the reserved capacity and utilized capacity per application could be the first step starting with minimal effort. In the end, using tools like opencost.io or Apptio’s Cloudability could be a significant advantage for the company. It is more effort to integrate them, but we can achieve many things, e.g.:
- Enable colleagues from controlling to invoice the teams that are using my PaaS.
- Be aware of any anomalies in my cloud spending
- Optimization of the infrastructure
The tools are already very good as they are today, but they will get additional features soon. While opencost.io (formerly the open-source version of kubecost) focuses more on Kubernetes, Cloudability focuses more on the cloud provider itself. At the same time, both tools can monitor the other infrastructure.
Conclusion
In general, these tools should give you an idea of optimizing your teams’ workflows and how this integration saves costs in their daily lives. Integrating Infracost in an environment where you already use Terraform is very simple. Integrating Cloud Custodian or Kube-Green with a basic off-hour policy as a PoC is also very easy. Setting up opencost.io or Cloudability can be a bit more effort, but you have to remember the “Crawl, Walk, Run”-approach mentioned by the FinOps Foundation. It is not important to set up a perfect workflow on day one. You must optimize it within each sprint, iterate over it and have a look at which things could be done in a better way. That is the most important wisdom every technician will learn someday:
References
https://github.com/kubecost/opencost
https://www.apptio.com/products/cloudability/
https://github.com/Liquid-Reply/finops-webinar
Photo by Alexander Mils on Unsplash