The checklist: Monitoring for Economy
With the checklist and Komiser in hand, gain valuable insights and start cutting down those huge numbers that keep you up at night.
Any company beyond a certain scale has a set of dashboards that the CEO and all execs review on a daily basis. By starting off each day, by tracking the real-time company metrics, the decision makers can always be in the best position to achieve the company's long-term goals. This data visualization method is even bleeding into our personal lives with the advent of the “personal life” dashboard.
Shift the lens's focus
In the world of cloud, We build architectures for resilience, security, and scalability and we generally know how to monitor them.
While building and maintaining the architecture that is optimizing for resilience, security and scalability are important and necessary it would make sense to broaden the scope and optimize for cost also.
In a recent blog post, Shane Baldacchino recommends just that, building architectures that are optimised on another axis, Economy.
There are so many resources out there to measure the network traffic, load and throughput of an application as well as the security posture and reliability of the environment. Be it by using a combination of open-source tools or by choosing your favorite proprietary provider. Fewer tools focus on optimizing cost though.
Granted the monitoring solutions built around the architecture types above may, at times, have some cost-specific metrics hidden away somewhere on the dashboard, but mostly, the information is scant and the costs commonly represented are difficult to change since they are the by-product of decisions meant to address concerns other than cost.
It is understandable that when trying to go about conceptualizing a new application or environment that optimizes for cost from the very beginning, it might seem like a difficult task. Many questions surely arise “How do I know that the decisions I’m making now will be flexible enough to accommodate future growth?” or “How do I know that I’m not limiting the performance of my application by prioritizing cost from the beginning?”
Regardless if you are building a new application from scratch or trying to implement some cost-conscious changes to an existing environment, monitoring for economy should be as simple as prioritizing the following:
💡 The same or better outcome for a lower cost
For each axis, its own tool
There is a myriad of monitoring tools out there, OSS and proprietary alike. All with their strengths and weaknesses. At Tailwarden, we’ve put a lot of effort into making sure that our OSS offering Komiser is up to the task and can be a one-stop shop when it comes to visualizing and taking steps to optimize cloud costs.
In the coming months, we plan on shipping a wide array of features that apart from just showing you what you have, will also tell you what actions you can take to bring costs down meaningfully. To make sure you don’t miss any updates and also let your voice be heard by upvoting your favorite feature requests, check out the public roadmap here.
Until the platform is fully fitted with all of the great features the team and the open source community are building, Komiser can be paired with a checklist of changes and considerations that can be taken in the areas of the (generally) most cost-intensive services.
Ideally, the checklist you work with should be tailored to your applications requirements and your company's cloud environments. But some standards and best practices are applicable to most cloud vendors. So if you don’t have a costume list of your own, feel free to use the one below.
Disclaimer: for consistencies sake and also to avoid loading this particular post with too many screenshots from the different cloud providers, the attachments below will reference a Komiser instance linked to an AWS account. Even though the checklist is cloud-agnostic and can be applied to the public cloud provider of your choice. If there is anything that we have missed in the checklist, let us know! The best way is by joining the Discord community and reaching out there.
- [ ] Do you have lifecycle policies in place to transition files from warm to cold storage?
- [ ] Are you using zstd over gzip for large file compression?
- [ ] Do you have any unattached disk/memory volumes?
- [ ] Do you have a lifecycle policy for your memory volume snapshots?
- [ ] Does your logs retention period need adjusting?
- [ ] Do you have a reserved / spot instance % target?
- [ ] Are any running instances underutilized?
- [ ] Do you have any unattached static IP addresses?
- [ ] Are you using the latest generation of an instance type?
- [ ] Are you using a managed service that could be substituted for a Load Balancer?
- [ ] Do you have a logical networking layer in the front of your Serverless functions that could be removed or replaced?
- [ ] Do you know how your Data transfer out (DTO) cost breaks down?
- [ ] Do you have any unused NAT Gateways?
- [ ] Do you have billing alerts in place?
- [ ] Does your bill forecast line up with your expectations?
Let’s take a look in closer detail to be sure we get the most value out of the checklist. Additionally for more in-depth AWS cost reduction tips, check out this article.
Do you have lifecycle policies in place to transition files from warm to cold storage?
- Make sure to automate decisions so you don’t have to remember to do them manually. When an object hasn’t been accessed for 90 days, should it be in the most expensive pricing tier? Leverage the data lifecycle features in AWS, Azure and GCP to keep your objects in the most appropriate tier.
Use zstd over gzip for large file compression.
- In some cases, zstd can offer up to a 30% reduction in compressed storage as compared to other compression mechanisms. Learn more about zstd here.
Do you have any unattached disk/memory volumes?
- The lifecycle of a disk or memory volume can sometimes be independent of the VM it had been associated with. If your environment has VMs that were terminated and the storage associated with said VMs persisted. Quickly determine if the volume holds valuable data that should be kept (by taking a snapshot) or else the volume should more than likely be removed to avoid the related expense.
Below, using the Komiser dashboard we can compare the used and total volumes widgets. If we have a higher total number of volumes compared to the used value, we know we have unattached volumes:
Do you have a lifecycle policy for your memory volume snapshots?
- Be sure to adjust the lifecycle policy of your backup volume snapshot service of choice, to your reliability and backup requirements. Rethink how many snapshots need to be kept. You are charged for the storage of the snapshots, so there is some potential room for optimization there.
Does your data/logs retention period need adjusting?
- Most cloud providers have pretty generous log retention and log volume thresholds. If this section of the bill starts to spike, first, check the configured log retention and then reference log volume pricing model (AWS, Azure, GCP) and then reduce where you would see the least amount of impact.
In the screenshot above (in red) we can see the configured Log retention days for the account.
And the total volume of logs:
Do you have a reserved / spot instance % target?
Unless you are provisioning instances in a development account, always plan the instance count required, pay upfront and leverage the many cloud savings plans at your disposal (AWS Savings plans, Azure’s Cost Management and GCPs committed use discounts). Many cloud providers have different ways of leveraging the excess compute power they have by buying time on spot and reserved instances.
Are any running instances underutilized?
- There are many ways you can see if instances are underutilized, use some open source tools such as k9s cli or Lens (if measuring the utilization of VMs which are part of Kubernetes clusters). Or the cloud providers console to see the memory and compute consumption of the provisioned VMs.
Do you have any unattached static IP addresses?
- The key concept to keep in mind is that we are only charged if the IP address is not in use. Make sure that you either use it or release it back into the IP pool for others to use.
Are you using the latest generation of an instance type?
- In most cases, you will get better performance from a newer version of a particular generation/series at a lower price point. Use Komiser to check the instance types you have provisioned then head over to the cloud provider's VM pricing page and see if there are any bargains you haven’t picked up yet.
Are you using a managed service that could be substituted by a Load Balancer?
- With managed services come additional costs, it might be the case you are using a costly managed service when a Load Balancer could expose an application or a stack of functions for a fraction of the cost. Each cloud providers has their API management offerings, make sure your application actually needs the features you are paying a premium for when a Load Balancer might suffice.
Do you have a logical networking layer in front of your functions that could be removed or replaced?
- In the previous point, we brought up the use of a load balancer to manage traffic to your application. Specifically, when thinking of serverless applications, do we even need a load balancer or networking logic on top of the functions? Since most cloud providers allow us to customize the function endpoints at no additional cost. If it is appropriate for your use case, consider removing the middlemen and using custom endpoints in Azure Functions or AWS Lambda URLs.
Do you know how your Data transfer out (DTO) cost breaks down?
- In most cases, a commonly overlooked aspect of any architecture is the expenses related to the Data that is transferred out to the internet or to other cloud regions. Take some time to understand how heavy is the data transfer out of your organization and if it warrants reducing. Understand AWS DTO, Azure DTO and GCP DTO
Do you have any unused NAT Gateways?
- Be sure this isn’t the case since they can rack up a nasty bill. A key aspect you want to make sure of is that the NAT Gateway instances are in the same AZ as the VMs they serve. If not the cross-AZ data transfer is where they will get you. Alternatively, try to understand the data transfer that is taking place. If data is being transferred from certain managed services, perhaps you can leverage cheaper service peering options. For example, in the case of data transfer between DynamoDB or S3 (in AWS) you can use VPC Endpoints instead of a NAT Gateway.
Does your bill forecast line up with your expectations?
- On the Komiser dashboard check the Bill widget and observe the value in the forecast section. This is a powerful piece of data to have at your disposal especially if you have your previous cloud bills available to compare. If you see a spike in the forecast compared to previous months, you know you have a culprit to find.
Do you have Billing Alerts in place?
- Don’t get caught off guard and set up billing alerts. Each cloud provider has a way of doing so from their consoles. Through Komiser you can also set up a slack notification to get a daily cost estimate sent directly to the slack channel of your choice. To set up an alarm in Komiser check out this tutorial.
But if you want to centralize your alerts all in one place and even set cost thresholds to be alerted on, check out the new Billing alarm feature available on Tailwarden now.
Every action has a reaction
When undergoing any given cost reduction effort, we should not lose sight of the tightly coupled nature of cost with other priorities such as reliability and performance. By making drastic actions in one area the chances are high that we directly impact others inadvertently, so always be mindful of the tradeoffs that should be considered.
For example, if you know that a Kubernetes cluster has a configured minimum amount of nodes that are too high (because they are underutilized). If you reduce the minimum amount of nodes and the cluster suddenly has a workload spike requiring quick scale out. Be cognizant that the degradation of performance might be noticed by users during the more frequent scale-out events and will therefore eat into the SLO margins.
When I think of how we manage our cloud resources, viewing them through cloud consoles and cluttered dashboards. What comes to mind is the analogy of the all you can eat restaurant. Imagine you like to eat roughly the same thing every day, but as days go by your appetite grows so does the number of ingredients that you put on your plate. If you had to build the plate every single day, surely you would forget a side dish or you would be unsure of which potatoes you used to get and what combinations of dressings you put on the salad.
What I’m trying to say is that we shouldn’t have to start fresh every time, which is what we do when we open up our cloud console every morning. We shouldn’t have to remember the resources that we provisioned and make sure we have a way of keeping track of them. We shouldn’t have to chase our resources, they should come to us.
We should have a clear and custom view into our own environment which dynamically updates as our environments grow or shrink, and that’s why Komiser exists, to deliver your data clearly and insights directly.
Make the most of the custom view that Komiser provides, use it alongside your own personal checklist or the one above and start trimming the fat off your cloud bill.
Regardless if you are a Developer, DevOps, or Cloud engineer. Dealing with the cloud can be tough at times, especially on your own. If you are using Tailwarden or Komiser and want to share your thoughts doubts and insights with other cloud practitioners feel free to join our Tailwarden Discord server. Where you will find tips, community calls, and much more.