Why Cloud Costs Are an Architecture Problem ft. Shailaja Beeram | Ep #68

FIA - Shailaja Beeram
===

Shailaja Beeram: [00:00:00] over time I realized that many inefficiencies actually comes from architectural decision. And how system evolves. that completely changed how I approach cloud efficiencies today.

Intro: Welcome to finops in Action. I'm your host, Taylor Houck. Each week I'll sit down with finops experts to explore the toughest challenges between finops and engineering. This show is brought to you by 0.5, empowering teams to optimize cloud costs with deep detection and remediation tools that actually drive action.

Taylor Houck: Hello, and welcome to another episode of finops in Action. Today's guest brings a perspective that we have not had on the show before. She is a cloud architect and engineer first with over 15 years of experience designing and building cloud infrastructure across Azure AWS, and GCP. She's worked across consulting, software, and financial services.[00:01:00]

And she's the published author of a research paper on Zero Trust Identity Management. But what makes this conversation special is how her engineering background shapes the way that she thinks about finops. She doesn't start with the bill, she starts with the architecture. Welcome to the show.

Shailaja Beeram: Uh, hi Taylor. to connect again.

Taylor Houck: Oh, I'm so excited for this conversation. Sledge. I wasn't kidding when I said that this is a, a new one because it's, it's really exciting for me to have on someone with such a strong cloud architecture and engineering background that moved into finops. It's something that I think most finops practitioners should embrace the mindset of a cloud architect and an engineer.

Can you take me back to the moment when you first realized that cost optimization or managing cloud spend was part of your job?

Shailaja Beeram: so my work has always been closely [00:02:00] to infrastructure, architecture, cost governance, and decision making. So over the time, one thing that kept coming up consistently was cost. around why costs were increasing, why it is unexpected spike, how to optimize the spend. So earlier, one thing I actually got wrong was thinking cloud optimization was mostly about reducing unused resources, but over time I realized that many inefficiencies actually comes from architectural decision. And how system evolves. that completely changed how I approach cloud efficiencies today.

Taylor Houck: What are some of the common inefficiencies? Because I think that a lot of people, they start with that same realization. Oh, let's go find the idle vm, the dev server, that someone never turned off, right? But [00:03:00] what are those bigger architectural decisions that lead to inefficient spend?

Shailaja Beeram: So I have been seeing a lot of inefficiencies that are obvious at first glance, especially in the networking layer. For example, Nat Gateway and private endpoints often remains, uh, after workloads are changed over time. This becomes what I call zombie resources. Um, they don't, uh, actively serve a purpose, but, uh, still, uh, add cost and complexity to that. So, I saw many enterprises assume networking cost is neglectable, uh, compared to compute, but in large scale Azure environment, networking and outbound connectivity patterns often becomes hidden costs and scalability, bottleneck. So we have been seeing with the NA gateway in our [00:04:00] environment, that is continuously to incur cost regardless of usage, since there is no pause or a stop state. So in some, uh, environments, even when workloads are scaled down or removed, um, the NAG gateway remains attached to the subnet. keeps running. And also, uh, I also noticed, uh, that monitoring background traffic can make it appear like it's in use. But, uh, in real, like, which often delays cleanup like that, um, that is where, uh, the inefficiency starts to build up, right? So that is where we, uh, see, uh, and because. It sits at the networking layer. It's, uh, it's not always visible at the application level, which makes it easy to [00:05:00] overlook.

Taylor Houck: It's interesting to me because it's so easy with say, like a compute resource to tie it to an application and measure the compute and the memory utilization rates. But oftentimes with networking, it's, it's not so simple. I wanna drill into a couple of the concepts that you touched on.

Okay.

Shailaja Beeram: Mm-hmm.

Taylor Houck: is nat gateways,

Shailaja Beeram: Mm-hmm.

Taylor Houck: the other is how do you actually observe. Or identify when there's an opportunity when there could be network traffic that is perhaps hiding the true idleness, um, of a resource.

Shailaja Beeram: so what, uh, I have been seeing with Nat Gateway. Uh, is that it continues, uh, the cost regardless of usage or anything. in some environments, even when workloads are scaled down or removed and all that. So, um, what I have seen just about whether an ad gateway exists. It's, [00:06:00] uh, how it is being designed and used. For example, in some cases, n gateways are over provisioned used as a default patterns when every, uh, whenever, uh, the workload doesn't really needs that level of out bond set up. But still, we use it because it's a default one. So another example, uh, is like another, like common issue is around SNA port execution.

Like teams sometime design, um, for it without fully understanding their actual traffic patterns, which leads to overcompensation in architecture. And then there is an outbound traffic design. If traffic, uh, paths are, you know. Uh, optimized. You can end up routing more data than necessary through NA Gateway, which directly increases, uh, cost. [00:07:00] So it really less about the resources itself think more about how the architecture around it is designed

Taylor Houck: so let's talk about it. How do you figure out that you're sending more traffic through a gateway than you need to?

Shailaja Beeram: Uh, that's a great

question.

Taylor Houck: It's not on the billing file, is it?

Shailaja Beeram: Yeah, it is not on the billing file. what I usually look is whether the NAG gateway is actually supporting any of the workloads, if the subnet doesn't have active compute resources like VMs or applications that required outbound connectivity. That usually a strong indicator. And I also try to understand the context. Sometimes there is minimal background traffic, but that doesn't necessarily means it's, it's serving a real application need. So that is where, uh, I could identify [00:08:00] and, uh, not every, uh, workload actually needs that gateway. It, it sometimes, uh, in some cases, uh, platform management, uh, outbound connectivity is sufficient for that, but teams often deploy in that gateway as a standard pattern without evaluating whether it's necessary, which can lead to unnecessary costs.

Taylor Houck: Yeah, and the managed NA gateway. You know, the concept of a NA gateway is not new, but it's the managed nat Gateway where they charge you hourly plus a per gigabyte fee to flow data through it. That's a new concept and it saves a lot of operational overhead, but as you pointed out, it's not the solution for every use case.

Before we move on from that, I want to go back because you mentioned that you've seen. Situations where minor background traffic could make it look like a Nat Gateway is actually in use when indeed there's nothing connected to it or no real business value in the Nat Gateway being there. To [00:09:00] me, it, it feels like something that could easily be overlooked because even if you're smart enough to look at the network traffic, you're gonna see some, some traffic there.

How do you figure out that it's just background, you know, traffic and it, it's nothing that. Is requiring the resource to be available.

Shailaja Beeram: That is like, uh, we sometimes miss that like and that is an unnecessary cost for us. So, for example, we have around five, uh, 5,000 as I said, like, uh, uh, logic cap, and we also have, uh. NA Gateway, which we are using as a shared NAD gateway in our organization. So for other function apps, what we are using it for, so we have co a little bit traffic on our function apps, which is an outbound traffic.

So that is where we are using a shared NAD gateway. So the, the, the cost is also optimized. Uh, according to that, so we have, [00:10:00] uh, yeah, earlier we had not like that. I mean, we had a regular NA gateway, but now we are like adding up shared NA gateway so that we can reduce the cost and all. So these things, which we had to do it before, uh, deploying our function apps. Um, so while doing an architecture itself. So how the traffic comes in, what we are sending and where the traffic is coming and going. So that is where we design it beforehand, before designing any of the, functions and all of that stuff.

Taylor Houck: Yeah, it's so important to start thinking about this stuff early and I want to get into shift left with you soon, but first I wanna talk about a specific story that you told me during our our prep call. And that was around a situation in which you were using Azure functions, hundreds of them. And there was an issue that you encountered or an opportunity you encountered with, uh, Nat Gateways.

Can you tell us about that story and, and the, [00:11:00] the previous state and the end state of what you came to?

Shailaja Beeram: Uh, so previously we didn't have any NA gateway. so we are, uh, sending the traffic and we have, uh, SNA ports like that is where like the traffic is coming and going, but it is going, going, it has, uh, some issues wherein like for an sna, like there are a couple of ports only open for that. uh, that is where the issues, uh, we were seeing. Uh, overlapping issues and we were, uh, having, uh, some traffic pattern issues. So that is where we, uh, uh, try to implement the Nat Gateway so that like you have, um, couple of more ports available for, you know, sending it. So that is how we implemented Nat Gateway for a couple of, uh, uh, functions, which really are needed, uh, uh, for the n uh, for the traffic to be sent.

Uh. Uh, inbound and outbound. So that is where we use the, uh, [00:12:00] NAT Gateway and now we are, uh, not facing any issues. Um, so, uh, all the traffic is flowing through the NAT Gateway for a couple of functions which are re really required for, uh, Nat Gateway. So that is how we, um, aligned with the design. And aligned with the cost because we are not using, uh, NAD gateway managed instances.

Instead, we are using a shared NAD gateway so that like we can, um, reduce the cost.

Taylor Houck: That's amazing. You also mentioned, uh, private endpoints as another hidden cost area. What's the angle there?

Shailaja Beeram: with private endpoints, I have seen cases. Where backend services are removed migrated, but the endpoints remain, the remain in the networking. So over time, these become zombie resources not actively used, but still can contribute to architectural cut, cut, uh, cutters. [00:13:00] So I think inefficiency happens mainly because environment evolves over time. What starts as a simple setup and then like, um, like a single application or a tenant often grows into something more complex and the original architecture is not always been revisited. So this is one of the thing. Over time, as changes are made, some resources are left behind and where these inefficiencies start to build up.

Taylor Houck: It's such a good point, and one that I was literally just speaking. With someone about in that context? Well, it was in Google Cloud and we were talking about cloud logging and the fact that in cloud logging in Google, there's, you know, a, a, a default logging bucket that all the logs are sent to. And then you can also set up, you know, syncs or routes to send logs to other services like [00:14:00] GCS to store your.

You know, logs elsewhere. And oftentimes teams will set up these syncs or these routes to send their logs to say BigQuery for analytics or to GCS for storage, but they won't turn off the default sync, which is storing them all in the logging bucket. So now you're holding the same logs in two separate places.

And it all goes back to what you were just saying, which is that over time things change and you need to keep going back and revisiting the decisions that were made on day one to ensure that they're still valid.

Shailaja Beeram: Uh, sometimes we forget to go and revisit as well. Um, so that is another, uh, drawback with what I can call it as because when we built, uh, uh, the POCs for example, we, we create a POCs and we just forget about the POC and we just go build dev, and, uh, production. So we don't go back and see, [00:15:00] okay. There was a POC, which has a networking, which has a public I piece, which we were publicly displayed like it is running, nobody will go back and see the POCs still there or not. So that is where another hidden, uh, inefficiency start to build.

Taylor Houck: It's interesting because you know, there's two separate concepts that are kind of going on here. Right. One we touched on earlier, and it was this idea of shifting left and bringing cost aware architecture decisions or architecture thinking earlier in the design process to avoid inefficiencies before they even occur.

Right? That's on one hand on the other. We are just talking now about the fact that things change, utilization patterns shipped. Project deliverables, timelines change. So much can happen where even if you make the right decision on day one, things drift and all of a sudden it's not optimal anymore. How do you think about these two and, and how they [00:16:00] work together?

Shailaja Beeram: So networking doesn't always get the same level of visibility as compute and storage it sits deeper, uh, in the architecture. So teams don't always notice these resources unless they specifically look at them. So, but over the time, the cost and complexity can add up significantly. So that is one thing. So investigation approaches, like, um, I usually start with understanding whether these are active workloads, uh, in the subnet that are actually required for outbound connectivity. Then I look at, uh, whether the resource still align with the current architecture or if it's just left behind from the earlier stage. So that is where we, uh, look into it.

Taylor Houck: Now let's talk a little bit about this concept of shift left that we've now touched on twice, and this idea that you [00:17:00] should bring. You know, we can talk about networking, but also all of your decisions even in compute, database storage, everything. This idea that cloud architects should be involved in thinking about the cost implications of their designs before they deploy.

I wanna ask you two things about this. One. Do you think it's common that this even happens? And two, how do we make it more common that it happens?

Shailaja Beeram: Um, so people now they are aware of the cost. So nowadays when they do some designing, when they approach for any of the design, they're thinking about the price, the costing nowadays. So I think that's a good thing. And um, I think, um, the second, the second one, what you asked, like, I think finops is going to become more integrated with an architectural

decisions. I [00:18:00] agree. So instead of looking at the cost. deployment, will start considering efficiencies during the design phase, so this helps our lifecycle gaps rather than fixing them later.

Taylor Houck: Yeah, I was just in London for a finops Foundation meetup, and it was probably one of the number one topics being discussed. It was probably ai, number one. Number two, shift left. How do you go from being reactive to proactive? What do you think is the biggest barrier to making that happen? Right? I mean, you've worked at large enterprises.

What makes it hard to get everyone thinking about cost during design?

Shailaja Beeram: it, it really starts with awareness. When engineers understand how they design impact cost over time, they naturally start making, uh, better choices. It's less about control and, uh, more about visibility and understanding. So yeah. [00:19:00] Um, by understanding how architecture evolves over time and then connect those design decisions to cost impact. And that is where finops, uh, does not need to be separated. It should be a part of how you think about the architecture.

Taylor Houck: I mean, yeah, finops and, and cloud architecture are coming together. I have another angle I want to present to you, and it is this, this idea that, finops recommendations or optimizations are oftentimes. Introducing, or at least it seems that they are introducing risk. For example, if finops shows up and recommends a change to a resource configuration or even a change to the resource itself, making that change, especially when you get into production, is a risky endeavor and one that is not always met with open arms As an engineer and architect and an architect yourself.

How do you [00:20:00] think about the, um, the balance between cost optimization and risk mitigation?

Shailaja Beeram: so one pattern I have seen is during early stage, like testing, like, the proof of concept resources are created. that is where, uh, we, we always, uh, check. And sometimes, as as I said, like we don't, uh, always clean up those later. So I think, um, uh, will play a bigger role in making sure those lifecycle gaps are addressed from the beginning. And yeah, uh, every, if we are creating any resources, so, um. of the other example, what I can give you is the API M, the API management, what we are doing it now. So we don't have the API management in our environment, [00:21:00] we are planning to, uh, do an API management. So for that, it, it is, there is a cost. So what we are trying to do is like how many APIs are we have and how do we design? So these are the things which are coming into the picture before we decide whether we need to go to the API management or we should see an alternate option. So what is the cost secure for that? So those are the decisions which we are making beforehand, before a design phase. So that is how we are building. Uh, so basically what I've. Thing is like, we have to know what we are doing. So if the, uh, the, the client required you to have an API management, then what are the APIs? Do you have how many APIs? What, what is the, uh, use of like those APIs? Where are we using them? So how many are there?

So what are the, uh. What the get or post [00:22:00] whatever's the APIs. So those are the list which we create and then we implement. And then that is where the cost also be calculated. The same thing is applicable what you asked for. So for each of the resources, I would definitely go with the pricing and making sure, uh, if we are using uh, functions, do we really need a na na gateway? So yes, if we need a NAD gateway, what kind of NAD gateway do we need? Do we need a shared, will it work with a shared NAD gateway for multiple functions? Yes. No. Um, what would be the cost for those? So those are the things which, uh, we will be definitely, uh, looking forward whenever we are doing a design phase.

Taylor Houck: It's so important to lay out all those different considerations that need to be made in parallel and I, I, I'm hoping that. You can speak to our, our audience and our listeners for a moment, and specifically think about [00:23:00] the listeners that may be earlier on in their finops journey and perhaps finops practitioners that don't come from an engineering background.

Imagine they came from finance or operations. What, uh, guidance or advice would you give to them to start adopting this cloud architecture and engineering mindset in their finops work?

Shailaja Beeram: Uh, I would advise, um, start by understanding how your architecture evolves over time. So then connect those design decisions to the cost impact. So that is where, uh, finops, uh, need. It need, it doesn't need to be separated. So it is part of it. So understand the process, uh, understand the resources and understand the, uh, the, where the cost is coming.

So when you, uh, do a, an ops report. So there, uh, everybody do the finops report nowadays. So that is where we can fi figure out, [00:24:00] uh, granularity generally we see, um, a monthly or a quarterly or yearly, um. Charges. So yeah, I would, uh, definitely start, uh, ask them to start understanding how the architecture was over time and then connect those points. And yeah, that is where, uh, we start, uh, learning the finops and going forward, like, um, every architect will decide on. The cost optimization before deploying any of the design into the production environment.

So,

Taylor Houck: That's excellent advice. I now want to flip it the other side. And let's say you're speaking to someone who is currently a cloud architect and really understands the cloud and architecture, but they're new to finops and thinking about costs, what advice would you give them to start becoming a cost aware, efficiency architect?

Shailaja Beeram: It, it's actually really, um, starts with an awareness. When [00:25:00] engineer understand how their design decision impact cost over time, start making better choices. So, um, is an architect, and they already know this, like virtual machines are, uh, spiking. So what if they don't have any, um. Yearly or whatever the, uh, quarterly plan for that. so that is where one of the things, so, and also like, um, a SP, uh, service plans, this also can be shared. So instead of using individual, uh, app service plans, we can also use the shared app service plan. So you can check on those. Um, and also there is, um. As I said, NA gateway, we can also use shared NA gateway. So we need to differentiate where the cost is flowing and where we can improve ourselves. So [00:26:00] I mean, right now, like they've already been involved with all the multiple NA gateways, multiple app service plans, so now they can like, uh, figure it out, uh, and say that like, okay, if we. Use a shared navigate or shared app services, we can reduce couple of bugs

Taylor Houck: It's just a new, it's a new performance metric to consider right alongside all the ones that they're, they're so used to already. And I think that if, if what you believe and what I believe is coming true. Which is that the disciplines of finops and cloud architecture and engineering are gonna converge.

People from both sides are gonna have a lot to learn as they come together. And you know, on that note there, there's also, if you were to look forward there, there's a big topic, as I mentioned earlier actually, the number one topic that I hear people talk about right now when I go to events or meetups, and that is ai.

I, I'm curious to hear your perspective on how AI is shaping the [00:27:00] role and what you think that AI's impact is going to have on both finops and cloud architecture over the next couple years.

Shailaja Beeram: Um, we are already using ai, so we are using copilot and we are really excited to use that and, uh, how it is, uh, been helpful for us in our projects. So I think AI will improve visibility and identity patterns faster. Uh, but architectural awareness and human decisions will still be, uh, very important for any of the architectural things.

So there is where AI fits is what I, um, think.

Taylor Houck: I think that the, there's no doubt it's gonna change things, but we are all still human Shailaja. And that's one thing I want to touch on as we just shift gears and prepare to close this episode. I know that you do a few things, uh, outside of [00:28:00] work that are very, very human that AI is not going to replace.

Can you tell us a bit about, um, the volunteering work, uh, that you do and, and why it's so important to you?

Shailaja Beeram: So like, yeah, I'm based in Houston, so I really enjoy the diversity and energy of the city. a great place to live and work out outside of work, uh, I like stay, uh, I like to stay, uh, connecting with the communities. I have done some volunteering work with, uh, volunteering Houston and Wesley volunteering work. So it, which has been really a rewarding experience for me. Uh, yeah. And I also enjoy exploring new technologies and understanding how they fit into real world solution.

Taylor Houck: I I'm, I'm curious to hear a little bit more about the type of service that you do. I, I believe it's food distribution. Is that right?

Shailaja Beeram: yes, yes. It's a food [00:29:00] distribution, uh, for the volunteering work. Um, there is, um, also. Um, what do you call, um, taking a mock interviews or, uh, helping the people, to prepare, uh, their next work, how they have to prepare, uh, into their interviews. And also, um, volunteering on this services Also, I do, Basically food services, food distribution is one of the thing. Another one is the, uh, volunteering to, uh, individuals, uh, with, uh, their, um, what we call they, uh, what, uh, they do and they specialized in.

Taylor Houck: Shailaja, that's such a, a nice thing that you do to give back to the community, and even this podcast itself is giving back. I'm sure that a lot of the listeners have learned a ton from the insights that you're able to share today. Thank you so much for coming on the show, and if people are interested in reaching out or talking or going deeper with [00:30:00] you, where should they find you?

Shailaja Beeram: I'm available on the LinkedIn. So I always enjoy connecting with people in the community, so LinkedIn is probably the best place to reach. reach out to me and, uh, connect,

Taylor Houck: amazing. Sja, this has been a fantastic episode. Thank you so much for coming on the show

Shailaja Beeram: Thank you for having me here.

Taylor Houck: and

Shailaja Beeram: I I really enjoyed your

show.

Taylor Houck: to our audience. Uh, thank you so much for coming back. If you got something out of today's conversation, which I'm sure you did, please share this episode with someone that needs to hear it. This has been another amazing episode of finops in Action, and we'll see you next time.

Okay.

Outro: That wraps up another episode of Fit Ops in Action. Thank you for joining. For show notes and more, please visit fit ops in action.com. This show is brought to you by 0.5, empowering teams to optimize cloud costs with deep detection remediation tools that actually drive [00:31:00] action.

Why Cloud Costs Are an Architecture Problem ft. Shailaja Beeram | Ep #68
Broadcast by