Making Cost a First-Class Engineering Concern ft. Jai Padmanabhan | Ep #62

FIA - Jai Padmanabhan
===

Jai Padmanabhan: [00:00:00] make FinOps a first class citizen. Um, just like in the old and SDLC days where we said quality is everybody's responsibility, security is everyone's responsibility. And, uh, FinOps is likewise everybody's responsibility,

right?

Intro: Welcome to FinOps

in Action. I'm your host, Taylor Houck. Each week I'll sit down with FinOps experts to explore the toughest challenges between FinOps and engineering. This show is brought to you by 0.5, empowering teams to optimize cloud costs with deep detection and remediation tools that actually drive action.

Taylor Houck: Hello, and welcome to another episode of FinOps in Action. Today's guest is a seasoned technology executive. He has over two decades of experience building and scaling, both cloud and AI driven platforms. In complex and regulated environments. He's known for bridging strategy and execution by leading engineering data and infrastructure teams, and [00:01:00] partnering with business leaders to translate technology investments into measurable operational and FinOps outcomes. Welcome to the show, Jai Padmanabhan.

Jai Padmanabhan: Thank you, Taylor. Thanks for having me. Excited to be here.

Taylor Houck: Absolutely. I am so excited to be chatting with you, Jai, and You know, it's not that often that I get to speak with, uh, actual technology executives on the show. So that's actually where I, I want to start, because you're not just a FinOps practitioner. You run engineering organizations, you own the platform, the infrastructure, the teams, and I've heard you say that cost sits right up there with security and performance as a first class engineering concern. did that shift happen for you? Like when did costs start to be not just a finance problem, but an engineering one too?

Jai Padmanabhan: Yeah. Um, so I would say when, um. Things started moving from on-prem to the cloud. Uh, and I've been with, You know, various companies in my, in my career and I noticed that there [00:02:00] was a significant paradigm shift where, uh, the cost economics of running at scale in the cloud became much more attractive to companies and, uh, companies were.

Planning to go from, uh, on-prem to a hybrid model and then completely into the cloud. But then at scale it kind of evens out and, You know, makes sense to, uh, have your deployments in the cloud. As long as you are able to, um, You know, get enough sales and you have, You know, revenues that would offset, You know, the cost there, you pay in the cloud, but for mid.

Level players that are not yet in the billions of dollars, uh, of revenue who are, You know, still getting there in, in the growth mode. Sometimes, You know, investing way too much in the cloud can, uh, impact your bottom line. And that realization sets in much later in the game where they are too much. Sunk in the cloud, cloud and, uh, You know, the optimization efforts begin as an afterthought and they're reactive when the bill shows up.

And then the [00:03:00] CFO is knocking on the doors like what's going on? Engineering. So that, I've seen it hap happen in multiple companies and, and that's what, You know, struck me, that this is an engineering problem to solve. You can have all the levels that if an ops person could point to, but ultimately it is engineering that needs to take care of the problem at the, at the root level.

Taylor Houck: Yeah, it's interesting, right, because we essentially went from a world where finance owned the timing and the amount of the purchases for compute, right? Because it was CapEx, they bought it all up front. Then you move to the cloud world and effectively, and every engineer within your organization is making purchase decisions every day.

How do you manage that, like from a team level?

Jai Padmanabhan: Um, so it is important that teams are, um, made aware that there is a threshold that they should try to be under. And, uh, on AWS for example, you could set budgets for the kind of services that you use. One of the things that You know, we've done in the past at my previous place is, uh, You know, to set a budget for a [00:04:00] set number of environments that would be brought up and.

Uh, hibernated over weekends. What that means is you would not need to have your Kubernetes worker nodes up and running when there is no usage. Obviously, in the cloud, you pay for usage if there is no usage, no cost. It's fairly simple math that way. And so if you were to set a budget that says that every weekend, my cost for this should not exceed $10, whereas during the day or during a weekday.

The cost for the entire week, for five days, say it is about more than 2000 bucks, right? Shutting it down, it should be zero. But you give and take for, You know, some nodes that are mandatory to be there because there is the, um, secret zero problem. You need to be able to boot start your environment from somewhere.

So you would have perhaps one worker node, uh, up and running. The control plane up, but you would be able to shut everything down, right? So having such budgets in place and then doing it at a team [00:05:00] level and or Kubernetes environment or a namespace level is, uh, You know, data that engineering teams can leverage and use as they prep, plan their sprints, and then making sure that they're able to leverage their resources, uh, optimally.

Taylor Houck: How do you think about measuring efficiency within. Your engineering teams, like how do You know that you guys are, are maintaining efficient architectures?

Jai Padmanabhan: Uh, one thing I just mentioned about the budgets, making sure that there are no cost overruns and we are within the expected estimate that may have been provided at the time of laying down the architecture times the number of environments that needs to be, uh, created for dev teams test teams.

Performance staging, uh, pre-prod and prod And so on. And then if there are any blue green architectures that are transient, meaning you build them up for a very specific test and then you're able to tear them down. So [00:06:00] having the engineering discipline and the automation to be able to bring up and tear down on demand, I think that is super critical.

Then along with that, the efficiency comes in from the, um, point of view of what is your usage per team on a monthly basis. So having guardrails around that and then making sure that, um, teams are following the hibernation model, making sure that there is no, unwanted usage or weekends, it becomes challenging, I would say, uh, for teams that, uh, are working across two zones and they, when they share environments, uh, And so you gotta be cognizant of, You know, when the offshore team comes and then, uh, when your team, uh, ends, ends the day.

But you're always able to schedule them at the Kubernetes level in terms of, You know, when environments come up and shut down. So I think a combination of that is what would determine the efficiency.

Taylor Houck: Yeah, it's just like enforcing. These, or instilling a culture of leveraging the best practices for these different resources and architectures is so important. I know, [00:07:00] um, from speaking with you before we started recording, that a big part of your world is in Kubernetes and EKS for practitioners who, You know, have some Kubernetes spend, but it's maybe not the, the majority. I think they'd love to learn from you, like, what makes cost optimization in Kubernetes different from traditional cloud cost optimization.

Jai Padmanabhan: Yeah, I mean, one of the main reasons why people go to Kubernetes is the elastic nature where you can deploy pods based on demand, the pod scale, out, scale in, uh, and then there is the adjacent technology, uh, within Kubernetes call a cada, which is Kubernetes, even driven Autoscaler. Uh, so when there are events, so for example, there are n number of messages on a topic that exceed a certain level, say n.

Then you wanna be able to scale out the pods so that they are able to consume off the topic and then process faster in real time. So that is an event. Uh, another event could be a combination of, You know, A CPU uptick on the database side. As well as the number of requests per second coming [00:08:00] in, uh, from users, you detect them at the a LB application load balancer level.

That combination could be a metric and then you would scale out at that point. So Kubernetes allows you to make those smart decisions scale in when there is no usage, and then scale out again when there is usage. Right. So we had, uh, at one point a very specific. Pattern where teams would, um, You know, start their batch processing, uh, over week, uh, weeknights.

And so there was a very definite pattern to. Make sure that we are able to scale out, absorb the traffic without a delay because we knew that set pattern and then we would also know when the traffic would die down. So you were able to scale in again at the end of that. So that's combination of cada that would scale out.

Scale in pods and Kubernetes would thereby see that because you are spinning out more pods, you would therefore need more easy two worker nodes. And so that would act accordingly. So you have enough. Instance types defined in your [00:09:00] node pool that will, You know, scale out on demand as well. So,

Taylor Houck: If you were to think, You know, to someone that's getting started on their, their Kubernetes optimization journey, what are the common like traps or pitfalls that you think someone could fall into when deploying these best practices?

Jai Padmanabhan: Yeah. One thing that has, uh, bitten, uh, bitten in my teams in the past is the node consolidation policy within Kubernetes, which would kick in if you said it to do. You know, twice every day. And if it happens to be doing that in production at a time when there is a lot of traffic, then you don't want that because there could be chances that, uh, users are starting to see errors.

Because what consolidation essentially means is your smaller pods that are provisioned on larger nodes would get uprooted from that node and then. Packed into a node that has in a more capacity. So when that transition happens, there could be a traffic in flight that, uh, customers might get errors for.

So you wanna be careful about not over [00:10:00] optimizing right away. You could perhaps do these, um, in production systems, uh, again, in enterprise level systems, you could do them on weekends when there is not, not much traffic happening. And then I would say for commercial refacing systems, you gotta pick a, You know, times when there is, You know, fewer users in the system for e-commerce or a retail website, and then do it at those times.

So that's one of the things that, uh, comes to mind.

Taylor Houck: Jai, this actually brings up such a good point and something I really want to ask you as a leader and executive. do you think about risk as it pertains to FinOps or optimization strategies? Because I speak all the time with FinOps practitioners and they love putting together, You know, optimization recommendations and ideas for how we can change our, You know, resource configuration or architecture to save on cost without impacting performance.

But it's that without impacting performance part, that's always up for debate. How do you think about. trade off [00:11:00] between risk and optimization or efficiency.

Jai Padmanabhan: Yeah, I would say SLAs always trump efficient, uh, efficiency and FinOps, especially in a production environment. You don't want to have, uh, spart instances, for example, to be running in production. Spart instances are, You know, at a, are a. Auctioned and obtained at a fraction of the cost of on-demand nodes, and they are much cheaper to use and therefore attractive.

But the downside is that, You know, the cloud provider can also yank them away at, uh, without any notice, and you would def definitely want to use them in lower environments and especially stateless environments. Still stateless applications, but for production, I would stay away from those. So that's where, You know, you don't compromise with performance.

Uh, and within your Kubernetes, um, setup, you would have node pools that would specify instance types that are high performant, but you would have a different node pool type in a lower environment where you are okay to use a, an older generation, a machine type, for [00:12:00] example, if you have the. Sixth and the seventh, or even the eighth generation on, on the Amazon, uh, family, then you may want to use a fifth generation one in a lower test environment because all you're trying to do is functionality testing and not measuring performance.

So you would have a dedicated environment where you are mimicking production. You would have that up and running for, like I said, in a, in a bluegreen environment where you would have it up, mimic what you are going to deploy from an infrastructure standpoint in production, and then run your CICD bits to get your application deployed there, validate your performance, and then hopefully your, your, uh, automated end-to-end from an IAC or infrastructure as code standpoint that you're able to tear down the environment, uh, the push of a button.

Right. So that would be the ideal mix.

Taylor Houck: It's such a cloud native way of, of thinking this ability to strip the performance out and just focus on, You know, testing the features and you don't need as big of machines. A lot of folks I've seen [00:13:00] transitioning from the on-prem world to the cloud, they want development to match production. Right. Have you seen that mindset come to bear within engineering organizations over the years, and and how do you think that mindset has shifted?

Jai Padmanabhan: Uh, I would say that used to be the case because, They wanna be able to certify much faster. Uh, you could think that it is one way shift left because I'm no minds, uh, results sooner. But in other ways it is suboptimal because you are, uh, spending way too much upfront. Right? So I would say that you would, um.

Try out your experiments from an infrastructure optimization standpoint in the lower environments. But in production, you are better off not to try them, uh, unless you have completely tested for workloads, uh, at scale. And you would also have some DR tests and negative tests for se, uh, per se, to ensure that you are able to withstand, You know, either it's a load or a failure And so on.

So, which I don't think in a dev developer environment, it makes sense to [00:14:00] have all of that, those in. There are some environments that are only for functionality, and you don't necessarily even need cross region failover, so you may be able to do away with, uh, just the AZ failover. So just deploy in one region.

As long as it's deployed across three AZs availability zones, then you should be okay.

Taylor Houck: It is getting like right to the point of where, how FinOps operates at this like inner. Section of so many related disciplines, right? Because it's not just about will the app run, it's also about your disaster recovery and what, You know, your thresholds are for different levels of risk across different environments. I do wanna shift, uh, gears slightly and talk a little bit about storage optimization, because I know this has also been a big focus for you,

Jai Padmanabhan: Yep.

Taylor Houck: Under a specific context, because you also, as I alluded to in the intro, have a lot of experience in highly regulated industries. How do you think about storage optimization?

And is there anything that being in a regulated industry, [00:15:00] uh, does to impact your strategy as it comes to storage in the cloud?

Jai Padmanabhan: Yeah, yeah, absolutely. So you could have storage via S3 or EFS primarily. Uh, and in the healthcare space, uh, you deal with P-I-I-P-H-I all the time. And, um, clients typically send that data, uh, as files that could recite again on EFS, elastic file storage or S3, um, the. Goal here is to make sure that you're always compliant, HIPAA compliant, SEC compliant, uh, and making sure that ideally you never, ever delete the data.

Uh, if possible you do, but you, you are well past compliance. Uh, right. So I, I'm worked at organizations where they don't wanna take any risk, even if it is more than 12 or 15 years, they have the data forever. So that's where the FinOps strategy comes in, where even if you. Do have to retain for some eventuality, then you make sure that you are at least storing them in the archival storage or the coldest tier.

Uh, you could do [00:16:00] that by setting lifecycle policies within S3 and EFS, move them from the standard tier to the infrequent access tier and into the colder tier. This is for EFS. And then S3 has similar, um, You know. Variations where you can take the data from standard to the infrequent access tiers into glacier and then, You know, de glacier And so on.

So I think the combination of those, uh, would, would work well again, in a production system for test environments. You know, you would have scripts that would delete and purge the data after a certain number of days for EFS, for example. And I would immediately, like within 14 days to 30 days, max would move the data out of standard into infrequent access.

And then stay maybe another 30 days and then thereafter delete. Right? So that's how you have a constant, uh, watch on your storage costs. And then the other aspect is EBSA block storage. So your PVCs for state full applications are, uh, [00:17:00] You know, they bring up, uh, uh, EBS volumes all the time, but as soon as they are unattached, meaning they're orphaned.

Those lie there. And they also consume not just the disc space, but if you've also set up a lifecycle policy to create snapshots, those cost you as well. So it's important that you have some sort of automation that puts essentially a tag and a timer on them and evaluate their status, whether there are attached, unattached, or end days.

And then after that number of days expires, you just delete them. Right. And then that automation is kind of difficult to nail. I wouldn't say difficult, but it is, uh, slightly more convoluted than just setting a lifecycle policy on the, uh, EFS console with any, uh, AWS, you need some more automation Lambda or a Kubernetes CR job that you gotta work on.

But I think that's the storage optimization bid.

Taylor Houck: I love that your mindset. Always goes back [00:18:00] to automation and putting in these guardrails to ensure that you're staying on top of all these, You know, different potential inefficiencies that could exist within a cloud environment. I think anyone that's, that's operated in the cloud at scale understand. The just sheer number of resources and services and pricing models at play is difficult to manage, and that's why it's important to put in these automations. And I hope that the listeners will learn from Jai that this is important to think about from day one. But Jai, I want to kind of on that note, bring you back to your earlier days in managing the cost of the cloud. curious if a story or a circumstance comes to mind when I, I bring up like big savings. Have you been a part of any, any big savings opportunities, um, in your career? And if so, I'd love to hear about it.

Jai Padmanabhan: Yeah. Um, I think one of my companies we had, uh, an entire EKS cluster in a separate region. That never was used. [00:19:00] So the moment we, You know, uh, pointed that out and deleted it out of our, uh, equation that saved, You know, big money and that definitely added to the bottom line. So savings, like those are easy to spot, like the, they're.

Wasteful and very, very easy. And, and You know, your a Ws monthly bill will show that up as a, an eyesore. So those you can tackle, but it's the, um, the, the tricky ones are the data transfer costs. The ones that show up under a mysterious, uh, line item called as, um, EC2 other, uh, there is a BS data transfer costs.

There are things that you are unable to explain and those are spread across a million different categories under that, uh, You know, dropdown. So those are kind of difficult to find. Some, you gotta accept them that when you have a distributed architecture across regions, there is gonna be, uh, inter Z talk.

And then crosstalk between regions, because if the application is running on two, two [00:20:00] different regions, they talk as well. So that is some of that data transfer costs. And especially if you retrieve files, uh, upload a file, upload, uh, retrieve a file from coldest tier, then that costs you more for retrieval And so on.

So those are some of the surprises that you could avoid by again, planning your architecture well for. Lower environments, is that a true need to run them highly available? Do you need an active, active scenario or is it a active and a warm standby, or an active and a, You know, pilot, um, uh, as, as a secondary region, how do you think about your RTO and RPO?

Um, You know, so those points also matter, uh, and that can contribute to your big savings as well.

Taylor Houck: I think that it's important to call out the difference between cost savings and cost avoidance, because of course, if you have an inefficient architecture and you make a change, you're [00:21:00] gonna see the bill go down overnight. That's measurable and that's real. When you make an optimal architecture decision on day one, you're not gonna see that big, that big drop in the, in the bill, because the cost was in a good place to begin with. With that, with that being said, how do you think about measuring FinOps?

Jai Padmanabhan: Fewer dollars every month. That's the easiest way to see, uh, right. But in terms of, you asked me earlier about the efficiency that can also contribute to, uh, developer velocity, uh, especially if FinOps is baked in as part of their sprint goals. Meaning these are the stories, these are the epics that you tackled.

And this was the rough estimate, um, that. We provided for and as, as a, You know, ballpark t-shirt sizing that we would be within this range for this month. If you exceed that, then You know what you need to do for, from an optimization standpoint. Like, so your initial assessment was [00:22:00] incorrect, or you undertook a performance load test and you had to repeat that test multiple times.

But You know, that test is now all set and done. We have the numbers and we will not repeat that the, the subsequent months, therefore, it is automatically gonna be, uh, missing from your bill. Uh, so those are things that, You know, you could sometimes, uh, foresee and, and also avoid in the same place.

Taylor Houck: Super interesting. Now, again, to shift gears once more, I'm, I'm curious. To ask you this, it's kind of a scenario for you. say that you were showing up and you had a family member that was just promoted and is running an engineering team, and they have a big cloud bill, let's say in the tens of millions of dollars per month, they ask you, what should I be thinking about like day one? What? What would you steer them towards?

Jai Padmanabhan: First, tag your resources in AWS otherwise you don't know where to look, right? And the cost explorer is only as good as, [00:23:00] You know how you tag them. So make sure that a, uh, you're using the right practices to develop, um, infrastructure code. So you're using Terraform. Make sure that tagging is one of the things that you always do, and you're then able to look by environments, You know, by.

Uh, even specific, uh, namespace specific applications, like how much is my actual usage? Uh, so that would be number one. Thereafter, it'll be the line items in your cost bill. If you have these tags, then you're able to use them in your cost explorer, or I think the, the monthly savings by service. I think it's a saved report on AWS cost explorer that you can always hit first to just see at a high level, like what are the various areas broken down by service.

You know, most likely it'll be the EC2 number because that's the, the heart of, uh, You know, where or how your applications run. And then subsequent to that would be any SQL Server usage. Uh, would, there would be any RDS there, there will be S3 EFS And so on in [00:24:00] that, in that order. Um, so I think those would be some of the, um, quick points that I would point them to.

Taylor Houck: It's, it's such a good point that you brought this up because visibility, we haven't chatted about visibility much. Your cost reporting and understanding of like. well You know your billing data and how up to date you are in your understanding of it, right? Making sure that you're consistently on track it, because things can happen so quickly and that really is phase one, right?

Really getting that solved and then you can get into optimization and you can start to think about engineering and efficiency, best practices. But without the visibility, you have no way to measure it. So I think that was a, a keen point.

Jai Padmanabhan: Yep. Yep. Absolutely.

Taylor Houck: I am, I'm wondering where your head is at as it relates to ai, both in terms of how to use AI to improve the efficiency of your FinOps practice, as well as managing the cost of AI as those costs [00:25:00] steadily start to increase within your teams. Does either of those sides, are you paying more attention to one than the other? Where's your head at in terms of this whole AI thing?

Jai Padmanabhan: Yeah, so we are measuring AI at this point from a FinOps angle. Um, I think we are using, um, open ai, uh, You know, those, those APIs of from the Azure side. So we are able to look at Azure login there and then able to look at the cost. And that is a guide in terms of how we want to proceed, how many tokens being used And so forth, right?

So there is some observability around the use of the specific, uh, LLMs in there. Um, as far as the infrastructure goes, uh, if you are hosting a foundational model, then. What is the pathway to deploy them? Is it using a cloud native service provider and again, in this case, Amazon Bedrock and use their foundation models?

Or are you better off. Provisioning GPUs and [00:26:00] then deploying your own models there. And you, you need to have CICD and, uh, you would need to have some sort of observability, uh, around the usage of that model. How efficient is it, You know, or what is the latency and the number of tokens going back and forth.

Again, if you're self-hosted, um, then model wise, then you don't necessarily pay per token. Uh, but those are some of the, uh, parameters that you would start measuring. We've not. There from that maturity standpoint, uh, from a, an AI observability standpoint, but uh, definitely in the works.

Taylor Houck: Are you seeing your AI costs start to increase? I'm curious because over the past couple years I've heard a lot of talk about, You know, fin. And, and measuring and managing AI spend, but it's historically been a pretty low percentage of the overall tech budget. I'm curious if that's starting to change, especially as just like a, uh, it feels like even since December there's been a rapid acceleration in terms of the capabilities of, uh, of these models.

Jai Padmanabhan: Yeah, so month over month there is an uptake in [00:27:00] the usage. So for whatever we, uh, use for inferencing, direct access to API, that is for OpenAI or Gemini, those calls are continuing to be on the upward trend. And it is important that we continue to look at optimizing things. Again, we are early in the stage, And so we don't have much insight into, into some of the, uh, development practices on how do we, um, not, You know, incur more costs.

I mean, the, again, cloud principles are fairly simple. No usage, no cost, so it would be hibernation of environments when you're not in use. I think that would be the mantra for, uh, anything AI related as well.

Taylor Houck: It's gonna be really interesting to see. How all this plays out. As you mentioned, we're in the super early days. No one has years experience of managing AI in the hyperscalers.

Jai Padmanabhan: Yep.

Taylor Houck: they, they can't. Right. Um, and, uh, it'll be super interesting to see how, how it all plays out. Now, I, I do want to jump back to a [00:28:00] topic that we touched on briefly earlier when we were speaking about storage optimization. it was, You know, these. Regulated industries. curious how you think about, You know, just the cloud in general engineering management in a regulated industry. How is it different than, You know, a, a traditional company that's not subject to these regulations?

Jai Padmanabhan: Um, this may not be related to FinOps, but security standpoint. Um, we are. Extra vigilant, making sure that we have every security related tool at our disposal to be able to detect some sort of a data exposure or any, um, malicious attacker intruder. DDoS attacks, uh, from the WAF in And so on, right? So you would have, uh, again, at the AWS level, you would have guard duty, you'd have security hub, you'd use it at a c uh, CSPM.

You would have a siem, SIEM to make sure that you're [00:29:00] able to track the security events and all that. So you generally don't want to compromise, especially in the healthcare world, and more so when you're deployed within, um, GovCloud, um, which is a pathway to get to FedRAMP status. Um, so you would have, uh, specific benchmarks, uh, from a security standpoint that will scan your environments and continuously give you a report in terms of any vulnerabilities that it finds.

Um, so you would start with the standard CIS benchmark, and then I think in GovCloud you would want to go up to the nist, uh, 853 and, and those benchmarks that will, You know, give you the chance to become FedRAMP compliant once you're ready for that. So. No compromise at all on security, uh, PHI, data exposure And so on.

Um, but that means it is, it has a, uh, You know, an adverse effect on, on costs because obviously you all need to deploy all these tools and that will, You know, [00:30:00] cost you money. So there are some things that are, um, that you cannot compromise on, and these are one of those tools that are kind of, You know, going against the grain of FinOps.

Taylor Houck: I'm glad that you brought up security because I, I'm really keen to actually speak with you about this because as a, a technology executive, especially in a highly regulated industry, you've got a lot of experience with, with security, and I think that. If you look at where security was many years ago, you could draw many parallels to FinOps and cloud efficiency management.

I mean, you mentioned this concept of CSPM, right? Cloud Security, posture Management, and essentially these companies, they'll go out and they'll proactively scan your environment, looking for these vulnerabilities, helping you define them. I think that there's this emerging practice of cloud efficiency, posture management, right?

You can imagine a world in which. Everyone has these essentially detection algorithms running within their environment, helping them to find inefficiencies, and when they find the inefficiency, then how to remediate it. [00:31:00] what you think about, You know, that cloud efficiency, posture management,

Jai Padmanabhan: I think if it is a continual, uh, scanning of it environments and give you, You know, weekly updates that will be useful. That said, AWS does have a, a trusted advisor as a. As a, as a service

Taylor Houck: place.

Jai Padmanabhan: Place. Yeah. So that is, I, I don't know if they use ai, the backend, maybe now, but again, a few years ago used, we trusted advisor and taking to look at, You know, what is C two spends that we have.

And they'll advise about, You know, buying up savings plans, uh, that'll save you, uh, You know, save you a lot more than maybe the on-demand price that you would otherwise pay.

Taylor Houck: Absolutely. Yeah, and that's, it goes back to the, the topic of automation and guardrails and putting processes in place to. an eye on your, uh, on the efficiency of your cloud environments akin to, uh, security in many ways.

Jai Padmanabhan: Yep.

Taylor Houck: So, Jai, just as we, we [00:32:00] near the, the close of this episode, it's been amazing. I, I feel I've learned so much from you already. Um, I'm gonna ask you a similar question to what I asked earlier, um, but phrased a a little bit differently, right? Let's say that you have. Visibility in place of your cloud costs. You understand what you're spending, you've got your allocation in place, right? a piece of advice that you would give to a platform engineering leader who's has those building blocks in place, but is really just getting started on their, their FinOps journey?

Jai Padmanabhan: I would say, um, make FinOps a first class citizen. Um, just like in the old and SDLC days where we said quality is everybody's responsibility, security is everyone's responsibility. And, uh, FinOps is likewise everybody's responsibility, right? You treat. Cloud costs as if, if you were paying bill from your credit card, and then what impact would that would have you on you if you were to see a, a million dollar bill on your credit card, right?

So you would [00:33:00] need to infuse that kind of, um, mentality to, to save and to make sure that, um, You know, there are no overages, uh, on a, on a weekly or a monthly basis among teams. So that's a practice, uh, that. Comes with education of the teams, making sure that your Scrum masters, the managers, the implementing implementation engineers, all of them understand that it is everybody's uh, requirement that, uh, they, they continue to optimize for costs.

So I would say start there, start sooner, then it becomes a problem that needs to be solved yesterday, meaning you see a, a bill that continuously increases and then. One fine day. You know, you get directions from the CEO that you gotta, uh, downsize big time and you don't have the resources because you don't have tags in them.

You don't know how to identify. So again, You know, I know your question was, uh, if you had everything set right, so then hopefully you don't get into the situation. So you're able to start [00:34:00] earlier and then start small. You know, do that experiments in the lower environments before you bring them up to production, um, and make use of.

The savings plans, ideas, especially if you're gonna be, um, staying with a specific cloud, uh, cloud provider for longer, then that makes sense. Um, and You know, again, focusing on automation to keep you updated about changes to environments, budgets are again, a, a key aspect to ensure that you don't, um, spend beyond your estimated budget.

I think those are some of the points that I would, uh, provide.

Taylor Houck: Such, such excellent advice. Jai. Really appreciate you diving into such detail and providing such analysis and perspective on That our listeners have gotten a lot out of this. Now, one more piece for those, those listeners, because. Jai, you've had a, a long and impressive career. You've been an engineer, an engineering manager, an executive. You studied [00:35:00] electrical and computer engineering in school. You then went and got an MBA to get the business background. If you just think back on your, your career, if you were to give, You know, some advice to folks that are earlier along their journey, what would you, what would you tell them from, You know, what you've learned along the years?

Jai Padmanabhan: I think versatility is the key. Uh, I started doing application tech support. Um. Saw that people around me were, um, You know, doing software coding. I was, again, an electrical engineer. So programming was not my, uh, favorite thing to do at that point in time. And I realized that, You know, the roles out there are demanding something different than what I'd learned.

And so I had to teach myself, um, You know, Java, Python back in the day, and I was able to get a software automation role that then transformed into performance engineering. Fast forward to management. MBA happened. And then I was able to use those performance engineering skills within the [00:36:00] umbrella of SRE or South Side Library Engineering and Infrastructure.

And that's how my cloud journey began. And then now with, You know, FinOps and, and DevOps and, and all those, uh, You know, pieces of the as well. So I, I think covered the entire gamut of the SDLC lifecycle process. And within the last two years, AI has taken off, uh, a great deal. And so, again, not to be left behind.

I did some, You know, innovative work. Uh. Where I was able to en enable, um, every employee in my, uh, my previous company to be able to use, um, enterprise document search using ai, using a rag based assistant that I was able to, uh, come up with. So self-learning, um, and continuous learning, I think is a key, uh, to success.

Taylor Houck: Yeah, it seems like you've just kind of grown and stayed on that. Cutting edge and continued to take that, that next step. And I think it's something that, You know, everyone can, can learn from is that, You know, kind of continuous learning mindset and always being open to what's [00:37:00] next.

Jai Padmanabhan: Absolutely

Taylor Houck: Amazing. Thank you so much again, Jai, for coming on the podcast.

Where can people find you if they want to connect or, or learn more?

Jai Padmanabhan: people can find me on LinkedIn. Again, my name is Jaiman. Uh, feel free to reach out.

Taylor Houck: Awesome. Jai, this has been fantastic. Thank you so much for coming on the show,

Jai Padmanabhan: Thank you, Taylor, for inviting. Thanks.

Taylor Houck: and thank you to our audience. Uh, if you got something outta today's conversation, which I'm sure you did, share this episode with someone who needs to hear it. This has been another awesome episode of FinOps in Action, and we'll see you next time.

Outro: That wraps up another episode of Fit Ops in Action. Thank you for joining. For show notes and more, please visit fit ops in action.com. This show is brought to you by 0.5, empowering teams to optimize cloud costs with deep detection remediation tools that actually drive action.

Making Cost a First-Class Engineering Concern ft. Jai Padmanabhan | Ep #62
Broadcast by