The Hidden Costs of Engineering ft. Kumar Singirikonda | Ep #72
FIA - Kumar Singirikonda
===
Kumar Singirikonda: [00:00:00] that's why often say the biggest operational cost problem today is inefficiency rather than infrastructure pricing. So organizations that reduce operational friction unlock far greater long-term value than organizations focused on only on reducing the compute spend.
Intro: welcome to FinOps in Action. I'm your host, Taylor Houck. Each week I'll sit down with FinOps experts to explore the toughest challenges between FinOps and engineering. This show is brought to you by 0.5, empowering teams to optimize cloud costs with deep detection and remediation tools that actually drive action.
Taylor Houck: Hello, and welcome to another episode of FinOps in Action. Today's guest sits in a seat that I think more FinOps practitioners should be paying attention to. Not the FinOps team itself, but the engineering [00:01:00] leader that the FinOps team reports into. He is the director of DevOps engineering at Toyota Financial Services, where he spent the last decade building out platform engineering, DevOps, AIOps, and AI-driven application development.
He is a FinOps Foundation board member, an advisory board member at the McCombs School of Business at UT Austin, and the author of the "DevOps Automation Cookbook." His take is that FinOps as we know it is only the first chapter, and that the next chapter will be based on self-healing systems, AI-driven remediation, and a whole class of unknown costs that traditional cost reports never surface.
Welcome to the show, Kumar Singarikkonda.
Kumar Singirikonda: Thank you so much, Taylor, for the kind intro. Much appreciated
Taylor Houck: Absolutely, Kumar, and thank you so much for joining us on the show. I'm really looking forward to diving into this conversation with you. To get [00:02:00] started, can you tell our listeners a little bit about your background and the journey that led you into DevOps, platform engineering, and FinOps?
Kumar Singirikonda: Absolutely. So my career has been, uh, centered around, uh, DevOps platform engineering, cloud transformation, and operational excellence. Over the decade, I worked extensively in enterprise environments where resilience, and operational efficiency are critical to the business success. During the journey, I began noticing that many organizations approached the cloud optimization primarily from a financial reporting perspective. But most FinOps conversations focused on infrastructure utilization, cloud spend visibility, and budgeting, are certainly important, but only tell part of the story. [00:03:00] As enterprise systems became more distributed and complex, I realized larger hidden costs were actually operational in nature. Downtime, uh, recurring incidents, uh, delayed deployments, manual remediation, alert overload, and engineering inefficiency that were creating massive business impact traditional cost reports often, often fail to capture. So that realization shifted my thinking toward a broader operational focused on engineering productivity, intelligent automation, and self-healing systems. Today, I see FinOps evolving into something much bigger, much larger. The future is not simply about controlling [00:04:00] infrastructure costs. It is about creating intelligent operational ecosystems the platforms optimize themselves, engages-- Uh, spend less time firefighting the-- Sorry, engineering- engineers spend less, less time firefighting and organizations can innovate faster with greater resilience
Taylor Houck: Yeah, and I think that technology and the shifts that we're seeing in AI is only accelerating this journey. And really the ownership of the cost control is moving away from pure finance personas and really into engineering, where we actually own the resources that are driving the cost. Now, I've heard you talk a lot about moving from cost control into value creation.
Why is that shift so important right now?
Kumar Singirikonda: traditionally, the FinOps has, has done a very good job helping organizations where cloud money is being [00:05:00] spent. But in many enterprises, cost optimization became narrowly focused on reducing infrastructure expenses without fully understanding impact or business value creation. So the shift toward value creation means organizations are starting to ask much deeper questions. Instead of simply asking, "Hey, how do we reduce cloud spend?" Leaders are now asking, "How do we maximize a business value from our engineering and operational investments?" For example, an organization may aggressively optimize infrastructure cost, but still lose money because engineer spends hours manually resolving incidents or dealing with unstable deployments. [00:06:00] So the operational inefficiency becomes more expensive than the infrastructure itself. So the value creation comes from improving engineering productivity, innovation, increasing deployment velocity, improving customer experience, and also, building resilient platforms that can reduce operational friction. So in that model, cost optimization becomes an outcome of operational maturity rather than a, primary objective itself. So that's why I believe the future of FinOps is much broader than a financial governance. It's becoming deeply connected to platform engineering, operations, reliability engineering, and organizational efficiency
Taylor Houck: It's such a good point because the only cost [00:07:00] within an engineering organization is not the cloud bill that comes in, right? You-- the actually headcount or people expenses is most of the time much, much higher than that, right? And asking an engineer that's being paid, let's say, an handsome salary to go and clean up, let's say, an unattached volumes that are costing in the, let's call it tens of dollars per, per resource, it, it's a waste of time and energy.
Now, if that engineer or that engineering team can build, as you would describe it, a self-healing system that would eliminate that waste in a repeated way, in a scalable way, and kind of have it be a, "Hey, let's fix this once and now it is fixed forever," that becomes a much more appealing scenario. And I, I think that, you know, in speaking with you before this call, there are many other examples of, let's say, unknown costs or costs that live outside of the, the cloud bill itself.
What do you think are some of the unknown costs that organizations often fail to [00:08:00] recognize?
Kumar Singirikonda: So most organizations are very good at measuring, uh, direct cloud expenses because those costs appear clearly on your invoice or on dashboards. But many of the largest operational costs never show up in your traditional financial reports. For example, consider a production incident that impacts customer-facing services for thirty minutes. Let's take that as an example, okay? So the infrastructure cost during that outage may not change significantly, but the operational impact can be enormous. Multiple engineering teams may join, uh, emergency bridge calls. Product release may get delayed. Customer trust may be impacted or affected. Revenue-generating systems may slow down, and support ticket volumes may [00:09:00] increase, so engineers, engineers lose product- productive development while managing the incident. So none of these productivity losses or, business typically appear in your cloud cost dashboard, yet they can easily exceed the actual infra- infrastructure expenses. So there are also hidden costs associated with operational complexity. So engineering teams often spend time managing the repetitive manual tasks, troubleshooting noisy alerts, infrastructure manually, or coordinating the incident remediation across fragmented systems. over the time, operational toil reduces innovation, uh, velocity, and creates burnout [00:10:00] within engineering organizations. that's why often say the biggest operational cost problem today is inefficiency rather than infrastructure pricing. So organizations that reduce operational friction unlock far greater long-term value than organizations focused on only on reducing the compute spend.
Taylor Houck: I'm actually really interested in-- to dive into this with you because you're coming at it from an executive perspective. I think a lot of the listeners on the podcast today are sitting in the seat of the FinOps practitioner, right? And historically, their focus has been squarely on the cloud bill, right?
And the OpEx that comes with it. What you're describing is, hey, that's only part of the picture, and we need to be thinking about engineering efficiency from a different lens. My question to you is: how do you think about measuring this [00:11:00] engineering efficiency more holistically? And whose job is it or whose job should it be to help you measure this?
Kumar Singirikonda: So when it comes to the engineering inefficiency, I actually think engineering efficiency-- inefficiency is becoming one of the most important strategic disciplines in enterprise technology organizations. So traditional FinOps focused heavily on infrastructure accountability and cloud optimization, but engineering in efficiency expands that conversation into how organizations maximize innovation capacity while minimizing the operational friction. So skilled engineers are among the most valuable assets any organization has. If those engineers spend most of their time troubleshooting incidents, to alerts, and manually scaling the systems, or resolving [00:12:00] repetitive operational failures, organizations may lose the innovation velocity strategic momentum.
So engineering inefficiency forces on reducing the operational toil through platform engineering, intelligent automation, infrastructure, AI-driven observability, and autonomous remediation, but also the operational So the goal is to allow engineers to spend more time building products and less time maintaining complexity. So I see that at the next chapter of operational maturity, are beginning to realize that improving engineering productivity creates far greater long-term business value than a simply optimizing infrastructure utilization. So in many ways, engineering efficiency becomes a competitive [00:13:00] advantage because organizations that operate more intelligently can innovate faster, scale more effectively, deliver a better customer value or experience
Taylor Houck: Thank you so much. That's super helpful. And I, I now wanna shift gears only slightly and talk about a concept that's already come up a few times in today's conversation, which is this concept of a self-healing agent or a self-healing system. I know that you have described these self-healing systems as the next evolution in operational maturity.
What does this self-healing platform actually look like in practice?
Kumar Singirikonda: That's a great question. a self-healing platform is essentially an intelligent operational ecosystem capable of issues, diagnosing, um, probable root causes, automatically remediating incidents without recurring-- the human intervention. So traditional [00:14:00] operations are highly reactive. Monitoring systems generate alerts, engineers investigate incidents manually, teams collaborate on troubleshooting, remediation actions are performed by humans. So that model worked reasonably well in a simpler environments. But modern cloud-native architectures are far too distributed and dynamic for a purely a manual operation.
So s- to scale effectively, it is very, very difficult. So self-healing systems fundamentally change that approach. These platforms continuously collect telemetry from applications, infrastructure, APIs, deployment pipelines, user traffic, and also operational events. So AI-driven observability engines analyze that telemetry in real time to [00:15:00] identify abnormal behavior patterns. So when incident occur, event-driven automation workflows immediately, uh, trigger a remediation actions. The platform can automatically execute operational playbooks, roll back deployments, or restart unhealthy servers resources dynamically the traffic or rebuild a failed infrastructure components without waiting for engineers to intervene. So what makes these systems, uh, transformational is not only a simply faster remediation, but the larger impact comes from reducing the operational interruption. So engineering teams no longer spend the majority of their time, I mean, reacting to the operational instability. Instead, they can focus [00:16:00] more on the innovation, architecture, or AI initiatives and the customer-focused development. that's where the self-healing systems become far more than operational tools. They become strategic business enablers
Taylor Houck: It's quite a forward-looking approach to engineering and building engineering teams. I'm curious if there are any specific examples of self-healing processes that you've put into place that you'd be willing to share with our audience today?
Kumar Singirikonda: Sure. Um, one example I'll talk about is the 404 error.
Taylor Houck: Okay
Kumar Singirikonda: a great example because almost every enterprise has some of this scenario. So imagine a production deployment that introduces a, a routing configuration issue that suddenly causes a spike in HTTP 404 error across a customer-facing API. [00:17:00] In a traditional operational environment, monitoring systems trigger an alert, engineers begin reviewing the logs, teams join the incident bridge calls, and operational personnel troubleshooting the deployment changes. So depending on organization complexity, that process can take anywhere from thirty minutes to several hours, depending on the issue. Now imagine the same situation inside a self-healing operational infrastructure. The observability platform continuously monitor the traffic patterns, error rates, uh, deployment events, and operational telemetry. As soon as platform detects abnormal 404 strike-- spike, it correlates the behavior with the, recent deployment activity. an event is automatically triggered [00:18:00] through Amazon EventBridge, invokes a, a Lambda remediation workflow. So the Lambda collects deployment metadata, application logs, configuration, and infrastructure telemetry. So that information is passed into a, a machine learning model trained on a historical incident patterns. The model identifies a high probability that issue is related to deployment routing a configuration drift. So once the root cause is identified, the system automatically executes a remediation playbook playbook could be a, a Python script or a, a, mm, a JavaScript. That may involve rolling back the deployment, restarting a ingress controllers, validating the endpoint health, and clearing the stale routing [00:19:00] caches, and remediating back to instances. And one, one more thing I wanted to bring up here is synthetic monitoring validates a service recovery, engineers receive a notification confirming the issue was resolved, and it will be sent an automatic message. So the entire process may complete in under two minutes without requiring engineers to manually intervene. That's where self-healing systems become transformational. They do not just downtime. They fundamentally redefine engineering productivity because engineers can focus on innovation instead of a, a repetitive operational firefighting
Taylor Houck: You know, it's such a good point and it really brings you to a higher level of, of working, right? And if I could just repeat back what I'm hearing from you, it is essentially, [00:20:00] hey, we can ingest all this data that, let's say, in a historic or an old world, we ourselves would look at and take actions based on how we interpret it.
Rather than doing that every single time, let's build a, let's say, deterministic system that ingests this data and applies these rules, and based on certain criteria, can actually take actions. Now, with AI, you can actually embed, let's say, non-deterministic steps within that deterministic system that could help you to make decisions and route things accordingly.
Um, but you know, uh, one thing I'm really keen to get to with this new way of working and this way in which you view the world is where AI weav-weaves itself in. Because, I mean, you mentioned in that one example that you have, uh, let's call it a step within your deterministic workflow that runs some ML capabilities to make a decision and route the actions.
But you could also, I imagine, have AI or, um, gen AI capabilities to help you build these systems [00:21:00] themselves. How does AI fold into this worldview of self-healing agents and systems?
Kumar Singirikonda: There's a lot of hype going on, um, on the AI, I think it is important to separate, uh, intelligent operational automation a generic AI discussions. Okay. So AI becomes valuable when it is embedded into operational workflows that drive a real business outcomes. So one of the key point I wanted to make here is, I often see that a model by itself does not create an operational intelligence. AI without operational agents execution frameworks is limited because the predictions alone do not resolve the instance. So the real value emerges when AI is [00:22:00] connected to observability platforms, automation workflows, a remediation playbooks, a deployment systems, and operational orchestration layers. the combination creates autonomous operational systems that capable of making an intelligent decisions and executing a corrective action in real-time. Let's take an example. AI models can analyze deployment behavior, a traffic anomalies, historical incidents, infrastructure telemetry, and dependency relationships to predict like likelier root cause. But the real transformation occurs operational agents can automatically execute a remediation actions based on these predictions. That is where the agentic workflows become extremely important. [00:23:00] AI provides the decision intelligence, while the operational agents execute the remediation, optimization, and orchestration automatically or autonomously. So the feature is enterprise operatio-- the future of the enterprise operations will increasingly rely on, uh, this combination of AI-driven intelligence and autonomous operational execution
Taylor Houck: It's such a, an interesting perspective and one that I think is really important for people to really sink their teeth into. And that is that this is not us simply saying that you should throw AI at every single problem and have it fix things for you. No, you should really be thoughtful about building this, let's say, deterministic experience in how you want to route problems and fix issues, but use AI to your advantage throughout the journey.
Leverage these AI capabilities to actually deploy and, you know, build the, the fixes and build [00:24:00] these deterministic systems. But it is not that it is as simple as, you know, prompting your, your generic use case, uh, or just piping your observability data into, you know, uh, an S3 bucket and giving an AI agent access to it.
Like, uh, that's, that's not what we're talking about
Kumar Singirikonda: True, true. Absolutely, you brought up the right point
Taylor Houck: Uh, very interesting. Now, I, I do wanna shift gears slightly because I know we could talk about AI forever, but we have limited time today. I, I, I do wanna touch in on this, this concept of engineering efficiency that you've touched on several times already in today's discussion. And we got into this a little bit earlier, but I really wanna be specific and get an answer out of you.
Do you view engineering efficiency as its own discipline, or is it the next evolution of FinOps itself?
Kumar Singirikonda: I- in my perspective, I think it should be a different discipline when you have a engineering efficiency, uh, as a separate discipline that [00:25:00] can focus, um, more on the operational toil, that can reduce the operational toil, maybe you can call it as a, uh, a platform engineering or intelligent automation, a self-service infrastructure, but it should be AI-driven observability, autonomous remediation, and operational standardization.
So I would say it should be a separate discipline and dedicated and entire engineering organization be focused on engineering and ef- efficiency
Taylor Houck: Man, that is spot on and exactly how I view it as well because it's like cost does not exist over here in its own box, right? Cost is a consequence of your architecture, of your resource configurations, and that architecture and those resource configurations are built so that the platform or so that the application is scalable, so that it's secure, so that it's reliable, right?
And cost is just another measure as a part of that. So when you're thinking [00:26:00] about optimizing your cloud environment, it's not, oh, let's just optimize it for cost. No. It's let's optimize it for performance and cost and security and scalability and all of these different aspects together, where again, this is why I think having you here and representing this real engineering leadership and executive perspective on, hey, cost is not the only thing that we care about, and therefore solving these problems is not squarely on what you would say a FinOps persona or a FinOps title.
And I think that this is where you're really gonna get into the role of FinOps evolving over time. And naturally, you know, things change, and especially with AI, they're changing very quickly, and it's on us as individuals to embrace that change and really grab onto it. Now, I do wanna touch on culture a little bit here because it's very important, especially in these moments of rapid change.
I think that humans intrinsically, we [00:27:00] like the status quo, right? And especially when you're working in big organizations, there are a lot of people and relationships and culture that has existed and manifested over time. From your perspective, how should organizations build this cost-aware engineering culture without slowing down innovation?
Kumar Singirikonda: So I would say that balance is incredibly important because organizations sometimes approach cost governance in ways that unintentionally, uh, discourage innovation sometimes or create a friction for engineering teams. I'm just, I'm just looking at my engineering perspective. So it should not, uh, interfere into your innovation. A successful cost-aware culture starts with a transparency and shared accountability than a rigid controls. Engineers need a [00:28:00] visibility into how architectural decisions, behaviors, and deployment patterns that impact our business outcomes. But that visibility should empower better decisions, not create a, a fear around the experimentation of the proof of concept that we do on a day-to-day basis. So the key is integrating cost awareness directly into engineering workflows through observability, automation, platform intelligence, and a self-healing tool that I discussed about. When developer can see real-time operational insights tied to the performance, scalability, and customer impact, they naturally begin optimizing systems more intelligently.
You don't need to enforce it. Organizations [00:29:00] also need to recognize that innovation itself, it creates a value. Sometimes increasing the infrastructure spend temporarily may accelerate a product delivery or improve customer experience significantly, but the objective should not be minimizing cost at all times. The objective should be ma-maximizing the operational and business value. That's why mature engineering culture focus on intelligent optimization rather than a restrictive governance.
Taylor Houck: Incredible insights, Kumar. I, I feel so fortunate to have had the opportunity to chat with you and dive into all of these topics. Now, before I let you go, I know that you're a very busy man. I do wanna give our listeners the opportunity to learn a bit from you outside of just the very, let's say, practical FinOps learnings.
I know that you've spent about a decade [00:30:00] now at Toyota Financial Services. What are some lessons from your experience that have shaped your leadership approach?
Kumar Singirikonda: One of the biggest lessons I learned is the importance of balancing, uh, innovation with operational resilience. Large enterprises like Toyota environments require systems that are not only scalable and efficient, but also highly reliable and secure. So over the years, I had an opportunity to lead global engineering teams focused on DevOps, uh, platform modernization, automation, operational excellence, so many initiatives. So those experience reinforced how important engineering culture, collaboration, and a continuous improvements are driving a successful transformation. So I also learned that operational maturity is not achieved through, uh, alone. It requires a leadership alignment, a [00:31:00] process standardization, a cross-functional collaboration, and long-term investment in engineering enablement. So many of these concepts we are discussing today are around self-healing systems, engineering efficiency and autonomous operations evolved from a, a real operational challenges experienced at enterprise, enterprise scale. So working in these environments provided a valuable perspective on how technology operations directly influence a business outcomes
Taylor Houck: It is really interesting to hear you upon your reflection on your career, on your time at Toyota, come to all of these realizations and then recognize how many of those learnings are directly related to many of the perspectives that you shared on FinOps, um, and the way that teams should be thinking about understanding and managing their cloud spend.
I mean, it's-- it just shows the value of experience, and that's why we are so fortunate to get the opportunity [00:32:00] to learn from you, who's been through these, let's say, battle scars and understands the way that organizational dynamics work and what engineering organizations should be thinking about and applying that to FinOps.
So thank you so much. I, I do also, um, wanna ask you a little bit about, uh, the book that you wrote. I know that you are the author of "The DevOps Automation Cookbook." Can you tell us a little bit about the book and what, what inspired you to write it?
Kumar Singirikonda: So the inspir- inspiration behind writing a DevOps automation book came from my desire to bridge a gap between the DevOps theory and the real, implementation. So over the years, I noticed that while many organizations understood the importance of automation conceptually, but the-- often they struggle with execution at a scale. So I wanted to create a practical, uh, resource that engineers, architects, and technology leaders could immediately apply within their enterprise environments. So the book [00:33:00] focus on actionable automation strategies, pipelines, infrastructure automation, operational workflows, and DevOps implementation patterns that designed to improve your reliability, scalability, and operational efficiency. So my goal was to help organizations move beyond a manual process and build a modern, a resilient engineering practices. So beyond the technical aspect, the book also reflects a deeper personal mission centered around giving back to the community. Throughout my career, I have, I have been fortunate to receive opportunities, mentorship, and support that, um, helped shape my pro-professional journey. I believe it is equally important to create opportunities for o-others and contribute to causes that empower individuals and families
Taylor Houck: No, that is so kind [00:34:00] of you. And I think that, you know, when you see leaders who are willing to give back, right, and, and provide their learning, I mean, as you did today, as you did with your book. I also know that you've done it through your involvement in other organizations like the FinOps Foundation or the McCombs School of Business, and even the, The Gift of Adoption.
Why is community involvement so important to you?
Kumar Singirikonda: I believe the leadership extends far beyond organizational responsibilities or business outcomes. evolves rapidly, and one of most meaningful ways professionals can create learning-- a, a lasting impact is by contributing knowledge, mentorship, and supports back into the broader community. community involvement allows me to share my experiences, collaborate with a, a diverse professionals, and help shape a future leaders across technology and business. Whether it is working with academic institutions, participating in [00:35:00] industry organizations, professionals, or contributing to through a thought leadership articles and speaking engagements. So I see it as an opportunity to help others grow while continuing to learn myself. So philanthropy i- also plays an important role in my life outside of my work. In addition to authoring several articles on DevOps, automation, and building inclusive workplace cultures, I wanted my work to serve a broader purpose beyond technical education. This is one of the reason I, I, I'm committed from my book, book sales toward a charitable causes like a Gift of Adoption and Mekhon School of Business, supporting initiatives focused on adoption and education that reflects myself and empowering, um, families [00:36:00] and children that creates a meaningful long-term societal impact. So community engagement also provides a valuable perspective. Some of the best ideas and innovations emerge through a collaboration across organizations, and disciplines. So staying connected to these communities not only strengthen a professional growth, but also, uh, reinforces the importance of, empathy, mentorship, and service-driven leadership
Taylor Houck: That's incredible, Kumar. Thank you so much for sharing that, and it's so important to dive into those topics, um, as well. If any of the listeners from today are interested in connecting with you or learning more, where's the best place to find you?
Kumar Singirikonda: Um, they can find me on LinkedIn. It's Kumar Singirikonda. They can always connect with me, and I can continue any more discussions about this topic
Taylor Houck: Amazing. Thank you so much for joining the show, Qamar. Do you have any final thoughts for our [00:37:00] listeners today?
Kumar Singirikonda: So the next generation of operational excellence will not defend-- uh, defined only by organization or cloud cost management. It will be defined by how intelligently organizations operate. So self-healing systems, AI-driven observability, automation remediation, and engineering efficiency are fundamentally changing how enterprises build and manage, uh, technology platforms. the organizations that succeed in the future will be the one that reduce operational friction, empower engineers to innovate, uh, build a resilient systems capable of adapting continuously in real-time. the real opportunity is not simply reducing the downtime or lowering the infrastructure cost. The larger transformation is creating a operational ecosystem where technology platforms becomes intelligent enough to optimize themselves [00:38:00] while enabling engineering teams to focus on innovation, customer value, and long-term strategic growth. is what beyond FinOps truly represents. you all.
Taylor Houck: Kamar, this has been a great conversation. Thank you so much for coming on the show
Kumar Singirikonda: Thank you so much
Taylor Houck: And thank you to our audience. If you got something out of today's conversation, which I'm sure you did, please share this episode with someone who needs to hear it. This has been another incredible episode of FinOps in Action, and we'll see you next time
Outro: That wraps up another episode of Fit Ops in Action. Thank you for joining. For show notes and more, please visit fit ops in action.com. This show is brought to you by 0.5, empowering teams to optimize cloud costs with deep detection remediation tools that actually drive action.
Creators and Guests
