EP. 37 – The Future of Platform Engineering with Dani Matzlavi

About This Episode

Matt Pacheco sits down with Danny Matzlavi, VP of Engineering and Platform engineering at Blackhawk Networks, for a deep dive into the rapidly evolving world of platform engineering. With over 25 years of experience spanning VMware, Nutanix, and fintech, Danny shares how Blackhawk has built a comprehensive platform engineering organization that unifies infrastructure, CI/CD pipelines, observability, and SRE under one umbrella. They discuss cutting-edge insights on AI integration in infrastructure operations, cloud cost optimization strategies that can save 20-30% immediately, and Danny’s bold prediction that platform engineering will evolve into “platform engineering in a box” within the next 3-7 years.

Know the Guests

Dani Matzlavi

VP of Engineering for Platform Engineering and Cloud Infrastructure at Blackhawk Network

Dani Matzlavi is the VP of Engineering for Platform Engineering and Cloud Infrastructure at Blackhawk Network, a leading fintech company specializing in branded payments. With nearly 25 years of experience in the technology industry, Dani has established himself as a visionary in platform engineering, cloud infrastructure, and modern software development practices.

Know Your Host

Matt Pacheco

Sr. Manager, Content Marketing Team at TierPoint

Matt leads the content marketing team at TierPoint, where his keen eye for detail and deep understanding of industry dynamics are instrumental in crafting and executing a robust content strategy. He excels in guiding IT leaders through the complexities of the evolving cloud technology landscape, often distilling intricate topics into accessible insights. Passionate about exploring the convergence of AI and cloud technologies, Matt engages with experts to discuss their impact on cost efficiency, business sustainability, and innovative tech adoption. As a podcast host, he offers invaluable perspectives on preparing leaders to advocate for cloud and AI solutions to their boards, ensuring they stay ahead in a rapidly changing digital world.

Transcript Table of Content

00:00 - Intro and Career Journey from Developer to Platform Engineering
10:13 - What is Platform Engineering and Blackhawk's Unique Approach
20:21 - Internal Developer Platform (IDP) Architecture
29:20 - Measuring Platform Success and Driving Developer Adoption
34:07 - AI Integration in Platform Engineering Operations
37:52 - Cloud Cost Optimization and FinOps Strategy
46:11 - Resource Lifecycle Management
49:14 - Future of Platform Engineering

Transcript

00:00 - Intro and Career Journey from Developer to Platform Engineering

Matt Pacheco
Hello everyone and welcome to the Cloud Currents podcast where we navigate the ever-evolving landscape of cloud computing and modern software development. I'm your host Matt Pacheco from TierPoint and today I'm thrilled to have with us Danny Matzlavi, a VP of engineering for platform engineering and cloud Infrastructure at Blackhawk Networks. With nearly over 25 years of experience in the technology industry, Danny brings a wealth of knowledge from his companies like VMware, where he was instrumental in the SaaS transformation, and Nutanix, where he led their cloud services engineering efforts. At Blackhawk Network, Danny has been at the forefront of their digital transformation, helping fintech company modernize its development practices and build a comprehensive platform engineering organization that goes beyond traditional boundaries that we'll dig into today and I'll ask you a lot of questions about.

Matt Pacheco
Today we'll explore Denny's unique approach to platform engineering, how he's integrating AI potentially into infrastructure operations, his innovative work on cost optimization and developer centric programs, and his vision for the future of platform engineering. So, Danny, thank you for joining us today on Cloud Currents.

Dani Matzlavi
Thank you, Matt. Thank you for having me on your show.

Matt Pacheco
Cool. So let's talk a little bit about you and your journey. Can you walk us through your career journey from your early days to where you are today at Blackhawk?

Dani Matzlavi
Yeah, sure. I mean, I started my journey as a developer like many others, early days, many in telecom services, and then I shifted into the application domain in companies like N layers, EMC and even later on in VMware. So I mainly created products around ADM application dependency mapping. But then later on I led the part of the APM group in VMware and overall my entry point to the world of web what today is being called platform engineering was around 12 years ago or 11 years ago, 2013, 2014 in VMware, when it was post the APM realm for VMware and we started to build the first as a service organization in VMware that tried to understand what it's going to take in order to run VMware services in the public cloud. Back then, VMware was a mainly private cloud company.

We did not have any presence in the public cloud whatsoever and the company was kind of quite far from getting there. So the platform that I've built for VMware was a platform that combined both the infrastructure needs, but also the other concerns that when it comes to the cloud, like billing and identity for the cloud and catalog and portal and many other things. And that was the first time when I kind of shifted from pure engineering as we all know it to understand that engineering and software engineering as a fact is tied almost completely to the infrastructure that we are leveraging and utilizing. So the application cannot really be completely detached from the infrastructure that is running.

Even Though both in VMware and later on my journey in Nutanix we always like to say that infrastructure is invisible or we want to make infrastructure invisible. And it's always a good goal. Even today that's the goal to make infrastructure invisible. But we also said that applications like application loves vSphere because they can leverage and utilize the underneath infrastructure to its best. So that was the beginning of my platform engineering journey and I learned that combining the infrastructure level with the other concerns of applications and modern applications and automations and building robust pipelines and a lot of good things that we'll talk about probably during the call today, they are super important in order to make a robust platform and allow applications to utilize that platform to its best. Time passed. We launched VMware Cloud Services in 2016.

VMC was the first service that utilized that platform and then we continue with some other management services and many others. I continue my journey with Nutanix as you mentioned and Elk Nutanix to build their hybrid offering, both private cloud that was utilizing Nutanix technology and the public cloud offering with the same technology that allowed customers to expand their private cloud into the public cloud with some very unique use cases. And three years ago I joined the Blackhawk. Blackhawk is pioneer and a leader in the branding, payments and prepaid cards. So kind of joined the Blackhawk to modernize a lot of the software development processes and bridge between the infrastructure of the company and all other services that we are building in the company.

So we can talk a lot more about the platform engineering organization that I've established in Blackhawk in bhn because it's quite unique I believe than what platform engineering is being grasped in the industry. I'll stop here because that was a bit long, but I'll give it back to ask some additional questions.

Matt Pacheco
Oh no, that was excellent and a great setup for a lot of the things we're going to talk about. You have a lot of wealth of experience at really big companies like VMware, Nutanix. It's really cool. So how has your experience at those big companies kind of experience or influenced and shaped your approach to platform engineering at somewhere like Blackhawk?

Dani Matzlavi
So I think the difference between running organizations in a big company is that you don't have the luxury sometimes that small startup has around Running fast and utilizing, let's say public resources. Today, companies and small startups can, you know, mainly in the AI with the AI era today, they can run really thin and build services really fast. With bigger companies, enterprise companies, you have a lot of different organizations within the company that you need to satisfy. So everything that you are building needs to be compliant with a lot of organization, just to mention a few. There's a legal and compliance and security that is completely in different levels in large companies, mainly in fintech companies as well. And everything that you're building needs to be enterprise ready.

I mean a lot of small companies, mainly startups, are running with their alpha or version one of their products and they're being smashed when they're getting into enterprise or bigger accounts. That's not the case with a lot of the bigger companies. So when you are building something in an enterprise, you also need to take a lot attention and pay attention to a lot of concerns that are there. For example, a lot of big companies are still running physical data centers. We in bhn, we had data centers as well. We are in the verge of moving completely out of the data centers and migrate completely to the cloud, but we still had physical data centers and presence in the public cloud.

And a lot of times you need to bridge between workloads that are running in the data centers and workloads that are running in the cloud. There are some applications that are not built natively to the cloud. So whether they are running in the data center or maybe even sometimes running in the cloud, but they were not built in the cloud. There are many challenges over there in many big companies. There were a lot of acquisitions in the past. BHN is one example of a company that in the last 20 years acquired many companies. So you end up with isolated companies inside of one big company. And a lot of times and many times in the past, people did not consolidate technologies well. And you are facing challenges of consolidations and many others.

So that's the major differences and challenges that you have in know, running these aspects in big companies. And of course you need to deal with legacy, not only legacy code, sometimes legacy technology and others as well.

10:13 - What is Platform Engineering and Blackhawk's Unique Approach

Matt Pacheco
Excellent, excellent experience from the large businesses to the smaller ones. So let's talk a little bit about platform engineering. First I would like to ask you, before we talk about your unique approach to it, I'd like to ask you, for our listeners who might be newer to the concept, can you explain what platform engineering is and why so important to modern businesses?

Dani Matzlavi
Yeah, definitely. So I think whoever been in the industry for enough time. Probably remember that, you know, 20 years ago we had a clear separation between development and operation and we had ever, you know, several other roles in between. And then I, I believe around 10 to 15 years ago, DevOps was the big thing, the buzzword in the industry because people and members of our industry understood that there isn't any clear cut between developers and operation. You cannot really build a product, hand it over, or throw it over the fence, sometimes to the operation team to make sure that it's being deployed correctly, that it's being configured correctly, and all of the environments, whether it's pre production or production, are being handled correctly.

And the reason for that was the acceleration to the public cloud in these days, where developers really wanted to have access to the environments in the public cloud and others. So DevOps kind of established around 10 to 15 years ago as a way to bridge that gap between development and operations and allow developers to take more activities around building the pipelines for applications, being engaged in configuring production, architecting production environments, and taking more active roles around that. And then over the years happened something that usually happened in the industry when you define a role or when you define a state of mind, which is DevOps, people are creating teams and they are creating teams around that state of mind. And companies started to create DevOps teams.

Now, I never believed that DevOps is a team, the same way that platform engineering is not a team, Platform engineering is a concept, and we'll get to that. But companies started to create DevOps teams. So we once again, we got to a place where we have developers, but instead the situation that we had 10 or 15 years ago, when we had developers versus operations, we got to a place where we have developers versus DevOps, where developers are the one that are building the code, they have accessibility to the pipeline, but someone else is building the pipeline, someone else is configuring the resources in aws, someone else is building the environments that are required in the public cloud for developers to be productive.

And then a lot of times what happened is that DevOps are, whether they're being part of the development teams or separate teams, in a lot of cases they just perform the work for other teams. And I think here is the platform engineering, that's where platform engineering gets into the mix. Developers, as always, they're in the state of mind that we can always, when there is another team, we can always do things better. And sometimes that's the case. So why do I need another team to build my pipeline? Why Do I need another team to configure my observability pipeline or data pipeline? Why do I need another team to configure my environment? I can do, I can do everything myself and I can do it better and I can do it with the tools. Just give me the access.

So I think platform engineering comes into the play in the last several years. I think in the last five to seven years, platform engineers evolved from that concept that instead of building those construct for the developers, we don't need to build these constructs for developers anymore, we just need to build services. Once again, platform engineers are based on software engineers and we'll talk about it in a second, but we need to build services. These services will allow developers to achieve everything that they need to achieve, like building, deploying, patching, managing common libraries, managing environments, taking complete ownership on their applications end to end, from development to production, but without spending or investing their time in non business related or non logic business logic related activities. Because that's not a great use of their time. So platform engineering evolved from that notion of DevOps.

But they are not doing the work for a specific team, they are not doing the work for others or part of, or embedding themselves in other teams, but instead they are software engineers that are building services. Like I think the best example is idp, the internal development portal. This is an application, it's internal application, but it's an application that is built usually by the platform engineering organization. But the aim of that application is to serve and allow developers to achieve all of their desires and all of their functionality to take ownership end to end. So secret management is another example for such a service, and identity management and access management is another example. And a lot of infrastructure services as well, because the infrastructure is not going away. But instead of operating the infrastructure, we now can build services on top of the infrastructure.

And we can build things like managed compute service that allows developers to just describe their application and deploy. Or we can build services like database as a service, which kind of sounds weird today to say database as a service because most of the databases out there are already as a service, but they are not as a service in an enterprise way. It's a good fit for a startup to go and use Aurora from aws. But when you have hundreds and thousands of these databases that you need to manage on an enterprise basis, you need a database as a service, as an internal service in the company.

So the transformation between DevOps is doing the work and the platform engineering situation, where we are software engineers that are building Services, the services are automating and providing a lot of functionality on top of infrastructure, on top of pipeline, on top of all the areas that DevOps was executing upon. But we are offering that as a service with clear interface, with APIs, with sometimes with UI, if it's part of the IDP, maybe with CLI, maybe it's as part of the pipeline itself. But we are providing all of these capabilities in code for developers to consume. So I can also add that also in Blackhawk. And that's why the platform engineering in Blackhawk is unique, I believe because in Blackhawk we managed to pull all of the disciplines together under the one umbrella of platform engineering.

So when people are saying platform engineering in other companies, they usually think about oh, these are the people that are building the pipeline and the pipeline services, or maybe that's sometimes the people that are part of the infrastructure services. But in Blackhawk we managed to pull together all the relevant skill sets, including infrastructure, including pipeline, including common services, including the observability and the SREs, and even including the NOC and operational control center, all under the same umbrella of platform engineering. So you can build services that are fully integrated to all the to the end to end process of everything that you need in the company. So when you build a service, it's completely integrated with all the ITSM capabilities. Okay, so change and problem solve and incidents are incident management is also already integrated with everything that you are building.

And compliance is already integrated. And when you build a pipeline, all the security aspects of the pipeline, whether it's a static code analysis or vulnerability checked in runtime, is already embedded in the pipeline. Quality is already embedded in the pipeline and many other aspects are embedded in the services that we are building. Because we have all the disciplines under the same umbrella and we can build services that are not focused one part of the organization or the company, but we can build services that are taking care of the entire concerns spectrum for developers. So when developer uses our services, a developer can be confident that is not missing anything.

20:21 - Internal Developer Platform (IDP) Architecture

Matt Pacheco
That's a great explanation. And we got a glimpse of what you're doing at bhn. That's really cool. Let's talk a little bit about the developer experience and also this internal developer platform you guys have at Blackhawk. So can you explain how you're structuring your IDP internal developer platform at Blackhawk and what services it provides to developers?

Dani Matzlavi
Yeah, so the idp, I think not specifically for Blackhawk, but IDP as general is the one stop shop for developers. So I'm always referring to when I'm talking with developers. I'm saying the first thing that you need to do in the morning when you start your workday is to open idp and you should not close that tab until or forever because that basically the place that you always need to go back and do something. It's fine. You are coding and you are generating code. And today it's very easy to generate code using GitHub, Copilot and use Cursor and use many other good AI tools that helps you to build the code.

But once you committed the code, you need to go and see what happens with your whether it's local build or whether it's the branch build that you are started and initiated and what happens with the review and whether it's initiated an automatic change. All of that is part of the life cycle of your development and you're doing that with the idp. So that's the place where you can see your application. You have a service catalog in your internal development portal. So you not only you see your applications, you also can manage all the deployment units of your application and you can manage your deployment units at any environments that you have, whether it's a local environment or development or testing or pre production or production. If something failed, you can go and look at about what was the problem from that system.

You can fix it. You can download reports and you have the scan reports, quality reports, security reports, you can do all of that and you can deploy your application via the same portal. Now it gives you a lot of confidence in your application lifecycle because it's everything in one place and you can see the entire information and you can control the behavior of the application. Now in addition to that, you can manage all the relevant resources for applications. So if your application consumes secrets or it manages that you have certificates as part of your application, you can manage that. In the same way you can view your application metrics and performance metrics, whether it's productivity metrics, AI usage metrics or any other metrics, Dora and others. And you can manage that and see that from the same portal and many other capabilities.

We can go over all the capabilities that on the internal development portal, but it basically provides you with all the capabilities that you need in order to maintain the lifecycle of your application and learn more. Okay, we have also integration of the infrastructure into our IDP and managing components. Like wherever you have end of life of a component or end of support of a component, you're getting notification, you can handle that as part of the lifecycle. So it's basically really the one stop shop for developers where they can manage their entire life cycle of their application.

Matt Pacheco
That's a really good explanation of the IDP and what it offers to developers. How do you measure the success of a platform like that and your overall platform engineering initiatives as well?

I think that's the million dollar question. Because there is a difference between when you're building a platform that it's internal, sometimes people think it's really easy to measure the success. But as a matter of fact it's harder to measure success because when you're building a product that is going out and it has external customers, then measuring success is really easy. You either have a lot of people using your systems and buying and paying money for that or not. But when you're building internally, from the one end, you have luxury to build things that maybe not a lot of people will use, but they are needed. But on the other end it's harder because you're building it for a reason. You're building it because you want developers to become more productive.

You build it because it comes to solve a pain point that is there today. But you need to measure whether it really helps developers to become more productive. So I'll give you an example. If we are adding a new service to our pipeline and people are using it, then you need to have a way to justify the investment in that feature because it needs to provide some level of improvement to the development cycle. Okay, so the way that we are trying to measure it first of all is with the level of adoption. So everything that we are delivering as part of the platform, we are measuring the level of adoptions. How many developers really use that feature. Sometimes they don't need to use it directly, they don't need to do something actively in order to use it because it's there.

But sometimes they need to directly or actively click a button or sometimes they need to actively call an API or consume something directly. So we have the tools in place to measure and gain the statistics of that usage, whether it's if it's a browser or web related activity. So we have statistics and telemetry on access and usage. If it's an internal feature, then we are using other ways to gather the statistics. We also exposing our features with feature flags and we're doing experiments on the fly. So we know even before we are releasing, we know how people are going to use the features that we are releasing. So I think it's not completely different than what companies are doing when they are releasing products externally, but we are doing that internally.

And for some, in some aspects it's easier because we have a friendly customers. But as I said, sometimes it's harder because you have the privilege sometimes to develop things that people are not going to use. And it happens. In addition, we are gathering a lot of metrics and all the metrics that we are gathering goes to an metric store and then we are trying to get a lot of insight out of it. So though I mentioned DORA metrics, the DORA metrics is very basic of everything, but we are also trying to gather metrics and correlation about productivity versus or in correlation to changes that we've made. Now usually you're doing that when you want to correlate changes to an incident or something bad that happened, but we're doing that in order to measure the impact of our changes.

If we introduce the change, then we are marking that point in time in the timeline and we're starting to measure from that point on all the productivity metrics and see whether that change had made any improvement on developers productivity. And sometimes we see that it did not improve or there are even cases where we saw some decrease in productivity, then we have the ability to remove that feature and rollback.

29:20 - Measuring Platform Success and Driving Developer Adoption

Matt Pacheco
Quick follow up question to that. So you mentioned whether they would use the internal platform. How do you, what strategy have you gone about to I guess drive adoption of those tools available to developers and those teams.

Dani Matzlavi
So I'm a true believer in the fact that people, and mainly developers will use things that are making or bringing value to the day to day. I don't believe that pushing or forcing anything on developers will yield a success. I spent many, many years hands on keyboard and I'm still developing on my spare time. I don't like using things that are slowing me down. I don't like using things that are not optimal to my needs. But I do like to use services that really helps me. Now of course that in organizations of thousands of developers you can't really produce services that are optimized for everyone. But you need to try, okay? And usually if you are producing good services, they are bringing value and they are helping productivity for many, the majority of the developers, if not all.

So what we are doing is we are starting our journey much, you know, really early on a discovery phase. And we know the pain points because the pain points are there. But solving the pain points and implementing the right thing starts with early discovery of both the pain points but also the abilities or the ways to mitigate and Resolve the problems that we see. We are performing a lot of talks with our stakeholders. We have our stakeholders in. In that sense, again, it's easier because our stakeholders are friendly and our stakeholders are there and we can go and really talk with them versus a product management work that really needs to approach the customers. And sometimes the customers are not even there, so it's harder.

But our customers are there and we talk with our customers and we validate with our customers the pain points that they have and we are brainstorming together about the right way to solve these problems. Now, it's not a full democracy because as I said, with thousands of developers, you can't really solve and let everyone vote for the right way to solve a problem. But once you realize what is the real pain point and you have the feedback and you maintain a continuous feedback cycle with your customers. So it's not a, oh, we understood the pain point, this is the way that we're going to solve it and we're moving on. Not at all. When once we understood the pain point and we are discussing the problem, we always maintain a backlog of this is the things that are, you know, we are working on.

So they are coming. These are the things that are coming next and these are the things that are still in brainstorming and discovery phase. And we are keeping our stakeholders in the loop as long as the feature is not released. But even after the feature was released, they are still in the loop to get feedback because we always making changes and you always refine the features and the services that we are building now, we can release super fast. Okay, we are kind of eating our own dog food. So we can release, you know, multiple times a day and we can fix and we can add things really fast. So when we're getting that feedback and we can talk about the different forums that we have and even with our idp, we have a way to submit feedback directly from our idp.

We're just jumping on that feedback and making it work. I think in a lot of ways the feedback that we are getting, shaping the roadmap and the backlog that we have, not in a destructive way, but in a very constructive way because it helps us to understand what is more important sometimes what is less important. In most cases, we are doing it right now.

34:07 - AI Integration in Platform Engineering Operations

Matt Pacheco
It sounds like you do equip your developers with a lot of good platforms and tools to get their job done. I'm curious and we're going to jump over to a new topic now. AI, you mentioned it earlier with coding. But specifically how are you integrating any AI capabilities or machine learning capabilities into your platform engineering practice at Blackhawk Network?

Dani Matzlavi
Yeah, so with, with AI there is a lot of great capabilities that we can get. So first of all, as a central productivity organization, we are pushing AI tools to the developers. We're kind of pushing in the way of shifting left to the developer. So I mentioned GitHub Copilot that we're using. But there's a lot of other AI capabilities that we are either evaluating right now or already using as built as in the IDE or not. But we're looking also at how can we announce our day to day work using AI. Now we're using AI for years already because we are a fintech company and ML is part of AI and we're using ML in different parts of the companies to measure risk and to identify risk and others.

But in the world of platform engineering we always look at where are the places where we can use AI more and leverage AI. So things like how do we do root cause analysis or even we are managing the post incident reports, for example. This is something that is clearly a use case for AI. So we can generate a lot of these post reviews with AI. We are also looking at how AI can help us with managing our data pipeline, for example, because once you integrate AI into the data pipeline, then you know the services that are shipping logs into the sinks to the other side and you can identify those activities. You can create alerts in an automated way.

On the other side, on the incident response, you can have an SRE agent and there are many companies today that are already implemented SRE agents that allows you to fetch that alert and do a proper root cause analysis and minimizing the mttr. Because at the end of the day that's the main goal, right? We have observability and we have those fancy processes, but at the end it's because we want to reduce and minimize the mttr. So we have on the observability side, we have that on the incident, on the overall ITSM we are using and we are integrating AI. Also on the activities that we have on provisioning of resources, we are looking at any aspects of our platform engineering, on building new environments, on optimizing cost, which we haven't talked about at all on finops and how to optimize cost in the cloud.

But we're looking at any aspect of the platform engineering and we're already using in many of the areas of our platform engineering we're already using other third parties that provided us with AI solutions or we're building something in house.

37:52 - Cloud Cost Optimization and FinOps Strategy

Matt Pacheco
Excellent, thank you for sharing that. And you did mention something I do want to talk about next cloud cost optimization and finops. So that was perfect segue. You did my job for me. Thank you. So from what I understand, you've implemented some cost optimization programs at Blackhawk Network. Can you talk a little bit about them and how they connect to the platform team?

Dani Matzlavi
Yes, definitely. So I think I mentioned in the beginning of the call that companies and Blackhawk still have physical data centers. Now, when it comes to physical data centers, you always don't have an option. You need to provision for picta. You can't buy hardware and install hardware on time to satisfy volumes. You need to be prepared to your highest volume, which means that most of the time your hardware is not utilized. Now, that's not the case with the cloud, because with the cloud you have elasticity and even though it's not 100% elasticity that you can squeeze on the second, but you can definitely take leverage of that elasticity to save cost. You don't need to provision for max, you need to provision for what you need.

And I think in many aspects a lot of companies are still provisioned in the cloud the same way that they were provisioned in the data center. Now again, I'm not talking about the small startups, I'm talking about the big companies. Some of them still have data centers in place. And when you look at the resources in the public cloud, a lot of the time you see, you know, whether it's EC2 instances that are underutilized, you see clusters that are provisioned in a size that it's way bigger than the needed, very kind of not optimized configuration of your resources in the cloud. So a lot of times there is a lot of easy wins. A lot of easy wins.

When you are starting to build that FinOps program and you look at the problem, you say, oh my God, I mean, I can save 20, 30% just by utilizing the resources. Better things, simple things like I need to resize my VMs, that's it. If I resize my VM to the right size and instead of provisioning for extra large, what actually I need is a medium, then I already saved the 20% or I have many resources that are idle, no one knows what they're doing there, okay? And you can just clean that up.

So a good hygiene on your cloud resources really gives you that between 10 to 25% sometimes when it comes to savings, but when you Once you clear then you cleaned up all of that and you're now in a good hygiene then still you can make a lot of good savings and leverage such a finops program when you move to the second level and then to the third level. The second level is how can I make sure that, okay, I'm provisioning right now, but how can I make sure that I am also optimized on the usage of my resources? Okay, do I have the right configuration in let's say my ECS clusters or my EKS clusters? Do I have the right density of workloads on these clusters? Do I provision my resources in the right regions or in the right availability zones?

Because sometimes you can see that with communication of services between regions, maybe a lot of the communication goes to a very expensive regions and I can just move around my resources. So that's the second level of optimizations that you can make. And the third one, which is the, that's the hardest one is refactoring my applications because a lot of applications are still not taking full advantage of the cloud and are not taking full advantage of the cloud characteristics. And it's not a rare case to see an application that consumes 60, 50, 60 gigabyte of RAM and it runs in a Kubernetes cluster. So basically you are having your entire cluster just for application and it consumes the entire capacity. Usually it's a result of a lift and shift to the cloud without giving a proper attention to the application itself.

So refactoring the application is the third and the most expensive and the most complex thing to do. But that has also the most yield the most results from cost perspective. Now the one thing that I want to mention around that FinOps program and cost saving is that again this is not something that should be owned by a team like a FinOps team or by, in some companies, DevOps teams. That's a developer role. I mean when we are looking at, I mean the first part, the hygiene part, that's a no brainer. Whoever has access to the cloud, to the portal can take care of the hygiene. Someone from a central place needs to take care of it. That's fine as well.

There are many tools today that are doing that as well and there are many tools that are showing you exactly what is underutilized and what is idle and what you can clean and, and that's not a problem. But mainly the second tier and of course the third level, this is the developer responsibility and I'm, I see a lot of times developers that they say, oh, we want to deploy something or we are react re architecting our application and they're really asking themselves what is the impact to the cost? Right. If I'm introducing now a new and I need redis for my application, okay, but what is the. I need to deploy that somewhere, it needs to run somewhere. And it's true that we are trying to make infrastructure invisible for developers, but still the cost is there.

So one of the things that we are trying to provide and I, I hope that soon we'll have it in a nice software application way to show but whenever you are making a change to your application, what's going to be the impact to your overall cost? Okay, if you just change your functionality, your code functionality, then there's not a lot of impact, you're just going to be deployed again and not a lot of impact. But if you are now requiring, you know, more buckets on S3s or you introduced a new database, or you increased your footprint in the cloud intensively, then there is an impact to the cost.

So developers, they need to be very aware of the cost that they are incurring in the public cloud and that needs to be part of their day-to-day work, otherwise that entire program can get out of control. And in many companies, the cloud cost is the one main cost that the company is incurring after operational cost, of course.

46:11 - Resource Lifecycle Management

Matt Pacheco
Definitely. And I like how you talked a little bit about balancing that innovation and those developer activities with the cloud cost optimization piece. So thank you for that. I have a few questions left before we start talking about future trends, but let's talk about resource lifecycle management because I understand that's an important, potentially overlooked aspect of platform engineering. Can you explain why the lifecycle of resources is so important?

Dani Matzlavi
Yeah, so first of all, applications are utilizing a lot of resources. And when we are talking about resources, it might be I'm leveraging a database, for example, or I'm using a common library or a third party library, or I'm leveraging some tools that comes out or internally or from a third party. Now a lot of these resources, they are part of my application, but it's it. They all of these resources, they have their life cycle. Some of the resources are as new versions, some of the resources are very old. And it's not a, again, it's not a rare situation to see applications that are using a library that is already 5, 6, 7, sometimes even more 10 years old library and sometimes it's end of support for a specific License. Sometimes it's end of life for some libraries.

And besides the fact that you need to maintain continuity in your ability to deploy and in your ability to build and deploy and maintain your application, a lot of times these resources are no longer supported. You don't have any patching that is being provided. You don't have the ability to maintain a clean hygiene of your application. Some of the resources comes also with a deadline. A lot of times with public providers, public cloud providers, they are giving you an ultimate. You have until you know, the next quarter to update your resources. If you're not updating, then you know we are no longer supporting it or we might go and update it under your feed and. And that's it. And you'll face the consequences.

So having the ability to manage the lifecycle of your resources, whether it's a cloud resources or maybe it's a licenses or maybe it's a it's other kind of resources secrets and others, it's a super important part of managing your application and application health and it prevents you from being surprised. And usually these kind of surprises are not the kind of things that you want to discover in the last moment. Excellent.

49:14 - Future of Platform Engineering

Matt Pacheco
Let's talk about the future for a moment. So we talked about platform engineering. The concept of platform engineering evolving over the last, let's say five to seven years into what it is today. What is your thoughts on the future then? Within the next five to 10 years of the concept and where do you see it going?

Dani Matzlavi
Yeah, so I believe that platform engineering has at least three to seven years to go, but it will disappear as a concept. It will disappear not because the platform engineering is no longer needed, but because there will be platform engineering in a box, I'll call it. There are already companies that are trying to perform platform engineering as a product and provides you with the means to get platform engineering capabilities from a software, from a service, from a product. And a lot of it is going there. Now if you remember I mentioned that the role of our of us as platform engineers is to create services. To create services that are allowing developers to be productive, to be successful. Now a lot of the services that we are building are very specific to our developer needs.

They are not generic enough that tomorrow we can sell it to the next company. Now take example, Backstage coming from Spotify, for example, that's something that they build in a generic way that they can now either use it for their IDP implementation. I think the same thing will happen with more and more services. Companies will build at the beginning companies will build one or two services in a generic way and people will start, or companies will start consuming these services. But in a matter of, as I said, between three to seven years, companies will mature to offer the entire suite of platform engineering as services. That companies can still can take dependencies and buy that as a solution.

So a lot of the things that happened to the DevOps world when tools like Circle and GitLab and others took a lot of the capabilities that were used to be built in house in the past and just selling that as a service today, so you don't need so many people to build a pipeline anymore. The same thing will happen with platform engineering. We talked also a little bit about AI. AI will speed up that process for sure. So I believe that platform engineering is not here to stay. But it doesn't mean that the people that are working today on platform engineering needs to be worried in five years. Because in five years the next thing will evolve on top of platform engineering.

Because at the end of the day we are all software engineers, we are building things and today we are building things that are services that are consumed by developers. That is not going away. The nature of the services will evolve. But I don't think platform engineering in five years will look like the same way that platform engineering is really well put.

Matt Pacheco
So if there is still three to seven years, there might be some companies that are actually looking to implement some kind of platform engineering program and strategy. What advice would you give to those companies just at the beginning of their journey?

Dani Matzlavi
Yeah, by the way, there are companies that are trying to do that. Even some mature companies or on their way to be mature companies. So it's not only early stages, but the advice that I would give them is to start from the main pain points. I mean there are services today that every platform engineering organization is required to build. There are many companies that providing IDP today. I mentioned one, there's others that are not open source software. And IDP is kind of becoming a no brainer anymore that you know, platform engineers do not need to build idps anymore that you can just use. But I think other services are following. So those companies needs to focus on these services that are next and gradually add more and more capabilities to their overall platform until it becomes a bit complex.

Because as I said, a lot of companies, they have different processes. In bhn I got familiar to a very different taxonomy than used in the industry which made it very hard to implement an IDP with something that is standard. I'm sure that other companies has their own kind of other taxonomy or other processes or other things that is different. Therefore it's really hard to make something that is fully generic and if you're making it generic enough then you are spending a lot in customization and spending in customization is also not a great idea. So I would advise just to start with the services that are important, aim for standardization in that domain and there is a lot of place where standardization can be achieved or already been achieved and move from there.

Matt Pacheco
Excellent advice and an excellent conversation. I'd like to thank you for being on the show today. I appreciate all your knowledge and your expertise that you shared with us.

Dani Matzlavi
Sure. Thank you. It was I really enjoyed being on your call today.

Matt Pacheco
Thank you. And for our listeners, thank you for tuning in. Check out the podcast and this one and more anywhere you get your podcasts and we will see you soon. Thank you so much.

More Episodes

EP. 50 Building Secure AI with Anthony Baio

Listen now

EP. 47 Biometrics, Sabotage & $5 Trillion Infrastructure with Kevin Surace

Listen now