EP. 35 Infrastructure as Code: Scaling Cloud Operations with Ryan Raub

About This Episode

Today Matt sits down with Ryan Raub, VP of Cloud Infrastructure at SmartRent. From scaling infrastructure that supports millions of IoT devices to navigating the challenges of AWS IoT Core, Ryan shares invaluable insights from his journey from government work to startup leadership. Don’t miss Ryan’s practical advice on implementing Terraform at scale and his perspective on how AI might transform cloud infrastructure management in the coming years.

Know the Guests

Ryan Raub

Vice President of Cloud Infrastructure at SmartRent

Ryan Raub is the Vice President of Cloud Infrastructure at SmartRent, where he’s led platform engineering teams and driven cloud innovation since 2019. With a background spanning software development, systems programming, and IT support at organizations like RetailMeNot and Arizona State University, Ryan brings deep expertise in building and scaling cloud-centric solutions.

Know Your Host

Matt Pacheco

Sr. Manager, Content Marketing Team at TierPoint

Matt heads the content marketing team at TierPoint, where his keen eye for detail and deep understanding of industry dynamics are instrumental in crafting and executing a robust content strategy. He excels in guiding IT leaders through the complexities of the evolving cloud technology landscape, often distilling intricate topics into accessible insights. Passionate about exploring the convergence of AI and cloud technologies, Matt engages with experts to discuss their impact on cost efficiency, business sustainability, and innovative tech adoption. As a podcast host, he offers invaluable perspectives on preparing leaders to advocate for cloud and AI solutions to their boards, ensuring they stay ahead in a rapidly changing digital world.

Transcript Table of Content

01:35 - Ryan's Career Journey
05:36 - Building and Scaling Cloud Infrastructure
10:15 - Infrastructure as Code and Terraform
19:07 - IoT Implementation & Security Challenges
27:21 - Automation, AI, and Observability
34:52 - Cost Optimization Strategies
39:40 - Future Trends and AI in Cloud Management

Transcript

Matt Pacheco
Hello everyone and welcome to Cloud Currents, a podcast that navigates the ever-evolving landscape of cloud and its impact on modern businesses. I'm your host Matt Pacheco, Senior Content Manager at TierPoint. And today's episode we're going to talk to Ryan Robb V vice president of Cloud Infrastructure at Smartrent, a leading provider of smart home and building technology solutions. Ryan brings over 15 years of experience in tech industry, starting his career in software development and moving into the private sector. As an early employee at SmartRent, he's helped scale the organization's cloud infrastructure to support millions of IoT devices while looking at issues like complexity, optimizing cost, implementing robust security measures and much more.

In this conversation we'll talk a little bit about things like infrastructure as a code, dive into challenges of managing large scale IoT on platforms like AWS, discuss a little bit about DevOps and look at some emerging trends in cloud infrastructure management. So thank you Rob for joining us today. Or thank you Ryan. Sorry, let me read that. Thank you, Ryan, for joining us today.

Ryan Raub
Oh, thank you for having me Matt. It's a pleasure to be here.

01:35 - Ryan's Career Journey

Matt Pacheco
Cool. So, so let's jump right into it. Can you tell us a little bit about your career journey and what led you to where you are today?

Ryan Raub
Sure. So I, you know, first started off in a degree in computer science, so kind of the new traditional software engineering background. From there moved into working for the federal government. So you know, kind of almost as big scale as you can get.

We had a HBC cluster, went through all of the fun aspects of large data acquiring hardware, going through month long procurement processes to get physical hardware to at least purchase it, get it installed and actually operational was painful for let's say from there moved to sort of a smaller scale, it was state government backed and there we actually had a little bit of a journey from physical servers to hosted on prem sort of private cloud and then towards the end of my time there completed a transition to AWS public cloud for them and that was kind of a, that was a fun journey and then continuing with my progression to go even smaller. Yeah, joined a, a small scrappy startup at the time. Smartrant under a hundred people when I joined.

Definitely we're you know, figuring out all of the pieces and parts for how do we do this and how do we scale that, scale this business and scale the infrastructure alongside it. You know we've kind of grown that up into a larger, larger company now and yeah, it's been A fun journey and as going through that we've, you know, figured out a lot of things, you know, figured out some really clever patterns and some fun ways to manage infrastructure at scale and also Scale Teams has.

Matt Pacheco
How's your experience in government work shaped your approach to your work at SmartRent?

Ryan Raub
So coming from a place where there's high regulation, there's defined processes for everything and having to consistently navigate that labyrinth of bureaucracy, coming from that to you know, effectively a small scrappy startup, it's been, well, I'd say it's been a, it's been a fun journey because a lot of that, you know, really I think speaks to the needs where we had to go fast, we had to figure things out. We, you know, didn't have the luxury of having established procedures or you know, a cloud advisory review board to review changes before we, you know, go forward with it. And were having to kind of build up those processes and procedures as we scale to adapt to our size and complexity and our needs. And that's been, you know, a great wealth of knowledge to sort of pull from.

There's a lot of I think good standards and processes that exist industry and in government that you know, are, you know, really, I think established best practices. And so pulling from a lot of those, I would say cherry picking the, you know, the ones for the right time, the right scale. You can't go full, full bore government level bureaucracy out of the gate at a startup. You'd never get your product out of the gate.

Matt Pacheco
So it has to be a relief and sort of in a way that's interesting. So you joined SmartRent as approximately what employee? Like number 60, you said?

Ryan Raub
Yeah, I'm in the 60s.

05:36 - Building and Scaling Cloud Infrastructure

Matt Pacheco
What's it been like to build and scale cloud infrastructure through that growth?

Ryan Raub
I would say first of all dynamic. We didn't start on the same sort of platform and the same sort of constraints that we are on right now. We, you know, we're small shop, we had small needs and we effectively used sort of providers that catered to those smaller needs early on. And as our systems have grown, our teams have grown. Being able to, you know, build those out, have those multiple teams, those multiple services, have those work in concert at scale has been an interesting, interesting evolution. You know, we've had to bring in new tools. When it was, you know, when it all fit in one person's head, it was easy and you could, you know, add a couple people to that.

But once you know, break out of, okay, well, you've got two, three, four teams that now all do similar things and similar patterns in parallel. How do we make sure that they're not all off? You know, one team's off building in one direction, the other team's moving in a completely opposite direction from a sort of operations standpoint, which is where, you know, where I end up kind of putting my head most of the day. Those additional complexities, those variations end up making the, you know, operation side really difficult. And it's harder to either move people between teams or take somebody who is needing to span teams and have them translate from one to the other. So trying to keep those variations in step and there are, you know, of course, exceptions and to the rule and keeping those variations in step.

So that way, you know, you're getting the benefit of, you know, some of that consistency. You're getting the benefit of those security practices that you've got implemented and those modules and you're able to actually go faster and you're not having to solve the same problems and rebuild the same wheels in multiple different ways, but you end up with the same product at the end of the day. But not go too far to, you know, go back to the, the overarching sort of bureaucracy of, you know, well, there's one team and they're the ones that do all of this. And so if you need a, you know, a cluster or whatever, you go get in the queue there and they get to you next month.

Matt Pacheco
So we talked a little bit about Smart Rent. Can you, can you give us like a high level overview of what Smart Rent does as a company?

Ryan Raub
Yeah, yeah. So our bread and butter really is, I describe it as sort of the smart home solution, but for multifamily. And so we really look at the resident experience like what is it like to live in, you know, an apartment that has our hardware in it? And also the property managers, there's a lot of benefit that they're able to gain getting access control to, you know, common areas or to, you know, specific units themselves. You know, we're able to provide control and functionality for residents to be able to, you know, use a smart lock. They can unlock their doors with their phones or, you know, enjoy. They can also control thermostat with their, with their phone or on their website or whatever, however they'd like to do that.

I, I think those are things that once you kind of become accustomed to it, they really become a creature comfort that a lot of people desire. And so we really try to, you know, make that experience, you know, really desirable for them but also deliver value to the property management company. And some of those things really, you know, can include sort of SaaS solutions for how they manage their business. They also love leak detectors. That's one of the, that's one of the things that they will get a lot of value out of. And you know, from a, just an asset liability perspective, you know, if you are able to, you know that you have a system that will al.

If there is, you know, a pipe burst or a leaky faucet or whatever and you can get ahead of that and get that repaired before it becomes an extensive problem or you know, flooding that saves you a ton of, ton of money and a ton of risk.

10:15 - Infrastructure as Code and Terraform

Matt Pacheco
So yeah, really interesting. What are, so as part of the cloud infrastructure team, what are some of the big things you're focusing on to help with your company's product?

Ryan Raub
Yeah, some of the big things that we do that I think deliver a lot of value are around really just management of infrastructure and doing that in such a way we end, we're big believers of infrastructure as code. We're a large terraform shop. So getting as much as we can into terraform, getting away from as much of the click ops or the bespoke hand created cloud resources and configurations as we can and also making sure that the, for me that adds a lot of visibility to the stack.

You know, if you're a developer on a team working with a service and you're, you know, unaware of how the actual intricacies of how the service actually works and that's often a different, either a different repo or you know, managed by some kubernetes YAML file somewhere else that you don't quite understand, you know, bringing as much of that as close to the application as possible because it truly is part of the application itself and trying to make sure that we've got, you know, good software development, life cycle practices and procedures in place. I mean we, you know, it's called infrastructure as code for a reason. It's code.

And that's one of the things that, you know, we're really trying hard to make sure that we're managing and using, you know, software and engineering principles, you know, effectively, you know, trying to manage the complexities of different solutions and trying to make sure that we are not over engineering a solution or we are not, you know, if we are, you know, adding complexity to a solution that complexity is warranted. The more complex solution, you know, the harder it is to manage, the harder it is to observe, the harder it is to maintain, update all those things.

And so keeping a keen eye on, and understanding the complexities of a system and the better that we can kind of reduce that and keep that all within a repository, within code, you know, the more effective that team is going to be at understanding how this service works and how we're able to troubleshoot and optimize it at scale.

Matt Pacheco
So speaking of Terraform, what are some of the advantages that it provides over the alternatives?

Ryan Raub
There are many alternatives and I enjoy the space where there's competition and there's cool ideas out there. I've seen a lot of interesting and novel approaches to this problem for us. Terraform, it isn't vendor specific. I mean, yes, I am saying Terraform, yeah, hcl, maybe I can step back a little bit and the more general language, but we are not locked into aws, we are not locked into GCP or any of the other providers that have either specific solutions for them. Having Terraform, having these infrastructure as code practices, we're able to leverage the other, they're called providers, we're able to leverage other platforms, the same approaches that we deploy there. We're able to manage our datadog configurations, we're able to manage enrollment into group management or in some cases like pager duty.

You know, we're able to control all of those configurations via code, which has I think really been a huge advantage. You know, onboarding is, you know, basically a PR or two for some people. And then we're able to basically get them access, they get, you know, invites, they're able to click and get in and same with offboarding. You know, that's, it's all in code, it's all easy to review, it's easy to see and you know, provides a lot of consistency and guarantees if somebody is in there changing something. You know, we have history of, you know, what was there before and also history of why did we have it configured before. I've, I've got an entire, you know, git history. I can go back and look at the various states of this particular monitor's criteria.

You know, why did we move from this threshold to that threshold? And there's an entire sort of, you know, PR ticket, discussion around the pros and cons and trade offs and you know that sort of, that history is there and it's nice to have that. Not a lot of tools actually provide that. And so being able to, you know, leverage those common patterns and practices across different providers I think has been a game changer for us in what, you know, makes Terraform sort of stand out.

Matt Pacheco
What strategies have you found I guess the most effective for managing Terraform at scale across like multiple teams?

Ryan Raub
That's an interesting question. I think there's a lot of different ways that you can go and I'm going to kind of give answer that's probably scoped more to my scale, my experiences. You know, we're 500 person company roughly. So you know there are a lot of other challenges out there. I'm, I tend to lean on not a mono repo. That's, that's. So we have sort of an internal module registry that we maintain and I've kind of made. We, you know, we've, it's always nice to try things. That's, that's one of the virtues of you know, having a sort of startup or a team where we've got room to experiment and try things. And we've, we've tried a couple different options and coming back to you know, simple is easy, simple is clear, simple is straightforward. Someone can pick it up and run with it.

Doesn't take a whole lot of explanation breaking those modules down into separate repositories, treating them as their own thing. That tends to be a great way to manage those. We've even been experimenting a little bit and there's, there's some interesting and there, you know, there's room for debate using the sort of standard tools like Dependabot for instance, to help manage updates. Just like any other sort of software dependency where it can come in and make a recommendation like hey, I see there's a patch version for this. You know, would you like to, here's a pr, I can bring in the change log, I can show you what the potential impacts are and I can run your tests for you and make sure that this isn't going to cause a regression.

And so by doing this and treating these as sort of individual units, you know, within our ecosystem we're able to use those industry standard tools to help maintain those updates. There's a lot of other, I guess a lot of other options in there for how to help manage these things at scale. And, you know, automation is, it's a must. I mean, you can't be running, you know, Terraform Apply from your desktop for everything. Like, that's just not a scalable solution. It's not secure. That. Yeah, so making sure that, you know, the automation is set up, make sure the automation is consistent, you know, getting back to, you know, the complexities.

If you got one app or one service or one area that's inconsistent, you know, and it's especially bad if it's an area that you don't touch every day or you have people who are unfamiliar with it come in and they're going to have either that learning curve or they're going to make that mistake when they first pick this up and they have an assumption based on the last five things that they've done and how they work and this being different. So I think those are some of the big, the big takeaways, I'd say, for how we manage infrastructure at code at infrastructure as code at scale.

19:07 - IoT Implementation & Security Challenges

Matt Pacheco
Oh, that's great. That's great. Thank you for that answer. Let's talk a little bit about IoT and AWS a little bit. So from What I understand, SmartRent leverages AWS IoT Core for device connectivity, so. Excellent. So what unique challenges have you encountered implementing IoT at scale?

Ryan Raub
Well, were one of the first, earlier adopters to AWS IoT Core, which comes with its, maybe not unique aspects. Anytime you're adopting a, you know, a new service out of the box, you know, there's a lot of early days and, you know, both sides of the fence, you know, the producer of that service or the host of that service and the client both trying to figure out, you know, what works well, what doesn't, where can we make improvements? How does this actually work? When the rubber meets the road, as we've been going, the APIs have been improving, the integrations have been getting a lot better. We can do so much more with it now than we can before.

So it's really been a steady increase in what I would consider just effective use in the service as it's become, you know, a little more robust. It's fairly, it, you know, it's one of those where I don't, I mean, nothing's perfect, but it's. I think there are a number of managed services, you know, across vendors that, you know, are very reliable and work really well. And it's Pretty rare to run into problems or unexpected behaviors or unpredictable costs. This is one that I think really matured nicely. It solves one sort of big problem for us, which is one that we've dealt with in other ways in other capacities and I don't see as much discussion over this. But when it just comes to managing long lived connections.

And so AWS IoT Core uses an MQTT protocol, so it's just a message broker protocol, but it's intended for high latency, sort of intermittent connectivity. It's not a super fast, meant to be low latency. So lossy when you're maintaining a large scale of persistent connections to anything. Rolling out change now becomes a much bigger orchestration is a lot harder. You know, the traditional sort of web server that if you were to, hey, I've got a new version of this I want to test out and I want to deploy. Well, it's as easy as, you know, adding those new instances into the load balancer. The load balancer slowly transitions over to the new instances and as those requests which, you know, generally if this is HTTP, you know, don't last more than 30, 60 seconds, 90 seconds at sort of maximum.

But you have that sort of window to know that like okay, any request in flight is completed. If this is a long lived connection, that is, you know, hours, days, I don't have that sort of luxury. And so I've got to go through a orchestration Process where I'm introducing new code, new application. I need to go through and gracefully move those connections over to that I can't yank out from under it everything that was connected before. And you'll see some of this in WebSockets. I think that's one of those things where a lot of companies will, you know, kind of ignore the problem, so to speak. And as they roll out code, they'll use the traditional models. And if you have those servers also serving websocket traffic, you're going to run into this exact problem. It just may not be that noticeable for you.

So those are fun solved problems that we get from these managed services like AWS, IoT Core, which has been really nice to have them deal with those complexities.

Matt Pacheco
Speaking of complexity, there's also potentially security considerations. Can you talk a little bit about unique security considerations for like IoT and the smart building technology you guys have?

Ryan Raub
Yeah, I mean a lot of the things that we have are not unique. I mean the vast majority of the stuff I would say is pretty standard in terms of approach tools, mitigations, some of the unique aspects for us really come into the cardinality of the data that we end up receiving. So you know, we've got, you know, let's say it's a million or so devices out there that are all giving me telemetry. So if you wanted to, you know, CPU utilization, that's an easy one. So if I had a million devices all trying to report to me five minute averages for CPU and I'm, you know, this is kind of a contrived example, but I'm trying to create a alert or I'm trying to monitor, I'm trying to figure out which ones are running at an elevated level compared to others.

Like that's just a much bigger problem and you can't really use your traditional tools that you normally would to, you know, either alert or analyze that data to be able to bring those insights out. And so we've had to shift a lot of that kind of thing much more towards a data warehousing model where we can actually process data at larger scale in a much more scalable fashion. And so there's some unique aspects where we would love to have, you know, we'd love to have more data, more, you know, specifics that we're able to get for every device, but it really turns into a data processing problem that we're having to work through on the back end. And then we also end up paying all of that cost to get all of that data.

And so we need to be really careful with what are those specific signals that we want to make sure that we're sending, that we're receiving. And you know, as it pertains to, you know, security, which is your original question, you know, behavior analysis can be one. You know, there's a lot of simple things that, you know, just having some regular data, having established patterns and then being able to intelligently identify those and either raise awareness, fire off an alert, or in a lot of cases we can include that into reports and make sure that is surfaced in sort of aggregate fashion as well. We've got a lot of protections in place for, you know, these devices and how they present themselves, their authentication methods. You know, we've got a great hardware team that comes up with a lot of really good solutions there.

You know, we lean on a lot of industry standards for how we do these, you know, exchanges and you know, make sure that these, you know, communications are, you know, properly secured and that the, you know, data that we're receiving is coming from, you know, that trusted source and we're not having to worry about, you know, these devices, I would say falling into the wrong hands. But you know, at the same time if they, you know, start doing other things, you know, they're, they would, they would stand out in our, in our telemetry.

27:21 - Automation, AI, and Observability

Matt Pacheco
What are your thoughts on using automation and AI for cybersecurity?

Ryan Raub
It definitely can help a lot. I think synthesizing large amounts of data, finding correlations can be of value. I've, you know, for me it's been a mixed bag. There's been situations where I've seen huge value, you know, it being able to pull together a correlation that I wouldn't have thought to go look naturally or, you know, just did not seem to, you know, was just not obvious to a person. But since, you know, it's able to churn through a lot more data than a person is, you know, it's able to kind of highlight those. But I've then also seen countless times where it picks up on what it thinks are correlations, but in fact they're unrelated signals or you know, whatever other misses. I would say so, you know, I don't think it's a silver bullet.

I don't think it replaces a person reviewing especially security findings. But I for sure see value in some of the synthesis and some of the correlation. The yeah, anomaly detection is an interesting one. You know, we, we use it in a number of different ways and there are certain situations I feel where it can work well and then there are many situations where it just, it does not and produces, you know, in fact probably more noise than the sort of source signal that you're actually trying to observe coming out of it. So it's definitely a tool. It's a tool you can use. It's a tool that helps, but it, you know, doesn't really replace a person at the end of the day.

Matt Pacheco
Yeah. And that seems to be something that we hear often that it's definitely a, an assistant rather than a replacement. So thanks for sharing your thoughts on that. So we've spoken previously and you mentioned we're switching gears. You've mentioned observability as an area you're passionate about. Can you tell us a little bit more about how you're approaching observability at smartrent?

Ryan Raub
So observability well, first off, you use a phrase which is uttered a lot. It's not free. And I think there's sometimes a misconception where, you know, well, let's just add some more log lines or let's just add this metric or let's just, you know, add this observability into our application and then we'll, you know, later on, maybe we'll need it or not. Or, you know, you think about sort of unstructured logs. Like if you're just adding in additional things but not necessarily getting the value out of it, you end up paying a price either in processing that telemetry, that log volume, that metric, whatever, whatever that is, you end up paying the, you know, the network costs, the CPU costs. You know, there's a lot of things that, you know, we're having to really be careful with and optimize for signal noise.

You know, if we are going to, you know, log this particular event with, you know, these, you know, this context, you know, is that a volume that we're going to get value on? And so, you know, we've got some team. I'd love to be able to tell a team like, you know, let's, you know, let's run the app in debug mode. Like, you know, why not? We can pay the network fee, we can pay the, you know, the costs for, you know, indexing all those logs. And we would have a better answer if a problem came up. We'd be able to know exactly what happened because we'd have the full trace, we'd have the full, you know, sort of gory details. And then, you know, the costs that come with that, you know, become huge.

You, you end up finding out that, oh, you now need to run 20, 30, 40% more resources to host the same load because now they're doing such a large job either maintaining state between processes or, you know, transferring log volume. You know, they're having to ship up telemetry to a centralized system at, you know, a much larger rate. And though, and then you have the, you know, the other part where you're consuming that and to make sure that you're, you know, having to either scale to accommodate that or scale your accounts payable department to cover that bill that you're going to end up paying.

And you know, it's always a bit of a, I don't say a battle, but it's a, a careful balance between, you know, what we're logging what we're monitoring what we're putting into our observability toolkit and making sure that we're getting that value out of it. You know, one sort of interesting area and it's, it's had a lot of iteration I think, in the last couple of years that I've seen around sampling. And I think there's a lot of different approaches for it. You know, there are some areas where I think it's a great solution where you may not have a, you know, your needs for observability may better served by sampling and guaranteeing, you know, at least 1% uniqueness of whatever patterns that you establish, you know, coming from your various nodes. That may be enough data to answer your question.

And if you can do that sampling upstream enough, you know, you're not paying that network cost, you're not paying that ingestion cost. You are still paying that resource cost on the node because it's potentially still logging or, you know, it's collecting that telemetry and depending on where that sampling is. And that's where I think there's been some interesting develops pushing that sampling back. I mean, early on you'll, you can, you know, you can sample at the ingest layer, you could have everything that it gives. You know, it's able to take, you know, 5, 10% and you're, but you're still paying that whole network transfer. You know, it's moved back to sort of the, I've, you know, I've seen it at the agent level.

So on a particular node, you know, or, you know, there's where it's kind of making the determination is this a, a sample that I send up or one that I discard and then even back further into the instrumentation itself to, at the time of instrumentation to understand am I a sampled request or not. And that comes with interesting trade offs because if you're sampling some requests and those are the slower ones, but you're not, you know, not seeing another, you're trying to figure out why. I've got two sets of, you know, performance characteristics for the same calls. You can end up with, you know, Exactly. Well, that's the overhead you're paying. So you can introduce complexities there as well. But that's kind of an interesting way to go about sort of managing observability data at scale.

The most important part is making sure it's A value and useful and you're going to be able to take action on it. Can't just log everything forever. It gets too expensive and too noisy.

34:52 - Cost Optimization Strategies

Matt Pacheco
Speaking of costs and expensive because we just talked about in the context of observability, but looking at it from a higher level. So at your kind of cost strategy at Smart Rent, how do you approach that? Because like you said, these costs could get out of control if they're not managed properly. What's your philosophy and approach to cost optimization and cost containment?

Ryan Raub
Yeah, so first and foremost, it's making sure we know what we have. And that is, you know, it's a little bit back down towards observability. But if you have a, you know, you need to have a good tagging strategy for how you understand, you know, what belongs to what, you know, what severity level certain things are at. And you know, you can go about spotting, you know, from an organization perspective, you can go ahead and spot areas of, you know, right. Sizing opportunities. You can, you know, make sure that those cost centers are, you know, properly accounted for to a certain degree. There's always a lot of the, you know, it's a, you know, network egress cost. Like, it's hard to, like, I can't put a tag on that necessarily.

But there's so, you know, proper attribution, good understanding of what's there, making sure you know, have everything accounted for. You're not, you know, finding a, you know, a large open or redshift cluster that someone turned on for a thing and then forgot about. So, you know, driving all the way back towards infrastructure as code, you know, as part of rolling out new resources, as part of, you know, making change. If we're going to increase capacity or we're going to change resource types, you know, there's a conversation in a pull request on that and that usually is a good evaluation of like, this is why these are the benefits. This is, you know, here's the justification for it. And so, you know, we've got that history.

We're able to kind of overlay that and make sure that, you know, through I say regular evaluations that we're, you know, staying efficient. We're, we're making sure that we're sticking with the, the tried and true paths. And then of course, the, you know, we're a large aws, we have a large AWS footprint. So there's a lot of strategies that you can employ. There They've, and you know, I think this has been good.

They've moved a lot from their, you know, really sort of locked in like reserved instance purchases where you had to buy, you know, that specific instance class in that region on Wednesdays or whatever, you know, crazy criteria they come up with to, you know, make sure that you don't out, you know, step outside of that and then they give you a, you know, a benefit, a savings that they give on top of that. And they've been shifting a lot more from that over the years to some of these more flexible options like savings plans. And so, you know, we, you know, regularly go through and make sure that we're, you know, utilizing those effectively and when were able to make, you know, larger commitments on. And that's always been one of the issues with the, you know, reserved instances.

Like, cool, I can make, you know, all of the commitments start. They're usually in terms of years. So you know, 1, 2, 3. Am I confident that I'm going to be running this particular database at this particular incidence class in a year, in two or three? And you know, that confidence, you know, can vary a lot and you know, they're willing to give you deeper discounts for, you know, higher or for longer term commitments and sometimes that's worth it and other times that's, that, you know, that's a trade off that you may not want to make if you find yourself needed to upgrade. And they've, you know, they've been doing a better job expanding this but you know, if they've kept everything really rigid and you can't, you know, they have a concept called instance class flexibility.

So you know, that's where they can kind of call these as like units of. And so if I were to go up from the large to the extra large, well that's like two of the same units. And so I would just need to buy one additional RI that would get me, you know, covers there, but not in all of their product models do they offer that. And so trying to make sure that we don't, you know, over or under commit the flexibility to allow us to, you know, grow and scale and pivot while not wasting a RI or you know, taking advantage of those things. So that's been its own fascinating journey as they make changes there.

39:40 - Future Trends and AI in Cloud Management

Matt Pacheco
Yeah, definitely a journey and I'm sure more changes in the future as well. Speaking of the future, let's Talk about some trends we see over the next few years. What are, what are some emerging cloud or IoT trends that you're most excited about in the next few years?

Ryan Raub
Yeah, I know I've seen a lot of growth in sort of the open telemetry standards. I think, you know, the past couple years we've seen that really materialize as like this is a vendor agnostic way to provide observability data. I'm excited to see that continue and that to become basically the standard going forward.

I know there's still some, some holdouts, there's still some things that need to be worked out with that standard, but I think that's a good thing for the industry, you know, and I guess to play off that a little bit, you know, we've also, you know, been working with large scale sort of observability problems and you know, I really see there's a couple players in this space now who take a sort of data warehousing approach to this problem rather than the more traditional approach where you know, they're storing things in memory and it's very expensive to, you know, the per volume cost is pretty prohibitive to have that at scale. So seeing those things move to more of a data warehouse model has been really interesting. It's still early days.

I think we're still trying to figure out how to properly query manage, maintain those pipelines in a sort of data warehouse approach. I think there's a lot of room for growth there.

Matt Pacheco
Yeah, so we talked about this a little earlier but I want to ask it at a grander scale, what role do you see AI and machine learning playing in cloud infrastructure management in the future, if any?

Ryan Raub
Yeah, yeah, no, I think that's a, the, if any is a good qualifier there. So you know, one of the things that I've, I've seen and so I guess back to terraform for a moment. So infrastructure as code, Terraform, it's a declarative language. You are declaring this is what I want you. And you know the idea is, and the in the pitch is, you know, you don't have to care about how to get there. However, I think anyone who's used it for a while will clearly see you really do have to understand how to get there. You know, I can't go change that Postgres version from 15 to 16 and just expect it to work with no downtime and like that. That's, that's a sort of a pie in the sky. So I've seen a lot of improvement in tooling, understanding that change.

You know, you're proposing this new declared state. How do I get there safely and effectively reliably and not either fail in a safe way or you know, make a sort of smoother transition or highlight that like, hey, this is going to be a disruptive thing and there's a lot of learning that we've had to do with. There can be some really surprising changes that you know, unless you've done this before and you may not have realized that, you know, changing this description on this attribute requires downtime. If you just send it as it were, you have to go through a bit of a two step operation in order to seamlessly transition.

And so I definitely see an opportunity there for AI to help highlight those things, potentially help provide solutions to help either train or you know, give more information to the, in, you know, in the pull request. You know, hey, this is going to be a downtime event. We're going to need to, you know, either take this approach or accept that and potentially here's a pattern that we've, you know, put in place for that. So to, you know, as a, a way to help educate the teams and those rough edges. I think there's a lot of interesting opportunity to come out there and also to review and potentially test, you know, you've asked for this new declared state. Is it going to do what you think it's going to do? Is it going to solve that problem that you're after?

I think there's ways to evaluate that using AI to, you know, help find those answers to help highlight those areas where a human can kind of come in and you know, provide that you know, be provided more context and more information to make decisions. I think that's a, that's definitely an area of opportunity that I've seen some, some people scratching that and I think there's a lot of room there.

Matt Pacheco
Yeah, that's really exciting to think about and seems to be the common theme, as I mentioned earlier, it's the common theme of helping people make decisions. Last question for you and this is more kind of advice related. So we talked about a lot about cloud infrastructure, management infrastructure as a code terraform. What is one piece of advice you would give organizations looking to adopt any of this technology to get Started.

Ryan Raub
Yeah. If you're looking to get started and you don't have anything or you, you know, don't have these practices, it's an uphill battle. And you know, I've gone through, you know, a number of exercises trying to go from a, either a completely different set of infrastructure as code or just a completely clip click ops based architectures and tried to bring that into a sort of terraform controlled process and workflow. And you know, I don't think there's one solution to fit everything. I think you need to account for the change that is going to happen while you're going through this. I think that's an aspect that not a lot of people think about. You know, it's easy to go, I need to go import everything here and like, okay, I started in two weeks later. All right, I'm done. Oh, wait, it's changed.

So being able to work progressively stay in step with that infrastructure with that change. I'm a big proponent of start small, make it work, then make it better. So if you can grab some resources, bring those resources into this flow, set up that automation for those resources and it may be a subset, that's the place to start and then build that out because that will let you iterate. And so then if you're making changes and they include those resources, you're able to include those changes now in the infrastructure as code pipelines. The transition time's painful because if you've got multiple, you know, multiple approaches to manage that infrastructure and that's just, that's more areas for someone to forget or make a mistake or, you know, screw that up. So you don't want to let that linger for longer than you need to.

But yeah, it's a tough problem. That's, that's for sure. But I, I'm a firm believer that the end state is worth, is worth it.

Matt Pacheco
That's excellent advice and that's all my questions for you today. I really appreciate having you on Ryan. This is very informative, especially about the infrastructure as a code terraform. We don't get to talk about it much, so it's really interesting to hear about it. So thank you for being on Cloud Currents today.

Ryan Raub
Yeah, well, thank you for having me. Happy to share some of my ideas and approaches to these problems.

Matt Pacheco
Absolutely. We appreciate you. And for our listeners, thank you for listening in. Find us on all your major podcast platforms and we'll see you soon. Thank you.

More Episodes

EP. 48 CIO’s Survival Guide to the Agentic AI Era with Sumit Taneja

Listen now

Ep. 46 – Learnings from Building a Massive Private Cloud with David Morales

Listen now