EP. 30 HPC, DevOps, and Security Trends with Dennis Walker
EP. 30 HPC, DevOps, and Security Trends with Dennis Walker

About This Episode
Cloud Currents host, Matt Pacheco, sits down with Dennis Walker, Senior Director of HPC Solutions Architecture at HPE, discusses the world of high-performance computing and cloud infrastructure. From managing massive-scale supercomputers to implementing cutting-edge AI solutions, Dennis shares his expertise on the challenges and innovations in modern computing infrastructure. Learn about the intersection of cloud-native architecture with supercomputing, the critical role of power efficiency in large-scale operations, and the evolution of DevOps in enterprise environments.
Know the Guests
Dennis Walker
Senior Director of HPC Solutions Architecture at HPE
Dennis Walker, Senior Director of HPC Solutions Architecture at HPE, spearheads cloud-native HPC systems management and supercomputing infrastructure initiatives. With over two decades of technology leadership experience, he excels in DevOps, AI/ML, and large-scale infrastructure management, including deployments exceeding 10,000 servers. Walker's expertise extends to developing cloud-native architectures for managing massive computing environments. Before joining HPE, he founded and led several technology consultancies, working on diverse projects from AI-based security solutions to biological telemetry data visualization platforms.
Know Your Host
Matt Pacheco
- Sr. Manager, Content Marketing Team at TierPoint
Matt heads the content marketing team at TierPoint, where his keen eye for detail and deep understanding of industry dynamics are instrumental in crafting and executing a robust content strategy. He excels in guiding IT leaders through the complexities of the evolving cloud technology landscape, often distilling intricate topics into accessible insights. Passionate about exploring the convergence of AI and cloud technologies, Matt engages with experts to discuss their impact on cost efficiency, business sustainability, and innovative tech adoption. As a podcast host, he offers invaluable perspectives on preparing leaders to advocate for cloud and AI solutions to their boards, ensuring they stay ahead in a rapidly changing digital world.
Transcript
00:00 - Introduction to Dennis Walker
Matt Pacheco
Hello everyone and welcome to Cloud Currents, a podcast that navigates the ever evolving landscape of cloud computing and its impact on modern businesses. We talk about lots of topics on this podcast, from AI, machine learning, cybersecurity, to hybrid cloud management and every trend you could think of in the cloud world. I'm your host, Matt Pacheco and I manage the content marketing strategy at tierpoint, a managed cloud and data center provider. Our guest today is Dennis Walker, Senior Director of HPC Solutions Architecture at hpe. That's Hewlett Packard and Enterprise, where he leads initiatives in cloud native supercomputing infrastructure. With over two decades of experience, Dennis has been at the forefront of DevOps, AI and ML large scale infrastructure management and working on some of the most the world's most powerful computing systems.
Matt Pacheco
In today's episode, we'll explore challenges of managing those massive scale infrastructures, the evolution of DevOps in cloud, the future of AI and machine learning in operations. So thank you for joining us today, Dennis. We're excited to have you on.
Dennis Walker
Hi Matt, it's my pleasure to be here.
Matt Pacheco
Awesome. Well, we'll dive right into it and I'm really curious, can you walk us through your journey in cloud all the way up to your current role at hpe?
01:39 - Evolution of Cloud Computing
Dennis Walker
Sure. Yes. Let me find a way to make this concise. So almost two decades ago, I entered the world of enterprise IT consulting and software development, back when those two things were separate, entirely separate. And so I worked inside of a data center managing hundreds of different websites and web applications that me and my teams had developed and deployed and were deploying and maintaining. And you know, I think it was 2006, seven, somewhere around there is when we saw AWS, right, beginning to take off in the public cloud hosting space. And that became a much more efficient way of spinning up infrastructure and getting feedback from our customers and then even launching. And so within a few years, we hopped on the bandwagon of writing code to manage infrastructure and using that to solve real problems.
I think it was 2012 that I was recruited into a Fortune 500 company, now on the New York Stock Exchange, that had taken, I think a Series A and tried to launch a new product. But it had horribly gone sideways on bare metal infrastructure. And within three months we had transplanted that over onto AWS and made it horizontally scale according to demand. And within the first week of its launch, they advertised on Good Morning America with a surprising volume we had never seen. Right? And sure enough, the elastic infrastructure was able to dynamically scale to demand. And that saved the Day for that really saved the day for that company, which is still around today, you know, making billions in revenue. And so from there, you know, we've kind of.
I walked through some of the consecutive steps in the cloud space where I think through the last five or even seven years, the areas where there's still rapid expansion or need for innovation is in the domains of security and artificial intelligence. That was really my focus area from 2017 to 2021. And then since about actually 2020. Since then I've been helping to bring the cloud native ecosystem to high performance computers and supercomputers, just as you mentioned. Right? Several in the top 10 of the world.
Matt Pacheco
That is so cool. How has your experience with AI and infrastructure management influenced what you're doing today?
Dennis Walker
Yeah, I think a couple of things, I mean all of the public cloud providers do a pretty good job in publishing their white papers around, well architected paradigms. And the nuts and bolts of those at a high level is that you organize your thinking across cross functional or non functional domains of requirements. So security, performance, availability, rate of change, cost leverage, and the north star of all of that I think is either velocity or cost per transaction, keeping the minimum infrastructure necessary to host the ever evolving landscape or demand of customer request volume. And so coming to HPC and supercomputing a domain where the core value proposition is within the hardware footprint, right? It's serving the ecosystem that has models at scale, not served at a cost efficiency by public cloud providers.
And so what the name of the game has been is making that infrastructure highly utilized and maximizing its value to be able to process models and data at scale for the lifetime of that system over the course of seven years?
Matt Pacheco
Can you explain, and you mentioned this a little before, can you explain the concept of cloud native architecture in the context of high performance computing environments for users of supercomputers?
06:17 - High Performance Computing and Cloud Native Architecture
Dennis Walker
They need to be able to submit jobs that run on the specified infrastructure in order to maximize the efficiency of model processing. And the delta if they get it wrong is in the range of weeks or months. And so it's crucial that they have a current inventory along with all of the capabilities of every single node within that footprint up to and beyond 10,000 nodes. You also have hundreds of users who are potentially simultaneously submitting jobs and you know, against that infrastructure. And so the scheduler functionality there has to be able to translate all of those requests into the maximum concurrency possible so that there are no idle nodes within that footprint for any length of time. Given especially that the capital expenditure investment is within $100 million or more.
What we've done in order to and the reason why cloud native computing foundation solutions are relevant is that they help translate and abstract the topology of the infrastructure into APIs that you can correspond with in order to query around the current state and leverage the infrastructure that is becoming available. There's actually I'll add to that there's one more domain that it's relevant to and this is especially true in Europe where there are power efficiency constraints. And so there's a. There, for example, there is a customer in Finland where they can't operate all of the nodes all at the same time, especially during the middle of the day, during peak power utilization by the public, otherwise there will be rolling brownouts. And so like another cutting edge domain is being able to power cap individual components of all of those. All of those nodes.
So for example, if a job is running that doesn't necessarily make use of the GPUs within the nodes, that you're able to turn those dynamically off and on so that you're lowering the footprint of the power consumption of the data center.
Matt Pacheco
It's a really interesting use case of leveraging tools like that. And you kind of got into a challenge that you potentially face. You mentioned a region where power could be a challenge. What are some of the other unique challenges in that space that you come across often?
Dennis Walker
Sure, security and multi tenancy is a burgeoning part of the ecosystem. So if you are a major government agency or an educational institution and you've made a dramatic investment into this area, you want to be able to avail it to potentially a group larger than your own. But in order to do so, you need to ensure security guarantees for the people who are pushing their models and their data into those systems concurrently. Along the same lines, multi tenancy in that domain entails ensuring that you have separation and isolation of the management concerns, the monitoring concerns and the security concerns. And that can even mean, for example, the PKI infrastructure and CA certs are unique to every individual organization, business unit or a tenant within that environment.
And that when they have been allocated a particular portion of the infrastructure, that proper partitioning exists within the storage arrays or parameterization has been applied to the network boundaries so that you're able to guarantee protection from one tenant to another.
Matt Pacheco
Really interesting. I guess we'll get into security in a little bit. I had more questions about high performance computing as well before we dive, because security I feel like we could talk about all day, but follow up. What role, if any, does Kubernetes play in modern high performance computing environments.
Dennis Walker
That's a good one. So when I came into hpe, the first product, really the main product that I was asked to oversee was a cutting edge system management platform called the Cray System Management csm and that was a product developed by Cray, an acquisition of hpe. And so one use case is that you can use Kubernetes in that ecosystem to basically bring in a lot of similar public cloud functionality into the management plane of that infrastructure. And some of the same use cases I'm talking about, for example ISTIO to be able to leverage cert manager within Kubernetes and rotate certificates based off of the customer provided CA certs or intermediary certs so that all of the data in flight is properly encrypted within certain protocols using latest ciphers. That's an example of that.
We also used it to provide APIs for pushing out updated declarative configuration models to any of the groupings or individual nodes or the entirety of the infrastructure so that you can dynamically update configuration management and then have a transaction of what that change was for auditing purposes over time. So that's one domain of where Kubernetes is relevant. Another domain where Kubernetes is relevant with an HPC is on the end user side. And so as you probably know, Kubernetes and kubflo and there is a burgeoning ecosystem of tools all suited towards elevating the velocity of data scientists and facilitating the MLOps ecosystem. Many customers want to be able to leverage those user journeys and tooling and solution within the compute, the managed ecosystem side itself. And so being able to dynamically provision Kubernetes environments and scale them up to 10,000 nodes is another domain.
Matt Pacheco
Wow. So can't talk high performance computing without at least mentioning the biggest trend in cloud or at least the thing that people are talking about the most, AI. So I have a few questions about AI. What are some of the key infrastructure requirements for large scale AI model training?
13:24 - AI and Infrastructure Requirements
Dennis Walker
Well, I think the answer to that basically boils down to machine level metrics, storage and network requirements. And so first of all, just a key differentiator between a common data center and what would be considered a supercomputer is the RDMA protocol and InfiniBand type networks where you're able to transmit at a much higher speed but also have direct memory access from one node to another, bypassing even the operating system kernel. What that means is the model itself is not limited to the confines of the memory capacity of a single computer, single server. It means your models. The size of those models can be the size of the sum total of memory within all the nodes. And all the model needs to be able to do is maintain a map to properly partition that and know where to retrieve the various elements of that model.
You can imagine some of the top players in that space, the LLMs, like ChatGPT, basically they're trying to compile, what is it in the billions, hundreds of billions of parameters. Now maybe it's even trillions, in order to basically process the total volume of Internet accessible human knowledge into all of the parameters within a vector database. And the only way to do that is if you can transcend the limits of single node memory requirements. So storage story is similar both in the HPC side as it is in the public compute side. You need to be able to leverage the storage, but also keep track of the topology of it so that you're not trying to transmit or process too much data from any given individual section of that data center, but it's properly distributed for maximum throughput. And then the management network is similar.
Every time you're rebooting a node, you have to deliver root file systems and or packages for dynamic mounting inside of each compute's memory. And so all of that has to be pretty precisely tuned so that it's properly distributed and that the models are aware of all of that distribution.
Matt Pacheco
You may have just answered this, but how is HPE addressing these massive computational needs for modern AI?
Dennis Walker
Yeah, there is a lot I can, I could speak to there. I think part of the key value proposition for HPE is especially on the hardware side and really in the domain, maybe even of cooling, providing liquid cooling to all of the hardware components, maybe. And that's not the first item for sure. Probably the chief value proposition is that they can come to the table with all of the elements and really providing the integrated solution for really all of it. There are very few providers who do that at the scale that HPE does, and certainly almost none really inside of North America, at the same scale that.
Matt Pacheco
HPE does, Everyone's interested in how to manage some of these AI workloads and that's really useful a sustainability standpoint. So high performance computing, gpu, all the great stuff that goes along with some of these intensive workloads like AI. How do you address power efficiency for those computing operations?
Dennis Walker
Yeah, it starts with the demand of a submitted job. So you're a data scientist, you have a model, you have data and you have an approximate understanding as to the size of the computational power and maybe even the domain of that, whether it's GPU power or CPU power. And so when you submit that job, you can submit it with that additional metadata so that the scheduler is then aware of where it needs to be able to send that job for processing. In the meantime, it's collecting metadata about execution of all jobs and sending that to the system management plane and then interacting with a power capping API so that it's able to dynamically adjust the power on the basis of the sum total of all the jobs that are processing in flight.
It's able to do that at both the node level, CPU level as well as all of the components. So if you have, in some cases you may have four or more nics network interface cards per server, not all of them may need to be in flight at the same time. You can power certain ones off and then power them back on when a heavier demand comes in. That is especially true though with the GPUs. I think that's the biggest value right now for dynamic power shifting.
Matt Pacheco
Yeah, that's very interesting, thank you. And it's definitely an interesting trend that a lot of people are starting to pay closer attention to. So I figured I ask, let's switch gears to security and compliance. We kind of talked about that a little earlier, but I'd love to dig into some of the unique security challenges in those high performing computing environments. What are the biggest challenges you're facing there?
19:17 - Security Challenges in HPC Environments
Dennis Walker
Probably the biggest in this space, in the HPC space, is cryptography and managing that all the way from the hardware itself. So starting from within the TPM module of physical chip, within the server you have an area where you can burn a cryptographic key. And then when that server powers on and it's delivered and it begins to execute its BIOS and everything down the line, it's able to use that cryptographic key and have it basically authentic checksum the execution of everything that's running below it. And so this is more commonly known as secure boot. But getting all of those artifacts properly signed and then re signed during deployment with a customer provided ca cert is, you know, integrating all of that and providing a deployment model for it has been a challenge.
So secure boot is a key area I think a lot of people are interested in represents some challenges there. Alongside that is node attestation and having a non person identity and access management story. CSM does provide that just recently actually tying it into the TPM cryptographic key. So that you're able to better protect the physical security of the data center. For example, if you have a sysadmin come in and plug a laptop into a switch without that, they could be asking for root file systems of a management node and then compromise the entire security model of everything. Another area that's a challenge is the amount of CVE vulnerability churn out there. And so for anybody who's managed Kubernetes environment, you know that you may be dealing with hundreds of containers at any given point in time.
Every one of those containers have dozens of RPMs, maybe even more. The evolving landscape of exploit discovery is I think, only escalating Story here is that at one point I was managing a product that I think had something like 5,000 vulnerabilities that we couldn't keep on top of by hand patching at that volume. And the pace was, like I said, escalating. The solution at that point in time was to dynamically rebuild all of the containers, pulling in patches every single night and then producing those in a way that customers could consume. But that's for the containers. And then you have root file systems and all of the RPMs that go into all of the machine learning libraries on top of that end users may also want to bring in. And so that's a complicated story.
It's also especially complicated because many models, especially for weather prediction, are large ETL pipelines kind of based constituent members of the whole data processing pipeline and solution. And so anytime you might change a library, just even for a minor vulnerability patch, you might inadvertently be altering the outcome of one of those components. And if it's part of a big ETL pipeline that takes a month or longer to run, you may not even see that drift until quite a bit later. And so managing the amount of churn the patch exploits, it's a challenge. Oh, let's see. I think maybe the final challenge is enforcing authorization at all parts of the system. So you have different user Personas that are interacting with the system. I think the main area that most people are concerned about is when end users are remotely connecting into the supercomputer.
In order to run workloads well, you need to be able to enforce boundaries like even within the nodes that they have access to as to what they can do and why. And solutions like AppArmor and SELinux are of course a good answer to that footprint. But then making sure that you have similar solutions in place all the way up and down the chain so that if you did have an exploit somewhere, for example, and somebody was able to reach a VMC or even deeper into the system management infrastructure, that authorization controls are at each step in that here, and that they're consistently pulling from the same upstream identity provider if that's, you know, if that's the desired configuration.
Matt Pacheco
Yeah, I was curious. We can jump into DevOps a little more now. DevOps and automation. So how has your view of DevOps evolved since your early days in the field?
24:25 - DevOps, MLOps, and Future Trends
Dennis Walker
Yeah, so originally I saw DevOps really as almost just configuration management and some lightweight infrastructure as code. Of course those frameworks evolved and I think the focus shifted more into the infrastructure as code space, almost away from configuration management. I think over time I began to realize that all change needs to be encompassed in source control and that anything that's not any change that is not governed by source control, although it may be a beautiful work of bespoke artisanship, engineering also represents technical debt that's going to get you later. Maybe seven years ago is really where I began making big pushes within the various organizations I was a part of to ensure that were version controlling everything. Even if you were hand patching or monkey patching a change because of some dire emergency in production, that change was then backported into the source control.
I guess the other thing is that DevOps is a domain that can manage everything and should I think there are categories not everybody thinks about. So for example, version controlling your data, which is a tool for developing your AI models and your ETL pipelines, all of that needs to be version control together. Your training data, your test data, your result data, your fabricated data, all of that needs to be version controlled alongside the building of your models and the pipelines. And oh by the way, now we have models for model development. I'm trying to remember the name of the one that's in kubeflow. It'll come to me in a moment.
But basically you can turn control of all of that now over to a machine learning model that is going to conduct as many parallelized concurrent tests of models with that data and with those pipelines in order to determine which one has the lowest loss and the volume of your infrastructure. In those cases, predict the overall velocity so you can get escape velocity from your infrastructure by version controlling all of that and then turning it over to yet another AI model.
Matt Pacheco
Where do you see. So you talked about MLOps earlier too. Where do you see DevOps, DevSecOps and MLOps heading in the next few years?
Dennis Walker
I see DevOps is the foundation from which everything is building up from. Some people would argue that's SRE or even prior to that, QA automation, but I think traditionally now it's understood as DevOps or SRE. I see DevSecOps and MLOps as branches that build on top of that are increasingly specialized roles with their own language and taxonomy and tools and ecosystem. So I really do see them almost as diverging roles of engineering classifications and maybe at some point they'll come back together where we realize as a society or whatever that it was all the same to begin with anyway. But right now, I mean, an mlops engineer is a rare thing and that is very much specialized role and distinct from DevSecOps.
Matt Pacheco
What areas of all these do you see the need for more improvement? Like where can all these improve over the next few years?
Dennis Walker
Yeah, that's a good question. So let's just start with DevOps. Right now, the state of the art, and from my perspective is basically that you build machinery that dynamically pivots based off of the current state of the infrastructure. And so obviously Kubernetes was built with this model where there's a reconciliation loop for the declared state, and then that can be extended with things like Kubert or crossplane, where you're connecting that to your cloud infrastructure and really the total infrastructure for your entire platform, so that it dynamically pivots based off of the state of all of the concerns. Right. Request volume and to a prior question, whether or not suspicious activity is happening on some part of that landscape. Really all of your infrastructure can be pivoting to that. However, because the state in those systems is always changing, always evolving.
The complexity for troubleshooting is sky high because you really have to go query so many different footprints unless you have built a truly phenomenal single pane of glass in your dashboarding. Right. Being able to query the state of all of that and understand where it's at and understand what race, condition the reg case was hit and why it went sideways. I think that's now exceeding the complexity that a small. That like a scrum team of DevOps engineer practitioners can fully comprehend for any platform at any size. And so I think what the industry really needs is a way to be able to better grok that and get the. Quickly get the abstract of what happened and why. That's. I think I see that as honestly like an existential threat for Kubernetes itself, in that it's just now so complicated, especially if you've.
If You've implemented a sizable part of that ecosystem. On the DevSecOps side, I think there aren't good enough standards for what represents a comprehensive security compliance program. And so you have things like the Security Scorecard, which try to enumerate a bunch of concerns, but really what I see is everybody kind of piecemealing together a bunch of tools into a solution and then writing custom wrapper scripts and reporting tools to try to incorporate that into their own health scorecard report. Right. And there are a bunch of reasons for that. In part because any compliance of node may have hundreds or thousands of compliance controls. I mean, I was a part of a group that recently tried to implement that for the Medicare infrastructure publicly, you know, hosted in the public cloud.
And they had 4,000 compliance controls, as you can imagine, with all of that sensitive data coming from every single state in the US up into an integrated ecosystem. And so trying to keep track of all of that is untenable if you're trying to do any, even a portion of it by hand. And So I think DevSeco, what DevSecOps needs is an aggregate solution, something that is incorporating all of it, a shared standard to indicate all of the concerns that need to be represented in your security tooling and platform. And of course the answer for that is going to be different for every business that interprets it.
But at least there can be a maturity model that is commonly shared and referenced and maybe even boilerplated or modularized components of a shared solution that you would just integrate in order to better ascertain what it is and where you stand from a security standpoint. The last one I think you said was mlops, is that right? Yes. Okay, so. So for mlops, I kind of see it as almost like a similar thing for DevSecOps. There's more movement there. I think I see a little more velocity in the churn of the changing landscape. But not every solution has reached all of the needs of its data scientists. But some are getting there, right?
I think the basics are just being able to version control, like I said before, your models, your data, your pipelines alongside any APIs you might be able to consume and then roll that into the entirety of your platform with some sort of comprehensive release management. There are also tools that assist the development, and so data science development tools. In that space, I could see a lot of room for improvement. You've got like SageMaker coming from AWS and other notebook providers that are trying to pull in this entire ecosystem of plugins and libraries and frameworks. And examples, but that still needs to kind of coalesce and come together and then also tie into the overall release procedure. So that would be my answer on that one.
Matt Pacheco
Thank you. So we're going to go to our last few questions as we wrap this up, I'm going to ask you two questions. They're going to be the exact opposite of each other. We're gonna have a little more fun. Even though this has been really fun, I'm gonna make it more fun. So let's talk emerging trends. We'll start with the, the fun question. What emerging trend in cloud and high performance computing does not excite you at all? What are you, what do you, what worries you in this space that's emerging?
Dennis Walker
I think the thing that excites me the least, but for valid reasons, so I can't complain very deeply, is that there are almost just too many tools and solutions to keep track of in all of the space. Right. So every cloud provider has multiple conventions per year. Every convention, there's business pressure to go produce additional tools and solutions. But when you look at those landscapes, I mean, they're growing so immense. And when you think about being a practitioner that people trust and trying to go pursue, for example, certifications, you know, at some point you're, you're answering questions for products that only might only represent an edge case or a small margin of potential users, or they're so far up in the OSI model that, yeah, they're just not commonly consumed.
And so I think there needs to be a better way to kind of grok that landscape and understand which ones really are the foundational tools and then build on top of there. So, for example, again, if you're inside of AWS, you absolutely need to know IAM and EC2 and load balancers and DNS and SSL cert management. And then on top of that, you can build from there because kind of everything else builds on those tools and consumes them. And so if you look at the CNCF landscape right now, I think it's got something like 300 logos and it's not even really tightly integrated where you get an easy, quick sense as to what each of them do. You kind of have to go click through all of them and read a whole lot, and it's always changing.
So that would be, I guess the tool kind of churn would be a little bit of a gripe. I guess on the flip side of that is it's a sign of innovation. And so you need a certain amount of chaos in order to see the change come to fruition and evolve in accordance with the user need.
Matt Pacheco
What advice would you give professionals looking to work in large scale infrastructure?
Dennis Walker
I'd say make sure your support staffing for that large scale infrastructure covers the boundaries, you know, of its domain. And that's probably an obvious statement. So a good rule of thumb is that most sought after engineers are t shaped in their expertise. So you want people who are pretty broad, but you also want them to have one or two areas of deep subject matter expertise so that they might be able to both cover a bunch of normal issues and areas and territory and be able to move solutions along far enough into the release process that it makes a meaningful impact. But then when it really comes down to troubleshoot a key problem in a given area, you also need that deep subject matter expertise.
And I think what a lot of companies don't realize, especially things that have matured and evolved over time, is it's easy to add features and add infrastructure and add hosting and add to that overall platform without also taking into account the additional headcount requirement for the overall cost of ownership. I guess that's really what it comes down to is with more infrastructure comes more cost of ownership. And I think a lot of many companies I've seen and work firsthand with actually make that mistake where at some point it just is not tenable or supportable by the staffing that exists or that has developed perhaps the years of necessary expertise to manage.
Matt Pacheco
Excellent, excellent advice. Well, I wanted to thank you for coming on the podcast today. It was great learning from you. I certainly learned a lot and I know our listeners probably learned a lot too. So really appreciate you coming on here. Dennis.
Dennis Walker
Oh, my pleasure. Thank you Matt. Thanks for having me.
Matt Pacheco
Thanks. And and to our listeners, thank you for listening in to Cloud Current, where we talk about all the trends in cloud computing and we look forward to you watching more episodes with us. You can find us anywhere you're you get your podcasts, including YouTube. So thank you very much and stay tuned for.