Skip to content

EP 05: AI in the Cloud: Charting a Sustainable Future with Huamin Chen

EP 05: AI in the Cloud: Charting a Sustainable Future with Huamin Chen

cloud-currents-huamin-chen-sustainable-cloud-ai (1)

About This Episode

In this episode of Cloud Currents, Matt Pacheco interviews Huamin Chen, a senior principal software engineer at Red Hat. They dive into the intersection of AI, cloud computing, and sustainability. Drawing from his extensive experience, Huamin discusses innovative projects and technologies shaping the cloud infrastructure. He emphasizes the role of AI in optimizing energy efficiency and addresses the challenges in reducing the carbon footprint of cloud services. With personal anecdotes and a deep dive into the significance of open-source communities, this episode offers a unique perspective on the future of sustainable technology in the AI era.

Know the Guests

Huamin Chen

Office of CTO - Red Hat

Huamin Chen, a seasoned Senior Principal Software Engineer in Red Hat's CTO office, stands as a founding member of Kubernetes SIG Storage. Widely recognized in the technology realm, Chen boasts an impressive 690 citations on Google Scholar. Beyond his contributions, he is an inventor, influencer, and motivator, fostering a robust social media presence. Chen is active on Twitter and LinkedIn, where he shares his thoughts on technology and industry trends. As a respected collaborator, Chen shapes the future of data storage systems, often suggesting pioneering research directions. Catch him as a distinguished speaker at major industry events, including the upcoming Open Source Summit NA 2023.

Know Your Host

Matt Pacheco

As Sr. Manager for Content at TierPoint, Matt Pacheco is a seasoned professional with 10 years of experience. In his current role, Matt serves as the Editor-in-Chief for the TierPoint blog. With an eye for detail and an understanding of industry dynamics, Matt plays a pivotal role in shaping and implementing TierPoint's comprehensive content strategy.

Matt enjoys assisting IT leaders in navigating the ever-evolving cloud technology landscape, often simplifying complex themes to make them more approachable. He enjoys engaging guests in conversations about the intersection between AI and the cloud, the impact of AI and Machine Learning on costs, optimizing business approaches to sustainability, preparing leaders to sell cloud and AI to their boards of directors, and how cloud providers must adjust to the demand for new innovative technologies.

Transcript

00:12 Intro to Huamin Chen

Matt Pacheco: Hello everyone, welcome to Cloud Currents, where we engage industry experts to talk about trends, tools and strategies as they relate to cloud and innovation. Today we're thrilled to have our guest with us, Huamin Chen. As a senior principal software engineer at Red Hat's office of the CTO, he's focused on cloud-native infrastructure projects. Huamin brings over nine years of expertise in open-source technologies like Kubernetes, OpenShift, Ceph, Rook, and Knative. Huamin is also a founding member of Kubernetes SIG Storage Group and a proactive contributor to multiple CNCF projects. He is also leading sustainability initiatives around energy efficiency and carbon footprint reduction in cloud-native environments. He is also involved in various startup companies working on storage, data security, FPGA acceleration, and hardware and software codesign. Welcome to the show. We're glad to have you with us today. Did that sound about right?

Huamin Chen: Yeah. Matt, it's my pleasure to meet you and thank you for the introduction. That's really wonderful. It brings a lot of good memories to our mind.

Matt Pacheco: That's awesome. So let's get started. Can you walk us through your career journey and how you ended up at Red Hat working on cloud infrastructure?

Huamin: Yeah, my pleasure. So when I was starting my careers almost 20 years ago, I was entering these cloud decisions, which is the best technologies at a time was 20 years ago, 2003. At the time the Internet bubble was just busted and where the industry is under the hard pressure. And I joined this company called EMC and right now it's called Dell EMC. This storage company has lots of advanced technologies. So I joined there. I started working on the storage system called Hybrid, which is hybrid file system that's using a lot of technologies in behind. So I was working on a number of platforms including Linux and Unix. I found out that the Linux infrastructure has so much attractiveness to engineers because we can go into the kernels, ask questions in the community, get immediate feedback.

I was kind of surprised when I was asking questions to Alan Cox, which was almost the head of the Linux software development. And I got answers right away. It's amazing. Now after EMC, I spent a few years in other companies and I finally joined Red Hat. Finally my dreams come true. I was able to meet so many wonderful people in open source and their personalities, they're open to questions and open to challenge is unparalleled. And I really like red hats so much. And back to the days when I was working at red hats. There's a number of interesting development in the industry. As you already know, in year 2014 virtual machines half of virtualization was kind of the state of art, but there was an emerging trend in containers and that totally happens across the industry as a surprise.

And I was so fortunate during this wave of development in the very beginning when I was working on the project called Kubernetes. It's very little name by then. It's now almost like a household name in many of the developers home. But back then I was so fortunate to be one of the few people who are working on Kubernetes and starting the storage stack from very early stage. And I was able to make lots of contributions to the projects and miss so many wonderful and exciting community members. And that's why I got fall in love with the projects. And eventually the Kubernetes involved into an organization called Cloud Native Computing Foundation CNCF.

And I was so proud to be a part of the organization and in the community, contributing to a number of projects, first starting from Kubernetes and then working on Rook when Rook was admitted at the incubation projects in kubernetes. And I was also getting the chance to work on a number of projects before they joined the CNCF iconative, which is a service platform. And then over the years, I got a sense that the CNCF has so much enthusiasm embracing all the challenges that our industry is facing. It's taken a lead in many of the innovations. For example, the sustainability, which is the topic I hopefully can continue discussion for the rest of the hour. The CNCF is one of the earliest community to embrace environmental sustainability.

There's a technical advisory group, tech on sustainability, started a year ago, and that attracted so many participants in the community. And from that point were thinking, can we do something for the industry, for the community, and actually for our society? So what we do in terms of technology innovation comes in three fronts. So one of those things we call observation. So the more we see how much energy being used in our data center, in our software, the more chance we can take actions to make things go in the right way. We can use a lot of analogies. Probably one of the things we can use is like just phone beer. Nowadays, it's not a big deal anymore, but back in the days when each one of you have those smartphones, what we care a lot is those phone bill, right?

If your cellular usage goes out of range, you pay more. And this time, especially when you have the family plan, you have multiple phones. And if you are wondering how much, why I should pay this much for this month, look at your phone bill, look at your cellular usage. Wow. Immediately you know somebody's phone is out of control and then the next month you are taking actions to reduce the cellular usage and your phone bill will go down. So that's the same analogy we're using here. If we know how much energy used by the data center, by different tenants, different software, then we can take actions to say, bring the energy bill down, reduce the carbon footprints by reduce the shutting down this machine, by shutting down, by optimize this software.

So that's the kind of motivation we want to start in terms of to have a sustainable computing environment. Sustainable means two things, be efficient whilst you continued your practice, not totally shut you down. It makes your life goals sustainable. So that's what we do. The second is we want to be data center, to be smart, so we want to identify the opportunities that we can optimize the energy usage and carbon reduction in a data center by a lot of things. There's a different scheduling algorithms, scaling algorithms, that kind of thing we want to use. And very lastly, we want to make the software greener. Well, at the end of the day, you do not just power the servers to run idle, you want to run software on it. The software is the critical piece that make the data center greener.

So how to make that happen? We want to bring certain metrics, certain tool chains to the software developers so they can using these tool chains to benchmark their software, to identify the bottleneck, to find out how they can using less energy to achieve the same thing. So that's one of the three directions we are exploring to hopefully to make our society, our industry go sustainable.

Matt Pacheco: I love it, very noble. And I know sustainability is a very important thing a lot of businesses and a lot of people in general are thinking about. It's not just within business, it's everywhere. Everyone's thinking about sustainability. So it's really interesting points you made in the focus. But I'm curious, before we dive more into the software, I'm curious as to what motivated you to start focusing more on sustainability personally.

Huamin: Yeah. So I think the sustainability in a better way or in a personal way means frugality. I live in a very frugal life ever since I was a child. I was the second child in my family when I was a little and a resource was very scarce. And as the youngest sign in the family, you had to make a lot of struggles finding the better room, fighting the fights on the dinner table to catch more food, that kind of thing. So I realized the resource is gas. If you are not doing something to make things do that's going to means a lot of consequences in our life. When I become older and have my own family, my own children.

So I'm just wondering, do we have enough resource for everybody, especially when the child grow up, do they have enough resource in the future to live peacefully with their peers? So that's kind of a mentality coming to my mind on my day to day basis. So that's why I strive to make sure that at least I can do something to sustain our society, sustain the next generation's future, so they have a better, brighter prospects than Earth. They can live peacefully, they can live resourcefully with a lot of things still with a lot of resources at their disposal without having to worry about the world will end. So in order to do that, we have to make sure that we are not waste too much. But that does not mean we cut off everything they are used to.

Right now, they can still do what they want, but can do it in a more efficient way, energy efficient way, environmentally efficient way. So they know what they do is consciously for the better goods without having to sacrifice, making a greater sacrifice in their life quality. So that's why we are trying to get a lot of intelligence into our industry, to make sure that what we do is continue, but what we do is in a better way than before.

(11:18) Sustainability Initiatives at Red Hat

Matt: That's great. I really love how you put it with doing what you're doing, but doing it more efficiently and not sacrificing what you're doing and making sure you're thinking sustainably as you're doing it. So let's talk a little bit about the initiatives you're working on. So can you explain some of the goals of the sustainability initiatives that you're leading at Red Hat?

Huamin: Okay. That is going to be very interesting. So we currently have different roadmaps. The roadmap is to have one project deliver and continue the next one based on the previous ones so that we can have a coherent story and thus we can deliver for the future. So the first project is called Kepler. It's a word by itself, but it's also initial means Kubernetes efficient power level exporter. So the Kepler project really comes to the goal of make the energy usage observable, meaning that we can report how much energy used by the infrastructure, by your software and by your tenants.

So if you are running a data center for your organization, or if you are using the cloud infrastructure, you may not just use it for a single application or for single tenant, chances are you're probably going to use the same Kubernetes clusters for multiple tenants in your organization, finance, it, logistics, marketing, things like that. And each of the departments may be accountable for what they use in terms of energy as well as the financial resources. By bringing this metrics into the picture, the end users will know, wow, this is how much I use and what's our carbon footprints look like. So that's very important in understanding how much people can use for using our project.

Once Kepler makes this available, the next step is if people want to take actions, we want to give them the opportunities to optimize the workloads so that we are using less energy. So we do that in two ways. One is called scheduling, the other one is called scaling. When we are making scheduling decisions, meaning that when we have new workloads or new applications to be started in the cluster or in the cloud environments, in the Kubernetes environments, how can we place a workload in the best node, best servers that can serve this workload most energy efficiently? Believe it or not, even though this looks accountant intuitive, but in science, people discovered that you do not linearly pack the application and use the same energy. A certain cpu utilization, the energy usage may be higher than lower.

For example, if you are running the cpu as a 70% utilization, and if you add 110 percent more utilization, 10% more, probably going to use more energy than previously stated, from 60% to 70%. So that means we have to consider the characteristics of the servers, especially cpu and coding environments, to anticipate this nonlinearity and make a sparse decision to place the workload in the equipment on a server that will serve the workload more efficiently. So that is the immunity of how scheduling works. The scaling works in a different scenario. So it says if you already have your application, for example, the database or the web servers running in the environment, how can we make sure the energy used by the application can be minimized without sacrificing the performance that they are delivering.

So one of the quick ways to do that is not all the applications serve their requests on the consistent basis. In the daytime, you're probably using the database server more than at night. So let's just say if you in the daytime we tune off the cpu frequency as the maximum, so you get the best performance, but you also using a lot of energy at night. I just cut the frequency at half or maybe to a third of the maximum. So you are using much less energy, but you do not have to have a lot of workload to use. So you are not wasting you sacrificing your performance because your demand is less, you're using less energy. So that will help a lot for the database administrators to be more efficient without having to sacrifice their availability. So that's kind of things we are doing.

The number two, number three is really going beyond just compute. We are looking at the different scenarios. How about the storage environments, data centers, not just for compute environments, they also serve storage. Just like Amazon. They have a huge storage services, s three ebs, things like that. And we also have this accelerator called nowadays the AI becomes a big thing, AI storage. We could have other things, big things coming up. So we want to solve these use cases one at a time. So make sure this accountability and sustainability can continue in each of the scenarios. So that's why we are trying to have longer roadmap to accommodate for all these scenarios.

(17:10) Cloud Infrastructure Energy Usage: Importance of Metrics and Accountability

Matt: Excellent. And we'll definitely get into AI in a few minutes. I have a few questions for you. We'll talk a lot about AI, but earlier you mentioned metrics and accountability. Why is having better metrics and accountability for energy usage so critical for businesses?

Huamin: Yeah, that's a very good one. The metrics and accountability, they really has two bearings in my mind. One is this accuracy. The one is the transparency. The same way as we are looking at the financial statements. If you are looking at the companies financial statements using the same accounting methodology, you can reproduce this everywhere without having to rely on certain people's expertise or any of the black box kind of methodology to understand what is going on. It's completely transparent. So that's why we believe open source and open standards is the best way to communicate with the end users how we measure things and how we come up with these numbers. So the metrics we are trying to bring up has two things. One is that we using the real energy outputs. People are familiar with the Watt, right?

That's similar people are familiar with and we also come up with the fine granularity. We come up with the workload level. So if you are running database, we tell you how much energy you used by the database. If you are running web app, that's the number will come up on that application. This will give the people the insights into individual applications they are using. So that's why we believe Kepler is one of the top forefronts runner in this space. We are giving them the granularity people like and they can use this and different aggregation level. So if you want to see as low as the process level or container level, Kepler provided us a granularity. If you go up, want to see the tenants level, we also give you that aggregation granularity. So that is one of the things.

So the transparency is quite critical, because everything we offer is open source, not just the source code. The way we train the machine learning models, that's also transparent. We do not just give you a model, we give you the whole pipeline, how model gets trained, as well as the model in open source, so you can use it. You can also validate the model. You have very good accuracy in terms of the model's training metrics. So we believe the arrow range is less than 2%. I believe we have a different metrics. So one of the metrics called Mie, the mean absolute arrow, that's why it depends on which model are you using. Some of the models is as low as one or two on 50 and 80 watts basis. One or two, that is less than 1%. Right.

And the newer models has even lower error range. So we are really proud of our accuracy. So with that in mind, so transparency and accuracy. So I believe we are really doing the pip end users give them the benefit to using our infrastructures for their environment.

(20:37) Optimizing Carbon Footprint and Energy Usage Across Hybrid Cloud Environments

Matt: Thank you for that. So I have another question. So, a lot of this is, correct me if I'm wrong, taking place in the public cloud. So how are you thinking about optimizing carbon footprint across other cloud types? So let's say someone has a hybrid cloud environment. How are you thinking about optimizing carbon footprint across those environments?

Huamin: Yeah, actually this is realistic questions in production, and people have been wondering that for a long time, because the hybrid cloud is really on premise that you can run a workload everywhere without have to worry about the infrastructure differences. And actually, that means a lot of challenge. You have to hide the differences between different clouds and private data center, the hyperscalers. So the differences are really tremendous. The infrastructure to bring in the commonalities have to be done in a very highly weighted way. So kepler is no exception. We are entering this hybrid cloud environment with a lot of challenges in front of us, but we do have a solution. So our solution is that we actually go into this hyper cloud the same way as the other software.

We go into this public cloud, we take actions to measure the power consumptions of each of the data instance, local instances, the machine types. So there are people using the models we build to understand what's their energy consumption of the instances. And then by using these models from different cloud, we can just build up the holistic pictures when people are using the hybrid environment, and we are not just doing this in hidden way. We publish our models on different cloud instances around the same time. So that's why if you are running on Amazon's cloud, for example, which is still undergoing, by the way, it's not done yet, we are building the pipeline to train the models on Amazon cloud. So we train on the cloud as our own expense. We got the models.

So when you are using Amazon, this is the model you use. And consequently we are using a different private cloud and we are using a different machine that's different from the Amazon machine. We are going to give you a different model you can use if you do not have the model exact match to your machine. And we have actually the training pipeline, you can build the models on your environments on your own. So we give them the best options they can use without to reveal their configurations, the secrets, ingredients underneath. And they can do this without having to worry about such privacy. And we are also providing open automation and not just a lot of hard works. You have to type this command and run to get this result and then copy this result to different inputs.

We do with a lot of automation to save people's time and reduce the entry barriers so people can adopt our models and adopt our training pipelines.

(23:57) AI’s Impact on Energy Usage and Carbon Footprint

Matt: It's great to hear that you're considering all these things and you're accounting for these different cloud types. Because I'm thinking about it and I'm like, wow, there's a lot of companies who aren't just using one public cloud instance. Sometimes it's multiple public cloud instances in the multi-cloud, or sometimes they have that, like you mentioned that on-prem, or sometimes even off prem in a colocation setting. But how do you manage all those things? So thank you for answering that. It's good for, especially for listeners who are dealing with infrastructure that's like that. It's pretty common as we see in our world as well. So shifting a little to the AI side, what everybody's really excited to hear about and talk about. As were chatting before, we got into a little bit about AI workloads and large language models.

So as we discuss those things, the growth of AI workloads, like large language models, the adoption and energy usage is exploding right now. How big of an impact do you think AI innovation will have on cloud infrastructure, both positive and negative?

Huamin: Yeah, so I do see this is a very interesting opening to a new era that we have never experienced before. Revolutionarily speaking, just as we are entering, like I described in my career, I was able to get this containerization very early on. I was fortunate to get into this landscape and working with very exciting projects. But the AI, especially the generative AI, that's quite different from all the computing norms that we have been used to for so many years. So speaking of the unique opportunities here, and also the unique challenge here is that generative AI is able to reduce a lot of overhead as we are used to before. So for example, I'm using the GitHub copilot and I see that give me a lot of productivity boost.

So I was able to, I do not have to memorize a lot of details as I used to, and I can still get things done. That means a lot of productivity boost, means a lot of reduction in time, but also means the same thing when it generates these contents. You have to think about how much energy is going to use. I read the article a few days ago. There's some research revealing that if you ask charge GPT basically from OpenAI, ask you like 20 or 50 questions, the amount of energy and the amounts of water for cooling purpose is tremendous. So like 20 or 50 questions, the water consumption is about 5 ML. That's quite a lot. Consider billions of people using the charts DPT on a daily basis, how much water can you use? And the energy is the same thing.

According to certain other research, in the beginning of the year, people estimate that charity Bt use as much energy. I don't know what basis, maybe on a daily basis, on a monthly basis is equivalent to 175,000 danish families annual energy bill. That's still early this year. And you see this chart, dvd gets more and more users and ir frequency gets network errors on my chart DVT. So that means the energy usage associated with the popularity of the large language models is just tremendous. So we are facing this unique opportunity by enjoying the benefits of generative AI.

But as an industry practitioner, especially focused on sustainability and gets really worried about the consequences and how much energy the GBT has consumed, and in the same time, the water that's going to use, that is something we have to take actions and look at how things are working internally to make sure that we are able to provide certain solutions to guess our future in a better way.

Matt: Yeah, definitely. And then also considering chat. GPT is just one model, there's so many others. We're talking about Google with Bard and introducing Gemini into it. We saw some demos shortly a few days ago. So there are a lot of them. So as these grow and the adoption grows and companies are adopting these AI workloads, do you think most companies properly measure and optimize for energy consumption of their workload, their AI workloads today. And if not, what do you think would need to change?

Huamin: Yeah, so that is what we have been considering a long time. I think that also other companies or products are trying to come up with certain metrics to see how much energy associated with not just have different metrics, but at the moment it's mostly just a financial terms. How many tokens you have consumed, because OC and AI charge by tokens, but there's no metrics called token. How much energy associated with each token. Providing that metrics is important. Just like the same way Kepler works to provide you how much energy used with this application. So you can potentially provide similar metrics. How much energy associated with that token? If you can understand how these tokens are produced. So by looking at the energy metrics, then there are certain ways you have to guess the application accountable for the generative tokens.

And that will be one of the ways we can hopefully tame down the energy usage associated with the large language models and not the language models anymore. They have the vision models, the multimodal models that will be a tremendous explosive in the future. So the early we can take action sprint these metrics available, the better chance we can reduce the energy and carbon footprints in the future.

Matt: That's great. So do solutions like Kepler leverage machine learning and AI to estimate things like energy usage and carbon footprint?

Huamin: Yes. You actually speak both sides of Kepler. That's amazing because Kepler itself does using machine learning in the heart of the project, we have to build up the machine learning models, different models to estimate how energy has been used by give estimates based on different deployments, virtual machines or biometric machines, how much energy used. So that's how we use the energy to make Kepler available for large language models so they can take actions based on the Kepler metrics. We are also working towards that direction. Oftentimes when we say we give you some metrics and then you can take actions based on these metrics, you have to develop certain software. Developing software is not easy thing and make the software versatile and ubiquitous. To make a lot of things working takes even harder level.

So we are considering can we use large language models to make the automation? For example, if I just adjust the same way you have your series, you can talk to series to set up timer, just talk to it. You do have the run program to set up 35 in a very similar fashion. If you are giving the, for example, the charge EBT prompts reduce the energy for this workload, tune the configuration for this workload. If capital reports certain threshold matrix that's above certain threshold, chances are we can make that automation happen without do a lot of hard working in terms of programming neighborhoods.

So once we figure out what the tuning maps are over there and how we are able to connect Kepler with large language models or potentially the chart GPT or GPT four, for example, we may be able to do these automations for organizations in a very summer way, just like the way they ask questions on chart GPT. They can ask the same question to reduce their energy consumption on their infrastructure by their software. So in a very similar way.

Matt: Excellent. And it's interesting to think about using machine learning and AI to help optimize for companies running potentially large language models and AI tools and applications in their workloads. So it's like AI helping AI. It's kind of funny when you think about it that way.

Huamin: Yeah, philosophically that is a very intriguing story. We can just AI.

(33:42) Other Opportunities for Optimizing Sustainability with AI & Machine Learning

Matt: So what other opportunities exist to leverage AI for optimizing sustainability in your mind?

Huamin: Yeah, certain other things as we have been working on, unconsciously we forgot a lot of things we can do. AI can bring up this back scene scenario into the front stage and help us to build up a lot of optimizations as long as we are able to create the heuristics and instruct AI to help us on this front. So the similar way as we just mentioned, using capital metrics to do automation, you probably can also tell the AI to build up a lot of other things as well. The productivity boost means we do not have to reading the software manuals, going to do Internet searches to find the answers. Language models probably already can give us a very quick answers. When we type the question, we are looking for answers. So that can also reduce energy usage in many ways.

So for example, when we are engineers, especially programmers, have questions, we go into different synthetic sources and analyze reading, ask questions, analyze the results and then based on these results write programs. We will probably do this in more efficient, on an automated way to use larger language models deruptly to produce what we need. So that's where even though generating these contents requires large language models to use energy, but in aggregates, time for searching research and then developing this software in aggregates, I think the username language models may potentially have even better efficiency and then do it in a manual way. In order to make this more streamlined, more efficient. I think we as an industry potentially needs to have more specialized model.

Specialized model in certain ways specialized, for example in software developments, encoding, specialized in managing data centers, deployments, monitoring and bug fixing, things like that. By using this specialized model we can improve the productivity and potentially in the same way you can reduce energy usage and a specialized model. This is still debatable, but I believe that specialized model is almost certainly smaller than large language models use trillions of parameters. We probably have using less billions of parameters to have a specialized model that is more efficient to use by using everybody use the specialist to look for mathematical advice. If you are looking for generic practitioners, you probably may not get the same idea. So things like that play a similar impact. So that's potentially the direction I'm looking for, and hopefully we can just generate more sustainable usage of AI models in the future.

Matt: It's a great answer. So some more questions for you around that. Earlier you had mentioned scheduling and scaling. Are there opportunities that exist to leverage automation and machine learning for scheduling and scaling as it relates to energy efficiency?

Huamin: Yeah, so that is very technical, but I believe this has a lot of interesting impact on our audience. So on scheduling, we are really looking for at least two machine learning algorithms or techniques. So why is this? Because scheduling has a lot of tunes to do with time series data, how the jobs are arrived and what's your projected job arrival in the future. So there's Rima, I got the exact terms. It's basically on time series forecast based on previous, just short history, long term history. On patterns that we are looking for. You can project how much traffic that you are coming in terms of number of requests, number of job arrivals. That's the one thing.

The other thing is based on estimates of the energy consumption or capacity availability in the existing computing environments that projects on these arrival rates, what could be the best combinations. So things like that will play a long way. So on the scaling side, really, that's a thing called reinforced learning because scaling, you probably don't know how to tune in the first place. Too much cpu, too much memory, that's for a job. The configuration work, right? For the configuration, for the application, you will probably want to tune the CPU utilization, CPU usage a little bit, see what's the feedback and what's the consequence, using that as a feedback for the next round of tuning. So that's so called a reinforced learning.

If the tuning is good, which means we got a better incentive to go into the same direction and make a next step, tuning based on same direction, then we are seeing the even better results and then we're going to continue that path. So this reinforced learning really in many ways has been used by AI models. Just like the alpha girl, the Google deep brain used to be the human goal player a few years ago. So that's based on the idea comparatively. The go player, the machine go player, has played the game many times, and each time learn from the mistake. So that's trying to improve yourself. So we are using the same algorithms in scaling decisions.

(39:53) Where Sustainability, Artificial Intelligence and Cloud Cost Optimization Intersect

Matt: So as we're talking about AI and sustainability, those things go together. And you've explained very well how the opportunities exist. A lot of our listeners potentially, and a lot of businesses in general, are also looking at the cost side cost optimization. What are some opportunities as it relates to sustainability, or what are some benefits as it relates to sustainability and AI that will also have impacts on cost? Because we know cloud cost optimization is a big topic right now. Sustainability is very important too. Where's that intersection? Can you tell us a little bit about that?

Huamin: Yeah, I think that's a very practical question. And I do see these questions has been coming up on a greater frequency than before, because sustainability means two things. One is the energy cost. It's a direct cost to you. The other one is a foot of carbon cost, which must be visible in certain places. Depends on the geographic locations in the world. You probably see these things more prominent than the other places. For example, if you live in the european regions, there's the carbon credits market, the pricing. There's pricing and there's a quota, right? So in order to get your quota, you have to go to the carbon credits markets and to buy the quota. And that means financial cost.

And then once you have this quota, you have to use it consciously, those you waste the quota and then that's going to include it in your cost models, how the company will operate, how to distribute the quota within different organizations. Certain operations needs more quota than the other, and then the financial resources needs to be adjusted on this basis. The third way is the valuation. So as we are looking at the financial markets, they are looking at the ESG investing, environmental, social and governance. So the ESG environments, it has been up and down. But I see this as a continuation that investors, you are looking at your ESG scores when they invest in your business. So if you are doing well on your social side, on environmental side, on the government side, you probably got some more investors interested in you.

So that will potentially boost up your valuation in the market. So I do see that as free capital market, this is as an incentive for the organizations, for the companies to drive up their social, economical and also environmental ratings. So I think in conjunction of all these factors, energy, actually, we are paying more for energy nowadays than before, as you see the utility charts. And over the course of the last three years, the energy price in us has been increased quite a bit than past two decades, the same thing in the European market. So the energy cost, the carbon cost in terms of coal and financial credits, and also on the company's valuation. So these three factors I think will play a very important way in our future of planning.

(43:08) Challenges and Opportunities for Cloud Sustainability

Matt: Yeah, definitely. And it feels like it could be an episode of its own. We could probably talk about this all day because there's still the aspect of government regulation that's probably going to come along with some of this. And then you mentioned Europe a little bit and some of that. So who knows if that comes to the US or other regions around. So it's very interesting thing to consider in the spirit of looking ahead at things, what do you see as some major challenges that still need solutions around cloud infrastructure, sustainability?
Huamin: Yeah, I think the biggest challenge is regulation. Right now there's no guidelines from our energy department. What are the metrics for? Measure the carbon footprint of software systems. There's no defined ways to measure that. If you look at the guidelines from the energy department, the EIA or some other government agencies, you will see final references, how much carbon footprints associated with different fossil energy, fossil fuel. So there's very distinct, very straightforward guidelines over there, but you do not find guidelines on software. So that is one of the things. So as a practitioner, I see there's a different ways people are inventing in the industry to try to come up with certain methodology and certain standards. This does not help our industry very well. Some methodologies based on their own understanding makes sense, but may not be applicable to other applications, other scenarios.

So if there's a transparent and open standards way people can measure unequivocally, this is how much, I mean magic over here that can be transparent and reproduced by other practitioners on different environments with the same number. So that should be the way that we should be adopting. Once we have adopted ways of measuring and valuating, then we probably can sit down side by side to report our carbon emissions and also see how we can benchmark against the industry. Right now, because we are lacking of this standard, lacking of this methodology, we are not able to do that. So if I report my carbon footprints hypothetically, let's say 100 people, we are using a difference, true and different number. Like it's 500, is 500 better than 500? 100 better than 500. It's a question mark.

So we do need a standard and hopefully our industry or our government or in general our society can come up with such standards so people can use it. I think that is the number one challenge. The number two in my opinion is the availability of such tool. So public clouds has potentially, I do not have the numbers in front of me, but I suspect that in general the public crowns the hyperscalers, the pops with hyperscalers probably generates the most, use the most energy and even they are migrating to carbon zero missions. They still have a lot of environmental impacts, the water usage, the recycling, these are the things hopefully they can provide to the end users. So as end users we are using their service.
We can take their inputs and make sure that we are doing a very conscious thing to reduce our own footprints by using their service. So the public cloud hopefully can provide us the tools and the metrics that we can use. And also same standard way, not just Microsoft provide metrics, Amazon provide different metrics, but it's not comparable. So these are kind of a chain reaction without a common understanding, common standards, then these public clouds probably won't be able to provide these metrics. But on the other hand, these public clouds can drive up the formation of these standards. There's actually initiatives and organizations, for example, the Green Software foundation, there are certain projects over there trying to drive up this correlation between the public clouds.

You come up with this standard so we can have a transparent and coherence and consistent ways of viewing our cover prints as end user. So I see these are top challenges we are facing nowadays.

Matt: And that's very interesting. Those challenges and the organizations you mentioned, where would you like to see them, those organizations collaborating more on these issues, what do you see them doing to get around? I know you mentioned there's lack of standards and the public cloud providers understanding their usage and reducing their usage. So where do you see organizations like the ones you've mentioned collaborating more on these issues?

Huamin: Yeah, right now the common themes among all these public clouds that they're using a lot of open software, open source software, and to that standard, if they are also used, open up open standards, transparent open source and also open discussion, that will open up a lot of opportunities. So if they can use, develop or adopt open source projects for the metrics, that's even better. And to that end, I have a biased view because hopefully capital can be adopted in that way to provide, or the methodology developed for capital can provide any values to that direction. So we do hope the open source, open standards and transparency are the drivers for public clouds to come up with a solution for our society and for our end users.

Matt: Nice. And you got a lot of energy when you talk about these things. Thank you. But I got another question. What gets you most exciting about the future of sustainable cloud initiatives and the technology you're working on? What gets you really excited about those things?

Huamin: As an engineer, I think my excitement comes from the way that we are able to use in the latest technology, in developing our technology stack, a software stack. So this is, if you are looking back the methodologies, what we are using is quite new. The academic, the research institute has been working on energy managements of software systems for many years, but they are not using the best technologies up to date. So we are using the latest technologies, Kefla, using the EBPF, which is a new thing in Linux ecosystem and excites the community quite a lot. And once you say we are using EBPF people just this is the project, I can jump in. And then we also use machine learning. And we are not just using the simple models, we support quite a complicated models, to be honest.

So like I said, we started with linear regression, which is a steel model of itself, but we are moving on to more sophisticated like Xgboost models. And that makes people wow, if I learn Xgboost machine learning we can use in Tesla, right? So that's kind of technology excitement, using the latest technology, that's kind of exciting thing to us as accomplishments and also things to motivate people to adopt our projects. So that's why we feel like we like it. The second thing that makes us very enthusiastic is to see that day to day we see new questions on GitHub. People ask questions why this thing is not working and why this is helping them solve their problems. This kind of confirmation and reinforcements that we are actually developing. Something people use when they report problems, that means they're using our products.

So that's kind of thing you are not working on something with no end users, with no adoptions problems is actually our friend, marks are our friends, and issues are our friends. Without problems, without bugs, we are working on a bad project. So the more bugs we see, more issues we see, we are I guess very excited. And third thing is we discussed this earlier, we are working on a number of new cloud environments, server environments. Getting hands on these new environments also makes our life as an engineer very exciting too. The little bit understanding of how cloud works and how new hardware works, that is a new knowledge to us. As you keep refreshing knowledge, you feel gets rewarded. So that's unlimited if I can list like ten things, I probably can count my all fingers.

Maybe I can use my toes to add more things. But I feel like as engineering day by day we got a lot of trash energy from developing the products and learning new things.

(52:44) Outro - Takeaways for Cloud Sustainability

Matt: Yeah, I love the passion and the excitement around it. I could tell you really enjoy what you do. It's great stuff. So I'm going to wrap it up. We're going to ask you one last question, and we've had a great conversation so far. So if there was one thing that you hoped people listening to this episode today understood, what would it be? Just one thing to take away.

Huamin: I probably want to stress that the open source and open discussion, open standards, transparency and being openness, that's probably being openness is most important. So you probably have the best idea. But if you are not open up this idea, you may not find your audience, people who support you and you may not find criticisms. One great genius may not have the best ideas to cover all scenario. So open this up. Less people hear you and let people come up with discussions. Thus can get a better impact, a greater impact and better futures. So if people pay attention to what we have discussed today, let's say be open, that's probably the most important.

Matt: Excellent. Well, I want to thank you for being here today to speak with us about sustainability, cloud infrastructure and all the efforts going on in your role at Red Hat. It was a great conversation and it left me and hopefully our audience with a lot to think about. So thank you for joining us today.

Huamin: Matt, it's always a pleasure and I love giving us the opportunity to share our thoughts and hopefully to give our audience inspirations and passions to explore sustainable ways for our future generations.

Matt: And the area of sustainability as it relates to the cloud is only going to get more important from here with rise of things like artificial intelligence adoption. So if listeners you want to follow Huamin, I believe you're on X and LinkedIn. Anywhere else? Any other channels you want us to mention?

Huamin: Yeah, we also have the GitHub channels. We have a Kepler Slack channels. We also are present on the CNCF Slack channel. That's a Kepler project channel over there. So you can always ask questions, share your questions, and most important, join us in developing a better product for the future.

Matt: For listeners, subscribe to Cloud Currents - wherever you get your podcasts to see all our conversations on the latest innovations in cloud, artificial intelligence and where they intersect and connect. So we look forward to seeing you again soon and thank you for listening!