Data Centres Operations, Machine Learning and a long fly…
It is great to be back home… After Nic Bruce and I delivered our keynote to @DCDSydney last Tuesday, thank you @Nick and @Simon for your incredible support and hospitality, and for providing an opportunity to meet influential leaders in the data centre industry, we are back in Europe. The best thing about a 32-hour flight is… time: time to think. And so, as the aircraft ascended and I contemplated the panoramic view of the city, I began to formulate my views of the last several days.
Within a relatively short period, Machine Learning has migrated from the laboratory to the forefront of operational systems. That big names in the Internet space use Machine Learning every day, to improve customer experience, to recommend purchases or connect people socially with new applications and facilitate personal connections, is clear. What is not as clear, at least not to the public, is that Machine Learning capabilities can also facilitate operations and security of large infrastructures supporting large data centres.
Data centre operators can become overwhelmed with real-time monitoring information, which is important to identify and prevent performance and security incidents. Human brains are not designed to correlate and analyse thousands of events per second from a significant number of data sources. They tend to focus on a limited section of alerts, therefore missing important events and links between them that might represent a risk. At the core of the problem is that systems analysts use to detect, explore and respond to incidents, are good at processing high volumes of data, but are unable to learn from it. In contrast, analysts possess a tremendous capacity to diagnose and select the appropriate recovery action, but struggle to analyse the data generated at the rate of generation.
The next generation of SIEM development faces the challenge of creating a system that can present different sources of information at multiple levels of abstraction, while learning from the experience and selecting the right corrective action. The objective is that one person will be able to operate large data centres. As you might imagine, the number of challenges to get there, is countless. However, a gradual increase in the adoption of intelligent components promises to be an exciting journey.
When I address a non-academic audience, I usually start by conducting a simple survey. It helps me to create a better connection with the audience by breaking the ice, but it also helps me to put my message into context. The survey produces almost identical results, regardless of where in world the conference is taking place. I start by asking how many have heard about Machine Learning. Most of the audience raise their hand. I suspect that I would not have been able to say the same 10 years ago. When I follow-up with, "how many of you have a precise definition of what Machine Learning is?", the percentage of people decreases significantly. For those who do have a definition, it is doubtful that they share the same concept, or that their concepts are even compatible.
The conclusion I draw is as simple as the survey. We, the people with a Machine Learning research background, have spent too much time talking about the "how", and not enough time talking about the "what". Perhaps that was never our function, but it is key to commercializing ML based solutions. The "what" is different for each vertical, or for each problem within a single vertical that is to be solved using algorithms. There is a strong uncertainty in the industry about how this relatively new technology can affect business, beyond hiring a group of data scientists and waiting 6 to 9 months for, hopefully, applicable results.
Specifically, regarding data centre operations, because application downtime could potentially cost millions of dollars, having a good performance and high availability are primary goals. As such, data centres are often required to respond to alerts within minutes, even if the event took place at night or at the weekend. To maintain good performance and high availability in such a complex and constantly changing environment, data centre operators face three main challenges: understanding application workload, understanding application performance under different workloads and system configurations, and quickly detecting, identifying, and resolving performance problems.
How can Machine Learning help to achieve that?
First, operators need to understand typical workload patterns and long-term trends and make sure the application can handle unexpected workload spikes. Most Web applications have highly predictable daily, weekly and yearly patterns, which the operators use for short-term and long-term resource provisioning. However, tens or hundreds of millions of users can create very unexpected workload spikes and data hotspots that can overload the application. Data centre operators must understand the properties of these spikes, so they can stress-test the application and appropriately provision resources.
Second, because of large-scale deployment, complex workloads, and software dependencies, it is extremely difficult to understand the impact of changes in application code, resource utilization, or hardware for end-to-end request latency. However, understanding what affects application performance is crucial for optimizing request latency, or for dynamically provisioning more resources in response to changes in user workload. Because, in modern data centres, different applications share the same hardware, it is also important to know the resource requirements of individual applications to maximize the utilization of the available computational resources.
Finally, data centre operators need to quickly detect, identify, and resolve performance crises that occur in the data centre. Most crises are detected automatically through increased request latency or decreased throughput. However, some problems are difficult to detect, because they are caused by small, but important alterations in application behaviour. The difficulty of crisis resolution varies according to the complexity of the crisis and whether it has been encountered before. Many of the less severe crises are resolved automatically by the application through various methods, such as redundancy, but more severe crises require human intervention, because no known automatic resolution is available. While crisis resolution often lasts only a few minutes, in some cases it might last a few hours and require a large group of operators to analyse the system.
Machine Learning (ML) provides a methodology for quickly processing large quantities of monitoring data generated by these applications, finding repeating patterns in their behaviour, and building accurate models of their performance. For example, regression methods allow us to create models of application performance as a function of user workload, and software and hardware configuration. Anomaly detection methods can be used to automatically spot deviations from normal system behaviour, which could correspond to application failures. Feature selection allows the operator to automatically find performance metrics correlated with occurrences of application failures, which could lead to faster diagnosis of these problems.
As the aircraft headed to Dubai, flying over the Indian ocean, I contemplated the long trip I had ahead. As an analogy, I imagined the journey of intelligent autonomous solutions from the lab to the real-world.