Let's walkthrough Accelerating AIOps use case Predicting & Detecting anomalies in IT Server Performance with Recurrent Neural Networks (RNNs) by Chandra Gundlapalli

AIOps Machine Learning Algorithms - Recurrent Neural Networks

Let's walk through an example of using Recurrent Neural Networks (RNNs) for a common AIOps use case: Proactive Anomaly Detection (Predicting and Detecting anomalies) in IT system performance. In this example, we will use RNNs to analyze time-series data from system performance metrics, such as CPU usage, memory consumption, and network latency.

Typical steps (flow picture coming soon, pls check back)

  1. Data Collection: First, we need to collect the time-series data for system performance metrics. This could include metrics like CPU usage, memory consumption, disk I/O, and network latency, all collected over a period of time at regular intervals (e.g., every minute).

  2. Data Preprocessing: Preprocessing includes cleaning the data, handling missing values, and normalizing the data to ensure that all metrics are on the same scale. This step is crucial for the RNN to learn effectively from the data. [We developed a data cleansing utility that will improve this process step by 30-50%, more on this soon.]

  3. Feature Engineering: We can create additional features that may help predict anomalies, such as rolling averages or moving standard deviations. This can help the RNN capture trends and patterns in the data more effectively.

  4. Splitting the Data: Divide the dataset into a training set and a validation set. The training set will be used to train the RNN, while the validation set will be used to evaluate its performance.

  5. Designing the RNN Architecture: Choose an appropriate RNN architecture for the problem. In this case, we can use a simple RNN or an LSTM (Long Short-Term Memory) network, which can capture longer-range dependencies in the time-series data. The input layer will have as many units as the number of input features (e.g., system performance metrics), and the output layer will have a single unit that predicts the next value in the sequence.

  6. Training the RNN: Train the RNN on the prepared training dataset, adjusting hyperparameters such as learning rate, number of hidden layers, and batch size to optimize performance. Training an RNN can be computationally expensive, so using techniques like gradient clipping and early stopping can help prevent overfitting and improve training efficiency.

  7. Model Evaluation: Evaluate the performance of the RNN on the validation set by comparing its predictions to the actual values. Common evaluation metrics for this task include mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE).

  8. Anomaly Detection: Set a threshold for detecting anomalies based on the model's prediction errors. If the prediction error for a data point exceeds this threshold, it can be flagged as an anomaly, indicating a potential issue in the IT system.

  9. Alerting and Root Cause Analysis: Integrate the RNN-based anomaly detection system into your IT operations monitoring platform. When an anomaly is detected, it can trigger alerts for further investigation by IT professionals, who can then perform root cause analysis to determine the underlying issue and resolve it.

LSTMs (Long Short-Term Memory) are a special type of RNN that can learn longer-range dependencies, which can be crucial in detecting complex patterns and trends in IT system data. Let’s use the LSTM network to predict and detect anomalies in a time-series dataset of IT system performance metrics. We will use Python and Keras, a high-level deep-learning library built on TensorFlow. Note that this example is simplified and does not include all the necessary steps for a production-ready implementation.

Import the necessary libraries:

import numpy as np

import pandas as pd

from keras.models import Sequential

from keras.layers import Dense, LSTM

from sklearn.preprocessing import MinMaxScaler

from sklearn.metrics import mean_squared_error

Load and preprocess the dataset (assuming a CSV file with columns for timestamps and system performance metrics, e.g., 'timestamp', 'cpu_usage', 'memory_usage'):

data = pd.read_csv('system_metrics.csv', index_col='timestamp', parse_dates=True)

data = data.fillna(method='ffill') # Fill missing values

scaler = MinMaxScaler(feature_range=(0, 1))

scaled_data = scaler.fit_transform(data)

Create a function to prepare the data for the LSTM:

def create_dataset(dataset, look_back=1):

dataX, dataY = [], []

for i in range(len(dataset) - look_back - 1):

dataX.append(dataset[i:(i + look_back), :])

dataY.append(dataset[i + look_back, 0])

return np.array(dataX), np.array(dataY)

look_back = 10

X, y = create_dataset(scaled_data, look_back)

Split the data into training and validation sets:

train_size = int(len(X) * 0.80)

X_train, X_val = X[:train_size], X[train_size:]

y_train, y_val = y[:train_size], y[train_size:]

Define and compile the LSTM model:

model = Sequential()

model.add(LSTM(50, input_shape=(look_back, X.shape[2])))

model.add(Dense(1))

model.compile(loss='mean_squared_error', optimizer='adam')

Train the LSTM model:

model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_val, y_val), verbose=1, shuffle=False)

Make predictions and evaluate the model:

y_train_pred = model.predict(X_train)

y_val_pred = model.predict(X_val)

# Invert the scaling to get actual values

y_train_pred = scaler.inverse_transform(y_train_pred)

y_val_pred = scaler.inverse_transform(y_val_pred)

y_train_actual = scaler.inverse_transform(y_train.reshape(-1, 1))

y_val_actual = scaler.inverse_transform(y_val.reshape(-1, 1))

# Calculate the RMSE (root mean squared error)

train_rmse = np.sqrt(mean_squared_error(y_train_actual, y_train_pred))

val_rmse = np.sqrt(mean_squared_error(y_val_actual, y_val_pred))

print(f'Train RMSE: {train_rmse:.4f}, Validation RMSE: {val_rmse:.4f}')

Set a threshold for detecting anomalies:

threshold = train_rmse * 1.5 # This is an example value; you may need to adjust it based on your specific use case.

Identify and flag anomalies:

val_errors = np.abs(y_val_actual - y_val_pred)

anomalies = np.where(val_errors > threshold)[0]

print(f'Anomaly indices: {anomalies}')

This example demonstrates a basic workflow for using an LSTM model to predict and detect

While you wait for test data (approval from business units, encrypting the data for privacy, etc.), Here's a Python script to generate a synthetic time-series dataset of IT system performance metrics for anomaly detection. This script creates a CSV file with columns for timestamps, CPU usage, memory usage, and network latency. We'll introduce some artificial anomalies in the data to simulate real-world scenarios.

import pandas as pd

import numpy as np

import random

from datetime import datetime, timedelta

def generate_metric_data(start_value, end_value, n_points, noise_amplitude):

x = np.linspace(start_value, end_value, n_points)

noise = np.random.normal(0, noise_amplitude, n_points)

return x + noise

def add_anomalies(data, n_anomalies, min_magnitude, max_magnitude):

anomaly_indices = random.sample(range(len(data)), n_anomalies)

for index in anomaly_indices:

data[index] += random.uniform(min_magnitude, max_magnitude) * (1 if random.random() > 0.5 else -1)

return data

n_points = 1000

start_time = datetime(2022, 1, 1)

time_interval = timedelta(minutes=1)

timestamps = [start_time + i * time_interval for i in range(n_points)]

cpu_usage = generate_metric_data(30, 70, n_points, 2)

memory_usage = generate_metric_data(20, 80, n_points, 3)

network_latency = generate_metric_data(10, 50, n_points, 1)

# Add anomalies to the data

cpu_usage = add_anomalies(cpu_usage, 10, 15, 25)

memory_usage = add_anomalies(memory_usage, 10, 15, 25)

network_latency = add_anomalies(network_latency, 10, 5, 15)

# Create a DataFrame and save it as a CSV file

data = pd.DataFrame({'timestamp': timestamps, 'cpu_usage': cpu_usage, 'memory_usage': memory_usage, 'network_latency': network_latency})

data.to_csv('synthetic_system_metrics.csv', index=False)

This script generates a synthetic dataset with 1000 data points and introduces 10 artificial anomalies in each of the performance metrics. You can adjust the parameters, such as the number of data points, anomaly count, and noise amplitude, to create different datasets for testing.

And also, test the model with some publicly available datasets and sources for anomaly detection tasks:

  • Numenta Anomaly Benchmark (NAB): NAB is a benchmark for evaluating algorithms for anomaly detection in streaming, real-time applications. It includes over 50 labeled real-world and artificial time-series datasets, some of which are related to IT system performance, such as server metrics and AWS CloudWatch data. You can access the datasets and learn more about NAB on their GitHub repository: https://github.com/numenta/NAB

  • UC Irvine Machine Learning Repository: The UCI Repository is a popular source for various types of datasets, including time-series data. While there isn't a specific IT system performance dataset, you can explore their collection to find time-series datasets that might be relevant or adaptable to your use case: https://archive.ics.uci.edu/ml/datasets.php?format=&task=&att=&area=&numAtt=&numIns=&type=ts&sort=nameUp&view=table

  • Kaggle: Kaggle is a platform for data science and machine learning competitions, and it hosts many datasets uploaded by users. You can search for time-series datasets related to IT system performance or other relevant domains: https://www.kaggle.com/datasets?search=time+series

30-Second Takeaway: An RNN-based anomaly detection system helps IT operations teams proactively identify performance issues and potential failures in their infrastructure. This leads to faster resolution times and improved system reliability.

Next: will be writing the next blog on solving the same above use case on the AWS platform (Bring Your Own Model on SageMaker).

PS: Please check back, I will be updating the above article on the weekend(s).