AIOPs

How To Operationalize AIOPs Faster with Synthetic Data

Mohammad Farook

10 Jul 2023 • 3 min read

Operationalizing AIOPs Faster

Setting up a fully operational AIOps system in a large telco or enterprise is a time-consuming process that takes months before it can be used in production. A significant amount of this time is spent during data ingestion, model training, and model tuning. Data ingestion involves using real data, which is expensive to generate, infrequently available, and slows down the training process. Currently, the utilization of synthetic data, especially in large telcos, is minimal, and this greatly affects the timeline for implementing AIOps more quickly.

Fortunately, new tools for generating synthetic data have emerged, enabling data validation, comparison with real data, and metrics for evaluating privacy and fidelity. By employing synthetic data, it becomes possible to test and operationalize AIOps more rapidly, resulting in a reduction in deployment time.

AIOPS Functionality

Gartner in the report Market Guides for AIOPs Platform defines the essential functions of AIOps platform.

The initial phase of the process involves ingestion, which includes the creation of both historical and real-time data. This data is then normalized and potentially enriched to train the model and fine-tune the system. This article primarily focuses on the ingestion aspect, with a particular emphasis on enhancing its speed and cost-effectiveness.

Ingestion involves setting up the lab environment with the various vendor devices that need to be monitored, as well as configuring the network if the devices are network devices like routers and switches. Additionally, other infrastructure systems that require monitoring must also be set up. Once the infrastructure is in place, the relevant data needs to be generated. This data can include logs, alerts, events, SNMP, traps, telemetry, and syslogs. However, this method of generating ingestion data is inefficient for the following reasons

Expensive: Real data generation requires setting up a lab with the appropriate combination of devices, network connections, and device configurations. The process also includes writing scripts to generate the data. The project's overall cost is influenced by the costs related to lab setup as well as the time needed for data generation.
Data Generation Delay: The quantity of real data generated is typically insufficient to train machine learning (ML) algorithms within a short timeframe, such as hours or days. It often takes weeks to accumulate a substantial amount of training data that is necessary for ML algorithms to generate effective models suitable for inferencing.

How Can we Reduce the Data Ingestion Time Period?

The answer lies in using synthetic data. Synthetic data is artificially generated data that replicates real-world data patterns. It is created using AI/ML algorithms and statistical models to mimic the characteristics of genuine data. It serves as a privacy-preserving alternative for testing algorithms, training models, conducting simulations, or sharing data without exposing sensitive information.

Advancements in synthetic data technology have enabled the rapid generation of large volumes of data within a short time frame, with the potential for further time reduction when utilizing NVidia's TAO toolkit. Key aspects to consider in synthetic data generation include:

Data Validation: Ensuring that the generated synthetic data meets quality standards and undergoes validation checks for integrity, accuracy, and consistency.
Data Conformance: Ensuring that the synthetic data adheres to predefined standards, formats, or regulations, ensuring compatibility with existing systems and processes.
Data Comparison: Comparing the synthetic data with real-world data to identify similarities, differences, or patterns, facilitating validation and assessment of the synthetic data's fidelity.

Synthetic Data Generation Applications

There are numerous software applications available, including both open-source and commercial versions. Let's evaluate two of these applications.

sdv.dev: Software Data Vault (SDV) was based from research at MIT. It is open-source and provides ecosystem that includes

Modeling: Models & generates data using statistical models and Deep Learning for various scenarios.
BenchMarking: The performance of synthetic data generators, including SDV models, is assessed by comparing their generated data with a given dataset.
Metrics: Provides a set of model-agnostic tools for evaluating synthetic data and also defines metrics for statistics, efficiency, and privacy.

Mostly.ai: Provides AI-generated synthetic data used across different segments including Telecommunications, Healthcare, Insurance and Banking. The main features are:

It offers a secure platform for sharing synthetic data, enabling collaboration and data sharing across organizations.
Synthetic data generated by the platform maintains referential integrity and can be synthesized directly from databases.
The platform provides automated privacy mechanisms and generates privacy and quality reports for each synthetic dataset, ensuring privacy and data quality.

Summary

According to Gartner, by 2024, synthetic data is projected to account for 60% of all data. As the technology for synthetic data continues to advance, its reliability will improve, leading to its increased adoption across various fields and applications. Specifically, AIOPs systems will utilize synthetic data to expedite the deployment and implementation process.