Training AI for cybersecurity
In cybersecurity, artificial intelligence (AI) is used to uncover attacks and threats that are hard to detect using traditional rule-based engines like security information and event management (SIEMs). Various AI and machine learning (ML) algorithms such as deep learning neural networks, clustering, classification, pattern matching are used to construct models to detect attacks and anomalies that indicate a potential attack.
These AI/ML algorithms learn through a process called training, where they look at past data and computations to understand what outcomes to generate. Some other models use clustering techniques to analyze data when past data references do not exist. In both cases, the accuracy of the models is increased by larger data volume and variety. Once a model is developed, it must be tested by data scientists to tune it further — another area where access to large datasets is important. Finally, when a model is deployed in a production setup, it needs to process large volumes of data to return the results expected. False positives undermine the efficacy of the model, though not as much as misses. Both can be reduced by training on a larger variety and volume of data.
Basically, for AI/ML-based detection to be successful, there is a need to process huge amounts of data, often in the range of hundreds of gigabytes. Consequently, there is a need to manage this huge data, which is where big data technology becomes relevant.
How big data enables AI
AI and ML need a way to efficiently ingest, clean, store and process large data volumes. Big data provides an efficient framework to work with these large volumes of data. There are various components in the big data stack that enable this.
Data used for analytical processing needs extract, transform, load (ETL) logic to convert it to a usable form. In cybersecurity analytics, raw logs come in various formats and need normalization and enrichment to be useful for executing models. Tools like Apache NiFi help build a robust data flow pipeline to ingest and parse data for further processing. It also has features like high availability, clustering, and a rich variety of built-in processors.
Storage design needs to ensure redundancy (e.g. using replication) and efficient retrieval (e.g. using appropriate partitioning schemes). The Hadoop file system (HDFS) can be used for efficient storage in distributed nodes with replication and high availability. The multi-node nature of the file system enables horizontal scaling. The data can still be accessed centrally by the consuming application without it being aware of which node the data is stored in, since Hadoop distributes the data while writing and retrieving from the various nodes.
Tools like Apache Hive provide SQL query capabilities on this data, enabling easier analytics on the stored data.
Apache Spark provides an engine to develop analytical applications for data processing at scale in a distributed computing framework. Spark provides high performance for both streaming and batch applications, making it a good choice for both real-time analytics and deep analysis of data collected over a period of time and processed in batches for training and detection. It can be used with programming languages like Java, Scala and Python, making application development easy.
In summary, big data brings provides a robust framework for AI/ML models to efficiently process large datasets. It helps create better, more accurate AI models that can translate raw log and alert data from multiple security devices into meaningful, actionable, intelligent information.
About the author
VP Product Engineering, Atos
Sonali Gupta has been into software architecture and development for 20 years, specializing in products in the Cybersecurity domain. Sonali was a key member of the Paladion team that conceptualized, developed and rolled out a high-speed Internet traffic monitoring system in 2004. A seasoned practitioner of Agile methodology with deep understanding of Cybersecurity requirements and challenges, she is an expert in high-performance distributed software architecture and design using Big Data technologies and various language frameworks. Sonali currently heads Engineering for AIsaac, an AI-based MDR and Cyber Analytics platform. She has been involved in the architecture and development of the AIsaac platform since its inception.
Interested in next publications?
Register to our newsletter and receive a notification when there are new articles.