CN107623655B

CN107623655B - System for real-time detection attack based on artificial intelligence and MapReduce

Info

Publication number: CN107623655B
Application number: CN201610546632.7A
Authority: CN
Inventors: 李木金; 凌飞
Original assignee: Nanjing Liancheng Technology Development Co ltd
Current assignee: Nanjing Liancheng Technology Development Co ltd
Priority date: 2016-07-13
Filing date: 2016-07-13
Publication date: 2020-10-27
Anticipated expiration: 2036-07-13
Also published as: CN107623655A

Abstract

The invention discloses a system for detecting security attacks in real time based on artificial intelligence and MapReduce, which comprises a preprocessing stage, a MAP stage, a Reduce stage and a software module contained in each stage. By the method and the system, the construction level of the enterprise safety operation and maintenance service platform can be improved, and the construction cost can be reduced.

Description

System for real-time detection attack based on artificial intelligence and MapReduce

Technical Field

The invention relates to the technical field of artificial intelligence, big data and information security application, in particular to a system for detecting security attack in real time.

Background

The English abbreviation contained in the invention is as follows:

RF: random Forest

CLF: common Log Format of Common Log Format

JSON: JavaScript Object Notification JAVA script Object Notation

SOC: security Operation Center Security management Center

IDS: intrusion Detection system of Intrusion Detection system

SNMP: simple Network Management Protocol

HDFS (Hadoop distributed File System): hadoop distribution File System Hadoop distributed File System.

Safety production always guarantees the orderly development of various works and is also a negative index for checking the leaders and the cadres at all levels. The network and information security operation and maintenance system is an important component of the security production work of various enterprises. The network and the information system are guaranteed to operate efficiently and stably, and the method is the basis for all market operation activities and normal operation of enterprises.

At present, various different business systems and safety equipment are deployed in an enterprise IT system, so that the labor productivity is effectively improved, the operation cost is reduced, and the enterprise IT system becomes an indispensable part in important support and production links of enterprise high-efficiency operation. On one hand, once a security event or fault occurs in each service system, the security event or fault cannot be timely discovered, timely processed and timely recovered, the operation of all services borne on the system must be directly caused, the normal operation order of an enterprise is influenced, the system related to a user directly causes the complaint of the user, the satisfaction degree is reduced, the enterprise image is damaged, and the system is particularly important for the security guarantee of an enterprise network; on the other hand, various network attack technologies are also becoming more advanced and more popular, and the network system of the enterprise is exposed to the risk of being attacked at any time, often suffers from invasion and damage of different degrees, and seriously interferes with the normal operation of the enterprise network. The increasing security threat forces enterprises to strengthen the security protection of network systems, continuously pursue multi-level and three-dimensional security defense systems, build security operation and maintenance service centers, track system events in real time, detect various security attacks in real time, take corresponding control actions in time, eliminate or reduce the loss caused by the attacks, and protect the normal operation of enterprise business systems.

However, as the size of the enterprise IT system is continuously enlarged, especially the variety and number of the devices, databases, middleware, operating systems, Web servers, and the like, used for performing the security operation and maintenance service task are undergoing a huge scale increase, so that log storage, log analysis, and problem tracking become more and more difficult. The massive increase of the log scale of the enterprise IT system forces a security operation and maintenance service provider to adopt a Hadoop/Spark large data architecture to perform log storage, log processing and log analysis, perform real-time tracking on system events and perform real-time detection on security attacks.

The existing security management analysis tools are not enough to be used for the security operation and maintenance service of the current enterprises. Therefore, a completely new concept for real-time analysis and management of mass log information is urgently needed. A log file is typically a flattened file that contains at least a timestamp field, an event identifier field, and an event description field. The rise in log size is also one of the three characteristic attributes of big data.

Therefore, how to improve the operation benefit of enterprises by using an informatization means and optimize an enterprise information system enables the enterprise information system to provide professional and high-cost-performance information security operation and maintenance service for various enterprises is an important subject which needs to be solved in the design of information security operation and maintenance management.

Disclosure of Invention

After analyzing the defects and shortcomings of various enterprise information security operation and maintenance management platforms, the invention provides a system for detecting security attacks in real time based on artificial intelligence and MapReduce.

The core idea of the invention is as follows: a system for real-time detection of security attacks is constructed. The system can realize real-time tracking and real-time detection of the security attack based on the artificial intelligence technology through logs, and is built based on Hadoop/Spark big data.

Further, the system comprises a preprocessing stage, a MAP stage and a Reduce stage.

The preprocessing stage comprises a log real-time acquisition module and a log real-time analysis module.

The MAP stage comprises a real-time event tracking module and a real-time attack detection module.

And the Reduce stage comprises a real-time statistical attack module.

Preferably, the log real-time acquisition and log real-time analysis module converts the original log into a JSON format through Python language and preprocessing.

Preferably, the real-time tracking event module and the real-time attack detection module implement an artificial intelligence algorithm to realize real-time tracking of system events, their dependencies and scenes, can learn normal behaviors of the system in real time, and can detect security attacks in real time.

Preferably, the real-time statistics attack module is used for carrying out real-time statistics on each attack and the occurrence frequency or frequency of the attack.

By the aid of the system, the construction level of the enterprise safety operation and maintenance service platform can be improved, and construction cost can be reduced.

Drawings

FIG. 1 is a schematic diagram of the conversion of original log format to JSON according to the present invention;

FIG. 2 illustrates the main stages of the artificial intelligence based analysis technique of the present invention;

FIG. 3 illustrates the main stages of the big data based architecture according to the present invention;

FIG. 4 is a schematic diagram of a security operation and maintenance management platform system according to the present invention;

Detailed Description

The invention is described in further detail below with reference to the figures and examples:

the system provided by this patent begins with the specification of unstructured log files. By retrieving unstructured log data, log storage and log processing can be further performed. Extracting data from logs has been a rather laborious technical task, since it has to process log data in various heterogeneous formats. To achieve a proper extraction of log data, the Python programming language is chosen for this patent because of its flexibility, its efficiency, and the relative ease with which the analysis tasks are handled. In the Python program, a useful class library is used to enable the construction of the parser directly in the Python code.

In the work of this patent, the result of the log preprocessing phase is a JSON (JavaScript object Notification) file that contains variables corresponding to the log fields, as shown in FIG. 1. JSON is a lightweight data exchange language that facilitates computer analysis and use. Compared with other structured data exchange languages (such as XML), JSON performance is obviously improved, and the parsing speed is one hundred times faster. Based on the RF method, this is an artificial intelligence technique for discovering and detecting events of related attacks in a log.

Fig. 2 shows three main stages of using artificial intelligence techniques. To make the discussion clearer, the binary-based data structure and algorithm will use the MapReduce big data architecture.

In order to analyze the occurrence frequency of the security attacks detected in the log, the patent provides an artificial intelligence technology based on big data. The method processes the JSON data and creates two data structures, one for storing the name of each security attack (i.e., attName) and the other for storing the number/frequency of attacks generated by each detected attack and the combination thereof (i.e., attFreq).

Fig. 3 shows three phases of the big data architecture: preprocessing stage, MAP stage and Reduce stage:

1. pretreatment stage (first stage): at this stage, two data structures attName and attFreq will be created. The size of the attFreq array depends on the number of attacks n that have been detected. For example, n =5, then the size of the attFreq array is:

corresponding to the combination of 5 possible attacks.

Assuming that attacks A, B and C are stored in the array attName at locations 1, 2, and 3, respectively, if both A and C attacks are found in the log, the index of this combination in the array attFreq is 5, which is determined by the binary translation. In this case, a and C are 101 in binary, which is a binary value of decimal 5. Then, the index of attFreq is determined by:

。

2. MAP phase (second phase): in this stage, the artificial intelligence algorithm begins to be executed by scanning the input JSON variable. Various security attacks are detected in real time by comparing JSON variables to a series of special regular expressions (e.g., the rules of logcorrlator. conf of this patent), which are a series of features used to identify different attack patterns.

For each attack detected in the log, the corresponding ID can be found in attName, which is used to decide the corresponding attFreq index in the following formula, named 'Loc', where i is the attack index in attName.

The following algorithm describes the overall process of the MAP phase, where i is the index of the current attack stored in the array attName. The output of the MAP phase is a key-value pair (key-value): attFreq index and frequency (this 'key-value pair' would be the input to Reduce stage):

Begin

loc←0

For each i in attName

If i is detected in log record

loc← loc + 2 i

End if

End for

Output [loc, 1]

End

3. reduce stage (third stage): at this stage, the Hadoop/Spark working node will redistribute the data based on the output of the MAP phase. The Reduce method will then perform an addition operation on the data output by each MAP in parallel. The array attFreq will be the result of the store Reduce method after execution, which may order the frequencies and may order the indices in the array from high to low.

Fig. 4 is a framework of the secure operation and maintenance management platform according to the present invention:

1. a pre-treatment stage

This part of the program is written by Python. These massive logs are collected in real-time from different security devices, network devices, databases, operating systems, middleware, etc. To be able to pre-process these heterogeneous logs, a rule (or regular expression) based approach is used. The rule-based approach can eliminate redundant log information (or useless log data). This rule-based approach, also in a particular format, contains several fields: the type field indicates the type of a rule whose pattern field is to identify an input event, and the ptype field indicates the type of the pattern field. The field desc is a description of the rule. The field action indicates the manner of alarm (e.g., short message, alarm box, Email) when the event occurs once.

After preprocessing, the log is changed to JSON format.

2. MAP phase

Tracking events in real time enables the discovery of relationships between different events, and it is common practice to obtain a higher level of knowledge from the log information. The number of events occurring on the network is large, so from these thousands of events, the decision is to consider which event to skip, in order to avoid unnecessary processing.

3. Reduce phase

And counting the detected attacks in real time.

The system provided by the patent is mainly realized by three programs, namely main, logcorrlater, conf and logWatcher. The following will briefly be introduced:

1、main.py

py starts from main. First, main () reads the configuration file logcorrlator. conf and loads the rules into memory. After the configuration file is read, the event matching the rule is searched. When a rule matching an event is found to exist, an action (e.g., a manner of alarm) for the event is looked up.

2、logcorrelator.conf

Function of def initFromConf (): is used in main () to achieve the initialization of the system by reading the configuration file.

The def initFromConf () procedure is as follows:

def initFromConf ():

global failed limit

config=configparser.ConfigParser()

config.read(“logcorrelator.conf”)

sections= config. sections()

for section in sections

options= config. options(section)

for option in options:

if(option==“match”):

matchers[section]= config. get(section, option)

if(option==“windows”):

if(section ==“Rule4”):

failed limit=int(config. get(section, option))

print(“failed:+str(failed _limit)”)

if(option==“action”):

actions[section]= config. get(section, option)

here, the working process of the rule is described roughly:

the Type (Type) of the rule shown in the following box is single, which describes the rule for checking the character string of the accepted password. The Continue field specifies the point to Continue after a matching pattern. After an event has matched a rule, the configuration file is immediately searched for the next rule (the rule mentioned in the Continue field for the next rule). When the event matches the rule, corresponding action is immediately executed (the rule has no action), and the password for successful login on the SSH connection is searched.

3、logWatcher.conf

This is an auxiliary file for real-time analysis. Once main () has read the current log file, it gives control to this file. It then polls the new log files in real time and applies the same rules and actions to these new logs. In this way, main, logcorrlator. conf and logwatch. conf, these three files are interrelated and run simultaneously to accomplish the task of detecting security attacks.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention; all equivalent changes and modifications made according to the present invention are considered to be covered by the scope of the present invention.

Claims

1. A system for detecting security attacks in real time based on artificial intelligence and MapReduce is characterized in that a binary-based data structure and an algorithm use a big data architecture of MapReduce, and three files, namely main, py, logcorrlater, conf and logWatcher, conf, are correlated with each other and run simultaneously to realize a task of detecting security attacks;

the system also comprises a preprocessing stage, a MAP stage and a Reduce stage;

in the preprocessing stage, two data structures attName and attFreq are created, wherein attName is used for storing names of various security attacks, attFreq is used for storing times/frequencies of detected attacks and attack times/frequencies generated by combination of the attacks, and the size of an attFreq array is as follows

The size of the attFreq array depends on the number of attacks n that have been detected, corresponding to

The combination of possible attacks, assuming that attacks A, B and C are stored in the array attName at positions 1, 2 and 3, respectively, if both A and C attacks are found in the log, the index of such combination in the array attFreq is 5, which is determined by the binary translation, in this case A and C are 101 in binary, which is the binary value of decimal 5, then the index of attFreq is determined by:

+

=5, comprising a log real-time acquisition module and a log real-time analysis module;

the MAP phase begins execution of an artificial intelligence algorithm by scanning an incoming JSON variable, which contains variables corresponding to log fields, which are a series of features used to identify different attack patterns, and for each attack detected in the log, the corresponding ID, which is used to identify the attack in Loc = attName

Determining corresponding attFreq indexes in a formula, namely named as 'Loc', wherein i is an attack index in attName and comprises a real-time tracking event module and a real-time detection attack module;

in the Reduce stage, addition operation is performed on data output by each MAP in parallel, an array attFreq is used as a result after the Reduce method is executed, the frequency can be sequenced, indexes in the array can be sequenced from high to low, and the Reduce stage comprises a real-time statistical attack module;

the log real-time acquisition and log real-time analysis module is used for converting the original log into a JSON format through Python language and preprocessing;

the real-time tracking event module and the real-time detection attack module track system events, dependence thereof and scenes in real time by implementing an artificial intelligence algorithm, learn normal behaviors of the system in real time and detect security attacks in real time;

the real-time statistics attack module is used for carrying out real-time statistics on various attacks and the occurrence times of the attacks;

the attacks and the times of occurrence thereof are stored in binary data structures attName and attFreq, respectively.