CN112783852A

CN112783852A - Network security analysis system based on big data

Info

Publication number: CN112783852A
Application number: CN202110043769.1A
Authority: CN
Inventors: 蒋丹阳; 钱承山; 孙宁; 毛伟民; 茹清晨; 王彭辉; 赵贤; 宗文杰
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2021-01-13
Filing date: 2021-01-13
Publication date: 2021-05-11

Abstract

The invention discloses a network security analysis system based on big data, which belongs to the technical field of network security, can acquire mass data of different types, meets the real-time requirement of business, and also provides cloud Impala supporting data online processing. On the basis of YARN batch processing, the cloud Impala queried in real time is added, data can be queried directly from HDFS or HBase by using SELECT, JOIN and a statistical function, and delay is greatly reduced. Compared with the original Hive SQL query speed based on MapReduce, the query speed is increased by 3-90 times. Apache Kylin is used as a big data analysis engine, the query speed is superior to Hive, the delay is reduced, and the system working efficiency is improved. The Choquet fuzzy integral fusion algorithm is applied to integrate the neural network training model on the basis of the neural network training model, so that the data analysis effect can be improved, and the fault tolerance of the whole system can be enhanced.

Description

Network security analysis system based on big data

Technical Field

The invention belongs to the technical field of optical communication, and particularly relates to a network security analysis system based on big data.

Background

At present, with the progress of science and technology, the internet becomes an important auxiliary tool for the life and work of people, so that the change of the living of people from the top to the bottom brings about the problem of network security.

In the big data era, enterprises attach more and more importance to cooperative business, the business scale is gradually enlarged, business communication with other enterprises depends on a computer network system, and in the process, if corresponding defense measures are not taken, the system is easily attacked by viruses, so that data is stolen or even damaged.

The network analysis system can detect, analyze and diagnose all transmitted data in the network in various network safety problems, help users to eliminate network accidents, avoid safety risks, improve network performance and increase network availability value. With the increase of network data, the traditional data information transfer technology cannot efficiently process the increasing amount of different types of data.

Disclosure of Invention

The purpose of the invention is as follows: aiming at improving the processing efficiency of massive and irregular information in the current network, the invention aims to provide a network security analysis system based on big data.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the following technical scheme:

the network security analysis system based on big data comprises a data acquisition module, a data preprocessing module, a real-time online analysis module, a data storage module, a data analysis module and a model fusion module which are sequentially connected through a network; the data acquisition module and the data preprocessing module carry out data communication through an HTTP (hyper text transport protocol);

the data acquisition module acquires log information by adopting a Chukwa + Scribe, Spark and Gbase processing mode; carrying out data distributed standby by adopting a script distributed log system;

the data preprocessing module adopts an Informatica Power Center to perform data cleaning, data integration, data transformation and data specification on the acquired original data to finally obtain a processing result; adopting Cloudar Impala to realize real-time online analysis of data;

the data storage module stores the data processed by the data preprocessing module; the module adopts an HDFS distributed file system and provides bottom support for file operation and distributed storage for the system; the NameNode is used as a main server in the HDFS and manages all metadata information of the HDFS file system and mapping relation information of Block blocks and data nodes; in an HDFS cluster, a DataNode is mainly responsible for data storage and management, data is divided into a plurality of Block blocks in the HDFS, and the Block blocks are stored on a plurality of data nodes DataNode;

the data analysis module adopts an online analysis processing mode; apache Kylin is a big data analysis engine which supports second-level OLAP query on a super-large data set;

the data analysis module realizes statistical analysis and mining analysis of data, and a distributed computing framework YARN is adopted for data division, computing task scheduling and distributed computing, so that the problem with larger scale is divided into a plurality of problems with smaller scale.

Further, the system also comprises a data visualization module which represents the data in the form of a graphic image.

Furthermore, a Resource Manager of the data analysis module is responsible for computing resources required by the application program, and an ApplicationMaster is responsible for scheduling, tracking and monitoring the operation; analyzing the acquired data by adopting a neural network; the neural network has the capabilities of large-scale parallel, distributed storage and processing, self-organization, self-adaptation and self-learning, and is suitable for processing inaccurate and fuzzy information processing problems which need to consider many factors and conditions simultaneously.

Furthermore, a multilayer neural network is selected for training in the data analysis module, and an error inverse propagation algorithm is adopted in the algorithm; processing the examples in the training set by iteration, and comparing the error between the predicted value (predicted value) and the real value (target value) of the input layer after passing through the neural network; in the reverse direction (from the output layer > hidden layer > input layer) to update the weight (weight) of each connection with the minimized error (error);

inputting: d: data set,/(learning rate), a multi-layer forward neural network

And (3) outputting: a trained neural network

Initialization weights (weights) and biases (bias) with random initialization between-1 and 1, or-0.5 and 0.5, with a bias per cell;

for each training instance X, the following steps are performed:

is forwarded by the input layer, wherein O_iFor each Layer element value, O is expressed in terms of Input Layer_iHas a value of x₁，x₂...x_n；w_ijIs the weight (weights), θ_jBiased (bias) random initialization between-1 and 1, or-0.5 and 0.5; i is_jThe predicted value of the next layer unit is obtained;

nonlinear transformation equation: in the formula O_jThe nonlinear conversion of the predicted value of the next layer unit is used as the input of the next layer;

backward transmission according to error

For the output layer: t is_jTrue value, Err_jTo output layer errors

Err_j＝O_j(1-O_j)(T_j-O_j)

For the hidden layer: err_kFor the purpose of transmitting the previous layer error, w_jkCorresponding weights to previous layer errors for reverse transmission

And (3) updating the weight: Δ w_ijFor the weight update amount, l is the learning rate value of 0-1

Δw_ij＝(l)Err_jO_i

Weight of this time w_ijnEqual to the sum of the last weight and the weight update amount

w_ijn＝w_ij+Δw_ij

Updating the deviation: delta theta_ijIs biased to update the amount

Δθ_j＝(l)Err_j

This deviation of theta_jnEqual to the sum of the last deviation and the deviation update amount

θ_jn＝θ_j+Δθ_j

Termination conditions were as follows: the updating of the weight is lower than a certain threshold value, the predicted error rate is lower than a certain threshold value, and a certain preset cycle number is reached; and training each block through a neural network to obtain a model, and then integrating the models to jointly complete a learning task.

Furthermore, the model fusion module adopts a fusion operator of Choqut fuzzy integral to integrate the neural network training model, so that the data analysis effect can be improved, and the fault tolerance of the whole system can be enhanced; for a given training set T, Ω ═ ω { ω }_i,ω₂,...,ω_kIs a set of class labels; dividing the training set into l parts, D ═ D₁,D₂,...,D_l-1Is the set of l-1 classifiers, i.e. models, trained from T; for any test sample x, D_i(x)＝(μ_i1(x),μ_i2(x),...,μ_ik(x) ); wherein, mu_ij(x)∈[0,1]Representation classifier D_i(1. ltoreq. i. ltoreq. l-1) classifying test samples x into j^th(j is more than or equal to 1 and less than or equal to k) class membership degree,

furthermore, a test sample x is given in the model fusion module, and the following (l-1) xk-order matrix DM is called as a decision matrix of x;

i of matrix DM^thLine representation classifier D_iClassifying x as j^thDegree of membership of a class; j of the matrix DM^thThe columns indicate that x is classified by different classifiers to j^thDegree of membership of a class;

given set of classifiers D ═ D₁,D₂,...D_l-1P (D) is the power set of D; the blur measure g on D is defined as a function g satisfying two conditions, P (D) → [0,1 [)]；

(1)g(φ)＝0,g(D)＝1；

(2)

If it is

Then g (A) is less than or equal to g (B);

if it is not

And A ≈ B ═ phi, the following formula holds, then g is called as lambda-fuzzy measure;

g(A∪B)＝g(A)+g(B)+λg(A)g(B)

where λ > -1 and λ ≠ 0, whose value is defined by:

in the formula, g_iA measure of blur, referred to as the blur density, represented on a single training model; it has been theoretically demonstrated that: regardless of the integration of several models, i.e., regardless of l-1 being equal to several, there is only one solution that satisfies the condition; determining g_iThe methods of (a) generally have the following three types:

(1)g_i＝p_i

(2)

(3)

in the above formula, p_iIs a training model D_iThe verification accuracy in the verification set; although the values of the three methods of fuzzy density have larger difference, the influence on the final result is not great; the number of the third method is more, and the larger the value of delta is, the more prominent the function of a single training model is; the smaller the value of delta is, the more prominent the function of the integrated training model is;

given a set of training models D ═ D₁,D₂...D_l-1G is a measure of blur on D, the function h: D → R⁺The Choquet integral for g is:

wherein h (D) is not less than 0₁)≤h(D₂)≤...≤h(D_l-1)≤1,h(D₀)＝0,

g(A₀) The ordering in equation also goes from large to small, but the integrand becomes (h (D)_i-1)-h(D_i) ); i.e. to ensure that the integral value is not negative; using the first part of data as a test sample, calculating fuzzy integral, and classifying the sample into which model when the model is the largest corresponding to the fuzzy integral calculated by the sample; finally, the test case is trained by the model to obtain the final analysis result.

The invention principle is as follows: the big data technology is applied to the construction of a network security analysis system, the data acquisition and analysis capacity of the system can be effectively improved, the application of the big data technology enables the network security analysis to be converted from a structured database to a distributed database, the overall performance of the system structure is optimized, the cost is reduced, the problem of unstable operation of the traditional network security analysis system is effectively solved, valuable and meaningful information can be mined from mass data, the accuracy, the authenticity, the timeliness and the effectiveness of information processing are ensured, incomplete factors of the network can be better identified, and the monitoring, defense and management levels of the network security are improved.

According to the system, a Chukwa, Spark and Gbase processing mode is adopted in a data acquisition module to better acquire log information, flow data and data information related to fixed-format services. On the basis of YARN batch processing, the cloud Impala of real-time online query is added, so that the delay is greatly reduced. And an Apache Kylin big data analysis engine is used in a data analysis layer, so that the delay of more than one billion data query in a Hadoop environment is reduced. The Choquet fuzzy integral fusion algorithm is applied to integrate the neural network training model on the basis of the neural network training model, so that the data analysis effect can be improved, and the fault tolerance of the whole system can be enhanced. The system solves the problem that the prior art can not effectively process multi-type mass data, increases the data processing types, improves the processing efficiency and accuracy of the network security analysis system, has low requirement on hardware and greatly reduces the cost.

Has the advantages that: compared with the prior art, the network security analysis system based on the big data can acquire mass data of different types, meets the real-time requirement of business, and also provides the cloud Impala supporting the data on-line processing. On the basis of YARN batch processing, the cloud Impala queried in real time is added, data can be queried directly from HDFS or HBase by using SELECT, JOIN and a statistical function, and delay is greatly reduced. Compared with the original Hive SQL query speed based on MapReduce, the query speed is increased by 3-90 times. Apache Kylin is used as a big data analysis engine, the query speed is superior to Hive, the delay is reduced, and the system working efficiency is improved. The Choquet fuzzy integral fusion algorithm is applied to integrate the neural network training model on the basis of the neural network training model, so that the data analysis effect can be improved, and the fault tolerance of the whole system can be enhanced.

Drawings

FIG. 1 is a big data based network security analysis system architecture diagram;

FIG. 2 is a schematic diagram of a data preprocessing module.

Detailed Description

The present invention will be further described with reference to the following embodiments.

The network security analysis system based on big data comprises a data acquisition module, a data preprocessing module, a real-time online analysis module, a data storage module, a data analysis module, a model fusion module and a data visualization module.

The data acquisition module is connected with the data preprocessing module; the data preprocessing module is connected with the data storage module;

the data acquisition module adopts Chukwa + Scribe, Spark and Gbase processing modes to better acquire log information including search engine crawler data, current flow data and data information related to fixed-format services. And a script distributed log system is adopted to perform data distributed standby so as to improve the data acquisition efficiency and quality.

The data preprocessing module adopts an Informatica Power Center to perform data cleaning, data integration, data transformation and data specification on the acquired original data to finally obtain a processing result. And (3) realizing real-time online analysis of data by adopting Cloudar Impala.

The data storage module stores the data processed by the data preprocessing module. The module adopts an HDFS distributed file system and provides bottom support for file operation and distributed storage for the system. The NameNode is used as a main server in the HDFS and manages all metadata information of the HDFS file system and mapping relation information of Block blocks and data nodes. In an HDFS cluster, a DataNode is mainly responsible for storing and managing data, the data is divided into several blocks inside the HDFS, and the blocks are stored on a plurality of data nodes DataNode. The HDFS is a high fault-tolerant system, is suitable for being deployed on cheap machines and is very suitable for application on large-scale data sets.

The data analysis module adopts an online analysis processing mode. Apache Kylin is a big data analysis engine that supports OLAP queries on a second scale on very large data sets.

Resource Manager is responsible for the computing resources required by the application, and the ApplicationMaster is responsible for scheduling, tracking and monitoring of the job. And analyzing the acquired data by adopting a neural network. The neural network has the capabilities of large-scale parallel, distributed storage and processing, self-organization, self-adaptation and self-learning, and is suitable for processing inaccurate and fuzzy information processing problems which need to consider many factors and conditions simultaneously.

In order to improve the learning ability of the neural network, the invention selects a plurality of layers of neural networks for training, and the algorithm adopts an error inverse propagation algorithm. The error between the predicted value (predicted value) and the actual value (target value) of the input layer after passing through the neural network is compared by iteratively processing the examples in the training set. The weight (weight) of each connection is updated in the reverse direction (from the output layer > hidden layer > input layer) with a minimized error (error).

Inputting: d: data set,/(learning rate), a multi-layer forward neural network

And (3) outputting: a trained neural network

Initialization weights (weights) and bias (bias) random initialization is between-1 and 1, or-0.5 and 0.5, with a bias per cell.

For each training instance X, the following steps are performed:

forward by the input layer:

in the formula O_iFor each Layer element value, O is expressed in terms of Input Layer_iHas a value of x₁，x₂...x_n。w_ijIs the weight (weights), θ_jFor bias (bias) the random initialization is between-1 and 1, or-0.5 and 0.5. I is_jIs the predicted value of the next layer unit.

Nonlinear transformation equation: in the formula O_jThe nonlinear conversion of the predicted value of the unit of the next layer is used as the input of the next layer.

Backward transmission according to error

For the output layer: t is_jTrue value, Err_jTo output layer errors

Err_j＝O_j(1-O_j)(T_j-O_j)

Δw_ij＝(l)Err_jO_i

w_ijn＝w_ij+Δw_ij

Updating the deviation: delta theta_ijIs biased to update the amount

Δθ_j＝(l)Err_j

θ_jn＝θ_j+Δθ_j

Termination conditions were as follows: the updating of the weight is lower than a certain threshold value, the predicted error rate is lower than a certain threshold value, and a certain preset circulation frequency is reached. And training each block through a neural network to obtain a model, and then integrating the models to jointly complete a learning task.

The model fusion module integrates the neural network training model by adopting a fusion operator of Choqut fuzzy integral, so that the data analysis effect can be improved, and the fault tolerance of the whole system can be enhanced. For a given training set T, Ω ═ ω { ω }_i,ω₂,...,ω_kIs a set of class labels. Dividing the training set into l parts, D ═ D₁,D₂,...,D_l-1Is the set of l-1 classifiers, i.e., models, trained from T. For any test sample x, D_i(x)＝(μ_i1(x),μ_i2(x),...,μ_ik(x) ). Wherein, mu_ij(x)∈[0,1]Representation classifier D_i(1. ltoreq. i. ltoreq. l-1) classifying test samples x into j^th(j is more than or equal to 1 and less than or equal to k) class membership degree,

given test sample x, the following (l-1). times.k matrix DM is referred to as the decision matrix for x.

I of matrix DM^thLine representation classifier D_iClassifying x as j^thDegree of membership of a class; j of the matrix DM^thThe columns indicate that x is classified by different classifiers to j^thDegree of membership of the class.

Given set of classifiers D ═ D₁,D₂,...D_l-1And P (D) is the power set of D. The blur measure g on D is defined as a function g satisfying two conditions, P (D) → [0,1 [)]。

(1)g(φ)＝0,g(D)＝1；

(2)

If it is

Then g (A) is less than or equal to g (B).

If it is not

And a ≈ B ═ Φ, and the following equation holds, g is called λ -blur measure.

g(A∪B)＝g(A)+g(B)+λg(A)g(B)

Where λ > -1 and λ ≠ 0, whose value is defined by:

in the formula, g_iThe measure of blur, represented on a single training model, is called the blur density. It has been theoretically demonstrated that: regardless of the integration of several models, i.e., regardless of l-1 being equal to several, there is only one solution that satisfies the condition. Determining g_iThe methods of (a) generally have the following three types:

(1)g_i＝p_i

(2)

(3)

in the above formula, p_iIs a training model D_iThe verification accuracy in the verification set. Although the three methods of taking the blur density have large differences in value, the final result is not greatly affected. The number of the third method is more, and the larger the value of delta is, the more prominent the function of a single training model is; the smaller the value of delta, the more prominent the effect of the integrated training model.

g(A₀) The order in the equation may also be from large to small, but the integrand correspondingly becomes (h (D)_i-1)-h(D_i)). I.e. to ensure that the integration value is not negative. And (4) calculating fuzzy integral by using the I-th data as a test sample, and classifying the sample into which model according to which model has the largest fuzzy integral value calculated corresponding to the sample. Finally, the test case is trained by the model to obtain the final analysis result.

The data visualization module represents the data in the form of a graphic image, and helps people to explore and understand complex data. The method is an important way for users to know the complex data and carry out deep analysis.

Examples

As shown in fig. 1, the present invention provides a big data based network security analysis system, which includes a data acquisition module, a data preprocessing module, a real-time online analysis module, a data storage module, a data analysis module, a model fusion module, and a data visualization module.

Firstly, the system collects mass data of the network, and each processing link of the design can adopt parallel processing. The processing modes of the acquisition module, such as Chukwa, Spark, Gbase and the like, are used for respectively acquiring log information, flow data and data information related to fixed-format services.

As shown in fig. 2, the information Power Center of the acquisition module sends acquired data to the data preprocessing module, and performs data cleaning, data integration, data transformation, and data specification on the acquired original data to finally obtain a processing result.

The Cloudar Impala performs real-time online analysis on the data. And the data preprocessing module sends the processed data to the data storage module. The DataNode of the module is responsible for storing and managing data, the data is divided into a plurality of Block blocks in the HDFS, and the Block blocks are stored on a plurality of data nodes DataNode. HBase stores semi-structured data. And the data analysis layer performs online analysis processing on the acquired data. Apache Kylin is a big data analysis engine, and a distributed computing framework YARN divides data, schedules computing tasks and performs distributed computing. Resource Manager provides the computing resources needed by the application program, and the ApplicationMaster is responsible for scheduling, tracking and monitoring the jobs. And training the data of each Block by adopting a neural network, wherein the BP learning process is carried out through two processes of signal forward propagation and error backward propagation.

In forward propagation, a sample is transmitted from an input layer, processed layer by each hidden layer and transmitted to an output layer. If the actual output of the output layer does not match the expected output, the error is propagated back to the error propagation stage. The error back propagation reversely transmits the output error to the input layer by layer through the hidden layer in a certain form, and distributes the error to all units of each layer, thereby obtaining the error signal of each layer, and the error signal is used as the basis for correcting the weight of the unit. The weight value adjustment process of each layer of signal forward propagation and error backward propagation is carried out repeatedly, the weight value is adjusted continuously until the error of the network output is reduced to an acceptable degree or is carried out to preset learning times, and the network learning training is finished. And finally obtaining the trained neural network model by utilizing the training set.

On the basis of the neural network training model, a Choquet fuzzy integral fusion algorithm is applied to integrate the neural network training model to enhance the fault tolerance of the whole system. The system analyzes the data of known network viruses and malicious software, finds out the weight relation between input and output by using the data, simulates by using the weight relation and finally outputs a simulation result. And send the results of the analysis to the data visualization module.

Visualization is used for quickly and effectively simplifying and refining data flow through interactive visual representation, and a large amount of data screened by user interaction well presents complex and massive data analysis results to users. When the system detects such a range of attacks, the intrusion detection system can quickly identify the attack and react.

The above description is only a preferred embodiment of the present invention, and it will be apparent to those skilled in the art that various modifications and variations can be made without departing from the technical principles of the present invention, and these modifications and variations should also be construed as the scope of the present invention.

Claims

1. Big data-based network security analysis system, its characterized in that: the system comprises a data acquisition module, a data preprocessing module, a real-time online analysis module, a data storage module, a data analysis module and a model fusion module which are sequentially connected through a network; the data acquisition module and the data preprocessing module carry out data communication through an HTTP (hyper text transport protocol);

the data storage module stores the data processed by the data preprocessing module;

the data analysis module realizes statistical analysis and mining analysis of data, and a distributed computing framework YARN is adopted for data division, computing task scheduling and distributed computing.

2. The big-data based network security analysis system of claim 1, wherein: the system also comprises a data visualization module which represents the data in the form of a graphic image.

3. The big-data based network security analysis system of claim 1, wherein: the Resource Manager of the data analysis module is responsible for computing resources required by the application program, and the ApplicationMaster is responsible for scheduling, tracking and monitoring the operation; and analyzing the acquired data by adopting a neural network.

4. The big-data based network security analysis system of claim 1, wherein: the data analysis module selects a multilayer neural network for training, and an algorithm adopts an error inverse propagation algorithm; processing the examples in the training set through iteration, and comparing errors between the predicted values and the true values of the input layer after passing through the neural network; updating the weight of each connection with a minimized error in the reverse direction;

inputting: d: data set, learning rate, a multi-layer forward neural network

And (3) outputting: a trained neural network;

initializing weight and bias, wherein random initialization is between-1 and 1, or between-0.5 and 0.5, and each unit has bias;

for each training instance X, the following steps are performed:

is forwarded by the input layer, wherein O_iFor each Layer element value, O is expressed in terms of Input Layer_iHas a value of x₁，x₂...x_n；w_ijAs a weight, θ_jRandomly initializing to be between-1 and 1 or-0.5 and 0.5; i is_jThe predicted value of the next layer unit is obtained;

transmitting in reverse direction according to error

For the output layer: t is_jTrue value, Err_jTo output layer errors

Err_j＝O_j(1-O_j)(T_j-O_j)

Δw_ij＝(l)Err_jO_i

w_ijn＝w_ij+Δw_ij

Updating the deviation: delta theta_ijIs biased to update the amount

Δθ_j＝(l)Err_j

θ_jn＝θ_j+Δθ_j

5. The big-data based network security analysis system of claim 4, wherein: the model fusion module adopts a fusion operator of Choqut fuzzy integral to integrate a neural network training model, so that the data analysis effect can be improved, and the fault tolerance of the whole system can be enhanced; for a given training set T, Ω ═ ω { ω }_i,ω₂,...,ω_kIs a set of class labels; dividing the training set into l parts, D ═ D₁,D₂,...,D_l-1Is the set of l-1 classifiers, i.e. models, trained from T; for any test sample x, D_i(x)＝(μ_i1(x),μ_i2(x),...,μ_ik(x) ); wherein, mu_ij(x)∈[0,1]Representation classifier D_iI is not less than 1 and not more than l-1, and the test sample x is classified into j^thJ is more than or equal to 1 and less than or equal to k, the membership degree of the class,

6. the big-data based network security analysis system of claim 5, wherein: a test sample x is given in the model fusion module, and a following (l-1) xk-order matrix DM is called as a decision matrix of x;

(1)g(φ)＝0,g(D)＝1；

(2)

If it is

Then g (A) is less than or equal to g (B);

if it is not

g(A∪B)＝g(A)+g(B)+λg(A)g(B)

where λ > -1 and λ ≠ 0, whose value is defined by:

(1)g_i＝p_i

(2)

(3)

in the above formula, p_iIs a training model D_iThe verification accuracy in the verification set; three methods of obtaining fuzzy densityAlthough the values have larger difference, the influence on the final result is not great; the number of the third method is more, and the larger the value of delta is, the more prominent the function of a single training model is; the smaller the value of delta is, the more prominent the function of the integrated training model is;

7. The big-data based network security analysis system of claim 1, wherein: the data storage module adopts an HDFS distributed file system and provides a bottom layer support for file operation and distributed storage for the system; the NameNode is used as a main server in the HDFS and manages all metadata information of the HDFS file system and mapping relation information of Block blocks and data nodes; in an HDFS cluster, a DataNode is mainly responsible for storing and managing data, the data is divided into several blocks inside the HDFS, and the blocks are stored on a plurality of data nodes DataNode.