CN112783852A - Network security analysis system based on big data - Google Patents

Network security analysis system based on big data Download PDF

Info

Publication number
CN112783852A
CN112783852A CN202110043769.1A CN202110043769A CN112783852A CN 112783852 A CN112783852 A CN 112783852A CN 202110043769 A CN202110043769 A CN 202110043769A CN 112783852 A CN112783852 A CN 112783852A
Authority
CN
China
Prior art keywords
data
module
analysis
value
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110043769.1A
Other languages
Chinese (zh)
Inventor
蒋丹阳
钱承山
孙宁
毛伟民
茹清晨
王彭辉
赵贤
宗文杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202110043769.1A priority Critical patent/CN112783852A/en
Publication of CN112783852A publication Critical patent/CN112783852A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection

Abstract

The invention discloses a network security analysis system based on big data, which belongs to the technical field of network security, can acquire mass data of different types, meets the real-time requirement of business, and also provides cloud Impala supporting data online processing. On the basis of YARN batch processing, the cloud Impala queried in real time is added, data can be queried directly from HDFS or HBase by using SELECT, JOIN and a statistical function, and delay is greatly reduced. Compared with the original Hive SQL query speed based on MapReduce, the query speed is increased by 3-90 times. Apache Kylin is used as a big data analysis engine, the query speed is superior to Hive, the delay is reduced, and the system working efficiency is improved. The Choquet fuzzy integral fusion algorithm is applied to integrate the neural network training model on the basis of the neural network training model, so that the data analysis effect can be improved, and the fault tolerance of the whole system can be enhanced.

Description

Network security analysis system based on big data
Technical Field
The invention belongs to the technical field of optical communication, and particularly relates to a network security analysis system based on big data.
Background
At present, with the progress of science and technology, the internet becomes an important auxiliary tool for the life and work of people, so that the change of the living of people from the top to the bottom brings about the problem of network security.
In the big data era, enterprises attach more and more importance to cooperative business, the business scale is gradually enlarged, business communication with other enterprises depends on a computer network system, and in the process, if corresponding defense measures are not taken, the system is easily attacked by viruses, so that data is stolen or even damaged.
The network analysis system can detect, analyze and diagnose all transmitted data in the network in various network safety problems, help users to eliminate network accidents, avoid safety risks, improve network performance and increase network availability value. With the increase of network data, the traditional data information transfer technology cannot efficiently process the increasing amount of different types of data.
Disclosure of Invention
The purpose of the invention is as follows: aiming at improving the processing efficiency of massive and irregular information in the current network, the invention aims to provide a network security analysis system based on big data.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the following technical scheme:
the network security analysis system based on big data comprises a data acquisition module, a data preprocessing module, a real-time online analysis module, a data storage module, a data analysis module and a model fusion module which are sequentially connected through a network; the data acquisition module and the data preprocessing module carry out data communication through an HTTP (hyper text transport protocol);
the data acquisition module acquires log information by adopting a Chukwa + Scribe, Spark and Gbase processing mode; carrying out data distributed standby by adopting a script distributed log system;
the data preprocessing module adopts an Informatica Power Center to perform data cleaning, data integration, data transformation and data specification on the acquired original data to finally obtain a processing result; adopting Cloudar Impala to realize real-time online analysis of data;
the data storage module stores the data processed by the data preprocessing module; the module adopts an HDFS distributed file system and provides bottom support for file operation and distributed storage for the system; the NameNode is used as a main server in the HDFS and manages all metadata information of the HDFS file system and mapping relation information of Block blocks and data nodes; in an HDFS cluster, a DataNode is mainly responsible for data storage and management, data is divided into a plurality of Block blocks in the HDFS, and the Block blocks are stored on a plurality of data nodes DataNode;
the data analysis module adopts an online analysis processing mode; apache Kylin is a big data analysis engine which supports second-level OLAP query on a super-large data set;
the data analysis module realizes statistical analysis and mining analysis of data, and a distributed computing framework YARN is adopted for data division, computing task scheduling and distributed computing, so that the problem with larger scale is divided into a plurality of problems with smaller scale.
Further, the system also comprises a data visualization module which represents the data in the form of a graphic image.
Furthermore, a Resource Manager of the data analysis module is responsible for computing resources required by the application program, and an ApplicationMaster is responsible for scheduling, tracking and monitoring the operation; analyzing the acquired data by adopting a neural network; the neural network has the capabilities of large-scale parallel, distributed storage and processing, self-organization, self-adaptation and self-learning, and is suitable for processing inaccurate and fuzzy information processing problems which need to consider many factors and conditions simultaneously.
Furthermore, a multilayer neural network is selected for training in the data analysis module, and an error inverse propagation algorithm is adopted in the algorithm; processing the examples in the training set by iteration, and comparing the error between the predicted value (predicted value) and the real value (target value) of the input layer after passing through the neural network; in the reverse direction (from the output layer > hidden layer > input layer) to update the weight (weight) of each connection with the minimized error (error);
inputting: d: data set,/(learning rate), a multi-layer forward neural network
And (3) outputting: a trained neural network
Initialization weights (weights) and biases (bias) with random initialization between-1 and 1, or-0.5 and 0.5, with a bias per cell;
for each training instance X, the following steps are performed:
is forwarded by the input layer, wherein OiFor each Layer element value, O is expressed in terms of Input LayeriHas a value of x1,x2...xn;wijIs the weight (weights), θjBiased (bias) random initialization between-1 and 1, or-0.5 and 0.5; i isjThe predicted value of the next layer unit is obtained;
Figure BDA0002896342650000021
nonlinear transformation equation: in the formula OjThe nonlinear conversion of the predicted value of the next layer unit is used as the input of the next layer;
Figure BDA0002896342650000031
backward transmission according to error
For the output layer: t isjTrue value, ErrjTo output layer errors
Errj=Oj(1-Oj)(Tj-Oj)
For the hidden layer: errkFor the purpose of transmitting the previous layer error, wjkCorresponding weights to previous layer errors for reverse transmission
Figure BDA0002896342650000032
And (3) updating the weight: Δ wijFor the weight update amount, l is the learning rate value of 0-1
Δwij=(l)ErrjOi
Weight of this time wijnEqual to the sum of the last weight and the weight update amount
wijn=wij+Δwij
Updating the deviation: delta thetaijIs biased to update the amount
Δθj=(l)Errj
This deviation of thetajnEqual to the sum of the last deviation and the deviation update amount
θjn=θj+Δθj
Termination conditions were as follows: the updating of the weight is lower than a certain threshold value, the predicted error rate is lower than a certain threshold value, and a certain preset cycle number is reached; and training each block through a neural network to obtain a model, and then integrating the models to jointly complete a learning task.
Furthermore, the model fusion module adopts a fusion operator of Choqut fuzzy integral to integrate the neural network training model, so that the data analysis effect can be improved, and the fault tolerance of the whole system can be enhanced; for a given training set T, Ω ═ ω { ω }i2,...,ωkIs a set of class labels; dividing the training set into l parts, D ═ D1,D2,...,Dl-1Is the set of l-1 classifiers, i.e. models, trained from T; for any test sample x, Di(x)=(μi1(x),μi2(x),...,μik(x) ); wherein, muij(x)∈[0,1]Representation classifier Di(1. ltoreq. i. ltoreq. l-1) classifying test samples x into jth(j is more than or equal to 1 and less than or equal to k) class membership degree,
Figure BDA0002896342650000041
furthermore, a test sample x is given in the model fusion module, and the following (l-1) xk-order matrix DM is called as a decision matrix of x;
Figure BDA0002896342650000042
i of matrix DMthLine representation classifier DiClassifying x as jthDegree of membership of a class; j of the matrix DMthThe columns indicate that x is classified by different classifiers to jthDegree of membership of a class;
given set of classifiers D ═ D1,D2,...Dl-1P (D) is the power set of D; the blur measure g on D is defined as a function g satisfying two conditions, P (D) → [0,1 [)];
(1)g(φ)=0,g(D)=1;
(2)
Figure BDA0002896342650000043
If it is
Figure BDA0002896342650000044
Then g (A) is less than or equal to g (B);
if it is not
Figure BDA0002896342650000045
And A ≈ B ═ phi, the following formula holds, then g is called as lambda-fuzzy measure;
g(A∪B)=g(A)+g(B)+λg(A)g(B)
where λ > -1 and λ ≠ 0, whose value is defined by:
Figure BDA0002896342650000046
in the formula, giA measure of blur, referred to as the blur density, represented on a single training model; it has been theoretically demonstrated that: regardless of the integration of several models, i.e., regardless of l-1 being equal to several, there is only one solution that satisfies the condition; determining giThe methods of (a) generally have the following three types:
(1)gi=pi
(2)
Figure BDA0002896342650000047
(3)
Figure BDA0002896342650000051
in the above formula, piIs a training model DiThe verification accuracy in the verification set; although the values of the three methods of fuzzy density have larger difference, the influence on the final result is not great; the number of the third method is more, and the larger the value of delta is, the more prominent the function of a single training model is; the smaller the value of delta is, the more prominent the function of the integrated training model is;
given a set of training models D ═ D1,D2...Dl-1G is a measure of blur on D, the function h: D → R+The Choquet integral for g is:
Figure BDA0002896342650000052
wherein h (D) is not less than 01)≤h(D2)≤...≤h(Dl-1)≤1,h(D0)=0,
Figure BDA0002896342650000053
g(A0) The ordering in equation also goes from large to small, but the integrand becomes (h (D)i-1)-h(Di) ); i.e. to ensure that the integral value is not negative; using the first part of data as a test sample, calculating fuzzy integral, and classifying the sample into which model when the model is the largest corresponding to the fuzzy integral calculated by the sample; finally, the test case is trained by the model to obtain the final analysis result.
The invention principle is as follows: the big data technology is applied to the construction of a network security analysis system, the data acquisition and analysis capacity of the system can be effectively improved, the application of the big data technology enables the network security analysis to be converted from a structured database to a distributed database, the overall performance of the system structure is optimized, the cost is reduced, the problem of unstable operation of the traditional network security analysis system is effectively solved, valuable and meaningful information can be mined from mass data, the accuracy, the authenticity, the timeliness and the effectiveness of information processing are ensured, incomplete factors of the network can be better identified, and the monitoring, defense and management levels of the network security are improved.
According to the system, a Chukwa, Spark and Gbase processing mode is adopted in a data acquisition module to better acquire log information, flow data and data information related to fixed-format services. On the basis of YARN batch processing, the cloud Impala of real-time online query is added, so that the delay is greatly reduced. And an Apache Kylin big data analysis engine is used in a data analysis layer, so that the delay of more than one billion data query in a Hadoop environment is reduced. The Choquet fuzzy integral fusion algorithm is applied to integrate the neural network training model on the basis of the neural network training model, so that the data analysis effect can be improved, and the fault tolerance of the whole system can be enhanced. The system solves the problem that the prior art can not effectively process multi-type mass data, increases the data processing types, improves the processing efficiency and accuracy of the network security analysis system, has low requirement on hardware and greatly reduces the cost.
Has the advantages that: compared with the prior art, the network security analysis system based on the big data can acquire mass data of different types, meets the real-time requirement of business, and also provides the cloud Impala supporting the data on-line processing. On the basis of YARN batch processing, the cloud Impala queried in real time is added, data can be queried directly from HDFS or HBase by using SELECT, JOIN and a statistical function, and delay is greatly reduced. Compared with the original Hive SQL query speed based on MapReduce, the query speed is increased by 3-90 times. Apache Kylin is used as a big data analysis engine, the query speed is superior to Hive, the delay is reduced, and the system working efficiency is improved. The Choquet fuzzy integral fusion algorithm is applied to integrate the neural network training model on the basis of the neural network training model, so that the data analysis effect can be improved, and the fault tolerance of the whole system can be enhanced.
Drawings
FIG. 1 is a big data based network security analysis system architecture diagram;
FIG. 2 is a schematic diagram of a data preprocessing module.
Detailed Description
The present invention will be further described with reference to the following embodiments.
The network security analysis system based on big data comprises a data acquisition module, a data preprocessing module, a real-time online analysis module, a data storage module, a data analysis module, a model fusion module and a data visualization module.
The data acquisition module is connected with the data preprocessing module; the data preprocessing module is connected with the data storage module;
the data acquisition module adopts Chukwa + Scribe, Spark and Gbase processing modes to better acquire log information including search engine crawler data, current flow data and data information related to fixed-format services. And a script distributed log system is adopted to perform data distributed standby so as to improve the data acquisition efficiency and quality.
The data preprocessing module adopts an Informatica Power Center to perform data cleaning, data integration, data transformation and data specification on the acquired original data to finally obtain a processing result. And (3) realizing real-time online analysis of data by adopting Cloudar Impala.
The data storage module stores the data processed by the data preprocessing module. The module adopts an HDFS distributed file system and provides bottom support for file operation and distributed storage for the system. The NameNode is used as a main server in the HDFS and manages all metadata information of the HDFS file system and mapping relation information of Block blocks and data nodes. In an HDFS cluster, a DataNode is mainly responsible for storing and managing data, the data is divided into several blocks inside the HDFS, and the blocks are stored on a plurality of data nodes DataNode. The HDFS is a high fault-tolerant system, is suitable for being deployed on cheap machines and is very suitable for application on large-scale data sets.
The data analysis module adopts an online analysis processing mode. Apache Kylin is a big data analysis engine that supports OLAP queries on a second scale on very large data sets.
The data analysis module realizes statistical analysis and mining analysis of data, and a distributed computing framework YARN is adopted for data division, computing task scheduling and distributed computing, so that the problem with larger scale is divided into a plurality of problems with smaller scale.
Resource Manager is responsible for the computing resources required by the application, and the ApplicationMaster is responsible for scheduling, tracking and monitoring of the job. And analyzing the acquired data by adopting a neural network. The neural network has the capabilities of large-scale parallel, distributed storage and processing, self-organization, self-adaptation and self-learning, and is suitable for processing inaccurate and fuzzy information processing problems which need to consider many factors and conditions simultaneously.
In order to improve the learning ability of the neural network, the invention selects a plurality of layers of neural networks for training, and the algorithm adopts an error inverse propagation algorithm. The error between the predicted value (predicted value) and the actual value (target value) of the input layer after passing through the neural network is compared by iteratively processing the examples in the training set. The weight (weight) of each connection is updated in the reverse direction (from the output layer > hidden layer > input layer) with a minimized error (error).
Inputting: d: data set,/(learning rate), a multi-layer forward neural network
And (3) outputting: a trained neural network
Initialization weights (weights) and bias (bias) random initialization is between-1 and 1, or-0.5 and 0.5, with a bias per cell.
For each training instance X, the following steps are performed:
forward by the input layer:
Figure BDA0002896342650000081
in the formula OiFor each Layer element value, O is expressed in terms of Input LayeriHas a value of x1,x2...xn。wijIs the weight (weights), θjFor bias (bias) the random initialization is between-1 and 1, or-0.5 and 0.5. I isjIs the predicted value of the next layer unit.
Figure BDA0002896342650000082
Nonlinear transformation equation: in the formula OjThe nonlinear conversion of the predicted value of the unit of the next layer is used as the input of the next layer.
Figure BDA0002896342650000083
Backward transmission according to error
For the output layer: t isjTrue value, ErrjTo output layer errors
Errj=Oj(1-Oj)(Tj-Oj)
For the hidden layer: errkFor the purpose of transmitting the previous layer error, wjkCorresponding weights to previous layer errors for reverse transmission
Figure BDA0002896342650000084
And (3) updating the weight: Δ wijFor the weight update amount, l is the learning rate value of 0-1
Δwij=(l)ErrjOi
Weight of this time wijnEqual to the sum of the last weight and the weight update amount
wijn=wij+Δwij
Updating the deviation: delta thetaijIs biased to update the amount
Δθj=(l)Errj
This deviation of thetajnEqual to the sum of the last deviation and the deviation update amount
θjn=θj+Δθj
Termination conditions were as follows: the updating of the weight is lower than a certain threshold value, the predicted error rate is lower than a certain threshold value, and a certain preset circulation frequency is reached. And training each block through a neural network to obtain a model, and then integrating the models to jointly complete a learning task.
The model fusion module integrates the neural network training model by adopting a fusion operator of Choqut fuzzy integral, so that the data analysis effect can be improved, and the fault tolerance of the whole system can be enhanced. For a given training set T, Ω ═ ω { ω }i2,...,ωkIs a set of class labels. Dividing the training set into l parts, D ═ D1,D2,...,Dl-1Is the set of l-1 classifiers, i.e., models, trained from T. For any test sample x, Di(x)=(μi1(x),μi2(x),...,μik(x) ). Wherein, muij(x)∈[0,1]Representation classifier Di(1. ltoreq. i. ltoreq. l-1) classifying test samples x into jth(j is more than or equal to 1 and less than or equal to k) class membership degree,
Figure BDA0002896342650000091
given test sample x, the following (l-1). times.k matrix DM is referred to as the decision matrix for x.
Figure BDA0002896342650000092
I of matrix DMthLine representation classifier DiClassifying x as jthDegree of membership of a class; j of the matrix DMthThe columns indicate that x is classified by different classifiers to jthDegree of membership of the class.
Given set of classifiers D ═ D1,D2,...Dl-1And P (D) is the power set of D. The blur measure g on D is defined as a function g satisfying two conditions, P (D) → [0,1 [)]。
(1)g(φ)=0,g(D)=1;
(2)
Figure BDA0002896342650000101
If it is
Figure BDA0002896342650000102
Then g (A) is less than or equal to g (B).
If it is not
Figure BDA0002896342650000103
And a ≈ B ═ Φ, and the following equation holds, g is called λ -blur measure.
g(A∪B)=g(A)+g(B)+λg(A)g(B)
Where λ > -1 and λ ≠ 0, whose value is defined by:
Figure BDA0002896342650000104
in the formula, giThe measure of blur, represented on a single training model, is called the blur density. It has been theoretically demonstrated that: regardless of the integration of several models, i.e., regardless of l-1 being equal to several, there is only one solution that satisfies the condition. Determining giThe methods of (a) generally have the following three types:
(1)gi=pi
(2)
Figure BDA0002896342650000105
(3)
Figure BDA0002896342650000106
in the above formula, piIs a training model DiThe verification accuracy in the verification set. Although the three methods of taking the blur density have large differences in value, the final result is not greatly affected. The number of the third method is more, and the larger the value of delta is, the more prominent the function of a single training model is; the smaller the value of delta, the more prominent the effect of the integrated training model.
Given a set of training models D ═ D1,D2...Dl-1G is a measure of blur on D, the function h: D → R+The Choquet integral for g is:
Figure BDA0002896342650000107
wherein h (D) is not less than 01)≤h(D2)≤...≤h(Dl-1)≤1,h(D0)=0,
Figure BDA0002896342650000108
g(A0) The order in the equation may also be from large to small, but the integrand correspondingly becomes (h (D)i-1)-h(Di)). I.e. to ensure that the integration value is not negative. And (4) calculating fuzzy integral by using the I-th data as a test sample, and classifying the sample into which model according to which model has the largest fuzzy integral value calculated corresponding to the sample. Finally, the test case is trained by the model to obtain the final analysis result.
The data visualization module represents the data in the form of a graphic image, and helps people to explore and understand complex data. The method is an important way for users to know the complex data and carry out deep analysis.
Examples
As shown in fig. 1, the present invention provides a big data based network security analysis system, which includes a data acquisition module, a data preprocessing module, a real-time online analysis module, a data storage module, a data analysis module, a model fusion module, and a data visualization module.
Firstly, the system collects mass data of the network, and each processing link of the design can adopt parallel processing. The processing modes of the acquisition module, such as Chukwa, Spark, Gbase and the like, are used for respectively acquiring log information, flow data and data information related to fixed-format services.
As shown in fig. 2, the information Power Center of the acquisition module sends acquired data to the data preprocessing module, and performs data cleaning, data integration, data transformation, and data specification on the acquired original data to finally obtain a processing result.
The Cloudar Impala performs real-time online analysis on the data. And the data preprocessing module sends the processed data to the data storage module. The DataNode of the module is responsible for storing and managing data, the data is divided into a plurality of Block blocks in the HDFS, and the Block blocks are stored on a plurality of data nodes DataNode. HBase stores semi-structured data. And the data analysis layer performs online analysis processing on the acquired data. Apache Kylin is a big data analysis engine, and a distributed computing framework YARN divides data, schedules computing tasks and performs distributed computing. Resource Manager provides the computing resources needed by the application program, and the ApplicationMaster is responsible for scheduling, tracking and monitoring the jobs. And training the data of each Block by adopting a neural network, wherein the BP learning process is carried out through two processes of signal forward propagation and error backward propagation.
In forward propagation, a sample is transmitted from an input layer, processed layer by each hidden layer and transmitted to an output layer. If the actual output of the output layer does not match the expected output, the error is propagated back to the error propagation stage. The error back propagation reversely transmits the output error to the input layer by layer through the hidden layer in a certain form, and distributes the error to all units of each layer, thereby obtaining the error signal of each layer, and the error signal is used as the basis for correcting the weight of the unit. The weight value adjustment process of each layer of signal forward propagation and error backward propagation is carried out repeatedly, the weight value is adjusted continuously until the error of the network output is reduced to an acceptable degree or is carried out to preset learning times, and the network learning training is finished. And finally obtaining the trained neural network model by utilizing the training set.
On the basis of the neural network training model, a Choquet fuzzy integral fusion algorithm is applied to integrate the neural network training model to enhance the fault tolerance of the whole system. The system analyzes the data of known network viruses and malicious software, finds out the weight relation between input and output by using the data, simulates by using the weight relation and finally outputs a simulation result. And send the results of the analysis to the data visualization module.
Visualization is used for quickly and effectively simplifying and refining data flow through interactive visual representation, and a large amount of data screened by user interaction well presents complex and massive data analysis results to users. When the system detects such a range of attacks, the intrusion detection system can quickly identify the attack and react.
The above description is only a preferred embodiment of the present invention, and it will be apparent to those skilled in the art that various modifications and variations can be made without departing from the technical principles of the present invention, and these modifications and variations should also be construed as the scope of the present invention.

Claims (7)

1. Big data-based network security analysis system, its characterized in that: the system comprises a data acquisition module, a data preprocessing module, a real-time online analysis module, a data storage module, a data analysis module and a model fusion module which are sequentially connected through a network; the data acquisition module and the data preprocessing module carry out data communication through an HTTP (hyper text transport protocol);
the data acquisition module acquires log information by adopting a Chukwa + Scribe, Spark and Gbase processing mode; carrying out data distributed standby by adopting a script distributed log system;
the data preprocessing module adopts an Informatica Power Center to perform data cleaning, data integration, data transformation and data specification on the acquired original data to finally obtain a processing result; adopting Cloudar Impala to realize real-time online analysis of data;
the data storage module stores the data processed by the data preprocessing module;
the data analysis module adopts an online analysis processing mode; apache Kylin is a big data analysis engine which supports second-level OLAP query on a super-large data set;
the data analysis module realizes statistical analysis and mining analysis of data, and a distributed computing framework YARN is adopted for data division, computing task scheduling and distributed computing.
2. The big-data based network security analysis system of claim 1, wherein: the system also comprises a data visualization module which represents the data in the form of a graphic image.
3. The big-data based network security analysis system of claim 1, wherein: the Resource Manager of the data analysis module is responsible for computing resources required by the application program, and the ApplicationMaster is responsible for scheduling, tracking and monitoring the operation; and analyzing the acquired data by adopting a neural network.
4. The big-data based network security analysis system of claim 1, wherein: the data analysis module selects a multilayer neural network for training, and an algorithm adopts an error inverse propagation algorithm; processing the examples in the training set through iteration, and comparing errors between the predicted values and the true values of the input layer after passing through the neural network; updating the weight of each connection with a minimized error in the reverse direction;
inputting: d: data set, learning rate, a multi-layer forward neural network
And (3) outputting: a trained neural network;
initializing weight and bias, wherein random initialization is between-1 and 1, or between-0.5 and 0.5, and each unit has bias;
for each training instance X, the following steps are performed:
is forwarded by the input layer, wherein OiFor each Layer element value, O is expressed in terms of Input LayeriHas a value of x1,x2...xn;wijAs a weight, θjRandomly initializing to be between-1 and 1 or-0.5 and 0.5; i isjThe predicted value of the next layer unit is obtained;
Figure FDA0002896342640000021
nonlinear transformation equation: in the formula OjThe nonlinear conversion of the predicted value of the next layer unit is used as the input of the next layer;
Figure FDA0002896342640000022
transmitting in reverse direction according to error
For the output layer: t isjTrue value, ErrjTo output layer errors
Errj=Oj(1-Oj)(Tj-Oj)
For the hidden layer: errkFor the purpose of transmitting the previous layer error, wjkCorresponding weights to previous layer errors for reverse transmission
Figure FDA0002896342640000023
And (3) updating the weight: Δ wijFor the weight update amount, l is the learning rate value of 0-1
Δwij=(l)ErrjOi
Weight of this time wijnEqual to the sum of the last weight and the weight update amount
wijn=wij+Δwij
Updating the deviation: delta thetaijIs biased to update the amount
Δθj=(l)Errj
This deviation of thetajnEqual to the sum of the last deviation and the deviation update amount
θjn=θj+Δθj
Termination conditions were as follows: the updating of the weight is lower than a certain threshold value, the predicted error rate is lower than a certain threshold value, and a certain preset cycle number is reached; and training each block through a neural network to obtain a model, and then integrating the models to jointly complete a learning task.
5. The big-data based network security analysis system of claim 4, wherein: the model fusion module adopts a fusion operator of Choqut fuzzy integral to integrate a neural network training model, so that the data analysis effect can be improved, and the fault tolerance of the whole system can be enhanced; for a given training set T, Ω ═ ω { ω }i2,...,ωkIs a set of class labels; dividing the training set into l parts, D ═ D1,D2,...,Dl-1Is the set of l-1 classifiers, i.e. models, trained from T; for any test sample x, Di(x)=(μi1(x),μi2(x),...,μik(x) ); wherein, muij(x)∈[0,1]Representation classifier DiI is not less than 1 and not more than l-1, and the test sample x is classified into jthJ is more than or equal to 1 and less than or equal to k, the membership degree of the class,
Figure FDA0002896342640000031
6. the big-data based network security analysis system of claim 5, wherein: a test sample x is given in the model fusion module, and a following (l-1) xk-order matrix DM is called as a decision matrix of x;
Figure FDA0002896342640000032
i of matrix DMthLine representation classifier DiClassifying x as jthDegree of membership of a class; j of the matrix DMthThe columns indicate that x is classified by different classifiers to jthDegree of membership of a class;
given set of classifiers D ═ D1,D2,...Dl-1P (D) is the power set of D; the blur measure g on D is defined as a function g satisfying two conditions, P (D) → [0,1 [)];
(1)g(φ)=0,g(D)=1;
(2)
Figure FDA0002896342640000033
If it is
Figure FDA0002896342640000034
Then g (A) is less than or equal to g (B);
if it is not
Figure FDA0002896342640000035
And A ≈ B ═ phi, the following formula holds, then g is called as lambda-fuzzy measure;
g(A∪B)=g(A)+g(B)+λg(A)g(B)
where λ > -1 and λ ≠ 0, whose value is defined by:
Figure FDA0002896342640000036
in the formula, giA measure of blur, referred to as the blur density, represented on a single training model; it has been theoretically demonstrated that: regardless of the integration of several models, i.e., regardless of l-1 being equal to several, there is only one solution that satisfies the condition; determining giThe methods of (a) generally have the following three types:
(1)gi=pi
(2)
Figure FDA0002896342640000041
(3)
Figure FDA0002896342640000042
in the above formula, piIs a training model DiThe verification accuracy in the verification set; three methods of obtaining fuzzy densityAlthough the values have larger difference, the influence on the final result is not great; the number of the third method is more, and the larger the value of delta is, the more prominent the function of a single training model is; the smaller the value of delta is, the more prominent the function of the integrated training model is;
given a set of training models D ═ D1,D2...Dl-1G is a measure of blur on D, the function h: D → R+The Choquet integral for g is:
Figure FDA0002896342640000043
wherein h (D) is not less than 01)≤h(D2)≤...≤h(Dl-1)≤1,h(D0)=0,
Figure FDA0002896342640000044
g(A0) The ordering in equation also goes from large to small, but the integrand becomes (h (D)i-1)-h(Di) ); i.e. to ensure that the integral value is not negative; using the first part of data as a test sample, calculating fuzzy integral, and classifying the sample into which model when the model is the largest corresponding to the fuzzy integral calculated by the sample; finally, the test case is trained by the model to obtain the final analysis result.
7. The big-data based network security analysis system of claim 1, wherein: the data storage module adopts an HDFS distributed file system and provides a bottom layer support for file operation and distributed storage for the system; the NameNode is used as a main server in the HDFS and manages all metadata information of the HDFS file system and mapping relation information of Block blocks and data nodes; in an HDFS cluster, a DataNode is mainly responsible for storing and managing data, the data is divided into several blocks inside the HDFS, and the blocks are stored on a plurality of data nodes DataNode.
CN202110043769.1A 2021-01-13 2021-01-13 Network security analysis system based on big data Pending CN112783852A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110043769.1A CN112783852A (en) 2021-01-13 2021-01-13 Network security analysis system based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110043769.1A CN112783852A (en) 2021-01-13 2021-01-13 Network security analysis system based on big data

Publications (1)

Publication Number Publication Date
CN112783852A true CN112783852A (en) 2021-05-11

Family

ID=75755740

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110043769.1A Pending CN112783852A (en) 2021-01-13 2021-01-13 Network security analysis system based on big data

Country Status (1)

Country Link
CN (1) CN112783852A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117076810A (en) * 2023-10-12 2023-11-17 睿至科技集团有限公司 Internet big data processing system and method based on artificial intelligence

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103581188A (en) * 2013-11-05 2014-02-12 中国科学院计算技术研究所 Network security situation forecasting method and system
CN109684352A (en) * 2018-12-29 2019-04-26 江苏满运软件科技有限公司 Data analysis system, method, storage medium and electronic equipment
CN109885562A (en) * 2019-01-17 2019-06-14 安徽谛听信息科技有限公司 A kind of big data intelligent analysis system based on cyberspace safety

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103581188A (en) * 2013-11-05 2014-02-12 中国科学院计算技术研究所 Network security situation forecasting method and system
CN109684352A (en) * 2018-12-29 2019-04-26 江苏满运软件科技有限公司 Data analysis system, method, storage medium and electronic equipment
CN109885562A (en) * 2019-01-17 2019-06-14 安徽谛听信息科技有限公司 A kind of big data intelligent analysis system based on cyberspace safety

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
VIKASH KUMAR GARG 等: "Dynamic System for Performance Analysis of Information Interchange", 《JOURNAL OF ADVANCEMENTS IN ROBOTICS》, vol. 6, no. 1, pages 8 - 14 *
王婷婷: "基于MapReduce和受限玻尔兹曼机的大数据分类研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 01, pages 17 - 20 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117076810A (en) * 2023-10-12 2023-11-17 睿至科技集团有限公司 Internet big data processing system and method based on artificial intelligence

Similar Documents

Publication Publication Date Title
Li et al. Topology-aware neural model for highly accurate QoS prediction
Olmezogullari et al. Representation of click-stream datasequences for learning user navigational behavior by using embeddings
CN111127246A (en) Intelligent prediction method for transmission line engineering cost
CN113743675B (en) Construction method and system of cloud service QoS deep learning prediction model
Wang et al. A multitask learning-based network traffic prediction approach for SDN-enabled industrial internet of things
CN111461286A (en) Spark parameter automatic optimization system and method based on evolutionary neural network
CN115309647A (en) Federal learning-based software defect prediction privacy protection method
Chen et al. An intelligent approval system for city construction based on cloud computing and big data
Sundarakumar et al. A heuristic approach to improve the data processing in big data using enhanced Salp Swarm algorithm (ESSA) and MK-means algorithm
Samaan et al. Feature-based real-time distributed denial of service detection in SDN using machine learning and Spark
Tao et al. Gated recurrent unit-based parallel network traffic anomaly detection using subagging ensembles
CN112783852A (en) Network security analysis system based on big data
Yan et al. TL-CNN-IDS: transfer learning-based intrusion detection system using convolutional neural network
CN116501444B (en) Abnormal cloud edge collaborative monitoring and recovering system and method for virtual machine of intelligent network-connected automobile domain controller
US20230205823A1 (en) Intelligent clustering systems and methods useful for domain protection
Shih et al. Implementation and visualization of a netflow log data lake system for cyberattack detection using distributed deep learning
Yang et al. Evolutionary Neural Architecture Search for Transformer in Knowledge Tracing
Wang et al. ifig: Individually fair multi-view graph clustering
Samaan et al. Architecting a machine learning pipeline for online traffic classification in software defined networking using spark
Fang et al. Active exploration: simultaneous sampling and labeling for large graphs
Xie et al. A prediction model of cloud security situation based on evolutionary functional network
Alkafagi Build Network Intrusion Detection System based on combination of Fractal Density Peak Clustering and Artificial Neural Network
Huang et al. Using Microservice Architecture as a Load Prediction Strategy for Management System of University Public Service
Xu et al. Heterogeneous data-driven failure diagnosis for microservice-based industrial clouds towards consumer digital ecosystems
Muruganandam et al. An effective utilization of optimal deep learning model based student performance prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination