CN112783852A - Network security analysis system based on big data - Google Patents
Network security analysis system based on big data Download PDFInfo
- Publication number
- CN112783852A CN112783852A CN202110043769.1A CN202110043769A CN112783852A CN 112783852 A CN112783852 A CN 112783852A CN 202110043769 A CN202110043769 A CN 202110043769A CN 112783852 A CN112783852 A CN 112783852A
- Authority
- CN
- China
- Prior art keywords
- data
- module
- analysis
- value
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1416—Event detection, e.g. attack signature detection
Abstract
The invention discloses a network security analysis system based on big data, which belongs to the technical field of network security, can acquire mass data of different types, meets the real-time requirement of business, and also provides cloud Impala supporting data online processing. On the basis of YARN batch processing, the cloud Impala queried in real time is added, data can be queried directly from HDFS or HBase by using SELECT, JOIN and a statistical function, and delay is greatly reduced. Compared with the original Hive SQL query speed based on MapReduce, the query speed is increased by 3-90 times. Apache Kylin is used as a big data analysis engine, the query speed is superior to Hive, the delay is reduced, and the system working efficiency is improved. The Choquet fuzzy integral fusion algorithm is applied to integrate the neural network training model on the basis of the neural network training model, so that the data analysis effect can be improved, and the fault tolerance of the whole system can be enhanced.
Description
Technical Field
The invention belongs to the technical field of optical communication, and particularly relates to a network security analysis system based on big data.
Background
At present, with the progress of science and technology, the internet becomes an important auxiliary tool for the life and work of people, so that the change of the living of people from the top to the bottom brings about the problem of network security.
In the big data era, enterprises attach more and more importance to cooperative business, the business scale is gradually enlarged, business communication with other enterprises depends on a computer network system, and in the process, if corresponding defense measures are not taken, the system is easily attacked by viruses, so that data is stolen or even damaged.
The network analysis system can detect, analyze and diagnose all transmitted data in the network in various network safety problems, help users to eliminate network accidents, avoid safety risks, improve network performance and increase network availability value. With the increase of network data, the traditional data information transfer technology cannot efficiently process the increasing amount of different types of data.
Disclosure of Invention
The purpose of the invention is as follows: aiming at improving the processing efficiency of massive and irregular information in the current network, the invention aims to provide a network security analysis system based on big data.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the following technical scheme:
the network security analysis system based on big data comprises a data acquisition module, a data preprocessing module, a real-time online analysis module, a data storage module, a data analysis module and a model fusion module which are sequentially connected through a network; the data acquisition module and the data preprocessing module carry out data communication through an HTTP (hyper text transport protocol);
the data acquisition module acquires log information by adopting a Chukwa + Scribe, Spark and Gbase processing mode; carrying out data distributed standby by adopting a script distributed log system;
the data preprocessing module adopts an Informatica Power Center to perform data cleaning, data integration, data transformation and data specification on the acquired original data to finally obtain a processing result; adopting Cloudar Impala to realize real-time online analysis of data;
the data storage module stores the data processed by the data preprocessing module; the module adopts an HDFS distributed file system and provides bottom support for file operation and distributed storage for the system; the NameNode is used as a main server in the HDFS and manages all metadata information of the HDFS file system and mapping relation information of Block blocks and data nodes; in an HDFS cluster, a DataNode is mainly responsible for data storage and management, data is divided into a plurality of Block blocks in the HDFS, and the Block blocks are stored on a plurality of data nodes DataNode;
the data analysis module adopts an online analysis processing mode; apache Kylin is a big data analysis engine which supports second-level OLAP query on a super-large data set;
the data analysis module realizes statistical analysis and mining analysis of data, and a distributed computing framework YARN is adopted for data division, computing task scheduling and distributed computing, so that the problem with larger scale is divided into a plurality of problems with smaller scale.
Further, the system also comprises a data visualization module which represents the data in the form of a graphic image.
Furthermore, a Resource Manager of the data analysis module is responsible for computing resources required by the application program, and an ApplicationMaster is responsible for scheduling, tracking and monitoring the operation; analyzing the acquired data by adopting a neural network; the neural network has the capabilities of large-scale parallel, distributed storage and processing, self-organization, self-adaptation and self-learning, and is suitable for processing inaccurate and fuzzy information processing problems which need to consider many factors and conditions simultaneously.
Furthermore, a multilayer neural network is selected for training in the data analysis module, and an error inverse propagation algorithm is adopted in the algorithm; processing the examples in the training set by iteration, and comparing the error between the predicted value (predicted value) and the real value (target value) of the input layer after passing through the neural network; in the reverse direction (from the output layer > hidden layer > input layer) to update the weight (weight) of each connection with the minimized error (error);
inputting: d: data set,/(learning rate), a multi-layer forward neural network
And (3) outputting: a trained neural network
Initialization weights (weights) and biases (bias) with random initialization between-1 and 1, or-0.5 and 0.5, with a bias per cell;
for each training instance X, the following steps are performed:
is forwarded by the input layer, wherein OiFor each Layer element value, O is expressed in terms of Input LayeriHas a value of x1,x2...xn;wijIs the weight (weights), θjBiased (bias) random initialization between-1 and 1, or-0.5 and 0.5; i isjThe predicted value of the next layer unit is obtained;
nonlinear transformation equation: in the formula OjThe nonlinear conversion of the predicted value of the next layer unit is used as the input of the next layer;
backward transmission according to error
For the output layer: t isjTrue value, ErrjTo output layer errors
Errj=Oj(1-Oj)(Tj-Oj)
For the hidden layer: errkFor the purpose of transmitting the previous layer error, wjkCorresponding weights to previous layer errors for reverse transmission
And (3) updating the weight: Δ wijFor the weight update amount, l is the learning rate value of 0-1
Δwij=(l)ErrjOi
Weight of this time wijnEqual to the sum of the last weight and the weight update amount
wijn=wij+Δwij
Updating the deviation: delta thetaijIs biased to update the amount
Δθj=(l)Errj
This deviation of thetajnEqual to the sum of the last deviation and the deviation update amount
θjn=θj+Δθj
Termination conditions were as follows: the updating of the weight is lower than a certain threshold value, the predicted error rate is lower than a certain threshold value, and a certain preset cycle number is reached; and training each block through a neural network to obtain a model, and then integrating the models to jointly complete a learning task.
Furthermore, the model fusion module adopts a fusion operator of Choqut fuzzy integral to integrate the neural network training model, so that the data analysis effect can be improved, and the fault tolerance of the whole system can be enhanced; for a given training set T, Ω ═ ω { ω }i,ω2,...,ωkIs a set of class labels; dividing the training set into l parts, D ═ D1,D2,...,Dl-1Is the set of l-1 classifiers, i.e. models, trained from T; for any test sample x, Di(x)=(μi1(x),μi2(x),...,μik(x) ); wherein, muij(x)∈[0,1]Representation classifier Di(1. ltoreq. i. ltoreq. l-1) classifying test samples x into jth(j is more than or equal to 1 and less than or equal to k) class membership degree,
furthermore, a test sample x is given in the model fusion module, and the following (l-1) xk-order matrix DM is called as a decision matrix of x;
i of matrix DMthLine representation classifier DiClassifying x as jthDegree of membership of a class; j of the matrix DMthThe columns indicate that x is classified by different classifiers to jthDegree of membership of a class;
given set of classifiers D ═ D1,D2,...Dl-1P (D) is the power set of D; the blur measure g on D is defined as a function g satisfying two conditions, P (D) → [0,1 [)];
(1)g(φ)=0,g(D)=1;
g(A∪B)=g(A)+g(B)+λg(A)g(B)
in the formula, giA measure of blur, referred to as the blur density, represented on a single training model; it has been theoretically demonstrated that: regardless of the integration of several models, i.e., regardless of l-1 being equal to several, there is only one solution that satisfies the condition; determining giThe methods of (a) generally have the following three types:
(1)gi=pi
in the above formula, piIs a training model DiThe verification accuracy in the verification set; although the values of the three methods of fuzzy density have larger difference, the influence on the final result is not great; the number of the third method is more, and the larger the value of delta is, the more prominent the function of a single training model is; the smaller the value of delta is, the more prominent the function of the integrated training model is;
given a set of training models D ═ D1,D2...Dl-1G is a measure of blur on D, the function h: D → R+The Choquet integral for g is:
wherein h (D) is not less than 01)≤h(D2)≤...≤h(Dl-1)≤1,h(D0)=0,g(A0) The ordering in equation also goes from large to small, but the integrand becomes (h (D)i-1)-h(Di) ); i.e. to ensure that the integral value is not negative; using the first part of data as a test sample, calculating fuzzy integral, and classifying the sample into which model when the model is the largest corresponding to the fuzzy integral calculated by the sample; finally, the test case is trained by the model to obtain the final analysis result.
The invention principle is as follows: the big data technology is applied to the construction of a network security analysis system, the data acquisition and analysis capacity of the system can be effectively improved, the application of the big data technology enables the network security analysis to be converted from a structured database to a distributed database, the overall performance of the system structure is optimized, the cost is reduced, the problem of unstable operation of the traditional network security analysis system is effectively solved, valuable and meaningful information can be mined from mass data, the accuracy, the authenticity, the timeliness and the effectiveness of information processing are ensured, incomplete factors of the network can be better identified, and the monitoring, defense and management levels of the network security are improved.
According to the system, a Chukwa, Spark and Gbase processing mode is adopted in a data acquisition module to better acquire log information, flow data and data information related to fixed-format services. On the basis of YARN batch processing, the cloud Impala of real-time online query is added, so that the delay is greatly reduced. And an Apache Kylin big data analysis engine is used in a data analysis layer, so that the delay of more than one billion data query in a Hadoop environment is reduced. The Choquet fuzzy integral fusion algorithm is applied to integrate the neural network training model on the basis of the neural network training model, so that the data analysis effect can be improved, and the fault tolerance of the whole system can be enhanced. The system solves the problem that the prior art can not effectively process multi-type mass data, increases the data processing types, improves the processing efficiency and accuracy of the network security analysis system, has low requirement on hardware and greatly reduces the cost.
Has the advantages that: compared with the prior art, the network security analysis system based on the big data can acquire mass data of different types, meets the real-time requirement of business, and also provides the cloud Impala supporting the data on-line processing. On the basis of YARN batch processing, the cloud Impala queried in real time is added, data can be queried directly from HDFS or HBase by using SELECT, JOIN and a statistical function, and delay is greatly reduced. Compared with the original Hive SQL query speed based on MapReduce, the query speed is increased by 3-90 times. Apache Kylin is used as a big data analysis engine, the query speed is superior to Hive, the delay is reduced, and the system working efficiency is improved. The Choquet fuzzy integral fusion algorithm is applied to integrate the neural network training model on the basis of the neural network training model, so that the data analysis effect can be improved, and the fault tolerance of the whole system can be enhanced.
Drawings
FIG. 1 is a big data based network security analysis system architecture diagram;
FIG. 2 is a schematic diagram of a data preprocessing module.
Detailed Description
The present invention will be further described with reference to the following embodiments.
The network security analysis system based on big data comprises a data acquisition module, a data preprocessing module, a real-time online analysis module, a data storage module, a data analysis module, a model fusion module and a data visualization module.
The data acquisition module is connected with the data preprocessing module; the data preprocessing module is connected with the data storage module;
the data acquisition module adopts Chukwa + Scribe, Spark and Gbase processing modes to better acquire log information including search engine crawler data, current flow data and data information related to fixed-format services. And a script distributed log system is adopted to perform data distributed standby so as to improve the data acquisition efficiency and quality.
The data preprocessing module adopts an Informatica Power Center to perform data cleaning, data integration, data transformation and data specification on the acquired original data to finally obtain a processing result. And (3) realizing real-time online analysis of data by adopting Cloudar Impala.
The data storage module stores the data processed by the data preprocessing module. The module adopts an HDFS distributed file system and provides bottom support for file operation and distributed storage for the system. The NameNode is used as a main server in the HDFS and manages all metadata information of the HDFS file system and mapping relation information of Block blocks and data nodes. In an HDFS cluster, a DataNode is mainly responsible for storing and managing data, the data is divided into several blocks inside the HDFS, and the blocks are stored on a plurality of data nodes DataNode. The HDFS is a high fault-tolerant system, is suitable for being deployed on cheap machines and is very suitable for application on large-scale data sets.
The data analysis module adopts an online analysis processing mode. Apache Kylin is a big data analysis engine that supports OLAP queries on a second scale on very large data sets.
The data analysis module realizes statistical analysis and mining analysis of data, and a distributed computing framework YARN is adopted for data division, computing task scheduling and distributed computing, so that the problem with larger scale is divided into a plurality of problems with smaller scale.
Resource Manager is responsible for the computing resources required by the application, and the ApplicationMaster is responsible for scheduling, tracking and monitoring of the job. And analyzing the acquired data by adopting a neural network. The neural network has the capabilities of large-scale parallel, distributed storage and processing, self-organization, self-adaptation and self-learning, and is suitable for processing inaccurate and fuzzy information processing problems which need to consider many factors and conditions simultaneously.
In order to improve the learning ability of the neural network, the invention selects a plurality of layers of neural networks for training, and the algorithm adopts an error inverse propagation algorithm. The error between the predicted value (predicted value) and the actual value (target value) of the input layer after passing through the neural network is compared by iteratively processing the examples in the training set. The weight (weight) of each connection is updated in the reverse direction (from the output layer > hidden layer > input layer) with a minimized error (error).
Inputting: d: data set,/(learning rate), a multi-layer forward neural network
And (3) outputting: a trained neural network
Initialization weights (weights) and bias (bias) random initialization is between-1 and 1, or-0.5 and 0.5, with a bias per cell.
For each training instance X, the following steps are performed:
forward by the input layer:
in the formula OiFor each Layer element value, O is expressed in terms of Input LayeriHas a value of x1,x2...xn。wijIs the weight (weights), θjFor bias (bias) the random initialization is between-1 and 1, or-0.5 and 0.5. I isjIs the predicted value of the next layer unit.
Nonlinear transformation equation: in the formula OjThe nonlinear conversion of the predicted value of the unit of the next layer is used as the input of the next layer.
Backward transmission according to error
For the output layer: t isjTrue value, ErrjTo output layer errors
Errj=Oj(1-Oj)(Tj-Oj)
For the hidden layer: errkFor the purpose of transmitting the previous layer error, wjkCorresponding weights to previous layer errors for reverse transmission
And (3) updating the weight: Δ wijFor the weight update amount, l is the learning rate value of 0-1
Δwij=(l)ErrjOi
Weight of this time wijnEqual to the sum of the last weight and the weight update amount
wijn=wij+Δwij
Updating the deviation: delta thetaijIs biased to update the amount
Δθj=(l)Errj
This deviation of thetajnEqual to the sum of the last deviation and the deviation update amount
θjn=θj+Δθj
Termination conditions were as follows: the updating of the weight is lower than a certain threshold value, the predicted error rate is lower than a certain threshold value, and a certain preset circulation frequency is reached. And training each block through a neural network to obtain a model, and then integrating the models to jointly complete a learning task.
The model fusion module integrates the neural network training model by adopting a fusion operator of Choqut fuzzy integral, so that the data analysis effect can be improved, and the fault tolerance of the whole system can be enhanced. For a given training set T, Ω ═ ω { ω }i,ω2,...,ωkIs a set of class labels. Dividing the training set into l parts, D ═ D1,D2,...,Dl-1Is the set of l-1 classifiers, i.e., models, trained from T. For any test sample x, Di(x)=(μi1(x),μi2(x),...,μik(x) ). Wherein, muij(x)∈[0,1]Representation classifier Di(1. ltoreq. i. ltoreq. l-1) classifying test samples x into jth(j is more than or equal to 1 and less than or equal to k) class membership degree,
given test sample x, the following (l-1). times.k matrix DM is referred to as the decision matrix for x.
I of matrix DMthLine representation classifier DiClassifying x as jthDegree of membership of a class; j of the matrix DMthThe columns indicate that x is classified by different classifiers to jthDegree of membership of the class.
Given set of classifiers D ═ D1,D2,...Dl-1And P (D) is the power set of D. The blur measure g on D is defined as a function g satisfying two conditions, P (D) → [0,1 [)]。
(1)g(φ)=0,g(D)=1;
g(A∪B)=g(A)+g(B)+λg(A)g(B)
in the formula, giThe measure of blur, represented on a single training model, is called the blur density. It has been theoretically demonstrated that: regardless of the integration of several models, i.e., regardless of l-1 being equal to several, there is only one solution that satisfies the condition. Determining giThe methods of (a) generally have the following three types:
(1)gi=pi
in the above formula, piIs a training model DiThe verification accuracy in the verification set. Although the three methods of taking the blur density have large differences in value, the final result is not greatly affected. The number of the third method is more, and the larger the value of delta is, the more prominent the function of a single training model is; the smaller the value of delta, the more prominent the effect of the integrated training model.
Given a set of training models D ═ D1,D2...Dl-1G is a measure of blur on D, the function h: D → R+The Choquet integral for g is:
wherein h (D) is not less than 01)≤h(D2)≤...≤h(Dl-1)≤1,h(D0)=0,g(A0) The order in the equation may also be from large to small, but the integrand correspondingly becomes (h (D)i-1)-h(Di)). I.e. to ensure that the integration value is not negative. And (4) calculating fuzzy integral by using the I-th data as a test sample, and classifying the sample into which model according to which model has the largest fuzzy integral value calculated corresponding to the sample. Finally, the test case is trained by the model to obtain the final analysis result.
The data visualization module represents the data in the form of a graphic image, and helps people to explore and understand complex data. The method is an important way for users to know the complex data and carry out deep analysis.
Examples
As shown in fig. 1, the present invention provides a big data based network security analysis system, which includes a data acquisition module, a data preprocessing module, a real-time online analysis module, a data storage module, a data analysis module, a model fusion module, and a data visualization module.
Firstly, the system collects mass data of the network, and each processing link of the design can adopt parallel processing. The processing modes of the acquisition module, such as Chukwa, Spark, Gbase and the like, are used for respectively acquiring log information, flow data and data information related to fixed-format services.
As shown in fig. 2, the information Power Center of the acquisition module sends acquired data to the data preprocessing module, and performs data cleaning, data integration, data transformation, and data specification on the acquired original data to finally obtain a processing result.
The Cloudar Impala performs real-time online analysis on the data. And the data preprocessing module sends the processed data to the data storage module. The DataNode of the module is responsible for storing and managing data, the data is divided into a plurality of Block blocks in the HDFS, and the Block blocks are stored on a plurality of data nodes DataNode. HBase stores semi-structured data. And the data analysis layer performs online analysis processing on the acquired data. Apache Kylin is a big data analysis engine, and a distributed computing framework YARN divides data, schedules computing tasks and performs distributed computing. Resource Manager provides the computing resources needed by the application program, and the ApplicationMaster is responsible for scheduling, tracking and monitoring the jobs. And training the data of each Block by adopting a neural network, wherein the BP learning process is carried out through two processes of signal forward propagation and error backward propagation.
In forward propagation, a sample is transmitted from an input layer, processed layer by each hidden layer and transmitted to an output layer. If the actual output of the output layer does not match the expected output, the error is propagated back to the error propagation stage. The error back propagation reversely transmits the output error to the input layer by layer through the hidden layer in a certain form, and distributes the error to all units of each layer, thereby obtaining the error signal of each layer, and the error signal is used as the basis for correcting the weight of the unit. The weight value adjustment process of each layer of signal forward propagation and error backward propagation is carried out repeatedly, the weight value is adjusted continuously until the error of the network output is reduced to an acceptable degree or is carried out to preset learning times, and the network learning training is finished. And finally obtaining the trained neural network model by utilizing the training set.
On the basis of the neural network training model, a Choquet fuzzy integral fusion algorithm is applied to integrate the neural network training model to enhance the fault tolerance of the whole system. The system analyzes the data of known network viruses and malicious software, finds out the weight relation between input and output by using the data, simulates by using the weight relation and finally outputs a simulation result. And send the results of the analysis to the data visualization module.
Visualization is used for quickly and effectively simplifying and refining data flow through interactive visual representation, and a large amount of data screened by user interaction well presents complex and massive data analysis results to users. When the system detects such a range of attacks, the intrusion detection system can quickly identify the attack and react.
The above description is only a preferred embodiment of the present invention, and it will be apparent to those skilled in the art that various modifications and variations can be made without departing from the technical principles of the present invention, and these modifications and variations should also be construed as the scope of the present invention.
Claims (7)
1. Big data-based network security analysis system, its characterized in that: the system comprises a data acquisition module, a data preprocessing module, a real-time online analysis module, a data storage module, a data analysis module and a model fusion module which are sequentially connected through a network; the data acquisition module and the data preprocessing module carry out data communication through an HTTP (hyper text transport protocol);
the data acquisition module acquires log information by adopting a Chukwa + Scribe, Spark and Gbase processing mode; carrying out data distributed standby by adopting a script distributed log system;
the data preprocessing module adopts an Informatica Power Center to perform data cleaning, data integration, data transformation and data specification on the acquired original data to finally obtain a processing result; adopting Cloudar Impala to realize real-time online analysis of data;
the data storage module stores the data processed by the data preprocessing module;
the data analysis module adopts an online analysis processing mode; apache Kylin is a big data analysis engine which supports second-level OLAP query on a super-large data set;
the data analysis module realizes statistical analysis and mining analysis of data, and a distributed computing framework YARN is adopted for data division, computing task scheduling and distributed computing.
2. The big-data based network security analysis system of claim 1, wherein: the system also comprises a data visualization module which represents the data in the form of a graphic image.
3. The big-data based network security analysis system of claim 1, wherein: the Resource Manager of the data analysis module is responsible for computing resources required by the application program, and the ApplicationMaster is responsible for scheduling, tracking and monitoring the operation; and analyzing the acquired data by adopting a neural network.
4. The big-data based network security analysis system of claim 1, wherein: the data analysis module selects a multilayer neural network for training, and an algorithm adopts an error inverse propagation algorithm; processing the examples in the training set through iteration, and comparing errors between the predicted values and the true values of the input layer after passing through the neural network; updating the weight of each connection with a minimized error in the reverse direction;
inputting: d: data set, learning rate, a multi-layer forward neural network
And (3) outputting: a trained neural network;
initializing weight and bias, wherein random initialization is between-1 and 1, or between-0.5 and 0.5, and each unit has bias;
for each training instance X, the following steps are performed:
is forwarded by the input layer, wherein OiFor each Layer element value, O is expressed in terms of Input LayeriHas a value of x1,x2...xn;wijAs a weight, θjRandomly initializing to be between-1 and 1 or-0.5 and 0.5; i isjThe predicted value of the next layer unit is obtained;
nonlinear transformation equation: in the formula OjThe nonlinear conversion of the predicted value of the next layer unit is used as the input of the next layer;
transmitting in reverse direction according to error
For the output layer: t isjTrue value, ErrjTo output layer errors
Errj=Oj(1-Oj)(Tj-Oj)
For the hidden layer: errkFor the purpose of transmitting the previous layer error, wjkCorresponding weights to previous layer errors for reverse transmission
And (3) updating the weight: Δ wijFor the weight update amount, l is the learning rate value of 0-1
Δwij=(l)ErrjOi
Weight of this time wijnEqual to the sum of the last weight and the weight update amount
wijn=wij+Δwij
Updating the deviation: delta thetaijIs biased to update the amount
Δθj=(l)Errj
This deviation of thetajnEqual to the sum of the last deviation and the deviation update amount
θjn=θj+Δθj
Termination conditions were as follows: the updating of the weight is lower than a certain threshold value, the predicted error rate is lower than a certain threshold value, and a certain preset cycle number is reached; and training each block through a neural network to obtain a model, and then integrating the models to jointly complete a learning task.
5. The big-data based network security analysis system of claim 4, wherein: the model fusion module adopts a fusion operator of Choqut fuzzy integral to integrate a neural network training model, so that the data analysis effect can be improved, and the fault tolerance of the whole system can be enhanced; for a given training set T, Ω ═ ω { ω }i,ω2,...,ωkIs a set of class labels; dividing the training set into l parts, D ═ D1,D2,...,Dl-1Is the set of l-1 classifiers, i.e. models, trained from T; for any test sample x, Di(x)=(μi1(x),μi2(x),...,μik(x) ); wherein, muij(x)∈[0,1]Representation classifier DiI is not less than 1 and not more than l-1, and the test sample x is classified into jthJ is more than or equal to 1 and less than or equal to k, the membership degree of the class,
6. the big-data based network security analysis system of claim 5, wherein: a test sample x is given in the model fusion module, and a following (l-1) xk-order matrix DM is called as a decision matrix of x;
i of matrix DMthLine representation classifier DiClassifying x as jthDegree of membership of a class; j of the matrix DMthThe columns indicate that x is classified by different classifiers to jthDegree of membership of a class;
given set of classifiers D ═ D1,D2,...Dl-1P (D) is the power set of D; the blur measure g on D is defined as a function g satisfying two conditions, P (D) → [0,1 [)];
(1)g(φ)=0,g(D)=1;
g(A∪B)=g(A)+g(B)+λg(A)g(B)
in the formula, giA measure of blur, referred to as the blur density, represented on a single training model; it has been theoretically demonstrated that: regardless of the integration of several models, i.e., regardless of l-1 being equal to several, there is only one solution that satisfies the condition; determining giThe methods of (a) generally have the following three types:
(1)gi=pi
in the above formula, piIs a training model DiThe verification accuracy in the verification set; three methods of obtaining fuzzy densityAlthough the values have larger difference, the influence on the final result is not great; the number of the third method is more, and the larger the value of delta is, the more prominent the function of a single training model is; the smaller the value of delta is, the more prominent the function of the integrated training model is;
given a set of training models D ═ D1,D2...Dl-1G is a measure of blur on D, the function h: D → R+The Choquet integral for g is:
wherein h (D) is not less than 01)≤h(D2)≤...≤h(Dl-1)≤1,h(D0)=0,g(A0) The ordering in equation also goes from large to small, but the integrand becomes (h (D)i-1)-h(Di) ); i.e. to ensure that the integral value is not negative; using the first part of data as a test sample, calculating fuzzy integral, and classifying the sample into which model when the model is the largest corresponding to the fuzzy integral calculated by the sample; finally, the test case is trained by the model to obtain the final analysis result.
7. The big-data based network security analysis system of claim 1, wherein: the data storage module adopts an HDFS distributed file system and provides a bottom layer support for file operation and distributed storage for the system; the NameNode is used as a main server in the HDFS and manages all metadata information of the HDFS file system and mapping relation information of Block blocks and data nodes; in an HDFS cluster, a DataNode is mainly responsible for storing and managing data, the data is divided into several blocks inside the HDFS, and the blocks are stored on a plurality of data nodes DataNode.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110043769.1A CN112783852A (en) | 2021-01-13 | 2021-01-13 | Network security analysis system based on big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110043769.1A CN112783852A (en) | 2021-01-13 | 2021-01-13 | Network security analysis system based on big data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112783852A true CN112783852A (en) | 2021-05-11 |
Family
ID=75755740
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110043769.1A Pending CN112783852A (en) | 2021-01-13 | 2021-01-13 | Network security analysis system based on big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112783852A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117076810A (en) * | 2023-10-12 | 2023-11-17 | 睿至科技集团有限公司 | Internet big data processing system and method based on artificial intelligence |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103581188A (en) * | 2013-11-05 | 2014-02-12 | 中国科学院计算技术研究所 | Network security situation forecasting method and system |
CN109684352A (en) * | 2018-12-29 | 2019-04-26 | 江苏满运软件科技有限公司 | Data analysis system, method, storage medium and electronic equipment |
CN109885562A (en) * | 2019-01-17 | 2019-06-14 | 安徽谛听信息科技有限公司 | A kind of big data intelligent analysis system based on cyberspace safety |
-
2021
- 2021-01-13 CN CN202110043769.1A patent/CN112783852A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103581188A (en) * | 2013-11-05 | 2014-02-12 | 中国科学院计算技术研究所 | Network security situation forecasting method and system |
CN109684352A (en) * | 2018-12-29 | 2019-04-26 | 江苏满运软件科技有限公司 | Data analysis system, method, storage medium and electronic equipment |
CN109885562A (en) * | 2019-01-17 | 2019-06-14 | 安徽谛听信息科技有限公司 | A kind of big data intelligent analysis system based on cyberspace safety |
Non-Patent Citations (2)
Title |
---|
VIKASH KUMAR GARG 等: "Dynamic System for Performance Analysis of Information Interchange", 《JOURNAL OF ADVANCEMENTS IN ROBOTICS》, vol. 6, no. 1, pages 8 - 14 * |
王婷婷: "基于MapReduce和受限玻尔兹曼机的大数据分类研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 01, pages 17 - 20 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117076810A (en) * | 2023-10-12 | 2023-11-17 | 睿至科技集团有限公司 | Internet big data processing system and method based on artificial intelligence |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | Topology-aware neural model for highly accurate QoS prediction | |
Olmezogullari et al. | Representation of click-stream datasequences for learning user navigational behavior by using embeddings | |
CN111127246A (en) | Intelligent prediction method for transmission line engineering cost | |
CN113743675B (en) | Construction method and system of cloud service QoS deep learning prediction model | |
Wang et al. | A multitask learning-based network traffic prediction approach for SDN-enabled industrial internet of things | |
CN111461286A (en) | Spark parameter automatic optimization system and method based on evolutionary neural network | |
CN115309647A (en) | Federal learning-based software defect prediction privacy protection method | |
Chen et al. | An intelligent approval system for city construction based on cloud computing and big data | |
Sundarakumar et al. | A heuristic approach to improve the data processing in big data using enhanced Salp Swarm algorithm (ESSA) and MK-means algorithm | |
Samaan et al. | Feature-based real-time distributed denial of service detection in SDN using machine learning and Spark | |
Tao et al. | Gated recurrent unit-based parallel network traffic anomaly detection using subagging ensembles | |
CN112783852A (en) | Network security analysis system based on big data | |
Yan et al. | TL-CNN-IDS: transfer learning-based intrusion detection system using convolutional neural network | |
CN116501444B (en) | Abnormal cloud edge collaborative monitoring and recovering system and method for virtual machine of intelligent network-connected automobile domain controller | |
US20230205823A1 (en) | Intelligent clustering systems and methods useful for domain protection | |
Shih et al. | Implementation and visualization of a netflow log data lake system for cyberattack detection using distributed deep learning | |
Yang et al. | Evolutionary Neural Architecture Search for Transformer in Knowledge Tracing | |
Wang et al. | ifig: Individually fair multi-view graph clustering | |
Samaan et al. | Architecting a machine learning pipeline for online traffic classification in software defined networking using spark | |
Fang et al. | Active exploration: simultaneous sampling and labeling for large graphs | |
Xie et al. | A prediction model of cloud security situation based on evolutionary functional network | |
Alkafagi | Build Network Intrusion Detection System based on combination of Fractal Density Peak Clustering and Artificial Neural Network | |
Huang et al. | Using Microservice Architecture as a Load Prediction Strategy for Management System of University Public Service | |
Xu et al. | Heterogeneous data-driven failure diagnosis for microservice-based industrial clouds towards consumer digital ecosystems | |
Muruganandam et al. | An effective utilization of optimal deep learning model based student performance prediction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |