Block chain application flow identification method based on DPI and CNN
Technical Field
The invention belongs to the technical field of block chains, and particularly relates to a block chain application flow identification technology based on a deep packet inspection technology (DPI) and a Convolutional Neural Network (CNN).
Background
As an emerging technology, the block chain technology has: the method has the characteristics of no tampering, decentralization, convenient tracing, collective maintenance and the like. The core technology mainly relates to encryption technology, point-to-point network design, realization of distributed algorithm and use of data storage technology. The decentralized blockchain technique has more significant advantages in privacy and security than the centralized technique architecture. The traditional traffic identification method in other fields in the industry is mainly as follows: a recognition method of a Support Vector Machine (SVM), a recognition method based on a bayesian algorithm, a recognition method based on a Decision Tree (Decision Tree), and the like. However, the current methods for identifying the application protocol traffic of the blockchain are relatively few, and no official published literature appears.
The prior art is mainly applied to traffic identification in the traditional centralized peer-to-peer network and the decentralized P2P network. There are significant drawbacks in large scale training or recognition errors. For example: a Support Vector Machine (SVM) -based recognition method is a novel learning method suitable for a small sample training model. The method has the characteristics of simple algorithm and stable performance. However, the algorithm and the improved method thereof are difficult to implement when training a large-scale training sample and solving a multi-classification problem; the identification method based on the decision tree is efficient and easy to understand and implement, but a large error is generated when processing data with stronger characteristic relevance. Therefore, there are significant drawbacks to the above approach when traffic recognition is performed for blockchain applications where data traffic is large and large-scale training samples need to be trained.
With the improvement of technology, the application of blockchains in the market is on the continuous rising trend. How network administrators can monitor blockchain traffic more efficiently becomes a serious problem. However, the blockchain application featuring decentralization is significantly different from the traditional centralization application in network model design. And relevant documents for identifying the block chain traffic do not appear at present.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a block chain application flow identification method based on DPI and CNN, which can identify the block chain application in a network more accurately and improve the identification capability of the block chain application flow.
In order to solve the technical problems, the invention adopts the following technical scheme:
a block chain application flow identification method based on DPI and CNN comprises the following steps:
step S1, the traffic collection module captures and transmits the network data traffic through a rule transmission mechanism;
step S2, the DPI recognition module adopts a pattern matching algorithm to carry out pattern matching on the flow data so as to recognize the application flow, and when the matching recognition is successful, the corresponding flow is marked and the process is finished; when the matching identification cannot be performed, the flow is marked as uncertain flow, and the step S3 is entered;
step S3, classifying and identifying the uncertain traffic by using the convolutional neural network model CNN, marking the traffic as a traffic type corresponding to the block chain after the traffic type is successfully identified, and ending; when the traffic can not be identified, the traffic is marked as non-blockchain traffic, and the process is finished.
Further, when the DPI module performs pattern matching on the traffic data in step S2, the DPI module first analyzes and decodes the traffic data application layer protocol, then extracts the payload features in the traffic data packet by using the search algorithm engine, matches the payload features with the feature library in the DPI module, and if matching is successful, identifies the traffic data as the known blockchain application data.
Further, the specific process of identifying the data traffic by the convolutional neural network model CNN in step S3 is as follows:
(1) the characteristic extraction module collects block chain flow as training flow data, extracts statistical characteristics of the training flow data and establishes a flow characteristic vector set;
(2) the machine learning training module carries out deep learning on the traffic characteristic vector, and a training model for identifying the block chain application traffic is obtained through training;
(3) the feature extraction module collects and issues real-time network traffic and extracts a real-time feature vector set in the real-time network traffic;
(4) and identifying and judging the real-time characteristic vectors through the trained training model.
Further, the traffic feature vector set in step (1) includes blockchain and non-blockchain application data traffic features.
Further, the characteristic parameters of the traffic characteristic vector set include a multi-port characteristic, a multi-connectivity characteristic, a far-end address port uniformity characteristic, the number of times of alternate occurrence of large and small data packets, a standard deviation of payload lengths of the data packets, and the number of data packets with payload lengths larger than zero.
Further, the model structure of the convolutional neural network model CNN comprises an input layer, a convolutional layer, a pooling layer and a full-link layer, wherein the output y of the jth convolutional neuron in the convolutional layerjObtained by the following formula:
in the formula, SjIs the net input value, x, of the jth neuron1,x2…xi…xnIs the input value, w, from the 1, 2. i. n neuronsj1,wj2…wji…wjnThe connection strength between the 1 st, 2. i. n neurons and the jth neuron, namely the weight; bjFor the threshold, f (-) is the transfer function, X is the transpose of the applied feature vector for the blockchain, WjIs the connection strength weight vector.
Further, the transfer function f(s)j) Is a bounded monotonically rising function.
Has the advantages that: compared with the prior art, the invention has the following innovation points:
(1) the method combines a DPI algorithm and a machine learning algorithm based on a convolutional neural network model to identify the block chain application in a real-time network.
(2) And accurately identifying the known block chain application by adopting a DPI algorithm according to the unique identification feature library of the block chain data packet.
(3) And (3) identifying the unknown block chain flow by adopting a convolutional neural network algorithm as a learning model in a machine learning identification module. A "data set" is created that includes blockchain applications and non-blockchain applications. The "data set" includes a training set and a test set. The training set is used for training the model, and the testing set is used for testing the trained model in the early stage.
(4) And optimizing the training model by adopting a cross validation mode to obtain the optimal training model.
Drawings
Fig. 1 is a logic flow diagram of a block chain application traffic identification method based on DPI and CNN according to the present invention;
FIG. 2 is a block chain flow real-time monitoring system structure diagram of the convolutional neural network algorithm CNN of the present invention;
FIG. 3 is a j-th convolutional neural network model of the convolutional neural network algorithm CNN according to the present invention;
FIG. 4 is a graph comparing the recognition rates of the patterns at different sampling times according to the embodiment of the present invention.
Detailed Description
The invention will be further elucidated with reference to the following description of an embodiment in conjunction with the accompanying drawing. It is to be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which may occur to those skilled in the art upon reading the present specification.
In order to more accurately identify the block chain application in the network, the identification capability of the block chain application flow is improved. We use the following modes, namely: a Deep Packet Inspection (DPI) technology and Convolutional Neural Network (CNN) based machine learning model combined mode. The method well makes up the defect that the traditional method cannot accurately identify the block chain application, and greatly improves the identification capability of the block chain application. The specific process of the present invention is described in detail below:
fig. 1 is a flow chart of a mixed identification module based on DPI and CNN according to the present invention. The traffic collection module is used for capturing and issuing data traffic in a network in real time through a rule issuing mechanism. Firstly, judging whether the flow is the application flow of the known block chain through a DPI identification module. If so, marking the application flow as a known flow, otherwise, inputting the flow into a convolutional neural network module for modeling through machine learning, then judging whether the flow is the block chain flow, and if so, marking the flow as the block chain flow.
3.2 deep Package inspection model
The DPI adopts a pattern matching algorithm to perform pattern matching on the unique identifier of the application data packet in the real-time network flow, so as to realize the identification of the application flow. The technical key points of the DPI module comprise the selection accuracy of the unique identifier and the design of a feature library model.
When data traffic issued in a real-time network enters a DPI detection module, the system analyzes and decodes an application layer protocol in real time, extracts the characteristics of a payload in a traffic data packet by adopting a search algorithm engine, and matches the characteristics with a characteristic library so as to judge whether the traffic is known block chain traffic. If the match is successful, the application traffic is flagged as a known blockchain application. In the DPI flow identification design model, the identification accuracy is related to the selection accuracy of the characteristics and the coverage width. However, the method is limited to the endless layer of new applications and insufficient collection of unknown applications, and a comprehensive feature library does not exist, and the feature library needs to be updated continuously along with the update of the protocol. Application traffic that is not recognized by the DPI module therefore requires machine learning for further inspection.
2.2 convolutional neural network model
Convolutional neural networks are employed in this patent as models for machine learning. The method is divided into a training phase and a recognition phase, and the system structure is shown in figure 2. In the training stage, a data set used for training is collected, features are extracted from the data set to form a feature vector training set of a sample, the feature vector training set is trained through a machine learning algorithm, and a cross-validation mode in a machine learning sky model selection library is used for obtaining a training model. In the identification stage, a feature vector is extracted from real-time network flow every other time window and is delivered to an identification module to obtain an identification result. The machine learning framework adopts tensorflow and is used for data flow programming and model deployment. The specific process of the CNN identification block chain application flow is as follows:
as shown in fig. 2, the feature extraction module is used to collect the blockchain traffic in the network and create a traffic "data set" from the extracted statistical features of the application traffic. The machine learning module is used for deeply learning the traffic data set through a machine learning algorithm to generate a training model for identifying the block chain application traffic. The data set includes a large number of blockchain and non-blockchain application data traffic characteristics. The real-time feature extraction module is used for collecting and issuing the flow in the real-time network, establishing a real-time feature vector set and sending the real-time feature vector set to the identification module. And then the recognition module predicts the real-time data flow through the trained model to obtain a prediction result.
The model structure includes: input layer, book basic unit, pooling layer and full tie-up layer. In the preprocessing stage, the block chain application and part of non-block chain application are adopted for preprocessing, and the characteristics of the data flow are extracted. And then, selecting the characteristics, and removing redundant characteristics to ensure that the system obtains an optimal characteristic subset. And finally, taking the feature subsets as a machine learning data set, wherein the data set comprises a training set and a testing set.
In the CNN model, a vector representing features in a training set as a matrix is used as an input to an input layer. And then selecting proper hyper-parameters, and importing a training set to train the weight and the bias of the model to obtain an optimal training model. When real-time network data traffic enters the detection module, the system will recognize the blockchain application according to the level of model training. Preferably, the characteristic parameters in the "data set" include: multiport characteristics, multi-connectivity characteristics, remote address port uniformity characteristics, alternate occurrence times of large and small data packets, standard deviation of payload length of the data packets, and the number of data packets with payload length larger than zero.
As shown in fig. 3, the operation principle of the jth basic convolution neuron of the present invention is as follows: net input value S of jth neuron of convolutional neural network detection modulejComprises the following steps:
wherein: x is the number of1、x2...xi...xnRepresent inputs from neurons 1,2 … i … n, respectively; w is aj1、wj2...wji...wjnRespectively representing the connection strength of the neurons 1 and 2 … i … n and the jth neuron, namely the weight; bjIs a threshold value; f (-) is a transfer function; y isjIs the output of the jth neuron. The block chain applies the transpose of the feature vector as:
X=[mip,mport,raup,stddev,swf,pcnt]T
Wj=[wj1wj2…wji…wjn]
net input SjAfter passing through the transfer function f (-), the output y of the j-th neuron is obtainedj,
Where f (x) is a monotonically rising function and must be a bounded function.
Comparing the block chain application flow identification method based on DPI and CNN with the block chain application flow identification rate of the common mode in the prior art. Selecting a plurality of common block chains for application, and performing research and analysis by adopting three methods of DPI, convolutional neural network and mixed detection of DPI and convolutional neural network respectively to obtain statistical results, as shown in Table 1.
TABLE 1 comparison of recognition rates by different methods
As can be seen from table 1, the recognition rate of the DPI method is about 85% when identifying the application traffic of the blockchain, and the recognition rate of the convolutional neural network method is also about 85% when identifying the application traffic of the blockchain. And the mixed detection model after the DPI and the convolutional neural network are combined, the recognition rate is obviously improved, and the result is between 93% and 97%. The average recognition rates of the above methods were compared by counting the results of the multiple detections, and the results are shown in fig. 4.
As can be seen from fig. 4, when identifying the block connection application traffic within a plurality of sampling times, the identification rate of the DPI and convolutional neural network hybrid identification method of the present invention is superior to that of the DPI or convolutional neural network identification method.
The key points of the invention are as follows: firstly, a DPI detection algorithm is adopted to accurately identify the application of the known block chain. And secondly, performing flow identification on the unknown block chain application by adopting a CNN-based machine learning method. Thirdly, a DPI and CNN combined model is adopted in a machine learning module for training and real-time flow detection. And fourthly, selecting and marking the DPI and CNN characteristic models.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.