CN115250199B

CN115250199B - Data stream detection method and device, terminal equipment and storage medium

Info

Publication number: CN115250199B
Application number: CN202210839784.1A
Authority: CN
Inventors: 丰竹勃; 安韬; 王智民; 王高杰
Original assignee: Beijing 6Cloud Technology Co Ltd; Beijing 6Cloud Information Technology Co Ltd
Current assignee: Beijing 6Cloud Technology Co Ltd; Beijing 6Cloud Information Technology Co Ltd
Priority date: 2022-07-15
Filing date: 2022-07-15
Publication date: 2023-04-07
Anticipated expiration: 2042-07-15
Also published as: CN115250199A

Abstract

The application discloses a data stream detection method, a device, a terminal device and a storage medium, wherein the data stream detection method comprises the following steps: the method comprises the steps of obtaining flow data to be detected, extracting characteristics of the flow data to obtain characteristic data, inputting the characteristic data into a pre-established self-coding network model for detection to obtain a detection result, wherein the self-coding network model comprises an encoder and a decoder, and the self-coding network model is obtained through unsupervised algorithm training based on the encoder and the decoder. The method and the device realize the abnormal detection of the unsupervised training of the encrypted flow data, and reduce the detection difficulty and cost of the encrypted flow data.

Description

Data stream detection method and device, terminal equipment and storage medium

Technical Field

The present invention relates to the field of data detection, and in particular, to a data stream detection method, apparatus, terminal device, and storage medium.

Background

With the rapid popularization and application of the internet, a plurality of network security problems are exposed, for example, malicious software is hidden in encrypted data streams and becomes more and more popular, and the malicious software can observe and steal communication information in TLS (secure transport layer protocol), so that a user computer can invade the personal privacy of the user. Before TLS starts encryption communication, in a handshake process established between a client and a server, the adopted optimal mutually-accepted signature algorithm, compression system, hash algorithm and the like are not determined, and the handshake process is carried out in a plaintext condition, so that communication information in the handshake process can be easily observed and stolen by malicious software.

Usually, a random forest model is used for training and identifying abnormal data streams in encrypted data streams, but the supervised training needs to rely on a large number of labeled abnormal flow data sets as training sample data, and when plausible abnormal encrypted data flows are judged, the abnormal encrypted data flows are easily judged to be normal encrypted data flows by mistake, so that the detection difficulty and the cost of the encrypted flow data are increased.

Disclosure of Invention

The invention mainly aims to provide a data flow detection method, a data flow detection device, terminal equipment and a storage medium, and aims to realize the abnormal detection of unsupervised training of flow data and reduce the detection difficulty and cost of the flow data.

In order to achieve the above object, the present invention provides a data stream detection method, including:

acquiring flow data to be detected;

performing feature extraction on the flow data to obtain feature data;

inputting the characteristic data into a pre-established self-coding network model for detection to obtain a detection result, wherein the self-coding network model comprises an encoder and a decoder, and the self-coding network model is obtained by unsupervised algorithm training based on the encoder and the decoder.

Optionally, the step of inputting the feature data into a pre-created self-coding network model for detection to obtain a detection result further includes:

and obtaining the self-coding network model based on the encoder and the decoder and through unsupervised algorithm training.

Optionally, the step of deriving the self-coding network model based on the encoder and the decoder and trained by an unsupervised algorithm includes:

acquiring TLS encrypted traffic communication data collected in advance as training data to construct a training set;

performing feature extraction on the training data through a TLS parameter list resume feature dictionary, and constructing a high-dimensional feature vector based on an extracted feature value, and recording the high-dimensional feature vector as a first high-dimensional feature vector;

inputting the first high-dimensional feature vector into a pre-established depth self-coding neural network, and performing the following processing in the depth self-coding neural network:

reconstructing the first high-dimensional feature vector according to the weight initialized by the random strategy to obtain a first reconstructed vector;

calculating first error data of the first reconstruction vector and the first high-dimensional feature vector, and updating the weight according to the first error data and an Adam algorithm to obtain a first updated weight;

and returning the first updating weight to the deep self-coding neural network, reconstructing the high-dimensional feature vector by adopting the first updating weight, and returning to the execution step: calculating first error data of the first reconstruction vector and the first high-dimensional feature vector, updating the weight according to the first error data and the Adam algorithm to obtain a first updated weight, repeating the steps until the calculated error data is lower than a preset error threshold, and terminating training to obtain a trained self-coding network model.

Optionally, the step of inputting the feature data into a pre-created self-coding network model for detection to obtain a detection result includes:

inputting the characteristic data into the self-coding network model, and carrying out the following processing:

constructing a second high-dimensional feature vector based on the feature data;

constructing a second reconstruction vector based on the second high-dimensional feature vector;

calculating second error data of the second high-dimensional feature vector and a second reconstructed vector, and obtaining a second error data set according to the second error data;

and obtaining a target probability value according to the distribution of the second error data in the second error data set, and taking the target probability value as a detection result.

Optionally, the step of calculating second error data of the second high-dimensional feature vector and the second reconstruction vector, and obtaining a second error data set according to the second error data, includes:

comparing the second error data with the target error data to obtain a comparison result;

adjusting the first updating weight according to the comparison result and the adam algorithm to obtain a second updating weight;

and returning the second updating weight to the self-coding network model, and updating the first updating weight to obtain an updated self-coding network model.

Optionally, the step of acquiring data of the flow to be detected includes:

the method comprises the steps of obtaining data information of TLS communication data quadruplets as data of flow to be detected, wherein the TLS communication data quadruplets comprise a source IP address, a source port, a destination IP address and a destination port.

and verifying the effectiveness of the self-coding network model through a data quality evaluation algorithm and a pre-collected test set.

In addition, an embodiment of the present application further provides a data stream monitoring device, where the data stream monitoring device includes:

the acquisition module is used for acquiring the flow data to be detected;

the extraction module is used for extracting the characteristics of the flow data to obtain characteristic data;

and the detection module is used for inputting the characteristic data into a pre-established self-coding network model for detection to obtain a detection result, wherein the self-coding network model comprises an encoder and a decoder, and the self-coding network model is obtained by the unsupervised algorithm training based on the encoder and the decoder.

In addition, an embodiment of the present application further provides a terminal device, where the terminal device includes a memory, a processor, and a data stream detection program that is stored on the memory and is executable on the processor, and when the data stream detection program is executed by the processor, the steps of the data stream detection method described above are implemented.

In addition, an embodiment of the present application further provides a computer-readable storage medium, where a data flow detection program is stored, and when executed by a processor, the data flow detection program implements the steps of the data flow detection method described above.

According to the data flow detection method, the data flow detection device, the terminal equipment and the storage medium, feature extraction is carried out on flow data to be detected through obtaining of the flow data to obtain feature data, the feature data are input into a self-coding network model which is created in advance to be detected, a detection result is obtained, wherein the self-coding network model comprises an encoder and a decoder, the self-coding network model is obtained through unsupervised algorithm training based on the encoder and the decoder, the traffic to be detected is detected through the trained self-coding network model, unsupervised training anomaly detection of encrypted flow data is achieved, and detection difficulty and cost of the encrypted flow data are reduced. Based on the scheme of the application, the characteristic that abnormal encrypted traffic occurs only when the traffic data in the server is attacked by malicious software is combined from the perspective that a large number of labels need a large amount of time for preprocessing, a model training method with small dependence on training data is provided, and the applicability of the model obtained by the model training method is wider than that of other models.

Drawings

Fig. 1 is a schematic diagram of functional modules of a terminal device to which a data stream detection apparatus of the present application belongs;

FIG. 2 is a schematic flow chart diagram illustrating an exemplary embodiment of a data stream detection method of the present application;

FIG. 3 is a diagram illustrating an overall data flow of a self-coding network model according to an embodiment of the present application;

FIG. 4 is a schematic flow chart diagram illustrating another exemplary embodiment of a data stream detection method of the present application;

FIG. 5 is a schematic flow chart diagram illustrating another exemplary embodiment of a data stream detection method of the present application;

FIG. 6 is a schematic flow chart diagram illustrating another exemplary embodiment of a data stream detection method of the present application;

FIG. 7 is a schematic flow chart diagram illustrating another exemplary embodiment of a data stream detection method of the present application;

fig. 8 is a flowchart illustrating another exemplary embodiment of the data stream detection method according to the present application.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The main solution of the embodiment of the application is as follows: the method comprises the steps of configuring an encoder and a decoder in a self-coding network model, obtaining a pre-collected training set, wherein the training set comprises training data which are not subjected to label processing in a plurality of unit time, extracting characteristic data in the training data, encoding the characteristic data through the encoder to obtain encoded data, decoding the encoded data through the decoder to obtain decoded data, obtaining a comparison result through comparing the characteristic data with the decoded data, and training the self-coder network model according to the comparison result and preset target error data. When the traffic to be detected is detected through the trained self-coding network model, the characteristic data in the traffic data to be detected is extracted, the characteristic data is input into the pre-established self-coding network model to be detected, a detection result is obtained, a method for training the detection model aiming at the TLS traffic without a label sample is provided, the trained model is optimized based on error data, the practicability and the accuracy of the model are improved, the unsupervised training abnormity detection of encrypted traffic data is realized, and the detection difficulty and the detection cost of the encrypted traffic data are reduced.

The technical terms related to the embodiments of the present application are:

TLS (Transport Layer Security, secure Transport protocol) ensures the data transmission Security of both communication parties, which is equivalent to establishing a Security Layer between both communication parties, and is the basis for implementing HTTPS. The handshake process in the secure transport protocol includes: the Client initiates a Client hello message to the server and carries a password combination and a random number 1, the server feeds back the message of the Client Sever hello after receiving the message and carries the password combination/certificate and the random number 2, and the random number 1,2 is encrypted by an algorithm to generate a key, and finally both parties send finished messages.

Unsupervised learning refers to a machine learning mode in which input data is not marked and a determined result is not obtained, namely, sample data types are unknown, a sample set needs to be classified according to similarity between samples, intra-class differences are minimized through classification, and inter-class differences are maximized. That is, in practical applications, the labels of the samples cannot be known in advance, that is, there is no class corresponding to the training samples, so that the classification design can only be learned from the sample set without the labels of the samples.

The embodiment of the application considers that a large amount of label sample data is needed to train a model normally, and a large amount of labels need a large amount of time for preprocessing, and the data flow in a server is combined, so that the abnormal characteristic can appear only when the label is attacked, the model training method with small dependence on training data is provided, the applicability of the model obtained by the model training method is wider than that of other models, the unsupervised training abnormity detection on encrypted traffic data is realized, the trained model is optimized, the practicability and accuracy of the model are improved, and the detection difficulty and cost of the encrypted traffic data are reduced.

Specifically, referring to fig. 1, fig. 1 is a schematic diagram of functional modules of a terminal device to which the data stream detection apparatus of the present application belongs. The data flow detection device may be a device which is independent of the terminal device and capable of performing data flow detection and network model training, and may be carried on the terminal device in a form of hardware or software. The terminal device can be an intelligent mobile terminal with a data processing function, such as a mobile phone and a tablet personal computer, and can also be a fixed terminal device or a server with a data processing function.

In this embodiment, the terminal device to which the data stream detection apparatus belongs at least includes an output module 110, a processor 120, a memory 130, and a communication module 140.

The memory 130 stores an operating system and a data stream detection program, and the data stream detection device can encode the acquired traffic to be detected and the acquired pre-acquired training set without label processing to obtain encoded data by using an encoder to encode the characteristic data extracted from the data of the traffic to be detected; decoded data obtained by decoding the encoded data by a decoder, and information such as error data obtained by comparing the characteristic data with the decoded data is stored in the memory 130; the output module 110 may be a display screen or the like. The communication module 140 may include a WIFI module, a mobile communication module, a bluetooth module, and the like, and communicates with an external device or a server through the communication module 140.

Wherein the data flow detection program in the memory 130 when executed by the processor implements the steps of:

acquiring flow data to be detected;

performing feature extraction on the flow data to obtain feature data;

inputting the characteristic data into a pre-established self-coding network model for detection to obtain a detection result, wherein the self-coding network model comprises an encoder and a decoder, and is obtained based on the encoder and the decoder and through unsupervised algorithm training.

Further, the data flow detection program in the memory 130 when executed by the processor further implements the steps of:

and returning the first updating weight to the deep self-coding neural network, reconstructing the high-dimensional feature vector by adopting the first updating weight, and returning to the execution step: calculating first error data of the first reconstruction vector and the first high-dimensional feature vector, updating the weight according to the first error data and an Adam algorithm to obtain a first updated weight, and performing weight updating in a loop until the calculated first error data is lower than a preset error threshold, and terminating training to obtain a trained self-coding network model.

constructing a second reconstructed vector based on the second high-dimensional feature vector;

calculating second error data of the second high-dimensional feature vector and a second reconstruction vector, and obtaining a second error data set according to the second error data;

According to the scheme, the traffic data to be detected are obtained, feature extraction is carried out on the traffic data to obtain feature data, the feature data are input into a pre-established self-coding network model to be detected, and a detection result is obtained, wherein the self-coding network model comprises an encoder and a decoder, and the self-coding network model is obtained through unsupervised algorithm training based on the encoder and the decoder. By adopting an unsupervised algorithm, namely training the self-coding network model by adopting training data which does not need label processing, the model training method with small dependence on the training data is provided, the abnormal detection of unsupervised training of encrypted flow data is realized, the detection applicability is improved, the self-coding network model is trained circularly by adjusting corresponding parameters of the self-coding network model, the accuracy and the practicability of the self-coding network model are improved, and the detection difficulty and the cost of the encrypted flow data are finally reduced.

Based on the above terminal device architecture, but not limited to the above architecture, the embodiments of the method of the present application are proposed.

Referring to fig. 2, fig. 2 is a schematic flowchart of an exemplary embodiment of the data stream detection method of the present application. The data stream detection method comprises the following steps:

step S1001, acquiring data of flow to be detected;

specifically, the acquisition may be performed by automatically setting a capture task through data capture software, and the object to be acquired may be obtained by capturing communication data in the encryption suite.

For example, the traffic to be detected obtained in this embodiment may be a Client Hello or a Server Hello packet in TLS encrypted traffic communication data, or an elliptic curve signature algorithm, or an elliptic curve format cipher.

Specifically, the TLS version, the encryption suite, the compression option, and the decimal byte value of the extended list in the Client Hello or Server Hello packet, or the field information in the signature algorithm, or the public key and/or private key byte value of the elliptic curve format cipher may be obtained as the data of the traffic to be detected.

In some malicious software sample data, the Server always responds to the Client in the same way, and the handshake process of the Client Hello and the Server Hello generally exists in encrypted traffic data, so that in the embodiment, a plurality of data combinations in the Client Hello and the Server Hello data packets are adopted as the data of the traffic to be detected, and the data are reliable at any specific port, and provide a granularity larger than that of a password suite for independent evaluation, namely, the data are more and larger than that of other encryption suites as the traffic to be detected, have higher accuracy, and have no port limitation necessary for other encryption suites, so that the use range is larger, the reliability of training data is improved, and the practicability and the applicability of the detected data stream are improved.

Step S1002, extracting the characteristics of the flow data to obtain characteristic data;

after the data of the flow to be detected is obtained, preprocessing is carried out on the data to extract features in the flow data to serve as distinguishing features to obtain feature data, namely the obtained data are converted into machine language, for example, chinese character data in a signature algorithm are subjected to feature extraction according to the stroke sequence of a signature pen to obtain feature data, or feature extraction is carried out on the data of the flow to be detected through a TLS parameter list resume feature dictionary, and a high-dimensional feature vector is constructed based on extracted feature values. The feature data obtained after preprocessing is beneficial to machine identification and subsequent calculation.

Step S1003, inputting the characteristic data into a pre-established self-coding network model for detection to obtain a detection result, wherein the self-coding network model comprises an encoder and a decoder, and the self-coding network model is obtained based on the encoder and the decoder through unsupervised algorithm training.

Specifically, the feature data extracted in step S1002 sequentially passes through the encoder and the decoder, and a detection result is obtained, where the detection result is a probability value representing an abnormal degree probability value of the data traffic corresponding to the feature data.

The encoder is used for performing convolution, pooling, full connection and dropout processing on the characteristic data to obtain encoded data, and the encoded data is provided for the decoder;

the decoder is used for performing dropout, full connection, upsampling and full connection processing on the coded data until the coded data are completely restored to the dimension of the feature data to obtain decoded data, and obtaining a detection result according to the decoded data and a comparison result of the feature data.

The execution main body of the method of this embodiment may be a data stream detection device, or may also be a data stream detection terminal device or server, and this embodiment is exemplified by the data stream detection device, and the data stream detection device may be integrated on a terminal device such as a smart phone, a tablet computer, and the like having a data processing function.

In this embodiment, a self-coding network model is used to detect traffic to be detected, and a framework of the self-coding network model includes: the self-coding network model structure is trained in a mode of a coder and a decoder, and the data flow of the whole network is as shown in figure 3, wherein S represents the dimension of characteristic data.

According to the technical scheme, the flow data to be detected are obtained, feature extraction is carried out on the flow data to obtain feature data, the feature data are input into a self-coding network model which is created in advance to be detected, and detection results are obtained.

Referring to fig. 4, fig. 4 is a schematic flowchart of another exemplary embodiment of the data stream detection method of the present application. Based on the foregoing embodiment shown in fig. 2, in this embodiment, before the step S1003 of inputting the feature data into a pre-created self-coding network model for detection, and obtaining a detection result, the data stream detection method further includes:

and S1000, obtaining the self-coding network model based on the encoder and the decoder through unsupervised algorithm training. In this embodiment, step S1000 is implemented before step S1001, and in other embodiments, step S1000 may be implemented between step S1001 and step S1002.

Compared with the embodiment shown in fig. 2, the present embodiment further includes a scheme for training a data stream detection model.

Specifically, the step of deriving the self-coding network model based on the encoder and decoder may comprise:

step S101, acquiring TLS encrypted traffic communication data collected in advance as training data to construct a training set;

specifically, in this embodiment, data information of a quadruplet of TLS communication data is collected in advance, where the quadruplet includes an IP address, a source port, a destination IP address, and a destination port, and taking the IP address as an example, 500 IP addresses are randomly selected from TLS communication data obtained in each batch to serve as training data, and a training set is constructed for subsequent feature information extraction.

Step S102, performing feature extraction on the training data through a TLS parameter list resume feature dictionary, and constructing a high-dimensional feature vector based on an extracted feature value, wherein the high-dimensional feature vector is recorded as a first high-dimensional feature vector;

specifically, a TLS parameter list resume feature dictionary is adopted to perform feature extraction on the training set obtained in the step S101, the training set is extracted according to extraction rules in the TLS parameter list resume feature dictionary to obtain corresponding feature values, and then corresponding high-dimensional feature vectors are constructed based on the feature values and recorded as first high-dimensional feature vectors.

Step S103, reconstructing the high-dimensional characteristic vector according to the weight initialized by the random strategy to obtain a reconstructed vector;

specifically, inputting a first high-dimensional feature vector into a pre-established depth self-coding neural network, wherein the depth self-coding neural network comprises an input layer, an output layer and a plurality of hidden layers between the input layer and the output layer, each hidden layer comprises a convolution layer, a pooling layer, a full-connection layer and a dropout layer, each hidden layer comprises a plurality of neuron nodes, and the set of the hidden layers and the neurons in the depth self-coding neural network is represented as follows: h = { (H) ₁ ，n ₂ ),(h ₂ ，n ₂ )，...，(h _n ，n _n ) H represents a set of hidden layers and neuron nodes in the deep self-coding neural network, H _n Denotes the nth hidden layer, n _n Representing a hidden layer h _n Therein is provided with n _n Each neuron node has a corresponding weight, and in this embodiment, the initial weight is a weight generated by a random strategy and is denoted as an initial weight.

First, h ₁ May be convolutional layer, which is used to perform the amplification difference processing on the first high-dimensional vectors obtained in step S102, i.e. to extract the features in the first high-dimensional feature vectors, each of which can pass through n ₁ The corresponding first weight multiplication mode obtains a plurality of multiplied first high-dimensional feature vectors, and the first high-dimensional feature vectors are recorded as second high-dimensional feature vectors.

h ₂ There may be a pooling layer, which acts to compress the features, i.e. to reduce the dimensionality of the second high-dimensional feature vectors, since not every second high-dimensional feature vector is a feature vectorWhat is needed, to give a simple example: for example, if the second high-dimensional feature vector includes a plurality of vectors, the plurality of vectors are divided into a plurality of portions, the number of the vectors in each portion is the same, each portion is a convolution kernel, and each first high-dimensional feature vector can pass through the convolution kernel together with n ₂ And then selecting the vector with the maximum inner product as a representative of the corresponding part in the second high-dimensional characteristic vector to be marked as a first representative vector and obtaining a first representative vector set by comparing the inner products of the multiplied first high-dimensional characteristic vectors.

h ₃ May be an upsampling layer, the upsampling layer is used for performing an extended upscaling on the first representative vector, i.e. reducing to the dimension of the first high-dimensional feature vector, in particular according to the first representative vector and n, as opposed to the pooling layer ₃ And obtaining a reconstructed vector according to the corresponding initial weight, wherein the dimensionality of the reconstructed vector is the same as that of the first high-dimensional feature vector.

Wherein the purpose of the fully connected layer and dropout layer is to prevent overfitting, i.e. to prevent n from being present _n The corresponding weight is too large, so that a calculation result has great deviation from an actual result, and the specific implementation mode can be that during training, a part of neurons are randomly selected and do not participate in calculation, during training again, a part of neurons are reselected and do not participate in calculation, and finally a more reasonable result is obtained, namely a reasonable reconstruction vector is obtained.

Step S104, calculating first error data of the first reconstruction vector and the first high-dimensional feature vector, and updating the weight according to the first error data and an Adam algorithm to obtain a first updated weight;

the error data comprises one or more combinations of mean, standard deviation, variance, mean variance and the like, the error data represents the similarity degree of a first high-dimensional feature vector passing through the self-coding network model and a reconstructed vector, and the smaller the error data is, the closer the reconstructed vector is to the first high-dimensional feature vector, the more normal the flow to be detected corresponding to the first high-dimensional feature vector is.

In particular, the amount of the solvent to be used,the reconstructed vector has the same dimensionality as the first high-dimensional characteristic vector, the mean square error value of each dimension is calculated and recorded as a first mean square error, and updating h is carried out on reverse transmission from the output layer to the hidden layer _n In n _n And during corresponding weight, adjusting the initial weight by combining a Adam optimization algorithm according to the obtained first mean square error, wherein the adjusted weight is a first updated weight.

Illustratively, a deep self-coding neural network such as [ h ] ₁ ，h ₂ ，h ₃ ]Has three layers, wherein h1 is an input layer, h ₂ ，h ₃ For the hidden layer, each layer is corresponding to a neuron with initial weight of [ X,1,5 ]]Assuming that the input X is 1, the target result is 6, and the current calculation result is 5, the difference error can be made to be 1, and there are four adjustment schemes for adjusting the initial weight according to the difference error 1: [ X,1,6 ]]Or [ X,2,3 ]]Or [ X,6,1 ]]Or [ X,3,2 ]]According to the Adam optimization algorithm, i.e. selecting the X,2,3 with the smallest adjustment amplitude compared to the initial weight]As the first update weight, because h ₂ The adjustment range of the middle weight 2 is minimum compared with the original weight 1, h ₃ The adjustment range of the weight 3 and the original weight 5 is minimum, and the Adam optimization algorithm guides the direction for the optimization selection of the weight.

Step S105, the first updating weight is transmitted back to the deep self-coding neural network, the high-dimensional feature vector is reconstructed by adopting the first updating weight, and the execution steps are returned: calculating first error data of the first reconstruction vector and the first high-dimensional feature vector, updating the weight according to the first error data and an Adam algorithm to obtain a first updated weight, and performing weight updating in a loop until the calculated first error data is lower than a preset error threshold, and terminating training to obtain a trained self-coding network model.

Specifically, the first update weight obtained in step S104 is returned to the deep self-coding neural network, so as to perform the following steps when reconstructing the next set of high-dimensional feature vectors: calculating first error data of the first reconstruction vector and the first high-dimensional feature vector, updating the weight according to the first error data and an Adam algorithm to obtain a first updated weight, repeating the steps until the calculated first error data is lower than a preset error threshold, terminating training, and finally obtaining a trained self-coding network model.

In this embodiment, the self-encoding network model is constructed based on a tensflow framework, or based on other learning frameworks such as Keras, pytorch, caffe, theono, and the like, and is used for detecting flow data to be detected.

In this embodiment, by using the above scheme, specifically, by obtaining TLS encrypted traffic data collected in advance as training data, constructing a training set, performing feature extraction on the training data through a TLS parameter list resume feature dictionary, constructing a high-dimensional feature vector based on an extracted feature value, which is denoted as a first high-dimensional feature vector, reconstructing the first high-dimensional feature vector according to a weight initialized by a random strategy to obtain a first reconstructed vector, calculating first error data between the first reconstructed vector and the first high-dimensional feature vector, updating the weight according to the first error data and an adam algorithm to obtain a first updated weight, returning the first updated weight to the deep self-coding neural network, reconstructing the high-dimensional feature vector by using the first updated weight, and returning to the executing step: calculating error data of the reconstructed vector and the high-dimensional characteristic vector, updating the weight according to the error data and the adam algorithm to perform a loop, updating the weight until the calculated error data is lower than a preset error threshold, terminating training to obtain a trained self-coding network model, adjusting the weight of parameters in the network in a loop mode by combining the error data with the adam optimization algorithm to finally obtain the trained self-coding network model, and then realizing unsupervised training anomaly detection on encrypted flow data by the trained self-coding network model to reduce the detection difficulty and the cost of the encrypted flow data.

Referring to fig. 5, fig. 5 is a schematic flowchart of another exemplary embodiment of the data stream detection method of the present application. The step S1003 of inputting the feature data into a pre-created self-coding network model for detection to obtain a detection result includes:

step S10031, constructing a second high-dimensional feature vector based on the feature data;

specifically, corresponding feature data are extracted according to the acquired data of the flow to be detected, and a second high-dimensional feature vector is constructed based on the feature data, wherein the manner of acquiring the data of the flow to be detected, extracting the feature data, and constructing the second high-dimensional feature vector is the same as that of the above embodiment.

Step S10040, constructing a second reconstruction vector based on the second high-dimensional feature vector;

specifically, in the trained self-coding network model, a second high-dimensional feature vector is input to obtain a second reconstruction vector, and the construction method is the same as that of the embodiment.

Step S10041, calculating second error data of the second high-dimensional feature vector and the second reconstruction vector, and obtaining a second error data set according to the second error data;

specifically, the error data includes a combination of one or more of a mean, a standard deviation, a variance, a mean variance, and the like, where a mean square error value is taken as the error data, and a mean square error calculation is performed on the second high-dimensional feature vector and the second reconstruction vector to obtain a first mean square error value and a first set of mean square error values.

Step S10042, obtaining a target probability value according to the distribution of the second error data in the second error data set, and taking the target probability value as a detection result.

Specifically, for example, in a two-dimensional XY coordinate axis, an X axis represents each actually calculated error value, a Y axis represents a percentage of each error value in the error value set, and in the XY two-dimensional coordinate axis, a distribution of the actual error values can be seen, and according to a distribution and a probability of each actual error value, a sum of probability values of the actual error values different from a target error value is a target probability value, where the target error value is a preset error value range, and the target probability value represents a sum of probabilities of the actual error values not within the preset error value range.

And then storing the target probability numerical value in a database as a basis for studying and judging the flow to be detected, writing a detection result interface to call the target probability numerical value, and sending alarm information when the target probability numerical value reaches a preset threshold value.

In this embodiment, by the above scheme, specifically, a second high-dimensional feature vector is constructed based on the feature data, a second reconstruction vector is constructed based on the second high-dimensional feature vector, second error data between the second high-dimensional feature vector and the second reconstruction vector is calculated, a second error data set is obtained according to the second error data, a target probability value is obtained according to distribution of the second error data in the second error data set, the target probability value is used as a detection result, an interface for inquiring abnormal degree of flow to be detected is provided, data detection conditions are visually and conveniently displayed in the interface, user experience is optimized, abnormal detection of unsupervised training on encrypted flow data is realized, and difficulty and cost in detecting the encrypted flow data are reduced.

Referring to fig. 6, fig. 6 is a schematic flowchart of another exemplary embodiment of the data stream detection method of the present application.

Step S10041, calculating second error data of the second high-dimensional feature vector and the second reconstruction vector, and obtaining a second error data set according to the second error data includes:

step S10043, comparing the second error data with the target error data to obtain a comparison result;

specifically, when the trained self-coding network model is used for detecting the flow to be detected, after the self-coding network model obtains a second high-dimensional feature vector through step S10041, the mean square error value of the second high-dimensional feature vector and the second reconstruction vector is calculated to obtain a second mean square error value and a second mean square error value set.

Step S10044, adjusting the first update weight according to the comparison result and the Adam algorithm to obtain a second update weight.

Specifically, after the second mean square error value, i.e. the second set of mean square error values, is obtained in step S10043, the reverse transmission from the output layer to the hidden layer is updated h _n In n _n And during the corresponding weight, adjusting the first updating weight by combining the Adam optimization algorithm according to the obtained second mean square error, wherein the adjusted weight is the second updating weight.

Step S10045, returning the second update weight to the self-coding network model, and updating the first update weight to obtain an updated self-coding network model.

Specifically, the second update weight obtained in step S10044 is returned to the self-coding network model for weight update, and the purpose of this step is to adjust the first update weight according to the comparison result and the adam algorithm to obtain the second update weight when a new data stream to be detected is detected by the trained self-coding network model, so that the new reconstruction vector fits the high-dimensional feature vector, and in combination with the characteristic that abnormal data only occurs when the data stream is attacked, that is, the new reconstruction vector fits the high-dimensional feature vector corresponding to the IP feature vector at normal flow, and a better self-coding network model is obtained by continuously updating the weights in the trained self-coding network.

According to the scheme, the second error data is compared with the target error data to obtain a comparison result, the first updating weight is adjusted according to the comparison result and the adam algorithm to obtain a second updating weight, the second updating weight is transmitted back to the self-coding network model, the first updating weight is updated to obtain an updated self-coding network model, and the finally obtained self-coding network model which is more fit with common TLS communication data is used for detecting the traffic to be detected to obtain a detection result, so that the unsupervised training anomaly detection of encrypted traffic data is realized, and the detection difficulty and the cost of the encrypted traffic data are reduced.

Referring to fig. 7, fig. 7 is a flowchart illustrating another exemplary embodiment of the data stream detection method of the present application. Step S1001, the step of acquiring data of a flow to be detected includes:

step S10011, obtain data information of a TLS communication data quadruple, where the TLS communication data quadruple includes a source IP address, a source port, a destination IP address, and a destination port.

In this embodiment, the communication element may be obtained from an IP address, a source port, a destination IP address, and a destination port, and the element is used as training data of traffic to be detected to construct a training set for subsequent extraction of feature information, and exemplarily, the IP address is extracted as an IP feature vector to be used as a high-dimensional feature vector.

Compared with other approaches for acquiring data from traffic to be detected, since the Server generally responds to the Client in the same manner in some malware samples, the use of combined data such as Client Hello and Server Hello packets in the TLS communication data quadruplet is not only reliable at any specific port, but also provides a larger granularity than evaluating the cipher suite alone, that is, more and more accuracy than extracting a reference object by taking the cipher suite as a feature.

In this embodiment, by using the above scheme, specifically, data information of a TLS communication data quadruple is obtained, where the TLS communication data quadruple includes a source IP address, a source port, a destination IP address, and a destination port, and the source port, the destination IP address, and the destination port are used as sample data to extract an IP feature vector, that is, a high-dimensional feature vector. The information is not changed when the server responds to the client, so that the method has stability compared with sample data from other sources, and the self-coding network model is trained through the sample data, so that the accuracy and the practicability of the self-coding network model are improved.

Referring to fig. 8, fig. 8 is a schematic flowchart of another exemplary embodiment of the data stream detection method of the present application. Step S1003, inputting the feature data into a pre-created self-coding network model for detection, and after the step of obtaining a detection result, the method includes:

and S106, verifying the effectiveness of the self-coding network model through a data quality evaluation algorithm and a pre-collected test set.

The data stream comprises a plurality of data samples, a training set is constructed by the data samples, the sample data in the training set is input into the self-coding network model, each sample data enables the self-coding network model to generate tiny change, decoded data is obtained, the decoded data is compared with the characteristic data corresponding to the data samples, specifically, the error data of the decoded data and the characteristic data corresponding to the data samples is compared with the target error data, and therefore the effectiveness of the self-coding network model is verified.

In this embodiment, a data quality evaluation algorithm is used to perform batch evaluation and scoring on 500 groups of data before and after detection, where the adam algorithm is an adaptive gradient descent algorithm, and the learning rate of this algorithm gradually decreases with the increase of time, that is, when the output value is closer to the minimum value, the parameters are adjusted and updated more finely, and the calculation of the comprehensive score may be: different weights are respectively set for different parameters according to the positions of the parameters, for example, in the group of data, if a certain section of IP parameter is frequently appeared in normal flow, a larger weight is set, and finally the score of each group of data is evaluated according to the set weight and the error data, so that the evaluation result of the data quality is obtained.

According to the scheme, the effectiveness of the self-coding network model is verified through a data quality evaluation algorithm and a pre-collected test set, a training set is reused as the test set, specific numerical values in error data achieve the effects that the higher the abnormal evaluation score is for abnormal flow, the lower the abnormal evaluation score is for normal flow, and the practicability and the accuracy of the verified self-coding network model are improved.

In addition, an embodiment of the present application further provides a data stream detection apparatus, where the data stream detection apparatus includes:

the acquisition module is used for acquiring the flow data to be detected;

Further, the data flow detection apparatus further includes:

and the model training module is used for obtaining the data stream detection network model based on the encoder and the decoder and through unsupervised algorithm training.

For the principle and implementation process of implementing data stream detection in this embodiment, please refer to the above embodiments, which are not described herein again.

Since the data stream detection program is executed by the processor, all technical solutions of all the foregoing embodiments are adopted, so that at least all the beneficial effects brought by all the technical solutions of all the foregoing embodiments are achieved, and details are not repeated herein.

Compared with the prior art, the data stream detection method, the device, the terminal device and the storage medium provided by the embodiment of the application perform feature extraction on the traffic data to be detected by acquiring the traffic data to be detected to obtain feature data, and input the feature data into the pre-created self-coding network model for detection to obtain a detection result, wherein the self-coding network model comprises an encoder and a decoder, the self-coding network model is obtained based on the encoder and the decoder and trained through an unsupervised algorithm, and the traffic to be detected is detected through the trained self-coding network model. From the perspective that a large number of labels need to be preprocessed in a large amount of time, and in combination with the characteristic that an abnormal condition occurs only when the labels are attacked, the unsupervised model training method with small dependence on training data is provided. And the unsupervised learning has no training set, only one group of data is provided, and the rules are searched in the group of data set. And the error data of the reconstructed vector and the high-dimensional characteristic vector are compared, the weight parameter value in the self-coding network model is circularly updated by combining the Adam optimization algorithm, each detection is once updated, the task of detecting the data stream is better executed by the updated self-coding network model, the abnormal detection of the unsupervised training of the encrypted flow data is finally realized, and the detection difficulty and the cost of the encrypted flow data are reduced.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or system comprising the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description, and do not represent the advantages and disadvantages of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, a controlled terminal, or a network device) to execute the method of each embodiment of the present application.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are included in the scope of the present application.

Claims

1. A data stream detection method, characterized in that the data stream detection method comprises the steps of:

acquiring flow data to be detected;

performing feature extraction on the flow data to obtain feature data;

performing feature extraction on the training data through a TLS parameter list resume feature dictionary, and constructing a high-dimensional feature vector based on the extracted feature values, and recording the high-dimensional feature vector as a first high-dimensional feature vector;

and returning the first updating weight to the deep self-coding neural network, reconstructing the high-dimensional feature vector by adopting the first updating weight, and returning to the execution step: calculating first error data of the first reconstruction vector and the first high-dimensional feature vector, updating the weights according to the first error data and an Adam algorithm to obtain first updated weights, and performing weight updating in a circulating manner until the calculated first error data is lower than a preset error threshold, terminating training and obtaining a trained self-coding network model;

2. The data stream detection method according to claim 1, wherein the step of inputting the feature data into a pre-created self-coding network model for detection to obtain a detection result comprises:

3. The method as claimed in claim 2, wherein said step of calculating second error data of said second high-dimensional feature vector and said second reconstruction vector, and deriving a second error data set from said second error data, comprises:

4. The data stream detection method of claim 1, wherein the step of obtaining data of the flow rate to be detected comprises:

5. The data stream detection method according to any one of claims 1 to 4, wherein the step of inputting the feature data into a pre-created self-coding network model for detection to obtain a detection result further comprises:

6. A data flow detection apparatus, characterized in that the data flow detection apparatus comprises:

the acquisition module is used for acquiring the flow data to be detected;

the training module is used for acquiring pre-collected TLS encrypted traffic communication data as training data to construct a training set; performing feature extraction on the training data through a TLS parameter list resume feature dictionary, and constructing a high-dimensional feature vector based on an extracted feature value, and recording the high-dimensional feature vector as a first high-dimensional feature vector; inputting the first high-dimensional feature vector into a pre-established depth self-coding neural network, and performing the following processing in the depth self-coding neural network: reconstructing the first high-dimensional characteristic vector according to the weight initialized by the random strategy to obtain a first reconstructed vector; calculating first error data of the first reconstruction vector and the first high-dimensional feature vector, and updating the weight according to the first error data and an Adam algorithm to obtain a first updated weight; and returning the first updating weight to the deep self-coding neural network, reconstructing the high-dimensional feature vector by adopting the first updating weight, and returning to the execution step: calculating first error data of the first reconstruction vector and the first high-dimensional feature vector, updating the weights according to the first error data and an Adam algorithm to obtain first updated weights, and performing weight updating in a circulating manner until the calculated first error data is lower than a preset error threshold, terminating training and obtaining a trained self-coding network model;

7. A terminal device, characterized in that the terminal device comprises a memory, a processor and a data flow detection program stored on the memory and executable on the processor, the data flow detection program implementing the steps of the data flow detection method according to any one of claims 1 to 5 when executed by the processor.

8. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a data flow detection program, which when executed by a processor implements the steps of the data flow detection method according to any one of claims 1-5.