CN116405261A

CN116405261A - Malicious flow detection method, system and storage medium based on deep learning

Info

Publication number: CN116405261A
Application number: CN202310252370.3A
Authority: CN
Inventors: 路松峰; 熊婧; 周显敬; 刘虎; 吴俊军; 朱建新
Original assignee: Wuhan Zhuoer Information Technology Co ltd; Huazhong University of Science and Technology
Current assignee: Wuhan Zhuoer Information Technology Co ltd; Huazhong University of Science and Technology
Priority date: 2023-03-15
Filing date: 2023-03-15
Publication date: 2023-07-07

Abstract

The invention relates to the technical field of network security, and discloses a malicious flow detection method, a system and a storage medium based on deep learning.

Description

Malicious flow detection method, system and storage medium based on deep learning

Technical Field

The invention relates to the technical field of network security, in particular to a malicious traffic detection method, a malicious traffic detection system and a malicious traffic detection storage medium based on deep learning.

Background

At present, the development of industrial digitization, networking and intellectualization is accelerated, the network environment is increasingly complex, network attacks frequently occur, and the importance and urgency of industrial control network security work under new situation are more prominent.

The current detection methods for malicious traffic in a network include misuse detection methods and anomaly detection methods. However, the misuse detection method relies on database comparison, and the current network flow needs to be matched and judged with the information in the database one by one; the anomaly detection method mostly adopts the traditional machine learning algorithm, and a large amount of time is consumed in training, so that when a large amount of data exists in a network, the detection efficiency of the two methods on malicious traffic is low.

The foregoing is provided merely for the purpose of facilitating understanding of the technical scheme of the present invention and is not intended to represent an admission that the foregoing is related art.

Disclosure of Invention

The invention mainly aims to provide a malicious traffic detection method, a malicious traffic detection system and a malicious traffic detection storage medium based on deep learning, and aims to solve the technical problems of low detection efficiency and low accuracy of malicious traffic in a network.

In order to achieve the above purpose, the present invention provides a malicious traffic detection method based on deep learning, the method comprising the following steps:

collecting multi-level network data and preprocessing the network data;

extracting time sequence characteristics of the network data, slicing the preprocessed network data according to the time sequence characteristics, and obtaining corresponding data subsequences;

respectively extracting the characteristics of each data subsequence to obtain characteristic vectors of each data subsequence;

splicing the characteristic vectors of the data subsequences to obtain the characteristic vector of the network data;

training a malicious traffic identification model based on LSTM by taking the feature vector of the network data as input data to obtain a preset malicious traffic identification model;

detecting the network traffic to be detected through the preset malicious traffic identification model, and judging whether the network traffic to be detected belongs to malicious traffic or not.

Optionally, the step of collecting multi-level network data and preprocessing the collected network data includes:

acquiring multi-level network data through a NetFlow collector, and converting the acquired multi-level network data into a data format to obtain a network data set with a uniform format;

preprocessing the network data set with the uniform format.

Optionally, the multi-level network data includes: data source and destination addresses, data port number, protocol type, packet size, timestamp, and QOS information.

Optionally, training the malicious traffic recognition model based on LSTM by using the feature vector of the network data as input data to obtain a preset malicious traffic recognition model, including:

taking the feature vector of the training set as input data, and constructing a malicious traffic recognition original model based on LSTM;

taking the feature vector of the test set as input data, and testing the malicious traffic identification original model constructed based on the LSTM;

and updating the malicious traffic identification original model according to the test result to obtain a preset malicious traffic identification model.

Optionally, the preprocessing the network data in the unified format includes:

when the network data set with the uniform format is obtained, invalid data and repeated data in the network data set are removed, and an updated network data set is obtained;

labeling the flow types of the updated network data sets, sampling and checking the network data sets labeled by the flow types based on preset standards, and performing label correction to obtain network data sets under standard labels;

the network data set under the standard label is divided into a training set and a testing set.

Optionally, before the step of extracting the features of each data subsequence to obtain the feature vector of each data subsequence, the method includes:

determining a network feature extraction mechanism according to the time sequence characteristics, the protocol types, the flow content and the statistical characteristics of the network data;

correspondingly, the feature extraction is performed on each data subsequence to obtain feature vectors of each data subsequence, including:

and respectively carrying out feature extraction on each data subsequence based on the network feature extraction mechanism to obtain feature vectors of each data subsequence.

Optionally, after the step of training the malicious traffic identification model based on the LSTM by using the feature vector of the network data as input data to obtain the preset malicious traffic identification model, the method includes:

optimizing the preset malicious flow identification model through a super-parameter searching and back propagation algorithm;

correspondingly, the detecting the network traffic to be detected through the preset malicious traffic identification model, and judging whether the network traffic belongs to malicious traffic, includes:

and detecting the network traffic to be detected through the optimized malicious traffic identification model, and judging whether the network traffic belongs to malicious traffic.

Optionally, the optimizing the preset malicious traffic identification model through the super-parameter searching and back-propagation algorithm includes:

selecting AUC and/or ROC as evaluation indexes, and grading the preset malicious flow identification model;

when the score of the preset malicious flow identification model is lower than a preset value, network searching is adopted and a random gradient descent method is combined to optimize the malicious flow identification model lower than the preset value.

In addition, in order to achieve the above object, the present invention further provides a malicious traffic detection system based on deep learning, the system comprising: the system comprises a data acquisition module, a data processing module, a characteristic engineering module, a model construction module and a flow detection module,

the data acquisition module is used for acquiring multi-level network data and preprocessing the network data;

the data processing module is used for extracting the time sequence characteristics of the network data, slicing the preprocessed network data according to the time sequence characteristics, and obtaining corresponding data subsequences;

the characteristic engineering module is used for respectively carrying out characteristic extraction on each data subsequence to obtain a characteristic vector of each data subsequence; splicing the characteristic vectors of the data subsequences to obtain the characteristic vector of the network data;

the model construction module is used for training a malicious traffic identification model based on LSTM by taking the feature vector of the network data as input data to obtain a preset malicious traffic identification model;

the flow detection module is used for detecting the network flow to be detected through the preset malicious flow identification model and judging whether the network flow to be detected belongs to malicious flow or not.

In addition, in order to achieve the above object, the present invention further provides a storage medium having stored thereon a malicious traffic detection program based on deep learning, which when executed by a processor, implements the steps of the malicious traffic detection method based on deep learning as described above.

Firstly, multi-level network data are collected, the network data are preprocessed, the time sequence characteristics of the network data are extracted, the preprocessed network data are sliced according to the time sequence characteristics to obtain corresponding data subsequences, then feature extraction is carried out on each data subsequence to obtain feature vectors of each data subsequence, then the feature vectors of each data subsequence are spliced to obtain feature vectors of the network data, then the feature vectors of the network data are used as input data, a malicious traffic recognition model based on LSTM (least squares) is trained to obtain a preset malicious traffic recognition model, finally network traffic to be detected is detected through the preset malicious traffic recognition model, and whether the network traffic to be detected belongs to malicious traffic is judged. The invention collects multi-level network data, can consider different types of data in the network, expands the detection range, considers the time sequence characteristic of the network data, extracts the characteristic vector of the subsequence after slicing, and then splices the characteristic vector to obtain the characteristic vector of the network data, so that the generated characteristic vector is more complete and accurate, an identification model is constructed based on LSTM, the problems of gradient elimination and forgetting in the traditional model training can be alleviated, and finally the network traffic to be detected is detected through the trained malicious traffic identification model, and whether the network traffic to be detected belongs to malicious traffic can be rapidly and accurately judged.

Drawings

FIG. 1 is a schematic flow chart of a first embodiment of a malicious traffic detection method based on deep learning according to the present invention;

FIG. 2 is a flow chart of a second embodiment of a malicious traffic detection method based on deep learning according to the present invention;

FIG. 3 is a schematic flow chart of a third embodiment of a malicious traffic detection method based on deep learning according to the present invention;

fig. 4 is a block diagram of a first embodiment of a malicious traffic detection system based on deep learning according to the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

The embodiment of the invention provides a malicious traffic detection method based on deep learning, and referring to fig. 1, fig. 1 is a flow diagram of a first embodiment of the malicious traffic detection method based on deep learning.

In this embodiment, the malicious traffic detection method based on deep learning includes the following steps:

step S10: and acquiring multi-level network data and preprocessing the network data.

It should be noted that, the execution body of the method of the embodiment may be a computer service device with functions of network communication, data acquisition and processing, model training, and program running, such as a tablet computer, a personal computer, a mainframe, etc., or may be other electronic devices capable of implementing the same or similar functions and connecting to a network and performing data analysis on the acquired network traffic, which is not limited in this embodiment. Embodiments of the deep learning-based malicious traffic detection method of the present invention will be described herein by taking a personal computer as an example.

It can be appreciated that the multi-level network data may be various heterogeneous data from different sources with different data types or different distribution modes, where the multi-level network data may include multi-dimensional network traffic, and the network traffic has high randomness and complexity, and the collected multi-level network data may represent the type of network traffic data obtained by a general computer service device when performing network communication.

It should be noted that, data collection may be performed on all websites currently accessed by means of a web data crawler or a website disclosure API, or a tool for directly obtaining network traffic interaction information, such as PCAP, binetflow, sFlow or Netflow, may be used to complete collection of multi-level network data.

Further, in order to further fuse the collected multi-level network data to better extract the common feature of the data, the step S10 includes:

step S11: and acquiring multi-level network data through a NetFlow collector, and converting the acquired multi-level network data into a data format to obtain a network data set with a uniform format.

It can be understood that NetFlow is a technology for providing data about network activity by sampling traffic between ports of a switch or interfaces of a router, and can sample each data packet of a specific network location, capture each data flow, and completely contain volume information, so that a NetFlow collector can collect more accurate and comprehensive multi-level network data.

Further, in order to make the collected multi-level network data more comprehensive, the multi-level network data further includes: data source and destination addresses, data port number, protocol type, packet size, timestamp, and QOS information.

It should be noted that, the data source address and the destination address are the source IP address and the destination IP address in the network data captured by the NetFlow collector, and the addresses can determine the source and the destination of the network communication so as to identify the potential security threat therefrom; the data port number may display the running network service in the device currently performing network communication, thereby discovering potential attack or abnormal behavior; the protocol type is a protocol adopted for network communication, such as TCP or UDP, and the protocol type can determine the type of the current network communication so as to assist in identifying potential security threats; the size of the data packet can display the size of the data volume for network communication, and the data volume can help to detect whether data leakage exists or not or other abnormal behaviors which can be reflected on the data volume; the time stamp is a time point for carrying out network communication, can help to determine the duration of the network communication, and judges whether potential security threat affecting the normal communication time of the network exists in the current network from the time level; the QOs information is network quality of service information, such as latency and bandwidth utilization, etc., that can help determine network performance issues to determine whether security threats exist in the current network data that affect network quality of service.

It should be noted that, due to the multi-source heterogeneous characteristics of the multi-level network data, for different network data, in order to further analyze the commonality and the difference characteristics of the different network data, format conversion needs to be performed on the collected data to obtain a network data set with a uniform format.

It can be understood that the unified format may be selected based on the duty ratio weights of the different format data in the collected network data, a data format type with the highest weight may be selected as the unified format, and a data format with the highest flexibility, that is, the data format most easily obtained by conversion, may be selected as the unified format.

It should be understood that the data format conversion may be performed by using a common big data computing framework such as Hadoop, spray or flank, and based on the computing framework, the classification processing and format specification may be performed on the collected network data, where the computing framework on which the data format conversion is based is not limited in this embodiment, and the Hadoop framework is selected to describe this embodiment and the embodiments below.

Step S12: preprocessing the network data set with the uniform format.

It can be understood that when a network data set in a unified format is acquired, because a mapping from features to labels needs to be constructed on the network data in the data set, the network data needs to be labeled, and labels corresponding to different network data respectively are determined.

In a specific implementation, multi-level network data are acquired through NetFlow, so that the randomness and complexity of the network data can be considered, more comprehensive and accurate network data can be acquired, a Hadoop-based computing frame is selected, expansibility is improved, meanwhile, high efficiency and high fault tolerance are achieved, the acquired network data of different types can be subjected to accurate format specification in time, a network data set with a unified format is obtained, and further the mapping from the network data construction characteristics with the unified format to the labels in the data set is determined, so that the labels corresponding to the network data with heterogeneous multiple sources in the data set are determined.

Step S20: and extracting the time sequence characteristic of the network data, slicing the preprocessed network data according to the time sequence characteristic, and obtaining a corresponding data subsequence.

It can be understood that, because the collected multi-level network data can be continuously acquired network traffic data, the network data has a time sequence characteristic, the time sequence characteristic can be specifically expressed as a time interval when the network data is transmitted, and the preprocessed network data can be segmented by adopting a sliding window method to obtain a data subsequence with a preset length.

It should be understood that, since the network data sets are derived from multi-level network data and the time sequence characteristics of network data with different sources and different structures are different, the setting of the width and the sliding step length of the sliding window when the sliding window method is adopted can be performed based on the network data with different sources, or the same width and step length of the segmentation can be performed on all the network data in the data sets by setting an average value based on experience, which is not limited in this embodiment.

It can be understood that the data subsequences are corresponding data obtained by splitting different types of network data in the network data set through a sliding window, each subsequence contains data points with a certain length, and all network data in the original network data set can be obtained after the data subsequences are spliced.

Step S30: and respectively extracting the characteristics of each data subsequence to obtain the characteristic vector of each data subsequence.

Further, in order to better extract the feature vector of the multi-source heterogeneous data from the multi-level network, before the step S30, the method further includes:

step S30': and determining a network characteristic extraction mechanism according to the time sequence characteristics, the protocol types, the flow content and the statistical characteristics of the network data.

It should be noted that the network data may be time series data, and thus have time series characteristics, which may be embodied as a time window length, and a time series of fluctuation characteristics, for example: the number of data packets in the time window, the average value of the data packet sizes, the label difference, the maximum value, etc., which are not limited in this embodiment.

The protocol type is a protocol adopted in network communication, and network data of different protocol types can select different characteristics, for example: the HTTP traffic can select various information in the URL length, HTTP status code and HTTP request header; FTP may select file size, file name, etc.

Traffic content is specific content contained in network data, for example: for email traffic, the traffic content may be embodied in the email topic, sender, email body, attachment size.

The statistical properties may be expressed as various statistics of the flow data, such as: the statistics may describe the distribution of traffic data to further distinguish normal traffic from malicious traffic.

Correspondingly, the step S30 specifically includes:

step S30': and respectively carrying out feature extraction on each data subsequence based on the network feature extraction mechanism to obtain feature vectors of each data subsequence.

In a specific implementation, when the characteristics of the network data are selected, the characteristics can be analyzed by combining the actual scene of the current acquired network data with the specific type of the acquired multi-level network data, and by selectively combining the time sequence characteristics of the network data, the adopted communication protocol, the specific content contained and the statistics of the flow data, the characteristics of each data subsequence obtained after slicing are extracted, the interpretability and the effectiveness of the characteristics can be considered, the characteristic vector of each data subsequence can be obtained, and more sufficient information is provided for subsequent model training.

Step S40: and splicing the characteristic vectors of the data subsequences to obtain the characteristic vector of the network data.

In a specific implementation, the sub-sequence feature vectors are spliced to obtain network features of network data from different sources and different types, so that the features reflected by the feature vectors of the network data serving as the input subsequently are more complete.

Step S50: and training the LSTM-based malicious traffic recognition model by taking the feature vector of the network data as input data to obtain a preset malicious traffic recognition model.

Note that LSTM (Long Short Term Memory, long and short term memory) is a special recurrent neural network that can analyze inputs using time series. The LSTM comprises a forgetting gate, an input gate and an output gate, and the problems of gradient elimination and forgetting in the existing flow detection model can be relieved by constructing a flow identification model based on the LSTM.

In a specific implementation, feature vectors of the network data obtained by splicing the subsequence features are used as training data of an input model, a malicious traffic recognition model constructed based on LSTM is trained, probability distribution from a security event to a malicious event of events contained in the network data is predicted, the probability distribution is given and ordered, and the event with the highest probability is selected as a prediction result.

Further, in order to maintain the accuracy of prediction, whether the prediction rate is correct or not can be judged by the prediction performance monitoring tracking report, and when the prediction accuracy is reduced below a certain preset threshold value, the model is automatically retrained.

Step S60: detecting the network traffic to be detected through the preset malicious traffic identification model, and judging whether the network traffic to be detected belongs to malicious traffic or not.

In a specific implementation, the network traffic to be detected is input into the trained preset malicious traffic identification model, and whether the event contained in the current network data to be detected is a security event or a malicious event can be identified through the model, namely whether the network traffic to be detected belongs to malicious traffic carrying the malicious event is judged.

According to the embodiment, the Netflow collector is selected to collect multi-level network data comprising data source address and destination address, data port number, protocol type, data packet size, timestamp and QOS information, more accurate and comprehensive multi-level network data can be obtained, then the collected multi-level network data is subjected to data format conversion to obtain a network data set with a unified format, the network data set with the unified format is preprocessed, the preprocessed network data in the data set is sliced by adopting a sliding window method to obtain corresponding data subsequences, and feature extraction of the subsequences is carried out based on feature extraction mechanisms established by various factors such as time sequence characteristics, protocol type and flow content of the network data, and then feature vectors of the network data are spliced, so that the extracted features are more complete, an identification model is constructed based on LSTM, the problems of gradient elimination and forgetting in traditional model training can be alleviated, and finally network flow to be detected through the trained malicious flow identification model is detected.

Referring to fig. 2, fig. 2 is a flow chart of a second embodiment of the malicious traffic detection method based on deep learning according to the present invention.

Based on the above-mentioned first embodiment, in order to further process the collected multi-level network data, in this embodiment, step S12 includes:

step S121: and when the network data set with the uniform format is acquired, removing invalid data and repeated data in the network data set to obtain an updated network data set.

It should be noted that, because there are blank information that can be acquired from a plurality of sources and that inevitably exists when network communication is performed, and overlap (data overlap) situations that exist in consideration of data consistency when acquisition is performed based on timing characteristics of network data, invalid data and duplicate data are included in the acquired network data.

In a specific implementation, when a network data set with a uniform format is acquired, invalid data and repeated data in the data set need to be removed, and for example, VBA, pycharm or an open source data processing tool on a network may be adopted to perform data cleaning on the data set, so as to obtain an updated network data set.

Step S122: and marking the flow types of the updated network data sets, sampling and checking the network data sets marked by the flow types based on a preset standard, and carrying out label correction to obtain the network data sets under the standard label.

It should be noted that, the updated network data set is labeled with the traffic type to obtain the label types of different types of network data in the network data set. The tag type may reflect whether the network data in the network data set includes a security event or a malicious event, and different tag types may be set based on a previous experience value, or may be set based on a personalization setting, which is not limited in this embodiment.

It can be understood that, after the primary labeling, the network data set subjected to the primary labeling can be sampled and detected based on a preset standard, and label correction is performed on the network data with labeling errors, so as to further ensure labeling quality and obtain the network data set under the standard label.

Step S123: the network data set under the standard label is divided into a training set and a testing set.

It should be noted that the training set is a data set that can be used for training, generating a model or algorithm, and the test set is a data set that can be used for further testing and optimizing the training of the model that is initially trained by the training set.

It will be appreciated that the data set may be divided into training and testing sets according to a predetermined ratio, for example: the segmentation is carried out by adopting the proportion of 8-2, 7-3, 6-4 or 5-5, and the data set can be directly divided into a training set and a testing set according to the random proportion based on personalized setting, wherein the proportion of the training set is larger than or equal to that of the testing set.

Further, to better construct a feature-to-tag mapping to enhance training effect of the model, step S50 includes:

step S51: and taking the feature vector of the training set as input data, and constructing a malicious traffic recognition original model based on the LSTM.

It can be understood that the feature vector of the training set is used as input data, and the malicious traffic recognition original model constructed based on the LSTM is trained by calculating the gradient and updating the weight.

Further, since the configuration of the verification model and whether the training degree is over fit or under fit are required in the model construction process, the training set can be further divided into a training set for training and a verification set for verification.

In the specific implementation, feature vectors of a training set are used as input, an LSTM-based identification model is obtained through training of the training set, validity of the model is verified through a verification set, and a malicious traffic identification original model constructed based on the LSTM is obtained through preliminary optimization.

Step S52: and taking the feature vector of the test set as input data, and testing the malicious traffic identification original model constructed based on the LSTM.

In a specific implementation, the feature vector of the test set is used as input data, and the identification accuracy of the malicious traffic identification original model on different network data is tested.

Step S53: and updating the malicious traffic identification original model according to the test result to obtain a preset malicious traffic identification model.

In a specific implementation, the testing effect of the malicious flow identification original model applied to the testing set is obtained, the testing accuracy of the original model is evaluated, and the malicious flow identification model is optimized by adopting methods of retraining, parameter adjustment and the like based on the testing accuracy, so that a preset malicious flow identification model is obtained.

When the network data set with the uniform format is obtained, invalid data and repeated data in the network data set are removed to obtain an updated network data set, the updated network data set is marked with the flow type, the network data set marked with the flow type is checked in a sampling mode based on a preset standard, label correction is carried out on the network data set marked with the flow type to obtain a network data set under a standard label, and the network data set under the standard label is divided into a training set and a test set. And further, taking the feature vector of the training set as input data, constructing a malicious traffic recognition original model based on LSTM, taking the feature vector of the testing set as input data, testing the malicious traffic recognition original model constructed based on LSTM, updating the malicious traffic recognition original model according to the testing result to obtain a preset malicious traffic recognition model, and finally detecting the network traffic to be detected through the preset malicious traffic recognition model to judge whether the network traffic to be detected belongs to malicious traffic. Because the embodiment considers a large amount of repeated data possibly existing in the acquired network data and irrelevant data or blank data irrelevant to the security judgment of the network data flow, the network data set in a unified format is subjected to data cleaning and updating, the processing of redundant data is reduced, the energy consumption is saved, the updated data set is marked and checked for the second time, the marking quality is ensured to obtain a more complete feature set, the network data set is divided into a training set and a testing set which are respectively used as input data, a better preset malicious flow identification model can be obtained in fewer samples, and the judging speed and accuracy of whether the network flow to be detected is malicious flow are further improved.

Referring to fig. 3, fig. 3 is a flowchart of a third embodiment of a malicious traffic detection method based on deep learning according to the present invention.

Based on the above embodiment, in order to further optimize the obtained malicious traffic recognition model, after step S50, it includes:

step S51: and optimizing the preset malicious flow identification model through a super-parameter searching and back propagation algorithm.

It should be noted that, the parameter adjustment policy adopted in the super-parameter search may be an algorithm such as grid search, random search or bayesian optimization, and the parameter is learned through a back propagation algorithm, which is not limited in this embodiment.

Further, in order to further evaluate the model effect, step S51 includes:

step S511: and selecting AUC and/or ROC as evaluation indexes, and grading the preset malicious flow identification model.

The AUC (Area under the Curve) is the Area under the ROC Curve, which is between 0.1 and 1, and as a numerical value, the classifier can be intuitively evaluated for quality, and each point on the ROC Curve reflects the feeling value of the same signal stimulus (ROC (receiveroperating characteristic, receiver operation characteristic). AUC, ROC curves are performance metrics for classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. For example: the higher the AUC, the better the model when predicting 0 as 0 and 1 as 1.

In a specific implementation, the AUC and/or ROC are/is selected to score the current preset malicious flow identification model, and the higher the scoring value is, the better the detection effect of the current preset malicious flow identification model on the malicious flow is.

Step S512: when the score of the preset malicious flow identification model is lower than a preset value, network searching is adopted and a random gradient descent method is combined to optimize the malicious flow identification model lower than the preset value.

It should be noted that, the random gradient descent method is to randomly draw a group from the samples, update the samples once according to the gradient after training, and update the samples once again, and the random gradient descent method can simplify the training process, so as to obtain a model with a loss value within an acceptable range when all the samples do not need to be trained.

It can be understood that when the score of the preset malicious traffic recognition model reflected by the AUC and/or ROC is lower than the preset value for quantitatively determining the recognition effect of the model, the model is further optimized by adopting network search and combining a random gradient descent method.

Accordingly, step S60 includes:

step S60': and detecting the network traffic to be detected through the optimized malicious traffic identification model, and judging whether the network traffic belongs to malicious traffic.

According to the embodiment, through the super-parametric search and the back propagation algorithm, the preset malicious flow identification model is optimized, the AUC and/or the ROC are further selected as evaluation indexes, the preset malicious flow identification model is scored, when the score of the preset malicious flow identification model is lower than a preset value, the network search is adopted, the malicious flow identification model lower than the preset value is optimized by combining a random gradient descent method, and then network flow to be detected is detected through the optimized malicious flow identification model, so that whether the network flow belongs to malicious flow is judged. The method can acquire the malicious traffic identification model with high accuracy, lower false alarm rate and stronger expandability, and is beneficial to further improving the judging speed and accuracy of whether the network traffic to be detected is malicious traffic.

In addition, the embodiment of the invention also provides a storage medium, wherein the storage medium is stored with a malicious flow detection program based on deep learning, and the malicious flow detection program based on deep learning realizes the steps of the malicious flow detection method based on deep learning when being executed by a processor.

Referring to fig. 4, fig. 4 is a block diagram of a first embodiment of a malicious traffic detection system based on deep learning according to the present invention.

As shown in fig. 4, the malicious traffic detection system based on deep learning of the present invention includes: a data acquisition module 401, a data processing module 402, a feature engineering module 403, a model construction module 404 and a flow detection module 405;

the data acquisition module 401 is configured to acquire multi-level network data, and perform preprocessing on the network data;

the data processing module 402 is configured to extract a time sequence characteristic of the network data, slice the preprocessed network data according to the time sequence characteristic, and obtain a corresponding data subsequence;

the feature engineering module 403 is configured to perform feature extraction on each data subsequence, so as to obtain feature vectors of each data subsequence; splicing the characteristic vectors of the data subsequences to obtain the characteristic vector of the network data;

the model building module 404 is configured to train an LSTM-based malicious traffic recognition model with the feature vector of the network data as input data, to obtain a preset malicious traffic recognition model;

the flow detection module 405 is configured to detect a network flow to be detected through the preset malicious flow identification model, and determine whether the network flow to be detected belongs to malicious flow.

Firstly, multi-level network data are collected, preprocessing is carried out on the network data, then the time sequence characteristics of the network data are extracted, the preprocessed network data are sliced according to the time sequence characteristics to obtain corresponding data subsequences, then feature extraction is carried out on each data subsequence to obtain feature vectors of each data subsequence, then the feature vectors of each data subsequence are spliced to obtain feature vectors of the network data, then the feature vectors of the network data are used as input data, a malicious traffic recognition model based on LSTM (least squares) is trained to obtain a preset malicious traffic recognition model, finally network traffic to be detected is detected through the preset malicious traffic recognition model, and whether the network traffic to be detected belongs to malicious traffic is judged. The method and the device for detecting the network traffic in the network are capable of taking different types of data in the network into consideration, expanding the detection range, taking time sequence characteristics of the network data into consideration, extracting feature vectors of subsequences after slicing, and then splicing to obtain the feature vectors of the network data, so that the generated feature vectors are more complete and accurate, an identification model is constructed based on LSTM, the problems of gradient elimination and forgetting in traditional model training can be alleviated, and finally network traffic to be detected is detected through the trained malicious traffic identification model, and whether the network traffic to be detected belongs to malicious traffic can be rapidly and accurately judged.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. read-only memory/random-access memory, magnetic disk, optical disk), comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A malicious traffic detection method based on deep learning, the method comprising:

collecting multi-level network data and preprocessing the network data;

2. The deep learning-based malicious traffic detection method as set forth in claim 1, wherein the step of collecting multi-level network data and preprocessing the collected network data comprises:

preprocessing the network data set with the uniform format.

3. The deep learning-based malicious traffic detection method of claim 2, wherein the multi-level network data comprises: data source and destination addresses, data port number, protocol type, packet size, timestamp, and QOS information.

4. The deep learning-based malicious traffic detection method of claim 3, wherein training the LSTM-based malicious traffic recognition model with the feature vector of the network data as input data to obtain the preset malicious traffic recognition model comprises:

5. The deep learning-based malicious traffic detection method of claim 4, wherein the preprocessing the uniformly formatted network data comprises:

6. The method for detecting malicious traffic based on deep learning as set forth in claim 1, wherein before the step of extracting features of each data subsequence to obtain feature vectors of each data subsequence, the method comprises:

7. The deep learning-based malicious traffic detection method according to any one of claims 1 to 6, wherein the training the LSTM-based malicious traffic recognition model using the feature vector of the network data as input data, after the step of obtaining the preset malicious traffic recognition model, includes:

8. The deep learning-based malicious traffic detection method as set forth in claim 7, wherein the optimizing the preset malicious traffic recognition model by the super-parametric search and back-propagation algorithm includes:

9. A deep learning-based malicious traffic detection system, the system comprising: the system comprises a data acquisition module, a data processing module, a characteristic engineering module, a model construction module and a flow detection module,

10. A storage medium, wherein a deep learning-based malicious traffic detection program is stored on the storage medium, which when executed by a processor implements the steps of the deep learning-based malicious traffic detection method according to any one of claims 1 to 8.