CN117349618A

CN117349618A - Method and medium for constructing malicious encryption traffic detection model of network information system

Info

Publication number: CN117349618A
Application number: CN202311312645.4A
Authority: CN
Inventors: 成凯; 刘涛; 李路明; 邱日轩; 吴湛; 王强; 梁良; 任牧; 秦超楠
Original assignee: Central China Grid Co Ltd
Current assignee: Central China Grid Co Ltd
Priority date: 2023-10-11
Filing date: 2023-10-11
Publication date: 2024-01-05

Abstract

The invention belongs to the technical field of network space security, and discloses a method for constructing a malicious encryption flow detection model in a network information system scene, which comprises the following steps: target positioning and specific detection purposes are defined; data collection, namely acquiring pure encrypted flow for training and real-time flow data in a detection stage; data processing, namely cleaning, integrating, transforming, mining and the like the originally collected data to form a data set meeting the requirements of deep learning training and testing; constructing a model, and constructing a detection model based on a deep learning algorithm; training and evaluating, namely training a detection model based on a deep learning algorithm and evaluating the detection performance of the model; and (3) applying improvement, namely applying the constructed model to an actual network to continuously perfect the model. The method is oriented to the field of encryption malicious flow detection, the detection steps are generalized to be a six-step method, various different detection models can be well explained, and the method is still applicable to common flow identification and has universality.

Description

Method and medium for constructing malicious encryption traffic detection model of network information system

Technical Field

The invention belongs to the technical field of network space security, and particularly relates to a method for constructing a malicious encryption traffic detection model in a network information system scene.

Background

With the increasing complexity of power information system construction, information networks face increasingly prominent information security problems. Meanwhile, the attack and defense degrees are stronger and stronger, and hackers also use the encrypted traffic to evade detection. Conventional security detection is based on plain text traffic detection, which makes visibility of the entire network increasingly difficult.

Functionally, there are various threat modes of malicious traffic, including command control (C & C) communication, back gate, and data leakage, which may be implemented by encryption. In the actual intrusion detection process, the difference between encrypted malicious traffic and unencrypted malicious traffic is mainly represented in the following 4 aspects: 1) Feature differences. The traffic characteristics of the two are obviously different, and the conventional identification method needing decoding is difficult to be applied to encrypted traffic, such as DPI (deep packet inspection) method. 2) Complexity differences. Encryption protocols are various (such as SSL/TLS, SSH and the like), and a general identification method is lacked, so that a specific identification method is usually required to be adopted for different encryption protocols, or a plurality of strategies are integrated to carry out comprehensive identification. 3) Technical differences. Encrypting malicious traffic often utilizes various techniques (e.g., protocol confusion and protocol variants) to disguise the malicious traffic features as normal traffic features, thereby avoiding detection. 4) Refining the difference. At present, research on identifying encrypted malicious traffic is mainly focused on identifying two classes or a few classes of attacks, and further research is needed for realizing the refined identification of the encrypted malicious traffic.

The detection model based on the deep learning algorithm is mainly divided into two main types, namely the detection of the encrypted malicious traffic based on the characteristic data set is realized, namely the characteristic engineering is applied to extract the characteristics of the original data, and the extracted characteristic label is used for detecting the encrypted malicious traffic; the other type is encryption malicious flow detection based on a slice data set, the method only needs to intercept certain bytes of the original data, and the characteristic learning capability of deep learning methods such as CNN, RNN and the like is directly utilized to automatically learn the hidden characteristics in the original flow data to carry out malicious flow detection.

The essence of encrypted malicious traffic detection is to learn data characteristics and correctly classify traffic data. Rezaei et al propose a general flow framework in the field of flow identification, dividing flow identification into 7 steps. While this framework is applicable to most algorithmic models, it fails to cover novel flow identification methods. For example, wang et al propose a one-dimensional CNN classification model, which does not perform data feature extraction, but performs only a shearing process on the flow data, and then inputs ID-CNN self-learning features to classify.

Through the above analysis, the problems and defects existing in the prior art are as follows:

1. lack of identification capability for novel traffic: the existing encryption malicious traffic detection method is mostly based on characteristic data sets or slice data sets for identification, and has limited identification capability for novel malicious traffic types or attack modes.

2. Feature engineering is time-consuming and not ubiquitous: the method for detecting the encrypted malicious traffic based on the characteristic data set needs to perform characteristic engineering, consumes time, has no universality and is difficult to adapt to the characteristic extraction requirements of different traffic.

3. The data preprocessing cost is high: the encryption malicious flow detection method based on the slice data set needs to preprocess data, including data slicing, data cleaning and the like, and has high cost.

4. Lack of adaptive learning and transfer learning capabilities: most of the existing encryption malicious traffic detection methods are based on a single data set for learning, and lack self-adaptive learning and migration learning capabilities.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention provides a method for constructing a malicious encryption traffic detection model in a network information system scene.

The invention is realized in such a way that the method for constructing the malicious encryption traffic detection model in the network information system scene comprises the following steps:

step one, target positioning and specific detection purposes are defined;

step two, data collection is carried out to obtain pure encrypted flow for training and real-time flow data in a detection stage;

step three, data processing, namely cleaning, integrating, transforming, mining and the like the originally collected data to enable the data to be a data set meeting the requirements of deep learning training and testing;

constructing a model, and constructing a detection model based on a deep learning algorithm;

training and evaluating, namely training a detection model based on a deep learning algorithm and evaluating the detection performance of the model;

and step six, applying improvement, namely applying the constructed model to an actual network to continuously perfect the model.

Further, the target positioning includes two aspects: one is the application scene, namely, in which network the detection system is used, such as a mobile phone mobile network, an internet of things, a car networking, an industrial control network, SDN and the like; the other is a detected object, such as detecting botnet, detecting DDoS attacks, detecting malware, detecting multi-class attacks, or identifying malicious traffic, etc.

Further, the data collection is performed, malicious traffic is obtained in a sandbox mode, malicious software is operated in the sandbox, traffic generated in the operation period is stored, communication traffic and system white traffic between the sandboxes are filtered, and the remaining traffic is used as malicious traffic.

Further, the data processing includes:

(1) The data is cleaned, methods of deleting invalid values or error values existing in the data are usually adopted, including methods of whole row deletion, variable deletion, paired deletion and the like, and methods of processing the missing values include mean value interpolation, homogeneous mean value interpolation, high-dimensional mapping and the like;

(2) Data integration, which integrates data collected by a plurality of data sources, and mainly performs pattern matching, data redundancy processing and conflict value processing;

(3) Data reduction, including dimension reduction, data compression, and quantity reduction, wherein the dimension reduction mainly reduces the number of independent variables, the method comprises Principal Component Analysis (PCA), feature Subset Selection (FSS), and the like, the quantity reduction is to replace original data with smaller data quantity, and the adopted method comprises logarithmic linear regression, clustering, sampling, and the like;

(4) The data transformation is performed on the data, and the data is normalized, which mainly comprises contents such as numerical control, centering and normalization, wherein the numerical control is to convert non-data information into data, such as network protocol information, and the data can be represented by simple numerical values, the centering is to subtract a mean value or an operation of a certain designated numerical value from the data, and the normalization is to integrate the data into a [0,1] interval so as to facilitate experiments, and a maximum and minimum normalization method is commonly used.

Further, the model is constructed, the dimensions input to each hierarchy must be matched, and in the process of constructing the deep learning network with multiple different algorithm cascade, the parameters such as the input dimension, time step and the like of the stage are determined by the output of the upper stage; for the non-convergence problem, analyzing the learning rate size selection and whether to preprocess the data set;

for the problem of gradient disappearance or gradient explosion, the fine adjustment of the hierarchical structure and the adjustment of an activation function adopted by a network are considered; for the over-fit or under-fit problem, data sample richness and network structure complexity are analyzed.

Further, the training and the evaluation are two methods, namely, the training of the deep learning encryption malicious flow detection model is divided into a training set, a verification set and a test set, the training is carried out on the training set, after each round of training is finished, the training effect of the round of training of the model is checked by the verification set, the model is optimized, after all rounds of training are finished, the detection performance of the real environment inspection model is simulated by the test set after the optimized model is obtained; the other is that the data set is divided into a training set and a testing set, N-fold cross verification is adopted during training, namely the training set is divided into N parts, one part is taken as a verification set each time, training is carried out on the rest N-1 parts, finally, the average value of the results of N times of training is taken as training precision, and after the training is finished, the detection performance of the model is checked on the testing set.

Further, the training and evaluation are used for evaluating the performance of the encrypted malicious flow detection model, and the evaluation indexes mainly comprise an accuracy index and a real-time index:

(1) The accuracy index is mainly generated based on the confusion matrix and comprises accuracy, recall rate, accuracy, false alarm rate, missing report rate, F score and the like;

(2) The real-time index reflects the capability of the encryption malicious detection algorithm to identify the encryption malicious traffic online and rapidly, ensures that the performance of a core network is not affected in the process of implementing malicious traffic detection, and is mainly reflected on the accurate detection of the first N packets of the flow.

Further, the application improvement means that the constructed model is applied to an actual network to detect the encrypted malicious traffic of the actual network, the effectiveness and the robustness of the algorithm model are checked through network operation, and the model is updated regularly to continuously perfect the model so as to obtain higher detection precision and efficiency.

Another object of the present invention is to provide a computer device, where the computer device includes a memory and a processor, where the memory stores a computer program, and where the computer program when executed by the processor causes the processor to execute the steps of the method for constructing a malicious encrypted traffic detection model in the network information system scenario.

Another object of the present invention is to provide a computer readable storage medium storing a computer program, which when executed by a processor, causes the processor to execute the steps of the method for constructing a malicious encrypted traffic detection model in the network information system scenario.

In combination with the technical scheme and the technical problems to be solved, the technical scheme to be protected has the following advantages and positive effects:

the method is oriented to the field of detecting encrypted malicious traffic, the detection steps are generalized to be a six-step method, various different detection models can be well explained, the application range is wider, and the existing research can be well covered. The invention is still applicable to the common flow identification problem and has universality.

The main advantages and positive effects of the second, each step are as follows:

step one, target positioning

The method has the advantages of definitely detecting the target and being beneficial to collecting high-quality labeling data and constructing a detection model.

The method has the advantages of high detection performance and strong model generalization capability.

Step two, data collection

The method has the advantages that flow data in a real network environment are obtained, so that the detection model is more suitable for an actual scene. The annotation data can avoid manually extracting features.

The method has the advantages that the model has better performance in an actual network and has stronger capability of detecting unknown malicious traffic.

Step three, data processing

The advantage is that the cleaning and integration can obtain a complete and high quality data set. The transformation and the excavation can highlight important characteristics, and the subsequent modeling process is simplified.

The training of the deep learning model is facilitated, and the model performance is higher. The data dependency is reduced and the model is more robust.

Step four, model construction

The deep learning model has the advantages of strong self-learning and generalization capability.

The method has the advantages of high detection performance and capability of detecting new types of malicious traffic.

Training and evaluating

The model has the advantages that the model continuously learns and evolves through a large amount of data, and the final model performance is optimal. The evaluation link can find out the model defects, which is beneficial to further improvement.

The method has the advantages that the detection performance of the model is continuously improved, and finally, a higher level is achieved.

Step six, application improvement

The model has the advantage that the model can be continuously learned and optimized through new data in a new environment.

The model performance can not excessively fit training data, can adapt to the network environment change, and the detection performance is continuous and stable.

In summary, each step of the flow detection model construction method has important advantages and positive effects, and is beneficial to obtaining a high-performance and stable deep learning detection model. There is also a synergistic relationship between these steps that together improve the quality of the final model.

Thirdly, the technical scheme is regarded as a whole or from the perspective of products, and the technical scheme to be protected has the following technical effects and advantages:

1) The detection model based on deep learning can automatically learn complex modes and features in encrypted traffic data, has stronger self-adaptive capacity and generalization capacity, and can detect new unknown malicious encrypted traffic. This is a significant advantage over existing rule and signature based detection methods.

2) Deep learning models can handle large amounts of high-dimensional, unstructured data, and training and prediction are fast. This makes it well suited for real-time detection of network traffic. Deep learning is advantageous in this regard over many existing machine learning algorithms.

3) The end-to-end detection model constructed by deep learning can be directly used for learning the detection model from the original flow data without manually extracting manual characteristics. This simplifies the detection model construction process and reduces the manual work load. This is an important advantage of deep learning.

4) The deep learning model learns the detection rules through a large amount of training data, can detect new unknown attacks, and realizes self-updating and improvement. Compared with static rules, the dynamic learning detection mechanism is more intelligent and flexible.

5) In fact, the vast amount of data and computing resources that are continuously collected support the training and optimization of deep learning models. With the increase of the data size and the increase of the computing power, the performance of the deep learning detection model is continuously improved. This is an important advantage over other machine learning methods.

In summary, compared with the prior art, the malicious encryption traffic detection model based on deep learning does have the positive effects and remarkable advantages. Deep learning also faces challenges such as poor interpretation, strong data dependence, and security. In general, deep learning is a machine learning method with great potential in the field of network traffic detection.

Drawings

Fig. 1 is a framework diagram of a method for constructing a malicious encryption traffic detection model in a network information system scene provided by an embodiment of the present invention;

fig. 2 is a data processing flow chart of a construction method provided by an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

As shown in fig. 1, the method for constructing a malicious encrypted traffic detection model in a network information system scene provided by the embodiment of the invention includes the following steps:

s101, target positioning;

s102, collecting data;

s103, data processing;

s104, constructing a model;

s105, training and evaluation;

s106, applying improvement.

The signal and data processing procedure in each step is described in detail:

and step one, target positioning, namely determining the type of malicious encrypted traffic to be detected, such as DDoS attack, SQL injection attack and the like. And formulating a data labeling scheme according to the detection target.

And step two, data collection, namely acquiring a large amount of encrypted flow data by deploying a flow probe or using a public data set, and labeling the data according to a labeling scheme. The acquired traffic data is to include normal encrypted traffic and malicious encrypted traffic.

Step three, data processing:

1) Cleaning, filtering irrelevant and redundant fields, processing missing values, and the like.

2) Integrating flow data from different sources to form a complete training and testing data set.

3) And transforming, namely performing proper transformation on flow data, such as time window slicing, flow index calculation and the like, so as to highlight important characteristics.

4) Mining, namely, using statistics and data mining technology to find patterns and associations in traffic data, and mining key features which can represent normal/malicious traffic.

And fourthly, model construction, namely selecting a deep learning model structure, such as CNN, RNN, autoencoder and the like. And determining super parameters such as hidden layer nodes, activation functions, regularization methods, optimization algorithms and the like.

Training and evaluating, namely training a deep learning model by using a training data set, evaluating model performance such as AUC, accuracy, recall rate and the like by using a test data set, and adjusting the model to improve the performance.

And step six, applying improvement, namely deploying the deep learning model obtained through training into an actual network to detect malicious encrypted traffic. New data tuning and modification models are continually used to accommodate environmental changes.

The signal and data processing method involved in the steps is comprehensive and comprises data cleaning, labeling, integration, transformation, mining and the like. By these processes, a high quality data set can be obtained to train a deep learning detection model. New data is also introduced in the model training and evaluation links to continuously improve model performance. This is a dynamic data and model process.

The target positioning provided by the embodiment of the invention refers to a specific detection purpose and comprises two aspects. One is the application scene, namely, in which network the detection system is used, such as a mobile phone mobile network, an internet of things, a car networking, an industrial control network, SDN and the like; the other is a detected object, such as detecting botnet, detecting DDoS attacks, detecting malware, detecting multi-class attacks, or identifying malicious traffic, etc. In objective terms, in the face of network environments in diversified complex scenes, a universal detection algorithm cannot exist, and all attacks can be detected quickly and accurately. Thus, in a specific implementation, the detection target should be specifically located according to the intended purpose.

In order to obtain pure encrypted traffic for training and real-time traffic data in the detection stage, a traffic capture model is constructed. The normal traffic is obtained by capturing traffic generated by accessing a normal encryption website or running normal software by running a tool such as a wireshark on a monitoring computer, or by monitoring cleaner network environment traffic, and obtaining sessions in a white list as normal traffic by white list filtering. The malicious traffic is obtained by adopting a sandbox mode, malicious software is operated in the sandbox, traffic generated during the operation is saved, communication traffic among the sandboxes and system white traffic are filtered, and the rest traffic is used as malicious traffic.

The original data may have redundancy, a '33307 k' error, unbalance, mismatch and other problems, and further processing is required to apply the data. The data processing provided by the embodiment of the invention is to clean, integrate, transform, mine and the like the originally collected data, so that the data is a data set meeting the requirements of deep learning training and testing. The general process of encrypting malicious traffic data is shown in fig. 2.

Data cleansing is a process of rechecking and checking data, aiming at deleting duplicate information and correcting existing errors. Methods of deleting invalid values or error values existing in data are often adopted, and include methods of whole column deletion, variable deletion, paired deletion and the like; the processing method for the missing values comprises mean value interpolation, similar mean value interpolation, high-dimensional mapping and the like.

The data integration is to integrate data collected by a plurality of data sources, and the main difficulty of the integration process is that the data sources are heterogeneous, namely, the data sources are not completely consistent, the collected data formats, lengths and the like are not identical, and the problems of redundancy, incompatibility and the like exist among the data sources. Therefore, data integration mainly performs pattern matching, data redundancy processing, and collision value processing.

Data reduction refers to maximally simplifying the data volume on the premise of keeping the original appearance of the data. Including dimension conventions, data compression and quantity conventions, etc. The dimension reduction mainly reduces the number of independent variables, and the method comprises Principal Component Analysis (PCA), feature Subset Selection (FSS) and the like; the quantitative reduction is to replace the original data with a smaller data size by using the methods such as logarithmic linear regression, clustering, sampling and the like.

The data transformation is to normalize the data so as to facilitate subsequent information mining, and mainly comprises contents such as numerical value, centering, normalization and the like. The numerical value is to convert non-data information into data, such as network protocol information, and can be represented by simple numerical values. Centering refers to the operation of subtracting the mean or some specified value from the data. Normalization aims at integrating data into the [0,1] interval to facilitate experiments, and a maximum and minimum normalization method is commonly used.

In the detection model construction process based on the deep learning algorithm, the following problems need to be noted: one is the dimension matching problem. I.e., the dimensions input to the various levels must match, otherwise the deep learning network will not operate properly. This is particularly important in the process of constructing a deep learning network with multiple different algorithm cascade, and the relation between the output of the previous stage and the input of the current stage must be considered, so that parameters such as the input dimension, time step and the like of the current stage are determined by the output of the previous stage.

And secondly, the problem of non-convergence. The non-convergence means that the error is not reduced in the neural network training process, and the gradient descent optimization process cannot reach the extreme point, so that the optimal solution cannot be obtained. Mainly for two reasons: 1) The problem of learning rate, the selected learning rate is too large, which may cause the inability to converge to the extreme point when the gradient is decreased; too small learning rate can lead to slow convergence speed of the deep learning network and too long training time; 2) The problem with the data set may be that the data set is not pre-processed, including normalization, regularization, etc., or that the data set contains bad samples, without data cleansing.

Thirdly, the gradient disappears or the gradient explodes. Both gradient disappearance and gradient explosion occur in the back propagation process, if the derivative of the activation function is smaller than 1 according to the chain derivative rule, the gradient value far away from the output end will be smaller and smaller along with the deepening of the hierarchy, so that the gradient is disappeared, and the parameter update is very slow; conversely, if the derivative of the activation function is greater than 1, a gradient explosion may occur, causing unstable parameters. The above-mentioned problems are mainly caused by two reasons: 1) The hierarchy is too deep, which may cause problems in gradient computation in back propagation; 2) An unsuitable activation function is employed. The solution is to fine tune the hierarchy and adjust the activation function employed by the network.

Fourth, the problem of over-fitting or under-fitting. The over-fitting means that the deep learning algorithm can perfectly fit the training data set, but the performance is poor on the test data set, while the under-fitting means that the detection effect cannot be well obtained on both the training data set and the test data set. The reasons for the overfitting are mainly two: 1) The data sample is too single, the richness is insufficient, and the deep learning network is only matched with at least part of effective information; 2) The network structure is too complex, the training parameters are too many, and the network learning capacity is too strong, so that the existing data set is completely matched, and the effect on the unknown data set is poor. The lack of fitting is mainly because the network structure is simple, the learning ability is insufficient, and the feature information contained in the learned object cannot be effectively obtained, so that the original data is characterized inaccurately.

After the model is constructed, training and evaluation are required to be carried out on the model, and two methods are generally adopted for training the deep learning encryption malicious flow detection model, wherein one method is to divide a data set into a training set, a verification set and a test set, and training is carried out on the training set; after each round of training is finished, checking the training effect of the round of training by using a verification set, and optimizing the model; after all rounds of training are completed and the optimized model is obtained, the test set is utilized to simulate the detection performance of the real environment inspection model. The other is to divide the data set into training and testing sets only. During training, N-fold cross verification is adopted, namely the training set is divided into N parts, one part is taken as a verification set each time, training is carried out on the rest N-1 parts, and finally, the average value of N training results is taken as training precision. After training, the test performance of the model is checked on the test set. In practice, 5-fold or 10-fold cross-validation is often used.

The method is used for truly evaluating the quality of an encrypted malicious traffic detection model, and the model needs to be applied to an actual network environment to observe the detection performance of the model. However, the cost of chip development, security risk and the like is too great, and the performance of the model on the test set is usually utilized to evaluate the advantages and disadvantages of the detection model. The evaluation index mainly comprises an accuracy index and a real-time index.

The accuracy index is generated mainly based on the confusion matrix. Including accuracy, recall, precision, false positive rate, false negative rate, F-score, etc.

The real-time index reflects the capability of the encryption malicious detection algorithm to identify the encryption malicious traffic online and rapidly, and ensures that the performance of the core network is not affected in the process of implementing malicious traffic detection. Real-time is mainly reflected in the accurate detection of the first N packets of the stream.

The application improvement provided by the embodiment of the invention is that the constructed model is applied to an actual network, the actual network encryption malicious flow detection is carried out, the effectiveness and the robustness of an algorithm model are checked through network operation, and the model is updated regularly, so that the model is continuously perfected to obtain higher detection precision and efficiency.

The method for constructing the malicious encryption traffic detection model in the network information system scene is applied to computer equipment, wherein the computer equipment comprises a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the steps of the method for constructing the malicious encryption traffic detection model in the network information system scene.

The method for constructing the malicious encryption traffic detection model in the network information system scene is applied to an information data processing terminal, and the information data processing terminal is used for realizing the method for constructing the malicious encryption traffic detection model in the network information system scene.

It should be noted that the embodiments of the present invention can be realized in hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or special purpose design hardware. Those of ordinary skill in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The device of the present invention and its modules may be implemented by hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., as well as software executed by various types of processors, or by a combination of the above hardware circuitry and software, such as firmware.

Example 1: intra-enterprise network information system

Step one: target positioning

The aim of the embodiment is to detect malicious encrypted traffic in an enterprise internal network information system, so as to identify potential network attacks and threats in time.

Step two: data collection

Network monitoring tools, such as a network sniffer, a firewall and the like, are deployed in the enterprise internal network information system, and network traffic data is collected in real time. Meanwhile, the disclosed pure encrypted traffic data is collected as training data.

Step three: data processing

Preprocessing the collected original data, including data cleaning, integration, transformation, feature extraction and the like. The processed data is divided into a training set and a testing set, so that the requirements of training and testing of the deep learning model are met.

Step four: model construction

A detection model based on a deep learning algorithm, such as a Convolutional Neural Network (CNN) or a long-short-term memory network (LSTM), is constructed. And proper network structure and parameters are designed to adapt to the scene of the internal network information system of the enterprise.

Step five: training and assessment

The deep learning model is trained using a training set and performance assessment is performed using a testing set. And (5) evaluating the detection performance of the model by adopting indexes such as accuracy, precision, recall rate, F1 score and the like.

Step six: application improvements

And deploying the trained model into an enterprise internal network information system, and detecting malicious encrypted traffic in real time. According to the actual application situation, the model performance is continuously perfected and optimized.

Example 2: internet service provider network information system

Step one: target positioning

The aim of this embodiment is to detect malicious encrypted traffic in an internet service provider network information system in order to identify potential network attacks and threats in time.

Step two: data collection

Network monitoring tools, such as network sniffers and firewalls, are deployed in the network information systems of internet service providers to collect network traffic data in real time. Meanwhile, the disclosed pure encrypted traffic data is collected as training data.

Step three: data processing

Step four: model construction

A detection model based on a deep learning algorithm, such as a Convolutional Neural Network (CNN) or a long-short-term memory network (LSTM), is constructed. Appropriate network structures and parameters are designed to accommodate the Internet service provider network information system scenario.

Step five: training and assessment

Step six: application improvements

And deploying the trained model into an Internet service provider network information system, and detecting malicious encrypted traffic in real time. According to the actual application situation, the model performance is continuously perfected and optimized.

The two embodiments respectively construct a malicious encryption flow detection model based on a deep learning algorithm aiming at the scene of the enterprise internal network information system and the scene of the Internet service provider network information system. These embodiments may help businesses and internet service providers more effectively protect against potential network attacks and threats, increasing the level of network security.

The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.

Claims

1. The method for constructing the malicious encryption traffic detection model in the network information system scene is characterized by comprising the following steps:

step one, target positioning and specific detection purposes are defined;

2. The method for constructing a malicious encrypted traffic detection model in a network information system scenario according to claim 1, wherein the target positioning comprises two aspects: one is the application scene, namely, in which network the detection system is used, such as a mobile phone mobile network, an internet of things, a car networking, an industrial control network, SDN and the like; the other is a detected object, such as detecting botnet, detecting DDoS attacks, detecting malware, detecting multi-class attacks, or identifying malicious traffic, etc.

3. The method for constructing a malicious encryption traffic detection model in a network information system scene according to claim 1, wherein the data collection and the malicious traffic acquisition adopt a sandbox mode, malicious software is operated in a sandbox, traffic generated during operation is saved, communication traffic among the sandboxes and system white traffic are filtered, and the remaining traffic is used as malicious traffic.

4. The method for constructing a malicious encrypted traffic detection model in a network information system scenario according to claim 1, wherein the data processing comprises:

5. The method for constructing a malicious encryption traffic detection model in a network information system scene according to claim 1, wherein the model is constructed, dimensions input to each hierarchy must be matched, and parameters such as input dimension, time step and the like of the present stage are determined by the output of the upper stage in the process of constructing a deep learning network with multiple different algorithm cascade; for the non-convergence problem, analyzing the learning rate size selection and whether to preprocess the data set;

6. The method for constructing the malicious encryption traffic detection model in the network information system scene according to claim 1, wherein the training and the evaluation are two methods, namely, the training of the deep learning encryption traffic detection model is divided into a training set, a verification set and a test set, the training is carried out on the training set, after each round of training is finished, the training effect of the round of training is checked by the verification set, the model is optimized, after all rounds of training are finished, the detection performance of the real environment check model is simulated by the test set; the other is that the data set is divided into a training set and a testing set, N-fold cross verification is adopted during training, namely the training set is divided into N parts, one part is taken as a verification set each time, training is carried out on the rest N-1 parts, finally, the average value of the results of N times of training is taken as training precision, and after the training is finished, the detection performance of the model is checked on the testing set.

7. The method for constructing a malicious encrypted traffic detection model in a network information system scenario according to claim 1, wherein the training and evaluating, evaluate performance of the encrypted malicious traffic detection model, and the evaluation indexes mainly include an accuracy index and a real-time index:

8. The method for constructing a malicious encrypted traffic detection model in a network information system scenario according to claim 1, wherein the application improvement means that the constructed model is applied to an actual network to perform the detection of the malicious traffic encrypted by the actual network, the effectiveness and the robustness of an algorithm model are checked through network operation, and the model is updated periodically, so that the model is continuously improved to obtain higher detection precision and efficiency.

9. The method for constructing a malicious encrypted traffic detection model in a network information system scenario according to claim 1, wherein the method specifically comprises:

and step one, target positioning, namely defining the type of malicious encrypted traffic to be detected, and formulating a data labeling scheme according to the detection target.

Step two, data collection, namely acquiring a large amount of encrypted flow data by deploying a flow probe or using a public data set, and labeling the data according to a labeling scheme; the acquired traffic data includes normal encrypted traffic and malicious encrypted traffic;

step three, data processing:

1) Filtering irrelevant and redundant fields and processing missing values;

2) Integrating flow data from different sources to form a complete training and testing data set;

3) Transforming, namely performing proper transformation on the flow data to highlight important characteristics;

4) Mining, namely using statistics and data mining technology to find patterns and correlations in the traffic data, and mining key features which can represent normal/malicious traffic;

step four, model construction, namely selecting a deep learning model structure, such as CNN, RNN, autoencoder; determining hidden layer nodes, an activation function, a regularization method and an optimization algorithm hyper-parameter;

training and evaluating, namely training a deep learning model by using a training data set, evaluating the performance of the model by using a test data set, and adjusting the model to improve the performance;

step six, applying improvement, namely deploying the deep learning model obtained by training into an actual network to detect malicious encryption traffic; new data tuning and modification models are continually used to accommodate environmental changes.

10. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method of constructing a malicious encrypted traffic detection model in a network information system scenario according to any one of claims 1-8.