CN117349618A - Method and medium for constructing malicious encryption traffic detection model of network information system - Google Patents
Method and medium for constructing malicious encryption traffic detection model of network information system Download PDFInfo
- Publication number
- CN117349618A CN117349618A CN202311312645.4A CN202311312645A CN117349618A CN 117349618 A CN117349618 A CN 117349618A CN 202311312645 A CN202311312645 A CN 202311312645A CN 117349618 A CN117349618 A CN 117349618A
- Authority
- CN
- China
- Prior art keywords
- data
- model
- training
- traffic
- detection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 115
- 238000000034 method Methods 0.000 title claims abstract description 86
- 238000012549 training Methods 0.000 claims abstract description 97
- 238000013135 deep learning Methods 0.000 claims abstract description 36
- 238000012360 testing method Methods 0.000 claims abstract description 30
- 238000012545 processing Methods 0.000 claims abstract description 27
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 25
- 230000006872 improvement Effects 0.000 claims abstract description 15
- 238000013480 data collection Methods 0.000 claims abstract description 10
- 238000004140 cleaning Methods 0.000 claims abstract description 9
- 238000005065 mining Methods 0.000 claims abstract description 9
- 230000001131 transforming effect Effects 0.000 claims abstract description 5
- 238000011895 specific detection Methods 0.000 claims abstract description 4
- 230000008569 process Effects 0.000 claims description 18
- 238000013136 deep learning model Methods 0.000 claims description 15
- 230000009467 reduction Effects 0.000 claims description 13
- 238000012795 verification Methods 0.000 claims description 12
- 238000010276 construction Methods 0.000 claims description 10
- 238000011156 evaluation Methods 0.000 claims description 10
- 238000010606 normalization Methods 0.000 claims description 10
- 244000035744 Hura crepitans Species 0.000 claims description 9
- 238000012217 deletion Methods 0.000 claims description 9
- 230000037430 deletion Effects 0.000 claims description 9
- 230000010354 integration Effects 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 8
- 238000002372 labelling Methods 0.000 claims description 8
- 238000004590 computer program Methods 0.000 claims description 6
- 230000000694 effects Effects 0.000 claims description 6
- 238000000513 principal component analysis Methods 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 6
- 239000000523 sample Substances 0.000 claims description 5
- 238000004891 communication Methods 0.000 claims description 4
- 238000004880 explosion Methods 0.000 claims description 4
- 238000005457 optimization Methods 0.000 claims description 4
- 238000013144 data compression Methods 0.000 claims description 3
- 238000013501 data transformation Methods 0.000 claims description 3
- 230000008034 disappearance Effects 0.000 claims description 3
- 238000002474 experimental method Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 238000012417 linear regression Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000012986 modification Methods 0.000 claims description 3
- 230000004048 modification Effects 0.000 claims description 3
- 230000006855 networking Effects 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 238000007418 data mining Methods 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 claims description 2
- 230000007613 environmental effect Effects 0.000 claims description 2
- 230000008901 benefit Effects 0.000 description 20
- 238000013527 convolutional neural network Methods 0.000 description 8
- 238000000605 extraction Methods 0.000 description 4
- 238000012544 monitoring process Methods 0.000 description 4
- 230000008092 positive effect Effects 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000007689 inspection Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 239000000243 solution Substances 0.000 description 3
- 238000003491 array Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000009412 basement excavation Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 238000002347 injection Methods 0.000 description 1
- 239000007924 injection Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011897 real-time detection Methods 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000010008 shearing Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000002195 synergetic effect Effects 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1425—Traffic logging, e.g. anomaly detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/40—Network security protocols
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computer Security & Cryptography (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention belongs to the technical field of network space security, and discloses a method for constructing a malicious encryption flow detection model in a network information system scene, which comprises the following steps: target positioning and specific detection purposes are defined; data collection, namely acquiring pure encrypted flow for training and real-time flow data in a detection stage; data processing, namely cleaning, integrating, transforming, mining and the like the originally collected data to form a data set meeting the requirements of deep learning training and testing; constructing a model, and constructing a detection model based on a deep learning algorithm; training and evaluating, namely training a detection model based on a deep learning algorithm and evaluating the detection performance of the model; and (3) applying improvement, namely applying the constructed model to an actual network to continuously perfect the model. The method is oriented to the field of encryption malicious flow detection, the detection steps are generalized to be a six-step method, various different detection models can be well explained, and the method is still applicable to common flow identification and has universality.
Description
Technical Field
The invention belongs to the technical field of network space security, and particularly relates to a method for constructing a malicious encryption traffic detection model in a network information system scene.
Background
With the increasing complexity of power information system construction, information networks face increasingly prominent information security problems. Meanwhile, the attack and defense degrees are stronger and stronger, and hackers also use the encrypted traffic to evade detection. Conventional security detection is based on plain text traffic detection, which makes visibility of the entire network increasingly difficult.
Functionally, there are various threat modes of malicious traffic, including command control (C & C) communication, back gate, and data leakage, which may be implemented by encryption. In the actual intrusion detection process, the difference between encrypted malicious traffic and unencrypted malicious traffic is mainly represented in the following 4 aspects: 1) Feature differences. The traffic characteristics of the two are obviously different, and the conventional identification method needing decoding is difficult to be applied to encrypted traffic, such as DPI (deep packet inspection) method. 2) Complexity differences. Encryption protocols are various (such as SSL/TLS, SSH and the like), and a general identification method is lacked, so that a specific identification method is usually required to be adopted for different encryption protocols, or a plurality of strategies are integrated to carry out comprehensive identification. 3) Technical differences. Encrypting malicious traffic often utilizes various techniques (e.g., protocol confusion and protocol variants) to disguise the malicious traffic features as normal traffic features, thereby avoiding detection. 4) Refining the difference. At present, research on identifying encrypted malicious traffic is mainly focused on identifying two classes or a few classes of attacks, and further research is needed for realizing the refined identification of the encrypted malicious traffic.
The detection model based on the deep learning algorithm is mainly divided into two main types, namely the detection of the encrypted malicious traffic based on the characteristic data set is realized, namely the characteristic engineering is applied to extract the characteristics of the original data, and the extracted characteristic label is used for detecting the encrypted malicious traffic; the other type is encryption malicious flow detection based on a slice data set, the method only needs to intercept certain bytes of the original data, and the characteristic learning capability of deep learning methods such as CNN, RNN and the like is directly utilized to automatically learn the hidden characteristics in the original flow data to carry out malicious flow detection.
The essence of encrypted malicious traffic detection is to learn data characteristics and correctly classify traffic data. Rezaei et al propose a general flow framework in the field of flow identification, dividing flow identification into 7 steps. While this framework is applicable to most algorithmic models, it fails to cover novel flow identification methods. For example, wang et al propose a one-dimensional CNN classification model, which does not perform data feature extraction, but performs only a shearing process on the flow data, and then inputs ID-CNN self-learning features to classify.
Through the above analysis, the problems and defects existing in the prior art are as follows:
1. lack of identification capability for novel traffic: the existing encryption malicious traffic detection method is mostly based on characteristic data sets or slice data sets for identification, and has limited identification capability for novel malicious traffic types or attack modes.
2. Feature engineering is time-consuming and not ubiquitous: the method for detecting the encrypted malicious traffic based on the characteristic data set needs to perform characteristic engineering, consumes time, has no universality and is difficult to adapt to the characteristic extraction requirements of different traffic.
3. The data preprocessing cost is high: the encryption malicious flow detection method based on the slice data set needs to preprocess data, including data slicing, data cleaning and the like, and has high cost.
4. Lack of adaptive learning and transfer learning capabilities: most of the existing encryption malicious traffic detection methods are based on a single data set for learning, and lack self-adaptive learning and migration learning capabilities.
Disclosure of Invention
Aiming at the problems existing in the prior art, the invention provides a method for constructing a malicious encryption traffic detection model in a network information system scene.
The invention is realized in such a way that the method for constructing the malicious encryption traffic detection model in the network information system scene comprises the following steps:
step one, target positioning and specific detection purposes are defined;
step two, data collection is carried out to obtain pure encrypted flow for training and real-time flow data in a detection stage;
step three, data processing, namely cleaning, integrating, transforming, mining and the like the originally collected data to enable the data to be a data set meeting the requirements of deep learning training and testing;
constructing a model, and constructing a detection model based on a deep learning algorithm;
training and evaluating, namely training a detection model based on a deep learning algorithm and evaluating the detection performance of the model;
and step six, applying improvement, namely applying the constructed model to an actual network to continuously perfect the model.
Further, the target positioning includes two aspects: one is the application scene, namely, in which network the detection system is used, such as a mobile phone mobile network, an internet of things, a car networking, an industrial control network, SDN and the like; the other is a detected object, such as detecting botnet, detecting DDoS attacks, detecting malware, detecting multi-class attacks, or identifying malicious traffic, etc.
Further, the data collection is performed, malicious traffic is obtained in a sandbox mode, malicious software is operated in the sandbox, traffic generated in the operation period is stored, communication traffic and system white traffic between the sandboxes are filtered, and the remaining traffic is used as malicious traffic.
Further, the data processing includes:
(1) The data is cleaned, methods of deleting invalid values or error values existing in the data are usually adopted, including methods of whole row deletion, variable deletion, paired deletion and the like, and methods of processing the missing values include mean value interpolation, homogeneous mean value interpolation, high-dimensional mapping and the like;
(2) Data integration, which integrates data collected by a plurality of data sources, and mainly performs pattern matching, data redundancy processing and conflict value processing;
(3) Data reduction, including dimension reduction, data compression, and quantity reduction, wherein the dimension reduction mainly reduces the number of independent variables, the method comprises Principal Component Analysis (PCA), feature Subset Selection (FSS), and the like, the quantity reduction is to replace original data with smaller data quantity, and the adopted method comprises logarithmic linear regression, clustering, sampling, and the like;
(4) The data transformation is performed on the data, and the data is normalized, which mainly comprises contents such as numerical control, centering and normalization, wherein the numerical control is to convert non-data information into data, such as network protocol information, and the data can be represented by simple numerical values, the centering is to subtract a mean value or an operation of a certain designated numerical value from the data, and the normalization is to integrate the data into a [0,1] interval so as to facilitate experiments, and a maximum and minimum normalization method is commonly used.
Further, the model is constructed, the dimensions input to each hierarchy must be matched, and in the process of constructing the deep learning network with multiple different algorithm cascade, the parameters such as the input dimension, time step and the like of the stage are determined by the output of the upper stage; for the non-convergence problem, analyzing the learning rate size selection and whether to preprocess the data set;
for the problem of gradient disappearance or gradient explosion, the fine adjustment of the hierarchical structure and the adjustment of an activation function adopted by a network are considered; for the over-fit or under-fit problem, data sample richness and network structure complexity are analyzed.
Further, the training and the evaluation are two methods, namely, the training of the deep learning encryption malicious flow detection model is divided into a training set, a verification set and a test set, the training is carried out on the training set, after each round of training is finished, the training effect of the round of training of the model is checked by the verification set, the model is optimized, after all rounds of training are finished, the detection performance of the real environment inspection model is simulated by the test set after the optimized model is obtained; the other is that the data set is divided into a training set and a testing set, N-fold cross verification is adopted during training, namely the training set is divided into N parts, one part is taken as a verification set each time, training is carried out on the rest N-1 parts, finally, the average value of the results of N times of training is taken as training precision, and after the training is finished, the detection performance of the model is checked on the testing set.
Further, the training and evaluation are used for evaluating the performance of the encrypted malicious flow detection model, and the evaluation indexes mainly comprise an accuracy index and a real-time index:
(1) The accuracy index is mainly generated based on the confusion matrix and comprises accuracy, recall rate, accuracy, false alarm rate, missing report rate, F score and the like;
(2) The real-time index reflects the capability of the encryption malicious detection algorithm to identify the encryption malicious traffic online and rapidly, ensures that the performance of a core network is not affected in the process of implementing malicious traffic detection, and is mainly reflected on the accurate detection of the first N packets of the flow.
Further, the application improvement means that the constructed model is applied to an actual network to detect the encrypted malicious traffic of the actual network, the effectiveness and the robustness of the algorithm model are checked through network operation, and the model is updated regularly to continuously perfect the model so as to obtain higher detection precision and efficiency.
Another object of the present invention is to provide a computer device, where the computer device includes a memory and a processor, where the memory stores a computer program, and where the computer program when executed by the processor causes the processor to execute the steps of the method for constructing a malicious encrypted traffic detection model in the network information system scenario.
Another object of the present invention is to provide a computer readable storage medium storing a computer program, which when executed by a processor, causes the processor to execute the steps of the method for constructing a malicious encrypted traffic detection model in the network information system scenario.
In combination with the technical scheme and the technical problems to be solved, the technical scheme to be protected has the following advantages and positive effects:
the method is oriented to the field of detecting encrypted malicious traffic, the detection steps are generalized to be a six-step method, various different detection models can be well explained, the application range is wider, and the existing research can be well covered. The invention is still applicable to the common flow identification problem and has universality.
The main advantages and positive effects of the second, each step are as follows:
step one, target positioning
The method has the advantages of definitely detecting the target and being beneficial to collecting high-quality labeling data and constructing a detection model.
The method has the advantages of high detection performance and strong model generalization capability.
Step two, data collection
The method has the advantages that flow data in a real network environment are obtained, so that the detection model is more suitable for an actual scene. The annotation data can avoid manually extracting features.
The method has the advantages that the model has better performance in an actual network and has stronger capability of detecting unknown malicious traffic.
Step three, data processing
The advantage is that the cleaning and integration can obtain a complete and high quality data set. The transformation and the excavation can highlight important characteristics, and the subsequent modeling process is simplified.
The training of the deep learning model is facilitated, and the model performance is higher. The data dependency is reduced and the model is more robust.
Step four, model construction
The deep learning model has the advantages of strong self-learning and generalization capability.
The method has the advantages of high detection performance and capability of detecting new types of malicious traffic.
Training and evaluating
The model has the advantages that the model continuously learns and evolves through a large amount of data, and the final model performance is optimal. The evaluation link can find out the model defects, which is beneficial to further improvement.
The method has the advantages that the detection performance of the model is continuously improved, and finally, a higher level is achieved.
Step six, application improvement
The model has the advantage that the model can be continuously learned and optimized through new data in a new environment.
The model performance can not excessively fit training data, can adapt to the network environment change, and the detection performance is continuous and stable.
In summary, each step of the flow detection model construction method has important advantages and positive effects, and is beneficial to obtaining a high-performance and stable deep learning detection model. There is also a synergistic relationship between these steps that together improve the quality of the final model.
Thirdly, the technical scheme is regarded as a whole or from the perspective of products, and the technical scheme to be protected has the following technical effects and advantages:
1) The detection model based on deep learning can automatically learn complex modes and features in encrypted traffic data, has stronger self-adaptive capacity and generalization capacity, and can detect new unknown malicious encrypted traffic. This is a significant advantage over existing rule and signature based detection methods.
2) Deep learning models can handle large amounts of high-dimensional, unstructured data, and training and prediction are fast. This makes it well suited for real-time detection of network traffic. Deep learning is advantageous in this regard over many existing machine learning algorithms.
3) The end-to-end detection model constructed by deep learning can be directly used for learning the detection model from the original flow data without manually extracting manual characteristics. This simplifies the detection model construction process and reduces the manual work load. This is an important advantage of deep learning.
4) The deep learning model learns the detection rules through a large amount of training data, can detect new unknown attacks, and realizes self-updating and improvement. Compared with static rules, the dynamic learning detection mechanism is more intelligent and flexible.
5) In fact, the vast amount of data and computing resources that are continuously collected support the training and optimization of deep learning models. With the increase of the data size and the increase of the computing power, the performance of the deep learning detection model is continuously improved. This is an important advantage over other machine learning methods.
In summary, compared with the prior art, the malicious encryption traffic detection model based on deep learning does have the positive effects and remarkable advantages. Deep learning also faces challenges such as poor interpretation, strong data dependence, and security. In general, deep learning is a machine learning method with great potential in the field of network traffic detection.
Drawings
Fig. 1 is a framework diagram of a method for constructing a malicious encryption traffic detection model in a network information system scene provided by an embodiment of the present invention;
fig. 2 is a data processing flow chart of a construction method provided by an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
As shown in fig. 1, the method for constructing a malicious encrypted traffic detection model in a network information system scene provided by the embodiment of the invention includes the following steps:
s101, target positioning;
s102, collecting data;
s103, data processing;
s104, constructing a model;
s105, training and evaluation;
s106, applying improvement.
The signal and data processing procedure in each step is described in detail:
and step one, target positioning, namely determining the type of malicious encrypted traffic to be detected, such as DDoS attack, SQL injection attack and the like. And formulating a data labeling scheme according to the detection target.
And step two, data collection, namely acquiring a large amount of encrypted flow data by deploying a flow probe or using a public data set, and labeling the data according to a labeling scheme. The acquired traffic data is to include normal encrypted traffic and malicious encrypted traffic.
Step three, data processing:
1) Cleaning, filtering irrelevant and redundant fields, processing missing values, and the like.
2) Integrating flow data from different sources to form a complete training and testing data set.
3) And transforming, namely performing proper transformation on flow data, such as time window slicing, flow index calculation and the like, so as to highlight important characteristics.
4) Mining, namely, using statistics and data mining technology to find patterns and associations in traffic data, and mining key features which can represent normal/malicious traffic.
And fourthly, model construction, namely selecting a deep learning model structure, such as CNN, RNN, autoencoder and the like. And determining super parameters such as hidden layer nodes, activation functions, regularization methods, optimization algorithms and the like.
Training and evaluating, namely training a deep learning model by using a training data set, evaluating model performance such as AUC, accuracy, recall rate and the like by using a test data set, and adjusting the model to improve the performance.
And step six, applying improvement, namely deploying the deep learning model obtained through training into an actual network to detect malicious encrypted traffic. New data tuning and modification models are continually used to accommodate environmental changes.
The signal and data processing method involved in the steps is comprehensive and comprises data cleaning, labeling, integration, transformation, mining and the like. By these processes, a high quality data set can be obtained to train a deep learning detection model. New data is also introduced in the model training and evaluation links to continuously improve model performance. This is a dynamic data and model process.
The target positioning provided by the embodiment of the invention refers to a specific detection purpose and comprises two aspects. One is the application scene, namely, in which network the detection system is used, such as a mobile phone mobile network, an internet of things, a car networking, an industrial control network, SDN and the like; the other is a detected object, such as detecting botnet, detecting DDoS attacks, detecting malware, detecting multi-class attacks, or identifying malicious traffic, etc. In objective terms, in the face of network environments in diversified complex scenes, a universal detection algorithm cannot exist, and all attacks can be detected quickly and accurately. Thus, in a specific implementation, the detection target should be specifically located according to the intended purpose.
In order to obtain pure encrypted traffic for training and real-time traffic data in the detection stage, a traffic capture model is constructed. The normal traffic is obtained by capturing traffic generated by accessing a normal encryption website or running normal software by running a tool such as a wireshark on a monitoring computer, or by monitoring cleaner network environment traffic, and obtaining sessions in a white list as normal traffic by white list filtering. The malicious traffic is obtained by adopting a sandbox mode, malicious software is operated in the sandbox, traffic generated during the operation is saved, communication traffic among the sandboxes and system white traffic are filtered, and the rest traffic is used as malicious traffic.
The original data may have redundancy, a '33307 k' error, unbalance, mismatch and other problems, and further processing is required to apply the data. The data processing provided by the embodiment of the invention is to clean, integrate, transform, mine and the like the originally collected data, so that the data is a data set meeting the requirements of deep learning training and testing. The general process of encrypting malicious traffic data is shown in fig. 2.
Data cleansing is a process of rechecking and checking data, aiming at deleting duplicate information and correcting existing errors. Methods of deleting invalid values or error values existing in data are often adopted, and include methods of whole column deletion, variable deletion, paired deletion and the like; the processing method for the missing values comprises mean value interpolation, similar mean value interpolation, high-dimensional mapping and the like.
The data integration is to integrate data collected by a plurality of data sources, and the main difficulty of the integration process is that the data sources are heterogeneous, namely, the data sources are not completely consistent, the collected data formats, lengths and the like are not identical, and the problems of redundancy, incompatibility and the like exist among the data sources. Therefore, data integration mainly performs pattern matching, data redundancy processing, and collision value processing.
Data reduction refers to maximally simplifying the data volume on the premise of keeping the original appearance of the data. Including dimension conventions, data compression and quantity conventions, etc. The dimension reduction mainly reduces the number of independent variables, and the method comprises Principal Component Analysis (PCA), feature Subset Selection (FSS) and the like; the quantitative reduction is to replace the original data with a smaller data size by using the methods such as logarithmic linear regression, clustering, sampling and the like.
The data transformation is to normalize the data so as to facilitate subsequent information mining, and mainly comprises contents such as numerical value, centering, normalization and the like. The numerical value is to convert non-data information into data, such as network protocol information, and can be represented by simple numerical values. Centering refers to the operation of subtracting the mean or some specified value from the data. Normalization aims at integrating data into the [0,1] interval to facilitate experiments, and a maximum and minimum normalization method is commonly used.
In the detection model construction process based on the deep learning algorithm, the following problems need to be noted: one is the dimension matching problem. I.e., the dimensions input to the various levels must match, otherwise the deep learning network will not operate properly. This is particularly important in the process of constructing a deep learning network with multiple different algorithm cascade, and the relation between the output of the previous stage and the input of the current stage must be considered, so that parameters such as the input dimension, time step and the like of the current stage are determined by the output of the previous stage.
And secondly, the problem of non-convergence. The non-convergence means that the error is not reduced in the neural network training process, and the gradient descent optimization process cannot reach the extreme point, so that the optimal solution cannot be obtained. Mainly for two reasons: 1) The problem of learning rate, the selected learning rate is too large, which may cause the inability to converge to the extreme point when the gradient is decreased; too small learning rate can lead to slow convergence speed of the deep learning network and too long training time; 2) The problem with the data set may be that the data set is not pre-processed, including normalization, regularization, etc., or that the data set contains bad samples, without data cleansing.
Thirdly, the gradient disappears or the gradient explodes. Both gradient disappearance and gradient explosion occur in the back propagation process, if the derivative of the activation function is smaller than 1 according to the chain derivative rule, the gradient value far away from the output end will be smaller and smaller along with the deepening of the hierarchy, so that the gradient is disappeared, and the parameter update is very slow; conversely, if the derivative of the activation function is greater than 1, a gradient explosion may occur, causing unstable parameters. The above-mentioned problems are mainly caused by two reasons: 1) The hierarchy is too deep, which may cause problems in gradient computation in back propagation; 2) An unsuitable activation function is employed. The solution is to fine tune the hierarchy and adjust the activation function employed by the network.
Fourth, the problem of over-fitting or under-fitting. The over-fitting means that the deep learning algorithm can perfectly fit the training data set, but the performance is poor on the test data set, while the under-fitting means that the detection effect cannot be well obtained on both the training data set and the test data set. The reasons for the overfitting are mainly two: 1) The data sample is too single, the richness is insufficient, and the deep learning network is only matched with at least part of effective information; 2) The network structure is too complex, the training parameters are too many, and the network learning capacity is too strong, so that the existing data set is completely matched, and the effect on the unknown data set is poor. The lack of fitting is mainly because the network structure is simple, the learning ability is insufficient, and the feature information contained in the learned object cannot be effectively obtained, so that the original data is characterized inaccurately.
After the model is constructed, training and evaluation are required to be carried out on the model, and two methods are generally adopted for training the deep learning encryption malicious flow detection model, wherein one method is to divide a data set into a training set, a verification set and a test set, and training is carried out on the training set; after each round of training is finished, checking the training effect of the round of training by using a verification set, and optimizing the model; after all rounds of training are completed and the optimized model is obtained, the test set is utilized to simulate the detection performance of the real environment inspection model. The other is to divide the data set into training and testing sets only. During training, N-fold cross verification is adopted, namely the training set is divided into N parts, one part is taken as a verification set each time, training is carried out on the rest N-1 parts, and finally, the average value of N training results is taken as training precision. After training, the test performance of the model is checked on the test set. In practice, 5-fold or 10-fold cross-validation is often used.
The method is used for truly evaluating the quality of an encrypted malicious traffic detection model, and the model needs to be applied to an actual network environment to observe the detection performance of the model. However, the cost of chip development, security risk and the like is too great, and the performance of the model on the test set is usually utilized to evaluate the advantages and disadvantages of the detection model. The evaluation index mainly comprises an accuracy index and a real-time index.
The accuracy index is generated mainly based on the confusion matrix. Including accuracy, recall, precision, false positive rate, false negative rate, F-score, etc.
The real-time index reflects the capability of the encryption malicious detection algorithm to identify the encryption malicious traffic online and rapidly, and ensures that the performance of the core network is not affected in the process of implementing malicious traffic detection. Real-time is mainly reflected in the accurate detection of the first N packets of the stream.
The application improvement provided by the embodiment of the invention is that the constructed model is applied to an actual network, the actual network encryption malicious flow detection is carried out, the effectiveness and the robustness of an algorithm model are checked through network operation, and the model is updated regularly, so that the model is continuously perfected to obtain higher detection precision and efficiency.
The method for constructing the malicious encryption traffic detection model in the network information system scene is applied to computer equipment, wherein the computer equipment comprises a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the steps of the method for constructing the malicious encryption traffic detection model in the network information system scene.
The method for constructing the malicious encryption traffic detection model in the network information system scene is applied to an information data processing terminal, and the information data processing terminal is used for realizing the method for constructing the malicious encryption traffic detection model in the network information system scene.
It should be noted that the embodiments of the present invention can be realized in hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or special purpose design hardware. Those of ordinary skill in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The device of the present invention and its modules may be implemented by hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., as well as software executed by various types of processors, or by a combination of the above hardware circuitry and software, such as firmware.
Example 1: intra-enterprise network information system
Step one: target positioning
The aim of the embodiment is to detect malicious encrypted traffic in an enterprise internal network information system, so as to identify potential network attacks and threats in time.
Step two: data collection
Network monitoring tools, such as a network sniffer, a firewall and the like, are deployed in the enterprise internal network information system, and network traffic data is collected in real time. Meanwhile, the disclosed pure encrypted traffic data is collected as training data.
Step three: data processing
Preprocessing the collected original data, including data cleaning, integration, transformation, feature extraction and the like. The processed data is divided into a training set and a testing set, so that the requirements of training and testing of the deep learning model are met.
Step four: model construction
A detection model based on a deep learning algorithm, such as a Convolutional Neural Network (CNN) or a long-short-term memory network (LSTM), is constructed. And proper network structure and parameters are designed to adapt to the scene of the internal network information system of the enterprise.
Step five: training and assessment
The deep learning model is trained using a training set and performance assessment is performed using a testing set. And (5) evaluating the detection performance of the model by adopting indexes such as accuracy, precision, recall rate, F1 score and the like.
Step six: application improvements
And deploying the trained model into an enterprise internal network information system, and detecting malicious encrypted traffic in real time. According to the actual application situation, the model performance is continuously perfected and optimized.
Example 2: internet service provider network information system
Step one: target positioning
The aim of this embodiment is to detect malicious encrypted traffic in an internet service provider network information system in order to identify potential network attacks and threats in time.
Step two: data collection
Network monitoring tools, such as network sniffers and firewalls, are deployed in the network information systems of internet service providers to collect network traffic data in real time. Meanwhile, the disclosed pure encrypted traffic data is collected as training data.
Step three: data processing
Preprocessing the collected original data, including data cleaning, integration, transformation, feature extraction and the like. The processed data is divided into a training set and a testing set, so that the requirements of training and testing of the deep learning model are met.
Step four: model construction
A detection model based on a deep learning algorithm, such as a Convolutional Neural Network (CNN) or a long-short-term memory network (LSTM), is constructed. Appropriate network structures and parameters are designed to accommodate the Internet service provider network information system scenario.
Step five: training and assessment
The deep learning model is trained using a training set and performance assessment is performed using a testing set. And (5) evaluating the detection performance of the model by adopting indexes such as accuracy, precision, recall rate, F1 score and the like.
Step six: application improvements
And deploying the trained model into an Internet service provider network information system, and detecting malicious encrypted traffic in real time. According to the actual application situation, the model performance is continuously perfected and optimized.
The two embodiments respectively construct a malicious encryption flow detection model based on a deep learning algorithm aiming at the scene of the enterprise internal network information system and the scene of the Internet service provider network information system. These embodiments may help businesses and internet service providers more effectively protect against potential network attacks and threats, increasing the level of network security.
The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.
Claims (10)
1. The method for constructing the malicious encryption traffic detection model in the network information system scene is characterized by comprising the following steps:
step one, target positioning and specific detection purposes are defined;
step two, data collection is carried out to obtain pure encrypted flow for training and real-time flow data in a detection stage;
step three, data processing, namely cleaning, integrating, transforming, mining and the like the originally collected data to enable the data to be a data set meeting the requirements of deep learning training and testing;
constructing a model, and constructing a detection model based on a deep learning algorithm;
training and evaluating, namely training a detection model based on a deep learning algorithm and evaluating the detection performance of the model;
and step six, applying improvement, namely applying the constructed model to an actual network to continuously perfect the model.
2. The method for constructing a malicious encrypted traffic detection model in a network information system scenario according to claim 1, wherein the target positioning comprises two aspects: one is the application scene, namely, in which network the detection system is used, such as a mobile phone mobile network, an internet of things, a car networking, an industrial control network, SDN and the like; the other is a detected object, such as detecting botnet, detecting DDoS attacks, detecting malware, detecting multi-class attacks, or identifying malicious traffic, etc.
3. The method for constructing a malicious encryption traffic detection model in a network information system scene according to claim 1, wherein the data collection and the malicious traffic acquisition adopt a sandbox mode, malicious software is operated in a sandbox, traffic generated during operation is saved, communication traffic among the sandboxes and system white traffic are filtered, and the remaining traffic is used as malicious traffic.
4. The method for constructing a malicious encrypted traffic detection model in a network information system scenario according to claim 1, wherein the data processing comprises:
(1) The data is cleaned, methods of deleting invalid values or error values existing in the data are usually adopted, including methods of whole row deletion, variable deletion, paired deletion and the like, and methods of processing the missing values include mean value interpolation, homogeneous mean value interpolation, high-dimensional mapping and the like;
(2) Data integration, which integrates data collected by a plurality of data sources, and mainly performs pattern matching, data redundancy processing and conflict value processing;
(3) Data reduction, including dimension reduction, data compression, and quantity reduction, wherein the dimension reduction mainly reduces the number of independent variables, the method comprises Principal Component Analysis (PCA), feature Subset Selection (FSS), and the like, the quantity reduction is to replace original data with smaller data quantity, and the adopted method comprises logarithmic linear regression, clustering, sampling, and the like;
(4) The data transformation is performed on the data, and the data is normalized, which mainly comprises contents such as numerical control, centering and normalization, wherein the numerical control is to convert non-data information into data, such as network protocol information, and the data can be represented by simple numerical values, the centering is to subtract a mean value or an operation of a certain designated numerical value from the data, and the normalization is to integrate the data into a [0,1] interval so as to facilitate experiments, and a maximum and minimum normalization method is commonly used.
5. The method for constructing a malicious encryption traffic detection model in a network information system scene according to claim 1, wherein the model is constructed, dimensions input to each hierarchy must be matched, and parameters such as input dimension, time step and the like of the present stage are determined by the output of the upper stage in the process of constructing a deep learning network with multiple different algorithm cascade; for the non-convergence problem, analyzing the learning rate size selection and whether to preprocess the data set;
for the problem of gradient disappearance or gradient explosion, the fine adjustment of the hierarchical structure and the adjustment of an activation function adopted by a network are considered; for the over-fit or under-fit problem, data sample richness and network structure complexity are analyzed.
6. The method for constructing the malicious encryption traffic detection model in the network information system scene according to claim 1, wherein the training and the evaluation are two methods, namely, the training of the deep learning encryption traffic detection model is divided into a training set, a verification set and a test set, the training is carried out on the training set, after each round of training is finished, the training effect of the round of training is checked by the verification set, the model is optimized, after all rounds of training are finished, the detection performance of the real environment check model is simulated by the test set; the other is that the data set is divided into a training set and a testing set, N-fold cross verification is adopted during training, namely the training set is divided into N parts, one part is taken as a verification set each time, training is carried out on the rest N-1 parts, finally, the average value of the results of N times of training is taken as training precision, and after the training is finished, the detection performance of the model is checked on the testing set.
7. The method for constructing a malicious encrypted traffic detection model in a network information system scenario according to claim 1, wherein the training and evaluating, evaluate performance of the encrypted malicious traffic detection model, and the evaluation indexes mainly include an accuracy index and a real-time index:
(1) The accuracy index is mainly generated based on the confusion matrix and comprises accuracy, recall rate, accuracy, false alarm rate, missing report rate, F score and the like;
(2) The real-time index reflects the capability of the encryption malicious detection algorithm to identify the encryption malicious traffic online and rapidly, ensures that the performance of a core network is not affected in the process of implementing malicious traffic detection, and is mainly reflected on the accurate detection of the first N packets of the flow.
8. The method for constructing a malicious encrypted traffic detection model in a network information system scenario according to claim 1, wherein the application improvement means that the constructed model is applied to an actual network to perform the detection of the malicious traffic encrypted by the actual network, the effectiveness and the robustness of an algorithm model are checked through network operation, and the model is updated periodically, so that the model is continuously improved to obtain higher detection precision and efficiency.
9. The method for constructing a malicious encrypted traffic detection model in a network information system scenario according to claim 1, wherein the method specifically comprises:
and step one, target positioning, namely defining the type of malicious encrypted traffic to be detected, and formulating a data labeling scheme according to the detection target.
Step two, data collection, namely acquiring a large amount of encrypted flow data by deploying a flow probe or using a public data set, and labeling the data according to a labeling scheme; the acquired traffic data includes normal encrypted traffic and malicious encrypted traffic;
step three, data processing:
1) Filtering irrelevant and redundant fields and processing missing values;
2) Integrating flow data from different sources to form a complete training and testing data set;
3) Transforming, namely performing proper transformation on the flow data to highlight important characteristics;
4) Mining, namely using statistics and data mining technology to find patterns and correlations in the traffic data, and mining key features which can represent normal/malicious traffic;
step four, model construction, namely selecting a deep learning model structure, such as CNN, RNN, autoencoder; determining hidden layer nodes, an activation function, a regularization method and an optimization algorithm hyper-parameter;
training and evaluating, namely training a deep learning model by using a training data set, evaluating the performance of the model by using a test data set, and adjusting the model to improve the performance;
step six, applying improvement, namely deploying the deep learning model obtained by training into an actual network to detect malicious encryption traffic; new data tuning and modification models are continually used to accommodate environmental changes.
10. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method of constructing a malicious encrypted traffic detection model in a network information system scenario according to any one of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311312645.4A CN117349618A (en) | 2023-10-11 | 2023-10-11 | Method and medium for constructing malicious encryption traffic detection model of network information system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311312645.4A CN117349618A (en) | 2023-10-11 | 2023-10-11 | Method and medium for constructing malicious encryption traffic detection model of network information system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117349618A true CN117349618A (en) | 2024-01-05 |
Family
ID=89362435
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311312645.4A Pending CN117349618A (en) | 2023-10-11 | 2023-10-11 | Method and medium for constructing malicious encryption traffic detection model of network information system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117349618A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117792933A (en) * | 2024-02-27 | 2024-03-29 | 南京市微驰数字科技有限公司 | Network flow optimization method and system based on deep learning |
CN118118271A (en) * | 2024-04-03 | 2024-05-31 | 苏州领跑智能科技有限公司 | Network data security management system based on artificial intelligence |
-
2023
- 2023-10-11 CN CN202311312645.4A patent/CN117349618A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117792933A (en) * | 2024-02-27 | 2024-03-29 | 南京市微驰数字科技有限公司 | Network flow optimization method and system based on deep learning |
CN117792933B (en) * | 2024-02-27 | 2024-05-03 | 南京市微驰数字科技有限公司 | Network flow optimization method and system based on deep learning |
CN118118271A (en) * | 2024-04-03 | 2024-05-31 | 苏州领跑智能科技有限公司 | Network data security management system based on artificial intelligence |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111277578B (en) | Encrypted flow analysis feature extraction method, system, storage medium and security device | |
Anton et al. | Anomaly-based intrusion detection in industrial data with SVM and random forests | |
US20230011004A1 (en) | Cyber security sandbox environment | |
CN117349618A (en) | Method and medium for constructing malicious encryption traffic detection model of network information system | |
Torres et al. | Active learning approach to label network traffic datasets | |
Lin et al. | Machine learning with variational autoencoder for imbalanced datasets in intrusion detection | |
CN117220920A (en) | Firewall policy management method based on artificial intelligence | |
CN115795330A (en) | Medical information anomaly detection method and system based on AI algorithm | |
Al-Shabi | Design of a network intrusion detection system using complex deep neuronal networks | |
CN117454376A (en) | Industrial Internet data security detection response and tracing method and device | |
Kumar et al. | A semantic machine learning algorithm for cyber threat detection and monitoring security | |
Hendry et al. | Intrusion signature creation via clustering anomalies | |
CN118138361A (en) | Security policy making method and system based on autonomously evolutionary agent | |
RU180789U1 (en) | DEVICE OF INFORMATION SECURITY AUDIT IN AUTOMATED SYSTEMS | |
CN117749499A (en) | Malicious encryption traffic detection method and system in network information system scene | |
CN117150488A (en) | Ground-leaving attack detection method and system based on time sequence analysis and memory evidence obtaining | |
Ali et al. | Detecting network attacks model based on a convolutional neural network | |
Alqurashi et al. | On the performance of isolation forest and multi layer perceptron for anomaly detection in industrial control systems networks | |
Gonzalez-Granadillo et al. | An improved live anomaly detection system (i-lads) based on deep learning algorithm | |
CN112988327A (en) | Container safety management method and system based on cloud edge cooperation | |
Lin et al. | Behaviour classification of cyber attacks using convolutional neural networks | |
Saraniya et al. | Securing Networks: Unleashing the Power of the FT-Transformer for Intrusion Detection | |
Awad et al. | ENHANCING IIOT SECURITY WITH MACHINE LEARNING AND DEEP LEARNING FOR INTRUSION DETECTION | |
Balaji | Enhanced Gradient Boosting Technique to Detect the Malware in API | |
CN115051833B (en) | Intercommunication network anomaly detection method based on terminal process |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |