CN110598774B - Encrypted flow detection method and device, computer readable storage medium and electronic equipment - Google Patents

Encrypted flow detection method and device, computer readable storage medium and electronic equipment Download PDF

Info

Publication number
CN110598774B
CN110598774B CN201910827194.5A CN201910827194A CN110598774B CN 110598774 B CN110598774 B CN 110598774B CN 201910827194 A CN201910827194 A CN 201910827194A CN 110598774 B CN110598774 B CN 110598774B
Authority
CN
China
Prior art keywords
data
encrypted
algorithm
training sample
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910827194.5A
Other languages
Chinese (zh)
Other versions
CN110598774A (en
Inventor
罗赟骞
邬江
戴方岳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Power Great Wall Internetworking Safety Technology Research Institute Beijing Co ltd
Original Assignee
China Power Great Wall Internetworking Safety Technology Research Institute Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Power Great Wall Internetworking Safety Technology Research Institute Beijing Co ltd filed Critical China Power Great Wall Internetworking Safety Technology Research Institute Beijing Co ltd
Priority to CN201910827194.5A priority Critical patent/CN110598774B/en
Publication of CN110598774A publication Critical patent/CN110598774A/en
Application granted granted Critical
Publication of CN110598774B publication Critical patent/CN110598774B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides an encrypted flow detection method and device, a computer readable storage medium and electronic equipment. The method comprises the following steps: extracting features of network sessions from a target file to serve as training samples, and constructing a training sample set, wherein data in the training samples comprise data of at least two data types; setting the data type of a preset training sample as a data type which can be identified by a preset algorithm, and obtaining a training sample set after preprocessing, wherein the preset training sample comprises the characteristics of a network session which is extracted from a target file and has the data type which can be identified by the preset algorithm, and the preset algorithm can identify the characteristics of at least two data types; constructing an encrypted flow detection model by adopting the predetermined algorithm; and detecting the object to be detected by using the constructed encrypted flow detection model. The device is used for executing the encrypted traffic detection method. The invention constructs more comprehensive detection characteristics, saves computing resources and improves detection accuracy.

Description

Encrypted flow detection method and device, computer readable storage medium and electronic equipment
Technical Field
The present invention relates to the field of network security, and in particular, to an encrypted traffic detection method, an encrypted traffic detection apparatus for performing the encrypted traffic detection method, a computer-readable storage medium, and an electronic device.
Background
With the rapid development of the internet of things, big data, cloud computing and high-speed mobile communication networks, the information confidentiality problem becomes more and more important, various security protocols for ensuring the network communication security are widely applied, and more internet traffic is encrypted. The encryption technology ensures the communication security of internet users, ensures that information cannot be intercepted and read by a third party, and simultaneously makes a traditional security detection mechanism face failure.
The wide application of the artificial intelligence technology provides an important means for discovering the threat of malicious flow attack. At present, malicious encrypted traffic detection research is mainly divided into session-based, session-statistics-based and certificate-based detection research. Detection based on conversation mainly aims at extracting characteristics of network flow and adopts methods such as random forest and the like; detection based on session statistics mainly aims at extracting statistical characteristics of statistical data of network flows, and methods such as eXtreme Gradient Boosting (Xgboost) and LightGBM (Light Gradient Boosting Machine) are adopted; based on certificate detection, aiming at certificate extraction features, a detection model is constructed by methods such as a Support Vector Machine (SVM) and the like.
However, the existing detection model has incomplete features, occupies a large memory space, and has yet to be further improved in detection accuracy.
Disclosure of Invention
To solve at least one aspect of the above problems of the prior art, it is an object of the present invention to provide an encrypted traffic detection method, an encrypted traffic detection apparatus that performs the encrypted traffic detection method, a computer-readable storage medium, and an electronic device. The method aims to reduce the memory space occupied by the encryption flow detection model and further improve the accuracy of encryption flow detection.
To achieve the above object, as a first aspect of the present invention, there is provided an encrypted traffic detection method including:
extracting features of network sessions from a target file to serve as training samples, and constructing a training sample set, wherein data in the training samples comprise data of at least two data types;
preprocessing training samples in the training sample set to set the data types of preset training samples as the data types which can be identified by a preset algorithm, and obtaining the preprocessed training sample set, wherein the preset training samples comprise the features of network sessions, which are extracted from a target file and have the data types which can be identified by the preset algorithm, and the preset algorithm can identify the features of at least two data types;
constructing an encrypted flow detection model by using the pre-processed training sample set and adopting the predetermined algorithm;
and detecting the object to be detected by using the constructed encrypted flow detection model.
Optionally, the data in the training samples comprises numerical data and classification data, the predetermined algorithm being capable of identifying and processing the numerical data and the classification data.
Optionally, the predetermined algorithm comprises a LightGBM algorithm or a Catboost algorithm.
Optionally, the target file includes a static packet file and/or a real-time network traffic file.
Optionally, the characteristics of the network session include at least one of session connection characteristics, TLS/SSL session characteristics, X509 certificate characteristics, and DNS characteristics.
Optionally, a TLS/SSL session of the network session includes TLS/SSL handshake and certificate information.
Optionally, constructing the encrypted traffic detection model includes:
searching the optimal hyper-parameter of the preset algorithm by utilizing the preprocessed training sample set;
and training by using the pre-processed training sample set and the preset algorithm by using the optimal hyper-parameter to obtain the encrypted flow detection model.
Optionally, the detecting the object to be detected by using the constructed encrypted traffic detection model includes:
extracting the characteristics of an object to be detected;
preprocessing the extracted features of the object to be detected, and setting the data type of the extracted features of the object to be detected, of which the data type before extraction is the data type which can be identified by a preset algorithm, as the data type which can be identified by the preset algorithm;
inputting the preprocessed extracted characteristics of the object to be detected into the encrypted flow detection model for identification.
As a second aspect of the present invention, there is provided an encrypted traffic detection device including:
the system comprises a feature extraction module, a feature extraction module and a feature extraction module, wherein the feature extraction module is used for extracting features of a network session from a target file to serve as training samples and constructing a training sample set, and data in the training samples comprise data of at least two data types;
the characteristic data processing module is used for preprocessing the training samples in the training sample set so as to set the data types of the preset training samples as the data types which can be identified by a preset algorithm and obtain the preprocessed training sample set, wherein the preset training samples comprise the characteristics of network sessions which are extracted from a target file and have the data types which can be identified by the preset algorithm, and the preset algorithm can identify the characteristics of at least two data types;
the model construction module is used for constructing an encrypted flow detection model by using the pre-processed training sample set and adopting the predetermined algorithm;
and the encrypted flow detection module is used for detecting the object to be detected by using the constructed encrypted flow detection model.
Optionally, the data in the training samples comprises numerical data and classification data, the predetermined algorithm being capable of identifying and processing the numerical data and the classification data.
Optionally, the predetermined algorithm comprises a LightGBM algorithm or a Catboost algorithm.
Optionally, the target file includes a static packet file and/or a real-time network traffic file.
Optionally, the characteristics of the network session comprise at least one of session connection characteristics, TLS/SSL session characteristics, X509 certificate characteristics, and DNS characteristics.
Optionally, a TLS/SSL session of the network session includes TLS/SSL handshake and certificate information.
Optionally, the model building module comprises:
the optimal hyper-parameter selection module is used for searching the optimal hyper-parameter of the preset algorithm by utilizing the preprocessed training sample set;
and the model training module is used for training by using the preprocessed training sample set by using the optimal hyper-parameter and the preset algorithm to obtain the encrypted flow detection model.
Optionally, the feature extraction module is further configured to extract features of the object to be detected.
The characteristic data processing module is further used for preprocessing the extracted characteristic of the object to be detected, and setting the data type of the extracted characteristic of the object to be detected, of which the data type before extraction is the data type which can be identified by the preset algorithm, as the data type which can be identified by the preset algorithm.
And the encrypted flow detection module is also used for inputting the preprocessed extracted characteristics of the object to be detected into the encrypted flow detection model for identification.
As a third aspect of the present invention, there is provided a computer-readable storage medium for storing an executable program capable of executing the above-described encrypted traffic detection method of the present invention.
As a fourth aspect of the present invention, there is provided an electronic apparatus comprising:
one or more processors;
a storage device, on which one or more programs are stored, which, when executed by the one or more processors, cause the one or more processors to implement the encrypted traffic detection method of the present invention described above.
According to the characteristics of malicious encrypted traffic, the encrypted traffic detection model is constructed by using an algorithm capable of directly identifying and processing numerical data and non-numerical data, and the non-numerical data is not required to be converted into the numerical data, so that the occupied storage space of the model is reduced, and the detection accuracy is improved; meanwhile, non-numerical characteristic data is extracted, perfect detection characteristics are constructed, and malicious encrypted flow can be described more comprehensively, so that the detection accuracy is further improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow chart of a method of detecting encrypted traffic;
FIG. 2 is a flow chart of the construction of an encrypted traffic detection model using the predetermined algorithm;
FIG. 3 is a flow chart of detecting an object to be detected by using the constructed encryption traffic detection model;
fig. 4 is a block diagram of the encrypted flow rate detection apparatus.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.
As a first aspect of the present invention, there is provided an encrypted traffic detection method. Fig. 1 is a flow chart of a method of detecting encrypted traffic. As shown in fig. 1, the encrypted traffic detection method according to this embodiment includes:
in step S110, features of the web session are extracted from the target file as training samples, and a training sample set is constructed, where data in the training samples includes data of at least two data types.
In step S120, preprocessing is performed on the training samples in the training sample set to set the data type of a predetermined training sample as a data type that can be recognized by a predetermined algorithm, and obtain a preprocessed training sample set, where the predetermined training sample includes features of a web session extracted from a target file and having a previous data type that can be recognized by the predetermined algorithm, and the predetermined algorithm is capable of recognizing features of at least two data types.
In step S130, the preprocessed training sample set is used to construct an encrypted traffic detection model by using the predetermined algorithm.
In step S140, the constructed encrypted traffic detection model is used to detect the object to be detected.
The inventor of the invention researches and discovers that the existing models can only identify and process numerical data, so that when the characteristic data of the encrypted flow is extracted or only the characteristics of the numerical type are extracted, the malicious encrypted flow cannot be completely described, and the detection is not accurate enough; or after the non-numerical characteristic data is extracted, the non-numerical characteristic data needs to be converted into numerical data, a large amount of memory space is occupied, the detection timeliness is low, and the detection accuracy is further limited.
In view of the above, in order to overcome the problem that the existing models can only recognize and process numerical data, and in order to process non-numerical data, the invention adopts an algorithm capable of directly recognizing and processing data of at least two data types, so that malicious encrypted traffic can be described more comprehensively, and waste of memory resources caused by converting non-numerical characteristic data into numerical data can be avoided, thereby effectively improving the accuracy of encrypted traffic detection.
In addition, research finds that the session connection characteristics represent the characteristic expression of malicious encrypted traffic on the connection traffic; the Security Transport Layer protocol (TLS)/Security Sockets Layer (SSL) session feature and the X509 certificate feature represent the feature expression of malicious traffic on the encryption attribute; the Domain Name System (DNS) feature represents whether there is a problem with a Domain Name used in a session, such as possibly a Domain Name generation Algorithm (DGA) Domain Name. The characteristics comprise non-numerical characteristics, and the non-numerical characteristics describe unique performances on specific attributes of the malicious encrypted traffic and have an important role in comprehensively describing the malicious encrypted traffic. When the characteristics of the network session are extracted, the characteristic data of at least two data types are simultaneously extracted as training samples, and relatively perfect detection characteristics are constructed, so that the accuracy of encrypted flow detection can be further improved.
It should be noted that, in the present invention, at least two types of feature data are extracted, and depending on the data processing by the system, the data type of the data whose data type is not a numerical value may be changed by the system among the extracted feature data, and in order to enable the predetermined algorithm to be used to identify the feature data, the data type of the feature data is set again in step S120 as the data type before extraction.
Optionally, the data in the training samples comprises numerical data and classification data, the predetermined algorithm being capable of identifying and processing the numerical data and the classification data.
As described above, among the features of the network session, non-numerical features play an important role in fully describing malicious encrypted traffic, and the non-numerical feature data is mainly classified data.
The existing encrypted flow detection model cannot directly identify and process the classification characteristics, and only one-hot coding (one-hot) needs to be carried out on the classification characteristic data to process the classification characteristics, so that the classification data is thinned. However, if the categories are too many, data becomes too sparse after one-hot processing, which greatly increases the size of the training set and wastes computing resources. In order to avoid the waste of the computing resources, the invention adopts an algorithm which can directly identify and process the classification characteristics and the numerical characteristics. Meanwhile, the algorithm capable of directly identifying and processing the classification characteristic and the numerical characteristic is adopted, so that the numerical characteristic and the classification characteristic can be simultaneously selected as training samples, malicious encrypted flow can be comprehensively described, and the detection accuracy is improved.
Optionally, the predetermined algorithm comprises a LightGBM algorithm or a Catboost algorithm.
The LightGBM algorithm and the Catboost algorithm can directly identify and process the classification features, so that the encrypted traffic detection model can be constructed by using the algorithms.
The LightGBM algorithm is a novel Gradient Boost Decision Tree (GBDT) algorithm, and is currently widely applied to relevant fields such as classification, regression, training and the like. The method mainly has the following advantages: 1. the method comprises unilateral sampling based on gradient and mutually exclusive feature binding, and meets the requirements of efficiency and expandability under the conditions of high dimension and mass data; 2. the algorithm based on the histogram is used for accelerating the training process and reducing the memory consumption; 3. the tree generation strategy growing according to the leaf nodes is adopted, so that the generalization performance of the algorithm is improved; 4. the classification characteristics can be directly processed, and the problems that data becomes too sparse after one-hot processing and computing resources are wasted are avoided.
The Catboost algorithm is a Boosting ensemble learning algorithm, mainly solves the learning of classification features, and can directly process and learn character type classification features. The method mainly has the following advantages: 1. the method supports a Graphics Processing Unit (GPU), and is more efficient in calculation; 2. providing a training process visualization function; 3. and supporting modeling of various languages such as Python, R and the like.
The inventor experiments of the invention show that the encrypted flow detection model constructed by the Catboost algorithm has about 0.05% difference in the indexes of accuracy, F1 value (F-measure), recall rate and Area Under the Curve (AUC) compared with the encrypted flow detection model constructed by the LightGBM algorithm.
Based on the difference, the LightGBM algorithm is selected to construct the encrypted traffic detection model in the embodiment. Since the LightGBM algorithm can directly identify and process the classification feature of the "category" type, and depending on the data processing of the system, the data type of the extracted feature of the network session, which is originally the "category" type, may become a character type or an "object" type, and in order to enable the LightGBM algorithm to identify the above feature data, the data type of the extracted feature of the network session, which is originally the "category" type, needs to be set as "category".
Optionally, the target file includes a static packet file and/or a real-time network traffic file.
The inventor of the invention finds that the existing encrypted traffic detection model based on session statistics cannot detect malicious encrypted traffic in real time. In the invention, the characteristic data of the network session can be extracted from the PCAP packet, the real-time network interface or other network flow files, thereby realizing the real-time detection of the encrypted flow.
In the present embodiment, the feature data of the network session is extracted from the static PCAP packet and/or the real-time network traffic, and further, the feature data of the network session required by the present invention may be extracted using the open source software Zeek.
Optionally, the characteristics of the network session include at least one of session connection characteristics, TLS/SSL session characteristics, X509 certificate characteristics, and DNS characteristics.
As mentioned above, the session connection characteristics represent the characteristics of malicious encrypted traffic on the connection traffic; TLS/SSL session characteristics and X509 certificate characteristics represent the characteristic representation of malicious traffic on encryption attributes; the DNS feature represents whether there is a problem with the domain name used in the session, such as possibly a DGA domain name. To fully describe the malicious encrypted traffic, the characteristics related to the construction of the encrypted traffic detection model can be selected according to the characteristic expression of the malicious traffic in different attributes. The http feature may also be used, but the inventors believe it will die in the future and will therefore not be embodied in this embodiment.
As an embodiment of the present invention, the feature of the network session may be selected as follows to construct the encrypted traffic detection feature:
and extracting 62 session connection features, TLS/SSL session features, X509 certificate features and DNS features related to building a malicious encrypted traffic detection model from the network session. The extracted features include a numerical type feature and a "category" type feature. The method specifically comprises the following steps:
session connection characteristics refer to communication session characteristics associated with encrypted traffic communications. In the present embodiment, 5 features such as "session duration" are selected, as shown in table 1.
TABLE 1
Figure BDA0002189473920000091
TLS/SSL session characteristics refer to TLS/SSL handshake characteristic data generated in the process of carrying out encryption communication by using TLS/SSL protocol. The present embodiment selects 11 of the features, as shown in table 2.
TABLE 2
Figure BDA0002189473920000092
And the X509 certificate feature refers to certificate data transmitted by a server side in the process of carrying out encrypted communication by using the TLS/SSL protocol. The present embodiment has 33 of these features, as shown in table 3.
TABLE 3
Figure BDA0002189473920000093
/>
Figure BDA0002189473920000101
The DNS feature refers to the feature contained in the DNS requested before the session starts, and the DNS feature is selected mainly in consideration of the fact that the DNS domain name used by some malicious encrypted traffic is greatly different from a common normal domain name. 13 of these features were selected in this embodiment as shown in table 4.
TABLE 4
Figure BDA0002189473920000102
/>
Figure BDA0002189473920000111
Optionally, a TLS/SSL session of the network session includes TLS/SSL handshake and certificate information.
The present embodiment is directed to a network session, because when a TLS/SSL session is first established and the session is already established, the session information includes important features such as TLS/SSL handshake and certificate, while a TLS/SSL session restored using previous session information does not include the above-mentioned information, in order to extract an effective detection feature from the session, the network session must satisfy that the TLS/SSL session includes important features such as TLS/SSL handshake and certificate, that is, the TLS/SSL session is first established and the session is already established.
Optionally, in order to use the extracted features of the network session for model training to obtain the encrypted traffic detection model, constructing a training sample set further includes: classifying the training samples into 'malicious' or 'normal' according to the nature of the network session, and constructing a training sample set
Figure BDA0002189473920000112
x i Representing characteristic data, y i In the present embodiment, the corresponding tag data is represented by 1 for malicious purpose, 0 for normal purpose, or in a customized manner.
Optionally, as an error-proofing process, in this embodiment, the preprocessing the training samples in the training sample set may include: the feature number of the training sample is checked, and if the training sample does not meet the specified feature number (in the present embodiment, the specified feature number is 62, wherein, the session connection feature is 5, the TLS/SSL session feature is 11, the X509 certificate feature is 33, and the DNS feature is 13), the training sample is discarded as a problem sample.
Optionally, fig. 2 is a flowchart for constructing an encrypted traffic detection model by using the predetermined algorithm. As shown in fig. 2, the constructing the encrypted traffic detection model by using the predetermined algorithm includes:
in step S131, the training sample set after the preprocessing is used to find the optimal hyper-parameter of the predetermined algorithm.
In general, the hyper-parameters have an important influence on the prediction accuracy. The hyper-parameters in the LightGBM algorithm determine the accuracy of the model, the speed of building the model and whether the model is over-fitted, so the number and the variation range of the hyper-parameters need to be determined, and the optimal hyper-parameters of the model are further obtained to build the optimal encrypted traffic detection model. In this embodiment, the parameters that the LightGBM algorithm needs to optimize are shown in table 5.
TABLE 5
Parameter name Interpretation of parameters
num_leaves The number of leaves of each tree determines the accuracy of the model
learning_rate Controlling the speed of iteration and determining model accuracy
max_depth Maximum depth of tree, determining whether model is overfitting
min_data_in_leaf The minimum number of records a leaf may contain determines whether the model is overfitting
feature_fraction The proportion of randomly selected features in each iteration of the building tree determines the model building speed
bagging_fraction The proportion of data used per iteration is typically used to speed up training and avoid overfitting
max_bin The maximum bin number of the inserted characteristic value determines the model construction speed
bagging_freq Frequency of bagging, determining whether the model is overfitting
n_estimators The number of iterations is improved, and the accuracy of the model is determined
Optionally, in this embodiment, all training samples in the training sample set that is preprocessed in step S120 are used to find the optimal hyper-parameter of the encrypted traffic detection model.
Optionally, in this embodiment, any one of a grid search method, a random search method, or a heuristic method is used to find the optimal hyper-parameter of the model; and when the optimal hyper-parameter is searched, an N-fold cross validation method is adopted.
The grid search method is an exhaustive search method for the designated parameter values, namely, the possible values of each parameter are arranged and combined, all the possible combination results are listed to generate a grid, and the parameters of the estimation function are optimized by a cross validation method to obtain the optimal hyper-parameters.
The random search method does not exhaust all parameter values, but extracts a fixed number of parameter values according to a specified distribution to find the optimal hyper-parameter.
The heuristic method usually uses optimization algorithms such as particle swarm optimization and difference algorithm to find the optimal hyper-parameter.
The inventor researches and discovers that theoretically, the grid search algorithm has the lowest efficiency, the random search algorithm has the next lowest efficiency, and the heuristic method has the highest efficiency; in the aspect of implementation, the grid search algorithm and the random search algorithm are simpler, and the heuristic method is more complex.
The basic idea of cross validation is to group the original data in a certain sense, one part is used as a training set, the other part is used as a validation set, firstly, the training set is used for training the classifier, and then the validation set is used for testing the model obtained by training, so that the model is used as the performance index for evaluating the classifier. The purpose of cross-validation is to obtain a reliable and stable model.
In step S132, the optimal hyper-parameter is adopted, the preprocessed training sample set is used, and the predetermined algorithm is used for training, so as to obtain the encrypted traffic detection model.
In this embodiment, the optimal hyper-parameter obtained in step S131 and all training samples in the training sample set preprocessed in step S120 are used to train with the LightGBM algorithm, and the detection model is obtained.
Optionally, fig. 3 is a flowchart for detecting an object to be detected by using the constructed encrypted traffic detection model. As shown in fig. 3, the detecting the object to be detected by using the constructed encrypted traffic detection model includes:
in step S141, the feature of the object to be measured is extracted.
Alternatively, the object to be tested may be a static PCAP data packet file or a dynamic real-time network traffic file.
Optionally, in this embodiment, the extracted features of the object to be tested include 62 session connection features (as shown in table 1), TLS/SSL session features (as shown in table 2), X509 certificate features (as shown in table 3), and DNS features (as shown in table 4) of the network session to be tested.
In step S142, the extracted feature of the object to be measured is preprocessed, so that the data type of the feature of the object to be measured, in which the data type before extraction is the data type that can be recognized by the predetermined algorithm, is set as the data type that can be recognized by the predetermined algorithm.
Optionally, in this embodiment, because the LightGBM algorithm is capable of directly identifying and processing the feature of the "category" type, depending on the data processing system, the data type of the feature of the extracted object to be tested, which is originally the "category" type, may become a character type or an "object" type, and in order to enable the LightGBM algorithm to identify the above feature data, the data type of the feature of the extracted object to be tested, which is originally the "category" type, needs to be set as the "category".
In step S143, the obtained characteristics of the object to be detected are input into the encrypted traffic detection model for identification.
Optionally, in this embodiment, the inputting the obtained feature of the object to be detected into the encrypted traffic detection model for identification further includes: and obtaining the abnormal probability value p of the object to be detected by the encryption detection model, comparing the abnormal probability value p with a set threshold value epsilon, if p is larger than epsilon, judging that the object to be detected is malicious flow, and otherwise, judging that the object to be detected is normal flow.
Because the false alarm rate of the algorithm can generate a plurality of false positives, safety analysis personnel can not obtain effective alarm, and the result of the algorithm loses significance. Therefore, a method for dynamically setting the threshold epsilon can be adopted, and a proper threshold is set by combining the size of the false alarm rate generated by the algorithm, so that the false alarm rate of the algorithm is reduced, and the accuracy of encrypted flow detection is improved.
Optionally, in this embodiment, a threshold value for making the false positive rate obtained by the N-fold cross validation one in ten thousandth is selected during training.
As a second aspect of the present invention, an encrypted traffic detection apparatus is provided, and fig. 4 is a block diagram of the encrypted traffic detection apparatus. As shown in fig. 4, the system includes a feature extraction module 110, a feature data processing module 120, an encrypted traffic detection model building module 130, and an encrypted traffic detection module 140.
A feature extraction module 110, configured to perform step S110, specifically, the training sample construction module 110 is configured to extract features of the web session from the target file as training samples, and construct a training sample set, where data in the training samples includes data of at least two data types.
A feature data processing module 120, configured to perform step S120, specifically, the feature data processing module 120 is configured to perform preprocessing on a training sample in the training sample set, so as to set a data type of a predetermined training sample as a data type that can be recognized by a predetermined algorithm, and obtain a preprocessed training sample set, where the predetermined training sample includes features of a network session, where a previous data type is the data type that can be recognized by the predetermined algorithm, extracted from a target file, and the predetermined algorithm can recognize features of at least two data types.
The encrypted flow detection model building module 130 is configured to execute step S130, and specifically, the model building module 130 is configured to build the encrypted flow detection model by using the pre-processed training sample set and using the predetermined algorithm.
The encrypted flow detection module 140 is configured to execute step S140, and specifically, the encrypted flow detection module 140 is configured to detect the object to be detected by using the constructed encrypted flow detection model.
Optionally, the data in the training samples comprises numerical data and classification data, the predetermined algorithm being capable of identifying and processing the numerical data and the classification data.
Optionally, the predetermined algorithm comprises a LightGBM algorithm or a Catboost algorithm.
Optionally, the target file includes a static packet file and/or a real-time network traffic file.
Optionally, the characteristics of the network session comprise at least one of session connection characteristics, TLS/SSL session characteristics, X509 certificate characteristics, and DNS characteristics.
Optionally, a TLS/SSL session of the network session includes TLS/SSL handshake and certificate information.
Optionally, the encrypted traffic detection model 130 includes an optimal hyperparameter selection module 150 and a model training module 160.
An optimal hyper-parameter selection module 150, configured to execute step S131, specifically, the optimal hyper-parameter selection module 150 is configured to find an optimal hyper-parameter of the predetermined algorithm by using the preprocessed training sample set.
The model training module 160 is configured to execute step S132, specifically, the model training module 160 is configured to perform training by using the pre-processed training sample set and using the predetermined algorithm by using the optimal hyper-parameter, so as to obtain the encrypted flow detection model.
Optionally, the feature extraction module 110 is further configured to execute step S141, that is, extract features of the object to be tested according to the features of the network session determined during model building.
Correspondingly, the feature data processing module 120 is further configured to execute step S142, that is, perform preprocessing on the extracted feature of the object to be tested, and set the data type of the extracted feature of the object to be tested, where the data type before extraction is the data type that can be recognized by the predetermined algorithm, as the data type that can be recognized by the predetermined algorithm.
Correspondingly, the encrypted flow detection module 140 is further configured to perform step S143, that is, input the preprocessed extracted feature of the object to be detected into the encrypted flow detection model for identification.
The working principle and the beneficial effect of the encryption traffic detection method have been described in detail above, and are not described again here.
As a third aspect of the present invention, there is provided a computer-readable storage medium for storing an executable program capable of executing the above-described encrypted traffic detection method of the present invention.
Computer-readable storage media include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer-readable storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage media, or any other medium which can be used to store the desired information and which can be accessed by a computer.
As a fourth aspect of the present invention, there is provided an electronic apparatus comprising:
one or more processors;
a storage device, on which one or more programs are stored, which, when executed by the one or more processors, cause the one or more processors to implement the encrypted traffic detection method of the present invention described above.
It will be understood that the above embodiments are merely exemplary embodiments taken to illustrate the principles of the present invention, which is not limited thereto. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit and substance of the invention, and these modifications and improvements are also considered to be within the scope of the invention.

Claims (14)

1. An encrypted traffic detection method, characterized in that the encrypted traffic detection method comprises:
extracting features of network sessions from a target file to serve as training samples, and constructing a training sample set, wherein data in the training samples comprise data of at least two data types;
preprocessing training samples in the training sample set to set the data types of preset training samples as the data types which can be identified by a preset algorithm, and obtaining the preprocessed training sample set, wherein the preset training samples comprise the features of network sessions, which are extracted from a target file and have the data types which can be identified by the preset algorithm, and the preset algorithm can identify the features of at least two data types;
constructing an encrypted flow detection model by using the pre-processed training sample set and adopting the predetermined algorithm;
detecting the object to be detected by using the constructed encrypted flow detection model;
wherein the detecting the object to be detected by using the constructed encrypted flow detection model comprises:
extracting the characteristics of an object to be detected;
preprocessing the extracted features of the object to be detected, and setting the data type of the extracted features of the object to be detected, of which the data type before extraction is the data type which can be identified by a preset algorithm, as the data type which can be identified by the preset algorithm;
inputting the preprocessed extracted characteristics of the object to be detected into the encrypted flow detection model for identification so as to determine whether the object to be detected is malicious flow;
wherein the characteristics of the network session comprise session connection characteristics, TLS/SSL session characteristics, X509 certificate characteristics and DNS characteristics.
2. The encrypted flow detection method according to claim 1, wherein the data in the training samples includes numerical data and classification data, and the predetermined algorithm is capable of recognizing and processing the numerical data and the classification data.
3. The encrypted traffic detection method according to claim 2, wherein the predetermined algorithm includes a LightGBM algorithm or a Catboost algorithm.
4. The encrypted traffic detection method according to claim 1, wherein the target file includes a static packet file and/or a real-time network traffic file.
5. The encrypted traffic detection method of claim 1, wherein a TLS/SSL session of the web session contains TLS/SSL handshake and certificate information.
6. The encrypted traffic detection method according to any one of claims 1 to 5, wherein constructing the encrypted traffic detection model includes:
searching the optimal hyper-parameter of the preset algorithm by utilizing the preprocessed training sample set;
and training by using the pre-processed training sample set and the preset algorithm by using the optimal hyper-parameter to obtain the encrypted flow detection model.
7. An encrypted flow rate detection device, characterized by comprising:
the system comprises a feature extraction module, a feature extraction module and a feature extraction module, wherein the feature extraction module is used for extracting features of a network session from a target file to serve as training samples and constructing a training sample set, and data in the training samples comprise data of at least two data types;
the characteristic data processing module is used for preprocessing the training samples in the training sample set so as to set the data types of the preset training samples as the data types which can be identified by a preset algorithm and obtain the preprocessed training sample set, wherein the preset training samples comprise the characteristics of network sessions which are extracted from a target file and have the data types which can be identified by the preset algorithm, and the preset algorithm can identify the characteristics of at least two data types;
the model construction module is used for constructing an encrypted flow detection model by using the pre-processed training sample set and adopting the predetermined algorithm;
the encrypted flow detection module is used for detecting an object to be detected by using the constructed encrypted flow detection model;
the characteristic extraction module is also used for extracting the characteristics of the object to be detected;
the characteristic data processing module is further used for preprocessing the extracted characteristic of the object to be detected, and setting the data type of the extracted characteristic of the object to be detected, of which the data type before extraction is the data type which can be identified by a preset algorithm, as the data type which can be identified by the preset algorithm;
the encrypted flow detection module is further used for inputting the preprocessed extracted characteristics of the object to be detected into the encrypted flow detection model for identification so as to determine whether the object to be detected is malicious flow;
wherein the characteristics of the network session include a session connection characteristic, a TLS/SSL session characteristic, an X509 certificate characteristic, and a DNS characteristic.
8. The encrypted flow rate detection device of claim 7, wherein the data in the training samples includes numerical data and classification data, and the predetermined algorithm is capable of identifying and processing the numerical data and the classification data.
9. The encrypted traffic detection device of claim 8, wherein the predetermined algorithm comprises a LightGBM algorithm or a Catboost algorithm.
10. The encrypted traffic detection device of claim 7, wherein the destination file comprises a static packet file and/or a real-time network traffic file.
11. The encrypted traffic detection apparatus of claim 7, wherein a TLS/SSL session of the web session contains TLS/SSL handshake and certificate information.
12. The encrypted flow rate detection device according to any one of claims 7 to 11, wherein the model construction module includes:
the optimal hyper-parameter selection module is used for searching the optimal hyper-parameter of the preset algorithm by utilizing the preprocessed training sample set;
and the model training module is used for training by using the preprocessed training sample set by using the optimal hyper-parameter and the preset algorithm to obtain the encrypted flow detection model.
13. A computer-readable storage medium for storing an executable program capable of executing the encrypted traffic detection method according to any one of claims 1 to 6.
14. An electronic device, characterized in that the electronic device comprises:
one or more processors;
a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the encrypted traffic detection method of any one of claims 1 to 6.
CN201910827194.5A 2019-09-03 2019-09-03 Encrypted flow detection method and device, computer readable storage medium and electronic equipment Active CN110598774B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910827194.5A CN110598774B (en) 2019-09-03 2019-09-03 Encrypted flow detection method and device, computer readable storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910827194.5A CN110598774B (en) 2019-09-03 2019-09-03 Encrypted flow detection method and device, computer readable storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN110598774A CN110598774A (en) 2019-12-20
CN110598774B true CN110598774B (en) 2023-04-07

Family

ID=68857386

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910827194.5A Active CN110598774B (en) 2019-09-03 2019-09-03 Encrypted flow detection method and device, computer readable storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN110598774B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111277578B (en) * 2020-01-14 2022-02-22 西安电子科技大学 Encrypted flow analysis feature extraction method, system, storage medium and security device
CN113595967A (en) * 2020-04-30 2021-11-02 深信服科技股份有限公司 Data identification method, equipment, storage medium and device
CN112165487B (en) * 2020-09-27 2022-07-15 上海万向区块链股份公司 Zeek-based distributed network security and performance detection method and system
CN112101485B (en) * 2020-11-12 2021-02-05 北京云真信科技有限公司 Target device identification method, electronic device, and medium
CN112714079B (en) * 2020-12-14 2022-07-12 成都安思科技有限公司 Target service identification method under VPN environment
CN113364792B (en) * 2021-06-11 2022-07-12 奇安信科技集团股份有限公司 Training method of flow detection model, flow detection method, device and equipment
CN113676348B (en) * 2021-08-04 2023-12-29 南京赋乐科技有限公司 Network channel cracking method, device, server and storage medium
CN113765911A (en) * 2021-09-02 2021-12-07 恒安嘉新(北京)科技股份公司 Method, device, equipment and storage medium for detecting webshell encrypted flow
CN116346452B (en) * 2023-03-17 2023-12-01 中国电子产业工程有限公司 Multi-feature fusion malicious encryption traffic identification method and device based on stacking

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106790019A (en) * 2016-12-14 2017-05-31 北京天融信网络安全技术有限公司 The encryption method for recognizing flux and device of feature based self study
CN109347872A (en) * 2018-11-29 2019-02-15 电子科技大学 A kind of network inbreak detection method based on fuzziness and integrated study

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106685962B (en) * 2016-12-29 2020-06-23 广东睿江云计算股份有限公司 Defense system and method for reflective DDOS attack flow
CN107294993B (en) * 2017-07-05 2021-02-09 重庆邮电大学 WEB abnormal traffic monitoring method based on ensemble learning
RU2693324C2 (en) * 2017-11-24 2019-07-02 Общество С Ограниченной Ответственностью "Яндекс" Method and a server for converting a categorical factor value into its numerical representation
CN110113349A (en) * 2019-05-15 2019-08-09 北京工业大学 A kind of malice encryption traffic characteristics analysis method
CN110177123B (en) * 2019-06-20 2020-09-18 电子科技大学 Botnet detection method based on DNS mapping association graph

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106790019A (en) * 2016-12-14 2017-05-31 北京天融信网络安全技术有限公司 The encryption method for recognizing flux and device of feature based self study
CN109347872A (en) * 2018-11-29 2019-02-15 电子科技大学 A kind of network inbreak detection method based on fuzziness and integrated study

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于LightGBM改进的GBDT短期负荷预测研究";王华勇等;《自动化仪表》;20180930;全文 *

Also Published As

Publication number Publication date
CN110598774A (en) 2019-12-20

Similar Documents

Publication Publication Date Title
CN110598774B (en) Encrypted flow detection method and device, computer readable storage medium and electronic equipment
Khammassi et al. A NSGA2-LR wrapper approach for feature selection in network intrusion detection
CN111565205B (en) Network attack identification method and device, computer equipment and storage medium
Jacobs et al. Ai/ml for network security: The emperor has no clothes
CN110557382A (en) Malicious domain name detection method and system by utilizing domain name co-occurrence relation
CN113469366A (en) Encrypted flow identification method, device and equipment
CN110493262B (en) Classification-improved network attack detection method and system
CN114172688B (en) Method for automatically extracting key nodes of network threat of encrypted traffic based on GCN-DL (generalized traffic channel-DL)
US20210263979A1 (en) Method, system and device for identifying crawler data
Luxemburk et al. Fine-grained TLS services classification with reject option
CN116915450A (en) Topology pruning optimization method based on multi-step network attack recognition and scene reconstruction
Tang et al. HSLF: HTTP header sequence based LSH fingerprints for application traffic classification
CN112822121A (en) Traffic identification method, traffic determination method and knowledge graph establishment method
CN110598794A (en) Classified countermeasure network attack detection method and system
Wang et al. Threat Intelligence Relationship Extraction Based on Distant Supervision and Reinforcement Learning.
CN111444364B (en) Image detection method and device
CN111291078B (en) Domain name matching detection method and device
CN111092873B (en) Training method and detection method of traffic detection model of asymmetric convolutional network
Bui et al. A clustering-based shrink autoencoder for detecting anomalies in intrusion detection systems
Ghimeş et al. Neural network models in big data analytics and cyber security
Zhang et al. An uncertainty-based traffic training approach to efficiently identifying encrypted proxies
CN113642017A (en) Encrypted flow identification method based on self-adaptive feature classification, memory and processor
Levshun et al. Active learning approach for inappropriate information classification in social networks
Leevy Machine Learning Algorithms for Predicting Botnet Attacks in IoT Networks
Arevalo-Herrera et al. Network anomaly detection with machine learning techniques for sdn networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant