CN110493208B

CN110493208B - Multi-feature DNS (Domain name System) combined HTTPS (Hypertext transfer protocol secure) malicious encrypted traffic identification method

Info

Publication number: CN110493208B
Application number: CN201910734488.3A
Authority: CN
Inventors: 陈虎; 唐开达
Original assignee: Nanjing Juming Network Technology Co ltd
Current assignee: Nanjing Juming Network Technology Co ltd
Priority date: 2019-08-09
Filing date: 2019-08-09
Publication date: 2021-12-14
Anticipated expiration: 2039-08-09
Also published as: CN110493208A

Abstract

The invention relates to a multi-feature malicious encrypted traffic identification method combining DNS and HTTPS, which comprises the following steps: the method comprises the following steps: extracting all sample DNS communication protocols in the learning network, and analyzing the DNS communication protocols: step two: extracting all malicious/non-malicious HTTPS communication protocol handshake parts (non-encrypted contents) in the learning network, and analyzing the HTTPS communication protocol handshake parts: step three: extracting the session related feature information of the malicious/non-malicious HTTPS protocol session, and the fourth step: correlating related contents of the DNS protocol and the HTTPS protocol, and performing a fifth step: through data learning of normal encryption flow, extracting normal encryption communication data characteristics, and carrying out the sixth step: classifying the data by using a regression method, and performing a seventh step: storing the weight data to a persistent medium through the training result for subsequent use; step eight: and performing characteristic extraction and substitution solving on the related encrypted flow data in the existing network by using the solving result.

Description

Multi-feature DNS (Domain name System) combined HTTPS (Hypertext transfer protocol secure) malicious encrypted traffic identification method

Technical Field

The invention relates to an identification method, in particular to a multi-feature malicious encrypted traffic identification method combining DNS and HTTPS, and belongs to the technical field of software encryption identification.

Background

With the continuous development of encryption technology and the upgrading of computer security attack and defense technology, the content of plaintext transmission in a network is less and less, the proportion of encryption flow is higher and higher, and statistics is made that the encryption method is used for more than 60% of the Internet transmission content currently, wherein the HTTPS encryption transmission ratio is the highest; along with this, hackers often use encryption algorithms to encrypt control commands and data transmitted by hackers, thereby evading the detection of various kinds of killing tools, which results in undetected malicious network traffic and thus missing important information.

As mentioned above, the encryption protocol used by hackers is generally preferred to be HTTPS, and the reason is that HTTPS can easily penetrate the firewall setting, i.e. the firewall generally does not set a policy to block network access of 80 or 443 ports, so that the hacker's control commands and some backhaul data can be simply transmitted in the network without any restrictions.

In view of the above situation, how to detect malicious communication data in encrypted traffic becomes a difficult problem for people. Moreover, the HTTPS protocol generally uses the Diffie-Hellman algorithm to perform dynamic key negotiation, and it is almost everywhere at night to try to break the session key, so a new approach should be made to detect such network traffic, especially malicious traffic, and the idea is mainly performed by a machine learning method.

Disclosure of Invention

The invention provides a method for identifying malicious encrypted traffic by combining a multi-feature DNS with HTTPS (hypertext transfer protocol secure protocol), aiming at the problems in the prior art, the technical scheme makes full use of the DNS protocol, because before HTTPS related communication, domain name request is generally required to cover or avoid hard coding of a hack loopback address in a code) to prepare the related malicious traffic, and partial dimensionality of a feature vector is formed.

The relevant explanation in this scheme is as follows:

DNS: the domain name resolution protocol provides an addressing method under the Internet environment by utilizing the mapping relation between a domain name and an IP address, so that a user can conveniently memorize related websites;

HTTPS: the encrypted hypertext transfer protocol, HTTP over TLS/SSL.

In order to achieve the above object, a technical solution of the present invention is as follows, a traffic identification method using a multi-feature DNS in combination with HTTPS for malicious encryption, wherein the method comprises the following steps:

the method comprises the following steps: extracting all sample DNS communication protocols in the learning network, and analyzing the DNS communication protocols:

step two: extracting all malicious/non-malicious HTTPS communication protocol handshake parts (non-encrypted contents) in the learning network, and analyzing the HTTPS communication protocol handshake parts:

step three: extracting session related characteristic information of malicious/non-malicious HTTPS protocol sessions, wherein the information comprises the following main aspects:

step four: the related contents of the DNS protocol and the HTTPS protocol are related according to the IP address query returned by the DNS and the destination address connected in the HTTPS protocol;

step five: extracting the characteristics of normal encrypted communication data through data learning of normal encrypted flow, wherein the data can come from well-known HTTPS websites such as Baidu and Xinlang and can be labeled in the forward direction; learning abnormal encryption flow generated by malicious software, extracting relevant characteristics, and carrying out negative direction labeling on the characteristics;

step six: classifying the data by using a regression method, wherein a Lasso regression method is used in the method in consideration of relevant application scenes and calculation speed;

step seven: storing the weight data to a persistent medium through the training result for subsequent use;

step eight: and performing feature extraction and substitution solution on the related encrypted traffic data in the existing network by using the solution result, if the result is biased to be positive, considering the result to be normal encrypted traffic, and if the result is not biased to be positive, considering the result to be malicious encrypted traffic, wherein an absolute value of the result can be given as a confidence coefficient to serve as a reference or measurement of the related accuracy.

As an improvement of the invention, the method comprises the following steps: all sample DNS communication protocols in the learning network are extracted and analyzed, and concretely, the DNS communication protocols are analyzed as follows,

the domain name information (FQDN) requested by the DNS and the returned actual domain name information, whether the two domain names appear in ten million common domain names, namely whether the ranking is within the first ten million of the common domain names, and the dimensionality in the characteristics is formed by respectively taking the values of 1 and 0 aiming at whether the two domain names appear;

acquiring a Time to Live (Time to Live, that is, the survival Time of the domain name) value of the domain name from the domain name query response information, wherein the unit of the Time to Live is generally second, such as 100 seconds, 200 seconds, 300 seconds and the like, and forming a dimension in the feature; according to a DNS request initiated by collected malicious traffic, the TTL value is generally rare;

acquiring the analyzed average address number from the domain name query one-time response information, wherein the number of the general request addresses is different from that of the normal DNS request to form dimensionality in the characteristics;

requesting domain name flash characteristics, i.e. requesting the relation between domain name and address number in unit time period (generally within one hour), forming the dimension in the feature;

obtaining the country (using specific IPGeo library to search) condition of the return addresses from the response information of the domain name query, forming the dimension in the characteristics;

in general malicious encrypted traffic, the number of domain names is large through back-checking the domain names (namely the number of the domain names corresponding to a certain IP address), so that dimensionality in characteristics is formed;

checking spelling of the queried domain name, including letter-to-number ratio, spelling ratio of 2-gram (i.e. two consecutive letters) (obtained from spelling ratio of general domain name, the lower the ratio, the more rare the ratio, the more possible there is a problem), 3-gram (spelling ratio of three consecutive letters) letter spelling transition probability, and 3-gram letter consonant ratio; the above spellings are each checked to form a dimension in a feature; the 2-gram and the 3-gram mentioned above refer to the number of continuous characters, and the domain names requested by the general malicious software are common in DGA (dynamic generation algorithm) domain names, so that the composition check of the domain names is very important;

as an improvement of the present invention, the second step: extracting all malicious/non-malicious HTTPS communication protocol handshake parts (non-encrypted contents) in the learning network, and analyzing the HTTPS communication protocol handshake parts, wherein the details are as follows:

checking the version information of the communication protocol, namely checking whether the encrypted communication protocol is TLS1.0, TLS1.1, TLS1.2 or TLS 1.3; the versions are further mapped to a numerical value to form a dimension in a characteristic, and most of malicious encrypted traffic uses a lower TLS version through sample analysis;

acquiring relevant information of a certificate issuer from a handshake protocol, inquiring the ranking condition of the certificate issuer in the known certificate issuer, and forming one dimension in the characteristics;

acquiring the type of an encryption algorithm suite from a handshake protocol, and respectively mapping the encryption algorithm suite into different values, namely mapping a public key algorithm, a communication key exchange algorithm, a data communication symmetric encryption algorithm (such as AES, 3DES, RC2, RC4, RC5 and the like), and a data digest algorithm (such as MD5, SHA-1, SHA-256 and the like) into different values, and forming different dimensions in characteristics, because according to sample observation, malicious encryption traffic can often use the encryption algorithm suite with lower strength due to technical reasons, for example, the data communication symmetric encryption algorithm can often use the basically abandoned algorithms such as RC4, RC2 and the like, and can less use the algorithms with higher strength such as AES and the like;

and acquiring address information (Server Name) of the communication Server from the handshake protocol, and checking the ranking of the communication Server in the common domain Name to form one dimension in the feature vector.

As an improvement of the invention, step three: extracting session related characteristic information of malicious/non-malicious HTTPS protocol sessions, wherein the information comprises the following main aspects: the average load packet length of the protocol session, namely, the related average data packet length in the HTTPS protocol is calculated, here, only the seven-layer load length is calculated, and four layers and less than four layers should be ignored (namely, part of the handshake packet, the response packet and the end data packet are ignored) as one dimension of the feature vector;

respectively calculating the average load packet length of a client/server of a protocol session, namely calculating the average data packet length related to an HTTPS protocol, wherein the average load packet length is similar to that of the previous protocol, only seven layers of load lengths are calculated, four layers and less than four layers are ignored (namely, part of handshake packets, response packets and end data packets are ignored), and the average load packet lengths of communication flows of the client and the server are respectively used as one dimension of a feature vector;

acquiring the average packet number of the protocol session, namely acquiring the average data packet number of the malicious encrypted flow session in all the learning samples as one dimension of the feature vector;

acquiring the ratio of the number of outgoing packets and the number of incoming packets of the protocol session, namely dividing the number of outgoing packets by the number of incoming packets to be used as one dimension of the feature vector; acquiring the ratio of the number of outgoing bytes and the number of incoming bytes of the protocol session, namely dividing the number of outgoing bytes by the number of incoming bytes to serve as one dimension of the feature vector;

acquiring the average packet number of a client/server of the protocol session, namely acquiring the average data packet number of malicious encrypted flow sessions in all learning samples, and taking the average data packet number as one dimension of a feature vector;

calculating the average entropy of all encrypted data packets, wherein the entropy is calculated according to the distribution condition of 0-255 characters and is taken as one dimension in the feature vector;

and respectively calculating the average entropy of all client/server encrypted data packets, wherein the entropy is calculated according to the distribution condition of 0-255 characters, and the two entropy values are taken as one dimension in the characteristic vector.

As an improvement of the invention, step six: the regression method is used for classifying the data, and in consideration of relevant application scenes and calculation speed, the Lasso regression method is used in the patent, and the method specifically comprises the following steps:

where X is a sample vector of how many samples there are, the dimensions of each vector depending on the number of features extracted, these dimensions may be a subset or the full set of those mentioned above; and Y (or f (x)_k) Is a scalar, only two values are taken in this patent, namely {1, -1}, where the positive sample is taken as 1 and the negative standard sample is taken as-1, the following is the main part of the regression algorithm, and the most important task is to obtain the relevant weight value by learning to minimize the objective function:

objective function (with regularization portion):

in the above formula f (x)_k) That is, the function values, take on values of 1 and-1, and x_kIs a certain sample, w^TThe method is characterized in that the method is a coefficient matrix transpose of a linear equation, which is also an object to be solved, N is the number of samples and m is the dimensionality of a vector; λ in the latter part of the formula is the regularization coefficient; however, since the absolute value function is included, the whole formula is not conductive, and the extreme value is obtained by using some auxiliary methods;

matrix form of the objective function (convert the above equation to matrix form as follows):

min||w^TX-Y||²+λ||w||

the difficulty in solving the above equation is that a norm is not derivable at zero (the reason for absolute value), so it does not have a closed solution unlike a general regression equation, but rather requires the use of a FIST (fast Iterative shock threshold) method, which can be used to solve an objective function (i.e., using the FIST method) shaped as the following section;

FIST method:

minF(x)＝minf(x)+g(x)

where g (x) is a continuous convex function, which may not be smooth, and f (x) is a smooth function, whose derivative should satisfy the Lipschitz Continuity requirement, which is stronger than the general requirement of consistent Continuity, i.e. there is a constant L (greater than zero) that satisfies the following requirements for any two different real numbers x and z (which may be extended to other satisfactory spaces, not necessarily real spaces) on the definition domain D:

for L that satisfies the condition minimum, it is called the Lipschitz constant, and if L <1, f is called the contraction mapping.

A gradient of f (x); the following equation can be obtained:

in the above formula, <., > is the inner product sign, the right part of the formula is expanded using a taylor-like formula (expansion of the function f (x) at point z).

Order to

Let g (w) be λ w | |, and then f (w) be added to w^(t)And (where t denotes w iterates t times), as disclosed aboveThe formula can be obtained:

however, the above formula is still not derivable, and now, the above formula is transformed into the following formula by using an upper bound function of FIST:

and finally solving the formula by using a soft domain shrinkage operator method to obtain a correlation result so as to obtain the weight.

As an improvement of the invention, step seven: and storing the weight data to a persistent medium through a training result for subsequent use, wherein the specific process comprises the following steps: analyzing DNS sample data, extracting domain name ranking characteristics, extracting domain name TTL characteristics, extracting domain name address analysis characteristics, extracting domain name flash characteristics, removing country distribution characteristics, extracting domain name spelling and pronunciation characteristics, analyzing encryption flow sample data, extracting encryption version characteristics, extracting certificate ranking characteristics, extracting algorithm suite characteristics, extracting communication server ranking characteristics, extracting encryption flow data packet characteristics, and training data by using a Lasso regression method to form a training result and store the training result.

As an improvement of the invention, step eight: performing feature extraction and substitution solution on the related encrypted traffic data in the existing network by using the solution result, if the result is biased to be positive, considering the result to be normal encrypted traffic, and if the result is not biased to be positive, considering the result to be malicious encrypted traffic, wherein an absolute value of the malicious encrypted traffic can be given as a confidence coefficient to serve as a reference or measurement of related accuracy; loading a weight formed by training data during initialization;

the system enters a network card packet receiving process;

judging whether the received data packet is a DNS protocol, if so, extracting relevant characteristic information and continuing to receive the packet, and if not, turning to the next step;

judging whether the received data packet is an encrypted flow protocol, if so, extracting relevant characteristic information and continuing to receive the packet, and if not, continuing to receive the packet;

and checking whether the malicious encrypted traffic characteristics are met, if so, alarming and continuing to process, and otherwise, continuing to process.

Compared with the prior art, the invention has the following technical effects:

1) for increasingly encrypted network traffic, the patent provides a means by which methods can examine them; with regard to the encryption case, the patent provides a method for detecting unknown threats, and particularly with the combination of DNS-related detection, the capabilities of these aspects can be further enhanced.

2) The technical scheme fully utilizes some relevant characteristics of a DNS protocol (before HTTPS relevant communication is carried out, a domain name request is generally required to cover or avoid hard coding a loopback address of a hacker in a code) to carry out preparation processing on relevant malicious traffic to form part of dimensions of a feature vector, and the features are very important;

3) the technical scheme fully utilizes the non-encryption part of the HTTPS protocol, such as a plurality of data packets of HTTPS handshake, and checks the characteristics of the data packets; the important part comprises the encryption algorithm suite type (including contents such as a digest algorithm, a key agreement algorithm, a session key algorithm and the like), a server name, a certificate owner, an issuer and the like, and forms part of dimensionalities of the feature vector;

4) the technical scheme fully learns the characteristics of encrypted communication data streams of some known malicious software, and finds out the possibly existing same points by comparing the characteristics with the encrypted communication data streams so as to discriminate whether the malicious encrypted communication data streams exist and form partial dimensions of related characteristic vectors;

5) according to the technical scheme, a certain machine learning algorithm is utilized to model the characteristics of the malicious encrypted traffic so as to distinguish normal communication traffic and the malicious encrypted traffic.

Drawings

FIG. 1 is a schematic view of a seventh process flow;

fig. 2 is a schematic view of the verification processing flow in step eight.

The specific implementation mode is as follows:

for the purpose of enhancing an understanding of the present invention, the present embodiment will be described in detail below with reference to the accompanying drawings.

Example 1: referring to fig. 1, a multi-feature DNS in combination with HTTPS malicious encrypted traffic identification method includes the following steps:

The method comprises the following steps: extracting all sample DNS communication protocols in a learning network, and analyzing the DNS communication protocols, wherein the DNS request domain name information (FQDN) and the returned actual domain name information show whether the two domain names appear in ten million common domain names, namely whether the ranking is within the previous ten million of the common domain names, and respectively taking values of 1 and 0 aiming at whether the two domain names appear to form dimensionality in the characteristics;

the second step is as follows: extracting all malicious/non-malicious HTTPS communication protocol handshake parts (non-encrypted contents) in the learning network, and analyzing the HTTPS communication protocol handshake parts, wherein the details are as follows:

Step three: extracting session related characteristic information of malicious/non-malicious HTTPS protocol sessions, wherein the information comprises the following main aspects: the average load packet length of the protocol session, namely, the related average data packet length in the HTTPS protocol is calculated, here, only the seven-layer load length is calculated, and four layers and less than four layers should be ignored (namely, part of the handshake packet, the response packet and the end data packet are ignored) as one dimension of the feature vector;

Step six: the regression method is used for classifying the data, and in consideration of relevant application scenes and calculation speed, the Lasso regression method is used in the patent, and the method specifically comprises the following steps:

where X is a sample vector of how many samples there are, the dimensions of each vector depending on the number of features extracted, these dimensions may be a subset or the full set of those mentioned above; and Y (or f (x)_k) Is a scalar, only two values are taken in this patent, namely {1, -1}, wherein the positive sample is taken as 1 and the negative standard sample is taken as-1, the main part of the regression algorithm is involved, and the most important task is to obtain the relevant weight value through learning to achieve the target functionMinimization of the number:

objective function (with regularization portion):

min||w^TX-Y||²+λ||w||

FIST method:

minF(x)＝minf(x)+g(x)

A gradient of f (x); the following equation can be obtained:

Order to

Let g (w) be λ w | |, and then f (w) be added to w^(t)The evolution (where t denotes w iterates t times) can be obtained from the above equation:

Step seven: the weight data is saved to the persistent medium through the training result for subsequent use, and referring to fig. 1, the specific flow is as follows: analyzing DNS sample data, extracting domain name ranking characteristics, extracting domain name TTL characteristics, extracting domain name address analysis characteristics, extracting domain name flash characteristics, removing country distribution characteristics, extracting domain name spelling and pronunciation characteristics, analyzing encryption flow sample data, extracting encryption version characteristics, extracting certificate ranking characteristics, extracting algorithm suite characteristics, extracting communication server ranking characteristics, extracting encryption flow data packet characteristics, and training data by using a Lasso regression method to form a training result and store the training result.

Step eight: performing feature extraction and substitution solution on the related encrypted traffic data in the existing network by using the solution result, if the result is biased to be positive, considering the result to be normal encrypted traffic, and if the result is not biased to be positive, considering the result to be malicious encrypted traffic, wherein an absolute value of the malicious encrypted traffic can be given as a confidence coefficient to serve as a reference or measurement of related accuracy;

loading a weight formed by training data during initialization;

the system enters a network card packet receiving process;

and checking whether the malicious encrypted traffic characteristics are met, if so, alarming and continuing to process, and otherwise, continuing to process, specifically referring to fig. 2.

The following is an example of feature extraction for a Trojan variety "Kryptik" (this Trojan variety was discovered first in 2019, 6 months, it was propagated through mail, infected persons are at risk of encrypted lasso of important documents; furthermore, for simplicity, the method labels the values of the dimensions with floating point numbers between [0,1], where more normal the closer to 1, the higher the anomaly, and vice versa):

1) according to the DNS request analysis, the Trojan horse requests a domain name batty.duckdnsadsrf.org, the domain name is not within ten million common domain names, so the characteristic is 0;

2) according to DNS request analysis, the Trojan horse requests domain names to have spelling abnormality due to the fact that the 3-gram transfer probability is too low and continuous consonants ('dsrf') exist; therefore, the two characteristics are different from the normal domain name spelling characteristics, and the lowest transfer probabilities are respectively marked;

3) according to the DNS request analysis, there is only one return address for the domain name (192.3.205.98), and the feature is normal, i.e. there is no abnormal behavior in the flash feature, here labeled 1;

4) according to the DNS request analysis, the domain name return address is an overseas IP (United states), so that the overseas and overseas characteristics are abnormal and are marked as 0;

5) the method adopts an HTTPS mode for communicating with a zombie host, the version of the HTTPS mode is TLS1.0, and the version is marked as 0 because the version is too low and abnormal;

6) in the algorithm suite, the symmetric encryption method uses RC4, and is marked as 0 if abnormal;

7) the communication certificate is also a self-signed certificate and is not in the range of common TLS/SSL certificates, so that the communication certificate is marked as 0;

8) the rank of the HTTPS communication server is not within normal ten million, so that the rank is marked as 0;

9) the average data packet byte of HTTPS communication is 264, which is much lower than the average byte number of normal communication session (normally 500-;

10) and putting the extracted features into a formed regression formula for verification, and determining the features as malicious encrypted traffic.

It should be noted that the above-mentioned embodiments are not intended to limit the scope of the present invention, and all equivalent modifications and substitutions based on the above-mentioned technical solutions are within the scope of the present invention as defined in the claims.

Claims

1. A multi-feature DNS combined with HTTPS malicious encrypted traffic identification method is characterized by comprising the following steps:

step two: extracting all non-encrypted contents of the malicious/non-malicious HTTPS communication protocol handshake part in the learning network, and analyzing the HTTPS communication protocol handshake part:

step three: extracting session-related characteristic information of malicious/non-malicious HTTPS protocol sessions,

step six: classifying the data by using a regression method, and considering relevant application scenes and calculation speed, using a Lasso regression method;

step eight: performing feature extraction and substitution solving on the related encrypted traffic data in the current network by using the weight data obtained in the step seven, if the result is biased to be a positive number, considering the result to be normal encrypted traffic, otherwise, considering the result to be malicious encrypted traffic, and giving an absolute value of the result as a confidence coefficient to be used as a reference or measurement of the related accuracy;

the method comprises the following steps: all sample DNS communication protocols in the learning network are extracted and analyzed, and concretely, the DNS communication protocols are analyzed as follows,

the domain name information FQDN requested by the DNS and the returned actual domain name information, whether the two domain names appear in ten million common domain names, namely whether the ranking is within the first ten million of the common domain names, and the dimensionality in the characteristics is formed aiming at whether the two domain names respectively take the values of 1 and 0;

acquiring TTL (Time to Live) of the domain name, namely the survival Time value of the domain name from the domain name query response information, wherein the unit of the survival Time value is generally second, and forming dimensionality in the characteristics;

acquiring the analyzed average address number from the domain name query one-time response information, wherein the request address number is different from a normal DNS request to a certain extent, and forming dimensionality in the characteristics;

requesting domain name flash characteristics, namely requesting the relation between domain names and the number of addresses in a unit time period to form dimensionality in the characteristics;

obtaining the countries of the return addresses from the response information of the domain name query, using a specific IPGeo library to search the condition, and forming the dimensionality in the characteristics in unit time;

forming dimensionality in the characteristics by the obtained address reverse-checking domain name, namely the number of domain names corresponding to a certain IP address;

checking the spelling mode of the queried domain name, wherein the checking comprises the occupation ratio of letters and numbers, the spelling ratio of 2-gram, namely two continuous letters, the spelling ratio of 3-gram three continuous letters, the letter spelling transition probability and the consonant occupation ratio of 3-gram letters; the above spell checks each form a dimension in a feature.

2. The multi-feature DNS in combination with HTTPS malicious encrypted traffic identification method according to claim 1, wherein said step two: extracting all non-encrypted contents of the malicious/non-malicious HTTPS communication protocol handshake part in the learning network, and analyzing the HTTPS communication protocol handshake part, wherein the details are as follows:

acquiring the type of an encryption algorithm suite from a handshake protocol, and mapping the type of the encryption algorithm suite into different values respectively, namely mapping a public key algorithm, a communication key exchange algorithm, a data communication symmetric encryption algorithm and a data digest algorithm into different values and forming different dimensions in characteristics;

and acquiring the address information Server Name of the communication Server from the handshake protocol, and checking the ranking of the communication Server Name in the common domain Name to form one dimension in the feature vector.

3. The multi-feature DNS combined HTTPS malicious encrypted traffic identification method according to claim 2, characterized by the steps of three: extracting session related characteristic information of a malicious/non-malicious HTTPS protocol session, wherein the average load packet length of the protocol session is calculated, namely the average data packet length related to the HTTPS protocol is calculated, and only the seven-layer load length is calculated, and four layers and below four layers are ignored and used as one dimension of a characteristic vector;

respectively calculating the average load packet length of a client/server of a protocol session, namely calculating the average data packet length related to an HTTPS protocol, wherein the average load packet length is similar to that of the previous part, only seven layers of load length are calculated, less than four layers of load length are ignored, namely, part of handshake packets, response packets and end data packets are ignored, and the average load packet lengths of communication flows of the client and the server are respectively used as one dimension of a feature vector; acquiring the average packet number of the protocol session, namely acquiring the average data packet number of the malicious encrypted flow session in all the learning samples as one dimension of the feature vector; acquiring the ratio of the number of outgoing packets and the number of incoming packets of the protocol session, namely dividing the number of outgoing packets by the number of incoming packets to be used as one dimension of the feature vector; acquiring the ratio of the number of outgoing bytes and the number of incoming bytes of the protocol session, namely dividing the number of outgoing bytes by the number of incoming bytes to serve as one dimension of the feature vector; acquiring the average packet number of a client/server of the protocol session, namely acquiring the average data packet number of malicious encrypted flow sessions in all learning samples, and taking the average data packet number as one dimension of a feature vector;

calculating the average entropy of all encrypted data packets, wherein the entropy is calculated according to the distribution condition of 0-255 characters and is taken as one dimension in the feature vector; and respectively calculating the average entropy of all client/server encrypted data packets, wherein the entropy is calculated according to the distribution condition of 0-255 characters, and the two entropy values are taken as one dimension in the characteristic vector.

4. The multi-feature DNS in combination with HTTPS malicious encrypted traffic identification method according to claim 3, wherein step six: classifying the data by using a regression method, namely using a Lasso regression method, which comprises the following specific steps:

where X is a sample vector of how many samples there are, the dimensions of each vector depending on the number of features extracted, these dimensions may be a subset or the full set of those mentioned above; and Y or f (x)_k) The method is a scalar, only two values are taken in the scheme, namely {1, -1}, wherein a positive sample is taken as 1, and a negative standard sample is taken as-1, the following relates to a main part of a regression algorithm, and the most important task is to obtain a relevant weight value through learning so as to achieve the minimization of an objective function:

an objective function, comprising a regularization portion:

the matrix form of the objective function converts the above equation into the following matrix form:

min||w^TX-Y||²+λ||w||

the difficulty in solving the above equation is the reason that a norm cannot lead to an absolute value at zero, so it does not have a closed solution unlike a general regression equation, but needs to use the Fast Iterative shocking threshold method, which can be used to solve the objective function formed as the following section, i.e. the fish method;

FIST method:

minF(x)＝minf(x)+g(x)

wherein g (x) is a continuous convex function, which may not be smooth, and f (x) is a smooth function, whose derivative should satisfy the Lipschitz continuous Lipschitz Continuity requirement, which is stronger than the general requirement of consistent Continuity, i.e. there is a constant L greater than zero, which satisfies the requirement for any two different real numbers x and z on the domain D, which may be extended to other satisfactory spaces, not necessarily real space:

for L that satisfies the condition minimum, it is called the Lipschitz constant, if L <1 then f is called the contraction mapping,

a gradient of f (x); the following equation can be obtained:

in the above formula, <., > is the inner product sign, the right part of the formula is expanded using a taylor-like formula, the function f (x) is expanded at point z,

order to

Let g (w) be λ w | |, and then f (w) be added to w^(t)And (3) expanding, wherein t represents w iterated for t times, and the formula can be obtained:

5. The multi-feature DNS in combination with HTTPS malicious encrypted traffic identification method according to claim 3, characterized by the seventh step of: and storing the weight data to a persistent medium through a training result for subsequent use, wherein the specific process comprises the following steps: analyzing DNS sample data, extracting domain name ranking characteristics, extracting domain name TTL characteristics, extracting domain name address analysis characteristics, extracting domain name flash characteristics, removing country distribution characteristics, extracting domain name spelling and pronunciation characteristics, analyzing encryption flow sample data, extracting encryption version characteristics, extracting certificate ranking characteristics, extracting algorithm suite characteristics, extracting communication server ranking characteristics, extracting encryption flow data packet characteristics, and training data by using a Lasso regression method to form a training result and store the training result.

6. The multi-feature DNS in combination with HTTPS malicious encrypted traffic identification method according to claim 5, wherein step eight: performing feature extraction and substitution solution on the related encrypted traffic data in the existing network by using the solution result, if the result is biased to be positive, considering the result to be normal encrypted traffic, and if the result is not biased to be positive, considering the result to be malicious encrypted traffic, wherein an absolute value of the malicious encrypted traffic can be given as a confidence coefficient to serve as a reference or measurement of related accuracy; loading a weight formed by training data during initialization;

the system enters a network card packet receiving process; judging whether the received data packet is a DNS protocol, if so, extracting relevant characteristic information and continuing to receive the packet, and if not, turning to the next step; judging whether the received data packet is an encrypted flow protocol, if so, extracting relevant characteristic information and continuing to receive the packet, and if not, continuing to receive the packet; and checking whether the malicious encrypted traffic characteristics are met, if so, alarming and continuing to process, and otherwise, continuing to process.