CN112270351A

CN112270351A - A Semi-Supervised Encrypted Traffic Identification Method Based on Auxiliary Classification Generative Adversarial Networks

Info

Publication number: CN112270351A
Application number: CN202011150439.4A
Authority: CN
Inventors: 张明明; 冒佳明; 夏飞; 赵俊峰; 夏元轶; 曾锃; 许良杰; 蒲强; 马媛媛; 陈璐
Original assignee: Global Energy Interconnection Research Institute; Anhui Jiyuan Software Co Ltd; Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Current assignee: Global Energy Interconnection Research Institute; Anhui Jiyuan Software Co Ltd; Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2020-10-24
Filing date: 2020-10-24
Publication date: 2021-01-26

Abstract

The invention discloses a semi-supervised encryption traffic identification method for generating a countermeasure network based on auxiliary classification. The method comprises the steps of transforming an original auxiliary classification generation countermeasure network, fusing a generator into vectors after receiving random noise, hidden variables and data labels, generating generated data containing real flow characteristics, receiving unmarked samples and marked samples in the real data by a discriminator, stacking three MLP networks to finish judgment of true and false flows, classifying the flows and extracting the hidden variables respectively. The method of the invention modifies the loss function of the original auxiliary classification generation countermeasure network, so that the method can utilize unmarked data to carry out semi-supervised learning, improve the identification precision, reduce the cost of network flow acquisition and marking, and simultaneously improve the network management and safety monitoring level.

Description

Semi-supervised encryption traffic identification method for generating countermeasure network based on auxiliary classification

Technical Field

The invention relates to a semi-supervised encrypted traffic identification method for generating a countermeasure network based on auxiliary classification, and belongs to the technical field of encrypted traffic identification.

Background

The flow classification and identification are the basis for improving the network management and safety monitoring level and improving the service quality, and are also the premise of network behaviors such as network design and planning. With the enhancement of user privacy protection and security awareness, technologies such as SSL, SSH, VPN, etc. are more and more widely used, resulting in a higher proportion of encrypted traffic in network transmission.

Due to the adoption of application layer encryption, the traditional port matching and DPI can not accurately identify the application flow; compared with machine learning, deep learning can well express essential characteristics of data, but a large number of marked samples are relied on during training, and the accuracy of the samples directly leads to the recognition rate of a training result. However, the traffic acquisition and marking of encryption application are very difficult, and it is difficult to directly acquire the sample size required for training a better model, which results in high cost.

The existing deep learning-based traffic identification method is mostly based on supervised learning and depends on a large amount of marked data. Marked data is always difficult and costly to obtain, however unmarked data is readily available. Obviously, how to combine a large amount of unmarked flow data with a small amount of marked flow data to complete the classification task in a semi-supervised manner would greatly eliminate the dependency of a large amount of marked data sets, and is very meaningful.

By means of a small amount of encrypted flow data of real application, the generation method described in the invention can simply and quickly generate encrypted flow with better application characteristics, and can obtain better identification effect than a supervised learning method under the same condition by using easier-to-obtain unmarked real flow, thereby greatly reducing the cost of flow acquisition and marking.

Disclosure of Invention

The technical problem to be solved by the invention is to overcome the defects of the prior art and provide a semi-supervised encryption traffic identification method for generating a countermeasure network based on auxiliary classification.

In order to achieve the above object, the present invention provides a method for identifying semi-supervised encrypted traffic of a countermeasure network generated based on auxiliary classification, comprising the steps of:

1) monitoring a network card of the network equipment by a capture code based on the libpcap, capturing flow containing the flow needing to be identified and storing the flow as the pcap;

2) according to the known flow, marking the known flow in the captured pcap to form a marked pcap and an unmarked pcap;

3) three logs were extracted for marked and unmarked pcaps, respectively, using the open source traffic analysis tool zeek: connection logs, SSL logs and certificate logs;

4) extracting the three logs obtained in the step 3) according to the defined features to obtain a stream feature matrix containing the features, and forming a real data set for training;

5) constructing an auxiliary classification generation confrontation network structure, receiving a vector c synthesized by random noise, hidden variables and class labels by a generator, generating flow by the generator, receiving the flow characteristic matrix obtained in the step 4) by a discriminator, receiving the flow generated by the generator by the discriminator, carrying out recognition training, and outputting to three fully-connected neural networks by the discriminator to finish recognition of true and false flow, classification of classes and extraction of hidden parameters;

6) defining real data in the real data set as x, dividing the real data x into real unmarked data and real marked data, defining P_data-unlabelDefining P for the probability distribution of true unlabeled data_data-labelDefining P for the probability distribution of true labeled data_gTo generate a probability distribution of data, the generated data is defined by inputting a synthetic vector to a generator, and the countermeasure loss of a discriminator is expressed by using Wasserstein distance instead of the original log-likelihood function_data-unlabel、P_data-labelTo P_gEarth-Mover distance of (1):

wherein sup is the minimum upper bound, | | f | | non-woven phosphor_LIs a Lipschitz constraint that is a constraint,

is x belongs to p_data-labeldeIn the expectation that the position of the target is not changed,

is x belongs to p_{data-unlabelde}In the expectation that the position of the target is not changed,

is x belongs to P_g(x) is the probability distribution of the random variable x;

7) the classification loss adopts cross entropy loss, and if the output of MLP _ C is represented by y, y is used_iThe value of each component of y is represented, the label corresponding to the input y flow is represented by y ', and y'_iRepresenting the value of each component of y', m being the total number of classes, the classification penalty is:

8) let the hidden parameter vector of the input generator be h ═ h₁,h₂,……h_n) The hidden parameter vector of the MLP _ H output is

The reduction loss for the implicit parameter is defined as

The reduction loss is the sum of the distances of the hidden parameter vector component of each input generator and the hidden parameter vector component of each MLP _ H output;

9) training by adopting a minipatch algorithm, taking any number of randomly distributed z and any label c with a label sample from random noise, generating a synthetic vector by matrix fusion of any number of hidden variables h, sending the synthetic vector to a generator for training, generating generated flow containing real flow characteristics, and sending any number of marked and unmarked real flows to a discriminator for training;

10) and alternately training the arbiter and the generator, and adopting the following parameter updating rules when the semi-supervised auxiliary label is trained to generate the countermeasure network: optimizing the countermeasure loss L_SUpdating the generator, the discriminator and the network parameters of the MLP _ S; optimizing classification loss L_cUpdating the network parameters of the generator, the discriminator and the MLP _ C; optimization of reduction loss L_hUpdating the generator, the discriminator and the network parameters of the MLP _ H;

11) repeating the step 10), and achieving Nash balance through the counterstudy of the generator and the discriminator;

and performing a certain traffic identification and classification task on the traffic of the existing network based on the data preprocessing mode from the step 1) to the step 4), the trained discriminator and the MLP _ C.

Further, in step 4), extracting inter-stream features in the pcap by using a zeek source-opening tool according to the defined features, forming 28 stream features, obtaining a stream feature matrix of the 28 features, and forming a real data set for training.

Furthermore, the original auxiliary classification generation countermeasure network structure is modified, hidden variables are introduced, and the output of the generator can be controlled according to some special hidden characteristics.

Further, a loss function of the original auxiliary classification generation countermeasure network is modified, Wasserstein distance is used for replacing an original log-likelihood function, and reduction loss is defined.

Further, in step 3), the log contains a log conn.log of connection records, a log ssl.log of SSL connection records, and a log x509.log of certificate records.

Log conn, describing the connection between two endpoints of two network connection parties; log, recording information including IP address, port, protocol of data packet sending and receiving party, network connection double-sending connection state, data packet sending and receiving party connection state, data packet quantity and label;

log SSL, each line describes the version of SSL/TLS in the SSL/TLS handshake and encryption setup process, the password used, the server name, the certificate path, the subject, and the issuer's information;

log x509.log, each row is a certificate record describing certificate information including a certificate serial number, a generic name, time validity, a subject, a signature algorithm, and a key length in bits.

Further, the flow features include a connection feature, an SSL feature, and a certificate feature.

Further, the connection features include:

(1) SSL aggregation and connection record number: each connection record in the SSL aggregation and connection record number comprises a certain number of SSL aggregation and connection records, and the first characteristic SSL aggregation and connection record number is only the sum of the SSL aggregation number and the connection record number;

(2) mean duration: each connection record in the SSL aggregation and connection record number contains a duration in seconds; for each incoming SSL aggregation and connection record in the number of connection records, this duration value is stored in a list, from which the average is finally calculated, setting t to contain the duration value, t ranging from { t } t₁，t₂，t₃，…， t_nThen the average value of t is:

(3) standard deviation of duration: calculate standard deviation of duration list:

(4) standard deviation range of duration: (ii) a percentage of out-of-range of all duration values, the range having an upper limit and a lower limit, the upper limit being the mean + standard deviation and the lower limit being the mean-standard deviation;

(5) payload bytes from originator: the initiator records the number of bytes of the transmitted effective load for all connections from conn.log;

(6) payload bytes from responder: log records the number of payload bytes sent for all connections from conn.log by the responder;

(7) ratio of responder bytes to all bytes: all bytes are bytes from the originator and bytes from the responder; the ratio of responder bytes to all bytes is:

where r is the number of bytes from the responder and o is the number of bytes from the originator;

(8) establishing the ratio of the connection states, wherein each connection record comprises the connection state; the connection state is divided into an established state and an unestablished state by 13 types; the established and non-established states describe whether there are any TCP handshakes, or just attempt to do a TCP handshake; the established state is [ SF, S1, S2, S3, RSTO, RSTR ], which includes a successful TCP handshake; the non-established state is [ OTH, SO, REJ, SH, SHR, RSTOS0, RSTRH ], which includes an unsuccessful handshake; the special meaning of each connection state is stored in a zeek document, and the specific calculation method comprises the following steps:

where e is the number of established states and n is the number of non-established states;

(9) number of inbound messages: the number of upstream packets contained in the connection log;

(10) number of outgoing messages: the number of downstream packets contained in the connection log;

(11) cycle average: each connection record having a capture time, the periodicity of one of the groups being measured, the first step being to calculate the time difference between the connection records in sequence, the second step being to calculate the second time difference from the first time difference in absolute value, and if the second time difference is zero, it is intended that the associated connection record is periodic; the third step, from the second timeThe values of the time differences are stored in a list, from which an average value is calculated, D being the value of the second time difference, D_nThe value obtained by the difference between the capture time of the nth connection record and the (n + 1) th connection record, d₁A value representing the difference between the time of capture of the first connection record and the time of capture of the second connection record, d₁Is the first value 0, d of 2nd time difference₂Making a difference between the time of capture recorded for the second connection and the time of capture recorded for the third connection, d₂Second value of 0, d of 2nd time difference₃The difference between the capture time of the second connection record and the capture time of the third connection record, d3 is the third value of 2nd time difference 15:

(12) periodic standard deviation: using the list of time difference values obtained in the previous step, calculating the value of the second time difference D, the standard deviation of the list of time difference values:

further, the SSL features include:

(13) ratio of connection record and SSL aggregation: the ratio between non-SSL connection records and SSL connection records is described, the ratio R of connection records and SSL aggregations being:

wherein f is_nIs the number of connection records, f, without SSL_sIs the number of connection records using SSL;

(14) ratio of TLS and SSL versions: all SSL connection records have a TLS or SSL protocol version for encryption, the SSL connection records including SSL 1.0, SSL 2.0, SSL 3.0, TLS 1.0, TLS 1.1, TLS 1.2 and TLS 1.3, SSLTLSTLS, this feature describing how many SSL connection records have a TLS protocol, the ratio R of TLS and SSL versions being:

wherein TLS is the number of SSL connection records with TLS protocol, SSL is the number of SSL connection records with SSL protocol;

(15) SNI ratio: SNI is the name of the server in the SSL connection record, and describes how many SSL connection records contain SNI, the SSL connection record of malware has more empty SNI than the SSL connection record of normal software, and the SNI ratio is:

wherein F_sIs the number of SSL connection records with SNI, F_aIs the number of all SSL connection records;

(16) SNI is a marker for IP: sometimes SSL connection records have SNI as the IP address; in this case, the SNI IP should be the same as the target IP address; if any of the SSL connection records in the connection log have SNI as IP, but SNI is different from DstIP, then this feature is-1; 0 if any SSL connection record has SNI as IP and SNI is the same as DstIP; if there is no SSL connection record of IP address, it is 1;

(17) average value of certificate path records, let C be each SSL connection recordStoring a list of the number of certificates in the certificate path, C₁Storing the number of certificates in the certificate path on behalf of the first SSL connection record, C₂Storing the number of certificates in the certificate path on behalf of the second SSL connection record, C_nStoring the number of certificates in the certificate path on behalf of the nth SSL connection record, and calculating the average value of the certificate path records according to the list:

(18) zeek can identify whether the end user certificate is self-signed or not, and stores the end user certificate in the SSL connection record; the self-signed certificate proportion is the ratio of the self-signed certificate and all end-user certificates in the log, and the ratio is the ratio of the number of self-signed certificates and the number of all certificates.

Further, the certificate features include:

(19) public key mean: each certificate record describing the certificate contains the public key of the certificate, the public key in each certificate record is added to the list, let J be the list formed by the number of public keys in each certificate record, J₁For the number of public keys in the first certificate record, j₂For the number of public keys in the second certificate record, j_nCalculate the average from the list for the number of public keys in the nth certificate record:

(20) average value of certificate validity period: each certificate has a validity period stored in the certificate record in unix time, the validity period is stored in the list in seconds in each certificate, G is the list formed by the validity periods of the certificates in the certificate record, G₁For the validity period of the certificate in the first certificate record, g₂For the validity period of the certificate in the second certificate record, g_nFor the certificate validity period in the nth certificate record, then calculate the average from the list:

(21) certificate validity standard deviation: the list of certificate validity periods, which is the same as the average value of the certificate validity periods, calculates G standard deviation in seconds:

(22) validity of certificate deadline during capture: determining whether the certificate during the capturing is valid or not through the capturing time and the validity period of the certificate; it is normal if the capture time is within the certificate validity period, the validity of the certificate validity period during capture being the number of certificates that exceed the validity period during capture of traffic, malware using invalid certificates instead of ordinary certificates:

(23) mean value of certificate validity start time: the ratio of the two lengths of time; the first length is that the validity period of the certificate is set as K, K₁Is the certificate validity period, k, of the first certificate record₂Is the certificate validity period, k, of the second certificate record_nIs the certificate validity period of the Nth certificate record, and the second length is the time period from the certificate validity period to the capture is set as P, P₁Is the period of time, p, from the validity period of the certificate to the capture of the first certificate record₂Is the period of time from the validity period of the certificate to the capture of the second certificate record, p_nThe time period from the validity period of the certificate to the capture of the Nth certificate, and thus how old the certificate is, for each certificate, the ratio of these periods is calculated and the result is stored inIn the list, the average Z is then calculated from the list:

(24) number of certificates: connection logs usually contain one certificate, but sometimes more; thus, the number of certificates is only that of one connection data;

(25) domain number average in certificate SAN DNS: SAN is a backup name that describes which domains belong to this certificate, and for each new incoming certificate the number of dns in the SAN is stored in a list, and then an average is calculated from the list;

(26) ratio of certificate record to SSL connection record: the number of SSL connection records with a certificate path is described, since a certificate record can be added to the SSL aggregation in case it is included in the certificate path as a first certificate, the ratio of certificate record to SSL connection record being the ratio of the number of certificate records and the number of SSL connection records of one connection log;

(27) whether there is SNI in SAN DNS: SNI is the indication of the server name contained in the SSL connection record, SAN DNS is the domain in the certificate record belonging to the certificate; SNI is part of SAN DNS; if any certificate log does not contain the SNI in the SSL connection record in one SSL aggregation, then whether there is a SNI in the SAN DNS with a value of 0; if all certificate logs contain the SNI in the SAN DNS in each pair of SSL aggregation in the connection log, judging whether the value of the SNI in the SAN DNS is 1 or not;

(28) whether there is a CN in the SAN DNS: CN is generic name CN is part of SAN DNS; if none of the certificates contains a CN in the SAN DNS, whether the functional value of the CN in the AN DNS is 0 or not is judged; if all certificates contain a CN in san.dns, then if there is a CN with a functional value of 1 in AN DNS.

The invention achieves the following beneficial effects:

the method of the invention modifies the loss function of the original auxiliary classification generation countermeasure network, so that the method can utilize unmarked data to carry out semi-supervised learning, improve the identification precision, reduce the cost of network flow acquisition and marking, and simultaneously improve the network management and safety monitoring level.

Drawings

Fig. 1 is a flow diagram of assisted classification for generating a connection record against a network fabric.

Detailed Description

The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

It should be noted that, if there is a directional indication (such as up, down, left, right, front, and back) in the embodiment of the present invention, it is only used to explain the relative position relationship between the components, the motion situation, and the like in a certain posture, and if the certain posture is changed, the directional indication is changed accordingly.

In addition, if the description of "first", "second", etc. is referred to in the present invention, it is used for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.

1. Connection characteristics:

(1) SSL aggregation and connection record number: each connection record contains a certain number of SSL aggregations and connection records. The first feature is simply the sum of the number of SSL aggregations and the number of connection records.

(2) Mean duration: each connection record contains a duration in seconds. For each incoming SSL aggregation and connection record in the number of connection records, this duration value is stored in a list, from which the average is finally calculated, setting t to contain the duration value, t ranging from { t } t₁，t₂，t₃，…，t_nThen the average value of t is:

standard deviation range of duration: percentage out of range of all duration values. The range has two limits, the upper limit being the mean of t + standard deviation and the lower limit being the mean of t-standard deviation.

(4) Payload bytes from originator: log records the number of payload bytes sent for all connections from conn.

(5) Payload bytes from responder: log records the number of payload bytes sent for all connections from conn.

(6) Ratio of responder bytes to all bytes: all bytes are bytes from the originator and bytes from the responder. The specific calculation method comprises the following steps:

where r is the number of bytes from the responder and o is the number of bytes from the originator.

(7) A ratio of establishing connection status, each connection record containing connection status. There are 13 of these states. These states are classified into established states and non-established states. These two sets of states describe whether there are any TCP handshakes or just attempt to do a TCP handshake. The established state is [ SF, S1, S2, S3, RSTO, RSTR ], which includes a successful TCP handshake; the non-established state is [ OTH, SO, REJ, SH, SHR, RSTOS0, RSTRH ], which includes an unsuccessful handshake. The special meaning of each state is in the zeek document. The specific calculation method comprises the following steps:

where e is the number of established states and n is the number of non-established states.

(8) Number of inbound messages: the number of upstream packets contained in the connection log.

(9) Number of outgoing messages: the number of downstream packets contained in the connection log.

(10) Cycle average: each connection record has a capture time. Thus, we can measure the periodicity of one of the groups. Five fictitious connection records are shown. The first step is to calculate the time difference between the connection records in sequence. The next step is to calculate a second time difference based on the first time difference in absolute value. If the value is zero, it means that the associated connection record is periodic. Finally, the value from the second time difference is stored in a list. Calculating an average value from the list, let D be the value of the second time difference, D_nThe value obtained by the difference between the capture time of the nth connection record and the (n + 1) th connection record, d₁A value representing the difference between the time of capture of the first connection record and the time of capture of the second connection record, d₁Is the first value 0, d of 2nd time difference₂Making a difference between the time of capture recorded for the second connection and the time of capture recorded for the third connection, d₂Second value of 0, d of 2nd time difference₃The difference between the capture time of the second connection record and the capture time of the third connection record, d3 is the third value of 2nd time difference 15:

(11) periodic standard deviation: calculating the standard deviation of the value D of the second time difference by using the list of time difference values obtained in the previous step and the list of time difference values obtained in the previous step:

(12) SSL characteristics:

(13) ratio of connection record and SSL aggregation: this feature describes the ratio between non-SSL connection records and SSL connection records. The ratio R is:

where fn is the number of connection records without SSL and fs is the number of connection records with SSL

(14) Ratio of TLS and SSL versions: all SSL connection records have either a TLS or SSL protocol version for encryption. There are SSL 1.0, SSL 2.0, SSL 3.0, TLS 1.0, TLS 1.1, TLS 1.2 and TLS 1.3, where the SSL protocol is earlier than TLS and almost all ordinary traffic uses TLS. This feature describes how many SSL connection records have the TLS protocol. The ratio R is:

where TLS is the number of SSL connection records with TLS protocol and SSL is the number of SSL connection records with SSL protocol.

(15) SNI ratio: SNI is the server name in the SSL connection record. This feature describes how many SSL connection records contain SNIs, with the malware SSL connection records having more empty SNIs than normal SSL connection records. The ratio R is calculated as:

where Fs is the number of SSL connection records with SNI and Fa is the number of all SSL connection records.

(16) SNI is a marker for IP: sometimes the SSL connection record has SNI as the IP address. In this case, the SNI IP should be the same as the target IP address. This feature is-1 if any SSL connection record in its connection log has SNI as IP, but SNI is different from DstIP. 0 if any SSL connection record has SNI as IP and SNI is the same as DstIP; if there is no SSL connection record for an IP address, it is 1.

(17) Average value of certificate path records, C is a list formed by storing the number of certificates in the certificate path for each SSL connection record, C₁Storing the number of certificates in the certificate path on behalf of the first SSL connection record, C₂Storing the number of certificates in the certificate path on behalf of the second SSL connection record, C_nStoring the number of certificates in the certificate path on behalf of the nth SSL connection record, and calculating the average value of the certificate path records according to the list:

(18) zeek is able to identify whether an end user certificate is self-signed. This information is in the SSL connection record. This feature is the ratio of the self-signed certificate and all end-user certificates in the log. The ratio R is:

where s is the number of self-signed certificates and c is the number of all certificates.

3. Certificate features:

(19) public key mean: each certificate record describing the certificate contains the public key of the certificate, the public key in each certificate record is added to the list, let J be the list formed by the number of public keys in each certificate record, J₁For the number of public keys in the first certificate record, j₂Number of public keys recorded for the second certificate, and so on_nCalculate the average from the list for the number of public keys in the nth certificate record:

(20) average value of certificate validity period: each certificate has a validity period, such as: the certificate duration is 10 years from 1/2010 to 1/2020. These binding dates are stored in the certificate record as unix times. In each certificate, this validity period is stored in a list in seconds, from which an average is then calculated:

(21) certificate validity standard deviation: the list of validity periods of certificates, identical to the last one, in seconds, calculates the standard deviation:

(22) validity of certificate deadline during capture: by capturing the time and the validity period of the certificate, we can determine whether the certificate during capture is valid. It is normal if the capture time is within the certificate validity period. This feature is to capture the number of certificates that exceed the validity period during the traffic. Malware often uses invalid certificates instead of ordinary certificates:

(23) mean value of certificate validity start time: this feature is the ratio of the two time lengths. The first length is a certificate validity period and the second length is a period of time from the start of the certificate validity period to the capture. Thus, it can be calculated how old the certificate is. For each certificate, the ratio of these periods will be calculated and the result stored in a list. The average is then calculated from the list.

(24) Number of certificates: the connection log usually contains one certificate, but sometimes more. Therefore, this feature is only the number of certificates of one connection data.

(25) Domain number average in certificate SAN DNS: the SAN is an alternate name that describes which domains belong to this certificate. The number of dns in the SAN is stored in the list for each new incoming certificate. The average is then calculated from the list. An example of a portion of the Google certificate SAN dns: "google.com", "google.co.com", "google-analytical.com", "google.ca", "google.cl", "google.co.in", "google.co.jp", "google.co.uk", "google.de ]

(26) Ratio of certificate record to SSL connection record: this feature describes the number of SSL connection records with a certificate path, since a certificate record can be added to the SSL aggregation in case it is included in the certificate path as the first certificate. The ratio R is:

where c is the number of certificate records and s is the number of SSL connection records for one connection log.

(27) Whether there is SNI in SAN DNS: the SNI is a server name indication contained in the SSL connection record. The SAN DNS is the domain in the certificate record that belongs to the certificate. Typically, the SNI is part of the SAN DNS. If any certificate log does not contain the SNI in the SSL connection record in one SSL aggregation, the value of this feature is 0; if all certificate logs contain SNI in SAN DNS in each pair of SSL aggregations in the connection log, the value of this feature is 1.

(28) Whether there is a CN in the SAN DNS: CN is a generic name that is part of the certificate record. The CN should be part of the SAN DNS. If no certificate contains a CN in the SAN DNS, this function value is 0; this function value is 1 if all certificates contain CN in san.

Model training:

the most primitive AC-GAN objective function is the loss L from discriminating real samples and generating samples_sAnd loss L of classified sample class_CTwo parts are formed. The original AC-GAN objective function only uses marked data, and does not use unmarked data. In actual production, marked data are often difficult to obtain, and unmarked samples are easy to obtain, so that the loss function of the AC-GAN discriminator is modified to enable the AC-GAN discriminator to perform semi-supervised learning. Simultaneously, Wasserstein distance is used to replace the original log-likelihood function to represent the true data set P_data-unlabel、P_data-labelTo generation of a data set P_gThe EM (Earth-Mover) distance of (c).

The penalty of the arbiter is therefore:

for the classification loss, cross entropy loss is adopted, and if the output of MLP _ C is represented by y, y is used_iThe value of each component of y is represented, the label corresponding to the input y flow is represented by y ', and y'_iRepresenting the value of each component, and m is the total number of classes. The classification penalty is then:

and simultaneously introducing a hidden variable h, and when the hidden variable and the random noise are independent from each other, for the flow G (z ') from the generator, obtaining an output z ' ═ M (G (z ')) through an encoder M, and separating each component of h. If the hidden parameter h of the input generator is equal to (h)₁,h₂,...,h_n) MLP _ H output is

The reduction loss of the implicit parameter is defined as:

the following parameter update rules are used in training the entire model:

optimizing L_SWhen loss occurs, updating network parameters of G, D and MLP _ S;

optimizing L_cWhen loss occurs, updating network parameters of G, D and MLP _ C;

optimizing L_hWhen loss occurs, the network parameters of G, D and MLP _ H are updated.

The libpcap is a network data packet capture function packet under a unix/linux platform in the prior art, the pcap file is a common datagram storage format, and the SSL connection recording protocol works by dividing a data stream into a series of segments and transmitting the segments, wherein each segment is independently protected and transmitted. The network devices may be PCs, switches and servers.

Labeling known flows in the captured pcap, resulting in labeled and unlabeled pcaps: the pcap file is a common datagram storage format, and is a file format, known in captured traffic and separable from the captured traffic are marked according to applications and stored as marked pcaps, and captured traffic contains known traffic but not well separated from the captured traffic and stored as unmarked pcaps, which is the prior art.

In the step (5), the identification of true and false flow, classification of categories and extraction of hidden parameters are completed: inputting the flow characteristic matrix obtained in the step (4) into three networks, outputting three vectors by the weight calculation of the neural network, wherein the meanings of the three vectors are true/false of flow, the probability of class classification and hidden parameter vectors respectively, and the output process is the prior art.

The MLP _ S refers to an MLP full-connection neural network for classifying the flow, the MLP _ C refers to a full-connection neural network used for judging whether the flow is real flow or generated flow in the sample diagram, and the MLP _ H refers to a full-connection neural network used for outputting hidden variables in the sample diagram.

The existing network refers to a network which needs to perform traffic identification and monitoring tasks and is currently providing production services, such as a public telecommunication network, a home local area network, an internal company network and the like.

In the method, the whole network is composed of a plurality of neural networks, the generator is a neural network, the discriminator + the fully-connected network + the MLP _ C form a stacked network to finish the true and false recognition of the flow, the discriminator + the fully-connected network + the MLP _ PS form the stacked network to finish the classified recognition of the flow, the discriminator + the fully-connected network + the MLP _ H form the stacked network to finish the output of the hidden variable, the training must be simultaneously performed, the discriminator + the fully-connected network + the MLP _ S form the stacked network to finish the classified recognition of the flow, the classification task is performed on the collected flow, the discriminator + the fully-connected network + the MLP _ H form the stacked network to finish the output of the hidden variable, and the generated flow is controlled by adjusting the hidden variable.

The special implicit features are professional terms, are used in the field of picture generation at first, and people do not know what meaning before model training is completed, and after the model training is completed, the picture generated by the generator is found to be changed after a certain value of the implicit variable is adjusted, for example, the hair color is found to be changed after the first feature of the implicit variable is changed, the hair style is found to be changed by adjusting the second feature, and the like. SSL is an existing encryption technique. In the method of the present invention, the certificates are SSL certificates.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. The method for identifying the semi-supervised encryption traffic of the countermeasure network generated based on the auxiliary classification is characterized by comprising the following steps:

8) input deviceThe hidden parameter vector of the generator is h ═ h (h)₁,h₂,……h_n) The hidden parameter vector of the MLP _ H output is

The reduction loss for the implicit parameter is defined as

2. The method for identifying semi-supervised encryption traffic based on assisted classification generation countermeasure network as claimed in claim 1, wherein in step 4), inter-flow features in the pcap are extracted by using a zeek source-opening tool according to the defined features, 28 flow features are formed, a flow feature matrix of the 28 features is obtained, and a real data set for training is formed.

3. The semi-supervised encryption traffic identification method for generation of the antagonistic network based on auxiliary classification as claimed in claim 1, wherein the structure of the antagonistic network generated based on the original auxiliary classification is modified, hidden variables are introduced, and the output of the generator can be controlled according to some special hidden features.

4. The semi-supervised encryption traffic identification method for generation of the countermeasure network based on auxiliary classification as claimed in claim 1, wherein a loss function of the original auxiliary classification generation countermeasure network is modified, a Wasserstein distance is used to replace an original log-likelihood function, and a restoration loss is defined.

5. The identification method for generating the semi-supervised encryption traffic of the countermeasure network based on the auxiliary classification as claimed in claim 1, wherein in the step 3), the log contains a log conn.log of the connection record, a log ssl.log of the SSL connection record and a log x509.log of the certificate record.

6. The identification method for generating the semi-supervised encryption traffic of the countermeasure network based on the auxiliary classification as claimed in claim 5, wherein in the log conn.log, the connection condition between two endpoints of two network connection parties is described; log, recording information including IP address, port, protocol of data packet sending and receiving party, network connection double-sending connection state, data packet sending and receiving party connection state, data packet quantity and label;

7. The identification method for generating the semi-supervised encryption traffic of the countermeasure network based on the auxiliary classification as claimed in claim 5, wherein the flow characteristics comprise connection characteristics, SSL characteristics and certificate characteristics.

8. The method of claim 7, wherein the connection feature comprises:

(2) mean duration: each connection record in the SSL aggregation and connection record number contains a duration in seconds; for each incoming SSL aggregation and connection record in the number of connection records, this duration value is stored in a list, from which the average is finally calculated, setting t to contain the duration value, t ranging from { t } t₁，t₂，t₃，…，t_nThen the average value of t is:

(4) standard deviation range of duration: the percentage of the out-of-range of all the duration values, the range having an upper limit and a lower limit, the upper limit being the mean of t + the standard deviation, the lower limit being the mean of t-the standard deviation;

(8) establishing the ratio of the connection states, wherein each connection record comprises the connection state; the connection state is divided into an established state and an unestablished state by 13 types; the established and non-established states describe whether there are any TCP handshakes, or just attempt to do a TCP handshake; the established states are [ SF, S1, S2, S3, RSTO, RSTR]Wherein a successful TCP handshake is included; the non-established state is [ OTH, SO, REJ, SH, SHR, RSTOS0, RSTRH]Including an unsuccessful handshake; the special meaning of each connection state is stored in the zeek document, and the ratio of establishing the connection state is:

(11) cycle average: each connection record having a capture time, the periodicity of one of the groups being measured, the first step being to calculate the time difference between the connection records in sequence, the second step being to calculate the second time difference from the first time difference in absolute value, and if the second time difference is zero, it is intended that the associated connection record is periodic; third, the values from the second time difference are stored in a list, from which an average value is calculated, D being the value of the second time difference, D_nThe value obtained by the difference between the capture time of the nth connection record and the (n + 1) th connection record, d₁A value representing the difference between the time of capture of the first connection record and the time of capture of the second connection record, d₁Is the first value 0, d of 2nd time difference₂Making a difference between the time of capture recorded for the second connection and the time of capture recorded for the third connection, d₂Second value of 0, d of 2nd time difference₃The difference between the capture time of the second connection record and the capture time of the third connection record, d3 is the third value of 2nd time difference 15:

(12) periodic standard deviation: calculating the standard deviation of the value D of the second time difference by using the list of time difference values obtained in the previous step:

9. the method of claim 7, wherein the SSL characteristics comprise:

(18) zeek can identify whether the end user certificate is self-signed or not, and stores the end user certificate in the SSL connection record; the self-signed certificate ratio is the ratio of the self-signed certificates and all end-user certificates in the log, and the self-signed certificate ratio is the ratio of the number of the self-signed certificates and the number of all the certificates.

10. The method of claim 7, wherein the certificate feature comprises:

(19) public key mean: each certificate record describing a certificate contains the public key of the certificate, the public key in each certificate record is added to the list, let J beList formed by the number of public keys in each certificate record, j₁For the number of public keys in the first certificate record, j₂Number of public keys recorded for the second certificate, and so on_nCalculate the average from the list for the number of public keys in the nth certificate record:

(23) mean value of certificate validity start time: the ratio of the two lengths of time; the first length is that the validity period of the certificate is set as K, K₁Is the certificate validity period, k, of the first certificate record₂Is the certificate validity period, k, of the second certificate record_nIs the certificate validity period of the Nth certificate record, and the second length is the time period from the certificate validity period to the capture is set as P, P₁Is the period of time, p, from the validity period of the certificate to the capture of the first certificate record₂Is the period of time from the validity period of the certificate to the capture of the second certificate record, and so on_nIs the period of time from the validity period of the certificate to the capture of the nth certificate, thus calculating how old the certificate is, for each certificate, the ratio of these periods will be calculated and the result stored in a list, from which the average Z is then calculated: