CN113923042A

CN113923042A - Malicious software abuse DoH detection and identification system and method

Info

Publication number: CN113923042A
Application number: CN202111245911.7A
Authority: CN
Inventors: 陈伟; 张文月
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-10-26
Filing date: 2021-10-26
Publication date: 2022-01-11
Anticipated expiration: 2041-10-26
Also published as: CN113923042B

Abstract

The invention discloses a detection and identification system and a method for malicious software abuse (DoH) in the technical field of deep learning and network security, wherein the detection and identification system comprises the following steps: acquiring a pcap flow packet in a network; after extracting the time sequence characteristics in the pcap flow packet, establishing a packet cluster; generating a cluster sequence based on the clusters of all packets; extracting a final characteristic set in the cluster sequence; inputting the final characteristic set into a Transformer model for calculation to obtain a prediction label; and judging the malicious software abuse DoH flow based on the prediction label type. According to the method, through more relevant time characteristics in the multi-head attention mechanism mining sequence, overall analysis is reduced, so that the accuracy of the model on DoH flow detection under malicious software is improved, and the classification effect of the model is improved.

Description

Malicious software abuse DoH detection and identification system and method

Technical Field

The invention relates to a detection and identification system and method for malicious software abuse (DoH), and belongs to the technical field of deep learning and network security.

Background

The Domain Name System (DNS) is one of the important basic core services in the internet today, and mainly translates domain names easy for human memory into IP addresses easy for machine recognition, and a large number of network services are developed depending on the domain name service. DNS is therefore one of the early vulnerable network protocols, and DNS abuse has been a field of great interest to network security researchers. To overcome some DNS vulnerabilities related to privacy and data manipulation, the internet engineering task force introduced in RFC8484 dnSoverHTTPS (DoH), the communication of hypertext transfer protocol (HTTP) through Secure Socket Layer (SSL) or Transport Layer Security (TLS) transport was largely successful in preventing DNS attacks, and at the same time, DoH improved user privacy and security by preventing eavesdropping and DNS data manipulation. Encryption of traffic effectively provides better privacy, but it also reduces the visibility of network traffic by various security tools, which can affect the security level of the network.

Malware includes computer viruses, worms, trojans, zombie programs, or other programs with malicious intent that are intended to disrupt the operation of a computer system, steal proprietary information, or gain access control rights. When malicious software abuses the DNS protocol, communication between the infected host and the command and control server is typically accomplished using IP-Flux or Domain-Flux technology. In recent years there has been a first known family of malware that uses encryption to hide DNS activity in the DoH tunnel, such as the malware named Godlua, by HTTPS requests to retrieve text records of domain names using DNS, where the URLs of subsequent command and control servers are stored, to which the Godlua malware connects to obtain further instructions, and this technique of retrieving second or third stage command and control server URL addresses from DNS text records is not new. The novelty here is that a DoH request is used instead of a traditional DNS request. In this way, malware hides the frequency of DNS resolution. The reduction in network visibility forces administrators to block the use of DoH encryption in their networks, typically by blocking specific IP addresses of authoritative DoH resolvers. This solution is not perfect, as any malware wants to hide DNS traffic, and can easily create its own DoH resolver on non-standard addresses and ports.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a detection and identification system and method for malicious software abuse DoH, which can achieve the effect of improving the detection accuracy.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

in a first aspect, the present invention provides a method for detecting and identifying malware abuse DoH, including:

acquiring a pcap flow packet in a network;

after extracting the time sequence characteristics in the pcap flow packet, establishing a packet cluster;

generating a cluster sequence based on the clusters of all packets;

extracting a final characteristic set in the cluster sequence;

inputting the final characteristic set into a Transformer model for calculation to obtain a prediction label;

and judging the malicious software abuse DoH flow based on the prediction label type.

Further, the cluster of packets is:

C＝{size,pktCount,direction,duration,interarrivalTime

where C is the cluster of packets, size is the size of the cluster, pktCount is the number of packets in the cluster, direction is the direction of all packets in the cluster, duration is the duration of the cluster, and interrrivaltimei is the inter-arrival time.

Further, the final feature set is:

F_l＝{(C_i，...，C_i+l)|1≤i＜n-l}

S＝(C₁，...，C_n)

wherein, F_lFor the final feature set, S is the cluster sequence, C_iIs the ith cluster in the cluster sequence, i is the cluster number, C_nAnd representing the nth cluster in the cluster sequence, wherein n is the number of clusters in the cluster sequence, and l is the length of each sequence in the final feature set.

Further, the Transformer model comprises an encoder and a decoder, wherein the encoder extracts a sequence matrix based on the time-series characteristics of the final characteristic set, and the decoder generates a position vector matrix through the extracted sequence matrix.

Further, inputting the final feature set into a Transformer model for calculation to obtain a prediction label, including:

inputting a Transformer model based on the final characteristic set to obtain a sequence matrix, wherein the expression is as follows:

Q＝{E₁，...，E_i-1，E_i，E_i+1，...，E_l}

wherein Q is a sequence matrix, E_iFor cluster vectorization, l is the length of each sequence in the final feature set;

obtaining the position information of the cluster based on the sequence matrix Q, wherein the expression is as follows:

PE(pos，2j)＝sin(pos/10000^2j/d)

PE(pos，2j+1)＝cos(pos/10000^2j/d)

wherein, PE represents the calculated position vector, pos is the position serial number of the cluster in the sequence, j belongs to (0, d) is the serial number of each value in the cluster vector, 2j represents an even position, 2j +1 represents an odd position, and d is the embedding dimension of the cluster;

and adding the sequence matrix Q and the position vector matrix PE to obtain a final coding matrix.

Further, inputting the final feature set into a transform model for calculation to obtain a prediction label, and further comprising:

performing linear mapping on the final coding matrix for multiple times to obtain subsequence codes in different subspaces;

self-attention calculation is carried out on the subsequence codes in each subspace to obtain the subsequence codes A after the dependency weight weighting_i；

A is to be_iAnd performing linear transformation after connection to obtain a characteristic matrix, wherein the expression is as follows:

α＝concat(A₁，...，A_i-1，A_i，A_i+1，...，A_h)W

wherein α ∈ R^l×lFor the feature matrix, concat is the join function, h is the number of subsequences encoded, W ∈ R^hd×1Is a parameter matrix.

performing characteristic matrix down-sampling on the global average pooling layer;

inputting the characteristic matrix after down-sampling into a full-connection layer for dimensionality reduction;

inputting a Softmax layer for classification detection based on the feature matrix after dimension reduction to obtain a prediction label;

the predictive tag includes: non-DoH, benign DoH and malicious DoH.

In a second aspect, the present invention provides a detection and identification system for a malware abuse DoH, including:

an acquisition module: the method comprises the steps of obtaining a pcap flow packet in a network;

a cluster creation module: the method comprises the steps of establishing a cluster of packets after extracting time sequence features in a pcap flow packet;

a cluster sequence generation module: for generating a cluster sequence based on the clusters of all packets;

the characteristic set extraction module: extracting a final characteristic set in the cluster sequence;

a predictive tag output module: the system is used for inputting the final characteristic set into a Transformer model for calculation to obtain a prediction label;

a judging module: for malware abuse DoH traffic determination based on predictive tag type.

In a third aspect, a device for detecting and identifying malicious software abuse (DoH) comprises a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any of the above.

In a fourth aspect, a computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of any of the methods described above.

Compared with the prior art, the invention has the following beneficial effects:

according to the method, whether the DoH flow in the network environment is malicious DoH abused by malicious software is detected by capturing the pcap flow packet in real time, time sequence features are extracted from the DoH flow packet, then a Transformer self-attention mechanism is adopted, modeling is carried out completely depending on the overall dependency relationship of the attention mechanism on input and output, more relevant time features in the sequence are mined through a multi-head attention mechanism, overall analysis is reduced, and therefore the accuracy of the model on the DoH flow detection under the malicious software is improved, and the classification effect of the model is improved.

Drawings

Fig. 1 is a schematic diagram illustrating a detection and identification process of a malware abuse DoH according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

The first embodiment is as follows:

a detection and identification method for malicious software abuse DoH is used for detecting the malicious software abuse DoH based on time series characteristics and self-attention mechanism identification, wherein FIG. 1 is a specific flow of the detection method for identifying the malicious software abuse DoH based on the time series characteristics and the self-attention mechanism and comprises the following steps:

capturing pcap traffic packets in a network, extracting time series characteristics in the data packets, and creating a cluster sequence of packets stored in JSON format to reduce the dimensionality of data, each generated packet cluster having five parameters for representing the characteristics of the cluster, which are the size of the cluster (the sum of data packets in bytes), the number of data packets in the cluster, the direction of all packets in the cluster (incoming or outgoing), the duration of the cluster (the time difference between the first and last cluster), and the inter-arrival time (the time difference between the current and previous cluster), as follows:

1) a packet cluster refers to a sequence of one or more consecutive packets in the same direction (having the same source and destination) in a network flow, creating a cluster of packets, and the basic principle is to combine these packets to find the application traffic between several packets during TLS fragmentation and IP fragmentation. The threshold timeout value t of the cluster is also taken into account so that two packets with a large time difference will not appear in the same cluster.

2) Traffic shape parameters such as packet size, packet direction, time difference between packets are used to infer some information about the underlying traffic. Extracting each generated packet cluster and expressing the packet cluster as follows by using quintuple characteristics:

C＝{size，pktCount，direction，duration，interarrivalTime}

where C is the cluster of packets, size is the cluster size (sum of packets in bytes), pktCount is the number of packets in the cluster, direction is the direction (incoming or outgoing) of all packets in the cluster, duration is the cluster duration (difference between the first and last cluster time), interrrivaltimei is the inter-arrival time (difference between current and previous cluster time).

Generating a cluster sequence of a packet stored in a JSON format, wherein the size of the sequence depends on the network flow inside the stream, and customizing a sliding window to generate the cluster sequence, so that one stream consists of a plurality of cluster sequences, and the method specifically comprises the following steps:

by generating the clustering process, the clustering sequence of any network flow can be represented as a clustering sequence S:

S＝(C₁，...，C_n)

C_nrepresenting the nth cluster in a cluster sequence, the size of the sequence n depending on the network traffic inside the stream, using a sliding window of length l to generate the cluster sequence, the cluster sequences smaller than l being filled with empty clusters. If l is a hyper-parameter of the number of clusters in the sequence, the final feature set F extracted from the cluster sequence S_lExpressed as:

F_l＝{(C_i，...，C_i+l)|1≤i＜n-l}

F_lfor the final feature set, S is the cluster sequence, C_iIs the ith cluster in the cluster sequence, i is the cluster number, C_nRepresenting the nth cluster in the cluster sequence, wherein n is the number of clusters in the cluster sequence, l is the length of each sequence in the final feature set, and l needs to be customized to find the maximum of lThe optimal value is achieved, and the optimal detection effect is achieved. Finding the best value of/is a trade-off between the accuracy of the detection and the response time.

And thirdly, establishing a Transformer model, wherein the Transformer adopts the structures of an encoder and a decoder, and the two substructures mainly model the extracted time sequence characteristics through a multi-head attention mechanism. The input to the model needs to pass through both substructures simultaneously. The encoder models the timing relationship between clusters in the source sequence, and the decoder generates new information through the information vector extracted by the encoding end. Both the encoder and the decoder adopt a multi-head attention mechanism, a position embedding layer is used for representing time sequence information between sequences, and a multi-head self-attention layer is used for extracting information of clusters in the sequences, wherein the information is as follows:

1) the input layer, which is the input to the model, through the encoder and decoder accepts (l, 5), where 5 is the 5 parameters contained in the cluster. Obtaining a sequence vectorization representation:

Q＝{E₁，...，E_i-1，E_i，E_i+1，...，E_l}

2) in both substructures, the input matrix is subjected to a position encoding operation. In the model herein, a structure such as a recurrent neural network is not used, and thus sequence information cannot be directly captured. But the sequence information is very important and represents a global structure, so the relative or absolute position information of clusters in the sequence must be utilized. The calculation formula of the position information is as follows:

PE(pos，2j)＝sin(pos/10000^2j/d)

PE(pos，2j+1)＝cos(pos/10000^2j/d)

wherein PE represents the calculated position vector, pos is the position sequence number of the cluster in the sequence, and j belongs to (0, d) as the cluster vector C_iThe serial number of each value in (1) is coded by sine at even position 2j, coded by cosine at odd position 2j +1, and d is the embedding dimension of the cluster.

3) The dimensionality of the sequence matrix Q is the same as that of the position vector matrix PE, and the two matrixes are added to obtain a final coding matrix.

4) In the multi-head attention calculation, each 1 head is 1 linear mapping. And performing linear mapping on the final coding matrix for multiple times to obtain subsequence codes in different subspaces. Self-attention calculation is carried out on the subsequence coding in each subspace, and the subsequence coding is coded as A after the dependency weight weighting_i. For the extracted A_iAnd (3) connecting, and obtaining a characteristic matrix alpha after linear transformation:

α＝concat(A₁，...，A_i-1，A_i，A_i+1，...，A_h)W

And fourthly, the second layer of the detection model is a global averaging pooling layer, after the feature matrix alpha is obtained, the feature matrix alpha is slid on the feature map in a window mode (window sliding similar to convolution), the average value in the window is taken as a result, one tensor of alpha-W-H-D is changed into a tensor of g-1-D, and the feature matrix is subjected to characteristic matrix down-sampling in the global averaging pooling layer, so that the overfitting phenomenon is reduced. Wherein, α is the original feature map, D is the number of sequence files, the number of feature maps is equal to the number of sequence files, and the average value of each feature map is calculated by the following calculation formula:

g_i＝avg(α_i)

wherein g is_iIs the result of averaging each feature map.

Fifthly, mixing g_iAfter the dimension reduction of the full-connection layer is input, a Softmax layer is input for classification detection, and a prediction label (whether malicious DoH exists) is obtained by cluster sequence classification:

wherein, the final output dimension of the Dense layer (Dense layer) is 3, which represents three categories: non-DoH, benign DoH and malignantDoH is intended.

And taking the maximum probability value of each class probability as a classification result, namely, the class to which the probability value belongs, so that the malicious DoH can be detected.

Example two:

a detection and identification system for malware abuse DoH, comprising:

Example three:

the embodiment of the invention also provides a device for detecting and identifying the malicious software abuse DoH, which comprises a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method of:

acquiring a pcap flow packet in a network;

generating a cluster sequence based on the clusters of all packets;

extracting a final characteristic set in the cluster sequence;

Example four:

an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the following method steps:

acquiring a pcap flow packet in a network;

generating a cluster sequence based on the clusters of all packets;

extracting a final characteristic set in the cluster sequence;

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A detection and identification method for malicious software abuse (DoH) is characterized by comprising the following steps:

acquiring a pcap flow packet in a network;

generating a cluster sequence based on the clusters of all packets;

extracting a final characteristic set in the cluster sequence;

2. The method for detecting and identifying DoH of malicious software according to claim 1,

the cluster of packets is:

C＝{size，pktCount，direction，duration，interarrivalTime}

3. The method for detecting and identifying DoH of malicious software according to claim 2,

the final feature set is:

F_l＝{(C_i，...，C_i+l)|1≤i＜n-l}

S＝(C₁，...，C_n)

4. The method for detecting and identifying malicious software abuse (DoH) according to claim 1, wherein the Transformer model comprises an encoder and a decoder, the encoder extracts a sequence matrix based on the time-series characteristics of the final characteristic set, and the decoder generates a position vector matrix through the extracted sequence matrix.

5. The method for detecting and identifying DoH of malicious software according to claim 1,

inputting the final characteristic set into a Transformer model for calculation to obtain a prediction label, wherein the prediction label comprises the following steps:

Q＝{E₁，...，E_i-1，E_i，E_i+1，...，E_l}

wherein the content of the first and second substances,q is a sequence matrix, E_iFor cluster vectorization, l is the length of each sequence in the final feature set;

PE(pos，2j)＝sin(pos/10000^2j/d)

PE(pos，2j+1)＝cos(pos/10000^2j/d)

6. The method for detecting and identifying DoH of malicious software according to claim 5,

inputting the final feature set into a Transformer model for calculation to obtain a prediction label, and further comprising:

α＝concat(A₁，...，A_i-1，A_i，A_i+1，...，A_h)W

7. The method for detecting and identifying DoH of malicious software according to claim 6,

the predictive tag includes: non-DoH, benign DoH and malicious DoH.

8. A detection and identification system for DoH (malware abuse over) comprising:

9. The device for detecting and identifying the abuse of DoH of the malicious software is characterized by comprising a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any one of claims 1 to 7.

10. Computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.