CN112953914A - DGA domain name detection and classification method and device - Google Patents

DGA domain name detection and classification method and device Download PDF

Info

Publication number
CN112953914A
CN112953914A CN202110124333.5A CN202110124333A CN112953914A CN 112953914 A CN112953914 A CN 112953914A CN 202110124333 A CN202110124333 A CN 202110124333A CN 112953914 A CN112953914 A CN 112953914A
Authority
CN
China
Prior art keywords
domain name
classification
dga domain
word
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110124333.5A
Other languages
Chinese (zh)
Inventor
林兰芬
周少芳
袁俊坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202110124333.5A priority Critical patent/CN112953914A/en
Publication of CN112953914A publication Critical patent/CN112953914A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2463/00Additional details relating to network architectures or network communication protocols for network security covered by H04L63/00
    • H04L2463/144Detection or countermeasures against botnets

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a DGA domain name detection classification method and a DGA domain name detection classification device, which comprise the following steps: obtaining a DGA domain name to be classified; adopting a domain name segmentation method with mixed granularity to segment the DGA domain name into elements with character granularity or word granularity; performing hash calculation on each element to obtain a corresponding integer code; and inputting the integer code into a DGA domain name detection classification network based on multilayer void causal convolution to obtain a classification result. The domain name is divided into elements of character granularity or word granularity by adopting a domain name division method of mixed granularity, so that the character combination and word combination characteristics of the domain name can be effectively utilized, the uniform detection of different DGA domain names is realized, and the detection capability of the DGA domain name based on a word list is improved. The DGA domain name detection classification network based on the multi-layer void causal convolution is used for calculating the classification result, the calculation efficiency is high, and the balance between the performance and the efficiency can be achieved.

Description

DGA domain name detection and classification method and device
Technical Field
The present application relates to the field of network security technologies, and in particular, to a DGA domain name detection method and apparatus.
Background
Botnets (botnets) are a serious malicious attack, in which the controller (Botmaster) propagates Bots for malicious purposes, infects a large number of hosts as Bots hosts (Bots), and establishes one-to-many Command and Control (Command and Control) channels between the controller's servers and the Bots hosts. Once a botnet is formed, an attacker can obtain powerful distributed computing power and rich information resource reserves, thereby being easier to launch various network attacks.
To better hide oneself and escape detection, many botnets encode a Domain name Generation Algorithm (DGA) in the bots program, dynamically generating a large number of Domain names (called DGA Domain names), but registering and putting into use only a few of them. At this time, the zombie host always establishes connection with the server finally by accessing the DGA domain names one by one, but the defenders cannot judge the real address of the server, and all DGA domain names must be shielded in order to cut off the channel.
At present, many people in the field of computer security have been invested in research on DGA domain name detection, and among them, the deep learning method has become a mainstream solution to the problem because it has the advantages of high detection precision, no need of manual features, and capability of real-time detection.
The DGA domain name detection technology based on deep learning generally adopts the method that a domain name is converted into a multi-dimensional numerical vector, then the vector is input into a deep neural network with a certain structure, characteristics are extracted through the network, and the probability that the domain name belongs to a certain classification is predicted.
Depending on the structure of the deep neural network used, existing detection algorithms can be classified into: (1) recursive neural network based algorithms, such as Woodbridge et al, propose to use Long-Short Term Memory (LSTM) networks to detect DGA domain names, and then Yu et al also propose bi-directional LSTM networks (BiLSTM) as classification networks; (2) convolution-based neural network algorithms, such as the stacked convolution neural network (S-CNN) proposed by Yu et al, and the parallel convolution neural network (P-CNN) proposed by Saxe et al; (3) a hybrid network-based approach, i.e., a network that uses both convolutional layers and recursive structures, such as the algorithm proposed by Yu et al that uses the CNN-LSTM network as a classifier; (4) algorithms based on Capsule networks, such as Berman, explore DGA domain name detection algorithms based on one-dimensional Capsule networks (capsnetwork for short).
However, the above detection algorithm has limitations, mainly including the following two points:
firstly, when converting a domain name into a multidimensional numerical vector, a domain name segmentation method with character granularity is usually adopted, that is, each character in the domain name is regarded as a separate element to be separated, and then different characters are respectively encoded and semantically embedded. The limitation of this approach is that the subsequent deep neural network can only mine character-level features, ignoring word structures that may exist in the domain name. This makes it difficult for the algorithm to distinguish DGA domains that are composed of many common words and have a character distribution frequency very similar to that of normal domains, although they can detect DGA domains composed of random characters.
Secondly, the deep neural networks used by the existing methods are good and bad, and the accuracy and the efficiency are difficult to be considered at the same time. Wherein, the recurrent neural network is suitable for processing sequence information, but the training time is long; the common convolutional neural network can extract local features, but is difficult to extract global information; the hybrid network can comprehensively utilize local and global characteristics, but information loss is easy to occur in the characteristic extraction process; the capsule network classification accuracy is high, but the calculation is complex, and the algorithm efficiency is low.
Disclosure of Invention
The embodiments of the present application provide a method and an apparatus for detecting a DGA domain name, so as to solve the problems in the related art that it is difficult to identify a DGA domain name based on a word list, and it is difficult to consider the accuracy and the efficiency of the method.
According to a first aspect of an embodiment of the present application, a DGA domain name detection and classification method is provided, including:
obtaining a DGA domain name to be classified;
adopting a domain name segmentation method with mixed granularity to segment the DGA domain name into elements with character granularity or word granularity;
performing hash calculation on each element to obtain a corresponding integer code;
and inputting the integer code into a DGA domain name detection classification network based on multilayer void causal convolution to obtain a classification result.
According to a second aspect of the embodiments of the present application, there is provided a DGA domain name detection and classification apparatus, including:
the acquisition module is used for acquiring the DGA domain names to be classified;
the segmentation module is used for segmenting the DGA domain name into elements of character granularity or word granularity by adopting a domain name segmentation method of mixed granularity;
the computing module is used for carrying out Hash computation on each element to obtain a corresponding integer code;
and the classification module is used for inputting the integer codes into a DGA domain name detection classification network based on multilayer void causal convolution to obtain classification results.
The technical scheme provided by the embodiment of the application can have the following beneficial effects:
according to the embodiments, the domain name is divided into elements of character granularity or word granularity by adopting a domain name division method with mixed granularity, so that the character combination and word combination characteristics of the domain name can be effectively utilized, the uniform detection of different types of DGA domain names is realized, and the detection capability of the DGA domain name based on a word list is improved. The DGA domain name detection classification network based on the multi-layer cavity cause-effect convolution is used for calculating a classification result, a plurality of cavity cause-effect convolution layers with different cavity factors are the core of the network, features with different scales can be mined from the domain name, multi-scale features are fused for carrying out DGA domain name classification, and the accuracy is high; in addition, the network has high calculation efficiency, and the classification effect and the efficiency can be balanced. Therefore, the problems that the DGA domain name based on the word list is difficult to identify and the accuracy and the efficiency of the method are difficult to consider in the related technology are solved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
Fig. 1 is a flowchart illustrating a DGA domain name detection classification method according to an exemplary embodiment.
Fig. 2 is a flow diagram illustrating a mixed-granularity domain name splitting method in accordance with an example embodiment.
Figure 3 is a process diagram illustrating mixed-granularity domain name splitting, using the domain name "nwvlan.
FIG. 4 is a flow diagram illustrating a calculation of integer coding according to an example embodiment.
FIG. 5 is a schematic diagram illustrating a structure of a DGA domain name detection classification network based on multi-layer hole causal convolution according to an exemplary embodiment.
FIG. 6 is a diagram illustrating a three-layer hole causal convolution according to an exemplary embodiment.
Fig. 7 is a block diagram illustrating a structure of a DGA domain name detection and classification apparatus according to an exemplary embodiment.
FIG. 8 is a block diagram illustrating the structure of a segmentation module in accordance with an exemplary embodiment.
FIG. 9 is a block diagram illustrating the structure of a computing module in accordance with an exemplary embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
Fig. 1 is a flowchart illustrating a DGA domain name detection classification method according to an exemplary embodiment, and referring to fig. 1, a DGA domain name detection classification method may include the following steps:
step S101, obtaining a DGA domain name to be classified;
step S102, a domain name segmentation method with mixed granularity is adopted to segment the DGA domain name into elements with character granularity or word granularity;
step S103, carrying out Hash calculation on each element to obtain a corresponding integer code;
and S104, inputting the integer codes into a DGA domain name detection classification network based on multilayer void causal convolution to obtain a classification result.
According to the embodiments, the domain name is divided into elements of character granularity or word granularity by adopting a domain name division method with mixed granularity, so that the character combination and word combination characteristics of the domain name can be effectively utilized, the uniform detection of different types of DGA domain names is realized, and the detection capability of the DGA domain name based on a word list is improved. The DGA domain name detection classification network based on the multi-layer void causal convolution is used for mining the features of different scales from the domain name and fusing the multi-scale features to classify the DGA domain name, so that the accuracy is high; in addition, the network has high calculation efficiency, and the classification effect and the efficiency can be balanced. Therefore, the problems that the DGA domain name based on the word list is difficult to identify and the accuracy and the efficiency of the method are difficult to consider in the related technology are solved.
In the above step S102, a domain name division method with mixed granularity is adopted to divide the DGA domain name into elements with character granularity or word granularity, fig. 2 is a flowchart of the domain name division method with mixed granularity according to an exemplary embodiment, and fig. 3 is a schematic process diagram of domain name division with mixed granularity according to an exemplary embodiment, taking the domain name "nwvlan. Referring to fig. 2 and 3, the mixed-granularity domain name segmentation method may specifically include the following sub-steps:
step S1021, according to the separator ". times" sign, dividing the DGA domain name into a plurality of domain name labels;
specifically, a complete domain name contains multiple labels, each label is separated by a ". quadrature.. Thus, "is a natural delimiter in a domain name. Com "is divided into a second level domain name label" nwvlan "and a top level domain name label" com ".
Step S1022, selecting character strings with occurrence times higher than a set threshold number from the unitary model data of the corpus to form a word frequency table;
specifically, the meta-model data in the corpus statistics the occurrence times of various english letter combinations, and it can be obtained from experience that a character string with a higher occurrence time is more likely to be a common english word. The character strings with the occurrence times higher than the set threshold times are selected, so that the calculation amount can be reduced.
The Corpus is selected from one of Google Web 1T 5-Grams, British National cores, The cores of contextual American English.
Step S1023, a word segmentation method based on word frequency statistics is adopted, and each domain name label is segmented into elements of a plurality of word granularity according to the word frequency table;
specifically, in order to extract common words from the domain name, thereby effectively utilizing a word combination structure in the domain name and improving the classification accuracy of the method for the DGA domain name based on the word list, a word segmentation method based on word frequency statistics is adopted, namely a dynamic programming algorithm is utilized to find out the one with the maximum segmentation probability in all the partitions of the belonging domain name label as a word segmentation result, and the sub-character strings obtained by segmentation are elements of word granularity. Taking the secondary domain name label "nwplmns" as an example, it is partitioned into the elements "nw" and "places" at word granularity.
Wherein the definition of the partition probabilityThe method comprises the following steps: let s be a string of length n, assuming that s is divided into m sub-strings w1,w2,...,wmThen the probability of such a partition, i.e., the joint probability of these substrings, is equal to the product of the probabilities of occurrence of all the substrings multiplied together, as shown in equation (1).
Figure BDA0002923424090000061
Wherein P (c) represents the probability of dividing c, and P (w)1:m) Representing a substring w1,w2,...,wmA joint probability of (a); p (w)i) Representing a substring wiThe probability of occurrence.
The calculation mode of the character string occurrence probability is as follows: if a string w is contained in the word frequency table and the number of occurrences of the string is n (w), the probability of occurrence of the string can be approximated according to equation (2), i.e., the probability of occurrence of a string is equal to its number of occurrences divided by the sum of the number of occurrences of all strings in the word frequency table.
P(w)≈N(w)/∑jN(wj) Formula (2)
If the character string w is not contained in the word frequency table, its probability of occurrence is approximately calculated according to formula (3). Where T is the total number of all strings in the meta-model data of the corpus, and len (w') represents the length of string w.
P(w)≈10.0/(T×10len(w)) Formula (3)
Step S1024, dividing the elements which are not included in the word frequency table in the word granularity elements into the elements with character granularity.
Specifically, the element of the granularity of the word is traversed, if the element is contained in the word frequency table, the occurrence frequency of the element is high, the element can be judged to belong to a common word according to experience, and other operations do not need to be executed on the element; if this element is not contained in the word frequency table, it does not belong to a common word, which should be further segmented into individual characters, i.e. elements of character granularity. For example, the element "nw" of the word granularity is not included in the word frequency table, and will be further divided into the elements "n" and "w" of the character granularity. This enables the method to be applied to DGA domains resulting from random character combinations.
In the step S103, hash calculation is performed for each element to obtain a corresponding integer code, fig. 4 is a flowchart illustrating integer code calculation according to an exemplary embodiment, and according to fig. 4, the step S103 may specifically include the following sub-steps:
step S1031, carrying out Hash calculation for each element to obtain a corresponding integer code;
specifically, the elements may be english letters, numbers, punctuations, and character strings included in the word frequency table, an index is respectively assigned to all the elements that may appear as codes, and a hash table is used to record the correspondence between each element and the code. And traversing the elements, and looking up the hash table to obtain the integer code corresponding to each element. Because the DGA domain name detection classification network based on the multilayer void causal convolution can only process numerical values, the text can be converted into numerical representation in the step, and the subsequent calculation of the DGA domain name detection classification network is facilitated.
Step S1032 is to fill several integers 0 at the end of the integer encoding to make the integer encoding reach a certain predetermined length.
Specifically, the DGA domain name detection classification network can only accept data with the same length as input data, and this step can convert the data with a variable length into data with a fixed length.
In this embodiment, the DGA domain name detection classification network based on the multi-layer cavity cause-and-effect convolution includes an embedded layer, a cavity cause-and-effect convolution layer, a layer normalization layer, and a full connection layer, which are connected in sequence; wherein the embedding layer is to map the integer encoding into a word vector representation; the hole causal convolution layer is used for extracting multi-scale features from the word vector representation form; the layer normalization layer is used for adjusting the features to be in accordance with the features of normal distribution; and the full connection layer is used for calculating a classification result according to the features conforming to normal distribution. When the DGA domain name detection classification network based on the multi-layer void causal convolution is used for classification, the classification result is a DGA domain name or a non-DGA domain name; when the DGA domain name detection classification network based on the multi-layer hole causal convolution is used for multi-classification, the classification result is a label of a specific DGA domain name family.
FIG. 5 is a schematic diagram illustrating a structure of a DGA domain name detection classification network based on multi-layer hole causal convolution according to an exemplary embodiment. Referring to fig. 5, the connection method of the DGA domain name detection classification network based on the multi-layer hole causal convolution is as follows: the integer coding inputs the output of the embedded layer as the input of a first hole cause and effect convolution layer, the output of the first hole cause and effect convolution layer as the input of a first layer normalization, the output of the first layer normalization as the input of a second hole cause and effect convolution layer, the output of the second hole cause and effect convolution layer as the input of a second layer normalization, the output of the second layer normalization as the input of a third hole cause and effect convolution layer, the output of the third hole cause and effect convolution layer as the input of a third layer normalization, the results of linear addition of the outputs of the first layer normalization, the second layer normalization and the third layer normalization are used as the input of a full connection layer, and the output of the full connection layer is a classification result.
In particular, the embedding layer is configured to map the integer code into a word vector representation. The embedding layer contains an n x k dimensional embedding matrix, where n denotes how many possible elements are present in total and k denotes the dimension of the word vector. For an element encoded as an integer i, the ith row of the embedding matrix is its word embedding representation. The embedded matrix is similar to other trainable parameters of the DGA domain name detection and classification network, is initialized to random weight, and is continuously updated according to a back propagation algorithm in the training process of the network.
The hole causal convolution layer is used to extract multi-scale features from the word vector representation. Let f be a one-dimensional input, k denote the convolution kernel, then a hole causal convolution operation is performed at the t position of fdcThen, it is calculated according to the formula (4). In the formula, d represents a hole factor.
Figure BDA0002923424090000081
Compared with the common convolution, the void causal convolution has the characteristics that: (1) when the output of the t position is calculated, only the data of the t position or the data before the t position participate in the calculation, so that the information of the future is prevented from being leaked to the past. (2) And introducing a cavity factor d, and selecting data to participate in calculation every time step d is carried out, so that the receptive field is enlarged under the condition that the convolution kernel parameters are not increased.
The DGA domain name detection classification network based on the multilayer cavity cause-and-effect convolution comprises three cavity cause-and-effect convolution layers with cavity factors multiplied by times, namely a first cavity cause-and-effect convolution layer, a second cavity cause-and-effect convolution layer and a third cavity cause-and-effect convolution layer, the sizes of convolution kernels of the first cavity cause-and-effect convolution layer, the second cavity cause-and-effect convolution layer and the third cavity cause-and-effect convolution layer are set to be 2, and the cavity factors are respectively 1. FIG. 6 is a diagram illustrating a three-layer hole causal convolution, according to an exemplary embodiment, with reference to FIG. 6, with a hole factor d of 1 for the first hole causal convolution layer used to extract local features in a domain name; integrating sequence information of the four elements by taking the output of the first hole cause-effect convolution layer as input, wherein the hole factor d of the second hole cause-effect convolution layer is 2; the hole factor d of the second hole causal convolution layer is 4, which further enlarges the receptive field and mines a more abstract pattern. The three layers of hole cause and effect convolution layers mine features with different scales and different abstraction degrees from the domain name.
The layer normalization is used to adjust the features to fit normally distributed features. Assuming that the output of the cavitation causal convolutional layer contains a total of H hidden neurons, their mean μ and standard deviation σ can be calculated by equation (5) and equation (6), respectively. In the formula, hiRepresenting the output of the i-th hidden neuron.
Figure BDA0002923424090000091
Figure BDA0002923424090000092
The next step is for hiAnd (5) normalizing, and adjusting the data to be in normal distribution. As shown in equation (7).
h′i=(hi- μ)/σ equation (7)
However, the normal distribution may not embody the original data distribution characteristics, and in order to ensure that the original information is not damaged, the vector h' composed of hidden neurons should be transformed as shown in formula (8).
h '+ g ″ + h' + b equation (8)
In the formula, g and b are trainable parameters and can be continuously updated and learned in the network training process.
And the full connection layer is used for calculating a classification result according to the features conforming to normal distribution. When the DGA domain name detection classification network based on the multi-layer void causal convolution is used for classification, the number of output neurons of a full connection layer is 1, an activation function is a Sigmoid function, a real number p with the size of 0-1 can be calculated by the full connection layer and is used for representing the probability that an input belongs to the DGA domain name, if p is larger than a certain preset threshold value, the classification result is the DGA domain name, and if not, the classification result is a non-DGA domain name; when the DGA domain name detection classification network based on the multilayer void causal convolution is used for multi-classification, the number of output neurons of the full connection layer is consistent with the number of classes, the activation function is a Softmax function, and the full connection layer can calculate to obtain a c-dimensional vector p ═ p1,p2,...,pc]Wherein p isiAnd (i ═ 1, 2., c) represents the probability that the domain name belongs to the category i, and the classification result is the label of the DGA domain name family corresponding to the category with the highest probability value.
1,000,000 benign domains were obtained from Tranco, 60, 7,073, 965 DGA domains were obtained from DGArchive, a dataset was constructed, and 80% of the dataset was used for training and 20% for testing. For the two classification problems of the domain names, the precision rate, the recall rate and the F rate are selected1Values, AUC values as evaluation indices; for the multi-classification problem of the domain name, the precision rate, the recall rate and the F rate are selected1The arithmetic mean and the weighted mean of the values serve as evaluation indexes.
Table 1 shows the results of the two-classification experiments for domain names. As can be seen from the table, the accuracy of the present invention is slightly lower than the CapsNet method, but the recall rate, F1Values versus AUC values exceed other existing state-of-the-art methods including CapsNet. In addition, the table also shows the training time of each model, the shortest is S-CNN, the second is P-CNN, the invention is the third, but the training time difference of the three convolutional neural networks is not too large on the whole; the capsNet method with the classification effect close to that of the method has the training time about 8 times that of the method.
TABLE 1 results of two-class experiments on Domain names
Figure BDA0002923424090000101
Table 2 shows the overall results of the multi-classification experiments for domain names. As can be seen from the data in the table, the arithmetic mean of the accuracy of the present invention is slightly lower than that of the CapsNet model, but the recall ratio is slightly lower than that of the F1The arithmetic mean of the values and the weighted mean of the three indices are all the highest. In general, the multi-classification effect of the invention is slightly improved compared with the best-classification CapsNet method in the existing method, and the training time is only one seventh of the CapsNet method. On the other hand, the classification effect of the invention is higher than that of the S-CNN method with the shortest training time, F1Value arithmetic mean sum F1The value weighted average is improved by about 13% and 8%, respectively.
TABLE 2 Total results of multi-classification experiments for Domain names
Figure BDA0002923424090000111
In addition, the detection F of each DGA domain name family is counted1And found that the present invention has the highest classification accuracy for all 42 of the classes. Table 3 shows a portion of representative data, and it can be seen that: (1) whether for the random character based DGA domain name family, such as Bamital, Cryptolocker, or for the word list based DGA domain name family,for example, Gozi, Matsnu and Suppobox have the highest classification result, which indicates that the method is suitable for different types of DGA domain names; (2) for new DGA domain names designed by researchers aiming at the weak points of the detection algorithm, such as CharBot and Deception, the classification accuracy of the method is superior to that of other models, and the method can extract more prominent features from the domain names with the capability of bypassing the detection algorithm; (3) for two similar sets of DGA domain name families, such as Nymaim and its variant, Nymaim2, the present invention distinguishes them relatively better, probably because the method is more feature-extracting than other methods.
TABLE 3F of part of the DGA Domain name family in the Multi-Classification experiment1Value statistics
Figure BDA0002923424090000112
Figure BDA0002923424090000121
In conclusion, the method can balance performance and efficiency, is superior to the existing latest method in detection capability, and has higher algorithm efficiency.
Corresponding to the embodiment of the DGA domain name detection and classification method, the application also provides an embodiment of a DGA domain name detection and classification device.
Fig. 6 is a block diagram illustrating a structure of a DGA domain name detection and classification apparatus according to an exemplary embodiment. Referring to fig. 6, the apparatus includes:
an obtaining module 21, configured to obtain a DGA domain name to be classified;
a segmentation module 22, configured to segment the DGA domain name into elements of a character granularity or a word granularity by using a domain name segmentation method with mixed granularity;
a calculating module 23, configured to perform hash calculation on each element to obtain a corresponding integer code;
and the classification module 24 is configured to input the integer codes into a DGA domain name detection classification network based on multi-layer void causal convolution to obtain classification results.
Fig. 7 is a block diagram illustrating the structure of a segmentation module according to an exemplary embodiment, wherein the segmentation module 22 includes the following sub-modules:
a first partitioning sub-module 221, configured to partition the DGA domain name into a plurality of domain name labels according to a separator "·" symbol;
a selecting submodule 222, configured to select, from the unary model data of the corpus, a character string whose occurrence frequency is higher than a set threshold frequency to form a word frequency table;
a second segmentation submodule 223, configured to segment each domain name label into elements of multiple word granularities according to the word frequency table by using a word segmentation method based on word frequency statistics;
a third segmentation submodule 224, configured to segment elements of the word granularity element that are not included in the word frequency table into elements of character granularity.
Fig. 8 is a block diagram illustrating a structure of a computing module according to an exemplary embodiment, where the computing module 23 may include the following sub-modules:
the calculating submodule 231 is configured to perform hash calculation on each element to obtain a corresponding integer code;
a padding sub-module 232, configured to pad a number of integers 0 at the end of the integer code, so that the integer code reaches a certain predetermined length.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application. One of ordinary skill in the art can understand and implement it without inventive effort.
Correspondingly, the present application also provides an electronic device, comprising: one or more processors; a memory for storing one or more programs; when executed by the one or more processors, cause the one or more processors to implement a DGA domain name detection classification method as described above.
Accordingly, the present application also provides a computer readable storage medium having stored thereon computer instructions, wherein the instructions, when executed by a processor, implement a DGA domain name detection classification method as described above.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (10)

1. A DGA domain name detection and classification method is characterized by comprising the following steps:
obtaining a DGA domain name to be classified;
adopting a domain name segmentation method with mixed granularity to segment the DGA domain name into elements with character granularity or word granularity;
performing hash calculation on each element to obtain a corresponding integer code;
and inputting the integer code into a DGA domain name detection classification network based on multilayer void causal convolution to obtain a classification result.
2. The DGA domain name detection and classification method according to claim 1, wherein the DGA domain name is segmented into elements of character granularity or word granularity by adopting a mixed-granularity domain name segmentation method, comprising:
segmenting the DGA domain name into a plurality of domain name labels according to a separator "-" symbol;
selecting character strings with the occurrence frequency higher than a set threshold frequency from the unitary model data of the corpus to form a word frequency table;
dividing each domain name label into a plurality of word granularity elements according to the word frequency table by adopting a word division method based on word frequency statistics;
and dividing elements which are not contained in the word frequency table in the word granularity elements into elements with character granularity.
3. The DGA domain name detection and classification method according to claim 2, wherein The Corpus is selected from one of Google Web 1T 5-Grams, British National desk, The desk of contextual American English.
4. The DGA domain name detection and classification method according to claim 1, wherein performing hash calculation for each element to obtain a corresponding integer code, comprises:
performing hash calculation on each element to obtain a corresponding integer code;
and filling a plurality of integers 0 at the end of the integer code to enable the integer code to reach a certain preset length.
5. The DGA domain name detection and classification method according to claim 1, wherein the DGA domain name detection and classification network based on the multi-layer hole causal convolution comprises an embedded layer, a hole causal convolution layer, a layer normalization layer and a full connection layer which are connected in sequence; wherein
The embedding layer is used for mapping the integer codes into word vector representation forms;
the hole causal convolution layer is used for extracting multi-scale features from the word vector representation form;
the layer normalization layer is used for adjusting the features to be in accordance with the features of normal distribution;
and the full connection layer is used for calculating a classification result according to the features conforming to normal distribution.
6. The DGA domain name detection and classification method according to claim 1, wherein when the DGA domain name detection and classification network based on multi-layer hole causal convolution is used for classification, the classification result is a DGA domain name or a non-DGA domain name; when the DGA domain name detection classification network based on the multi-layer hole causal convolution is used for multi-classification, the classification result is a label of a specific DGA domain name family.
7. A DGA domain name detection and classification device is characterized by comprising:
the acquisition module is used for acquiring the DGA domain names to be classified;
the segmentation module is used for segmenting the DGA domain name into elements of character granularity or word granularity by adopting a domain name segmentation method of mixed granularity;
the computing module is used for carrying out Hash computation on each element to obtain a corresponding integer code;
and the classification module is used for inputting the integer codes into a DGA domain name detection classification network based on multilayer void causal convolution to obtain classification results.
8. The DGA domain name detection and classification device according to claim 7, wherein the segmentation module comprises:
a first segmentation submodule, configured to segment the DGA domain name into a plurality of domain name labels according to a separator ". times" symbol;
the selection submodule is used for selecting character strings with the occurrence frequency higher than the set threshold frequency from the unitary model data of the corpus to form a word frequency table;
the second segmentation submodule is used for segmenting each domain name label into elements of a plurality of word granularities according to the word frequency table by adopting a word segmentation method based on word frequency statistics;
and the third segmentation submodule is used for segmenting the elements which are not contained in the word frequency table in the word granularity elements into the elements with character granularity.
9. The DGA domain name detection and classification device according to claim 7, wherein the calculation module comprises:
the first calculation submodule is used for carrying out Hash calculation on each element to obtain a corresponding integer code;
and the filling submodule is used for filling a plurality of integers 0 at the end of the integer code so as to enable the integer code to reach a certain preset length.
10. The DGA domain name detection and classification device according to claim 7, wherein the DGA domain name detection and classification network based on the multi-layer hole causal convolution comprises an embedded layer, a hole causal convolution layer, a layer normalization layer and a full connection layer which are connected in sequence; wherein
The embedding layer is used for mapping the integer codes into word vector representation forms;
the hole causal convolution layer is used for extracting multi-scale features from the word vector representation form;
the layer normalization layer is used for adjusting the features to be in accordance with the features of normal distribution;
and the full connection layer is used for calculating a classification result according to the features conforming to normal distribution.
CN202110124333.5A 2021-01-29 2021-01-29 DGA domain name detection and classification method and device Pending CN112953914A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110124333.5A CN112953914A (en) 2021-01-29 2021-01-29 DGA domain name detection and classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110124333.5A CN112953914A (en) 2021-01-29 2021-01-29 DGA domain name detection and classification method and device

Publications (1)

Publication Number Publication Date
CN112953914A true CN112953914A (en) 2021-06-11

Family

ID=76239307

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110124333.5A Pending CN112953914A (en) 2021-01-29 2021-01-29 DGA domain name detection and classification method and device

Country Status (1)

Country Link
CN (1) CN112953914A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116318845A (en) * 2023-02-09 2023-06-23 国家计算机网络与信息安全管理中心甘肃分中心 DGA domain name detection method under unbalanced proportion condition of positive and negative samples

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109450845A (en) * 2018-09-18 2019-03-08 浙江大学 A kind of algorithm generation malice domain name detection method based on deep neural network
CN110807098A (en) * 2019-09-24 2020-02-18 武汉智美互联科技有限公司 DGA domain name detection method based on BiRNN deep learning
CN111209497A (en) * 2020-01-05 2020-05-29 西安电子科技大学 DGA domain name detection method based on GAN and Char-CNN
CN112073550A (en) * 2020-08-26 2020-12-11 重庆理工大学 DGA domain name detection method fusing character-level sliding window and depth residual error network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109450845A (en) * 2018-09-18 2019-03-08 浙江大学 A kind of algorithm generation malice domain name detection method based on deep neural network
CN110807098A (en) * 2019-09-24 2020-02-18 武汉智美互联科技有限公司 DGA domain name detection method based on BiRNN deep learning
CN111209497A (en) * 2020-01-05 2020-05-29 西安电子科技大学 DGA domain name detection method based on GAN and Char-CNN
CN112073550A (en) * 2020-08-26 2020-12-11 重庆理工大学 DGA domain name detection method fusing character-level sliding window and depth residual error network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SHAOFANG ZHOU,ETC: "CNN-based DGA Detection with High Coverage", 《2019 IEEE INTERNATIONAL CONFERENCE ON INTELLIGENCE AND SECURITY INFORMATICS (ISI)》 *
韩建胜,等: "基于双向时间深度卷积网络的中文文本情感分类", 《计算机应用与软件》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116318845A (en) * 2023-02-09 2023-06-23 国家计算机网络与信息安全管理中心甘肃分中心 DGA domain name detection method under unbalanced proportion condition of positive and negative samples

Similar Documents

Publication Publication Date Title
US11227118B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
CN111628970B (en) DGA type botnet detection method, medium and electronic equipment
Yang et al. Detecting stealthy domain generation algorithms using heterogeneous deep neural network framework
KR102069621B1 (en) Apparatus and Method for Documents Classification Using Documents Organization and Deep Learning
CN109471944A (en) Training method, device and the readable storage medium storing program for executing of textual classification model
CN112948578B (en) DGA domain name open set classification method, device, electronic equipment and medium
Hegde et al. Aspect based feature extraction and sentiment classification of review data sets using Incremental machine learning algorithm
Huang et al. Large-scale heterogeneous feature embedding
CN107145516A (en) A kind of Text Clustering Method and system
US11914641B2 (en) Text to color palette generator
CN112651025A (en) Webshell detection method based on character-level embedded code
Sharma et al. A new hardware Trojan detection technique using class weighted XGBoost classifier
CN115456043A (en) Classification model processing method, intent recognition method, device and computer equipment
CN113496123A (en) Rumor detection method, rumor detection device, electronic equipment and storage medium
CN114444476B (en) Information processing method, apparatus, and computer-readable storage medium
US10467276B2 (en) Systems and methods for merging electronic data collections
CN115392357A (en) Classification model training and labeled data sample spot inspection method, medium and electronic equipment
CN118250169A (en) Network asset class recommendation method, device and storage medium
CN112953914A (en) DGA domain name detection and classification method and device
Xing et al. Mining semantic information in rumor detection via a deep visual perception based recurrent neural networks
Santacruz et al. Learning the sub-optimal graph edit distance edit costs based on an embedded model
Liu et al. Redundancy reduction based node classification with attribute augmentation
CN111930883A (en) Text clustering method and device, electronic equipment and computer storage medium
CN111738226A (en) Text recognition method and device based on CNN (convolutional neural network) and RCNN (recursive neural network) models
CN115422000A (en) Abnormal log processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210611