CN113746952B - DGA domain name detection method and device, electronic equipment and computer storage medium - Google Patents

DGA domain name detection method and device, electronic equipment and computer storage medium Download PDF

Info

Publication number
CN113746952B
CN113746952B CN202111074752.9A CN202111074752A CN113746952B CN 113746952 B CN113746952 B CN 113746952B CN 202111074752 A CN202111074752 A CN 202111074752A CN 113746952 B CN113746952 B CN 113746952B
Authority
CN
China
Prior art keywords
domain name
detected
hamming distance
simhash
average hamming
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111074752.9A
Other languages
Chinese (zh)
Other versions
CN113746952A (en
Inventor
张羽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdong Technology Information Technology Co Ltd
Original Assignee
Jingdong Technology Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdong Technology Information Technology Co Ltd filed Critical Jingdong Technology Information Technology Co Ltd
Priority to CN202111074752.9A priority Critical patent/CN113746952B/en
Publication of CN113746952A publication Critical patent/CN113746952A/en
Application granted granted Critical
Publication of CN113746952B publication Critical patent/CN113746952B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a DGA domain name detection method, a device, electronic equipment and a computer storage medium, wherein a DNS request packet in a network data packet and a request domain name in the DNS request packet are extracted by acquiring the network data packet in a pcap format; filtering the request domain name, and taking the obtained unknown domain name as the domain name to be detected; detecting a domain name to be detected by using a Simhash algorithm, and distinguishing a legal domain name and a malicious domain name in the domain name to be detected; and carrying out family clustering on the malicious domain names to obtain DGA domain names of each family. In the scheme, the request domain name is filtered firstly, the obtained unknown domain name is used as a domain name to be detected, the domain name to be detected is detected through a Simhash algorithm, and family clustering is carried out on the distinguished malicious domain names in the domain name to be detected, so that the DGA domain name of each family is obtained, the DGA domain name detection efficiency is improved, and the false alarm rate of the DGA domain name detection is reduced.

Description

DGA domain name detection method and device, electronic equipment and computer storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a DGA domain name detection method, a DGA domain name detection device, an electronic device, and a computer storage medium.
Background
DGA is a domain name generation algorithm, which is a method for generating a series of pseudo-random malicious domain names by utilizing specific seed characters and combining an encryption algorithm. Pseudo-random means that the string sequence appears to be random, but can be repeatedly generated and duplicated since its structure can be predetermined. The malware may cause the infected host to establish a network connection with a control (C2) server through DGA domain name and an attacker's command, thereby receiving a command from the C2 server to launch a DNS attack on the victim host.
In the prior art, hidden attack behaviors can be found through DGA domain name detection, and most of currently adopted DGA domain name detection methods are based on machine learning and deep learning. However, the existing DGA domain name detection method based on machine learning depends on the number of training samples, a large number of samples need to be trained, and as the number of different DGA family domain names is large, the problem that the data distribution of the DGA samples is unbalanced, so that the detection rate of the various DGA family domain names is large is caused. Meanwhile, the existing DGA detection method is difficult to identify domain names with high randomness, and has high false alarm rate.
In summary, the existing DGA domain name detection method can cause the problems of low DGA domain name detection efficiency and high false alarm rate.
Disclosure of Invention
In view of the above, the embodiments of the present invention provide a DGA domain name detection method, apparatus, electronic device, and computer storage medium, so as to solve the problems of low detection efficiency and high false alarm rate of the existing DGA domain name.
In order to solve the above problems, the embodiment of the present invention provides the following technical solutions:
the first aspect of the embodiment of the invention discloses a DGA domain name detection method, which comprises the following steps:
acquiring a network data packet in a pcap format, and extracting a DNS request packet in the network data packet and a request domain name in the DNS request packet;
filtering the request domain name, and taking the obtained unknown domain name as a domain name to be detected;
detecting the domain name to be detected by using a Simhash algorithm, and distinguishing a legal domain name and a malicious domain name in the domain name to be detected;
and carrying out family clustering on the malicious domain names to obtain DGA domain names of each family.
Optionally, the filtering the request domain name, taking the obtained unknown domain name as the domain name to be detected includes:
performing de-duplication treatment on the request domain name to obtain a de-duplicated request domain name;
Removing the request domain name containing special characters from the de-duplicated request domain name to obtain an effective domain name;
and filtering the effective domain name by using a black-and-white list, filtering known domain names in the black-and-white list to obtain unknown domain names, and taking the unknown domain names as domain names to be detected.
Optionally, the detecting the domain name to be detected by using Simhash algorithm, distinguishing a legal domain name and a malicious domain name in the domain name to be detected, includes:
processing each domain name to be detected by using a Simhash algorithm to obtain each domain name to be detected represented by a Simhash value;
carrying out average Hamming distance calculation on each Simhash value and a blacklist Simhash sample library and a whitelist Simhash sample library respectively to obtain blacklist average Hamming distance and whitelist average Hamming distance corresponding to each domain name to be detected, wherein domain names in the blacklist Simhash sample library and the whitelist Simhash sample library are processed in advance by using a Simhash algorithm and converted into corresponding Simhash values;
comparing the average Hamming distance of the blacklist and the average Hamming distance of the whitelist for each domain name to be detected;
if the average Hamming distance of the white list is smaller than the average Hamming distance of the black list, determining that the domain name to be detected is a legal domain name;
If the average Hamming distance of the white list is larger than the average Hamming distance of the black list, determining that the domain name to be detected is a malicious domain name;
if the white list average Hamming distance is equal to the black list average Hamming distance, judging whether the white list average Hamming distance or the black list average Hamming distance is smaller than a preset threshold;
if the detected domain name is smaller than the threshold value, determining that the domain name to be detected is a legal domain name;
and if the domain name to be detected is greater than or equal to a threshold value, determining that the domain name to be detected is a malicious domain name.
Optionally, the method further comprises:
performing level division on the domain names in the black list and the white list according to the domain name levels to obtain domain names with different levels, wherein each level is provided with a corresponding index;
processing the domain name in each level by using a Simhash algorithm to obtain a domain name converted into a corresponding Simhash value;
the step of calculating the average Hamming distance between each Simhash value and the blacklist Simhash sample library and the whitelist Simhash sample library to obtain the blacklist average Hamming distance and the whitelist average Hamming distance corresponding to each domain name to be detected, comprises the following steps:
determining the level of the domain name to be detected corresponding to each Simhash value, and carrying out average Hamming distance calculation on the Simhash value and the Simhash value with the same level in the blacklist Simhash sample library and the whitelist Simhash sample library based on indexes to obtain the blacklist average Hamming distance and the whitelist average Hamming distance corresponding to each domain name to be detected.
Optionally, the performing family clustering on the malicious domain name to obtain DGA domain names of each family includes:
extracting and fusing characteristics of each malicious domain name to generate a multidimensional characteristic vector;
performing dimension reduction processing on each multidimensional feature vector to obtain dimension reduction feature vectors;
and clustering all the dimension reduction feature vectors by using DBSCAN to obtain the DGA domain name of each family.
The second aspect of the embodiment of the invention discloses a DGA domain name detection device, which comprises:
the extraction module is used for acquiring a network data packet in a pcap format and extracting a DNS request packet in the network data packet and a request domain name in the DNS request packet;
the filtering module is used for filtering the request domain name and taking the obtained unknown domain name as a domain name to be detected;
the Simhash module is used for detecting the domain name to be detected by using a Simhash algorithm and distinguishing a legal domain name and a malicious domain name in the domain name to be detected;
and the clustering module is used for carrying out family clustering on the malicious domain names to obtain DGA domain names of each family.
Optionally, the Simhash module includes:
the Simhash calculation unit is used for processing each domain name to be detected by using a Simhash algorithm to obtain each domain name to be detected represented by a Simhash value;
The distance calculation unit is used for carrying out average Hamming distance calculation on each Simhash value and a blacklist Simhash sample library and a whitelist Simhash sample library respectively to obtain blacklist average Hamming distance and whitelist average Hamming distance corresponding to each domain name to be detected, and domain names in the blacklist Simhash sample library and the whitelist Simhash sample library are processed in advance by using a Simhash algorithm and converted into corresponding Simhash values;
the determining unit is used for comparing the average Hamming distance of the blacklist and the average Hamming distance of the whitelist for each domain name to be detected; if the average Hamming distance of the white list is smaller than the average Hamming distance of the black list, determining that the domain name to be detected is a legal domain name; if the average Hamming distance of the white list is larger than the average Hamming distance of the black list, determining that the domain name to be detected is a malicious domain name; if the white list average Hamming distance is equal to the black list average Hamming distance, judging whether the white list average Hamming distance or the black list average Hamming distance is smaller than a preset threshold; if the detected domain name is smaller than the threshold value, determining that the domain name to be detected is a legal domain name; and if the domain name to be detected is greater than or equal to a threshold value, determining that the domain name to be detected is a malicious domain name.
Optionally, the clustering module includes:
the feature processing unit is used for carrying out feature extraction and fusion on each malicious domain name to generate a multidimensional feature vector;
the dimension reduction unit is used for carrying out dimension reduction processing on each multi-dimension feature vector to obtain dimension reduction feature vectors;
and the clustering unit is used for clustering all the dimension reduction feature vectors by using DBSCAN to obtain the DGA domain name of each family.
The third aspect of the embodiment of the invention discloses an electronic device, which comprises a processor and a memory;
the memory is used for storing a computer program;
the processor is configured to implement the DGA domain name detection method disclosed in the first aspect of the embodiment of the present invention when invoking and executing the computer program stored in the memory.
The fourth aspect of the embodiment of the invention discloses a computer storage medium, in which computer executable instructions are stored, and when the computer executable instructions are loaded and executed by a processor, the DGA domain name detection method disclosed in the first aspect of the embodiment of the invention is implemented.
Based on the method, the device, the electronic equipment and the computer storage medium for detecting the DGA domain name provided by the embodiment of the invention, the method comprises the following steps: acquiring a network data packet in a pcap format, and extracting a DNS request packet in the network data packet and a request domain name in the DNS request packet; filtering the request domain name, and taking the obtained unknown domain name as a domain name to be detected; detecting the domain name to be detected by using a Simhash algorithm, and distinguishing a legal domain name and a malicious domain name in the domain name to be detected; and carrying out family clustering on the malicious domain names to obtain DGA domain names of each family. In the scheme, the request domain name is filtered firstly, the obtained unknown domain name is used as a domain name to be detected, the domain name to be detected is detected through a Simhash algorithm, and family clustering is carried out on the distinguished malicious domain names in the domain name to be detected, so that the DGA domain name of each family is obtained, the DGA domain name detection efficiency is improved, and the false alarm rate of the DGA domain name detection is reduced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a DGA domain name detection method according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of data preprocessing for a request domain name according to an embodiment of the present invention;
fig. 3 is a schematic flow chart for distinguishing a legal domain name from a malicious domain name in a domain name to be detected according to an embodiment of the present invention;
FIG. 4 is a schematic flow chart of distinguishing legal domain names from malicious domain names in the domain names to be detected according to another embodiment of the present invention;
fig. 5 is a schematic flow chart of DGA domain name detection based on Simhash algorithm according to an embodiment of the present invention;
FIG. 6 is a schematic flow chart of family clustering according to an embodiment of the present invention;
FIG. 7 is a schematic flow chart of DGA family clustering based on DBSCAN algorithm according to the embodiment of the invention;
Fig. 8 is a schematic structural diagram of a DGA domain name detection device according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In order to facilitate understanding of the technical scheme of the present invention, technical terms appearing in the present invention are described:
pcap traffic packet: the pcap is a common datagram storage format, and the pcap traffic packet is a network packet in the pcap format.
DNS: the domain name system is a service of the internet, and can be used as a distributed database for mapping domain names and IP addresses with each other, so that a user can access the internet more conveniently.
Simhash: the method is a local sensitive hash algorithm, is proposed by Charika et al in 2002 at the earliest and is specially used for solving the task of webpage duplication elimination of hundreds of millions of levels, and comprises 5 steps of word segmentation, hash, weighting, merging, dimension reduction and the like.
DBSCAN: the density-based clustering algorithm defines clusters as the largest set of densely connected points, is able to divide areas with a sufficiently high density into clusters, and can find clusters of arbitrary shape in a noisy spatial database.
DGA: the domain name generation algorithm is a method for generating a series of pseudo-random malicious domain names by combining a specific seed character with an encryption algorithm.
n-gram: is an algorithm based on a statistical language model. The basic idea is to perform a sliding window operation of size n on the content in the text according to bytes, forming a sequence of byte fragments of length n.
Sea distance: the corresponding bits of the two codewords take different numbers of bits.
Classification: is a branch of supervised learning whose purpose is to predict classification labels for new samples based on past observations.
Clustering: a data set is partitioned into different classes or clusters according to a certain criterion, such as distance, so that the similarity of data objects within the same cluster is as large as possible, while the variability of data objects not in the same cluster is as large as possible.
Entropy: information measure of random event uncertainty.
TSNE: the best dimension reduction processing method at present belongs to a popular learning method.
Dimension reduction: the high-dimensional features are reduced to low-dimensional features.
min-max normalization: the value of the feature is scaled to the [0,1] interval using the maximum and minimum values of the feature.
z-score normalization: the data is scaled to fall within a particular interval.
As known from the background art, the existing DGA domain name detection method can cause the problems of low DGA domain name detection efficiency and high false alarm rate.
Therefore, in the scheme, the method, the device, the electronic equipment and the computer storage medium for detecting the DGA domain name filter the request domain name, take the obtained unknown domain name as the domain name to be detected, detect the domain name to be detected through a Simhash algorithm, and perform family clustering on the malicious domain name in the distinguished domain name to be detected, so that the DGA domain name of each family is obtained, the DGA domain name detection efficiency is improved, and the false alarm rate of the DGA domain name detection is reduced.
As shown in fig. 1, a flow chart of a DGA domain name detection method according to an embodiment of the present invention is shown, and the method mainly includes the following steps:
step S101: and acquiring a network data packet in a pcap format, and extracting a DNS request packet in the network data packet and a request domain name in the DNS request packet.
In the specific implementation process of step S101, a network data packet in a pcap format is obtained, that is, a pcap traffic packet is obtained, a DNS request packet is extracted from the pcap traffic packet, and then a request domain name is extracted from the DNS request packet.
Step S102: filtering the request domain name, and taking the obtained unknown domain name as the domain name to be detected.
In the specific implementation process of step S102, the data preprocessing is performed on the request domain name, specifically: and carrying out de-duplication processing, special character processing and filtering processing on the request domain name to obtain an unknown domain name, and taking the unknown domain name as the domain name to be detected.
It should be noted that, the deduplication process refers to that only one data of the plurality of identical data is retained.
The special characters refer to 26 english alphabets, numbers of 0 to 9, and the remaining characters excluding the-and-character.
The process of removing special characters refers to eliminating the request domain name containing special characters.
The filtering process refers to filtering the requested domain name using a black and white list.
Wherein the whitelist is from the alexitap website domain name and the blacklist is from the 360 public threat intelligence DGA domain name.
Step S103: detecting the domain name to be detected by using a Simhash algorithm, and distinguishing a legal domain name and a malicious domain name in the domain name to be detected.
In the specific implementation step S103, inputting the domain name to be detected into the Simhash algorithm to perform domain name detection, obtaining a legal domain name and a malicious domain name in the domain name to be detected, and distinguishing the legal domain name and the malicious domain name in the domain name to be detected.
Step S104: and carrying out family clustering on the malicious domain names to obtain DGA domain names of each family.
Alternatively, the malicious domain names may be clustered based on a DBSCAN algorithm, but not limited thereto.
In the specific implementation process of step S104, family clustering is performed on the malicious domain names by using the DBSCAN algorithm, so as to obtain DGA domain names of each family.
According to the DGA domain name detection method provided by the embodiment of the invention, the DNS request packet in the network data packet and the request domain name in the DNS request packet are extracted by acquiring the network data packet in the pcap format; filtering the request domain name, and taking the obtained unknown domain name as the domain name to be detected; detecting a domain name to be detected by using a Simhash algorithm, and distinguishing a legal domain name and a malicious domain name in the domain name to be detected; and carrying out family clustering on the malicious domain names to obtain DGA domain names of each family. In the scheme, the request domain name is filtered firstly, the obtained unknown domain name is used as a domain name to be detected, the domain name to be detected is detected through a Simhash algorithm, and family clustering is carried out on the distinguished malicious domain names in the domain name to be detected, so that the DGA domain name of each family is obtained, the DGA domain name detection efficiency is improved, and the false alarm rate of the DGA domain name detection is reduced.
Based on the DGA domain name detection method provided by the embodiment of the present invention, step S102 is executed to filter the requested domain name, and the obtained unknown domain name is used as the domain name to be detected. As shown in fig. 2, a flow chart of data preprocessing for a request domain name according to an embodiment of the present invention mainly includes the following steps:
step S201: and carrying out de-duplication treatment on the request domain name to obtain the de-duplicated request domain name.
In the specific implementation process of step S201, the extracted request domain name is subjected to duplication removal processing, so as to obtain a duplicated request domain name.
Step S202: and removing the request domain name containing special characters from the repeated request domain name to obtain an effective domain name.
In the specific implementation step S202, the request domain name containing the special character in the request domain name after duplication removal is removed, that is, the invalid domain name containing the special character in the request domain name after duplication removal is removed, so as to obtain the valid domain name.
Step S203: and filtering the effective domain name by using the black-and-white list, filtering known domain names in the black-and-white list to obtain unknown domain names, and taking the unknown domain names as the domain names to be detected.
In the specific implementation step S203, the obtained effective domain names are filtered by using the blacklist and the whitelist, known domain names existing in the blacklist and the whitelist are filtered, the remaining unknown domain names are obtained, and the unknown domain names are used as the domain names to be detected.
According to the DGA domain name detection method provided by the embodiment of the invention, the extracted request domain name is subjected to de-duplication processing, special character removing processing and filtering processing, the obtained unknown domain name is used as the domain name to be detected, and a basis is provided for the subsequent domain name detection, so that the DGA domain name detection efficiency is improved and the false alarm rate of the DGA domain name detection is reduced.
Based on the DGA domain name detection method provided by the embodiment of the present invention, step S103 is executed to detect the domain name to be detected by using Simhash algorithm, and distinguish the legal domain name from the malicious domain name in the domain name to be detected. As shown in fig. 3, a flow chart for distinguishing a legal domain name from a malicious domain name in a domain name to be detected according to an embodiment of the present invention mainly includes the following steps:
step S301: and processing each domain name to be detected by using a Simhash algorithm to obtain each domain name to be detected represented by a Simhash value.
In the specific implementation process of step S301, inputting each domain name to be detected into a Simhash algorithm for processing, namely performing word segmentation, hashing, weighting, merging and dimension reduction processing, to obtain each domain name to be detected represented by a Simhash value, where the specific processing process includes the following steps:
Step S11: and performing word segmentation processing on each domain name to be detected, namely performing word segmentation processing on each domain name to be detected according to 'to obtain each word segmentation of each domain name to be detected'.
For example: some domain name to be detected is www.baidu.com, and word segmentation processing is performed on the domain name to be detected according to "", so as to obtain three words, which are respectively: www, baidu, com.
Step S12: and carrying out weighting treatment on each word of each domain name to be detected, namely setting different weights for each word of each domain name to be detected.
It should be noted that, the sum of the weights of the words of the same domain name to be detected is 1, and the weight of the word of the top domain is greater than the weight of the word of the 2-level domain and greater than the weight of the word of the 3-level domain.
Step S13: calculating the hash value of each word of each domain name to be detected by using a common hash function to obtain a bit sequence with the length of L, which corresponds to each word of each domain name to be detected, and taking 64 bits from the bit sequence with the length of L for subsequent processing.
Wherein the value of each bit is 0 or 1.
Step S14: and carrying out weighting treatment on each bit of each word of each domain name to be detected to obtain a final weighting vector of each word.
When the bit value is 0, the bit is represented as a negative weight, and when the bit value is 1, the bit is represented as a positive weight.
For example: the hash value of a certain word is 101000 and the weight is 0.1, and the weight vector of the word is [0.1-0.10.1-0.1-0.1-0.1].
Step S15: and accumulating the weighted vectors of the segmentation words of each domain name to be detected to obtain a 64-bit sequence corresponding to each domain name to be detected.
For example: the domain name to be detected has two participles, the weighted vector of participle 1 is [0.1-0.10.1-0.1-0.1-0.1], the weighted vector of participle 2 is [0.90.90.9-0.9-0.9-0.9], and the bit sequence obtained by accumulating participle 1 and participle 2 is [10.81-1-1-1].
Step S16: the obtained 64-bit sequence is subjected to dimension reduction, and the dimension reduction is specifically as follows: when the value of any bit of the 64-bit sequence is greater than 0, the bit is replaced with 1, and when the value of any bit of the 64-bit sequence is less than or equal to 0, the position is replaced with 0, so as to obtain a 64-bit Simhash value of each domain name to be detected, that is, each domain name to be detected expressed by the Simhash value.
Taking the example in the step S15 as an example, the bit sequence obtained by accumulating the words of a domain name to be detected is [10.81-1-1 ], the bit sequence is subjected to dimension reduction processing, and the Simhash value obtained by dimension reduction of the domain name to be detected is 111000 according to the dimension reduction processing rule.
Step S302: and respectively carrying out average Hamming distance calculation on each Simhash value and the blacklist Simhash sample library and the whitelist Simhash sample library to obtain blacklist average Hamming distance and whitelist average Hamming distance corresponding to each domain name to be detected.
In step S302, the blacklist Simhash sample library and domain names in the whitelist Simhash sample library are processed in advance by using a Simhash algorithm, and converted into corresponding Simhash values.
In the specific implementation process of step S302, simhash values of domain names in a blacklist and a whitelist are calculated respectively to obtain a blacklist Simhash sample library corresponding to the blacklist and a whitelist Simhash sample library corresponding to the whitelist, and then the Simhash value of each domain name to be detected is calculated with the blacklist Simhash sample library and the whitelist Simhash sample library respectively to obtain a blacklist average hamming distance and a whitelist average hamming distance corresponding to each domain name to be detected.
It should be noted that, the smaller the hamming distance, the greater the similarity between the domain name to be detected and the domain name in the black list or the white list.
Step S303: and comparing the average Hamming distance of the blacklist and the average Hamming distance of the whitelist for each domain name to be detected.
In the specific implementation step S303, for each domain name to be detected, the average hamming distance of the blacklist and the average hamming distance of the whitelist corresponding to each domain name to be detected are compared, so as to obtain the average hamming distance of the blacklist and the average hamming distance of the whitelist corresponding to each domain name to be detected.
Step S304: whether the white list average hamming distance is equal to the black list average hamming distance is determined, if yes, step S305 is executed, and if no, step S308 is executed.
In the specific implementation step S304, it is determined whether the whitelist average hamming distance is equal to the blacklist average hamming distance, if so, step S305 is executed, if not, step S308 is executed, wherein the whitelist average hamming distance is smaller or larger than the blacklist average hamming distance.
Step S305: and judging whether the white list average Hamming distance or the black list average Hamming distance is smaller than a preset threshold, if so, executing the step S306, and if not, executing the step S307.
It should be noted that, as long as the white list average hamming distance is smaller than the preset threshold, or the black list average hamming distance is smaller than the preset threshold, step S306 is performed.
Step S307 is performed as long as the white list average hamming distance is greater than or equal to the preset threshold value, or the black list average hamming distance is greater than or equal to the preset threshold value.
In the specific implementation step S305, it is determined whether the whitelist average hamming distance or the blacklist average hamming distance is smaller than a preset threshold, if yes, step S306 is executed, if no, step S307 is executed, if yes, the whitelist average hamming distance or the blacklist average hamming distance is larger than or equal to the preset threshold.
Step S306: and determining the domain name to be detected as a legal domain name.
In the specific implementation step S306, it is determined that the white list average hamming distance or the black list average hamming distance is smaller than the preset threshold, or it is determined that the white list average hamming distance is smaller than the black list average hamming distance, and then it is determined that the domain name to be detected is a legal domain name.
Step S307: and determining the domain name to be detected as a malicious domain name.
In the specific implementation step S307, it is determined that the white list average hamming distance or the black list average hamming distance is greater than or equal to the preset threshold, or it is determined that the white list average hamming distance is greater than the black list average hamming distance, and then it is determined that the domain name to be detected is a malicious domain name.
Step S308: and judging whether the average Hamming distance of the white list is smaller than the average Hamming distance of the black list, if so, executing the step S306, and if not, executing the step S307.
In the specific implementation step S308, it is determined whether the whitelist average hamming distance is smaller than the blacklist average hamming distance, if yes, step S306 is executed, if no, step S307 is executed, wherein the whitelist average hamming distance is larger than the blacklist average hamming distance.
According to the DGA domain name detection method provided by the embodiment of the invention, the blacklist average Hamming distance and the whitelist average Hamming distance corresponding to each domain name to be detected are calculated, the blacklist average Hamming distance and the whitelist average Hamming distance are compared, and corresponding operations are performed, so that a basis is provided for subsequent operations, the DGA domain name detection efficiency is improved, and the false alarm rate of DGA domain name detection is reduced.
Based on the DGA domain name detection method provided by the embodiment of the present invention, another execution step S103 provided by the embodiment of the present invention uses Simhash algorithm to detect the domain name to be detected, and distinguishes the legal domain name and the malicious domain name in the domain name to be detected. As shown in fig. 4, the method mainly comprises the following steps:
Step S401: and processing each domain name to be detected by using a Simhash algorithm to obtain each domain name to be detected represented by a Simhash value.
The execution principle and process of the above step S401 are the same as those of the step S301 disclosed in fig. 3, and will not be described here again.
Step S402: and respectively carrying out level division on the domain names in the black list and the white list according to the domain name levels to obtain domain names with different levels.
In step S402, each level is provided with a corresponding index.
The domain names of different levels may be a primary domain name, a secondary domain name, a tertiary domain name, but are not limited thereto.
In the specific implementation process of step S402, the domain names in the blacklist and the domain names in the whitelist are respectively classified according to the domain name levels, so as to obtain the domain names with different classification levels.
Step S403: and processing the domain names in each level by using a Simhash algorithm to obtain domain names converted into corresponding Simhash values.
In the specific implementation process of step S403, for the domain name in each level, the domain name in each level is processed by using a Simhash algorithm, that is, the Simhash value of the domain name in each level is calculated, so as to obtain the domain name converted into the corresponding Simhash value.
Step S404: determining the level of the domain name to be detected corresponding to each Simhash value, and respectively carrying out average Hamming distance calculation on the Simhash value and the Simhash value with the same level in the blacklist Simhash sample library and the whitelist Simhash sample library based on indexes to obtain the blacklist average Hamming distance and the whitelist average Hamming distance corresponding to each domain name to be detected.
In the specific implementation process of step S404, determining a level of a domain name to be detected corresponding to each Simhash value, and based on an index corresponding to each level, performing average hamming distance calculation on the Simhash value of each domain name to be detected and Simhash values with the same level in the blacklist Simhash sample library and the whitelist Simhash sample library respectively to obtain a blacklist average hamming distance and a whitelist average hamming distance corresponding to each domain name to be detected.
Step S405: and comparing the average Hamming distance of the blacklist and the average Hamming distance of the whitelist for each domain name to be detected.
Step S406: whether the white list average hamming distance is equal to the black list average hamming distance is determined, if yes, step S407 is performed, and if not, step S410 is performed.
Step S407: and judging whether the average Hamming distance of the white list or the average Hamming distance of the black list is smaller than a preset threshold, if so, executing the step S408, and if not, executing the step S409.
Step S408: and determining the domain name to be detected as a legal domain name.
Step S409: and determining the domain name to be detected as a malicious domain name.
Step S410: and judging whether the average Hamming distance of the white list is smaller than the average Hamming distance of the black list, if so, executing the step S408, and if not, executing the step S409.
The execution principle and process of the above steps S405 to S410 are the same as those of the steps S303 to S308 disclosed in fig. 3, and will be referred to herein without further description.
According to the DGA domain name detection method provided by the embodiment of the invention, the levels of the domain names in the black list and the white list are divided, the level of the domain name to be detected corresponding to each Simhash value is determined, the average Hamming distance of the black list and the average Hamming distance of the white list corresponding to each domain name to be detected are calculated according to the index corresponding to each level, the average Hamming distance of the black list and the average Hamming distance of the white list are compared, and corresponding operations are carried out, so that the domain name to be detected does not need to be compared with all black-white list sample libraries, but the Hamming distance of the domain name corresponding to the level is compared through the index, so that the calculation cost is reduced, the DGA domain name detection efficiency is improved, and the false report rate of the DGA domain name detection is reduced.
In order to better understand the DGA domain name detection method provided by the embodiment of the present invention, as shown in fig. 5, a flow diagram of DGA domain name detection based on Simhash algorithm is provided in the embodiment of the present invention.
In fig. 5, first, a pcap traffic packet is acquired, and data in the pcap traffic packet is preprocessed, specifically: extracting a DNS request packet from the pcap flow packet, extracting a request domain name from the DNS request packet, performing duplication removal processing and special character removal processing on the request domain name, filtering the request domain name by using a black-white list to obtain an unknown domain name, and taking the unknown domain name as a domain name to be detected.
Then, detecting the domain name to be detected by using a Simhash algorithm, and distinguishing a legal domain name and a malicious domain name in the domain name to be detected, wherein the method specifically comprises the following steps: and performing word segmentation, hashing, weighting, merging and dimension reduction on each domain name to be detected by using a Simhash algorithm, and calculating the Simhash corresponding to each domain name to be detected to obtain each domain name to be detected represented by a Simhash value.
Respectively carrying out average Hamming distance calculation on each Simhash value and a blacklist Simhash sample library and a whitelist Simhash sample library to obtain blacklist average Hamming distances corresponding to each domain name to be detected Mean Haiming distance from white list->
Determining the average Hamming distance of the white listWhether or not equal to the blacklist mean Haiming distance->
If yes, judging whether the average Hamming distance of the white list or the average Hamming distance of the black list is smaller than a preset threshold, if yes, determining that the domain name to be detected is a legal domain name, and if not, determining that the domain name to be detected is a malicious domain name.
If not, judging the average Hamming distance of the white listWhether or not it is smaller than the blacklist mean Haiming distance +.>If yes, determining that the domain name to be detected is a legal domain name, and if not, determining that the domain name to be detected is a malicious domain name.
According to the DGA domain name detection method provided by the embodiment of the invention, the request domain name is filtered, the obtained unknown domain name is used as the domain name to be detected, the domain name to be detected is detected through the Simhash algorithm, and the legal domain name and the malicious domain name in the distinguished domain name to be detected are detected, so that the DGA domain name detection efficiency is improved, and the false alarm rate of DGA domain name detection is reduced.
Based on the DGA domain name detection method provided by the embodiment of the present invention, step S104 is executed to perform family clustering on malicious domain names, so as to obtain DGA domain names of each family. As shown in fig. 6, a flow chart for performing family clustering according to an embodiment of the present invention mainly includes the following steps:
Step S601: and extracting and fusing the characteristics of each malicious domain name to generate a multidimensional characteristic vector.
In the specific implementation process of step S601, feature extraction and feature fusion are performed on each malicious domain name, that is, n-gram features, character entropy features, digital features and percentage features occupied by meaningful characters of the malicious domain name are extracted, and feature fusion is performed on the extracted features, that is, normalization and standardization processing are performed on the extracted features, so as to generate a multidimensional feature vector.
In the feature fusion, since the extracted feature values greatly float, the feature values need to be scaled to the same section and then feature fusion is performed.
In the embodiment of the present invention, the normalization processing may be performed using a min-max function, or the normalization processing may be performed using a z-score, but is not limited thereto.
The specific characteristic extraction process comprises the following steps:
step S21: extracting the n-gram characteristics of the malicious domain name.
In the process of implementing step S21 specifically, as known from the technical terms above, n-gram is a byte fragment sequence with length n, where n=2, n-gram is called bigram, and where n=3, n-gram is called trigram, and the domain name readability is distinguished by using the characteristics of bigram and trigram.
Because the bigram and trigram generated by the domain name generated by the DGA algorithm are scattered and the domain names generated by different DGA algorithms are different in readability, the bigram and the trigram are used as features to respectively count the average value, the variance and the median of the occurrence times of adjacent double characters and adjacent triple characters of the 2-level domain and the 3-level domain of the malicious domain in a corpus.
Step S22: and extracting character entropy value characteristics of the malicious domain name.
It should be noted that most of domain names generated by the DGA algorithm are random characters, and have disorder, while entropy can measure the disorder of random character strings, the more disorder of the character strings is, the higher the randomness is, the larger the entropy is, otherwise, the more ordered the character strings are, the lower the randomness is, and the smaller the entropy is. And the disorder of domain names of different families is different, and the entropy is different. Thus, the random size of different family domain name strings can be quantified by the features of entropy.
In the specific implementation process of step S22, the level of the malicious domain name is determined, and the character entropy values of the level 2 domain and the level 3 domain of the malicious domain name are respectively obtained.
Step S23: extracting the digital characteristics of the malicious domain name.
It should be noted that, the domain name generated by the DGA algorithm is generally composed of pure characters or random character strings mixed by characters and numbers, and the domain names generated by the same DGA algorithm have higher similarity.
In the specific implementation step S23, based on the higher similarity of the domain names generated by the same DGA algorithm, the total length features of the malicious domain names, the lengths of the 2-level domain and the 3-level domain, the value of the n-level domain n and the percentage features of the digital characters in the 2-level domain and the 3-level domain are extracted respectively.
Step S24: and extracting the percentage characteristics of the meaningful characters of the malicious domain name.
It should be noted that, in order to improve the escape-resistant capability, some advanced DGA algorithms generate domain names similar to those generated manually, and have readability compared with other DGA algorithms.
In the specific implementation process of step S24, based on the above situation, the percentages of the significant characters in the 2-level domain and 3-level domain character strings of the malicious domain name are respectively counted, and the percentage of the maximum value of the significant characters is selected.
Step S602: and performing dimension reduction processing on the obtained multidimensional feature vector to obtain a dimension reduction feature vector.
In the embodiment of the invention, the obtained multidimensional feature vector can be subjected to dimension reduction processing by using a TSNE algorithm, but the method is not limited to the method.
Step S603: and clustering the dimension reduction feature vectors by using DBSCAN to obtain the DGA domain name of each family.
In the specific implementation process of step S603, clustering is performed on all the dimension-reduction feature vectors by using DBSCAN, specifically: and performing parameter adjustment processing on all the dimension reduction feature vectors, comprehensively judging the clustering effect according to the output graph, performing proper post-processing on the clustering effect to obtain a final clustering result, and generating DGA domain names of all families.
In order to better understand the above description, as shown in fig. 7, a flowchart of DGA family clustering based on a DBSCAN algorithm is provided in an embodiment of the present invention.
In fig. 7, first, feature extraction is performed on a detected malicious domain name, that is, n-gram features, character entropy features, numerical features, and percentage features of meaningful characters of the malicious domain name are extracted.
Then, feature fusion is performed on the extracted features, that is, normalization and normalization processing are performed on the extracted features, so as to generate multidimensional feature vectors.
And then, carrying out dimension reduction processing on each multidimensional feature vector to obtain a dimension reduction feature vector.
And finally, clustering all the dimension reduction feature vectors by using the DBSCAN to obtain the DGA domain name of each family.
According to the DGA domain name detection method provided by the embodiment of the invention, family clustering is carried out on malicious domain names in the domain names to be detected by utilizing the DBSCAN, so that the DGA domain name of each family is obtained, the DGA domain name detection efficiency is improved, and the false alarm rate of the DGA domain name detection is reduced.
Corresponding to the DGA domain name detection method shown in the above embodiment of the present invention, the embodiment of the present invention further correspondingly provides a DGA domain name detection device, as shown in fig. 8, where the DGA domain name detection device includes: an extraction module 81, a filtering module 82, a Simhash module 83 and a clustering module 84.
The extracting module 81 is configured to obtain a network data packet in a pcap format, and extract a DNS request packet in the network data packet and a request domain name in the DNS request packet.
And the filtering module 82 is configured to filter the requested domain name, and take the obtained unknown domain name as the domain name to be detected.
The Simhash module 83 is configured to detect a domain name to be detected by using a Simhash algorithm, and distinguish a legal domain name from a malicious domain name in the domain name to be detected.
The clustering module 84 is configured to perform family clustering on the malicious domain names to obtain DGA domain names of each family.
It should be noted that, the specific principle and execution process of each module or each unit in the DGA domain name detection device disclosed in the above embodiment of the present invention are the same as those of the DGA domain name detection method implemented in the above embodiment of the present invention, and may refer to the corresponding parts in the DGA domain name detection method disclosed in the above embodiment of the present invention, and will not be described herein again.
According to the DGA domain name detection device provided by the embodiment of the invention, a DNS request packet in a network data packet and a request domain name in the DNS request packet are extracted by acquiring the network data packet in a pcap format; filtering the request domain name, and taking the obtained unknown domain name as the domain name to be detected; detecting a domain name to be detected by using a Simhash algorithm, and distinguishing a legal domain name and a malicious domain name in the domain name to be detected; and carrying out family clustering on the malicious domain names to obtain DGA domain names of each family. In the scheme, the request domain name is filtered firstly, the obtained unknown domain name is used as a domain name to be detected, the domain name to be detected is detected through a Simhash algorithm, and family clustering is carried out on the distinguished malicious domain names in the domain name to be detected, so that the DGA domain name of each family is obtained, the DGA domain name detection efficiency is improved, and the false alarm rate of the DGA domain name detection is reduced.
Optionally, based on the filtering module 82 shown in fig. 8, the filtering module 82 further includes: the device comprises a duplication removing unit, a special character removing unit and a filtering unit.
And the de-duplication unit is used for de-duplication processing the request domain name to obtain the de-duplicated request domain name.
And removing the special character unit, wherein the special character unit is used for removing the request domain name containing the special characters in the repeated request domain name to obtain the effective domain name.
And the filtering unit is used for filtering the effective domain name by using the black-and-white list, filtering the known domain name in the black-and-white list to obtain an unknown domain name, and taking the unknown domain name as the domain name to be detected.
According to the DGA domain name detection device provided by the embodiment of the invention, the extracted request domain name is subjected to de-duplication processing, special character removing processing and filtering processing, the obtained unknown domain name is used as the domain name to be detected, and a basis is provided for the subsequent domain name detection, so that the DGA domain name detection efficiency is improved, and the false alarm rate of the DGA domain name detection is reduced.
Optionally, based on the Simhash module 83 shown in fig. 8, the Simhash module 83 further includes: the device comprises a Simhash calculation unit, a distance calculation unit and a determination unit.
And the Simhash calculation unit is used for processing each domain name to be detected by using a Simhash algorithm to obtain each domain name to be detected represented by a Simhash value.
The distance calculation unit is used for carrying out average Hamming distance calculation on each Simhash value and the blacklist Simhash sample library and the whitelist Simhash sample library respectively to obtain blacklist average Hamming distance and whitelist average Hamming distance corresponding to each domain name to be detected, and domain names in the blacklist Simhash sample library and the whitelist Simhash sample library are processed in advance by using a Simhash algorithm and converted into corresponding Simhash values.
The determining unit is used for comparing the average Hamming distance of the blacklist and the average Hamming distance of the whitelist for each domain name to be detected; if the average Hamming distance of the white list is smaller than the average Hamming distance of the black list, determining that the domain name to be detected is a legal domain name; if the average Hamming distance of the white list is larger than the average Hamming distance of the black list, determining that the domain name to be detected is a malicious domain name; if the average Hamming distance of the white list is equal to the average Hamming distance of the black list, judging whether the average Hamming distance of the white list or the average Hamming distance of the black list is smaller than a preset threshold value; if the detected domain name is smaller than the threshold value, determining that the domain name to be detected is a legal domain name; if the detected domain name is greater than or equal to the threshold value, determining that the domain name to be detected is a malicious domain name.
According to the DGA domain name detection device provided by the embodiment of the invention, the blacklist average Hamming distance and the whitelist average Hamming distance corresponding to each domain name to be detected are calculated, the blacklist average Hamming distance and the whitelist average Hamming distance are compared, and corresponding operations are performed, so that a basis is provided for subsequent operations, the DGA domain name detection efficiency is improved, and the false alarm rate of DGA domain name detection is reduced.
Optionally, based on the Simhash module 83 shown in fig. 8, the Simhash module 83 further includes: a dividing unit and a processing unit.
The dividing unit is used for respectively dividing the levels of the domain names in the blacklist and the white list according to the levels of the domain names to obtain the domain names with different levels, and each level is provided with a corresponding index.
And the processing unit is used for processing the domain name in each level by utilizing a Simhash algorithm to obtain the domain name converted into the corresponding Simhash value.
The distance calculating unit is specifically configured to:
determining the level of the domain name to be detected corresponding to each Simhash value, and respectively carrying out average Hamming distance calculation on the Simhash value and the Simhash value with the same level in the blacklist Simhash sample library and the whitelist Simhash sample library based on indexes to obtain the blacklist average Hamming distance and the whitelist average Hamming distance corresponding to each domain name to be detected.
According to the DGA domain name detection device provided by the embodiment of the invention, the levels of the domain names in the blacklist and the whitelist are divided, the level of the domain name to be detected corresponding to each Simhash value is determined, the blacklist average Hamming distance and the whitelist average Hamming distance corresponding to each domain name to be detected are calculated according to the index corresponding to each level, the sizes of the blacklist average Hamming distance and the whitelist average Hamming distance are compared, corresponding operations are carried out, and a basis is provided for subsequent operations, so that the DGA domain name detection efficiency is improved, and the false alarm rate of DGA domain name detection is reduced.
Optionally, based on the clustering module 84 shown in fig. 8, the clustering module 84 further includes: the device comprises a feature processing unit, a dimension reduction unit and a clustering unit.
And the feature processing unit is used for carrying out feature extraction and fusion on each malicious domain name to generate a multidimensional feature vector.
The dimension reduction unit is used for carrying out dimension reduction processing on each multidimensional feature vector to obtain dimension reduction feature vectors.
And the clustering unit is used for clustering all the dimension reduction feature vectors by using the DBSCAN to obtain the DGA domain name of each family.
According to the DGA domain name detection device provided by the embodiment of the invention, family clustering is carried out on malicious domain names in the domain names to be detected by utilizing the DBSCAN, so that the DGA domain name of each family is obtained, the DGA domain name detection efficiency is improved, and the false alarm rate of the DGA domain name detection is reduced.
Based on the DGA domain name detection device disclosed in the embodiments of the present disclosure, each of the above modules may be implemented by a hardware device configured by a processor and a memory. Specifically, the above modules are stored in a memory as program units, and the processor executes the program units stored in the memory to implement DGA domain name detection.
The processor comprises a kernel, and the kernel fetches the corresponding program unit from the memory. The kernel can be provided with one or more than one, and DGA domain name detection is realized by adjusting kernel parameters.
The embodiment of the invention provides a computer storage medium, which comprises a DGA domain name detection program, wherein the program is executed by a processor to realize the DGA domain name detection method disclosed by the embodiment of the invention.
The embodiment of the disclosure provides a processor, which is used for running a program, wherein the program runs to execute the DGA domain name detection method disclosed in the embodiment.
The embodiment of the disclosure provides an electronic device, as shown in fig. 9, which is a schematic structural diagram of the electronic device provided in the embodiment of the disclosure.
The electronic device 90 in the embodiments of the present disclosure may be a server, a PC, a PAD, a mobile phone, or the like.
The electronic device 90 comprises at least one processor 901, and at least one memory 902 coupled to the processor, and a bus 903.
The processor 901 and the memory 902 communicate with each other via the bus 903.
A processor 901 for executing programs stored in the memory.
A memory 902 for storing a program for at least: acquiring a network data packet in a pcap format, and extracting a DNS request packet in the network data packet and a request domain name in the DNS request packet; filtering the request domain name, and taking the obtained unknown domain name as the domain name to be detected; detecting a domain name to be detected by using a Simhash algorithm, and distinguishing a legal domain name and a malicious domain name in the domain name to be detected; and carrying out family clustering on the malicious domain names to obtain DGA domain names of each family.
The present application also provides a computer program product adapted to perform, when executed on an electronic device, a program initialized with the method steps of:
acquiring a network data packet in a pcap format, and extracting a DNS request packet in the network data packet and a request domain name in the DNS request packet; filtering the request domain name, and taking the obtained unknown domain name as the domain name to be detected; detecting a domain name to be detected by using a Simhash algorithm, and distinguishing a legal domain name and a malicious domain name in the domain name to be detected; and carrying out family clustering on the malicious domain names to obtain DGA domain names of each family.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, the device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.
The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), among other forms in computer readable media, the memory including at least one memory chip. Memory is an example of a computer-readable medium.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for a system or system embodiment, since it is substantially similar to a method embodiment, the description is relatively simple, with reference to the description of the method embodiment being made in part. The systems and system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. A DGA domain name detection method, the method comprising:
acquiring a network data packet in a pcap format, and extracting a DNS request packet in the network data packet and a request domain name in the DNS request packet;
filtering the request domain name, and taking the obtained unknown domain name as a domain name to be detected;
processing each domain name to be detected by using a Simhash algorithm to obtain each domain name to be detected represented by a Simhash value;
carrying out average Hamming distance calculation on each Simhash value and a blacklist Simhash sample library and a whitelist Simhash sample library respectively to obtain blacklist average Hamming distance and whitelist average Hamming distance corresponding to each domain name to be detected, wherein domain names in the blacklist Simhash sample library and the whitelist Simhash sample library are processed in advance by using a Simhash algorithm and converted into corresponding Simhash values;
Comparing the average Hamming distance of the blacklist and the average Hamming distance of the whitelist for each domain name to be detected;
if the average Hamming distance of the white list is smaller than the average Hamming distance of the black list, determining that the domain name to be detected is a legal domain name;
if the average Hamming distance of the white list is larger than the average Hamming distance of the black list, determining that the domain name to be detected is a malicious domain name;
if the white list average Hamming distance is equal to the black list average Hamming distance, judging whether the white list average Hamming distance or the black list average Hamming distance is smaller than a preset threshold;
if the detected domain name is smaller than the threshold value, determining that the domain name to be detected is a legal domain name;
if the domain name to be detected is greater than or equal to a threshold value, determining that the domain name to be detected is a malicious domain name;
and carrying out family clustering on the malicious domain names to obtain DGA domain names of each family.
2. The method according to claim 1, wherein filtering the requested domain name, using the obtained unknown domain name as the domain name to be detected, comprises:
performing de-duplication treatment on the request domain name to obtain a de-duplicated request domain name;
removing the request domain name containing special characters from the de-duplicated request domain name to obtain an effective domain name;
And filtering the effective domain name by using a black-and-white list, filtering known domain names in the black-and-white list to obtain unknown domain names, and taking the unknown domain names as domain names to be detected.
3. The method as recited in claim 1, further comprising:
performing level division on the domain names in the black list and the white list according to the domain name levels to obtain domain names with different levels, wherein each level is provided with a corresponding index;
processing the domain name in each level by using a Simhash algorithm to obtain a domain name converted into a corresponding Simhash value;
the step of calculating the average Hamming distance between each Simhash value and the blacklist Simhash sample library and the whitelist Simhash sample library to obtain the blacklist average Hamming distance and the whitelist average Hamming distance corresponding to each domain name to be detected, comprises the following steps:
determining the level of the domain name to be detected corresponding to each Simhash value, and carrying out average Hamming distance calculation on the Simhash value and the Simhash value with the same level in the blacklist Simhash sample library and the whitelist Simhash sample library based on indexes to obtain the blacklist average Hamming distance and the whitelist average Hamming distance corresponding to each domain name to be detected.
4. The method of claim 1, wherein the performing family clustering on the malicious domain names to obtain DGA domain names of each family comprises:
extracting and fusing characteristics of each malicious domain name to generate a multidimensional characteristic vector;
performing dimension reduction processing on each multidimensional feature vector to obtain dimension reduction feature vectors;
and clustering all the dimension reduction feature vectors by using DBSCAN to obtain the DGA domain name of each family.
5. A DGA domain name detection device, the device comprising:
the extraction module is used for acquiring a network data packet in a pcap format and extracting a DNS request packet in the network data packet and a request domain name in the DNS request packet;
the filtering module is used for filtering the request domain name and taking the obtained unknown domain name as a domain name to be detected;
the Simhash module is used for processing each domain name to be detected by using a Simhash algorithm to obtain each domain name to be detected represented by a Simhash value; carrying out average Hamming distance calculation on each Simhash value and a blacklist Simhash sample library and a whitelist Simhash sample library respectively to obtain blacklist average Hamming distance and whitelist average Hamming distance corresponding to each domain name to be detected, wherein domain names in the blacklist Simhash sample library and the whitelist Simhash sample library are processed in advance by using a Simhash algorithm and converted into corresponding Simhash values; comparing the average Hamming distance of the blacklist and the average Hamming distance of the whitelist for each domain name to be detected; if the average Hamming distance of the white list is smaller than the average Hamming distance of the black list, determining that the domain name to be detected is a legal domain name; if the average Hamming distance of the white list is larger than the average Hamming distance of the black list, determining that the domain name to be detected is a malicious domain name; if the white list average Hamming distance is equal to the black list average Hamming distance, judging whether the white list average Hamming distance or the black list average Hamming distance is smaller than a preset threshold; if the detected domain name is smaller than the threshold value, determining that the domain name to be detected is a legal domain name; if the domain name to be detected is greater than or equal to a threshold value, determining that the domain name to be detected is a malicious domain name;
And the clustering module is used for carrying out family clustering on the malicious domain names to obtain DGA domain names of each family.
6. The apparatus of claim 5, wherein the clustering module comprises:
the feature processing unit is used for carrying out feature extraction and fusion on each malicious domain name to generate a multidimensional feature vector;
the dimension reduction unit is used for carrying out dimension reduction processing on each multi-dimension feature vector to obtain dimension reduction feature vectors;
and the clustering unit is used for clustering all the dimension reduction feature vectors by using DBSCAN to obtain the DGA domain name of each family.
7. An electronic device comprising a processor and a memory;
the memory is used for storing a computer program;
the processor is configured to implement the DGA domain name detection method according to any one of claims 1 to 4 when invoking and executing the computer program stored in the memory.
8. A computer storage medium having stored therein computer executable instructions which when loaded and executed by a processor implement the DGA domain name detection method according to any one of claims 1 to 4.
CN202111074752.9A 2021-09-14 2021-09-14 DGA domain name detection method and device, electronic equipment and computer storage medium Active CN113746952B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111074752.9A CN113746952B (en) 2021-09-14 2021-09-14 DGA domain name detection method and device, electronic equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111074752.9A CN113746952B (en) 2021-09-14 2021-09-14 DGA domain name detection method and device, electronic equipment and computer storage medium

Publications (2)

Publication Number Publication Date
CN113746952A CN113746952A (en) 2021-12-03
CN113746952B true CN113746952B (en) 2024-04-16

Family

ID=78738679

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111074752.9A Active CN113746952B (en) 2021-09-14 2021-09-14 DGA domain name detection method and device, electronic equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN113746952B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115412306A (en) * 2022-08-08 2022-11-29 天翼安全科技有限公司 Domain name homology judgment method and device, electronic equipment and storage medium
CN115550021A (en) * 2022-09-26 2022-12-30 东华理工大学 Method and system for accurately replicating network space in big data environment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107645503A (en) * 2017-09-20 2018-01-30 杭州安恒信息技术有限公司 A kind of detection method of the affiliated DGA families of rule-based malice domain name
CN109246083A (en) * 2018-08-09 2019-01-18 北京奇安信科技有限公司 A kind of detection method and device of DGA domain name
CN109462612A (en) * 2018-12-27 2019-03-12 北京神州绿盟信息安全科技股份有限公司 A kind of determination method and device of attack domain name in Botnet
WO2019136953A1 (en) * 2018-01-15 2019-07-18 深圳市联软科技股份有限公司 C&c domain name analysis-based botnet detection method, device, apparatus and medium
EP3614645A1 (en) * 2018-08-21 2020-02-26 Deutsche Telekom AG Embedded dga representations for botnet analysis
CN111935097A (en) * 2020-07-16 2020-11-13 上海斗象信息科技有限公司 Method for detecting DGA domain name
CN113315739A (en) * 2020-02-26 2021-08-27 深信服科技股份有限公司 Malicious domain name detection method and system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101666177B1 (en) * 2015-03-30 2016-10-14 한국전자통신연구원 Malicious domain cluster detection apparatus and method
US9479524B1 (en) * 2015-04-06 2016-10-25 Trend Micro Incorporated Determining string similarity using syntactic edit distance
CN109788079B (en) * 2017-11-15 2022-03-15 瀚思安信(北京)软件技术有限公司 DGA domain name real-time detection method and device
US20200169570A1 (en) * 2018-11-28 2020-05-28 Ca, Inc. Systems and methods for detecting malware infections associated with domain generation algorithms

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107645503A (en) * 2017-09-20 2018-01-30 杭州安恒信息技术有限公司 A kind of detection method of the affiliated DGA families of rule-based malice domain name
WO2019136953A1 (en) * 2018-01-15 2019-07-18 深圳市联软科技股份有限公司 C&c domain name analysis-based botnet detection method, device, apparatus and medium
CN109246083A (en) * 2018-08-09 2019-01-18 北京奇安信科技有限公司 A kind of detection method and device of DGA domain name
EP3614645A1 (en) * 2018-08-21 2020-02-26 Deutsche Telekom AG Embedded dga representations for botnet analysis
CN109462612A (en) * 2018-12-27 2019-03-12 北京神州绿盟信息安全科技股份有限公司 A kind of determination method and device of attack domain name in Botnet
CN113315739A (en) * 2020-02-26 2021-08-27 深信服科技股份有限公司 Malicious domain name detection method and system
CN111935097A (en) * 2020-07-16 2020-11-13 上海斗象信息科技有限公司 Method for detecting DGA domain name

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于人工特征与深度特征的DGA域名检测算法;胡鹏程;刁力力;叶桦;仰燕兰;;计算机科学(09);全文 *

Also Published As

Publication number Publication date
CN113746952A (en) 2021-12-03

Similar Documents

Publication Publication Date Title
CN107566376B (en) Threat information generation method, device and system
CN106709345B (en) Method, system and equipment for deducing malicious code rules based on deep learning method
CN107204960B (en) Webpage identification method and device and server
CN113746952B (en) DGA domain name detection method and device, electronic equipment and computer storage medium
CN112866023B (en) Network detection method, model training method, device, equipment and storage medium
US10474818B1 (en) Methods and devices for detection of malware
CN112019651B (en) DGA domain name detection method using depth residual error network and character-level sliding window
US20210271634A1 (en) Fuzzy hash algorithms to calculate file similarity
CN111818198B (en) Domain name detection method, domain name detection device, equipment and medium
CN111147459B (en) C & C domain name detection method and device based on DNS request data
CN107666490A (en) A kind of suspicious domain name detection method and device
EP3051767A1 (en) Method and apparatus for automatically identifying signature of malicious traffic using latent dirichlet allocation
CN111581355A (en) Method, device and computer storage medium for detecting subject of threat intelligence
CN112073551B (en) DGA domain name detection system based on character-level sliding window and depth residual error network
CN112052451A (en) Webshell detection method and device
CN109492118A (en) A kind of data detection method and detection device
CN109600382B (en) Webshell detection method and device and HMM model training method and device
CN112073550A (en) DGA domain name detection method fusing character-level sliding window and depth residual error network
CN110647895B (en) Phishing page identification method based on login box image and related equipment
CN117675387B (en) Network security risk prediction method and system based on user behavior analysis
CN110535821A (en) A kind of Host Detection method of falling based on DNS multiple features
CN113315851A (en) Domain name detection method, device and storage medium
CN109992960B (en) Counterfeit parameter detection method and device, electronic equipment and storage medium
CN113691489A (en) Malicious domain name detection feature processing method and device and electronic equipment
CN115344563B (en) Data deduplication method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant