CN113645173A

CN113645173A - Malicious domain name identification method, system and equipment

Info

Publication number: CN113645173A
Application number: CN202010344021.0A
Authority: CN
Inventors: 苏香艳; 梁兴强
Original assignee: Beijing Guancheng Technology Co ltd
Current assignee: Beijing Guancheng Technology Co ltd
Priority date: 2020-04-27
Filing date: 2020-04-27
Publication date: 2021-11-12

Abstract

The invention discloses a method, a system and equipment for identifying a malicious domain name, which comprise the following steps: dividing a domain name to be identified into a domain name prefix, a main domain name and a domain name suffix; carrying out malicious domain name detection on a domain name prefix and/or a main domain name through a pre-constructed domain name identification model, and carrying out domain name suffix credibility query on a domain name suffix; and judging whether the domain name to be identified is a malicious domain name or not according to the detection and query conclusion. The domain name identification method adopts a sectional processing mode to identify the domain name, divides the domain name to be detected into three parts of a domain name prefix, a main domain name and a domain name suffix, respectively adopts different detection methods to different parts, respectively establishes respective models to detect the domain name prefix and/or the main domain name, respectively detects the domain name suffix by adopting a mode of comparing the reputation of the domain name suffix, carries out comprehensive scoring and can accurately and efficiently identify the malicious domain name.

Description

Malicious domain name identification method, system and equipment

Technical Field

The invention relates to the technical field of network security, in particular to a method, a system and equipment for identifying a malicious domain name.

Background

The identification and classification of the TLS encrypted traffic are the current research hotspots, some research achievements exist at present on how to identify the malicious TLS encrypted traffic, and the false alarm rate of the identification of the TLS malicious encrypted traffic is a problem that the identification of the TLS malicious encrypted traffic is relatively painful at present.

The malicious domain names refer to a kind of websites with malicious links, and attackers can induce users to enter the websites through the domain names, so that the purposes of acquiring private information of the users or controlling the users and the like are achieved. There are many ways for malicious domain name attacks, some malicious domain names are domain names that are generated by an attacker through a DGA algorithm and are often used for communication between a bot and a C & C server of a controller of the bot, some malicious domain names are registered by input errors that the user may make when inputting the known domain names, and some domain names are constructed by adding other keywords to the known website and register new domain names for malicious behavior.

Domain name information such as SNI, DNS and CN often exists in TLS encrypted traffic, and if the domain names can be directly identified to be malicious or not, clear basis can be provided for identification and classification of the TLS encrypted traffic.

The identification of the malicious domain name is a difficult problem and bottleneck of network security detection, and the reliability of network security can be greatly improved by realizing the identification of the malicious domain name.

Disclosure of Invention

The invention aims to provide a method, a system and equipment capable of accurately identifying a malicious domain name so as to solve the problem of identification of the malicious domain name.

In order to solve the technical problem, the invention provides a method for identifying a malicious domain name, which comprises the following steps:

dividing a domain name to be identified into a domain name prefix, a main domain name and a domain name suffix;

the domain name suffix is determined according to international general TLD, and the domain name prefix and the main domain name are determined according to the distance between the domain name suffix and the domain name prefix;

carrying out malicious domain name detection on a domain name prefix and/or a main domain name through a pre-constructed domain name identification model, and carrying out domain name suffix credibility query on a domain name suffix;

and judging whether the domain name to be identified is a malicious domain name or not according to the detection and query conclusion.

Optionally, the pre-constructed domain name recognition model includes:

carrying out DGA domain name recognition, and/or C & C domain name recognition, and/or domain name recognition model for imitating known name domain name recognition on the domain name prefix and/or the main domain name;

wherein the C & C domain name refers to a domain name involved in malicious encrypted traffic using a C & C server.

Optionally, the domain name recognition model for performing DGA domain name recognition includes:

inputting a domain name prefix and/or a main domain name into the DGA domain name recognition model based on a DGA domain name recognition model constructed by an LSTM neural network to obtain whether the domain name prefix and/or the main domain name is the DGA domain name.

Optionally, the domain name recognition model for performing C & C domain name recognition includes:

extracting the characteristics of a domain name prefix and/or a main domain name based on a C & C domain name identification model constructed by a random forest algorithm, wherein the characteristics of the domain name prefix and/or the main domain name comprise: and inputting the characteristics into the C & C domain name recognition model to obtain whether the domain name prefix and/or the main domain name is the C & C domain name or not according to the switching proportion between the character strings and the numbers and/or whether the special character strings are contained.

Optionally, the domain name recognition model for performing counterfeit known domain name recognition includes:

and the counterfeit domain name identification model is used for inputting the domain name prefix and/or the main domain name into the counterfeit domain name identification model, and carrying out editing distance calculation on the domain name prefix and/or the main domain name and the known domain name in a known domain name library preset in the counterfeit domain name identification model to obtain whether the domain name prefix and/or the main domain name is the counterfeit known domain name or not.

Optionally, after performing malicious domain name detection on a domain name prefix and/or a main domain name through a domain name recognition model which is constructed in advance, and performing domain name suffix reputation query on the domain name suffix, the method further includes:

and performing impersonation known website matching on the domain name prefix, constructing a known website list in advance, matching the domain name prefix in the known website list, and determining whether the domain name prefix is an impersonation known domain name or not according to a matching result.

Optionally, the domain name suffix is determined according to international TLD, and after the domain name prefix and the main domain name are determined according to a distance from the domain name suffix, the method further includes:

the first level before the domain name suffix is the main domain name and all levels before the main domain name are the domain name prefix.

Optionally, the process of constructing a DGA domain name recognition model based on the LSTM neural network includes:

constructing a DGA domain name training set, wherein samples in the DGA domain name training set comprise a published DGA domain name list and/or a DGA domain name generated by a published DGA domain name generation algorithm, and the DGA domain name training set is used for obtaining an effective character string coding set by processing the samples and converting the samples into numerical types through the effective character string coding set;

constructing a DGA domain name recognition model, constructing the DGA domain name recognition model based on an LSTM neural network, and training the DGA domain name recognition model through the DGA domain name training set.

Optionally, the process of constructing the C & C domain name recognition model based on the random forest algorithm includes:

constructing a C & C domain name training set, collecting a disclosed C & C domain name list to obtain a set of C & C domain names, and selecting domain name prefixes and/or the characteristics of a main domain name to be analyzed in the set through characteristic engineering to establish the C & C domain name training set;

and constructing a C & C domain name recognition model, and training the C & C domain name recognition model by using a random forest algorithm based on the C & C domain name training set.

Optionally, the performing a domain name suffix reputation query on the domain name suffix includes:

and constructing a domain name suffix reputation table in advance, and performing reputation ranking query on the domain name suffix in a preset domain name suffix reputation table.

Optionally, the dividing the domain name to be identified into a domain name prefix, a main domain name and a domain name suffix includes:

dividing the domain name to be identified into a domain name prefix, a main domain name and a domain name suffix, wherein the domain name to be identified comprises an SNI and/or a DNS and/or a CN domain name.

Optionally, the determining, according to the detection and query conclusion, whether the domain name to be identified is a malicious domain name includes:

scoring the domain name prefix and the main domain name according to the detection conclusion, and scoring the domain name suffix according to the query conclusion;

and calculating the probability of the domain name to be identified as the malicious domain name according to the scores of the domain name prefix, the main domain name and the domain name suffix.

The invention also provides a system for identifying the malicious domain name, which comprises the following steps:

the domain name segmentation module is used for dividing a domain name to be identified into a domain name prefix, a main domain name and a domain name suffix, wherein the domain name suffix is determined according to the international general TLD, and the domain name prefix and the main domain name are determined according to the distance from the domain name suffix;

the domain name detection module is used for carrying out malicious domain name detection on a domain name prefix and/or a main domain name through a domain name recognition model which is constructed in advance, and carrying out domain name suffix credibility query on a domain name suffix;

and the domain name identification module is used for judging whether the domain name to be identified is a malicious domain name or not according to the detection and query conclusion.

The invention also provides computer equipment which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the malicious domain name identification method in any item when executing the computer program.

The invention provides a method, a system and equipment for identifying a malicious domain name, wherein the domain name is identified by adopting a sectional processing mode, the domain name to be detected is divided into a domain name prefix part, a main domain name and a domain name suffix part, different detection methods are respectively adopted for different parts, respective models are respectively established for the domain name prefix and/or the main domain name for detection, the domain name suffix part is detected by adopting a mode of comparing the reputation of the domain name suffix part, the malicious domain name can be accurately and efficiently identified by comprehensive scoring.

Drawings

In order to more clearly illustrate the embodiments or technical solutions of the present invention, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

Fig. 1 is a flowchart of a malicious domain name identification method according to a first embodiment of the present invention;

fig. 2 is a flowchart of a segment detection method of a malicious domain name identification method according to a first embodiment of the present invention;

fig. 3 is a specific domain name detection flowchart of the malicious domain name identification method according to the first embodiment of the present invention;

fig. 4 is a levenstein distance calculation formula diagram of the malicious domain name identification method according to the first embodiment of the present invention;

fig. 5 is a flowchart of detecting an impersonated known domain name in the malicious domain name identification method according to the first embodiment of the present invention;

fig. 6 is a domain name segmentation flowchart of a malicious domain name identification method according to a first embodiment of the present invention;

fig. 7 is a flowchart of constructing a DGA domain name recognition model of the malicious domain name recognition method according to the first embodiment of the present invention;

fig. 8 is a flowchart of a C & C domain name recognition model construction of the malicious domain name recognition method according to the first embodiment of the present invention;

fig. 9 is a flow chart of selecting a domain name to be detected in the method for identifying a malicious domain name according to the first embodiment of the present invention;

fig. 10 is a flowchart of calculating scores and probabilities in a method for identifying a malicious domain name according to a first embodiment of the present invention;

fig. 11 is a block diagram illustrating a configuration of a malicious domain name recognition apparatus according to a second embodiment of the present invention;

FIG. 12 is a block diagram of a computer apparatus according to a third embodiment of the present invention;

FIG. 13 is a diagram of an embodiment of the present invention.

Detailed Description

The core of the invention is to provide a method, a system and equipment capable of accurately identifying a malicious domain name so as to solve the problem of identifying the malicious domain name.

In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.

The embodiment of the invention can be used in the safety detection scene of the domain name, in particular to the identification scene of the malicious domain name. The domain name related to the embodiment of the present invention may be a TLS traffic in network communication, or may be a domain name directly, which is not limited in the embodiment of the present invention.

An embodiment of the present invention provides a method for identifying a malicious domain name, as shown in fig. 1, including the following steps:

s100: the domain name to be identified is divided into a domain name prefix, a main domain name and a domain name suffix.

The domain name suffix is determined according to international TLD, and the domain name prefix and the main domain name are determined according to the distance between the domain name suffix and the domain name prefix.

Specifically, the domain name to be identified is segmented and detected, and the malicious domain name has different attack modes or different deadweight directions during counterfeiting aiming at the domain name prefix, the main domain name and the domain name suffix, so that the domain name to be identified is segmented for subsequent detection respectively.

More specifically, the domain name suffix is determined by adopting international TLD, and then the domain name prefix and the main domain name are determined according to the distance from the domain name suffix, the main domain name is close to the domain name suffix, and the domain name prefix is arranged in front of the main domain name. Preferably, the domain name suffix is preceded by a primary domain name, and all the stages before the primary domain name suffix are determined as the domain name prefix.

S300: and carrying out malicious domain name detection on the domain name prefix and/or the main domain name through a pre-constructed domain name identification model, and carrying out domain name suffix reputation degree query on the domain name suffix.

Specifically, for a plurality of attack modes of the domain name prefix and the main domain name, a pre-constructed domain name recognition model is needed to be used for respectively carrying out multistage malicious domain name detection on the domain name prefix and/or the main domain name, and for the detection of the domain name suffix, the embodiment is obtained by analyzing and summarizing a large number of malicious domain names, the credibility of the domain name suffix can be efficiently judged by adopting the domain name suffix reputation query, the domain name suffix is subjected to the domain name suffix reputation query, and the credibility of the domain name suffix is more intuitive and is easy to judge.

S500: and judging whether the domain name to be identified is a malicious domain name or not according to the detection and query conclusion.

Specifically, whether the domain name to be identified is a malicious domain name or not is comprehensively judged according to the domain name prefix, the malicious domain name detection result of the main domain name and the domain name suffix reputation query result of the domain name suffix.

The invention provides a method for identifying a malicious domain name, which adopts a sectional processing mode for domain name identification, divides a domain name to be detected into a domain name prefix, a main domain name and a domain name suffix, respectively adopts different detection methods for different parts, respectively establishes respective models for the domain name prefix and/or the main domain name for detection, adopts a mode of comparing the reputation of the domain name suffix for detection for the domain name suffix, carries out comprehensive scoring, and can accurately and efficiently identify the malicious domain name.

Optionally, as shown in fig. 2, the domain name recognition model pre-constructed in S300 includes:

s310: and carrying out DGA domain name recognition, and/or C & C domain name recognition, and/or a domain name recognition model for imitating known name domain name recognition on the domain name prefix and/or the main domain name.

Specifically, there are many ways for malicious domain name attacks, wherein a domain name is generated by a DGA algorithm for spoofing a domain name prefix or a main domain name, and is often used for a zombie program to communicate with a C & C server of a controller of the zombie program, or to register a domain name that is very similar to and related to a known website, so that detection of a domain name prefix and/or a main domain name can be achieved by performing DGA domain name recognition, and/or C & C domain name recognition, and/or spoofing known domain name recognition.

The malicious domain name identification method of the embodiment is more specific and accurate in the respective detection processes of the segmented domain names.

Optionally, as shown in fig. 3, the domain name recognition model for performing DGA domain name recognition in S310 includes:

s311: inputting a domain name prefix and/or a main domain name into the DGA domain name recognition model based on a DGA domain name recognition model constructed by an LSTM neural network to obtain whether the domain name prefix and/or the main domain name is the DGA domain name.

Specifically, the DGA domain name recognition model is obtained by training the LSTM neural network, and the advantage of training by the LSTM neural network is that the LSTM neural network has the capability of removing or adding information to a cell state through a gate control unit, and the gate is a method for allowing the information to pass selectively. Through the training of the LSTM neural network, the DGA domain name recognition model can automatically learn the correlation between DGA domain name character strings.

The identification method for the malicious domain name of the embodiment is specifically directed to the DGA domain name identification process of the domain name prefix and the main domain name, and the identification method for the malicious domain name of the embodiment is more specific and accurate.

Optionally, as shown in fig. 3, the domain name recognition model for performing C & C domain name recognition in S310 includes:

s312: extracting the characteristics of a domain name prefix and/or a main domain name based on a C & C domain name identification model constructed by a random forest algorithm, wherein the characteristics of the domain name prefix and/or the main domain name comprise: and inputting the characteristics into the C & C domain name recognition model to obtain whether the domain name prefix and/or the main domain name is the C & C domain name or not according to the switching proportion between the character strings and the numbers and/or whether the special character strings are contained.

Specifically, the C & C domain name recognition model is constructed based on a random forest algorithm, and the model constructed based on the random forest algorithm has the advantages that the random forest algorithm is a classifier comprising a plurality of decision trees, the output category of the random forest algorithm is determined by the mode of the category output by individual trees, each splitting process of a sub-tree in the random forest does not use all the features to be selected, a certain feature is randomly selected from all the features to be selected, and then the optimal feature is selected from the randomly selected features, so that the decision trees in the random forest can be different from one another, the diversity of the system is improved, and the classification performance is improved. The C & C domain name recognition model constructed based on the random forest algorithm can be used for carrying out C & C domain name recognition according to the extracted domain name prefix or the characteristics of the main domain name, and the performance is good.

The method for identifying the malicious domain name of the embodiment is more specific and accurate in the C & C domain name identification process of the domain name prefix and the main domain name.

Optionally, as shown in fig. 3, the domain name recognition model for performing counterfeit known domain name recognition in S310 includes:

s313: and the counterfeit domain name identification model is used for inputting the domain name prefix and/or the main domain name into the counterfeit domain name identification model, and carrying out editing distance calculation on the domain name prefix and/or the main domain name and the known domain name in a known domain name library preset in the counterfeit domain name identification model to obtain whether the domain name prefix and/or the main domain name is the counterfeit known domain name or not.

Specifically, many lawbreakers may counterfeit domain names of some known websites or regular websites when registering a domain name, and make their own domain name look like a normal domain name by modifying some or some characters therein. Calculating the edit distance is to calculate the similarity between the domain name to be detected and the known domain name.

More specifically, the present embodiment employs Edit Distance (Edit Distance) calculation, which means that the minimum number of Edit operations between two character strings is required for converting one character string into another character string through replacement, insertion, and deletion operations. The smaller the edit distance, the greater the similarity between two character strings, which is a quantitative representation of the degree of difference between the two character strings. There are various edit distances, such as a Levenshtein Distance (Levenshtein Distance), a Damerau-Levenshtein Distance, a longest common subsequence Distance (LCS), a Jaro Distance, and a Hamming Distance (Hamming Distance), according to the manner in which a character string is processed. The edit Distance of the present invention employs a Levenshtein Distance (Levenshtein Distance) and the calculation formula is shown in fig. 4, in which, in addition to the operands of the insert, delete and change operations defined in the Levenshtein Distance, the operands of adjacent character conversion are calculated.

The malicious domain name identification method of the embodiment is more specific and accurate in the domain name prefix and main domain name counterfeiting known domain name identification process.

Alternatively, as shown in fig. 5, the S300: the method comprises the following steps of carrying out malicious domain name detection on a domain name prefix and/or a main domain name through a domain name identification model which is constructed in advance, and after carrying out domain name suffix reputation degree query on a domain name suffix, the method also comprises the following steps:

s400: and performing impersonation known website matching on the domain name prefix, constructing a known website list in advance, matching the domain name prefix in the known website list, and determining whether the domain name prefix is an impersonation known domain name or not according to a matching result.

Specifically, a part of the malicious domain names look similar to a known website, and a mode that a domain name prefix directly uses the known website is also adopted to confuse a user, the domain name prefix is in front of a main domain name, and when the domain name prefix directly uses the known domain name, the user may mistakenly assume that the malicious domain name is a target known website if only seeing the prefix part, so that the domain name prefix part can be matched in a pre-constructed known website list, and whether the domain name prefix is an impersonated known domain name or not is determined according to a matching result. If the matching detection method is adopted for the main domain name part, the normal domain name can be damaged by mistake, so that only the domain name prefix is subjected to matching by falsely using a known website.

The method for identifying the malicious domain name is more specific and accurate aiming at the matching process of the impersonated known website of the domain name prefix.

Optionally, as shown in fig. 6, after the domain name suffix in S100 is determined according to international TLD and the domain name prefix and the main domain name are determined according to a distance from the domain name suffix, the method further includes:

Specifically, a domain name suffix in the domain name segment to be recognized is determined according to the international TLD, the domain name prefix and the main domain name are determined according to the distance from the domain name suffix, the main domain name is close to the domain name suffix, and the domain name prefix is in front of the main domain name. In this embodiment, the first level before the domain name suffix is used as the main domain name, and all the levels before the main domain name suffix are used as the domain name prefix.

The method for identifying the malicious domain name is more specific to a segmentation process of the domain name to be identified, and the method for identifying the malicious domain name is more specific and accurate.

Optionally, as shown in fig. 7, the process of constructing a DGA domain name recognition model based on the LSTM neural network in S311 includes:

s311 a: and constructing a DGA domain name training set, wherein samples in the DGA domain name training set comprise a published DGA domain name list and/or a DGA domain name generated by a published DGA domain name generation algorithm, and the DGA domain name training set is used for obtaining an effective character string coding set by processing the samples and converting the samples into numerical types by the effective character string coding set.

Specifically, the method comprises the steps of collecting relevant data of the published DGA domain name and DGA algorithms, obtaining domain name samples generated by the DGA algorithms, selecting a domain name set to be analyzed for the DGA algorithms, comprehensively forming a DGA domain name training set to be used, processing the training set to obtain an effective character string coding set, and converting the sample set into a numerical type through the effective character string coding set.

S311 b: constructing a DGA domain name recognition model, constructing the DGA domain name recognition model based on an LSTM neural network, and training the DGA domain name recognition model through the DGA domain name training set.

Specifically, the DGA domain name identification model includes: word embedding layer, LSTM layer, Dropout layer, full connection layer, and output layer.

More specifically, the word embedding layer of the present embodiment is preferably 128-dimensional, which is primarily to convert the input data into vectors that facilitate neural network operations. The LSTM layer of this embodiment is preferably 128-dimensional, and this structure is adopted to combine the previous feature information and the current feature information to more accurately discriminate input data, and the LSTM layer has the capability of removing or adding information to a cell state through a gate control unit, and a gate is a method for passing an information selection formula, and through training of the LSTM layer, a neural network autonomously learns the correlation between DGA domain name strings. The Dropout layer reduces interdependence between nodes and risk of a network structure by zeroing partial weight or output of the LSTM layer according to a certain proportion in a training process, so that overfitting is effectively avoided, and the Dropout proportion of the embodiment is 0.5. The fully-connected layer is responsible for converting the output of the LSTM recurrent neural network into the input of the sigmoid layer. The output layer mainly carries out classification discrimination, and the output layer selects a sigmoid function.

The identification method for the malicious domain name is specifically directed at the construction process of the DGA domain name identification model, and the identification method for the malicious domain name is more specific and accurate.

Optionally, as shown in fig. 8, the process of constructing the C & C domain name recognition model based on the random forest algorithm in S312 includes:

s312 a: and constructing a C & C domain name training set, collecting a disclosed C & C domain name list to obtain a set of C & C domain names, and selecting the domain name prefix and/or the characteristics of the main domain name to be analyzed in the set through characteristic engineering to establish the C & C domain name training set.

Specifically, the detection model of the C & C domain name is trained by adopting a Random Forest (Random Forest), a set of the C & C domain name is collected, and a domain name prefix or main domain name feature to be analyzed, such as a switching ratio between a character string and a number, whether a special character string is contained or not, is selected through feature engineering to establish a C & C domain name training set.

More specifically, in this embodiment, a set of C & C domain names is obtained from a published C & C domain name list, which may include existing C & C domain names, domain names 10W before Alexa ranking, domain names of domestic known websites, and the like, and after feature extraction processing, a C & C domain name recognition model for recognizing domain name prefixes and a C & C domain name recognition model for recognizing main domain names are respectively trained.

S312 b: and constructing a C & C domain name recognition model, and training the C & C domain name recognition model by using a random forest algorithm based on the C & C domain name training set.

Specifically, the C & C domain name recognition model is trained by using a random forest algorithm, so that the C & C domain name recognition model which is stable in classification performance and good in performance is obtained.

The identification method for the malicious domain name is specifically directed at the construction process of the C & C domain name identification model, and the identification method for the malicious domain name is more specific and accurate.

Optionally, as shown in fig. 2, the performing a domain name suffix reputation query on the domain name suffix in S300 includes:

s320: and constructing a domain name suffix reputation table in advance, and performing reputation ranking query on the domain name suffix in a preset domain name suffix reputation table.

Specifically, the malicious website can also adopt a mode of imitating domain name postfixes, because the website cannot register all top-level domain names, and the domain names of some countries and regions are not well registered, gaps of unregistered top-level domain names of normal websites are filled in the malicious website. The counterfeiting of the domain name suffix is difficult to detect, but malicious counterfeit domain names can be screened out by inquiring the reputation of the domain name suffix. The invention ranks and scores the domain name suffixes of the malicious domain name to form a pre-constructed domain name suffix credit table, and determines the credit degree of the suffixes of the domain name to be detected by matching the domain name suffixes with the table, thereby providing basis and reference for judging the whole domain name.

The malicious domain name identification method of the embodiment is more specific and accurate in the reputation query process of domain name suffix.

Alternatively, as shown in fig. 9, the S100: dividing the domain name to be identified into a domain name prefix, a main domain name and a domain name suffix, comprising:

s110: dividing the domain name to be identified into a domain name prefix, a main domain name and a domain name suffix, wherein the domain name to be identified comprises an SNI and/or a DNS and/or a CN domain name.

Specifically, the TLS encrypted traffic mainly involves three domain names, namely, SNI, DNS, and CN, in this embodiment, the comprehensive detection of the SNI, and/or DNS, and/or CN domain names is used to identify malicious domain names, and it is possible to provide an explicit basis for identifying and classifying the TLS encrypted traffic by identifying whether these domain names are malicious domain names.

The malicious domain name identification method of the embodiment is more specific and accurate in the selection process of the domain name to be detected.

Alternatively, as shown in fig. 10, the S500: judging whether the domain name to be identified is a malicious domain name according to the detection and query conclusion, comprising the following steps:

s510: and scoring the domain name prefix and the main domain name according to the detection conclusion, and scoring the domain name suffix according to the query conclusion.

S520: and calculating the probability of the domain name to be identified as the malicious domain name according to the scores of the domain name prefix, the main domain name and the domain name suffix.

Specifically, the domain name to be detected is processed in a segmented mode, different analysis methods are respectively adopted for the domain name to be detected of each part, according to detection and query conclusions, a domain name prefix, a main domain name and a domain name suffix are respectively scored, and the final score of the domain name to be detected is obtained through weighted summation of scores of the domain name prefix, the main domain name and the domain name suffix, namely the probability that the domain name to be detected is a malicious domain name.

More specifically, according to the scores of the domain name prefix, the main domain name and the domain name suffix, the probability that the domain name to be identified is a malicious domain name can be calculated, for example: the standard value 1 is preset, the occupation ratios of the domain name prefix, the main domain name and the domain name suffix in the probability calculation are the same, at this time, the full scores of the domain name prefix, the main domain name and the domain name suffix are all 1, the first ratio of the domain name prefix, the second ratio of the main domain name and the third ratio of the domain name suffix are 1, the scores of the domain name prefix, the main domain name and the domain name suffix are assumed to be 0.9, if the scores are scored according to the standard degree of the domain name to be recognized, the probability that the domain name to be recognized is a malicious domain name is 1- (the standard probability of the domain name prefix, the main domain name and the domain name suffix) is 1- [ (0.9+0.9+0.9)/3] ═ 0.1, and if the scores are scored according to the non-standard degree of the domain name to be recognized, the probability that the domain name to be recognized is the malicious domain name is the domain name prefix, the main domain name and the non-standard probability of the domain name suffix (0.9+0.9+ 0.9)/3). The preset standard value can be 1, 10 or 100, etc., and the proportion of the domain name prefix, the main domain name and the domain name suffix in the probability calculation can also be customized and adjusted according to the importance degree of each segmented domain name in the actual probability calculation.

The method for identifying the malicious domain name is more specific and accurate in the scoring and probability calculation process of the domain name to be detected.

The present embodiment provides a method for identifying a malicious domain name, where a malicious counterfeit domain name usually uses a DGA algorithm to generate a domain name, a counterfeit C & C domain name, or a counterfeit known website, and the method includes, for the counterfeit characteristics of the malicious domain name, segmenting the domain name, respectively using different detection methods for a domain name prefix, a main domain name, and a domain name suffix in the domain name, respectively verifying whether the domain name prefix and the main domain name are the DGA domain name, the C & C domain name, and the counterfeit domain name, respectively, and comparing reputation of the domain name suffix for the domain name suffix counterfeit detection, so as to perform comprehensive judgment, thereby accurately and efficiently identifying the malicious domain name.

An embodiment of the present invention provides a system for identifying a malicious domain name, as shown in fig. 11, including:

the domain name segmentation module 10 is configured to divide a domain name to be identified into a domain name prefix, a main domain name, and a domain name suffix, where the domain name suffix is determined according to an international TLD, and the domain name prefix and the main domain name are determined according to a distance from the domain name suffix.

The domain name detection module 20 is configured to perform malicious domain name detection on a domain name prefix and/or a main domain name through a domain name recognition model that is constructed in advance, and perform domain name suffix reputation query on a domain name suffix.

And the domain name identification module 30 is configured to determine whether the domain name to be identified is a malicious domain name according to the detection and query conclusion.

Optionally, the domain name detection module 20 includes:

the domain name recognition model sub-module is used for carrying out DGA domain name recognition, and/or C & C domain name recognition and/or a domain name recognition model for imitating known domain name recognition on the domain name prefix and/or the main domain name; wherein the C & C domain name refers to a domain name involved in malicious encrypted traffic using a C & C server.

Optionally, the domain name recognition model sub-module includes:

and the DGA domain name identification model unit is used for inputting the domain name prefix and/or the main domain name into the DGA domain name identification model based on the DGA domain name identification model constructed by the LSTM neural network to obtain whether the domain name prefix and/or the main domain name is the DGA domain name.

Optionally, the domain name recognition model sub-module includes:

the C & C domain name recognition model unit is used for extracting the characteristics of a domain name prefix and/or a main domain name based on a C & C domain name recognition model constructed by a random forest algorithm, and the characteristics of the domain name prefix and/or the main domain name comprise: and inputting the characteristics into the C & C domain name recognition model to obtain whether the domain name prefix and/or the main domain name is the C & C domain name or not according to the switching proportion between the character strings and the numbers and/or whether the special character strings are contained.

Optionally, the domain name recognition model sub-module includes:

and the counterfeit domain name identification model unit is used for inputting the domain name prefix and/or the main domain name into the counterfeit domain name identification model, and carrying out editing distance calculation on the domain name prefix and/or the main domain name and a known domain name in a known domain name library preset in the counterfeit domain name identification model to obtain whether the domain name prefix and/or the main domain name is a counterfeit known domain name.

Optionally, the system for identifying a malicious domain name further includes:

and the domain name prefix impersonation module is used for carrying out impersonation known website matching on the domain name prefix, constructing a known website list in advance, matching the domain name prefix in the known website list, and determining whether the domain name prefix is an impersonation known domain name or not according to a matching result.

Optionally, the domain name segmentation module 10 further includes:

and the domain name segmentation unit is used for dividing the first level before the domain name suffix into a main domain name, and all the levels before the main domain name suffix into domain name prefixes.

Optionally, the DGA domain name recognition model unit includes:

and the DGA domain name training set subunit is used for constructing a DGA domain name training set, samples in the DGA domain name training set comprise a published DGA domain name list and/or a DGA domain name generated by a published DGA domain name generation algorithm, and the DGA domain name training set is used for obtaining an effective character string coding set by processing the samples and converting the samples into numerical types through the effective character string coding set.

And the DGA domain name recognition model subunit is used for constructing a DGA domain name recognition model, constructing the DGA domain name recognition model based on an LSTM neural network, and training the DGA domain name recognition model through the DGA domain name training set.

Optionally, the C & C domain name recognition model unit includes:

and the C & C domain name training set subunit is used for constructing a C & C domain name training set, collecting the published C & C domain name list to obtain a set of C & C domain names, and selecting the domain name prefix and/or the characteristics of the main domain name to be analyzed in the set through characteristic engineering to establish the C & C domain name training set.

And the C & C domain name recognition model subunit is used for constructing a C & C domain name recognition model and training the C & C domain name recognition model by utilizing a random forest algorithm based on the C & C domain name training set.

Optionally, the domain name detection module 20 includes:

and the domain name suffix query submodule is used for constructing a domain name suffix reputation table in advance and carrying out reputation ranking query on the domain name suffix in a preset domain name suffix reputation table.

Optionally, the domain name segmentation module 10 includes:

and the domain name selection submodule is used for dividing the domain name to be identified into a domain name prefix, a main domain name and a domain name suffix, and the domain name to be identified comprises an SNI (single noise indicator) and/or a DNS (domain name system) and/or a CN (core network) domain name.

Optionally, the domain name identifying module 30 includes:

the domain name scoring module is used for scoring the domain name prefix and the main domain name according to the detection conclusion and scoring the domain name suffix according to the query conclusion;

and the probability calculation submodule is used for calculating the probability that the domain name to be identified is a malicious domain name according to the scores of the domain name prefix, the main domain name and the domain name suffix.

The identification system provided by the application can divide the domain name to be detected into a domain name prefix part, a main domain name part and a domain name suffix part in a way of carrying out identification and segmentation processing on the domain name, different detection methods are respectively adopted for different parts, respective models are respectively established for the domain name prefix and/or the main domain name for detection, the domain name suffix is detected in a way of comparing the reputation of the domain name suffix, the overall scoring is carried out, and the malicious domain name can be accurately and efficiently identified.

A third embodiment of the present invention further provides a computer device, as shown in fig. 12, including a memory 1 and a processor 2, where the memory 1 stores a computer program, and the processor 2 implements any one of the above methods for identifying a malicious domain name when executing the computer program.

The memory 1 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 1 may in some embodiments be an internal storage unit of the identification system of malicious domain names, for example a hard disk. The memory 1 may also be an external storage device of the malicious domain name recognition system in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 1 may also include both an internal storage unit of the identification system of the malicious domain name and an external storage device. The memory 1 may be used to store not only application software installed in the malicious domain name recognition system and various types of data, such as codes of a malicious domain name recognition program, but also temporarily store data that has been output or is to be output.

The processor 2 may be a Central Processing Unit (CPU), a controller, a microcontroller, a microprocessor or other data Processing chip in some embodiments, and is used for running program codes stored in the memory 1 or Processing data, such as executing a malicious domain name recognition program.

The computer equipment provided by the application can divide the domain name to be detected into a domain name prefix part, a main domain name part and a domain name suffix part in a way of identifying and sectionally processing the domain name, different detection methods are respectively adopted for different parts, respective models are respectively established for the domain name prefix and/or the main domain name for detection, the domain name suffix is detected in a way of comparing the reputation of the domain name suffix, the comprehensive scoring is carried out, and the malicious domain name can be accurately and efficiently identified.

An embodiment four of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method for identifying a malicious domain name according to any one of the above-mentioned embodiments is implemented.

The malicious domain name detection system, the computer device and the computer readable storage medium provided by the application correspond to the method. It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the apparatus, and the computer-readable storage medium described above may refer to the corresponding processes in the foregoing first method embodiment, and are not described herein again.

The method analyzes the domain name information in the encrypted flow, comprehensively inspects the domain name characteristics of the encrypted flow, and plays an important role in identifying and classifying TLS encrypted flow. The invention adopts a sectional processing method for domain name identification, adopts different processing modes for different parts and carries out comprehensive scoring, thereby increasing the detection strength for the domain name.

The invention provides a set of comprehensive and systematic method for identifying the malicious domain name, and the malicious domain name is identified and detected from multiple dimensions, and the main part is to comprehensively score different parts of the domain name from different dimensions according to different sources and different malicious types.

The method for identifying the malicious domain name provided by the invention has the following working process:

the target is as follows: performing comprehensive scoring on the domain name in TLS encrypted traffic based on the domain name information;

inputting: TLS encrypts the domain name string in the traffic;

and (3) outputting: the probability that a domain name in TLS encrypted traffic is a malicious domain name.

The specific implementation process of the present invention, as shown in fig. 13, is as follows:

1. extracting and segmenting SNI, DNS and CN domain names in TLS encrypted flow, and dividing the SNI, DNS and CN domain names into domain name prefixes, main domain names and domain name suffixes.

2. The domain name prefix section is detected through four processing modes, whether the domain name prefix is a DGA domain name, whether the domain name prefix is a C & C domain name, whether the domain name prefix is a counterfeit known domain name or not is judged, whether the domain name prefix directly uses the known domain name or not is judged, and whether the domain name prefix falsely uses the known domain name or not is judged, wherein the domain name prefix falsely uses the known domain name, and the domain name prefix looks like a known website by replacing or modifying a character string and the like, and the domain name prefix is comprehensively scored after detection.

3. The main domain name is detected through three processing modes, whether the main domain name is a DGA domain name, whether the main domain name is a C & C domain name and whether the main domain name is a counterfeit known domain name is judged, and the main domain name is comprehensively scored after detection.

4. And the domain name suffix is subjected to reputation query by matching a domain name suffix reputation table, and the domain name suffix is scored after the query.

5. And weighting and summing the scores of the domain name prefix, the main domain name and the domain name suffix respectively to obtain the final score of the domain name, namely the probability that the domain name is a malicious domain name.

According to the invention, the SNI/DNS/CN domain names in the TLS encrypted flow containing the SNI/DNS/CN domain names are respectively marked, so that whether the SNI/DNS/CN domain names in the TLS encrypted flow are malicious domain names can be judged, and a certain auxiliary effect on the identification and classification of the TLS encrypted flow is achieved.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The method, the device and the equipment for identifying the malicious domain name provided by the invention are described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims

1. A method for identifying a malicious domain name, comprising:

2. The method for identifying a malicious domain name according to claim 1, wherein the pre-constructed domain name identification model comprises:

3. The method for identifying a malicious domain name according to claim 2, wherein the domain name recognition model for performing DGA domain name recognition comprises:

4. The method for identifying a malicious domain name according to claim 2, wherein the domain name identification model for performing C & C domain name identification comprises:

5. The method for identifying a malicious domain name according to claim 2, wherein the domain name identification model for performing the counterfeit known domain name identification comprises:

6. The method for identifying a malicious domain name according to claim 1, wherein after performing malicious domain name detection on a domain name prefix and/or a main domain name through a pre-constructed domain name identification model and performing domain name suffix reputation query on the domain name suffix, the method further comprises:

7. The method for identifying a malicious domain name according to claim 1, wherein the domain name suffix is determined according to international TLD, and after the domain name prefix and the main domain name are determined according to a distance from the domain name suffix, the method further comprises:

8. The method for identifying malicious domain names according to claim 3, wherein the process of constructing the DGA domain name identification model based on the LSTM neural network comprises:

9. The malicious domain name identification method according to claim 4, wherein the process of constructing the C & C domain name identification model based on the random forest algorithm comprises:

10. The method of identifying malicious domain names according to claim 1, wherein the querying the domain name suffix for reputation comprises:

11. The method for identifying a malicious domain name according to any one of claims 1 to 10, wherein the dividing of the domain name to be identified into a domain name prefix, a main domain name and a domain name suffix comprises:

12. The method for identifying a malicious domain name according to claim 1, wherein the determining whether the domain name to be identified is the malicious domain name according to the detection and query conclusion comprises:

13. A system for identifying malicious domain names, comprising:

14. A computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the method for identifying a malicious domain name according to any one of claims 1 to 12 when executing the computer program.