CN114826681A

CN114826681A - DGA domain name detection method, system, medium, equipment and terminal

Info

Publication number: CN114826681A
Application number: CN202210322471.9A
Authority: CN
Inventors: 付玉龙; 弓弛; 李智华
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2022-07-29

Abstract

The invention belongs to the technical field of computer networks, and discloses a DGA domain name detection method, a system, a medium, equipment and a terminal, which comprise a sample pairing rule for pairing domain names of various categories; the domain name feature space matching system comprises a twin architecture model Sim-BLA, a network BLA and a module Weighted-v & d, wherein the twin architecture model Sim-BLA is used for fitting a domain name feature space and comprises a preprocessing layer, an embedding layer, a feature extraction layer and a similarity calculation layer, the network BLA is obtained by splitting the structure of the twin architecture model and used for extracting domain name features, and the module Weighted-v & d is used for carrying out exclusive similarity measurement on two domain names; a calculation method of a reference vector representing the characteristic situation of each category; the efficient twin framework multi-classification and unknown class identification algorithm. By the technical scheme, characteristic engineering and large-scale labeled data are not needed, the identification accuracy rate reaches more than 98%, the accuracy rate of partial categories is even 100%, the classification accuracy rate under a small sample environment is improved, and the time of a twin framework for multi-category prediction is shortened.

Description

DGA domain name detection method, system, medium, equipment and terminal

Technical Field

The invention belongs to the technical field of computer networks, and particularly relates to a DGA domain name detection method, a DGA domain name detection system, a DGA domain name detection medium, DGA domain name detection equipment and a DGA domain name detection terminal.

Background

At present, in a network space, an attacker often attacks or controls a computer, a smart phone and other devices used by a user through malicious programs such as trojan horse programs and worm viruses. The user's devices, once controlled, fall under the part of an attacker-controlled "botnet". The attacker then sends instructions through the internet to steal privacy within the user device or the remote control device participates in a denial of service attack on a particular target server. To avoid detection and attack, and also to keep "botnet" communications clear, attackers use Domain name Generation algorithms (DGAs). The method enables an attacker to generate a large number of pseudo domain names by the malicious program deployed in the zombie machine without writing fixed domain name information or IP addresses of the attacker in the malicious program, and tries to connect all or part of the domain names, and the attacker only needs to randomly register one or two domain names in advance to recover the communication with the controlled equipment.

Initially, people hit "botnets" by preemptively registering or pulling on blacklists for DGA pseudo domain names. Firstly, the DGA pseudo domain name needs to be predicted in advance, reverse engineering needs to be carried out on a DGA algorithm, complexity is complex, and secondly, the number and the speed of the domain names generated by the DGA greatly increase, so that the method cannot be used for carrying out preemptive registration or drawing in a blacklist on the pseudo domain name. According to the detection principle, the current main DGA domain name detection method is roughly divided into three types. Firstly, the method based on analysis and statistics is as follows: for example, according to the characteristic that DNS query is required to be performed continuously during DGA attack, a victim host receives a large number of response messages without a domain (NXDomain), and the same botnet generates DNS traffic with the same characteristics. And according to the length ratio of the DNS request to the DNS response, combining with a query record which is not commonly used for detecting the client side, and distinguishing malicious data from normal traffic. Secondly, a method combining the traditional machine learning algorithm with the feature engineering is as follows: the normal domain name and the DGA domain name are distinguished by aiming at characteristics of a legal domain name and a DGA domain name, including readability characteristics such as entropy values, n-gram values, root words, affix words, pinyin and abbreviation characteristics, vowel distribution and the like, and by combining mainstream machine learning algorithms such as K-means, SVM, random forest, XGboost and the like. And finally, a method based on a deep learning model: network models such as CNN, RNN, LSTM, etc. are introduced to perform DGA detection by training a deep learning classifier. However, the current attack means is gradually combined with new technologies such as big data, artificial intelligence and the like, and iteration is continuously updated, and the three methods can not cope with variant DGA families and DGA domains with characteristics which are more similar to those of legal domain names; in a real complex network environment, the three methods cannot solve the extremely-high unbalance phenomenon and the small sample learning requirement existing between the normal legal domain name with wide feature distribution and the DGA family domain name with the difficulty in acquiring the unbalanced sample of the feature distribution.

Through the above analysis, the problems and defects of the prior art are as follows:

(1) the current attack means is gradually combined with new technologies such as big data, artificial intelligence and the like to continuously update and iterate, and the existing methods cannot deal with variant DGA families and DGA domain names which are more similar to legal domain name characteristics.

(2) Under a real complex network environment, the existing method cannot solve the extreme imbalance phenomenon and the small sample learning requirement existing between the normal legal domain name with a large amount of widely distributed features and the domain names of various families of DGA with difficultly obtained feature distribution samples.

Disclosure of Invention

The invention provides a DGA domain name detection method, a DGA domain name detection system, a DGA domain name detection medium, DGA domain name detection equipment and a DGA domain name detection terminal, and particularly relates to a DGA domain name detection method, a DGA domain name detection system, a DGA domain name detection medium, DGA domain name detection equipment and a DGA domain name detection terminal based on a twin framework.

The DGA domain name detection method is realized in the invention, and the DGA domain name detection method is used for marking and sorting the collected domain name data and pairing the normal domain name and each DGA family domain name according to a sample pairing rule; establishing a learning model Sim-BLA based on a twin framework, and inputting pairing data into the learning model Sim-BLA pair by pair; training a twin-architecture-based classification learning model, and splitting the twin-architecture-based classification learning model to obtain a feature extraction network BLA and a similarity measurement function Weighted-v & d; generating reference vectors of all categories according to a reference vector generation rule; inputting the captured domain name to be detected into a feature extraction network BLA to obtain a feature vector of the domain name to be detected; and constructing a twin framework multi-classification and unknown class identification algorithm, and classifying and identifying the feature vector of the domain name to be detected according to a multi-classification prediction algorithm.

Further, the DGA domain name detection method comprises the following steps:

step one, performing domain name pairing on collected samples containing normal domain names and domain names of various classes of DGA by using a twin framework sample pairing rule;

establishing a twin-architecture-based learning model Sim-BLA for training a fitting domain name complex feature space, wherein the learning model Sim-BLA comprises a preprocessing layer, an embedding layer, a feature extraction layer and a similarity calculation layer;

step three, inputting the paired domain name binary groups into a learning model Sim-BLA pair by pair for model training, and splitting the model to obtain a feature extraction network BLA for extracting domain name features and a similarity measurement module Weighted-v & d for performing similarity measurement on two domain name samples;

calculating corresponding reference vectors of all domain name categories in the model training process before the model is applied;

and step five, judging the domain name to be detected captured by the current network by using the multi-classification prediction algorithm, wherein the judgment result comprises whether the domain name is a DGA domain name, the DGA class to which the domain name belongs and whether the domain name is an unknown domain name.

Further, in the first step, a ratio of the number of matched samples to the total number of the similar samples is introduced as a matching coefficient, and the domain name matching is performed according to the requirements of meeting the balance of training of the similar and dissimilar samples under a twin-structure dual-input training mechanism and overcoming the extreme imbalance phenomenon of the DGA domain name and the normal domain name.

Furthermore, the twin-architecture-based learning model Sim-BLA in the second step is a two-way parallel weight sharing structure, and a left and right input two-way parallel weight sharing feature extraction network comprises a preprocessing layer, an embedding layer, a feature extraction layer and a similarity calculation layer.

The preprocessing layer is used for filling and intercepting the input domain name into a uniform length; the embedded layer is used for vectorizing the domain name character string and establishing a word vector by combining the domain name character sequence with the word embedding method by using the one-hot coding; the feature extraction layer is used for extracting features in two directions of a preorder direction and a postorder direction of the input domain name word vector by using a BilSTM structure, and meanwhile, an attention mechanism is used for evaluating the importance of the features of each part and finally outputting the feature vector of the domain name; the similarity calculation layer is used for performing comprehensive similarity measurement on two input feature vectors and outputting a value, and specifically comprises:

unifying the input domain name character string into a fixed size through filling and intercepting operations, and forming mapping from domain name characters to numbers according to domain name legal characters, filling characters and illegal characters so as to preprocess the domain name character string into a one-dimensional vector with unified length;

converting the one-dimensional vector into a two-dimensional non-sparse vector in a mode of combining one-hot encoding with word embedding;

a network structure of a BilSTM combined attention mechanism is used, the characteristics of two time sequences of forward and backward domain names are fused, and the weighted summation is carried out on the last time sequence of the BilSTM to be used as an attention distribution value, so that the sample characteristics of the domain name word level are enhanced to the more accurate sample characteristics with stronger generalization of sentence level.

Furthermore, in the third step, a twin architecture loss function is combined, the relation among the domain name feature vector, various distance measurement functions and the domain name original character set is comprehensively considered, and a similarity measurement module Weighted-v & d is provided; and carrying out vector splicing on each dimension value of the two input feature vectors, a Manhattan distance value, a weighted Euclidean distance value, an included angle cosine value and a Jacard distance value of an original domain name character element set, and finally using a result mapped to a numerical value by using a full-connection network as a similarity metric value of the two inputs.

Further, in the fifth step, the feature extraction network BLA is used in advance to calculate the reference vector representative corresponding to the feature of the known class to participate in the multi-class prediction process, meanwhile, the similarity measurement module Weighted-v & d is used in the multi-class prediction process to calculate the similarity between the domain name to be detected and the reference vector of each class, and the domain name to be detected is classified or judged according to the similarity and the boundary value of the unknown class.

Another objective of the present invention is to provide a DGA domain name detection system using the DGA domain name detection method, wherein the DGA domain name detection system comprises:

the domain name matching module is used for matching the collected samples containing the normal domain name and the domain names of various classes of the DGA by using a twin framework sample matching rule;

the domain name complex feature space training and fitting system comprises a learning model building module, a domain name complex feature space training module and a domain name complex feature space learning model building module, wherein the learning model building module is used for building a twin-architecture-based learning model Sim-BLA for training and fitting a domain name complex feature space and comprises a preprocessing layer, an embedding layer, a feature extraction layer and a similarity calculation layer;

the model training module is used for inputting the paired domain name binary groups into a learning model Sim-BLA in pairs for model training, and then splitting the model to obtain a feature extraction network BLA for extracting domain name features and a similarity measurement module Weighted-v & d for performing similarity measurement on two domain name samples;

the reference vector calculation module is used for calculating corresponding reference vectors for all domain name categories in the model training process before the model is applied;

and the domain name judgment module to be detected is used for judging the domain name to be detected captured by the current network by using a multi-classification prediction algorithm, and the judgment result comprises whether the domain name is a DGA domain name, the DGA class to which the domain name belongs and whether the domain name is an unknown domain name.

It is a further object of the invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:

carrying out marking and sorting on the collected domain name data, and pairing the normal domain name and each DGA family domain name according to a sample pairing rule; establishing a learning model Sim-BLA based on a twin framework, and inputting pairing data into the learning model Sim-BLA pair by pair; training a twin-architecture-based classification learning model, and splitting the twin-architecture-based classification learning model to obtain a feature extraction network BLA and a similarity measurement function Weighted-v & d; generating reference vectors of all categories according to a reference vector generation rule; inputting the captured domain name to be detected into a feature extraction network BLA to obtain a feature vector of the domain name to be detected; and constructing a twin framework multi-classification and unknown class identification algorithm, and classifying and identifying the feature vector of the domain name to be detected according to a multi-classification prediction algorithm.

It is another object of the present invention to provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

Another object of the present invention is to provide an information data processing terminal for implementing the DGA domain name detection system.

In combination with the technical solutions and the technical problems to be solved, please analyze the advantages and positive effects of the technical solutions to be protected in the present invention from the following aspects:

first, aiming at the technical problems existing in the prior art and the difficulty in solving the problems, the technical problems to be solved by the technical scheme of the present invention are closely combined with results, data and the like in the research and development process, and some creative technical effects are brought after the problems are solved. The specific description is as follows:

according to the DGA domain name detection system based on the twin framework, the sample pairing rule used for pairing each class of domain names is adopted; the domain name feature space matching system comprises a twin architecture model Sim-BLA used for fitting a domain name feature space and comprising a preprocessing layer, an embedding layer, a feature extraction layer and a similarity calculation layer, a network BLA obtained by splitting the structure of the twin architecture model Sim-BLA and used for extracting domain name features, and a module Weighted-v & d used for carrying out exclusive similarity measurement on two domain names; a calculation method of a reference vector representing the characteristic situation of each category; the efficient twin framework multi-classification and unknown identification algorithm does not need characteristic engineering and large-scale labeling data, the identification accuracy of most of the classifications reaches more than 98%, and the identification accuracy of some of the classifications is even 100%. The whole process realizes the effect of quickly and accurately classifying the current network domain name, including effectively identifying unknown classes, greatly improves the classification accuracy rate in a small sample environment, and greatly reduces the time of a twin framework for multi-classification prediction.

According to the DGA domain name detection method based on the twin framework, provided by the invention, the sample pairing algorithm, the feature extraction network, the loss function and the multi-class prediction algorithm of the twin framework are modified and designed, the domain name does not need to be subjected to feature engineering, large-scale training data is not needed, the twin framework can continuously learn to fit to complex feature distribution in domain name feature spaces of various classes in pair-by-pair training, and finally multi-class prediction of the domain name, including identification of unknown classes, is realized by measuring the similarity among samples. Under the scale that the number of training samples of each class is only 1000, the recognition rate of the algorithm of the invention to most classes reaches more than 98%, the family recognition rate with relatively simple partial characteristics is even 100%, and the unknown DGA family can be effectively recognized. The invention greatly improves the DGA detection efficiency and accuracy, and has excellent small sample learning ability and unknown class identification ability.

According to the method, the twin-architecture dual-input model is only used for training the basic model, the feature extraction network and the similarity distance module in the basic model are independently separated and used for predicting the actual domain name, the operation capable of being processed in advance is processed in advance, the prediction process of the twin architecture is greatly simplified, and the domain name prediction time is remarkably reduced.

According to the method, the learning model Sim-BLA based on the twin framework is subjected to structure splitting to obtain the feature extraction network BLA and the similarity measurement module Weighted-v & d which are respectively used for feature vector extraction and similarity measurement, so that the redundant structure of the twin framework when the twin framework is used for classification is simplified, and the time of the twin framework when the twin framework is used for sample classification is greatly shortened. The invention evaluates the capability of whether the sample can represent a certain category by analyzing the distribution condition of the distance values between the sample in each category and the samples of the same category and combining the distance expectation and the distance variance, and correspondingly selects the characteristic vector of the sample with the highest representation capability as the reference vector for representing the overall characteristic condition of the category.

When the online application model is used for domain name prediction, a domain name to be detected is input into a feature extraction network BLA to obtain a feature vector of the domain name, then the similarity measurement module Weighted-v & d is used for calculating the similarity value between the domain name to be detected and reference vectors of all known classes which are calculated in advance, and the class where the minimum similarity value is located is selected as the class to which the domain name to be detected belongs most probably. Particularly, if the minimum similarity value is still larger than the experimentally defined unknown class limit value, the method judges that the domain name to be detected belongs to the novel unknown class, otherwise, the domain name to be detected is predicted to belong to the class to which the minimum similarity value belongs.

By using the DGA detection method based on the twin framework, the small sample learning capability is greatly enhanced by matching with the creative sample pairing rule and the similarity measurement module which are suitable for a DGA detection scene, so that large-scale training data are not needed, and the classification effect under the condition that the number of samples of each class is 1000 or even exceeds that of the samples of each class of other existing algorithms under the condition that the number of samples of each class is tens of thousands can be achieved; the portable feature extraction network using the BilSTM combined attention mechanism extracts bidirectional feature dependence of the domain name and dynamically adjusts the importance of domain name features under the condition of ensuring the time efficiency of the algorithm, thereby greatly improving the fitting capability of the network to the domain name feature space; by pertinently splitting the twin-architecture-based learning model Sim-BLA and using the multi-classification prediction algorithm, the time of the twin architecture for classification is greatly shortened, the multi-classification accuracy is improved, and the purpose of effectively identifying unknown classes is achieved.

Secondly, considering the technical scheme as a whole or from the perspective of products, the technical effect and advantages of the technical scheme to be protected by the invention are specifically described as follows:

by the technical scheme of the invention, characteristic engineering and large-scale data labeling are not needed, the recognition accuracy of most categories reaches more than 98%, and the accuracy of some categories is even 100%; the whole process realizes the effect of quickly and accurately classifying the current network domain name, including effectively identifying unknown classes, greatly improves the classification accuracy rate in a small sample environment, and greatly reduces the time of a twin framework for multi-classification prediction.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a DGA domain name detection method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a DGA domain name detection method provided in the embodiment of the present invention;

fig. 3 is a block diagram of a DGA domain name detection system according to an embodiment of the present invention;

FIG. 4 is an overall structure diagram of a DGA domain name classifier Sim-BLA for training learning provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of a network structure of a feature extraction layer provided in an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a similarity metric module Weighted-v & d according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a process for calculating a domain name reference vector according to an embodiment of the present invention;

FIG. 8 is a flow chart of multi-class prediction of domain name samples by the Siam-BLA according to the embodiment of the present invention;

in the figure: 1. a domain name pairing module; 2. a learning model building module; 3. a model training module; 4. a reference vector calculation module; 5. and a domain name judging module to be detected.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In view of the problems in the prior art, the present invention provides a DGA domain name detection method, system, medium, device and terminal, which are described in detail below with reference to the accompanying drawings.

First, an embodiment is explained. This section is an explanatory embodiment expanding on the claims so as to fully understand how the present invention is embodied by those skilled in the art.

Example 1

The DGA domain name detection method based on the twin framework comprises the steps of carrying out sample pairing rules for pairing domain names of various classes; the domain name feature space matching system comprises a twin architecture model Sim-BLA used for fitting a domain name feature space and comprising a preprocessing layer, an embedding layer, a feature extraction layer and a similarity calculation layer, a network BLA obtained by splitting the structure of the twin architecture model Sim-BLA and used for extracting domain name features, and a module Weighted-v & d used for carrying out exclusive similarity measurement on two domain names; a calculation method of a reference vector representing the characteristic situation of each category; the efficient twin framework multi-classification and unknown class identification algorithm.

As shown in fig. 1, the DGA domain name detection method provided in the embodiment of the present invention includes the following steps:

s101, performing domain name pairing on the collected samples containing the normal domain name and the domain names of various classes of the DGA by using a twin framework sample pairing rule;

s102, establishing a twin-architecture-based learning model Sim-BLA for training a fitting domain name complex feature space, wherein the learning model Sim-BLA comprises a preprocessing layer, an embedding layer, a feature extraction layer and a similarity calculation layer;

s103, inputting the paired domain name binary groups into a learning model Sim-BLA pair by pair for model training, and splitting the model to obtain a feature extraction network BLA for extracting domain name features and a similarity measurement module Weighted-v & d for performing similarity measurement on two domain name samples;

s104, before the model is applied, calculating corresponding reference vectors for each domain name category in the model training process;

and S105, judging the domain name to be detected captured by the current network by using the multi-classification prediction algorithm, wherein the judgment result comprises whether the domain name is a DGA domain name, the DGA class of the domain name and whether the domain name is an unknown domain name.

Preferably, the sample matching rule for matching the domain name samples is a matching rule designed for a twin-architecture dual-input training mechanism, the balance of heterogeneous sample training, and the unbalanced phenomenon of the DGA domain name and the normal domain name.

Preferably, the twin-architecture-based learning model Siam-BLA is a two-way parallel weight sharing structure, wherein: the preprocessing layer comprises the steps of filling and intercepting the input domain name into a uniform length; the embedding layer comprises the steps of vectorizing the domain name character string and establishing a word vector by combining a domain name character sequence with the word embedding by using a unique hot code method; the feature extraction layer uses a BilSTM structure to extract features of the input domain name word vectors in two directions of a preorder direction and a postorder direction, and simultaneously uses an attention mechanism to evaluate the importance of the features of each part and finally outputs the feature vectors of the domain name; the similarity calculation layer carries out integrated similarity measurement on the two input feature vectors and outputs the value.

Preferably, the learning model Sim-BLA based on the twin framework is subjected to structure splitting to obtain a feature extraction network BLA and a similarity measurement module Weighted-v & d which are respectively used for feature vector extraction and similarity measurement, so that the redundant structure of the twin framework when the twin framework is used for classification is simplified, and the time of the twin framework when the twin framework is used for sample classification is greatly reduced.

Preferably, the twin architecture dual input model is used only for training of the base model. The feature extraction network and the similarity distance module in the basic model are independently separated and used for predicting the actual domain name, the operation which can be processed in advance is processed in advance, the prediction process of a twin framework is greatly simplified, and the domain name prediction time is obviously reduced.

Preferably, through analyzing the distribution of the distance values between the samples in each category and the samples of the same category, the ability of the sample to represent a certain category is evaluated in a manner of combining the distance expectation and the distance variance, and accordingly, the feature vector of the sample with the highest representation ability is selected as the reference vector for representing the overall feature condition of the category.

Preferably, when the online application model is used for domain name prediction, the domain name to be detected is input into the feature extraction network BLA to obtain a feature vector of the domain name, then the similarity measurement module Weighted-v & d is used for calculating the similarity value between the domain name to be detected and each known class reference vector which is calculated in advance, and the class where the minimum similarity value is located is selected as the class to which the domain name to be detected most possibly belongs. In particular, if the minimum similarity value is still greater than the experimentally defined limit value of the unknown class, the invention tasks that the domain name to be tested belongs to the novel unknown class, otherwise the domain name to be tested is predicted as the class to which the minimum similarity value belongs.

Preferably, the feature extraction network BLA is used in advance to calculate reference vectors of known classes to represent corresponding features thereof to participate in the multi-class prediction process, meanwhile, the similarity measurement module Weighted-v & d is used in the multi-class prediction process to calculate the similarity between the domain name to be detected and the reference vectors of each class, and the domain name to be detected is classified or judged according to the similarity and the boundary value of the unknown class.

As shown in fig. 3, the DGA domain name detection system provided in the embodiment of the present invention includes:

the domain name matching module 1 is used for matching the collected samples containing the normal domain name and the domain names of various classes of the DGA by using a twin framework sample matching rule;

the learning model building module 2 is used for building a twin-architecture-based learning model Sim-BLA for training and fitting a domain name complex feature space, and comprises a preprocessing layer, an embedding layer, a feature extraction layer and a similarity calculation layer;

the model training module 3 is used for inputting the paired domain name binary groups into a learning model Sim-BLA in pairs for model training, and then splitting the model to obtain a feature extraction network BLA for extracting domain name features and a similarity measurement module Weighted-v & d for performing similarity measurement on two domain name samples;

the reference vector calculation module 4 is used for calculating corresponding reference vectors for each domain name category in the model training process before the model is applied;

and the domain name judgment module 5 is used for judging the domain name to be detected captured by the current network by using a multi-classification prediction algorithm, and judging results comprise whether the domain name is a DGA domain name, the DGA class to which the domain name belongs and whether the domain name is an unknown domain name.

Example 2

As shown in fig. 2, a DGA domain name detection method based on a twin architecture provided in an embodiment of the present invention includes: performing domain name pairing on collected samples containing normal domain names and various types of domain names of DGA by using a twin framework sample pairing rule so as to meet the requirement of balance of training of heterogeneous samples under a twin framework dual-input training mechanism and overcome the condition of extreme imbalance under the complex network environment of the existing network; establishing a twin-architecture-based learning model Sim-BLA for training a fitting domain name complex feature space, wherein the learning model Sim-BLA comprises a preprocessing layer, an embedding layer, a feature extraction layer and a similarity calculation layer; a feature extraction network BLA for extracting domain name features and a similarity measurement module Weighted-v & d for carrying out similarity measurement on two domain name samples, which are obtained by a trained model Sim-BLA split structure; extracting reference vector extraction rules of all classes of reference vectors by using a feature extraction network BLA; and the multi-classification prediction algorithm is used for comprehensively judging the class of the domain name to be detected captured in the current network, the unknown domain name and the like by using the module.

In the above technical solution, preferably, the sample matching rule for matching the domain name sample needs to satisfy the balance of training heterogeneous samples under a twin architecture dual-input training mechanism and overcome the phenomenon of extreme imbalance between the DGA domain name and the normal domain name.

Specifically, a pairing coefficient p is introduced, which represents the ratio of the number of pairs between one sample and other samples of the same class to the total number of samples of the class, and generally represents the ratio of the number of pairs between one sample and other samples of the class to the total number of samples of the class, and satisfies 1/m < ═ p < > 1. Given n classes of black samples, there are m samples in each class. Selecting m × p samples from n-1 other classes except the class to which the samples belong in total for each black sample, and respectively pairing the m × p samples with the samples x; and randomly selecting m × p samples (not containing the samples x) from the category of the samples x for each black sample, and pairing the samples x with each other. The method randomly selects n × m samples from normal domain name data, randomly selects 3 × p × m samples from each type of black samples for pairwise matching of each white sample, and randomly selects 3 × p × m × n other white samples for pairwise matching of each white sample, so that 8 × p × m × n domain name binary groups for training are generated in total.

As shown in fig. 4, in the above technical solution, preferably, the twin framework-based learning model Siam-BLA is a two-way parallel weight sharing structure, and there are two left and right input two-way parallel weight sharing feature extraction networks, which include a same preprocessing layer, an embedding layer, and a feature extraction layer. Finally, similarity measurement is carried out on the feature vectors extracted by the two inputs through a similarity calculation layer, and values are output.

Specifically, the preprocessing layer unifies the character length of the input domain name into a fixed size through padding and intercepting operations. Then, according to the legal characters 'abcdefghijklmnnopqrstuvwxyz 0123456789-' of the domain name and 40 characters which are possible to appear in all of the padding characters and illegal characters, forming a mapping relation from characters to numbers, and converting the domain name into a one-dimensional vector with the length of 64; meanwhile, for being conveniently applied to a subsequent feature extraction network and avoiding the influence of too sparse input vectors on the feature extraction effect, the embedding layer adopts a method of combining one-dimensional domain name vectors with word embedding to map the one-dimensional domain name vectors to a high-dimensional vector space: the method comprises the steps of carrying out unique hot coding on 64-bit one-dimensional domain name vectors subjected to preprocessing into two-dimensional sparse vectors with the size of 64x 40, carrying out matrix multiplication on the two-dimensional sparse vectors and an embedded layer parameter matrix, and finally generating two-dimensional non-sparse vectors with the size of 64x128 as the input of a feature extraction layer.

As shown in fig. 5, in the structure of the feature extraction layer, after the two-layer BiLSTM structure combines the feature dependence of two time sequence relations of forward and backward of the domain name, the BiLSTM structure calculates the weight of the output of each time sequence to generate a weight vector, and then performs weighted summation on all the time sequence vectors as an attention distribution value, so that the sample features of the "word level" in each iteration are strengthened into more accurate sample features with stronger generalization of the "sentence level". Specifically, the two-dimensional vector matrix of size 64 × 128 output by the embedding layer is finally output as a feature vector of length 48 after passing through the feature extraction layer.

As shown in fig. 6, in the above technical solution, preferably, the invention combines a twin architecture loss function, comprehensively considers the relationship among the domain name feature vector, various distance metric functions and the domain name original character set, and proposes a similarity metric module Weighted-v & d, which not only can more comprehensively measure the similarity between two inputs, but also can correct the result of the feature extraction layer to a certain extent, and accelerate the convergence of the model. Specifically, vector splicing is carried out on each dimension value of two input feature vectors, a Manhattan distance value, a weighted Euclidean distance value, an included angle cosine value and a Jacard distance value of an original domain name character element set, and finally 100-bit vectors obtained through splicing are mapped to numerical values by using a full-connection network to serve as similarity measurement values of the two inputs.

As shown in fig. 7, in the above technical solution, preferably, before actual prediction is performed, similarity measurement is performed between reference vectors of each category of a training sample set calculated in advance and a feature vector of a domain name to be measured when the reference vectors are used for actual prediction, so as to simplify a twin architecture prediction flow. Specifically, the distance value between the sample in each class training set and other similar samples is calculated, then all the distance values between each sample and other similar samples are subjected to statistical analysis, the mathematical expectation value and the standard deviation value of each sample are calculated (the sample with the minimum mathematical expectation value and standard deviation value is considered to be most suitable to use the feature vector of the sample as the reference vector of the class to which the sample belongs), after the expectation value e and the standard deviation d are standardized by using a min-max standardization method, the opposite number of the sum of the expectation value e and the standard deviation d is used as the abstract score s for evaluating whether the sample can represent a certain class, and the feature vector of the sample with the highest abstract score is selected as the reference vector representing the feature condition of the class to which the sample belongs.

As shown in fig. 8, in the above embodiment, when the online application model performs domain name prediction, the domain name to be detected is input into the feature extraction network BLA to obtain its feature vector, then the similarity measurement module Weighted-v & d is used to calculate the similarity value between the domain name to be detected and the reference vector of each known class that has been calculated in advance, and the class in which the minimum similarity value is located is selected as the class to which the domain name to be detected most likely belongs. In particular, if the minimum similarity value is still greater than the experimentally defined limit value of the unknown class, the invention tasks that the domain name to be tested belongs to the novel unknown class, otherwise the domain name to be tested is predicted as the class to which the minimum similarity value belongs. The whole multi-classification prediction process realizes the process of efficiently predicting the class of the domain name to be detected from the original input to the final accurate prediction, greatly shortens the time of the twin framework for multi-classification prediction, improves the accuracy of multi-classification and achieves the aim of effectively identifying the unknown class.

And II, application embodiment. In order to prove the creativity and the technical value of the technical scheme of the invention, the part is the application example of the technical scheme of the claims on specific products or related technologies.

The computer device provided by the embodiment of the invention comprises a memory and a processor, wherein the memory stores a computer program, and the computer program causes the processor to execute the following steps when executed by the processor:

The computer-readable storage medium provided by the embodiment of the invention stores a computer program, and when the computer program is executed by a processor, the processor is caused to execute the following steps:

The information data processing terminal provided by the embodiment of the invention is used for realizing the DGA domain name detection system.

It should be noted that the embodiments of the present invention can be realized by hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided on a carrier medium such as a disk, CD-or DVD-ROM, programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier, for example. The apparatus and its modules of the present invention may be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of hardware circuits and software, e.g., firmware.

The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims

1. A DGA domain name detection method is characterized by comprising the following steps:

2. The DGA domain name detection method of claim 1, wherein the DGA domain name detection method comprises the steps of:

and step five, judging the domain name to be detected captured by the current network by using the multi-classification prediction algorithm, wherein the judgment result comprises whether the domain name is a DGA domain name, the DGA class of the domain name and the unknown domain name.

3. The DGA domain name detection method of claim 2, wherein in the first step, the ratio of the number of matched samples to the total number of class samples between the samples and other samples of the same class is introduced as a matching coefficient, and the domain name matching is performed according to the requirements of meeting the balance of training of the samples of the same and different classes under a twin architecture dual input training mechanism and overcoming the extreme unbalance phenomenon of the DGA domain name and the normal domain name.

4. The DGA domain name detection method of claim 2, wherein the learning model Sim-BLA based on twin architecture in the second step is a two-way parallel weight sharing structure, and there are two left and right input two-way parallel and weight sharing feature extraction networks, including a preprocessing layer, an embedding layer, a feature extraction layer and a similarity calculation layer;

and a network structure of a BilSTM combined attention mechanism is used, the characteristics of the forward time sequence and the backward time sequence of the domain name are fused, and the weighted summation is carried out on the final time sequence of the BilSTM to be used as an attention distribution value, so that the sample characteristic of the word level of the domain name is enhanced to be the sample characteristic of the sentence level.

5. The DGA domain name detection method of claim 2, wherein in the third step, a twin framework loss function is combined, the relationship among the domain name feature vector, various distance metric functions and the domain name original character set is comprehensively considered, and a similarity metric module Weighted-v & d is proposed; and carrying out vector splicing on each dimension value of the two input feature vectors, a Manhattan distance value, a weighted Euclidean distance value, an included angle cosine value and a Jacard distance value of an original domain name character element set, and finally using a result mapped to a numerical value by using a full-connection network as a similarity metric value of the two inputs.

6. The DGA domain name detection method of claim 2, wherein in the fifth step, the feature extraction network BLA is used in advance to calculate the reference vector representative corresponding to the feature for the known class to participate in the multi-class prediction process, and meanwhile, the similarity of the domain name to be detected and the reference vector of each class is calculated by using the similarity metric module Weighted-v & d in the multi-class prediction process, and the domain name to be detected is classified or judged according to the size of the similarity and the boundary value of the unknown class.

7. A DGA domain name detection system for implementing the DGA domain name detection method of any one of claims 1 to 6, wherein the DGA domain name detection system comprises:

8. A computer device, characterized in that the computer device comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of:

9. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

10. An information data processing terminal characterized by being configured to implement the DGA domain name detection system according to claim 7.