CN118138382B

CN118138382B - Malicious domain name generation method, device, equipment and medium

Info

Publication number: CN118138382B
Application number: CN202410577323.0A
Authority: CN
Inventors: 董国忠; 张伟哲; 张宇; 黄树佳; 贾陆洋; 涂唯坚
Original assignee: Peng Cheng Laboratory
Current assignee: Peng Cheng Laboratory
Priority date: 2024-05-10
Filing date: 2024-05-10
Publication date: 2024-07-09
Anticipated expiration: 2044-05-10
Also published as: CN118138382A

Abstract

The embodiment of the disclosure provides a method, a device, equipment and a medium for generating a malicious domain name, belonging to the technical field of network security. The method comprises the following steps: acquiring an initial malicious domain name, and coding each character in the initial malicious domain name to obtain initial domain name characteristics; carrying out semantic extraction on the initial domain name characteristics based on a pre-constructed attention mechanism to obtain semantic characteristics; determining the distribution probability that the initial malicious domain name belongs to different Gaussian mixture categories according to a preset clustering center of the plurality of Gaussian mixture categories, and determining the target category to which the initial malicious domain name belongs from the plurality of Gaussian mixture categories according to the magnitude relation of each distribution probability; inputting the semantic features into a domain name generation network, extracting the clustering features of the target categories, and guiding the domain name generation network to generate the potential malicious domain name by using the clustering features. The embodiment of the disclosure can improve the quality of the mined potential malicious domain name.

Description

Malicious domain name generation method, device, equipment and medium

Technical Field

The disclosure relates to the technical field of network security, and in particular relates to a method, a device, equipment and a medium for generating a malicious domain name.

Background

The malicious domain name is the domain name of the malicious website, and detection of the malicious domain name has important significance for network security. The mining of malicious domain names is a means for mining more potential malicious domain names on the basis of the existing malicious domain names. Potential threats can be found in advance by mining more potential malicious domain names, so that defensive measures can be taken in time.

In the related art, the mining of the malicious domain name often depends on the characteristics of the domain name and the domain name history data, so that the mined potential malicious domain name has poor timeliness, the generated potential malicious domain name has poor relevance with the original malicious domain name, and the domain name with normal nature is easy to generate. Thus, the quality of the potentially malicious domain name that is ultimately mined is poor.

Disclosure of Invention

The main purpose of the disclosed embodiments is to provide a method, a device, equipment and a medium for generating a malicious domain name, which can improve the quality of the mined potential malicious domain name.

To achieve the above object, a first aspect of an embodiment of the present disclosure provides a method for generating a malicious domain name, including:

acquiring an initial malicious domain name, and coding each character in the initial malicious domain name to obtain initial domain name characteristics;

Semantic extraction is carried out on the initial domain name features based on a pre-constructed attention mechanism, so that semantic features are obtained;

Determining the distribution probability that the initial malicious domain name belongs to different Gaussian mixture categories according to a preset clustering center of a plurality of Gaussian mixture categories, and determining the target category to which the initial malicious domain name belongs from the plurality of Gaussian mixture categories according to the magnitude relation of the distribution probabilities;

inputting the semantic features into a domain name generation network, extracting cluster features of the target category, and guiding the domain name generation network to generate a potential malicious domain name by utilizing the cluster features;

The Gaussian mixture class is obtained by updating the initial Gaussian mixture class according to the updated Gaussian distribution parameters; the updated Gaussian distribution parameters are obtained by firstly determining a plurality of initial Gaussian mixture categories of a plurality of sample domain name features, then calculating posterior probability of each sample domain name feature belonging to each initial Gaussian mixture category, and updating the current Gaussian distribution parameters according to the posterior probability.

In some embodiments, the gaussian mixture class is determined by:

Acquiring a plurality of sample domain name features, and performing Gaussian mixture clustering on the plurality of sample domain name features to obtain a plurality of initial Gaussian mixture categories;

Determining current Gaussian distribution parameters of all initial Gaussian mixture categories, and determining probability density functions of each sample domain name feature under all initial Gaussian mixture categories, wherein the Gaussian distribution parameters comprise mean vectors, covariance matrixes and mixing coefficients;

Calculating posterior probability of each sample domain name feature belonging to each initial Gaussian mixture category based on the current mean vector, the covariance matrix, the mixture coefficient and the probability density function;

Updating the current Gaussian distribution parameters according to the posterior probability, and updating the initial Gaussian mixture category according to the updated Gaussian distribution parameters to obtain the updated Gaussian mixture category.

In some embodiments, the computing a posterior probability that each of the sample domain name features belongs to a respective initial gaussian mixture class based on the current mean vector, the covariance matrix, the mixture coefficients, and the probability density function comprises:

Calculating the posterior probability gamma (Z _nk) of each sample domain name feature x _n belonging to the kth initial Gaussian mixture class according to the following formula:

（1）；

Wherein N (x _n∣μ_k,Σ_k) in formula (1) is a probability density function of the sample domain name feature x _n under the kth initial gaussian mixture class, pi _k is a mixture coefficient of the current kth initial gaussian mixture class, μ _k is the mean vector of the current kth initial gaussian mixture class, Σ _k is the covariance matrix of the current kth initial gaussian mixture class;

the updating the current gaussian distribution parameter according to the posterior probability comprises the following steps:

The formula for updating the current Gaussian distribution parameter according to the posterior probability gamma (Z _nk) is as follows:

（2）；

（3）；

（4）；

Wherein N _K in formulas (2) to (4) is the total number of the sample domain name features belonging to the kth initial gaussian mixture class, and N is the total number of the sample domain name features.

In some embodiments, the obtaining a plurality of sample domain name features comprises:

Acquiring a plurality of sample malicious domain names;

Extracting a first character of a top-level domain name part and a second character of a secondary domain name part in each sample malicious domain name;

extracting first character features from the first characters, and extracting second character features from the second characters;

And combining the corresponding first character features and the corresponding second character features under each sample malicious domain name to obtain a plurality of sample domain name features.

In some embodiments, the extracting the first character feature from the first character and the extracting the second character feature from the second character includes:

Performing one-time thermal coding on the first character to obtain a first character characteristic;

counting substrings meeting the preset length in the second character, and coding the second character according to the occurrence times of the substrings under a plurality of sample malicious domain names to obtain a second character characteristic.

In some embodiments, the performing gaussian mixture clustering on the plurality of sample domain name features to obtain a plurality of initial gaussian mixture categories includes:

determining initial category numbers, and carrying out Gaussian mixture clustering on a plurality of sample domain name features under the category numbers to obtain a plurality of corresponding sub-Gaussian mixture categories;

Calculating error values between the sample domain name features belonging to each sub-Gaussian mixture category and the corresponding clustering centers;

gradually increasing the category number, carrying out Gaussian mixture clustering again under the corresponding category number, and calculating the error value of each sub-Gaussian mixture category under the corresponding category number;

Determining a minimum target error value from the error values under different category numbers, and determining the category number corresponding to the target error value as a target category number;

And carrying out Gaussian mixture clustering on the plurality of sample domain name features under the number of the target categories to obtain a plurality of initial Gaussian mixture categories.

In some embodiments, the type of covariance matrix is determined by:

Based on the Gaussian distribution parameters calculated by different covariance matrix types, carrying out Gaussian mixture clustering on a plurality of sample domain name features to obtain a plurality of test Gaussian mixture categories;

and determining the clustering effect of the corresponding test Gaussian mixture category under different covariance matrix types, and determining the covariance matrix as a diagonal covariance matrix according to the clustering effect.

In some embodiments, the semantic extraction of the initial domain name feature based on the pre-constructed attention mechanism, to obtain a semantic feature, includes:

inputting the initial domain name characteristics into a multi-layer attention network which is connected in sequence;

In each layer of the attention network, current input data is bidirectionally encoded through an attention mechanism to obtain a corresponding attention weight matrix, and a hidden state is obtained and output based on the attention weight matrix and the current input data;

and obtaining semantic features through the output of the attention network of the last layer.

In some embodiments, the inputting the semantic feature into a domain name generation network, extracting a cluster feature of the target class, and guiding the domain name generation network to generate a potentially malicious domain name by using the cluster feature includes:

Extracting clustering features of the target category;

Inputting the initial domain name features, the cluster features and the semantic features into a domain name generation network, and performing diffusion processing on the initial domain name features by using an attention mechanism constructed based on the cluster features and the semantic features to generate a potential malicious domain name.

In some embodiments, after the step of using the clustering feature to guide the domain name generation network to generate the potentially malicious domain name, the method further includes:

acquiring a domain name to be detected;

and carrying out malicious detection on the domain name to be detected based on the potential malicious domain name to obtain a corresponding domain name detection result.

To achieve the above object, a second aspect of an embodiment of the present disclosure provides a malicious domain name generating device, including:

The coding module is used for acquiring an initial malicious domain name, and coding each character in the initial malicious domain name to obtain initial domain name characteristics;

The semantic extraction module is used for carrying out semantic extraction on the initial domain name characteristics based on a pre-constructed attention mechanism to obtain semantic characteristics;

The clustering module is used for determining the distribution probability that the initial malicious domain name belongs to different Gaussian mixture categories according to the preset clustering centers of the Gaussian mixture categories, and determining the target category to which the initial malicious domain name belongs from the Gaussian mixture categories according to the magnitude relation of the distribution probabilities;

the domain name generation module is used for inputting the semantic features into a domain name generation network, extracting cluster features of the target category, and guiding the domain name generation network to generate a potential malicious domain name by utilizing the cluster features;

To achieve the above object, a third aspect of the embodiments of the present disclosure provides an electronic device, where the electronic device includes a memory and a processor, where the memory stores a computer program, and the processor implements the method for generating a malicious domain name according to the embodiment of the first aspect when executing the computer program.

To achieve the above object, a fourth aspect of the embodiments of the present disclosure proposes a storage medium, which is a computer-readable storage medium storing a computer program, which when executed by a processor, implements the method for generating a malicious domain name according to the embodiment of the first aspect.

The method, the device, the equipment and the medium for generating the malicious domain name provided by the embodiment of the disclosure can be applied to the device for generating the malicious domain name. By executing the generation method of the malicious domain name, each character in the initial malicious domain name is firstly encoded to obtain initial domain name characteristics, semantic extraction is carried out on the initial domain name characteristics based on a pre-constructed attention mechanism, so that the accuracy and the efficiency of the semantic extraction can be remarkably improved, the identification and coping capacity of the malicious domain name are enhanced, and semantic characteristics are obtained; then, according to a preset clustering center of a plurality of Gaussian mixture categories, determining the distribution probability that the initial malicious domain name belongs to different Gaussian mixture categories, and according to the magnitude relation of each distribution probability, determining the target category to which the initial malicious domain name belongs from the plurality of Gaussian mixture categories; inputting semantic features into a domain name generation network, and extracting clustering features of target categories, wherein the Gaussian mixture categories are obtained by updating initial Gaussian mixture categories according to updated Gaussian distribution parameters; the updated Gaussian distribution parameters are obtained by firstly determining a plurality of initial Gaussian mixture categories of a plurality of sample domain name features, then calculating posterior probability of each sample domain name feature belonging to each initial Gaussian mixture category, and updating the current Gaussian distribution parameters according to the posterior probability. Therefore, each Gaussian mixture category can accurately cluster similar domain names, and the clustering characteristics of the target category can well express the data characteristics of the Gaussian distribution. The cluster feature is utilized to guide the domain name generation network to generate similar potential malicious domain names in the range indicated by the target category, the obtained potential malicious domain names have stronger relevance with the initial malicious domain names, and the potential malicious domain names are generated based on the initial malicious domain names which are input currently, and no dependency on historical data is needed, so that timeliness is better, and the quality of the finally mined potential malicious domain names is better.

Drawings

Fig. 1 is a flowchart illustrating a method for generating a malicious domain name according to an embodiment of the present disclosure;

FIG. 2 is a flow diagram of a Gaussian mixture class determination process provided by an embodiment of the disclosure;

Fig. 3 is a schematic flow chart further included in step S201 in fig. 2;

Fig. 4 is a schematic flow chart further including step S303 in fig. 3;

fig. 5 is another schematic flow chart further included in step S201 in fig. 2;

FIG. 6 is a flow chart diagram of a covariance matrix type determination process provided by an embodiment of the disclosure;

fig. 7 is a schematic flow chart further including step S102 in fig. 1;

FIG. 8 is a schematic diagram of an encoder provided by an embodiment of the present disclosure;

fig. 9 is a schematic flow chart further including step S104 in fig. 1;

FIG. 10 is a schematic diagram of a domain name generation network training process provided by an embodiment of the present disclosure;

fig. 11 is a schematic flow chart further included after step S104 in fig. 1;

Fig. 12 is a schematic functional block diagram of a malicious domain name generating device according to an embodiment of the present disclosure;

fig. 13 is a schematic hardware structure of an electronic device according to an embodiment of the disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, the present disclosure will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present disclosure.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terminology used herein is for the purpose of describing embodiments of the present disclosure only and is not intended to be limiting of the present disclosure.

First, several nouns referred to in this disclosure are parsed:

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The gaussian mixture model (Gaussian Mixture Model, GMM) is a clustering method based on probability density functions, which assumes that each cluster is a mixture of multiple gaussian distributions. The goal of GMM is to estimate model parameters, including the mean, variance, and mixing coefficients of each gaussian distribution, and the probability that the data points belong to each cluster by maximizing the likelihood function.

In the related art, the mining of the malicious domain names often depends on the characteristics of the domain names and the history data of the domain names, so that the mined potential malicious domain names are poor in timeliness, the newly generated unknown malicious domain names are not processed quickly enough, the detection range is small, a large number of potential malicious domain names cannot be accommodated and mined, the generated potential malicious domain names are poor in relevance with the original malicious domain names, and the domain names with normal essence are easy to generate. Thus, the quality of the potentially malicious domain name that is ultimately mined is poor.

In addition, legal domain names and malicious domain names are identified based on rules of character features of domain names in the related art. But the domain name detection effect for the domain name spliced by English words is poor, and new malicious website domain name families continuously emerge, so that the problems of insufficient training data and low recognition rate are caused.

Based on this, the embodiment of the disclosure provides a method, a device, equipment and a medium for generating a malicious domain name, which can improve the quality of the mined potential malicious domain name.

The method for generating the malicious domain name in the embodiment of the disclosure may be illustrated by the following embodiment.

Embodiments of the present disclosure may acquire and process related data based on artificial intelligence techniques.

The malicious domain name generation method provided by the embodiment of the disclosure can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like that implements a method of generating a malicious domain name, but is not limited to the above form.

The disclosure is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The disclosure may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The disclosure may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In the various embodiments of the present disclosure, when related processing is required according to user information, user behavior data, user history data, user location information, and other data related to user identity or characteristics, permission or consent of the user is obtained first. For example, when acquiring information related to a domain name, permission or consent of the user is acquired first. Moreover, the collection, use, processing, etc. of such data would comply with relevant laws and regulations. In addition, when the embodiment of the present disclosure needs to acquire the sensitive personal information of the user, the independent permission or independent consent of the user is obtained through a popup window or a jump to a confirmation page, and after the independent permission or independent consent of the user is explicitly obtained, the necessary user-related data for enabling the embodiment of the present disclosure to function normally is acquired.

Fig. 1 is an optional flowchart of a method for generating a malicious domain name according to an embodiment of the present disclosure, where the method in fig. 1 may include, but is not limited to, steps S101 to S104.

Step S101, acquiring an initial malicious domain name, and coding each character in the initial malicious domain name to obtain the characteristics of the initial domain name;

Step S102, carrying out semantic extraction on the initial domain name characteristics based on a pre-constructed attention mechanism to obtain semantic characteristics;

step S103, determining the distribution probability that the initial malicious domain name belongs to different Gaussian mixture categories according to the preset clustering centers of the plurality of Gaussian mixture categories, and determining the target category to which the initial malicious domain name belongs from the plurality of Gaussian mixture categories according to the magnitude relation of the distribution probabilities;

step S104, inputting semantic features into a domain name generation network, extracting clustering features of target categories, and guiding the domain name generation network to generate potential malicious domain names by using the clustering features;

For the above step S101, the initial malicious domain name is a domain name of any bad website, and the website has a certain malicious behavior, which has some problems that violate laws and regulations, and affects the security of the network, so the domain name of the website is defined as malicious. The initial malicious domain name is one of the currently entered data used to mine out potential malicious domain names. Further, the initial malicious domain name may be from an initial domain name dataset in which a plurality of malicious domain names are present, which may be referred to as a plurality of data points, and thus the initial malicious domain name may be one in which at least one of the initial malicious domain names may be selected as the initial malicious domain name for mining of potential malicious domain names, i.e., the initial malicious domain name may be one data point in the dataset. Thus, embodiments of the present disclosure may mine based on a plurality of initial malicious domain names, without specific limitation.

It should be noted that, the embodiments of the present disclosure may perform data collection on related network traffic, and use related packet capturing technologies, such as a data plane Development Kit (DATA PLANE Development Kit, DPDK). The embodiment of the disclosure can focus on the flow of website information in certain specific bad fields, and constructs a data set focused on the flow through deep analysis, but in the process of acquiring data, all relevant regulations and ethical guidelines need to be complied with, so that the privacy of users and the safety of data are ensured. The resulting data set is directly associated with related network activities and thus has direct practical applicability, which not only highlights the importance of fine monitoring of network traffic, but also provides a reliable data basis for in-depth research of network security and identification of potential threats.

After the initial malicious domain name is obtained, each character in the initial malicious domain name needs to be encoded to obtain the characteristics of the initial domain name. Illustratively, in terms of encoding, embodiments of the present disclosure first form a character encoding table from a dataset containing initial malicious domain names, all top-level domain names that appear, and single characters and null characters that result from splitting secondary domain names (e.g., secondary domain names), and number each character therein. Given the specificity and inseparability of top-level domain names of most malicious website domains, the top-level domain names need to be treated as a whole when encoded. Then, the characters in the domain name are in one-to-one correspondence with the numbers in the table, so that the domain name vectors with different initial lengths are obtained. To unify the length of all domain name vectors, null characters may be used to supplement to ensure that each vector has the same length and consists of character codes. Therefore, the implementation of the method and the device can be realized by encoding the domain names in the data set, and has the advantages of unifying characteristics, retaining key information, simplifying processing, facilitating comparison and matching, reducing data sparsity, being easy to expand and the like. Finally, after the current initial malicious domain name is encoded, initial domain name characteristics can be obtained.

For the above step S102, the attention mechanism is an important concept in deep learning, which mimics the attention mechanism in the human visual system. In processing large amounts of information, the human visual system may selectively focus on certain important information while ignoring other information. Also, in deep learning, the attention mechanism enables the model to automatically learn and focus on the most task-related portions of the input data. In the context of malicious domain name detection, the initial domain name features are obtained by encoding each character in the domain name, and although the features contain basic information of the domain name, the semantic meaning or potential maliciousness of the domain name may not be directly reflected. Thus, there is a need to extract more meaningful semantic features through further processing.

Based on this, in the embodiment of the disclosure, semantic extraction is performed based on a pre-constructed attention mechanism, and essentially, a weight allocation policy is learned based on the initial domain name characteristics, and this policy may allocate different weights to different characters or character combinations according to their importance in the domain name, for example, the important characters or character combinations may get higher weights, so as to play a greater role in subsequent processing. The finally obtained semantic features not only contain basic information of the domain name, but also integrate semantic relations among characters or character combinations and contribution degrees of the characters or character combinations to the maliciousness of the domain name, provide richer and meaningful information for subsequent processing and analysis, and are beneficial to improving the identification and coping capacity of the malicious domain name.

Further, the attention mechanism may calculate weights through one or more neural network layers and apply these weights to the initial domain name features, and embodiments of the present disclosure are not particularly limited. In this way, it is possible to learn which characters or character combinations are more relevant to the malicious domain name, thereby extracting more representative semantic features.

For the above step S103, in the embodiment of the present disclosure, the preset plurality of gaussian mixture categories are different clusters of malicious domain name features. The clusters may be preset based on historical data or expert knowledge, or may be obtained by clustering different malicious domain names in the dataset, so that each cluster center represents a specific malicious behavior pattern or feature set.

When an initial malicious domain name is given, the feature vector obtained by coding and semantic extraction is input into a Gaussian mixture model, and the model calculates the probability that the feature vector belongs to each Gaussian mixture category. For example, the final distribution probability may be obtained by calculating the distance (e.g., mahalanobis distance) or likelihood of the feature vector from each gaussian distribution and then combining the weights of each distribution, etc., and the manner of specifically calculating the distribution probability is not limited herein.

Based on the magnitude of these distribution probabilities, it is then possible to determine the gaussian mixture class, i.e. the target class, to which the initial malicious domain name most likely belongs. Therefore, the embodiment of the disclosure can cluster malicious domain names with similar characteristics or behavior patterns together, and extract representative clustering characteristics for each category, wherein the clustering characteristics not only help to understand the commonalities and differences of the malicious domain names, but also provide guidance for subsequent generation of potential malicious domain names.

It should be noted that, the gaussian mixture model assumes that all data points are mixed by a finite number of gaussian distributions (i.e., normal distributions), and in the embodiment of the present disclosure, the gaussian mixture class is different clusters of domain name features, each cluster represents a specific malicious behavior pattern or feature set, so that it is necessary to gradually perfect the clustering effect through iterative optimization.

In the process of improving the clustering effect, some initial Gaussian mixture categories are preset firstly based on the existing sample domain name characteristics. Then, for each sample domain name feature, the probability that it belongs to each initial gaussian mixture class needs to be calculated, these probabilities being called posterior probabilities, reflecting the likelihood that the sample feature belongs to a certain class. Based on the posterior probability, parameters of each gaussian distribution, including mean vector, covariance matrix, etc., may be updated, for example, the mean of each distribution may be updated to be a weighted average of the sample features belonging to the distribution, and the covariance matrix may be adjusted accordingly according to the distribution of the sample features. With the update of the Gaussian distribution parameters, the original Gaussian mixture category is changed. Some of the otherwise similar categories may merge, while some of the otherwise more disparate categories may differentiate further. Therefore, after multiple iterations, the Gaussian mixture categories gradually tend to be stable, and each category can more accurately reflect the distribution and clustering condition of the domain name characteristics.

Through the above process, the gaussian mixture class is constructed and updated. The model not only considers the statistical distribution of domain name characteristics, but also gradually improves the accuracy and efficiency of clustering through iterative optimization, and when in clustering, data points are distributed into clusters with the highest probability through Gaussian mixture clustering, but not hard data points are distributed into a certain cluster like K-Means. Therefore, the generation process of the domain name generation network is guided by utilizing the Gaussian mixture category, and a new domain name which is similar to the initial malicious domain name and has potential malicious property can be generated, so that the accuracy and the efficiency of malicious domain name detection are improved.

For the above step S104, the domain name generating network is a deep learning model, and training may learn the rule and mode of domain name generation, and in this process, the network may attempt to generate a new domain name according to the input semantic features. Thus, semantic features may be entered into the domain name generation network. In addition, the embodiment of the disclosure also needs to extract the clustering features of the target category, wherein the clustering features are obtained based on the Gaussian mixture model, and represent the common features and modes of the domain names in the target category.

By using the clustering feature, the domain name generation network can be guided to generate in the range indicated by the target category. In particular, the clustering feature may be used as a condition or constraint for a domain name generation network to ensure that the generated domain name matches the characteristics of the target class, such that the network is able to generate a new domain name that is similar to the original malicious domain name and potentially malicious from the clustering feature of the target class. Because the potential malicious domain names are generated based on the initial malicious domain names which are input currently and guided by utilizing the clustering characteristics of the target categories, the potential malicious domain names have stronger relevance to the initial malicious domain names. Meanwhile, due to the fact that deep learning and cluster analysis technology are combined, the generation mode does not depend on historical data, and is generated in real time according to the currently input characteristics, so that the method has better timeliness, and the accuracy and the efficiency of malicious domain name generation can be improved.

In summary, through steps S101 to S104, in the embodiment of the present disclosure, initial domain name features are obtained by encoding each character in an initial malicious domain name, and semantic extraction is performed on the initial domain name features based on a pre-constructed attention mechanism, so that accuracy and efficiency of semantic extraction can be significantly improved, recognition and coping ability for the malicious domain name are enhanced, and semantic features are obtained; then, according to a preset clustering center of a plurality of Gaussian mixture categories, determining the distribution probability that the initial malicious domain name belongs to different Gaussian mixture categories, and according to the magnitude relation of each distribution probability, determining the target category to which the initial malicious domain name belongs from the plurality of Gaussian mixture categories; inputting semantic features into a domain name generation network, and extracting clustering features of target categories, wherein the Gaussian mixture categories are obtained by updating initial Gaussian mixture categories according to updated Gaussian distribution parameters; the updated Gaussian distribution parameters are obtained by firstly determining a plurality of initial Gaussian mixture categories of a plurality of sample domain name features, then calculating posterior probability of each sample domain name feature belonging to each initial Gaussian mixture category, and updating the current Gaussian distribution parameters according to the posterior probability. Therefore, each Gaussian mixture category can accurately cluster similar domain names, and the clustering characteristics of the target category can well express the data characteristics of the Gaussian distribution. The cluster feature is utilized to guide the domain name generation network to generate similar potential malicious domain names in the range indicated by the target category, the obtained potential malicious domain names have stronger relevance with the initial malicious domain names, and the potential malicious domain names are generated based on the initial malicious domain names which are input currently, and no dependency on historical data is needed, so that timeliness is better, and the quality of the finally mined potential malicious domain names is better.

Referring to fig. 2, in some embodiments, the gaussian mixture class is determined by the following steps, which may include steps S201 to S204:

Step S201, obtaining a plurality of sample domain name features, and carrying out Gaussian mixture clustering on the plurality of sample domain name features to obtain a plurality of initial Gaussian mixture categories;

Step S202, determining current Gaussian distribution parameters of each initial Gaussian mixture category, and determining probability density functions of domain name features of each sample under each initial Gaussian mixture category, wherein the Gaussian distribution parameters comprise mean vectors, covariance matrixes and mixing coefficients;

step S203, calculating posterior probability of each sample domain name feature belonging to each initial Gaussian mixture category based on the current mean vector, covariance matrix, mixture coefficient and probability density function;

Step S204, updating the current Gaussian distribution parameters according to the posterior probability, and updating the initial Gaussian mixture category according to the updated Gaussian distribution parameters to obtain the updated Gaussian mixture category.

In the above steps S201 to S204, it is a series of key steps of the gaussian mixture model in determining and updating the gaussian mixture class. The method aims at obtaining a Gaussian mixture model capable of accurately reflecting data distribution by clustering sample domain name features so as to be used in the subsequent malicious domain name generation process.

Specifically, in the embodiment of the present disclosure, a plurality of sample domain name features are collected first, where the plurality of sample domain name features may be used as data points of a cluster, and the sample domain name features may be obtained by encoding other domain names in an initial domain name data set where an initial malicious domain name is located. Then, the plurality of sample domain name features are subjected to Gaussian mixture clustering, and similar sample domain name features can be gathered together to form an initial Gaussian mixture class.

The gaussian distribution parameters then include a mean vector, a covariance matrix, and a mixing coefficient. The mean vector represents the center position of the gaussian distribution, the covariance matrix describes the degree of dispersion of the data, and the mixing coefficient determines the weight of each gaussian distribution in the mixing model. Thus, embodiments of the present disclosure need to determine these parameters for each initial gaussian mixture class and calculate a probability density function for each sample domain name feature under these classes that describes the probability that the sample feature belongs to a certain gaussian distribution.

The posterior probability of each sample domain name feature belonging to the respective initial gaussian mixture class is then calculated based on the current mean vector, covariance matrix, mixture coefficients, and probability density function. The posterior probability refers to the probability that a certain sample feature belongs to a certain Gaussian mixture class under the condition that the sample feature and other information are known, and the posterior probability that the sample feature belongs to each initial Gaussian mixture class is obtained by calculating the probability density of the sample feature under each Gaussian distribution and combining the mixing coefficients.

And finally, updating the current Gaussian distribution parameters according to the posterior probability, and updating the initial Gaussian mixture category according to the updated Gaussian distribution parameters to obtain the updated Gaussian mixture category. In this step, the gaussian distribution parameters are updated with the calculated posterior probability. In particular, the mean vector, covariance matrix, and mixing coefficients are updated by maximizing the likelihood function of the data or minimizing some loss function. As the parameters are updated, the initial gaussian mixture class will also change, gradually approaching the true data distribution. After repeated iterative updating, the obtained Gaussian mixture category can reflect the clustering condition of the sample domain name characteristics more accurately.

In summary, through this series of steps, the gaussian mixture class is determined and updated, providing a basis for subsequent applications in malicious domain name generation. The Gaussian mixture model can capture complex distribution of domain name characteristics, and accuracy and efficiency of malicious domain name detection and generation are improved.

Further, unlike the K-means algorithm, the GMM algorithm does not need to pre-specify the number of clusters, but rather discovers clusters according to the probability density between data points. It assumes that the data is a mixture of multiple gaussian distributions and uses a Expectation Maximization (EM) algorithm to estimate the parameters of these distributions. The specific flow is as follows:

Initializing: an initial set of gaussian distribution parameters including a mean vector (μ) and a covariance matrix (Σ) is first randomly selected. Then, the probability that each data point belongs to each distribution is calculated. This process may be calculated using a gaussian distributed Probability Density Function (PDF), which in some embodiments may be the distributed probability in the above embodiments. The formula is as follows:

（5）；

Where x in equation (5) represents a feature vector of the data point, such as a sample domain name feature, μ represents a cluster center, Σ is a covariance matrix of the gaussian mixture class, d is a dimension of the data point, and |Σ| represents a determinant of the covariance matrix. By calculating the probability that each data point belongs to each distribution, it can be used for the subsequent desired step.

Desired step (E-step): in some embodiments, calculating the posterior probability that each sample domain name feature belongs to a respective initial gaussian mixture class based on the current mean vector, covariance matrix, mixture coefficient, and probability density function comprises:

The posterior probability gamma (Z _nk) of each sample domain name feature x _n belonging to the kth initial gaussian mixture class is calculated according to the following formula:

（1）；

Where N (x _n∣μ_k,Σ_k) in equation (1) is the probability density function of the sample domain name feature x _n under the kth initial gaussian mixture class, pi _k is the mixture coefficient of the current kth initial gaussian mixture class, μ _k is the mean vector of the current kth initial gaussian mixture class, Σk is the covariance matrix of the current kth initial gaussian mixture class, and the denominator part is the sum of the mixture probabilities of all gaussian distribution pair data points. Calculating the posterior probability gamma (Z _nk) results in a matrix whose rows represent data points and columns represent gaussian distributions, and each element represents the posterior probability that the data point belongs to the corresponding gaussian distribution.

Maximizing step (M-step) the gaussian distribution parameters need to be updated according to the posterior probability calculated by the desired step. Specifically, the parameters of the update gaussian distribution include the mean μk, the covariance matrix Σk, and the mixing coefficient pi k.

In some embodiments, updating the current gaussian distribution parameter according to the posterior probability comprises:

（2）；

（3）；

（4）；

Where N _K in equations (2) through (4) is the total number of sample domain name features belonging to the kth initial gaussian mixture category and N is the total number of sample domain name features. Each gaussian distribution parameter is updated by weighted averaging of the data points using a posterior probability gamma (Z _nk) to maximize the log-likelihood function.

In the expectation maximization process (expectation step and maximization step), the expectation step stage calculates the posterior probability that each data point belongs to each distribution by calculating the probability of each data point on each distribution and normalizing to obtain the posterior probability. In the maximizing step, updating model parameters according to posterior probability calculated in the expected step, wherein the model parameters comprise parameters such as mean vector, covariance matrix, mixing coefficient and the like of updating distribution so as to maximize a log-likelihood function. And repeatedly executing the expected step and the maximizing step until the parameters converge or reach the maximum iteration number, thereby obtaining the optimal parameter estimation of the model and realizing the clustering of the data or the parameter estimation. This method assigns data points to each class with a certain probability and is therefore a soft clustering method with the advantage of handling data ambiguity or multi-class attribution.

In addition, in the algorithm flow, other improvement points are also possible. For example, clustering can be performed by introducing a distributed computing and parallelizing algorithm so as to fully utilize computing resources and accelerate the clustering process, and the clustering process is particularly excellent in processing a large-scale data set, so that the model can analyze and process a large amount of malicious domain name data more quickly, and more efficient support is provided for the clustering task of the model. In addition, the robustness and accuracy of the clustering algorithm can be improved by improving the processing capacity of the model on abnormal values, and the traditional parameter estimation method is sensitive to the abnormal values and possibly causes the model parameters to deviate from the actual conditions. By adopting a more robust parameter estimation method M-estimation, the embodiment of the disclosure can more effectively identify and process abnormal values, so that the model can show better robustness to various data conditions, and is more suitable for various complex data scenes in the real world, thereby improving the practicability and reliability.

Referring to fig. 3, in some embodiments, the step S201 of obtaining a plurality of sample domain name features may include steps S301 to S304:

step S301, obtaining a plurality of sample malicious domain names;

Step S302, extracting a first character of a top-level domain name part and a second character of a secondary domain name part in the sample malicious domain name aiming at each sample malicious domain name;

Step S303, extracting first character features from the first characters and extracting second character features from the second characters;

Step S304, corresponding first character features and second character features under each sample malicious domain name are combined to obtain a plurality of sample domain name features.

The domain name data features are optimized in the embodiments of the present disclosure. In the process of obtaining the sample domain name features, i.e. before clustering, an improved initialization method is adopted, which is more suitable for processing domain name data.

In the above steps S301 to S304, the domain name may include a top-level domain name part and a secondary domain name part, the secondary domain name part may include a secondary domain name part or more-level domain name part, the top-level domain name part generally representing a category or organization property of the domain name, and the secondary domain name part reflecting a specific website name or function. According to the embodiment of the disclosure, the plurality of sample malicious domain names can be acquired for targeted feature extraction, first, the first character of the top-level domain name part and the second character of the secondary domain name part in the sample malicious domain name are extracted for each sample malicious domain name, and then, the semantic information of different parts in the domain name can be better captured by respectively extracting the features of the two parts, so that the malicious domain name can be more accurately identified.

After the characters related to the top-level domain name and the secondary domain name are extracted, a characterization process is required. Specifically, the extraction process may convert the character into a numerical value, encode, calculate statistical information (e.g., frequency, length, etc.) of the character, or perform other forms of feature extraction, where the obtained features are used in subsequent processing to capture potential malicious patterns in the domain name.

Finally, combining corresponding first character features and second character features under each sample malicious domain name to obtain a plurality of sample domain name features, wherein the combination process can be used for fusing the first character features and the second character features, and distributing corresponding feature weights for the first character features and the second character features in the fusion process, and weighting and fusing the first character features and the second character features based on the distributed feature weights to obtain the sample domain name features. Thus, after the characteristics of each sample malicious domain name are finally extracted, the corresponding sample domain name characteristics can be obtained, so that the malicious domain name can be accurately identified later.

Referring to fig. 4, in some embodiments, the extracting the first character feature from the first character and the extracting the second character feature from the second character in step S303 may include steps S401 to S402:

step S401, performing single-heat coding on the first character to obtain a first character feature;

step S402, counting substrings meeting the preset length in the second character, and coding the second character according to the occurrence times of each substring under a plurality of sample malicious domain names to obtain a second character characteristic.

In the steps S401 to S402, for the feature representation of the top-level domain name part, the embodiment of the disclosure adopts a one-hot (one-hot) encoding manner, and encodes the first character extracted from each top-level domain name part into a binary vector, so as to obtain the first character feature. For example, when the top-level domain name appears, the vector of the corresponding position is set to 1, the other positions are set to 0, and the length of the vector is equal to the total number of the top-level domain names.

For the extraction of the secondary domain name part characteristics, an n-gram characteristic representation method is adopted, and the method considers all substrings meeting the preset length in the second character, wherein the preset length is n, so that the occurrence frequency of the substrings with the length of n is considered, the low-frequency substrings are also removed, and only the rest substrings are reserved to construct a substring list. Then, a vector with a corresponding length is created according to the length of the substring list, wherein the vector component of each position represents the occurrence number of the corresponding substring, so that an n-gram feature vector of the secondary domain name part is formed, namely the second character feature.

Finally, the extracted top-level domain name feature vector and the second-level domain name feature vector are combined to form a complete domain name feature vector, namely the sample domain name feature.

Referring to fig. 5, in some embodiments, the step S201 of performing gaussian mixture clustering on the plurality of sample domain name features to obtain a plurality of initial gaussian mixture categories may include steps S501 to S505:

step S501, determining initial category number, and carrying out Gaussian mixture clustering on a plurality of sample domain name features under the category number to obtain a plurality of corresponding sub-Gaussian mixture categories;

Step S502, calculating error values between the domain name features of each sample belonging to each sub Gaussian mixture category and the corresponding clustering center;

Step S503, gradually increasing the category number, and carrying out Gaussian mixture clustering under the corresponding category number again, and calculating the error value of each sub Gaussian mixture category under the corresponding category number;

step S504, determining the minimum target error value from the error values under different category numbers, and determining the category number corresponding to the target error value as the target category number;

Step S505, performing Gaussian mixture clustering on the plurality of sample domain name features under the number of target categories to obtain a plurality of initial Gaussian mixture categories.

In the above steps S501 to S505, in order to secure the effect of gaussian mixture clustering, the number of gaussian distributions needs to be determined first. Firstly, determining the initial category number, for example, the number is 1 or 2, and carrying out Gaussian mixture clustering on the domain name features of a plurality of samples under the category number to obtain a plurality of corresponding sub-Gaussian mixture categories, wherein the process can also set the upper limit of the category number.

Then, calculating the error value between each sample domain name feature of each sub-Gaussian mixture category and the corresponding clustering center, namely calculating an evaluation index of the clustering effect, such as the sum of squares of the error, for the clustering result under each category number, and taking the evaluation index as the error value to represent the sum of squares of the distance between each data point and the clustering center to which each data point belongs. And then gradually increasing the category number, and carrying out Gaussian mixture clustering under the corresponding category number again, and calculating the error value of each sub-Gaussian mixture category under the corresponding category number.

After the error values under different category numbers are obtained, the value of the category number can be used as an X axis, the corresponding error value is used as a Y axis, and a graph changing along with the category number can be drawn. In the graph, finding a point with obviously slowed down error value falling speed as a target point, and rapidly falling the error value along with the increase of the category number at the left side of the target point; on the right side of the target point, the descending speed of the error value is obviously slowed down along with the further increase of the category number, and the improvement degree of the clustering effect is firstly rapidly reduced and then gradually smoothed along with the increase of the clustering number, and the transition point is the optimal clustering number. Therefore, the number of categories corresponding to the target point is the optimal gaussian distribution number, that is, the smallest of the error values is the target error value, and the number of categories corresponding to the target error value is determined as the target category number.

Referring to fig. 6, in some embodiments, the type of covariance matrix is determined by the following steps, which may include steps S601 to S602:

Step S601, gaussian mixture clustering is carried out on a plurality of sample domain name features based on Gaussian distribution parameters calculated by different covariance matrix types, and a plurality of test Gaussian mixture categories are obtained;

Step S602, determining clustering effects of corresponding test Gaussian mixture categories under different covariance matrix types, and determining the covariance matrix as a diagonal covariance matrix according to the clustering effects.

In the above steps S601 to S602, the embodiments of the present disclosure need to tune the type of covariance matrix in the gaussian mixture clustering algorithm, and try different types of covariance matrices, including diagonal covariance matrices, to better reflect the relationship between data, in consideration of the characteristics of domain name data.

Specifically, embodiments of the present disclosure attempt to use different covariance matrix types for gaussian mixture clustering. Covariance matrices are mathematical tools describing the correlation between multidimensional random variables, which determine the shape and direction of each gaussian distribution in a gaussian mixture model. Different covariance matrix types include diagonal covariance matrix, full covariance matrix, etc. The diagonal covariance matrix means that the features are independent of each other, while the full covariance matrix considers the correlation between the features. Therefore, based on each covariance matrix type, corresponding gaussian distribution parameters (including mean vectors, covariance matrices and mixing coefficients) are calculated, and then, a plurality of sample domain name features are subjected to gaussian mixture clustering by using the parameters, so that a plurality of test gaussian mixture categories are obtained, wherein each test gaussian mixture category is a clustering result obtained based on a specific covariance matrix type.

Subsequently, the embodiment of the disclosure needs to evaluate the clustering effect of the test gaussian mixture class obtained under different covariance matrix types, for example, the clustering effect may be achieved by some clustering effect evaluation indexes, such as contour coefficients, calinski-Harabasz Index, and the like. These metrics may help quantify the merits of the clustering results. By comparing the clustering effects under different covariance matrix types, the covariance matrix type that performs best can be selected. In practical application, the diagonal covariance matrix shows a good clustering effect in the test, so that the characteristics of domain name data can be better considered, and the relation between the data can be better reflected. Thus, the diagonal covariance matrix is finally selected as the final covariance matrix type.

After the covariance matrix type is determined, the covariance matrix type can be used for further adjusting and optimizing the Gaussian mixture model, so that the identification and coping capacity of malicious domain names are improved, and the network security is effectively protected.

Referring to fig. 7, in some embodiments, the semantic extraction of the initial domain name feature based on the pre-constructed attention mechanism in the step S102 to obtain the semantic feature may include steps S701 to S703:

step S701, inputting the initial domain name characteristics into a multi-layer attention network which is connected in sequence;

step S702, in each layer of attention network, current input data is bidirectionally encoded through an attention mechanism to obtain a corresponding attention weight matrix, and a hidden state is obtained and output based on the attention weight matrix and the current input data;

in step S703, the semantic features are obtained through the output of the last layer of attention network.

In the above steps S701 to S703, the attention mechanism constructed in the embodiments of the present disclosure includes a bi-directional encoder (Bidirectional Encoder Representations from Transformers, BERT) from a transducer, the BERT including a multi-layer attention network connected in sequence. The initial domain name features are recoded using BERT to extract key features of the domain name. As shown in fig. 8, each BERT encoder (Trm) includes a multi-head self-attention mechanism and a feedforward neural network, and then is respectively connected with a residual error and a normalization module, and the input of each layer of attention network is EN, and the output of each layer of attention network is TN. Residual connection helps to better spread the gradient, thereby alleviating the gradient vanishing problem; and the layer normalization is helpful to reduce the internal covariate offset and improve the training speed.

After the initial domain name characteristics are input into the multi-layer attention networks which are connected in sequence, in each layer of attention network, current input data are bidirectionally encoded through an attention mechanism, a corresponding attention weight matrix is obtained, and a hidden state is obtained and output based on the attention weight matrix and the current input data. Specifically, the input to the encoder is first passed through a BERT model, which includes a multi-layer transducer encoder. The BERT model focuses on different parts of the domain name sequence through a self-attention mechanism, allowing the encoder to focus on the context of the entire sentence when encoding a particular word. Then, the semantic feature is obtained through the output (such as TN) of the attention network of the last layer, namely, the output of the last layer of the BERT model, namely, the output of the last time step is selected to be used as the feature vector of the domain name.

Thus, embodiments of the present disclosure maintain semantic consistency with domain name generation networks while obtaining a global understanding and abstract high-level feature representation of the entire domain name sequence. In this way, more meaningful prior knowledge is provided for the domain name generation network, so that semantic consistency in the generation process is facilitated, and the quality of domain name generation is improved.

Referring to fig. 9, in some embodiments, the step S104 of inputting semantic features into the domain name generating network and extracting cluster features of the target class, and guiding the domain name generating network to generate the potentially malicious domain name by using the cluster features may include steps S801 to S802:

step S801, extracting clustering features of target categories;

Step S802, inputting the initial domain name features, the cluster features and the semantic features into a domain name generation network, and performing diffusion processing on the initial domain name features by using an attention mechanism constructed based on the cluster features and the semantic features to generate the potential malicious domain name.

In the steps S801 to S802, the domain name generating network may be obtained in a pre-training manner, after training is completed, the domain name generating network may receive the initial domain name feature, the cluster feature and the semantic feature as input, and perform diffusion processing on the initial domain name feature by using an attention mechanism constructed based on the cluster feature and the semantic feature, and in this manner, the domain name generating network may combine the cluster feature and the semantic feature to generate a potentially malicious domain name, so that the obtained potentially malicious domain name has a stronger association with the initial malicious domain name.

It should be noted that the domain name generation network may be a diffusion model, and the diffusion model may also be used for text generation tasks. The diffusion model is constructed with a corresponding multi-layer structure, and an attention mechanism is constructed based on the clustering features and the semantic features, and the attention mechanism can help the model to pay more attention to the clustering features and the semantic features and distribute more attention or weight to more important parts when generating the potential malicious domain names, so that the initial domain name features can be subjected to diffusion processing based on the attention mechanism, and the final model can output the potential malicious domain names.

The domain name generation network may also be SeqGAN, as shown in fig. 10, including a generator capable of receiving the BERT encoded feature vector and a discriminator, and may be derived in advance through countermeasure training, with the trained generator being used to generate potentially malicious domain names. Specific:

The generator comprises a generator capable of receiving the BERT encoded feature vector and a discriminator, wherein the generator comprises an attention mechanism and an LSTM layer. In the task of generating sequence data (e.g., text), the attention mechanism allows the domain name generation network to weight the different parts of the input sequence as each element is generated, thereby more efficiently utilizing the information of the input sequence as each element is generated. The mechanism can help the domain name generation network to better understand the long-term dependency relationship and the global structure of the input sequence, thereby improving the accuracy and fluency of the generation. In the domain name generation network, the introduction of an attention mechanism helps the generator to better capture the dependency and semantic relevance between characters, thereby generating more accurate, similar and diversified domain names. The Long Short-Term Memory (LSTM) layer is a variant of a Recurrent Neural Network (RNN) specifically designed to process sequence data and to better capture Long-Term dependencies in the sequence. Compared to a conventional RNN, LSTM incorporates three gating units: the input gate, the forget gate and the output gate, and a memory unit that cooperate together to enable the network to better save and update information when processing long sequences. In the text generation task, the LSTM layer can better capture the long-distance dependency relationship in the text, and the problem of gradient disappearance or explosion in the traditional RNN is avoided, so that the LSTM layer is widely applied to the task of generating the text. In the present domain name generation network, the use of the LSTM layer helps the generator to better model long-term dependencies of domain names, thereby generating domain names with more semantic consistency and diversity. In addition, the domain name generation network introduces a self-attention mechanism, can better capture the dependency relationship and semantic relativity between characters, enhance the modeling capability of long dependency relationship, improve the global context perceptibility, and improve the expression capability and flexibility of the domain name generation network through a multi-head attention mechanism, so that a generator can consider longer context information when generating a domain name, thereby generating more accurate, similar and diversified domain names.

To ensure the accuracy of the discriminator, the domain name generation network employs a two-layer convolutional neural network, and adds a attention mechanism. After the generator generates the potential malicious domain name, the domain name vector can be converted, the discriminator inputs the domain name vector data, and after passing through the embedding layer, the domain name vector data sequentially passes through the convolution pooling combination layer, the tail end convolution layer, the double-layer high way network layer and the full connection layer, and finally the classification result of the domain name is output. The end convolution layer contains a context layer of the attention layer. Details of the specific network architecture will be described in detail later.

The generator is based on an LSTM layer design, consisting of three main parts, where the LSTM layer introduces a self-attention mechanism. The inputs to the domain name generation network include the outputs, i.e., semantic features, extracted from the last self-attention layer after BERT encoding, and a pseudorandom seed. A pseudorandom seed is an initial random vector used to introduce randomness, thereby increasing the diversity of the generated text. In domain name generation networks, the role of the pseudo-random seed is to ensure that the domain name generation network produces a different output at each generation, rather than relying strictly on the input data. The task of the generator is to generate a text sequence that is similar to the real domain name, whose output dimensions match the size of the list of domain name characters. The main objective of this generator is to generate domain name text with a certain variability by introducing pseudo-randomness while preserving the semantic features of the input domain name. During the whole generation process, the pseudo random seed plays a key role, and it introduces some initial uncertainty and variation in the pilot generation process. This makes the discriminator challenging when trying to distinguish between the authenticity and the illusion of the generated domain name vector, thereby improving the robustness and resistance of the generator.

The main task of the generator is to generate a sequence as similar as possible to the original malicious domain name vector that is actually present, so that it is difficult for the discriminator to distinguish between the authenticity and the illusion of the input domain name vector. In the process, the generator fully utilizes the information obtained from the clusters to guide the generation of the potential malicious domain names, and ensures that the generated potential malicious domain names are similar to the real domain names in the specific clusters to a certain extent. First, the generator starts with a pseudo-random initial character number, which is converted by the embedding layer into a meaningful embedding vector. This process aims at introducing semantic information for the generated domain name sequence so that the generated domain name can have a certain semantic consistency. The embedded vector provides a good starting point for the generator to gradually adjust the parameters during the training process to generate more reasonable domain names. Then, the embedded vector is spliced with semantic features obtained by BERT coding to form a feature sequence output with fixed size and meaning. These properties include not only structural and grammatical considerations, but also higher level semantic information, providing a richer context for the generator. The introduction of the transducer code plays a key role in processing long-distance dependency, effectively captures complex association between characters in the domain name, and improves the overall consistency of the generated domain name sequence. In the LSTM layer, the cell structure of long and short term memory networks allows the network to better capture long term dependencies in the sequence. Through a gating mechanism, the LSTM network can selectively memorize or forget information when processing an input sequence, which helps the generator to better understand and maintain semantic and structural information in the domain name sequence. The fully connected layer then receives inputs from the LSTM layer containing predictive information for the next character number. This design allows for the dynamic process of domain name generation, which is converted into a vector of the size of the domain name alphabet for output. By application of LogSoftMax functions, the output is converted into a logarithmic probability distribution, ultimately yielding domain name vectors with a high degree of interpretability. The whole generation process is perfected in the iteration of continuous learning and parameter adjustment, so that the generator can gradually improve the quality of the generated domain name and simulate the structure and characteristics of the real domain name more accurately. This process of loop iteration is critical to generator self-improvement, and aims to better fool the discriminator, generating a more realistic potentially malicious domain name.

The task of the discriminator is to discriminate the authenticity of the input domain name vector. The structure of the system comprises an embedded layer, a convolution pooling combination layer (consisting of 302 multiplied by 2 filters and 15 3 multiplied by 3 filters), a tail end convolution layer, a self-defined attention module, a two-layer highway network layer and a full connection layer. The combination of these layers enables the discriminator to efficiently extract features from the input domain name vector and make a true or false determination. First, a set of domain name vectors is passed through the embedding layer to obtain an embedded tensor, which is shaped as a set size x domain name maximum length x convolution layer dimension. Then, the embedded tensor passes through a convolution pooling combination layer, extracts character characteristics of the domain name, and outputs corresponding characteristic tensors.

To enhance the discriminator's ability to extract input features, the domain name generation network adds an end convolution layer after the convolution pooling combination layer. Through a customized attention _3d_block function, an attention module is realized, a feature tensor which does not originally have a hidden state is converted into a hidden state tensor with sequence information, and the hidden state at the last moment in the hidden state tensor is extracted through a customized lambda function. Then, dot function is used to perform dot product operation on the attention score vector and the attention score vector in specific dimension [2,1] to obtain tensor representing attention weight. In the subsequent Activation module, in order for the discriminator to focus on the key features of the domain name vector, normalization of the attention weights is required. The domain name generation network converts all attention weights into probability distribution through softmax function, so that the probability distribution is located between [0,1] and the sum is 1, and the importance distribution of the input features is realized. Finally, under the specific dimension [1,1], dot product operation is executed again through the dot function, weighted summation is carried out on the dot product operation and the input hidden state, context vectors integrating importance of different positions are generated, and more comprehensive semantic understanding is provided for the discriminator.

To address the gradient vanishing or explosion problems that may result from increased depth of domain name generating network, two highway network layers are introduced. These highway network layers can adaptively select the amount and path of information delivery, thereby avoiding the adverse performance impact of gradient problems. Finally, the probability of the domain name vector is output through the full connection layer. When the probability is larger than a preset threshold (such as 0.5), judging that the input domain name vector is the vector of the potential malicious domain name; and when the probability is smaller than a preset threshold value, judging that the potential malicious domain name generated by the generator is not satisfactory.

Referring to fig. 11, in some embodiments, after the step S104 directs the domain name generating network to generate the potentially malicious domain name by using the clustering feature, steps S901 to S902 may further include:

Step S901, acquiring a domain name to be detected;

Step S902, carrying out malicious detection on the domain name to be detected based on the potential malicious domain name to obtain a corresponding domain name detection result.

In the above steps S901 to S902, the domain name to be detected may be a suspected malicious domain name submitted by the user, captured in the network traffic monitoring, or detected by other security systems. Since the security of the domain name to be detected is not clear, the security detection of the domain name to be detected is required in the follow-up.

After obtaining the potential malicious domain name in the above embodiment, since the potential malicious domain name is generated based on the initial malicious domain name in the data set, that is, generated according to any data point in the data set, there may be multiple potential malicious domain names that are ultimately generated. There are a number of ways in which the potentially malicious domain name can be applied. For example, the multiple potential malicious domain names may be used to directly perform malicious detection on the domain name to be detected, for example, the domain name to be detected and the potential malicious domain name are compared, whether similarity or common features exist is checked, and a corresponding domain name detection result is obtained, and if the domain name to be detected and the potential malicious domain name have high similarity in terms of structure, character combination, semantics, and the like, then the domain name to be detected is likely to be a malicious domain name. In addition, the domain name to be detected can be classified by using a machine learning or deep learning model, and whether the domain name to be detected is a malicious domain name or not is judged, and specifically, the method comprises the step of training a malicious domain name recognition model through the generated plurality of potential malicious domain names, and the trained recognition model can accurately recognize malicious behaviors in the domain name, so that after the domain name to be detected is input into the recognition model, a corresponding domain name detection result can be accurately obtained.

Further, the domain name detection result may be a classified result, i.e. whether the domain name is malicious or not, or a multi-classified result, i.e. which type of malicious behavior the domain name belongs to, or contains more detailed scoring or probability information for reference of subsequent security analysis or response measures. Therefore, the generated potential malicious domain name is applied to the actual malicious domain name detection task in the embodiment of the disclosure, so that the network security protection capability is enhanced, and the potential network threat is discovered and dealt with in time.

Referring to fig. 12, an embodiment of the present disclosure further provides a malicious domain name generating device, which may implement the method for generating a malicious domain name, where the malicious domain name generating device includes:

the coding module 1201 is configured to obtain an initial malicious domain name, and code each character in the initial malicious domain name to obtain an initial domain name feature;

The semantic extraction module 1202 is configured to perform semantic extraction on the initial domain name feature based on a pre-constructed attention mechanism, so as to obtain a semantic feature;

The clustering module 1203 is configured to determine, according to preset clustering centers of the plurality of gaussian mixture categories, distribution probabilities that the initial malicious domain name belongs to different gaussian mixture categories, and determine, according to a magnitude relation of each distribution probability, a target category to which the initial malicious domain name belongs from the plurality of gaussian mixture categories;

The domain name generation module 1204 is used for inputting semantic features into a domain name generation network, extracting clustering features of target categories, and guiding the domain name generation network to generate potential malicious domain names by using the clustering features;

In summary, the malicious domain name generating device encodes each character in the initial malicious domain name to obtain the initial domain name feature by executing the malicious domain name generating method, and performs semantic extraction on the initial domain name feature based on a pre-constructed attention mechanism, so that the accuracy and efficiency of semantic extraction can be remarkably improved, the recognition and coping capability of the malicious domain name can be enhanced, and the semantic feature can be obtained; then, according to a preset clustering center of a plurality of Gaussian mixture categories, determining the distribution probability that the initial malicious domain name belongs to different Gaussian mixture categories, and according to the magnitude relation of each distribution probability, determining the target category to which the initial malicious domain name belongs from the plurality of Gaussian mixture categories; inputting semantic features into a domain name generation network, and extracting clustering features of target categories, wherein the Gaussian mixture categories are obtained by updating initial Gaussian mixture categories according to updated Gaussian distribution parameters; the updated Gaussian distribution parameters are obtained by firstly determining a plurality of initial Gaussian mixture categories of a plurality of sample domain name features, then calculating posterior probability of each sample domain name feature belonging to each initial Gaussian mixture category, and updating the current Gaussian distribution parameters according to the posterior probability. Therefore, each Gaussian mixture category can accurately cluster similar domain names, and the clustering characteristics of the target category can well express the data characteristics of the Gaussian distribution. The cluster feature is utilized to guide the domain name generation network to generate similar potential malicious domain names in the range indicated by the target category, the obtained potential malicious domain names have stronger relevance with the initial malicious domain names, and the potential malicious domain names are generated based on the initial malicious domain names which are input currently, and no dependency on historical data is needed, so that timeliness is better, and the quality of the finally mined potential malicious domain names is better.

The specific implementation of the malicious domain name generating device is basically the same as the specific embodiment of the malicious domain name generating method, and will not be described herein. On the premise of meeting the requirements of the embodiment of the disclosure, the malicious domain name generating device can also be provided with other functional modules so as to realize the malicious domain name generating method in the embodiment.

The embodiment of the disclosure also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the generation method of the malicious domain name when executing the computer program. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.

Referring to fig. 13, fig. 13 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:

The processor 1301 may be implemented by a general-purpose CPU (central processing unit), a microprocessor, an application-specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided by the embodiments of the present disclosure;

Memory 1302 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM). The memory 1302 may store an operating device and other application programs, and when the technical solutions provided in the embodiments of the present disclosure are implemented by software or firmware, relevant program codes are stored in the memory 1302, and the processor 1301 invokes a method for generating a malicious domain name that executes the embodiments of the present disclosure;

An input/output interface 1303 for implementing information input and output;

the communication interface 1304 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);

A bus 1305 to transfer information between the various components of the device (e.g., the processor 1301, memory 1302, input/output interfaces 1303, and communication interfaces 1304);

Wherein the processor 1301, the memory 1302, the input/output interface 1303 and the communication interface 1304 enable a communication connection between each other inside the device via a bus 1305.

The embodiment of the disclosure also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the generation method of the malicious domain name when being executed by a processor.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiments described in the embodiments of the present disclosure are for more clearly describing the technical solutions of the embodiments of the present disclosure, and do not constitute a limitation on the technical solutions provided by the embodiments of the present disclosure, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present disclosure are equally applicable to similar technical problems.

It will be appreciated by those skilled in the art that the technical solutions shown in the figures do not limit the embodiments of the present disclosure, and may include more or fewer steps than shown, or may combine certain steps, or different steps.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps, apparatus, functional modules/units in the devices, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the present disclosure and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, apparatus, article, or device that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or device.

It should be understood that in this disclosure, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another apparatus, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

Preferred embodiments of the disclosed embodiments are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the disclosed embodiments. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present disclosure shall fall within the scope of the claims of the embodiments of the present disclosure.

Claims

1. The method for generating the malicious domain name is characterized by comprising the following steps:

2. The method for generating a malicious domain name according to claim 1, wherein the gaussian mixture class is determined by:

3. The method according to claim 2, wherein the calculating a posterior probability of each of the sample domain name features belonging to the respective initial gaussian mixture class based on the current mean vector, the covariance matrix, the mixture coefficient, and the probability density function comprises:

（1）；

（2）；

（3）；

（4）；

4. The method for generating a malicious domain name according to claim 2, wherein the obtaining a plurality of sample domain name features comprises:

Acquiring a plurality of sample malicious domain names;

5. The method for generating a malicious domain name according to claim 4, wherein the extracting the first character feature from the first character and the extracting the second character feature from the second character includes:

6. The method for generating a malicious domain name according to claim 2, wherein the performing gaussian mixture clustering on the plurality of sample domain name features to obtain a plurality of initial gaussian mixture categories includes:

7. The method of generating a malicious domain name according to claim 2, wherein the type of covariance matrix is determined by:

8. The method for generating a malicious domain name according to claim 1, wherein the semantic extraction of the initial domain name feature based on a pre-constructed attention mechanism to obtain a semantic feature comprises:

9. The method for generating a malicious domain name according to claim 1, wherein the inputting the semantic feature into a domain name generating network, extracting a cluster feature of the target class, and guiding the domain name generating network to generate a potentially malicious domain name by using the cluster feature comprises:

Extracting clustering features of the target category;

10. The method for generating a malicious domain name according to claim 1, wherein after the domain name generating network is guided to generate a potentially malicious domain name by using the clustering feature, the method further comprises:

acquiring a domain name to be detected;

11. A malicious domain name generation device, comprising:

12. An electronic device comprising a memory storing a computer program and a processor implementing the method of generating a malicious domain name according to any one of claims 1 to 10 when the computer program is executed by the processor.

13. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the method of generating a malicious domain name according to any one of claims 1 to 10.