CN112217787A

CN112217787A - Method and system for generating mock domain name training data based on ED-GAN

Info

Publication number: CN112217787A
Application number: CN202010895375.4A
Authority: CN
Inventors: 朱怡; 宁振虎
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2020-08-31
Filing date: 2020-08-31
Publication date: 2021-01-12
Anticipated expiration: 2040-08-31
Also published as: CN112217787B

Abstract

The invention discloses a method and a system for generating mock domain name training data based on ED-GAN, wherein a domain name encoder and a domain name decoder are designed; secondly, combining an encoder and a decoder of the domain name with a GAN neural network, designing a character-level domain name generation countermeasure network model to generate similar counterfeit domain name samples, and realizing the prediction and detection of the counterfeit domain name; and finally, performing validity check on the generated counterfeit domain name sample data through multi-classifier parameter performance comparison. In the invention, in order to maximally utilize the characteristic that the GAN can directly sample and learn samples, the data is directly input into a GAN original model for learning and training without performing complex processing and transformation (such as adopting a convolution layer, a pooling layer and the like), and the real characteristic of the data can be kept; the construction of the domain name encoder and the domain name decoder has the characteristics of simplification and closeness to the original data, so that the true characteristics of the data can be maximally maintained.

Description

Method and system for generating mock domain name training data based on ED-GAN

Technical Field

The invention belongs to the field of deep learning and information security, particularly relates to a method and a system for generating counterfeit domain name training data based on ED-GAN, and belongs to the counterfeit domain name protection technology.

Background

With the rapid development of internet application, the benefit of internet bearing is getting larger and larger, attacks against a network communication domain name system are more frequent, and great impact is caused to network security. Among them, the spoofed domain name attack has become one of the important problems threatening the safe operation of the internet due to the characteristics of low attack cost, wide damage range, diversified profitability measures and the like, and induces the user to visit the domain name (such as facebook 0k.com, gooqle.com and the like) instead of the domain name of the target website by registering the domain name similar to the legal domain name so as to issue false advertisements, sell false commodities, even cheat user information and perform identity theft and the like. The counterfeit domain name network attack is a key protection object of security communities and related organizations at home and abroad nowadays, and the prevention of the counterfeit domain name attack in advance has important significance for guaranteeing the safe operation of the internet.

At present, research and discussion are carried out by academic circles aiming at a counterfeit domain name detection method from multiple angles, such as a statistics angle, a host behavior angle, a network behavior angle and the like, and research shows that the counterfeit domain name is a complex social engineering problem in nature, the related fields and the aimed application scenes are also various, and how to adopt an automatic method to detect and judge the counterfeit domain name attack behavior is the essence of the problem. The detection technologies for counterfeit domain names at present can be mainly classified into the following three categories:

1) and a blacklisting technology based on manual judgment and quality evaluation. This type of technique prevents users from accessing discovered, surviving, phishing domain name websites through a maintained blacklist. The establishment of the blacklist is mainly completed by manual reporting and examination or evaluation of the website quality by a user group. For example, Cloudmark maintains blacklists by rating websites by a large number of users, and browsers such as IE and Firefox prompt users for security in accessing web pages by blacklisting phishing pages reported by the users as being updated in real time. The method has reliable and accurate result, but lacks real-time performance, does not work on counterfeit domain name websites which do not appear on the blacklist, and wastes a large amount of resources for manual examination.

2) Rule-based heuristic detection techniques. The technology automatically judges the authenticity of the website by imitating a series of characteristics of the domain name website. For example, the SpoofGuard tool which comprehensively analyzes heuristic characteristics of the counterfeit domain name website at the earliest detects phishing characteristics such as a host domain name, a webpage counterfeit picture, a counterfeit page link and the like to judge. For another example, the identity of the website is defined by the characteristics of the text content of the webpage, and then the result of the search engine is referred to judge the authenticity of the website. Since the counterfeit domain name website also has a tendency of simulating an attack website in vision, researchers use the EMD algorithm to calculate the visual similarity of two webpages, so as to judge whether phishing behaviors exist or not. The technology can detect most of the unreported counterfeit domain name websites in real time, and the accuracy is very good due to high manual intervention degree set by the rule, but the technology lacks certain robustness and is still easy to generate missed judgment.

3) Pattern classification techniques based on statistical machine learning. The research method aims to extract domain name features by using an algorithm, construct a classification discrimination model, and convert the detection problem of counterfeit domain name attack into a binary-mode classification problem, namely, how to determine that the unknown domain name is a normal domain name or a counterfeit webpage by giving the unknown domain name. However, the technology has the disadvantages that many difficulties are still existed when the technology is applied to a DNS big data environment, the training data for generating the detection model is difficult to acquire, and most of the characteristics are complex and difficult to acquire in time under the DNS big data environment, so that the detection accuracy of the technology in mass data cannot be ensured.

In addition, the technologies have common limitations that enough latest mock domain name training sample data is difficult to obtain timely and effectively, so that the updating period of the detection model is too long and too slow, and the detection effectiveness and rapidity are not strong.

Due to the massive data of the internet and the diversification of related characteristic dimensions, the counterfeit domain name detection method is gradually developed from an early blacklist matching technology and a rule-based heuristic detection technology to machine learning, and most researchers adopt a classification judgment model to complete the identification and detection of counterfeit domain names. However, the existing machine learning detection algorithm has two problems, one is the problem that the proportion of attack data and normal data is unbalanced in the model training process, the detection model is unbalanced due to the unbalance of a data set, and the counterfeit domain name cannot be correctly detected. The other is the detection problem of unknown attack data, new attack data cannot be instantly disclosed on the network and cannot be used for model training, so that the trained model cannot identify and detect newly generated counterfeit domain names and cannot update the detection model in real time.

Based on the method, the invention provides a method and a system for generating the training data of the counterfeit domain name based on ED-GAN, which adopts a generation countermeasure network (GAN) to directly learn the character characteristics of the counterfeit domain name without clustering and extracting the characteristics of the domain name in advance, and can construct the generated domain name similar to the real counterfeit sample domain name only by encoding and decoding the domain name.

Disclosure of Invention

The invention mainly aims to provide an ED-GAN-based counterfeit domain name training data generation system, which comprises a real counterfeit domain name set coding module, a GAN generation network construction module, a GAN discrimination network construction module and a decoding module for generating a similar counterfeit domain name set. The output of the real counterfeit domain name set coding module and the output of the GAN generation network construction module are both connected with the input of the GAN judgment network construction module and are used for the optimization training of GAN judgment network parameters; the output of the GAN judgment network construction module is connected with the input of the GAN generation network construction module and used for continuously optimizing the parameters of the GAN generation network to generate new data to be input into the GAN judgment network for identification; the output of the GAN generation network construction module is respectively connected with the input of the GAN discrimination network construction module and the input of the decoding module for generating the similar counterfeit domain name set, thereby completing the training of the GAN discrimination network and the reduction of the character-level domain name.

In the aspect of counterfeit domain name detection, in order to solve the problems of unbalanced data set and difficult identification of new attack samples, the invention introduces an ED-GAN character level domain name generation model to generate usable attack data. Firstly, analyzing and researching the characteristics of counterfeit domain names, wherein common counterfeit domain names are generally constructed by combining random letters and numbers, so that the length of the counterfeit domain names has certain regularity, and a domain name encoder and a domain name decoder are designed on the basis of the method; secondly, combining an Encoder (Encoder), a Decoder (Decoder) and a GAN neural network of the domain name, designing a character-level domain name generation confrontation network model to generate similar counterfeit domain name samples, and realizing the prediction and detection of the counterfeit domain name; and finally, performing validity check on the generated counterfeit domain name sample data through multi-classifier parameter performance comparison.

The core architectural idea of the character-level generation model is the GAN neural network, which designs the encoder and decoder of domain name characters on the GAN architecture. The method for generating the simulated domain name training data based on the ED-GAN comprises the following steps:

s1 true mock Domain name set encoding

The main function of a domain name Encoder (Encoder) is to encode a character-level domain name into a corresponding domain name vector for representing domain name character data as input for generating a confrontation network discrimination network. The domain name is preprocessed, a top level domain, a possible second level domain, a possible third level domain and the like of the domain name are removed, and only the key part of the domain name is extracted (for example, www.baidu.com only extracts baidu). The encoding method for the domain name characters is as follows: let the character-level domain name be d, and the vector formed by sequential hashing of the character-level domain name be

Namely, it is

Where n is the domain name length. The conversion function between characters and numerical values is as follows:

f(x)＝Q(d_i) (1)

wherein d is_i(i 1, 2.., n) is a domain name character, n is a domain name length,

i.e. the character value. Only 38 characters in total, namely '0-9', 'a-z', '(hyphen) and' ″ (dotted character), are considered to be numerically converted, since only 38 characters are allowed to be used in the domain name string, and case distinction is not made. After the character value conversion function, the corresponding values of the 38 characters are obtained in turn, namelyQ ('0') -1, Q ('1') -2, Q ('a') -11, a. Domain name character vector

Is converted into the form of

The domain name character value vector of (1). In order to improve the learning efficiency of GAN, the domain name character numerical value vector is normalized by data

Is mapped to the interval [0,1 ]]. For i 1, 2., n, n is the domain name length, the mapping formula is as in formula (2):

wherein,

is the normalized character value, and the character value,

for the character value, minq (d) ═ 1 is the lower limit of the character conversion value, maxq (d) ═ 38 is the upper limit of the character conversion value, and the domain of the encoder mapping function is [1, 38%]Value range of [0,1]Mapped domain name vector

Is mapped into

For example, for the domain name baidu, the character vector of the domain name is

The character value vector of the domain name is

After encoding, the domain name vector of the domain name can be obtained

And the domain name vectors corresponding to other domain names can be obtained in the same way. Because the domain name character lengths are different, the dimension of the domain name vector is set to be 15 according to the characteristics of the counterfeit domain name, and 0 is supplemented after the domain name vector for the part with insufficient dimension, so that all the domain name character vectors are converted into the domain name vector with the dimension of 15 after passing through an encoder. After being encoded by the encoder, the character-level domain name vector is converted into training data of the GAN, the training data is used as input of a GAN generation network of S2, and finally the training data is converted into a tensor of deep neural network operation through Tesorflow.

S2 GAN generation network construction

The network structure mainly comprises four layers of neural networks, including an input layer, two hidden layers and an output layer. The input layer data is derived from a Gaussian distribution model and randomly generates data with dimension n being 100, the activation function of the input layer adopts a ReLu function, the performance of the ReLu function is considered, index operation is involved in calculating the activation function part, the calculated amount in the whole process is saved by adopting the ReLu function, and the problem of gradient disappearance in the deep network training process is prevented by using the ReLu function; the hidden layer comprises two layers, the number of nodes is h₁150 and h₂300, the activation function part of the node still adopts the ReLu function; the number of nodes in the output layer is 15, and the input data of the decoder is [0, 1%]And the interval elements, namely the sigmoid function is adopted only in the activation function of the output layer to reduce the calculation amount of the generated network. The data generated by S2 are passed to the discrimination network of S3 and the domain name decoder of S4, respectively.

S3 GAN discriminative network construction

The discrimination network is similar to the generation network, and is also a four-layer neural network, including an input layer, two hidden layers and an output layer. The data source of the input layer is two, one part is from real data, and the other partThe output of the generated data derived from the generation network generation, i.e., the real data after S1, together with the output of S2 constitutes the input of S3. The spoofed domain size is set to reset _ size 15, so the input data dimension is 2 × reset _ size. The number of nodes of the two hidden layers is h₂300 and h₁150, the activation function adopts a ReLu function. The output layer activation function is a sigmoid function, data are split and operated on the front reset _ size dimensional data and the rear reset _ size dimensional data before being operated by the activation function, the error of the training network is minimum, the real data and the discarding rate of the generated data are respectively output, and network overfitting is prevented.

S4 generating a mock Domain name set decoder

The domain name Decoder (Decoder) functions to decode the domain name vector generated by the GAN generation network into a corresponding character-level domain name, to perform character reduction for generating data, and to generate an output of the countermeasure network discrimination network. The essence is the mirror image of the encoder, and the inverse mapping formula of the domain name decoder is as follows (3):

Q(d_i)＝Q'(d_i)*[maxQ(d)-minQ(d)+1]+minQ(d) (3)

where, i ═ 1, 2., n, n is the domain name length 15, minq (d) ═ 1 is the lower limit of the character conversion value, maxq (d) ═ 38 is the upper limit of the character conversion value, and Q '(d)', Q ═ 38 is the upper limit of the character conversion value_i) Elements in a domain name vector generated for the generating network. Generating network generated domain name vectors

Obtaining the numerical value vector of the domain name character after decoding

Let the inverse of the character to value conversion function be:

g(x)＝Q^-1(x) (4)

after g (x), the field name value vector

Is mapped into

Will be provided with

Element d in (1)_iThe domain name string d is formed after the sequential combination₁,d₂,...,d_nThe domain name vector after passing through the decoder can be decoded into a visible domain name.

Compared with the prior art, the invention has the technical advantages that: 1) according to the method, the simulated domain name training GAN is adopted for generating data, and the training data and the generated data are more targeted; 2) in the invention, in order to maximally utilize the characteristic that the GAN can directly sample and learn samples, the data is directly input into a GAN original model for learning and training without performing complex processing and transformation (such as adopting a convolution layer, a pooling layer and the like), and the real characteristic of the data can be kept; 3) the construction of the domain name encoder and the domain name decoder has the characteristics of simplification and closeness to the original data, so that the true characteristics of the data can be maximally maintained.

Drawings

FIG. 1 is a diagram of model generation based on ED-GAN spoofed domain names.

Fig. 2 is a comparison graph of real samples and different confrontational rounds.

Fig. 3 is a diagram of a basic framework for spoofing domain name detection.

Detailed Description

The invention is explained and illustrated below with reference to the accompanying drawings:

in order to make the objects, technical solutions and features of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. The basic framework of the generation of the invention based on the ED-GAN spoofed domain name training data is shown in FIG. 1. The individual modules are illustrated below:

real counterfeit domain name set coding module

The real counterfeit domain name data set is obtained by network crawling, and after being encoded by an encoder, the character-level counterfeit domain name is encoded into a domain name vector, which is used for generating input of a discrimination network in a countermeasure network and is used for representing domain name character data again. In the design of an encoder, a method of converting domain name characters and numerical values and then carrying out normalization processing is adopted, the length of domain name character vectors is uniformly set according to needs, and all character-level domain names are converted into equal-length domain name vectors through the encoder and are used as input of a GAN (generic identifier) discrimination network.

(II) GAN generation network construction module

The generation countermeasure network (GAN) mainly includes two parts, one is the generation network g (generator network) and the other is the discriminant network d (discriminant network). The generation network G takes random sampling from the potential space as input, and its output needs to imitate the real samples in the training set as much as possible, i.e. the real mock domain names. As in FIG. 1, the inputs of the network G are generated so as to obey a prior probability distribution P_z(z) generating data g (z) and providing it to discrimination network D, where g (z) is the generated mock domain name.

(III) GAN discriminative network construction module

The input to the discriminating network D consists of two parts, one of which is the true sample P_data(x) Namely, the real mock domain name, and the output G (z) of the generator, namely, the generated similar mock domain name, and the network needs to distinguish whether the current input data is real data or the data G (z) generated by the generator as far as possible. In the model training process, the discrimination network D improves the discrimination ability of the discrimination network D through continuous learning, and the generation network G improves the disguising ability of the discrimination network G through continuous learning. The generation network and the discrimination network form a dynamic countermeasure process, the two processes are continuously optimized in an iteration process, when D cannot distinguish real data from generated data at last, namely D takes the generated data G (z) as real data, the model is considered to be optimal, G is considered to be distributed completely with the real data, and the generated data can be used as a counterfeit domain name.

(IV) generate similar counterfeit domain name set decoding module

The analogous counterfeit domain names generated by the GAN-generated network are in the form of domain name vectors, which the decoder needs to reverse the encoder for visual visualization. The decoder completes the character reduction of the generated data, the input of the decoder is a domain name vector, and the output of the decoder is a character-level domain name which is common to us.

And (3) test environment:

the invention verifies the practical effect of the method for generating the ED-GAN-based mock domain name training data through experiments. The test environment comprises an ubuntu16.04 operating system, an 8G memory, a 1T hard disk, an Intel Corei5-32102.5GHz CPU, a Tensorflow deep learning framework and a WEKA machine learning platform. The experimental data are derived from the domain name with the highest rank in Alexa as a legal target domain name and 100 actual counterfeit domain names in the network and similar counterfeit domain names generated by the ED-GAN counterfeit domain name generation model.

In the invention, GAN and a designed encoder and decoder are combined to generate counterfeit domain name training data, and the experimental design is as follows:

1) preprocessing a million-level counterfeit domain name data set, splitting a domain name by using a python array list splitting function spit, reserving a key part of the domain name, and removing a top-level domain, a possible second-level domain, a possible third-level domain and the like;

2) after the million-level counterfeit domain name is processed, in order to shorten the training time and reduce the memory consumption and time complexity during the GAN training, the invention carries out data coding processing on domain name characters in advance, converts the domain name characters into a domain name vector form after passing through an encoder, and converts the domain name vector form into an input tensor of the GAN neural network through a data standard reading format in Tensorflow.

3) Similar to mock domain name generation. Inputting the processed million-level domain name into a domain name character generation model, and training and generating a similar counterfeit domain name sample. Generating the network within epochs of each network training (1 epoch equals one training using all samples in the training set) produces generated data after each training session, each time producing bach _ size (batch size) list data.

Generating a parameter configuration of the countermeasure network:

the generation of the countermeasure network includes generation of a parameter configuration of the network and determination of the parameter configuration of the network, and the network parameter configurations are shown in table 1 and table 2, respectively. For generating the network, Gaussian random generated data is trained into similar data with real data characteristics through a countertraining algorithm. The weight parameter initialization method of the G network and the D network adopts tf.truncated _ normal () Gaussian normal distribution model provided by Tensorflow, and the offset part is initialized to be zero by tf.zeros ().

Table 1 generating parameter configuration of network

Table 2 discriminating parameter configuration of network

In the counterstudy process of the generation network and the judgment network, the characteristics of the generated data are continuously close to those of real data along with the increase of the number of counterturns. In order to embody the learning characteristic of the generated network, the invention tracks and outputs the data generated by different learning countermeasure rounds of the character-level domain name generation model in the learning process, and selects and compares the real data of partial counterfeit domain names, as shown in fig. 2. The data in the upper left ellipse is a real counterfeit domain name sample, the data in the upper right ellipse is a sample generated by 0-10 counterattack rounds, and the data in the lower ellipse is a sample generated by 250 future counterattack rounds. It can be seen that the data of the parts 0-10 of the confrontation rounds are the data generated by the GAN during the initial confrontation training, and the data generated at this time is very different from the real data, and most of the data cannot be used as the characters of the domain name. As the number of the fight rounds is increased and the characteristics of the fight rounds are continuously close to those of real data, after the GAN learns about 250 fight rounds, most of the generated data are similar to the real data, and most of the generated data can be used as the counterfeit domain name.

In order to further verify that the data generated by the ED-GAN-based mock domain name training data generation model has the characteristics of real domain name data, the method carries out recognition detection on the mock domain name through J48, naive Bayes, random numbers and random forest multiple classifiers so as to verify the validity of the generated data, and the basic framework of the mock domain name detection is shown in FIG. 3.

Three groups of comparison experiments are designed, wherein the three groups of comparison experiments comprise classification detection of 5000 Alexa legal target domain name positive sample sets and 5000 real counterfeit domain name negative sample sets; secondly, classifying and detecting 5000 Alexa legal target domain name positive sample sets and 5000 similar counterfeit domain name negative sample sets generated by generating models; and thirdly, 10000 negative sample sets formed by combining 10000 Alexa legal target domain name positive sample sets, 5000 real counterfeit domain names and 5000 similar counterfeit domain names generated by a generating model are classified and detected. The results of the three sets of experiments are shown in table 3:

TABLE 3 sample Classification results of three experiments

It can be seen from the results of the Alexa samples and the real mock domains in the first group of Table 3 that the classification effect of naive Bayes and random forests is better than that of the other two classifiers. The detection results of the first group are used as comparison reference values of Alexa samples and similar counterfeit domain name classification, and Alexa samples and real counterfeit domain name and similar counterfeit domain name combined sample classification. Comparing the classification results of the second and third group experiments in table 3 with the reference value of the classification results of the first group, it can be found that both the classification index F value and the ROC area thereof are maintained in the same performance state as the reference value, which indicates that the generated similar counterfeit domain name sample has the characteristics of a real counterfeit domain name sample, and can be used as a real counterfeit domain name data sample, thereby verifying the validity of the generated data.

In conclusion, the method for generating the ED-GAN-based counterfeit domain name training data by combining the domain name coder and the domain name decoder with the GAN realizes the characteristic learning of the real counterfeit domain name and the generation of the similar counterfeit domain name data.

Claims

1. An ED-GAN-based counterfeit domain name training data generation system is characterized in that: the domain name set decoding device comprises a real counterfeit domain name set encoding module, a GAN generation network construction module, a GAN discrimination network construction module and a similar counterfeit domain name set generation decoding module; the output of the real counterfeit domain name set coding module and the output of the GAN generation network construction module are both connected with the input of the GAN judgment network construction module and are used for the optimization training of GAN judgment network parameters; the output of the GAN judgment network construction module is connected with the input of the GAN generation network construction module and used for continuously optimizing the parameters of the GAN generation network to generate new data to be input into the GAN judgment network for identification; the output of the GAN generation network construction module is respectively connected with the input of the GAN discrimination network construction module and the input of the decoding module for generating the similar counterfeit domain name set, thereby completing the training of the GAN discrimination network and the reduction of the character-level domain name.

2. The ED-GAN based mock domain name training data generating system according to claim 1, wherein: an ED-GAN character level domain name generation model is introduced to generate usable attack data; designing an encoder and a decoder of a domain name; combining an encoder and a decoder of a domain name with a GAN neural network, designing a character-level domain name generation countermeasure network model to generate similar counterfeit domain name samples, and realizing the prediction and detection of the counterfeit domain name; and finally, performing validity check on the generated counterfeit domain name sample data through multi-classifier parameter performance comparison.

3. The ED-GAN based mock domain name training data generating system according to claim 2, wherein: the core architectural idea of the character-level generation model is the GAN neural network, which designs the encoder and decoder of domain name characters on the GAN architecture.

4. A method for generating mock domain name training data based on ED-GAN is characterized in that: the method comprises the following steps:

s1 true mock Domain name set encoding

The domain name encoder is used for encoding the character-level domain name into a corresponding domain name vector, representing domain name character data and serving as input for generating a confrontation network discrimination network; firstly, preprocessing a domain name, removing a top level domain, a possible second level domain and a possible third level domain of the domain name, and only extracting a key part of the domain name; the encoding method for the domain name characters is as follows: let the character-level domain name be d, and the vector formed by sequential hashing of the character-level domain name be

Namely, it is

Wherein n is the domain name length; the conversion function between characters and numerical values is as follows:

f(x)＝Q(d_i) (1)

namely the character numerical value; only 38 characters in total, namely '0-9', 'a-z', 'hyphen and' dot character, are considered to be subjected to numerical conversion, because only 38 characters are allowed to be used in the domain name character string and case distinction is not made; after passing through the character value conversion function, the corresponding values of the 38 characters are sequentially obtained, i.e., Q ('0') -1, Q ('1') -2, Q ('a') -11, Q., (Q ('z') -36, Q (') -37, Q (') -38; domain name character vector

Is converted into the form of

The domain name character numerical value vector of (1); in order to improve the learning efficiency of GAN, the domain name character numerical value vector is normalized by data

Is mapped to the interval [0,1 ]](ii) a For i 1, 2., n, n is the domain name length, the mapping formula is as in formula (2):

wherein,

is the normalized character value, and the character value,

Is mapped into

Because the domain name character lengths are different, the dimension of the domain name vector is set to be 15 according to the characteristics of the counterfeit domain name, and 0 is supplemented after the domain name vector for the part with insufficient dimension, so that all the domain name character vectors are converted into the domain name vector with the dimension of 15 after passing through an encoder; after being coded by the coder, the character-level domain name vector is converted into training data of GAN, the training data is used as input of a GAN generation network of S2, and finally the training data is converted into a tensor of deep neural network operation through Tesorflow;

s2 GAN generation network construction

The generation network is used for learning a probability distribution model of real data under the guidance of a discrimination network, and the network structure of the generation network consists of four layers of neural networks, including an input layer, two hidden layers and an output layer; the input layer data is derived from a Gaussian distribution model and randomly generates n-100-dimensional data, the activation function of the input layer adopts a ReLu function, and the ReLu function is taken as a referenceConsidering the performance of the system, exponential operation is involved when an activation function part is calculated, the calculation amount of the whole process can be saved by adopting the ReLu function, and the problem of gradient disappearance in the deep network training process is prevented by using the ReLu function; the hidden layer comprises two layers, the number of nodes is h₁150 and h₂300, the activation function part of the node still adopts the ReLu function; the number of nodes in the output layer is 15, and the input data of the decoder is [0, 1%]Interval elements, so that a sigmoid function is adopted only in an activation function of an output layer to reduce the calculation amount of a generated network; the data generated by S2 are respectively passed to the discrimination network of S3 and the domain name decoder of S4;

s3 GAN discriminative network construction

The discrimination network is similar to the generation network, and is also a four-layer neural network comprising an input layer, two hidden layers and an output layer; the data source of the input layer is two, one part is derived from real data, and the other part is derived from generated data generated by the generation network, namely the output of the real data after S1 and the output of S2 form the input of S3; setting the length of the counterfeit domain name to reset _ size 15, so the dimension of the input data is 2 × reset _ size; the number of nodes of the two hidden layers is h₂300 and h₁150, the activation function adopts a ReLu function; the output layer activation function is a sigmoid function, before data is subjected to activation function operation, the data in the front reset _ size dimension and the data in the rear reset _ size dimension are split and operated, the error of the training network is minimum, the discarding rates of real data and generated data are respectively output, and network overfitting is prevented;

s4 generating a mock Domain name set decoder

The domain name Decoder (Decoder) is used for decoding domain name vectors generated by the GAN generation network into corresponding character-level domain names, is used for generating character reduction of data, and is used for generating output of a countermeasure network judgment network; the essence is the mirror image of the encoder, and the inverse mapping formula of the domain name decoder is as follows (3):

Q(d_i)＝Q'(d_i)*[max Q(d)-min Q(d)+1]+min Q(d) (3)

where, i is 1, 2., n, n is the domain name length 15, min Q (d) is 1, max Q (d) is 38, Q '(d) is the lower limit of the character conversion value, and Q' (d) is the upper limit of the character conversion value_i) Elements in a domain name vector generated for generating a network; generating network generated domain name vectors

Let the inverse of the character to value conversion function be:

g(x)＝Q^-1(x) (4)

after g (x), the field name value vector

Is mapped into

Will be provided with

Element d in (1)_iThe domain name string d is formed after the sequential combination₁,d₂,...,d_nThe domain name vector after passing through the decoder is decoded into a visible domain name.