CN112217787A - Method and system for generating mock domain name training data based on ED-GAN - Google Patents

Method and system for generating mock domain name training data based on ED-GAN Download PDF

Info

Publication number
CN112217787A
CN112217787A CN202010895375.4A CN202010895375A CN112217787A CN 112217787 A CN112217787 A CN 112217787A CN 202010895375 A CN202010895375 A CN 202010895375A CN 112217787 A CN112217787 A CN 112217787A
Authority
CN
China
Prior art keywords
domain name
gan
data
network
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010895375.4A
Other languages
Chinese (zh)
Other versions
CN112217787B (en
Inventor
朱怡
宁振虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202010895375.4A priority Critical patent/CN112217787B/en
Publication of CN112217787A publication Critical patent/CN112217787A/en
Application granted granted Critical
Publication of CN112217787B publication Critical patent/CN112217787B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computer Hardware Design (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Machine Translation (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention discloses a method and a system for generating mock domain name training data based on ED-GAN, wherein a domain name encoder and a domain name decoder are designed; secondly, combining an encoder and a decoder of the domain name with a GAN neural network, designing a character-level domain name generation countermeasure network model to generate similar counterfeit domain name samples, and realizing the prediction and detection of the counterfeit domain name; and finally, performing validity check on the generated counterfeit domain name sample data through multi-classifier parameter performance comparison. In the invention, in order to maximally utilize the characteristic that the GAN can directly sample and learn samples, the data is directly input into a GAN original model for learning and training without performing complex processing and transformation (such as adopting a convolution layer, a pooling layer and the like), and the real characteristic of the data can be kept; the construction of the domain name encoder and the domain name decoder has the characteristics of simplification and closeness to the original data, so that the true characteristics of the data can be maximally maintained.

Description

Method and system for generating mock domain name training data based on ED-GAN
Technical Field
The invention belongs to the field of deep learning and information security, particularly relates to a method and a system for generating counterfeit domain name training data based on ED-GAN, and belongs to the counterfeit domain name protection technology.
Background
With the rapid development of internet application, the benefit of internet bearing is getting larger and larger, attacks against a network communication domain name system are more frequent, and great impact is caused to network security. Among them, the spoofed domain name attack has become one of the important problems threatening the safe operation of the internet due to the characteristics of low attack cost, wide damage range, diversified profitability measures and the like, and induces the user to visit the domain name (such as facebook 0k.com, gooqle.com and the like) instead of the domain name of the target website by registering the domain name similar to the legal domain name so as to issue false advertisements, sell false commodities, even cheat user information and perform identity theft and the like. The counterfeit domain name network attack is a key protection object of security communities and related organizations at home and abroad nowadays, and the prevention of the counterfeit domain name attack in advance has important significance for guaranteeing the safe operation of the internet.
At present, research and discussion are carried out by academic circles aiming at a counterfeit domain name detection method from multiple angles, such as a statistics angle, a host behavior angle, a network behavior angle and the like, and research shows that the counterfeit domain name is a complex social engineering problem in nature, the related fields and the aimed application scenes are also various, and how to adopt an automatic method to detect and judge the counterfeit domain name attack behavior is the essence of the problem. The detection technologies for counterfeit domain names at present can be mainly classified into the following three categories:
1) and a blacklisting technology based on manual judgment and quality evaluation. This type of technique prevents users from accessing discovered, surviving, phishing domain name websites through a maintained blacklist. The establishment of the blacklist is mainly completed by manual reporting and examination or evaluation of the website quality by a user group. For example, Cloudmark maintains blacklists by rating websites by a large number of users, and browsers such as IE and Firefox prompt users for security in accessing web pages by blacklisting phishing pages reported by the users as being updated in real time. The method has reliable and accurate result, but lacks real-time performance, does not work on counterfeit domain name websites which do not appear on the blacklist, and wastes a large amount of resources for manual examination.
2) Rule-based heuristic detection techniques. The technology automatically judges the authenticity of the website by imitating a series of characteristics of the domain name website. For example, the SpoofGuard tool which comprehensively analyzes heuristic characteristics of the counterfeit domain name website at the earliest detects phishing characteristics such as a host domain name, a webpage counterfeit picture, a counterfeit page link and the like to judge. For another example, the identity of the website is defined by the characteristics of the text content of the webpage, and then the result of the search engine is referred to judge the authenticity of the website. Since the counterfeit domain name website also has a tendency of simulating an attack website in vision, researchers use the EMD algorithm to calculate the visual similarity of two webpages, so as to judge whether phishing behaviors exist or not. The technology can detect most of the unreported counterfeit domain name websites in real time, and the accuracy is very good due to high manual intervention degree set by the rule, but the technology lacks certain robustness and is still easy to generate missed judgment.
3) Pattern classification techniques based on statistical machine learning. The research method aims to extract domain name features by using an algorithm, construct a classification discrimination model, and convert the detection problem of counterfeit domain name attack into a binary-mode classification problem, namely, how to determine that the unknown domain name is a normal domain name or a counterfeit webpage by giving the unknown domain name. However, the technology has the disadvantages that many difficulties are still existed when the technology is applied to a DNS big data environment, the training data for generating the detection model is difficult to acquire, and most of the characteristics are complex and difficult to acquire in time under the DNS big data environment, so that the detection accuracy of the technology in mass data cannot be ensured.
In addition, the technologies have common limitations that enough latest mock domain name training sample data is difficult to obtain timely and effectively, so that the updating period of the detection model is too long and too slow, and the detection effectiveness and rapidity are not strong.
Due to the massive data of the internet and the diversification of related characteristic dimensions, the counterfeit domain name detection method is gradually developed from an early blacklist matching technology and a rule-based heuristic detection technology to machine learning, and most researchers adopt a classification judgment model to complete the identification and detection of counterfeit domain names. However, the existing machine learning detection algorithm has two problems, one is the problem that the proportion of attack data and normal data is unbalanced in the model training process, the detection model is unbalanced due to the unbalance of a data set, and the counterfeit domain name cannot be correctly detected. The other is the detection problem of unknown attack data, new attack data cannot be instantly disclosed on the network and cannot be used for model training, so that the trained model cannot identify and detect newly generated counterfeit domain names and cannot update the detection model in real time.
Based on the method, the invention provides a method and a system for generating the training data of the counterfeit domain name based on ED-GAN, which adopts a generation countermeasure network (GAN) to directly learn the character characteristics of the counterfeit domain name without clustering and extracting the characteristics of the domain name in advance, and can construct the generated domain name similar to the real counterfeit sample domain name only by encoding and decoding the domain name.
Disclosure of Invention
The invention mainly aims to provide an ED-GAN-based counterfeit domain name training data generation system, which comprises a real counterfeit domain name set coding module, a GAN generation network construction module, a GAN discrimination network construction module and a decoding module for generating a similar counterfeit domain name set. The output of the real counterfeit domain name set coding module and the output of the GAN generation network construction module are both connected with the input of the GAN judgment network construction module and are used for the optimization training of GAN judgment network parameters; the output of the GAN judgment network construction module is connected with the input of the GAN generation network construction module and used for continuously optimizing the parameters of the GAN generation network to generate new data to be input into the GAN judgment network for identification; the output of the GAN generation network construction module is respectively connected with the input of the GAN discrimination network construction module and the input of the decoding module for generating the similar counterfeit domain name set, thereby completing the training of the GAN discrimination network and the reduction of the character-level domain name.
In the aspect of counterfeit domain name detection, in order to solve the problems of unbalanced data set and difficult identification of new attack samples, the invention introduces an ED-GAN character level domain name generation model to generate usable attack data. Firstly, analyzing and researching the characteristics of counterfeit domain names, wherein common counterfeit domain names are generally constructed by combining random letters and numbers, so that the length of the counterfeit domain names has certain regularity, and a domain name encoder and a domain name decoder are designed on the basis of the method; secondly, combining an Encoder (Encoder), a Decoder (Decoder) and a GAN neural network of the domain name, designing a character-level domain name generation confrontation network model to generate similar counterfeit domain name samples, and realizing the prediction and detection of the counterfeit domain name; and finally, performing validity check on the generated counterfeit domain name sample data through multi-classifier parameter performance comparison.
The core architectural idea of the character-level generation model is the GAN neural network, which designs the encoder and decoder of domain name characters on the GAN architecture. The method for generating the simulated domain name training data based on the ED-GAN comprises the following steps:
s1 true mock Domain name set encoding
The main function of a domain name Encoder (Encoder) is to encode a character-level domain name into a corresponding domain name vector for representing domain name character data as input for generating a confrontation network discrimination network. The domain name is preprocessed, a top level domain, a possible second level domain, a possible third level domain and the like of the domain name are removed, and only the key part of the domain name is extracted (for example, www.baidu.com only extracts baidu). The encoding method for the domain name characters is as follows: let the character-level domain name be d, and the vector formed by sequential hashing of the character-level domain name be
Figure BDA0002658272000000031
Namely, it is
Figure BDA0002658272000000032
Where n is the domain name length. The conversion function between characters and numerical values is as follows:
f(x)=Q(di) (1)
wherein d isi(i 1, 2.., n) is a domain name character, n is a domain name length,
Figure BDA0002658272000000033
i.e. the character value. Only 38 characters in total, namely '0-9', 'a-z', '(hyphen) and' ″ (dotted character), are considered to be numerically converted, since only 38 characters are allowed to be used in the domain name string, and case distinction is not made. After the character value conversion function, the corresponding values of the 38 characters are obtained in turn, namelyQ ('0') -1, Q ('1') -2, Q ('a') -11, a. Domain name character vector
Figure BDA0002658272000000041
Is converted into the form of
Figure BDA0002658272000000042
The domain name character value vector of (1). In order to improve the learning efficiency of GAN, the domain name character numerical value vector is normalized by data
Figure BDA0002658272000000043
Is mapped to the interval [0,1 ]]. For i 1, 2., n, n is the domain name length, the mapping formula is as in formula (2):
Figure BDA0002658272000000044
wherein,
Figure BDA0002658272000000045
is the normalized character value, and the character value,
Figure BDA0002658272000000046
for the character value, minq (d) ═ 1 is the lower limit of the character conversion value, maxq (d) ═ 38 is the upper limit of the character conversion value, and the domain of the encoder mapping function is [1, 38%]Value range of [0,1]Mapped domain name vector
Figure BDA0002658272000000047
Is mapped into
Figure BDA0002658272000000048
For example, for the domain name baidu, the character vector of the domain name is
Figure BDA0002658272000000049
The character value vector of the domain name is
Figure BDA00026582720000000410
After encoding, the domain name vector of the domain name can be obtained
Figure BDA00026582720000000411
And the domain name vectors corresponding to other domain names can be obtained in the same way. Because the domain name character lengths are different, the dimension of the domain name vector is set to be 15 according to the characteristics of the counterfeit domain name, and 0 is supplemented after the domain name vector for the part with insufficient dimension, so that all the domain name character vectors are converted into the domain name vector with the dimension of 15 after passing through an encoder. After being encoded by the encoder, the character-level domain name vector is converted into training data of the GAN, the training data is used as input of a GAN generation network of S2, and finally the training data is converted into a tensor of deep neural network operation through Tesorflow.
S2 GAN generation network construction
The network structure mainly comprises four layers of neural networks, including an input layer, two hidden layers and an output layer. The input layer data is derived from a Gaussian distribution model and randomly generates data with dimension n being 100, the activation function of the input layer adopts a ReLu function, the performance of the ReLu function is considered, index operation is involved in calculating the activation function part, the calculated amount in the whole process is saved by adopting the ReLu function, and the problem of gradient disappearance in the deep network training process is prevented by using the ReLu function; the hidden layer comprises two layers, the number of nodes is h1150 and h2300, the activation function part of the node still adopts the ReLu function; the number of nodes in the output layer is 15, and the input data of the decoder is [0, 1%]And the interval elements, namely the sigmoid function is adopted only in the activation function of the output layer to reduce the calculation amount of the generated network. The data generated by S2 are passed to the discrimination network of S3 and the domain name decoder of S4, respectively.
S3 GAN discriminative network construction
The discrimination network is similar to the generation network, and is also a four-layer neural network, including an input layer, two hidden layers and an output layer. The data source of the input layer is two, one part is from real data, and the other partThe output of the generated data derived from the generation network generation, i.e., the real data after S1, together with the output of S2 constitutes the input of S3. The spoofed domain size is set to reset _ size 15, so the input data dimension is 2 × reset _ size. The number of nodes of the two hidden layers is h2300 and h1150, the activation function adopts a ReLu function. The output layer activation function is a sigmoid function, data are split and operated on the front reset _ size dimensional data and the rear reset _ size dimensional data before being operated by the activation function, the error of the training network is minimum, the real data and the discarding rate of the generated data are respectively output, and network overfitting is prevented.
S4 generating a mock Domain name set decoder
The domain name Decoder (Decoder) functions to decode the domain name vector generated by the GAN generation network into a corresponding character-level domain name, to perform character reduction for generating data, and to generate an output of the countermeasure network discrimination network. The essence is the mirror image of the encoder, and the inverse mapping formula of the domain name decoder is as follows (3):
Q(di)=Q'(di)*[maxQ(d)-minQ(d)+1]+minQ(d) (3)
where, i ═ 1, 2., n, n is the domain name length 15, minq (d) ═ 1 is the lower limit of the character conversion value, maxq (d) ═ 38 is the upper limit of the character conversion value, and Q '(d)', Q ═ 38 is the upper limit of the character conversion valuei) Elements in a domain name vector generated for the generating network. Generating network generated domain name vectors
Figure BDA0002658272000000051
Obtaining the numerical value vector of the domain name character after decoding
Figure BDA0002658272000000052
Let the inverse of the character to value conversion function be:
g(x)=Q-1(x) (4)
after g (x), the field name value vector
Figure BDA0002658272000000053
Is mapped into
Figure BDA0002658272000000054
Will be provided with
Figure BDA0002658272000000055
Element d in (1)iThe domain name string d is formed after the sequential combination1,d2,...,dnThe domain name vector after passing through the decoder can be decoded into a visible domain name.
Compared with the prior art, the invention has the technical advantages that: 1) according to the method, the simulated domain name training GAN is adopted for generating data, and the training data and the generated data are more targeted; 2) in the invention, in order to maximally utilize the characteristic that the GAN can directly sample and learn samples, the data is directly input into a GAN original model for learning and training without performing complex processing and transformation (such as adopting a convolution layer, a pooling layer and the like), and the real characteristic of the data can be kept; 3) the construction of the domain name encoder and the domain name decoder has the characteristics of simplification and closeness to the original data, so that the true characteristics of the data can be maximally maintained.
Drawings
FIG. 1 is a diagram of model generation based on ED-GAN spoofed domain names.
Fig. 2 is a comparison graph of real samples and different confrontational rounds.
Fig. 3 is a diagram of a basic framework for spoofing domain name detection.
Detailed Description
The invention is explained and illustrated below with reference to the accompanying drawings:
in order to make the objects, technical solutions and features of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. The basic framework of the generation of the invention based on the ED-GAN spoofed domain name training data is shown in FIG. 1. The individual modules are illustrated below:
real counterfeit domain name set coding module
The real counterfeit domain name data set is obtained by network crawling, and after being encoded by an encoder, the character-level counterfeit domain name is encoded into a domain name vector, which is used for generating input of a discrimination network in a countermeasure network and is used for representing domain name character data again. In the design of an encoder, a method of converting domain name characters and numerical values and then carrying out normalization processing is adopted, the length of domain name character vectors is uniformly set according to needs, and all character-level domain names are converted into equal-length domain name vectors through the encoder and are used as input of a GAN (generic identifier) discrimination network.
(II) GAN generation network construction module
The generation countermeasure network (GAN) mainly includes two parts, one is the generation network g (generator network) and the other is the discriminant network d (discriminant network). The generation network G takes random sampling from the potential space as input, and its output needs to imitate the real samples in the training set as much as possible, i.e. the real mock domain names. As in FIG. 1, the inputs of the network G are generated so as to obey a prior probability distribution Pz(z) generating data g (z) and providing it to discrimination network D, where g (z) is the generated mock domain name.
(III) GAN discriminative network construction module
The input to the discriminating network D consists of two parts, one of which is the true sample Pdata(x) Namely, the real mock domain name, and the output G (z) of the generator, namely, the generated similar mock domain name, and the network needs to distinguish whether the current input data is real data or the data G (z) generated by the generator as far as possible. In the model training process, the discrimination network D improves the discrimination ability of the discrimination network D through continuous learning, and the generation network G improves the disguising ability of the discrimination network G through continuous learning. The generation network and the discrimination network form a dynamic countermeasure process, the two processes are continuously optimized in an iteration process, when D cannot distinguish real data from generated data at last, namely D takes the generated data G (z) as real data, the model is considered to be optimal, G is considered to be distributed completely with the real data, and the generated data can be used as a counterfeit domain name.
(IV) generate similar counterfeit domain name set decoding module
The analogous counterfeit domain names generated by the GAN-generated network are in the form of domain name vectors, which the decoder needs to reverse the encoder for visual visualization. The decoder completes the character reduction of the generated data, the input of the decoder is a domain name vector, and the output of the decoder is a character-level domain name which is common to us.
And (3) test environment:
the invention verifies the practical effect of the method for generating the ED-GAN-based mock domain name training data through experiments. The test environment comprises an ubuntu16.04 operating system, an 8G memory, a 1T hard disk, an Intel Corei5-32102.5GHz CPU, a Tensorflow deep learning framework and a WEKA machine learning platform. The experimental data are derived from the domain name with the highest rank in Alexa as a legal target domain name and 100 actual counterfeit domain names in the network and similar counterfeit domain names generated by the ED-GAN counterfeit domain name generation model.
In the invention, GAN and a designed encoder and decoder are combined to generate counterfeit domain name training data, and the experimental design is as follows:
1) preprocessing a million-level counterfeit domain name data set, splitting a domain name by using a python array list splitting function spit, reserving a key part of the domain name, and removing a top-level domain, a possible second-level domain, a possible third-level domain and the like;
2) after the million-level counterfeit domain name is processed, in order to shorten the training time and reduce the memory consumption and time complexity during the GAN training, the invention carries out data coding processing on domain name characters in advance, converts the domain name characters into a domain name vector form after passing through an encoder, and converts the domain name vector form into an input tensor of the GAN neural network through a data standard reading format in Tensorflow.
3) Similar to mock domain name generation. Inputting the processed million-level domain name into a domain name character generation model, and training and generating a similar counterfeit domain name sample. Generating the network within epochs of each network training (1 epoch equals one training using all samples in the training set) produces generated data after each training session, each time producing bach _ size (batch size) list data.
Generating a parameter configuration of the countermeasure network:
the generation of the countermeasure network includes generation of a parameter configuration of the network and determination of the parameter configuration of the network, and the network parameter configurations are shown in table 1 and table 2, respectively. For generating the network, Gaussian random generated data is trained into similar data with real data characteristics through a countertraining algorithm. The weight parameter initialization method of the G network and the D network adopts tf.truncated _ normal () Gaussian normal distribution model provided by Tensorflow, and the offset part is initialized to be zero by tf.zeros ().
Table 1 generating parameter configuration of network
Figure BDA0002658272000000081
Table 2 discriminating parameter configuration of network
Figure BDA0002658272000000082
Figure BDA0002658272000000091
In the counterstudy process of the generation network and the judgment network, the characteristics of the generated data are continuously close to those of real data along with the increase of the number of counterturns. In order to embody the learning characteristic of the generated network, the invention tracks and outputs the data generated by different learning countermeasure rounds of the character-level domain name generation model in the learning process, and selects and compares the real data of partial counterfeit domain names, as shown in fig. 2. The data in the upper left ellipse is a real counterfeit domain name sample, the data in the upper right ellipse is a sample generated by 0-10 counterattack rounds, and the data in the lower ellipse is a sample generated by 250 future counterattack rounds. It can be seen that the data of the parts 0-10 of the confrontation rounds are the data generated by the GAN during the initial confrontation training, and the data generated at this time is very different from the real data, and most of the data cannot be used as the characters of the domain name. As the number of the fight rounds is increased and the characteristics of the fight rounds are continuously close to those of real data, after the GAN learns about 250 fight rounds, most of the generated data are similar to the real data, and most of the generated data can be used as the counterfeit domain name.
In order to further verify that the data generated by the ED-GAN-based mock domain name training data generation model has the characteristics of real domain name data, the method carries out recognition detection on the mock domain name through J48, naive Bayes, random numbers and random forest multiple classifiers so as to verify the validity of the generated data, and the basic framework of the mock domain name detection is shown in FIG. 3.
Three groups of comparison experiments are designed, wherein the three groups of comparison experiments comprise classification detection of 5000 Alexa legal target domain name positive sample sets and 5000 real counterfeit domain name negative sample sets; secondly, classifying and detecting 5000 Alexa legal target domain name positive sample sets and 5000 similar counterfeit domain name negative sample sets generated by generating models; and thirdly, 10000 negative sample sets formed by combining 10000 Alexa legal target domain name positive sample sets, 5000 real counterfeit domain names and 5000 similar counterfeit domain names generated by a generating model are classified and detected. The results of the three sets of experiments are shown in table 3:
TABLE 3 sample Classification results of three experiments
Figure BDA0002658272000000092
It can be seen from the results of the Alexa samples and the real mock domains in the first group of Table 3 that the classification effect of naive Bayes and random forests is better than that of the other two classifiers. The detection results of the first group are used as comparison reference values of Alexa samples and similar counterfeit domain name classification, and Alexa samples and real counterfeit domain name and similar counterfeit domain name combined sample classification. Comparing the classification results of the second and third group experiments in table 3 with the reference value of the classification results of the first group, it can be found that both the classification index F value and the ROC area thereof are maintained in the same performance state as the reference value, which indicates that the generated similar counterfeit domain name sample has the characteristics of a real counterfeit domain name sample, and can be used as a real counterfeit domain name data sample, thereby verifying the validity of the generated data.
In conclusion, the method for generating the ED-GAN-based counterfeit domain name training data by combining the domain name coder and the domain name decoder with the GAN realizes the characteristic learning of the real counterfeit domain name and the generation of the similar counterfeit domain name data.

Claims (4)

1. An ED-GAN-based counterfeit domain name training data generation system is characterized in that: the domain name set decoding device comprises a real counterfeit domain name set encoding module, a GAN generation network construction module, a GAN discrimination network construction module and a similar counterfeit domain name set generation decoding module; the output of the real counterfeit domain name set coding module and the output of the GAN generation network construction module are both connected with the input of the GAN judgment network construction module and are used for the optimization training of GAN judgment network parameters; the output of the GAN judgment network construction module is connected with the input of the GAN generation network construction module and used for continuously optimizing the parameters of the GAN generation network to generate new data to be input into the GAN judgment network for identification; the output of the GAN generation network construction module is respectively connected with the input of the GAN discrimination network construction module and the input of the decoding module for generating the similar counterfeit domain name set, thereby completing the training of the GAN discrimination network and the reduction of the character-level domain name.
2. The ED-GAN based mock domain name training data generating system according to claim 1, wherein: an ED-GAN character level domain name generation model is introduced to generate usable attack data; designing an encoder and a decoder of a domain name; combining an encoder and a decoder of a domain name with a GAN neural network, designing a character-level domain name generation countermeasure network model to generate similar counterfeit domain name samples, and realizing the prediction and detection of the counterfeit domain name; and finally, performing validity check on the generated counterfeit domain name sample data through multi-classifier parameter performance comparison.
3. The ED-GAN based mock domain name training data generating system according to claim 2, wherein: the core architectural idea of the character-level generation model is the GAN neural network, which designs the encoder and decoder of domain name characters on the GAN architecture.
4. A method for generating mock domain name training data based on ED-GAN is characterized in that: the method comprises the following steps:
s1 true mock Domain name set encoding
The domain name encoder is used for encoding the character-level domain name into a corresponding domain name vector, representing domain name character data and serving as input for generating a confrontation network discrimination network; firstly, preprocessing a domain name, removing a top level domain, a possible second level domain and a possible third level domain of the domain name, and only extracting a key part of the domain name; the encoding method for the domain name characters is as follows: let the character-level domain name be d, and the vector formed by sequential hashing of the character-level domain name be
Figure FDA0002658271990000011
Namely, it is
Figure FDA0002658271990000012
Wherein n is the domain name length; the conversion function between characters and numerical values is as follows:
f(x)=Q(di) (1)
wherein d isi(i 1, 2.., n) is a domain name character, n is a domain name length,
Figure FDA0002658271990000013
namely the character numerical value; only 38 characters in total, namely '0-9', 'a-z', 'hyphen and' dot character, are considered to be subjected to numerical conversion, because only 38 characters are allowed to be used in the domain name character string and case distinction is not made; after passing through the character value conversion function, the corresponding values of the 38 characters are sequentially obtained, i.e., Q ('0') -1, Q ('1') -2, Q ('a') -11, Q., (Q ('z') -36, Q (') -37, Q (') -38; domain name character vector
Figure FDA0002658271990000021
Is converted into the form of
Figure FDA0002658271990000022
The domain name character numerical value vector of (1); in order to improve the learning efficiency of GAN, the domain name character numerical value vector is normalized by data
Figure FDA0002658271990000023
Is mapped to the interval [0,1 ]](ii) a For i 1, 2., n, n is the domain name length, the mapping formula is as in formula (2):
Figure FDA0002658271990000024
wherein,
Figure FDA0002658271990000025
is the normalized character value, and the character value,
Figure FDA0002658271990000026
for the character value, minq (d) ═ 1 is the lower limit of the character conversion value, maxq (d) ═ 38 is the upper limit of the character conversion value, and the domain of the encoder mapping function is [1, 38%]Value range of [0,1]Mapped domain name vector
Figure FDA0002658271990000027
Is mapped into
Figure FDA0002658271990000028
Because the domain name character lengths are different, the dimension of the domain name vector is set to be 15 according to the characteristics of the counterfeit domain name, and 0 is supplemented after the domain name vector for the part with insufficient dimension, so that all the domain name character vectors are converted into the domain name vector with the dimension of 15 after passing through an encoder; after being coded by the coder, the character-level domain name vector is converted into training data of GAN, the training data is used as input of a GAN generation network of S2, and finally the training data is converted into a tensor of deep neural network operation through Tesorflow;
s2 GAN generation network construction
The generation network is used for learning a probability distribution model of real data under the guidance of a discrimination network, and the network structure of the generation network consists of four layers of neural networks, including an input layer, two hidden layers and an output layer; the input layer data is derived from a Gaussian distribution model and randomly generates n-100-dimensional data, the activation function of the input layer adopts a ReLu function, and the ReLu function is taken as a referenceConsidering the performance of the system, exponential operation is involved when an activation function part is calculated, the calculation amount of the whole process can be saved by adopting the ReLu function, and the problem of gradient disappearance in the deep network training process is prevented by using the ReLu function; the hidden layer comprises two layers, the number of nodes is h1150 and h2300, the activation function part of the node still adopts the ReLu function; the number of nodes in the output layer is 15, and the input data of the decoder is [0, 1%]Interval elements, so that a sigmoid function is adopted only in an activation function of an output layer to reduce the calculation amount of a generated network; the data generated by S2 are respectively passed to the discrimination network of S3 and the domain name decoder of S4;
s3 GAN discriminative network construction
The discrimination network is similar to the generation network, and is also a four-layer neural network comprising an input layer, two hidden layers and an output layer; the data source of the input layer is two, one part is derived from real data, and the other part is derived from generated data generated by the generation network, namely the output of the real data after S1 and the output of S2 form the input of S3; setting the length of the counterfeit domain name to reset _ size 15, so the dimension of the input data is 2 × reset _ size; the number of nodes of the two hidden layers is h2300 and h1150, the activation function adopts a ReLu function; the output layer activation function is a sigmoid function, before data is subjected to activation function operation, the data in the front reset _ size dimension and the data in the rear reset _ size dimension are split and operated, the error of the training network is minimum, the discarding rates of real data and generated data are respectively output, and network overfitting is prevented;
s4 generating a mock Domain name set decoder
The domain name Decoder (Decoder) is used for decoding domain name vectors generated by the GAN generation network into corresponding character-level domain names, is used for generating character reduction of data, and is used for generating output of a countermeasure network judgment network; the essence is the mirror image of the encoder, and the inverse mapping formula of the domain name decoder is as follows (3):
Q(di)=Q'(di)*[max Q(d)-min Q(d)+1]+min Q(d) (3)
where, i is 1, 2., n, n is the domain name length 15, min Q (d) is 1, max Q (d) is 38, Q '(d) is the lower limit of the character conversion value, and Q' (d) is the upper limit of the character conversion valuei) Elements in a domain name vector generated for generating a network; generating network generated domain name vectors
Figure FDA0002658271990000031
Obtaining the numerical value vector of the domain name character after decoding
Figure FDA0002658271990000032
Let the inverse of the character to value conversion function be:
g(x)=Q-1(x) (4)
after g (x), the field name value vector
Figure FDA0002658271990000033
Is mapped into
Figure FDA0002658271990000034
Will be provided with
Figure FDA0002658271990000035
Element d in (1)iThe domain name string d is formed after the sequential combination1,d2,...,dnThe domain name vector after passing through the decoder is decoded into a visible domain name.
CN202010895375.4A 2020-08-31 2020-08-31 Method and system for generating mock domain name training data based on ED-GAN Active CN112217787B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010895375.4A CN112217787B (en) 2020-08-31 2020-08-31 Method and system for generating mock domain name training data based on ED-GAN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010895375.4A CN112217787B (en) 2020-08-31 2020-08-31 Method and system for generating mock domain name training data based on ED-GAN

Publications (2)

Publication Number Publication Date
CN112217787A true CN112217787A (en) 2021-01-12
CN112217787B CN112217787B (en) 2022-11-04

Family

ID=74059263

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010895375.4A Active CN112217787B (en) 2020-08-31 2020-08-31 Method and system for generating mock domain name training data based on ED-GAN

Country Status (1)

Country Link
CN (1) CN112217787B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113190846A (en) * 2021-02-02 2021-07-30 广东工业大学 Malicious domain name training data generation method based on generation countermeasure network model
CN113420492A (en) * 2021-04-30 2021-09-21 华北电力大学 Modeling method for frequency response model of wind-solar-fire coupling system based on GAN and GRU neural network
CN114118640A (en) * 2022-01-29 2022-03-01 中国长江三峡集团有限公司 Long-term precipitation prediction model construction method, long-term precipitation prediction method and device
CN114978558A (en) * 2021-02-20 2022-08-30 中国电信股份有限公司 Domain name recognition method and device, computer device and storage medium
CN115277211A (en) * 2022-07-29 2022-11-01 哈尔滨工业大学(威海) Multi-mode pornography and gambling domain name automatic detection method based on text and images
CN117579397A (en) * 2024-01-16 2024-02-20 杭州海康威视数字技术股份有限公司 Internet of things privacy leakage detection method and device based on small sample ensemble learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019096099A1 (en) * 2017-11-15 2019-05-23 瀚思安信(北京)软件技术有限公司 Real-time detection method and apparatus for dga domain name
CN110781876A (en) * 2019-10-15 2020-02-11 北京工业大学 Visual feature-based counterfeit domain name lightweight detection method and system
CN110830490A (en) * 2019-11-14 2020-02-21 苏州大学 Malicious domain name detection method and system based on area confrontation training deep network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019096099A1 (en) * 2017-11-15 2019-05-23 瀚思安信(北京)软件技术有限公司 Real-time detection method and apparatus for dga domain name
CN110781876A (en) * 2019-10-15 2020-02-11 北京工业大学 Visual feature-based counterfeit domain name lightweight detection method and system
CN110830490A (en) * 2019-11-14 2020-02-21 苏州大学 Malicious domain name detection method and system based on area confrontation training deep network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
朱怡等: "基于视觉特征的仿冒域名轻量级检测技术", 《计算机应用》 *
白玲玲等: "隐马尔可夫模型在恶意域名检测中的应用", 《计算机工程》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113190846A (en) * 2021-02-02 2021-07-30 广东工业大学 Malicious domain name training data generation method based on generation countermeasure network model
CN114978558A (en) * 2021-02-20 2022-08-30 中国电信股份有限公司 Domain name recognition method and device, computer device and storage medium
CN113420492A (en) * 2021-04-30 2021-09-21 华北电力大学 Modeling method for frequency response model of wind-solar-fire coupling system based on GAN and GRU neural network
CN114118640A (en) * 2022-01-29 2022-03-01 中国长江三峡集团有限公司 Long-term precipitation prediction model construction method, long-term precipitation prediction method and device
CN115277211A (en) * 2022-07-29 2022-11-01 哈尔滨工业大学(威海) Multi-mode pornography and gambling domain name automatic detection method based on text and images
CN115277211B (en) * 2022-07-29 2023-07-28 哈尔滨工业大学(威海) Text and image-based multi-mode pornography and gambling domain name automatic detection method
CN117579397A (en) * 2024-01-16 2024-02-20 杭州海康威视数字技术股份有限公司 Internet of things privacy leakage detection method and device based on small sample ensemble learning
CN117579397B (en) * 2024-01-16 2024-03-26 杭州海康威视数字技术股份有限公司 Internet of things privacy leakage detection method and device based on small sample ensemble learning

Also Published As

Publication number Publication date
CN112217787B (en) 2022-11-04

Similar Documents

Publication Publication Date Title
CN112217787B (en) Method and system for generating mock domain name training data based on ED-GAN
Asiri et al. A survey of intelligent detection designs of HTML URL phishing attacks
CN112287997A (en) Depth map convolution model defense method based on generative confrontation network
CN112231562A (en) Network rumor identification method and system
CN113961922A (en) Malicious software behavior detection and classification system based on deep learning
CN110830489B (en) Method and system for detecting counterattack type fraud website based on content abstract representation
CN113132410B (en) Method for detecting phishing website
CN112073551B (en) DGA domain name detection system based on character-level sliding window and depth residual error network
CN113032525A (en) False news detection method and device, electronic equipment and storage medium
CN113269228B (en) Method, device and system for training graph network classification model and electronic equipment
CN109977118A (en) A kind of abnormal domain name detection method of word-based embedded technology and LSTM
CN113420294A (en) Malicious code detection method based on multi-scale convolutional neural network
CN110958244A (en) Method and device for detecting counterfeit domain name based on deep learning
Feng et al. A phishing webpage detection method based on stacked autoencoder and correlation coefficients
CN116722992A (en) Fraud website identification method and device based on multi-mode fusion
CN113965377A (en) Attack behavior detection method and device
CN117729003A (en) Threat information credibility analysis system and method based on machine learning
Valiyaveedu et al. Survey and analysis on AI based phishing detection techniques
CN117633811A (en) Code vulnerability detection method based on multi-view feature fusion
Jesmithaa et al. Detecting phishing attacks using Convolutional Neural Network and LSTM
CN112860976B (en) Fraud website detection method based on multi-mode hierarchical attention mechanism
Zhao et al. D3-SACNN: DGA domain detection with self-Attention convolutional network
CN112950222A (en) Resource processing abnormity detection method and device, electronic equipment and storage medium
Dilhara Phishing URL detection: A novel hybrid approach using long short-term memory and gated recurrent units
Cao et al. Adversarial DGA domain examples generation and detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant