WO2019223587A9 - 域名识别 - Google Patents

域名识别 Download PDF

Info

Publication number
WO2019223587A9
WO2019223587A9 PCT/CN2019/087076 CN2019087076W WO2019223587A9 WO 2019223587 A9 WO2019223587 A9 WO 2019223587A9 CN 2019087076 W CN2019087076 W CN 2019087076W WO 2019223587 A9 WO2019223587 A9 WO 2019223587A9
Authority
WO
WIPO (PCT)
Prior art keywords
domain name
character vector
character
vector
input
Prior art date
Application number
PCT/CN2019/087076
Other languages
English (en)
French (fr)
Other versions
WO2019223587A1 (zh
Inventor
顾成杰
Original Assignee
新华三信息安全技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 新华三信息安全技术有限公司 filed Critical 新华三信息安全技术有限公司
Priority to US17/050,026 priority Critical patent/US20210097399A1/en
Priority to JP2021510515A priority patent/JP7069410B2/ja
Priority to EP19808429.5A priority patent/EP3799398A4/en
Publication of WO2019223587A1 publication Critical patent/WO2019223587A1/zh
Publication of WO2019223587A9 publication Critical patent/WO2019223587A9/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]

Definitions

  • the terminal when the terminal is accessing the network, the terminal can obtain a network protocol (English: Internet, Protocol: IP) address of the target terminal through a Domain Name System (English: Domain Name System, DNS) server. Then, based on the IP address of the target terminal, the terminal establishes a communication link with the target terminal, and then performs data interaction with the target terminal.
  • a network protocol English: Internet, Protocol: IP
  • DNS Domain Name System
  • the DNS server extracts characters from the domain name included in the domain name resolution request after receiving the domain name resolution request sent by the terminal. By comparing the characters with the stored character feature database, the legality of the domain name can be determined.
  • FIG. 1 is a flowchart of a method for identifying a domain name according to an embodiment of the present disclosure
  • FIG. 2 is a flowchart of a method for determining a sequence matrix according to an embodiment of the present disclosure
  • FIG. 3 is a flowchart of a method for calculating a feature vector according to an embodiment of the present disclosure
  • FIG. 4 is a logic structural diagram of an input-output gate according to an embodiment of the present disclosure.
  • FIG. 5 is a logic structural diagram of a feedback gate according to an embodiment of the present disclosure.
  • FIG. 6 is a flowchart of a training method for a domain name feature analysis model and a domain name classification model according to an embodiment of the present disclosure
  • FIG. 7 is a schematic structural diagram of a device for identifying a domain name according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic structural diagram of a network device according to an embodiment of the present disclosure.
  • An embodiment of the present disclosure provides a method for identifying a domain name.
  • the method can be applied to a network device having a domain name resolution function.
  • DNS server For example, DNS server.
  • the source device When a terminal (which can be called a source device) needs to send a message to a server (which can be called a destination device), the source device first obtains the domain name of the destination device. For example, when users want to visit a website, they can enter the domain name of the website first.
  • the source device sends a domain name resolution request to the network device, and the domain name resolution request includes the domain name of the destination device.
  • the network device determines the IP address corresponding to the domain name according to the pre-stored correspondence between the domain name and the IP address. Then, the network device sends the IP address to the source device, so that the source device can send a packet to the destination device through the IP address.
  • the network device has been configured with a domain name feature analysis model and a domain name classification model, wherein the domain name feature analysis model includes an input-output gate.
  • the network device receives a domain name resolution request sent by a terminal, the network device identifies whether the domain name included in the domain name resolution request is a legitimate domain name through a domain name feature analysis model and a domain name classification model.
  • the network device will not send a response message to the terminal to prevent the source device from sending data packets to the malicious terminal. If the domain name is a legitimate domain name, the network device sends a response message to the terminal, and the response message carries the IP address corresponding to the domain name, so that the terminal can access the terminal corresponding to the IP address.
  • the method for identifying a domain name provided by the embodiments of the present disclosure can improve the accuracy of identifying a domain name, thereby improving the security of user data.
  • the processing process of the method includes the following steps.
  • Step 101 The network device receives a domain name resolution request sent by the terminal.
  • the domain name resolution request includes a domain name to be identified, and the domain name includes at least one character.
  • a source device ie, a terminal
  • a domain name resolution request carries the domain name of the destination device (that is, the domain name to be identified).
  • the network device parses the domain name resolution request to obtain a domain name to be identified.
  • the destination device is a network device with a domain name, and may be a host, a server, or a virtual machine.
  • Step 102 The network device determines a sequence matrix corresponding to the domain name.
  • the network device calculates a sequence matrix corresponding to the domain name.
  • the sequence matrix includes at least one character vector. Each character vector in the at least one character vector corresponds to each character in the at least one character. The calculation method of the sequence matrix will be described in detail later.
  • Step 103 The network device sequentially inputs each character vector in the at least one character vector to an input-output gate.
  • the domain name feature analysis model includes an input-output gate.
  • the input-output gate includes a logic operation rule between multiple activation functions.
  • the activation function can be a tanh activation function. After the network device determines the sequence matrix, each character vector contained in the sequence matrix is input to the input-output gate accordingly.
  • the formula for the tanh activation function is:
  • the network device inputs each character vector in the sequence matrix to the input-output gate in order to calculate a feature vector corresponding to the sequence matrix.
  • step 104 the network device performs a logical operation process on each character vector in at least one character vector through a logic operation rule between multiple activation functions to obtain a feature vector corresponding to the sequence matrix.
  • the network device performs logical operation processing on the character vector currently input to the input-output gate by using a logic operation rule between the multiple activation functions, to obtain The feature vector corresponding to the sequence matrix.
  • the logical operation processing includes arithmetic and logical operations. The specific process of performing logical operation processing on character vectors through input-output gates will be described in detail later.
  • Step 105 The network device inputs a feature vector corresponding to the sequence matrix into a domain name classification model, and determines whether the domain name is a legitimate domain name.
  • the domain name classification model may be a full connection layer (English: Full Connection Layer) with a number of neurons.
  • the network device inputs the feature vector corresponding to the sequence matrix into the domain name classification model, and the domain name classification model outputs a classification result corresponding to the feature vector.
  • the classification result is used to represent the probability that the domain name is an illegal domain name.
  • the network device determines whether the domain name is a legal domain name according to the classification result.
  • the classification result corresponding to a legal domain name is 0, the classification result corresponding to an illegal domain name is 1, and the preset threshold is 0.6. If the classification result corresponding to the domain name is 0.8> 0.6, the network device determines that the domain name is an illegal domain name; if the classification result corresponding to the domain name is 0.2 ⁇ 0.6, the network device determines that the domain name is a legitimate domain name.
  • the network device receives a domain name resolution request sent by the terminal.
  • the domain name resolution request includes a domain name to be identified, and the domain name includes at least one character.
  • the network device determines a sequence matrix corresponding to the domain name, where the sequence matrix includes at least one character vector, and each character vector in the at least one character vector corresponds to each character of the at least one character.
  • the network device sequentially inputs each character vector in the at least one character vector to the input-output gate.
  • the input-output gate includes logic operation rules between multiple activation functions.
  • the network device performs logical operation processing on each character vector in at least one character vector through a logic operation rule between multiple activation functions to obtain a feature vector corresponding to the sequence matrix.
  • the network device inputs the feature vector corresponding to the sequence matrix into the domain name classification model to determine whether the domain name is a legitimate domain name.
  • a technician is not required to set a character feature database, which improves the accuracy of identifying a domain name.
  • the network device determines that the domain name is a legitimate domain name, it sends a response message to the terminal, where the response message carries the IP address corresponding to the domain name.
  • the network device determines the IP address corresponding to the domain name according to the pre-stored correspondence between the domain name and the IP address. Further, the network device sends the determined IP address to the terminal, so that the terminal sends a message to the destination device through the IP address. If the domain name is an illegal domain name, the network device does not send a response message to the terminal, or the network device sends a prompt message to the terminal.
  • the prompt information is used to indicate that the domain name requested by the terminal for resolution is an illegal domain name.
  • a network device receives a domain name resolution request sent by a terminal.
  • the domain name resolution request includes a domain name to be identified, and the domain name includes at least one character.
  • the network device determines a sequence matrix corresponding to the domain name, where the sequence matrix includes at least one character vector, and each character vector in the at least one character vector corresponds to each character of the at least one character.
  • the network device inputs each character vector in the at least one character vector to the input-output gate in turn.
  • the input-output gate includes logic operation rules between multiple activation functions.
  • the network device performs logical operation processing on each character vector in at least one character vector through a logic operation rule between multiple activation functions to obtain a feature vector corresponding to the sequence matrix.
  • the network device inputs the feature vector corresponding to the sequence matrix into the domain name classification model to determine whether the domain name is a legitimate domain name.
  • a technician is not required to set a character feature database, which improves the accuracy of identifying a domain name.
  • An embodiment of the present disclosure also provides a method for determining a sequence matrix corresponding to a domain name. As shown in FIG. 2, a specific processing process of the method includes the following steps.
  • Step 201 Obtain a valid character from a network device in a domain name.
  • the valid character is composed of characters other than a stored prefix character and a stored suffix character in the domain name.
  • the network device stores prefix characters and suffix characters commonly used in domain names.
  • the prefix characters are network names, such as www., "Ftp.”, “Smtp.”, Etc .
  • the suffix characters are top-level domain names, such as ".com”, “.net”, “.edu”, “.gov”, etc. .
  • the network device recognizes the prefix and suffix characters contained in the domain name, and then extracts characters other than the prefix and suffix characters.
  • the extracted characters are valid characters. For example, if the domain name is www.google.com, extract strings other than www. And .com to get google.
  • Step 202 According to the stored character and index value mapping rule, the network device determines an index value corresponding to each character in the valid characters, and obtains a first index sequence corresponding to the valid characters.
  • characters that may appear in a domain name are stored in a network device, and an index value is assigned to each character, thereby generating a character-to-index value mapping rule.
  • characters that may appear in a domain name are stored in a network device, and then each character is numbered sequentially from 1.
  • the number corresponding to each character is the index value corresponding to each character.
  • the characters that appear are a, b, c, and d.
  • the network device determines that the number corresponding to a is 1, the number corresponding to b is 2, the number corresponding to c is 3, and the number corresponding to d is 4.
  • the index value of a is 1, the index value of b is 2, the index value of c is 3, and the index value of d is 4.
  • Table 1 an example of a mapping rule between characters and index values provided by an embodiment of the present disclosure.
  • the network device After the network device obtains a valid character, according to the stored character and index value mapping rule, an index value corresponding to each character in the valid character is determined, and a first index sequence corresponding to the valid character is obtained.
  • the domain name is www.google.com
  • the valid character is google.
  • the first index sequence is 1, 2, 2, 1, 5, and 6.
  • Step 203 When the first index sequence does not reach the standard length, the network device fills the first index sequence with the second index sequence.
  • the second index sequence has a standard length.
  • the network device After the network device obtains a first index sequence corresponding to a valid character, it is determined whether the first index sequence reaches a standard length.
  • the standard length can be set by a technician according to experience, and the standard length is greater than the upper limit of the length of the first index sequence.
  • the standard length can be set to 60 characters.
  • the network device fills the first index sequence with a second index sequence, where the second index sequence has a standard length. It can be understood that it is easier to perform programming processing by using a standard-length index sequence to represent each valid character.
  • the network device may fill the first index sequence by using a preset character.
  • the preset character is 0.
  • the network device may fill the preset characters before the first character of the first index sequence, or fill the preset characters after the end character of the first index sequence.
  • the second index sequence is 1, 2, 2, 1, 5, 6, 0, 0, ..., 0, that is, 54 zeros are filled after the character 6.
  • Step 204 The network device calculates a character vector corresponding to each index value in the second index sequence.
  • an embedded layer (English: Embedding layer) neural network is stored in the network device.
  • the embedded layer neural network may be used to convert an arbitrary character into a character vector.
  • the network device may input the second index sequence to the embedded layer neural network to calculate a character vector corresponding to each index value in the second index sequence.
  • the calculated character vector may be a vector of 128 dimensions.
  • Step 205 The network device determines a sequence matrix by using a character vector corresponding to each index value in the second index sequence.
  • the network device calculates a character vector corresponding to each index value in the second index sequence, and then, based on the character vector corresponding to each index value in the second index sequence, the network device determines a sequence matrix.
  • the second index sequence is 1, 2, 2, 1, 5, 6, 0, 0 ..., 0.
  • the length of the second index sequence is 60 characters, that is, 54 zeros are filled after the character 6.
  • the network device inputs the second index sequence to the embedded layer neural network, and outputs character vectors a 1 , a 2 , a 3, ... a 60 .
  • a i is a vector of 128 dimensions, that is, 60 character vectors of 128 dimensions are obtained.
  • the network device uses 60 character vectors of 128 dimensions to determine a sequence matrix of 60 * 128.
  • a network device converts a domain name into a natural language processing problem, sets an index value corresponding to each character, and then characterizes the index value into a vectorized representation, which is easier to program.
  • An embodiment of the present disclosure also provides another implementation manner for determining a sequence matrix.
  • the network device may not perform processing for extracting valid characters.
  • the network device determines an index value corresponding to each character contained in the domain name, and obtains an index sequence corresponding to the domain name (for convenience of distinguishing, it may be referred to as a third index sequence).
  • the network device determines whether the third index sequence reaches a standard length. When the third index sequence does not reach the standard length, the network device fills the third index sequence with a fourth index sequence, and the fourth index sequence has a standard length. Then, the network device calculates a character vector corresponding to each index value in the fourth index sequence. Further, using the character vector corresponding to each index value in the fourth index sequence, the network device determines a sequence matrix.
  • a network device In the method for determining a sequence matrix corresponding to a domain name provided by an embodiment of the present disclosure, a network device first extracts a character (which can be called a valid character) with an identifying meaning from the domain name, and then determines a sequence matrix corresponding to the domain name according to the valid character. It is not necessary to calculate all characters contained in the domain name, which improves the efficiency of determining the sequence matrix.
  • a character which can be called a valid character
  • An embodiment of the present disclosure also provides a method for calculating a feature vector. As shown in FIG. 3, a specific processing process of the method includes the following steps.
  • Step 301 The network device obtains a feedback value of a first character vector currently input to the input-output gate, a second character vector input last to the input-output gate, and a feedback value of the second character vector input to the input-output gate last.
  • the network device sequentially inputs each character vector included in the sequence matrix to an input-output gate.
  • the character vector currently input to the input-output gate is referred to as a first character vector
  • the character vector that was input to the input-output gate last time is referred to as a second character vector.
  • the network device obtains an output value of the first character vector, a second character vector, and a feedback value of the second character vector.
  • the output value of the character vector input to the input-output gate last time and the character vector input to the input-output gate the last time are both zero.
  • the calculation method of the feedback value will be described in detail later.
  • a network device first inputs a character vector a 1 to an input-output gate, outputs an output value of a 1 , and inputs a character vector a 1 to a feedback gate, and outputs a feedback value of a 1 . Then, the network device inputs the character vector a 2 to the input-output gate, and outputs the output value of a 2 . The network device obtains the output values of a 2 and a 1 and the feedback value of a 1 for subsequent operations.
  • Step 302 The network device performs a first logical operation on the output value of the first character vector, the second character vector, and the feedback value of the second character vector to obtain the output value of the first character vector.
  • FIG. 4 it is a logic structure diagram of an input-output gate provided by an embodiment of the present disclosure.
  • the network device performs a first logical operation on the output value of the first character vector, the second character vector, and the feedback value of the second character vector based on the input-output gate shown in FIG. 4 to obtain the output value of the first character vector.
  • the specific calculation process includes the following steps.
  • Step 1 Perform a first weighting calculation on the feedback values of the first character vector and the second character vector according to the first weight matrix, and the network device obtains the first weighted result.
  • Step 2 Perform a second weighting calculation on the feedback values of the first character vector and the second character vector according to the second weight matrix, and the network device obtains a second weighting result.
  • the first weight matrix and the second weight matrix are the same or different.
  • Step 3 The network device inputs the first weighted result and the first bias parameter into the first activation function to obtain a first operation result.
  • the corresponding calculation formula can be as follows:
  • the first activation function is the tanh activation function
  • h t-1 is the feedback value of the character vector input to the input-output gate last time
  • x t is the character vector currently input to the input-output gate
  • w i is the first weight matrix
  • Bi is a first offset parameter
  • i t is a first operation result.
  • Step 4 The network device inputs the second weighted result and the second bias parameter into the second activation function to obtain a second operation result.
  • the corresponding calculation formula can be as follows:
  • the second activation function is a tanh activation function
  • w c is a second weight matrix
  • b c is a second bias parameter
  • I t and above Both are determined through the feedback value of the second character vector and the first character vector, and i t represents the final input data of this calculation determined by the feedback value of the second character vector and the first character vector; Represents data that needs to be retained in the calculated feedback value determined by the feedback value of the second character vector and the first character vector.
  • the first bias parameter and the second bias parameter may be the same or different.
  • Step 5 The network device multiplies the first operation result and the second operation result, and adds the result of the multiplication to the output value of the second character vector to obtain an output value corresponding to the first character vector.
  • the corresponding calculation formula can be as follows:
  • C t is the output value corresponding to the first character vector
  • C t-1 is the output value of the second character vector
  • i t is the result of the first operation. Is the result of the second operation.
  • the network device stores the output value of the first character vector for subsequent logical operation processing.
  • Step 303 The network device determines a feature vector corresponding to the sequence matrix by using the output value of the at least one character vector.
  • the network device based on the above processing, for any character vector, after the network device inputs the character vector to the input-output gate, the network device will obtain the output value of the character vector. In this way, the network device can obtain the output value of each character vector contained in the sequence matrix. The network device determines the feature vector corresponding to the sequence matrix based on the output value of each character vector.
  • the sequence matrix includes a 1 , a 2 , and a 3 , where the output value of a 1 is x, the output value of a 2 is y, and the output value of a 3 is z, then the feature vector corresponding to the sequence matrix is (x , y, z).
  • the logical operation of the input-output gate in the existing recurrent neural network is simplified, the processing amount of the network equipment is reduced, and the accuracy of domain name recognition is improved.
  • the domain name feature analysis model further includes a feedback gate. After the network device obtains the output value of the character vector currently input to the input-output gate, it also calculates the feedback value of the character vector currently input to the input-output gate.
  • the specific processing process may be: the network device performs a second logical operation on the output value of the first character vector, the first character vector, and the feedback value of the second character vector to obtain the feedback value of the first character vector.
  • the feedback value is used for the network device to calculate the output value of the character vector input to the input-output gate next time.
  • the network device inputs the character vector to an input-output gate of the domain name feature analysis model.
  • the network device also inputs the character vector into a feedback gate of the domain name feature analysis model. That is, the character vector currently input to the input-output gate is the same character vector as the character vector currently input to the feedback gate. Similarly, the character vector input to the input-output gate last time is the same character vector as the character vector input to the feedback gate last time.
  • FIG. 5 it is a logic structure diagram of a feedback gate provided by an embodiment of the present disclosure.
  • the network device performs a second logical operation on the output value of the first character vector, the first character vector, and the feedback value of the second character vector based on the feedback gate shown in FIG. 5 to obtain the feedback of the character vector currently input to the feedback gate. value.
  • the specific calculation process includes the following steps.
  • Step 1 According to the third weight matrix, the network device performs a third weighting calculation on the feedback values of the first character vector and the second character vector to obtain a third weighting result.
  • Step 2 The network device inputs the third weighted result and the third bias parameter into the third activation function to obtain a third operation result.
  • the corresponding calculation formula can be as follows:
  • the third activation function is tanh activation function, h t-1 as the feedback value of the first character vector, x t is the first character vector, w o is the third weight matrix, b o for the third bias parameter, O t is the third operation result.
  • O t represents the data selected from the feedback value of the second character vector and the first character vector to be fed back to the next calculation, and the data to be memorized in the current output (ie, C t ).
  • Step 3 The network device inputs the output value of the first character vector into a fourth activation function to obtain a fourth operation result.
  • Step 4 The network device multiplies the third operation result and the fourth operation result to obtain a feedback value of the first character vector.
  • the corresponding calculation formula can be as follows:
  • the fourth activation function is a tanh activation function
  • C t is an output value of the first character vector
  • O t is a third operation result
  • h t is a feedback value of the first character vector.
  • the network device stores the feedback value of the first character vector for subsequent logical operation processing.
  • the network device converts each character into a character vector of 128 dimensions through the embedding layer.
  • h t-1 and x t are both vectors of 128 dimensions; b i is a vector of 128 dimensions; w i Is a 128 * 128 matrix, where w 1 * h t-1 is 128 * 1, w 2 * x t is also 128 * 1, and finally adding b i to output a 128-dimensional vector.
  • w c is a matrix of 128 * 128, and b c is a vector of 128 dimensions; in the above and formula (5), w o is a matrix of 128 * 128, and b o is of 128 dimensions. vector.
  • C t is a 128 * 128 matrix.
  • the logical operation of the feedback gate in the existing recurrent neural network is simplified, the processing amount of the network device is reduced, and the accuracy of domain name recognition is improved.
  • an embodiment of the present disclosure further provides a training method for a domain name feature analysis model and a domain name classification model.
  • the method may be performed by a network device, where the network device may be a network device with a data processing function. As shown in FIG. 6, it specifically includes the following steps:
  • Step 601 The network device obtains a stored training sample set.
  • the training sample set includes multiple positive samples and multiple negative samples.
  • Each positive sample is a sequence matrix corresponding to a legitimate domain name; each negative sample is a sequence matrix corresponding to an illegal domain name.
  • a technician inputs a plurality of legal domain names in advance in a network device.
  • the network device determines the sequence matrix corresponding to each legal domain name and obtains a positive sample.
  • the network device may also obtain an illegal domain name.
  • the illegal domain name may be obtained by the network device crawling from the network.
  • the network device may be generated by a domain name generation algorithm (English: Domain name generation algorithm, DGA for short) technology.
  • the network device determines the sequence matrix corresponding to each illegal domain name and obtains a negative sample.
  • the network device can obtain the training sample set.
  • the sequence matrix For the specific process of determining the sequence matrix, refer to the related description of step 102, and details are not described herein again.
  • Step 602 The network device sequentially inputs each character vector included in each sequence matrix into the first initial training model, and obtains an output value corresponding to each character vector.
  • the first initial training model may be a recurrent neural network, and the recurrent neural network includes an input-output gate and a feedback gate.
  • Step 603 The network device determines a feature vector corresponding to each sequence matrix by using an output value of each character vector in each sequence matrix.
  • step 604 the network device inputs a feature vector corresponding to each sequence matrix to a second initial training model, and obtains a domain name recognition result corresponding to each sequence matrix.
  • the second initial training model is a fully connected layer.
  • step 105 For the processing of this step, reference may be made to the related description of step 105, and details are not described herein again.
  • Step 605 Using the back-propagation algorithm and using the domain name recognition result corresponding to each sequence matrix, the network device performs a first weight matrix, a second weight matrix, a third weight matrix, and a first bias parameter included in the first initial training model. , The second bias parameter and the third bias parameter are adjusted to obtain a domain name feature analysis model.
  • the back-propagation algorithm may be a time-based back-propagation (English: Back Propagation, Tough Time, BPTT for short) algorithm.
  • the network device recognizes the domain name identification result corresponding to each sequence matrix (ie, the sample), the actual classification result of the domain name corresponding to the sequence matrix (ie, the sample) (such as a legal domain name or an illegal domain name), and BPTT
  • the algorithm adjusts the first weight matrix, the second weight matrix, the first bias parameter and the second bias parameter included in the input-output gate, and the third weight matrix and the third bias parameter in the feedback gate to obtain the domain name feature analysis. model.
  • step 606 the network device adjusts the second initial training model by using the backpropagation algorithm and using the domain name recognition result corresponding to each sequence matrix to obtain a domain name classification model.
  • the fully connected layer includes a weight vector, and the weight vector is a 128-dimensional vector.
  • the network device adjusts the full connection according to the domain name recognition result corresponding to each sequence matrix (that is, the sample), the actual classification result of the domain name corresponding to the sequence matrix (that is, the sample) (such as a legal domain name or an illegal domain name), and the back propagation algorithm.
  • the value of the weight vector contained in the layer to obtain the domain name feature analysis model.
  • FIG. 7 shows an apparatus for identifying a domain name according to an embodiment of the present disclosure.
  • the apparatus is applied to network equipment.
  • the network equipment has been configured with a domain name feature analysis model and a domain name classification model.
  • the domain name feature analysis model includes input and output gates.
  • the device includes a receiving module 710, a first determining module 720, a first input module 730, a processing module 740, and a second determining module 750. The description of each module is shown below.
  • the receiving module 710 is configured to receive a domain name resolution request sent by a terminal, where the domain name resolution request includes a domain name to be identified, and the domain name includes at least one character;
  • a first determining module 720 configured to determine a sequence matrix corresponding to a domain name, the sequence matrix includes at least one character vector, and each character vector in the at least one character vector corresponds to each character of the at least one character;
  • a first input module 730 configured to sequentially input each character vector in the at least one character vector to the input-output gate, where the input-output gate includes a logic operation rule between multiple activation functions;
  • a processing module 740 configured to perform a logical operation process on each character vector in the at least one character vector by using a logic operation rule between the multiple activation functions to obtain a feature vector corresponding to the sequence matrix;
  • a second determining module 750 is configured to input a feature vector corresponding to the sequence matrix into a domain name classification model, and determine whether the domain name is a legal domain name.
  • the first determination module 720 may include a first acquisition submodule, a first determination submodule, a filling submodule, a first calculation submodule, and a second determination submodule.
  • a first obtaining sub-module configured to obtain valid characters from the domain name, the valid characters are composed of characters other than the stored prefix characters and stored suffix characters in the domain name;
  • a first determining submodule configured to determine an index value corresponding to each character in a valid character according to a stored character and index value mapping rule, and obtain a first index sequence corresponding to the valid character;
  • a filling sub-module configured to fill the first index sequence with a second index sequence when the first index sequence does not reach a standard length, and the second index sequence has a standard length;
  • a first calculation submodule configured to calculate a character vector corresponding to each index value in the second index sequence
  • the second determining submodule is configured to determine a sequence matrix by using a character vector corresponding to each index value in the second index sequence.
  • the processing module 740 may include a second acquisition submodule, an operation submodule, and a third determination submodule.
  • a second acquisition submodule configured to acquire a first character vector currently input to the input-output gate, an output value of a second character vector last input to the input-output gate, and a last input to the input-output The feedback value of the second character vector of the gate;
  • An operator module configured to perform a first logical operation on the output value of the first character vector, the second character vector, and the feedback value of the second character vector to obtain the output value of the first character vector ;
  • the third determining submodule is configured to determine a feature vector corresponding to the sequence matrix by using the obtained output value of the at least one character vector.
  • the operation sub-module 742 may include a first calculation unit, a second calculation unit, a first input unit, a second input unit, and a third calculation unit.
  • a first calculation unit configured to perform a first weighting calculation on the feedback values of the first character vector and the second character vector according to a first weight matrix to obtain a first weighting result
  • a second calculation unit configured to perform a second weighting calculation on the feedback value of the first character vector and the second character vector according to a second weight matrix to obtain a second weighting result
  • a first input unit configured to input a first weighted result and a first bias parameter into a first activation function to obtain a first operation result
  • a second input unit configured to input a second weighted result and a second bias parameter into a second activation function to obtain a second operation result
  • a third calculation unit configured to multiply the first operation result and the second operation result, and add the result of the multiplication to the output value of the second character vector to obtain the first character vector Corresponding output value.
  • the domain name feature analysis model further includes a feedback gate
  • the device further includes an operation module
  • An operation module is configured to perform a second logical operation on an output value of the first character vector, a feedback value of the first character vector, and a feedback value of the second character vector to obtain a feedback value of the first character vector.
  • the operation module may include a second calculation sub-module, a first input sub-module, a second input sub-module, and a multiplication sub-module.
  • a second calculation submodule configured to perform a third weighting calculation on the feedback value of the first character vector and the second character vector according to a third weight matrix to obtain a third weighting result
  • a first input sub-module configured to input a third weighted result and a third bias parameter into a third activation function to obtain a third operation result
  • a second input submodule configured to input an output value of the first character vector into a fourth activation function to obtain a fourth operation result
  • the multiplication submodule is configured to multiply the third operation result and the fourth operation result to obtain a feedback value of the first character vector.
  • the device further includes an acquisition module, a second input module, a third determination module, a third input module, a first adjustment module, and a second adjustment module.
  • An acquisition module for acquiring a stored training sample set includes multiple positive samples and multiple negative samples, each positive sample is a sequence matrix corresponding to a legal domain name; each negative sample is a sequence matrix corresponding to an illegal domain name ;
  • a second input module configured to sequentially input each character vector included in each sequence matrix into the first initial training model to obtain an output value corresponding to each character vector
  • a third determining module configured to determine a feature vector corresponding to each sequence matrix by using an output value of each character vector in each sequence matrix
  • a third input module configured to input a feature vector corresponding to each sequence matrix to the second initial training model, and obtain a domain name recognition result corresponding to each sequence matrix
  • a first adjustment module configured to use the back-propagation algorithm to use the domain name recognition result corresponding to each sequence matrix to perform the first weight matrix, the second weight matrix, and the Adjusting a third weight matrix, the first bias parameter, the second bias parameter, and the third bias parameter to obtain the domain name characteristic analysis model;
  • a second adjustment module is configured to adjust the second initial training model by using a backpropagation algorithm and using a domain name recognition result corresponding to each sequence to obtain the domain name classification model.
  • the apparatus for identifying a domain name provided by the embodiment of the present disclosure is applied to network equipment.
  • the network equipment has been configured with a domain name feature analysis model and a domain name classification model.
  • the domain name feature analysis model includes an input-output gate.
  • the network device receives a domain name resolution request sent by the terminal.
  • the domain name resolution request includes a domain name to be identified, and the domain name includes at least one character.
  • the network device determines a sequence matrix corresponding to the domain name, where the sequence matrix includes at least one character vector, and each character vector in the at least one character vector corresponds to each character of the at least one character.
  • the network device inputs each character vector in the at least one character vector to the input-output gate in turn.
  • the input-output gate includes logic operation rules between multiple activation functions.
  • the network device performs logical operation processing on each character vector in at least one character vector through a logic operation rule between multiple activation functions to obtain a feature vector corresponding to the sequence matrix.
  • the network device inputs the feature vector corresponding to the sequence matrix into the domain name classification model to determine whether the domain name is a legitimate domain name.
  • a technician is not required to set a character feature database, which improves the accuracy of identifying a domain name.
  • FIG. 8 is a structural block diagram of a network device according to an embodiment of the present disclosure.
  • the network device includes a processor 801, a transceiver 802, and a machine-readable storage medium 803 storing machine-executable instructions.
  • the network device has been configured with a domain name feature analysis model and a domain name classification model.
  • the domain name feature analysis model includes input and output gates.
  • the domain name feature analysis model and domain name classification model can be implemented by software function modules. It can be understood that the foregoing software function module may have been loaded into a memory (English: flash), and the processor 801 may be implemented by calling; or, the foregoing software function module may be provided inside the processor, and the processor 801 may access the Way to achieve.
  • the transceiver 802 is configured to implement: receiving a domain name resolution request sent by a terminal, and transmitting the domain name resolution request to the processor 801.
  • the domain name resolution request includes a domain name to be identified, and the domain name includes at least one character. ;
  • the processor 801 By reading and executing the machine-executable instructions, the processor 801 is caused to:
  • sequence matrix corresponding to the domain name, the sequence matrix including at least one character vector, each character vector in the at least one character vector corresponding to each character of the at least one character;
  • each character vector in the at least one character vector into the input-output gate in sequence, and the input-output gate includes a logic operation rule between multiple activation functions;
  • the feature vector corresponding to the sequence matrix is input into the domain name classification model to determine whether the domain name is a legitimate domain name.
  • machine-executable instructions specifically cause the processor 801 to:
  • the sequence matrix is determined by a character vector corresponding to each index value in the second index sequence.
  • machine-executable instructions specifically cause the processor 801 to:
  • a feature vector corresponding to the sequence matrix is determined by using the obtained output value of the at least one character vector.
  • machine-executable instructions specifically cause the processor 801 to:
  • the domain name feature analysis model further includes a feedback gate
  • the machine-executable instructions further cause the processor 801 to: output values of the first character vector, the first character Performing a second logical operation on the vector and the feedback value of the second character vector to obtain the feedback value of the first character vector;
  • the feedback value is used to calculate an output value of a character vector input to the input-output gate next time.
  • the machine-executable instructions specifically cause the processor 801 to perform a third weighting on the feedback values of the first character vector and the second character vector according to a third weight matrix. Calculate to get the third weighted result;
  • machine-executable instructions further cause the processor 801 to:
  • each positive sample is a sequence matrix corresponding to a legal domain name
  • each negative sample is a sequence matrix corresponding to an illegal domain name
  • the second initial training model is adjusted by a back propagation algorithm using the domain name recognition result corresponding to each sequence matrix to obtain the domain name classification model.
  • machine-executable instructions further cause the processor 801 to:
  • a response message is sent to the terminal, and the response message includes a network protocol IP address corresponding to the domain name.
  • the network device may further include a communication bus 804.
  • the communication bus 804 may be a Peripheral Component Interconnect Standard (English: Peripheral Component Interconnect (PCI) for short) bus or an Extended Industry Standard Architecture (English: Extended Industry Standard Architecture (EISA) bus for short).
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the communication bus 804 can be divided into an address bus, a data bus, a control bus, and the like.
  • the machine-readable storage medium 803 may include a random access memory (English: Random Access Memory, referred to as RAM), and may also include a non-volatile memory (English: Non-Volatile Memory, referred to as NVM), such as at least one disk memory .
  • the machine-readable storage medium 803 may also be at least one storage device located far from the foregoing processor.
  • the processor 801 may be a general-purpose processor, including a central processing unit (English: Central Processing Unit, CPU), a network processor (English: Network Processor, NP), etc .; it may also be a digital signal processor (English: Digital Signal Processing (abbreviation: DSP), Application Specific Integrated Circuit (English: Application Specific Integrated Circuit (abbreviation: ASIC)), Field Programmable Gate Array (English: Field-Programmable Gate Array, FPGA) or other programmable logic devices, Discrete gate or transistor logic devices, discrete hardware components.
  • a central processing unit English: Central Processing Unit, CPU
  • NP Network Processor
  • DSP Digital Signal Processing
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • a domain name feature analysis model and a domain name classification model have been configured, and the domain name feature analysis model includes an input-output gate.
  • the network device receives a domain name resolution request sent by the terminal.
  • the domain name resolution request includes a domain name to be identified, and the domain name includes at least one character.
  • the network device determines a sequence matrix corresponding to the domain name, where the sequence matrix includes at least one character vector, and each character vector in the at least one character vector corresponds to each character of the at least one character.
  • the network device inputs each character vector in the at least one character vector to the input-output gate in turn.
  • the input-output gate includes logic operation rules between multiple activation functions.
  • the network device performs logical operation processing on each character vector in at least one character vector through a logic operation rule between multiple activation functions to obtain a feature vector corresponding to the sequence matrix.
  • the network device inputs the feature vector corresponding to the sequence matrix into the domain name classification model to determine whether the domain name is a legitimate domain name.
  • a technician is not required to set a character feature database, which improves the accuracy of identifying a domain name.
  • relational terms such as first and second are used only to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that there exists between these entities or operations. Any such actual relationship or order.
  • the terms "including”, “comprising”, or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article, or device that includes a series of elements includes not only those elements but also those that are not explicitly listed Or other elements inherent to such a process, method, article, or device. Without more restrictions, the elements defined by the sentence “including a " do not exclude the existence of other identical elements in the process, method, article, or equipment including the elements.

Abstract

接收终端发送的域名解析请求,域名解析请求包括待识别的域名,域名包括至少一个字符,确定域名对应的序列矩阵,序列矩阵包括至少一个字符向量,至少一个字符向量中的每个字符向量与至少一个字符中的每个字符一一对应,将至少一个字符向量中的每个字符向量,依次输入至输入输出门,输入输出门包括多个激活函数之间的逻辑运算规则,通过多个激活函数之间的逻辑运算规则,分别对至少一个字符向量中每个字符向量进行逻辑运算处理,得到序列矩阵对应的特征向量,将序列矩阵对应的特征向量输入至域名分类模型中,确定域名是否为合法域名。

Description

域名识别
相关申请的交叉引用
本公开要求于2018年5月21日提交中国专利局、申请号为201810489709.0发明名称为“一种识别域名的方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
背景技术
目前,终端在访问网络的过程中,终端可以通过域名系统(英文:Domain Name System,简称:DNS)服务器获取目标终端的网络协议(英文:Internet Protocol,简称:IP)地址。然后,通过目标终端的IP地址,终端与目标终端建立通信链路,进而与目标终端进行数据交互。在实际组网中,由于终端存在被病毒程序感染的情况,被感染的终端会与不法分子设置的恶意终端进行数据传输,给网络带来较大安全风险。
为了阻止被感染的终端与恶意终端之间的数据传输,DNS服务器在接收到终端发送的域名解析请求后,对域名解析请求包括的域名进行字符提取。通过将字符与已存储的字符特征库进行比对,实现对域名的合法性判断。
附图简要说明
图1为本公开实施例提供的一种识别域名的方法流程图;
图2为本公开实施例提供的一种确定序列矩阵的方法流程图;
图3为本公开实施例提供的一种计算特征向量的方法流程图;
图4为本公开实施例提供的一种输入输出门的逻辑结构图;
图5为本公开实施例提供的一种反馈门的逻辑结构图;
图6为本公开实施例提供的一种域名特征分析模型和域名分类模型的训练方法流程图;
图7为本公开实施例提供的一种识别域名的装置的结构示意图;
图8为本公开实施例提供的一种网络设备的结构示意图。
具体实施方式
为使本公开的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本公开作进一步详细的说明。本公开实施例提供了一种识别域名的方法,该方法可以应用于具有域名解析功能的网络设备。比如,DNS服务器。
当某终端(可称为源设备)需要向某服务器(可称为目的设备)发送报文时,源设备先获取目的设备的域名。例如,用户希望访问某网站时,可以先输入该网站的域名。源设备向网络设备发送域名解析请求,域名解析请求包括目的设备的域名。网络设备根据预先存储的域名和IP地址的对应关系,确定该域名对应的IP地址。然后,网络设备将该IP地址发送给源设备,以使源设备可以通过该IP地址向目的设备发送报文。
本公开实施例中,网络设备已配置域名特征分析模型和域名分类模型,其中,域名特 征分析模型包括输入输出门。当网络设备接收到终端发送的域名解析请求时,网络设备通过域名特征分析模型和域名分类模型,识别该域名解析请求包含的域名是否为合法域名。
如果域名为非法域名,则网络设备不会向该终端发送响应消息,以避免源设备向恶意终端发送数据报文。如果域名为合法域名,则网络设备向该终端发送响应消息,该响应消息携带该域名对应的IP地址,以使终端访问该IP地址对应的终端。
通过本公开实施例提供的识别域名的方法,可以提高识别域名的准确度,从而提高用户数据的安全性。如图1所示,该方法的处理过程包括如下步骤。
步骤101,网络设备接收终端发送的域名解析请求。
其中,域名解析请求包括待识别的域名,域名包括至少一个字符。
在本公开实施例中,当源设备(即,终端)需要向目的设备发送报文时,先向该网络设备发送域名解析请求。域名解析请求中携带有目的设备的域名(即待识别的域名)。网络设备接收到该域名解析请求后,对该域名解析请求进行解析,获取待识别的域名。
其中,目的设备是具有域名的网络设备,可以是主机、服务器或虚拟机等。
步骤102,网络设备确定域名对应的序列矩阵。
在本公开实施例中,网络设备获取到待识别的域名后,计算该域名对应的序列矩阵。序列矩阵包括至少一个字符向量。其中,至少一个字符向量中的每个字符向量与至少一个字符中的每个字符一一对应。序列矩阵的计算方法后续会进行详细说明。
步骤103,网络设备将至少一个字符向量中的每个字符向量,依次输入至输入输出门。
在本公开实施例中,域名特征分析模型包括输入输出门。其中,输入输出门包括多个激活函数之间的逻辑运算规则。激活函数可以采用tanh激活函数。网络设备确定序列矩阵后,将序列矩阵包含的每个字符向量,依此输入至输入输出门。tanh激活函数的公式为:
F(x)=(e x–e -x)/(e x+e -x)                      (1)
其中,e为自然对数的底数。
网络设备将上述序列矩阵中的每个字符向量,依次输入至输入输出门,以计算该序列矩阵对应的特征向量。
步骤104,通过多个激活函数之间的逻辑运算规则,网络设备分别对至少一个字符向量中的每个字符向量进行逻辑运算处理,得到序列矩阵对应的特征向量。
在本公开实施例中,针对输入至输入输出门的每个字符向量,网络设备通过上述多个激活函数之间的逻辑运算规则,对当前输入至输入输出门的字符向量进行逻辑运算处理,得到序列矩阵对应的特征向量。其中,逻辑运算处理包括算术和逻辑运算。通过输入输出门对字符向量进行逻辑运算处理的具体过程后续会详细说明。
步骤105,网络设备将序列矩阵对应的特征向量输入至域名分类模型中,确定域名是否为合法域名。
在本公开实施例中,域名分类模型可以是神经元数量为1的全连接层(英文:Full Connection layer)。网络设备将序列矩阵对应的特征向量,输入至域名分类模型中,域名分类模型则会输出该特征向量对应的分类结果,该分类结果用以表示该域名为非法域名的概率。网络设备根据该分类结果,确定该域名是否为合法域名。
例如,合法域名对应的分类结果为0,非法域名对应的分类结果为1,预设阈值为0.6。 如果该域名对应的分类结果为0.8>0.6,则网络设备确定该域名为非法域名;如果该域名对应的分类结果为0.2<0.6,则网络设备确定该域名为合法域名。
本公开实施例中,网络设备接收终端发送的域名解析请求,该域名解析请求包括待识别的域名,域名包括至少一个字符。网络设备确定域名对应的序列矩阵,其中,序列矩阵包括至少一个字符向量,至少一个字符向量中每个字符向量与至少一个字符中每个字符一一对应。
网络设备将至少一个字符向量中每个字符向量,依次输入至输入输出门。输入输出门包括多个激活函数之间的逻辑运算规则。网络设备通过多个激活函数之间的逻辑运算规则,分别对至少一个字符向量中每个字符向量进行逻辑运算处理,得到序列矩阵对应的特征向量。网络设备将序列矩阵对应的特征向量输入至域名分类模型中,确定域名是否为合法域名。
基于本公开实施例,不需要技术人员设定字符特征库,提高了识别域名的准确度。
在一个示例中,如果网络设备确定域名为合法域名,则向终端发送响应消息,该响应消息携带域名对应的IP地址。
在本公开实施例中,如果该域名为合法域名,网络设备则会根据预先存储的域名和IP地址的对应关系,确定该域名对应的IP地址。进而,网络设备将确定出的IP地址发送给该终端,以使该终端通过该IP地址向目的设备发送报文。如果该域名为非法域名,网络设备则不会向该终端发送响应消息,或者,网络设备向终端发送提示信息,该提示信息用于表示该终端请求解析的域名为非法域名。
本公开实施例提供的识别域名的方法,网络设备接收终端发送的域名解析请求,该域名解析请求包括待识别的域名,域名包括至少一个字符。网络设备确定域名对应的序列矩阵,其中,序列矩阵包括至少一个字符向量,至少一个字符向量中每个字符向量与至少一个字符中每个字符一一对应。
然后,网络设备将至少一个字符向量中每个字符向量,依次输入至输入输出门。输入输出门包括多个激活函数之间的逻辑运算规则。网络设备通过多个激活函数之间的逻辑运算规则,分别对至少一个字符向量中每个字符向量进行逻辑运算处理,得到序列矩阵对应的特征向量。
网络设备将序列矩阵对应的特征向量输入至域名分类模型中,确定域名是否为合法域名。
基于本公开实施例,不需要技术人员设定字符特征库,提高了识别域名的准确度。
本公开实施例还提供了一种确定域名对应的序列矩阵的方法,如图2所示,该方法的具体处理过程包括如下步骤。
步骤201,从域名中,网络设备获取有效字符,有效字符由域名中除已存储的前缀字符和已存储的后缀字符以外的字符构成。
在本公开实施例中,网络设备存储域名中常用的前缀字符和后缀字符。其中,前缀字符为网络名,比如www.、“ftp.”和“smtp.”等;后缀字符为顶级域名,比如“.com”、“.net”、“.edu”和“.gov”等。网络设备识别域名包含的前缀字符和后缀字符,然后提取出除前缀字符和后缀字符以外的字符,提取出的字符即为有效字符。例如,域名为www.google.com, 提取除www.和.com以外的字符串,得到google。
步骤202,根据已存储的字符与索引值映射规则,网络设备确定有效字符中每个字符对应的索引值,得到有效字符对应的第一索引序列。
在本公开实施例中,网络设备中存储域名中可能出现的字符,并为各字符分配索引值,从而生成字符与索引值映射规则。
在一种可能的实现方式中,网络设备中存储有域名中可能出现的字符,然后从1开始依次为各字符进行编号。每个字符对应的编号,即为每个字符对应的索引值。例如,出现的字符为a、b、c、d。网络设备确定a对应的编号为1,b对应的编号为2,c对应的编号为3,d对应的编号为4。此时,a的索引值为1,b的索引值为2,c的索引值为3,d的索引值为4。参照表一,为本公开实施例提供的一种字符与索引值映射规则示例。
表一
字符 索引值 字符 索引值
g 1 a 4
o 2 l 5
f 3 e 6
网络设备获取有效字符后,根据已存储的字符与索引值映射规则,确定有效字符中每个字符对应的索引值,得到有效字符对应的第一索引序列。例如,域名为www.google.com,有效字符为google,基于表一所示的字符与索引值映射规则,第一索引序列为1、2、2、1、5、6。
步骤203,当第一索引序列未达到标准长度时,网络设备将第一索引序列填充为第二索引序列。
其中,第二索引序列具有标准长度。
在本公开实施例中,网络设备得到有效字符对应的第一索引序列之后,判断第一索引序列是否达到标准长度。其中,标准长度可以由技术人员根据经验进行设置,标准长度大于第一索引序列的长度上限。比如,标准长度可以设置为60个字符。
如果第一索引序列未达到标准长度时,网络设备将第一索引序列填充为第二索引序列,其中,第二索引序列具有标准长度。可以理解的是,用标准长度的索引序列表示每个有效字符,更容易进行编程处理。
在一种可能的实现方式中,网络设备可以通过预设字符对第一索引序列进行填充。例如,预设字符为0。网络设备可以在第一索引序列的首字符之前填充预设字符,也可以在第一索引序列的末尾字符之后填充预设字符。
例如,第一索引序列为1、2、2、1、5、6,标准长度为60个字符,则第二索引序列为1、2、2、1、5、6、0、0……,0,也即,在字符6的后面填充了54个0。
步骤204,网络设备计算第二索引序列中每个索引值对应的字符向量。
在本公开实施例中,网络设备中存储有嵌入层(英文:Embedding layer)神经网络,嵌入层神经网络可以用于将任意字符转换为字符向量。网络设备可以将第二索引序列输入至嵌入层神经网络,以计算第二索引序列中每个索引值对应的字符向量。其中,计算出的字符向量可以为128维度的向量。
通过嵌入层神经网络计算各字符对应的字符向量的过程属于现有技术,本实施例不再赘述。通过嵌入层神经网络计算字符向量,可以有效的学习到字符之间的相似性,以及在上下文之间的关联性。
步骤205,通过第二索引序列中的每个索引值对应的字符向量,网络设备确定序列矩阵。
在本公开实施例中,网络设备计算第二索引序列中每个索引值对应的字符向量,然后,通过第二索引序列中的每个索引值对应的字符向量,网络设备确定序列矩阵。
例如,第二索引序列为1、2、2、1、5、6、0、0……,0。其中,第二索引序列的长度为60个字符,也即,在字符6的后面填充了54个0。网络设备将第二索引序列输入至嵌入层神经网络,输出字符向量a 1,a 2,a 3……a 60。其中,a i为128维度的向量,也即,得到60个128维度的字符向量。网络设备利用60个128维度的字符向量,确定60*128的序列矩阵。
本公开实施例中,网络设备将域名转化为自然语言处理问题,设置了每个字符对应的索引值,然后对索引值进行字符向量化表示,更容易进行编程。
本公开实施例中还提供了另一种确定序列矩阵的实现方式,在该实现方式中,网络设备可以不进行提取有效字符的处理。
具体地,网络设备获取到待识别的域名后,确定域名包含的每个字符对应的索引值,得到该域名对应的索引序列(为了便于区分,可称为第三索引序列)。网络设备判断第三索引序列是否达到标准长度。当第三索引序列未达到标准长度时,网络设备将第三索引序列填充为第四索引序列,第四索引序列具有标准长度。然后,网络设备计算第四索引序列中每个索引值对应的字符向量。进而,利用第四索引序列中的每个索引值对应的字符向量,网络设备确定序列矩阵。
本公开实施例提供的确定域名对应的序列矩阵的方法,网络设备先从域名中提取具有标识性含义的字符(可称为有效字符),然后根据有效字符,确定域名对应的序列矩阵的方法,无需对域名包含的全部字符进行计算,提高了确定序列矩阵的效率。
本公开实施例还提供了一种计算特征向量的方法,如图3所示,该方法的具体处理过程包括如下步骤。
步骤301,网络设备获取当前输入至输入输出门的第一字符向量、上一次输入至输入输出门的第二字符向量的输出值、以及上一次输入至输入输出门的第二字符向量的反馈值。
在本公开实施例中,网络设备将序列矩阵包含的各字符向量,依次输入至输入输出门。为了便于描述,将当前输入至输入输出门的字符向量称为第一字符向量,将上一次输入至输入输出门的字符向量称为第二字符向量。
网络设备获取第一字符向量、第二字符向量的输出值、以及第二字符向量的反馈值。其中,对于第一个输入至输入输出门的字符向量,上一次输入至输入输出门的字符向量的输出值、以及上一次输入至输入输出门的字符向量的反馈值均为0。反馈值的计算方式后续会进行详细说明。
例如,网络设备先将字符向量a 1输入至输入输出门,输出a 1的输出值,并且,将字符向量a 1输入至反馈门,输出a 1的反馈值。然后,网络设备将字符向量a 2输入至输入输出 门,输出a 2的输出值。网络设备获取a 2、a 1的输出值、以及a 1的反馈值,以进行后续运算。
步骤302,网络设备对第一字符向量、第二字符向量的输出值、以及第二字符向量的反馈值进行第一逻辑运算,得到第一字符向量的输出值。
在本公开实施例中,如图4所示,为本公开实施例提供的输入输出门的逻辑结构图。网络设备基于图4所示的输入输出门,对第一字符向量、第二字符向量的输出值、以及第二字符向量的反馈值进行第一逻辑运算,得到第一字符向量的输出值。具体的计算过程包括如下步骤。
步骤一,根据第一权重矩阵,对第一字符向量和第二字符向量的反馈值进行第一加权计算,网络设备得到第一加权结果。
步骤二,根据第二权重矩阵,对第一字符向量和第二字符向量的反馈值进行第二加权计算,网络设备得到第二加权结果。
其中,第一权重矩阵和第二权重矩阵相同,也可以不相同。
步骤三,网络设备将第一加权结果和第一偏置参数输入至第一激活函数,得到第一运算结果。相应的计算公式可以如下:
i t=tanh(w i·[h t-1,x t]+b i)                   (2)
其中,第一激活函数为tanh激活函数,h t-1为上一次输入至输入输出门的字符向量的反馈值,x t为当前输入至输入输出门的字符向量,w i为第一权重矩阵,bi为第一偏置参数,i t为第一运算结果。
步骤四,网络设备将第二加权结果和第二偏置参数输入至第二激活函数,得到第二运算结果。相应的计算公式可以如下:
Figure PCTCN2019087076-appb-000001
其中,第二激活函数为tanh激活函数,w c为第二权重矩阵,b c为第二偏置参数,
Figure PCTCN2019087076-appb-000002
为第二运算结果,h t-1、x t与公式(2)中的h t-1、x t相同。
上述i t
Figure PCTCN2019087076-appb-000003
都是通过第二字符向量的反馈值和第一字符向量确定出的,i t表示通过第二字符向量的反馈值和第一字符向量确定出的本次计算最终输入的数据;
Figure PCTCN2019087076-appb-000004
表示通过第二字符向量的反馈值和第一字符向量确定出的本次计算的反馈值中,需要保留的数据。
在一个示例中,第一偏置参数和第二偏置参数可以相同,也可以不相同。
步骤五,网络设备将第一运算结果和第二运算结果相乘,将相乘得到的结果与第二字符向量的输出值相加,得到第一字符向量对应的输出值。相应的计算公式可以如下:
Figure PCTCN2019087076-appb-000005
其中,C t为第一字符向量对应的输出值,C t-1为第二字符向量的输出值,i t为第一运算结果,
Figure PCTCN2019087076-appb-000006
为第二运算结果。
在一个示例中,网络设备会对第一字符向量的输出值进行存储,以便进行后续逻辑运算处理。
步骤303,通过至少一个字符向量的输出值,网络设备确定序列矩阵对应的特征向量。
在本公开实施例中,基于上述处理,对于任一字符向量,网络设备将该字符向量输入 至输入输出门后,会得到该字符向量的输出值。这样,网络设备能够得到序列矩阵包含的每个字符向量的输出值。网络设备通过每个字符向量的输出值,确定序列矩阵对应的特征向量。
例如,序列矩阵包括a 1,a 2,a 3,其中,a 1的输出值为x,a 2的输出值为y,a 3的输出值为z,则序列矩阵对应的特征向量为(x,y,z)。
本公开实施例中,简化了现有循环神经网络中输入输出门的逻辑运算,降低网络设备的处理量,并提高域名识别的准确度。
进一步地,本公开实施例中,域名特征分析模型还包括反馈门。网络设备得到当前输入至输入输出门的字符向量的输出值之后,还会计算当前输入至输入输出门的字符向量的反馈值。
具体处理过程可以为:网络设备对第一字符向量的输出值、第一字符向量、以及第二字符向量的反馈值进行第二逻辑运算,得到第一字符向量的反馈值。
其中,该反馈值用于网络设备计算下一次输入至输入输出门的字符向量的输出值。
在本公开实施例中,对于任一字符向量,网络设备将该字符向量输入至域名特征分析模型的输入输出门。同时,网络设备还将该字符向量输入至域名特征分析模型的反馈门。也即,当前输入至输入输出门的字符向量,与当前输入至反馈门的字符向量为同一字符向量。同理,上一次输入至输入输出门的字符向量,与上一次输入至反馈门的字符向量为同一字符向量。
可以理解的是,对于第一个输入至反馈门的字符向量,上一次输入至反馈的字符向量的反馈值均为0。
如图5所示,为本公开实施例提供的反馈门的逻辑结构图。网络设备基于图5所示的反馈门,对第一字符向量的输出值、第一字符向量、以及第二字符向量的反馈值进行第二逻辑运算,得到当前输入至反馈门的字符向量的反馈值。具体的计算过程包括如下步骤。
步骤一,根据第三权重矩阵,网络设备对第一字符向量和第二字符向量的反馈值进行第三加权计算,得到第三加权结果。
步骤二,网络设备将第三加权结果和第三偏置参数输入至第三激活函数,得到第三运算结果。相应的计算公式可以如下:
O t=tanh(w o·[h t-1,x t]+b o)                     (5)
其中,第三激活函数为tanh激活函数,h t-1为第一字符向量的反馈值,x t为第一字符向量,w o为第三权重矩阵,b o为第三偏置参数,O t为第三运算结果。O t表示从第二字符向量的反馈值和第一字符向量中选择出的需要反馈到下一次计算的数据、以及需要记忆在本次输出(即C t)中的数据。
步骤三,网络设备将第一字符向量的输出值输入至第四激活函数,得到第四运算结果。
步骤四,网络设备将第三运算结果与第四运算结果相乘,得到第一字符向量的反馈值。相应的计算公式可以如下:
h t=O t*tanh(C t)                (6)
其中,第四激活函数为tanh激活函数,C t为第一字符向量的输出值,O t为第三运算 结果,h t为第一字符向量的反馈值。
在一个示例中,网络设备会对第一字符向量的反馈值进行存储,以便进行后续逻辑运算处理。
上述公式(2)中,w i·[h t-1,x t]为两个公式的简写,w i·[h t-1,x t]=w 1*h t-1+w 2*x t。本公开实施例中,网络设备通过嵌入层将每个字符转换为128维度的字符向量,相应的,h t-1和x t均为128维度的向量;b i为128维度的向量;w i为128*128的矩阵,其中,w 1*h t-1为128*1,w 2*x t也为128*1,最后加上b i输出一个128维度的向量。
同理,上述公式(3)中,w c为128*128的矩阵,b c为128维度的向量;上述和公式(5)中,w o为128*128的矩阵,b o为128维度的向量。上述公式(4)中,C t为一个128*128的矩阵。
本公开实施例中,简化了现有循环神经网络中反馈门的逻辑运算,降低了网络设备的处理量,并提高域名识别的准确度。
在一个示例中,本公开实施例还提供了一种域名特征分析模型和域名分类模型的训练方法,该方法可以由网络设备执行,其中,该网络设备可以是具有数据处理功能的网络设备。如图6所示,具体包括以下步骤:
步骤601,网络设备获取已存储的训练样本集合。
其中,训练样本集合包括多个正样本和多个负样本。每个正样本为合法域名对应的序列矩阵;每个负样本为非法域名对应的序列矩阵。
在本公开实施例中,技术人员在网络设备中预先输入多个合法域名。网络设备确定每个合法域名对应的序列矩阵,得到正样本。网络设备还可以获取非法域名,非法域名可以是网络设备从网络中爬取获得;或者,也可以是网络设备通过域名生成算法(英文:Domain name generation algorithm,简称:DGA)技术生成。网络设备确定每个非法域名对应的序列矩阵,得到负样本。
这样,网络设备可以得到训练样本集合。其中,确定序列矩阵的具体处理过程参照步骤102的相关说明,此处不再赘述。
步骤602,网络设备将每个序列矩阵包括的每个字符向量,依次输入至第一初始训练模型,得到每个字符向量对应的输出值。
其中,第一初始训练模型可以为循环神经网络,循环神经网络包括输入输出门和反馈门。
步骤603,通过每个序列矩阵中的每个字符向量的输出值,网络设备确定每个序列矩阵对应的特征向量。
在本公开实施例中,步骤602和步骤603的具体处理过程可以参照步骤301~步骤303的相关说明,此处不再赘述。
步骤604,网络设备将每个序列矩阵对应的特征向量输入至第二初始训练模型,得到每个序列矩阵对应的域名识别结果。
其中,第二初始训练模型为全连接层。
本步骤的处理过程可以参照步骤105的相关说明,此处不再赘述。
步骤605,通过反向传播算法,利用每个序列矩阵对应的域名识别结果,网络设备对 第一初始训练模型包括的第一权重矩阵、第二权重矩阵、第三权重矩阵、第一偏置参数、第二偏置参数和第三偏置参数进行调整,得到域名特征分析模型。
其中,反向传播算法可以为基于时间的反向传播(英文:Back Propagation Trough Time,简称:BPTT)算法。
在本公开实施例中,网络设备根据每个序列矩阵(即样本)对应的域名识别结果、该序列矩阵(即样本)对应的域名的实际分类结果(比如是合法域名或非法域名)、以及BPTT算法,调整输入输出门包括的第一权重矩阵、第二权重矩阵、第一偏置参数和第二偏置参数,以及反馈门中的第三权重矩阵和第三偏置参数,得到域名特征分析模型。
其中,通过BPTT算法调整循环神经网络的处理过程属于现有技术,本公开实施例不再赘述。
步骤606,通过反向传播算法,利用每个序列矩阵对应的域名识别结果,网络设备对第二初始训练模型进行调整,得到域名分类模型。
在本公开实施例中,全连接层中包含权重向量,该权重向量为128维度的向量。网络设备根据每个序列矩阵(即样本)对应的域名识别结果、该序列矩阵(即样本)对应的域名的实际分类结果(比如是合法域名或非法域名)、以及反向传播算法,调整全连接层包含的权重向量的数值,得到域名特征分析模型。
其中,通过反向传播算法调整全连接层的处理过程属于现有技术,本公开实施例不再赘述。
本公开实施例中,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本公开实施例并不受所描述的动作顺序的限制,因为依据本公开实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本公开实施例所必须的。参照图7,图7示出了本公开实施例提供的一种识别域名的装置,该装置应用于网络设备,网络设备已配置域名特征分析模型和域名分类模型,域名特征分析模型包括输入输出门,该装置包括接收模块710、第一确定模块720、第一输入模块730、处理模块740、第二确定模块750。各模块的介绍如下所示。
接收模块710,用于接收终端发送的域名解析请求,域名解析请求包括待识别的域名,域名包括至少一个字符;
第一确定模块720,用于确定域名对应的序列矩阵,序列矩阵包括至少一个字符向量,至少一个字符向量中的每个字符向量与至少一个字符中的每个字符一一对应;
第一输入模块730,用于将所述至少一个字符向量中的每个字符向量,依次输入至所述输入输出门,所述输入输出门包括多个激活函数之间的逻辑运算规则;
处理模块740,用于通过所述多个激活函数之间的逻辑运算规则,分别对所述至少一个字符向量中的每个字符向量进行逻辑运算处理,得到所述序列矩阵对应的特征向量;
第二确定模块750,用于将序列矩阵对应的特征向量输入至域名分类模型中,确定域名是否为合法域名。
在本公开的一个实施例中,第一确定模块720可以包括第一获取子模块、第一确定子模块、填充子模块、第一计算子模块、第二确定子模块。
第一获取子模块,用于从域名中,获取有效字符,有效字符由域名中除已存储的前缀字符和已存储的后缀字符以外的字符构成;
第一确定子模块,用于根据已存储的字符与索引值映射规则,确定有效字符中每个字符对应的索引值,得到有效字符对应的第一索引序列;
填充子模块,用于当第一索引序列未达到标准长度时,将第一索引序列填充为第二索引序列,第二索引序列具有标准长度;
第一计算子模块,用于计算第二索引序列中每个索引值对应的字符向量;
第二确定子模块,用于通过第二索引序列中的每个索引值对应的字符向量,确定序列矩阵。
在本公开的一个实施例中,处理模块740可以包括第二获取子模块、运算子模块、第三确定子模块。
第二获取子模块,用于获取当前输入至所述输入输出门的第一字符向量、上一次输入至所述输入输出门的第二字符向量的输出值、以及上一次输入至所述输入输出门的第二字符向量的反馈值;
运算子模块,用于对所述第一字符向量、所述第二字符向量的输出值、以及所述第二字符向量的反馈值进行第一逻辑运算,得到所述第一字符向量的输出值;
第三确定子模块,用于通过得到的至少一个字符向量的输出值确定所述序列矩阵对应的特征向量。
在本公开的一个实施例中,运算子模块742可以包括第一计算单元、第二计算单元、第一输入单元、第二输入单元、第三计算单元。
第一计算单元,用于根据第一权重矩阵,对所述第一字符向量和所述第二字符向量的反馈值进行第一加权计算,得到第一加权结果;
第二计算单元,用于根据第二权重矩阵,对所述第一字符向量和所述第二字符向量的反馈值进行第二加权计算,得到第二加权结果;
第一输入单元,用于将第一加权结果和第一偏置参数输入至第一激活函数,得到第一运算结果;
第二输入单元,用于将第二加权结果和第二偏置参数输入至第二激活函数,得到第二运算结果;
第三计算单元,用于将所述第一运算结果和所述第二运算结果相乘,将相乘得到的结果与所述第二字符向量的输出值相加,得到所述第一字符向量对应的输出值。
在本公开的一个实施例中,域名特征分析模型还包括反馈门,该装置还包括运算模块。
运算模块,用于对所述第一字符向量的输出值、所述第一字符向量、以及所述第二字符向量的反馈值进行第二逻辑运算,得到所述第一字符向量的反馈值。
在本公开的一个实施例中,运算模块可以包括第二计算子模块、第一输入子模块、第二输入子模块、相乘子模块。
第二计算子模块,用于根据第三权重矩阵,对所述第一字符向量和所述第二字符向量的反馈值进行第三加权计算,得到第三加权结果;
第一输入子模块,用于将第三加权结果和第三偏置参数输入至第三激活函数,得到第 三运算结果;
第二输入子模块,用于将所述第一字符向量的输出值输入至第四激活函数,得到第四运算结果;
相乘子模块,用于将所述第三运算结果与所述第四运算结果相乘,得到第一字符向量的反馈值。
在本公开的一个实施例中,该装置还包括获取模块、第二输入模块、第三确定模块、第三输入模块、第一调整模块、第二调整模块。
获取模块,用于获取已存储的训练样本集合,训练样本集合包括多个正样本和多个负样本,每个正样本为合法域名对应的序列矩阵;每个负样本为非法域名对应的序列矩阵;
第二输入模块,用于将每个序列矩阵包括的每个字符向量,依次输入至第一初始训练模型,得到每个字符向量对应的输出值;
第三确定模块,用于通过每个序列矩阵中的每个字符向量的输出值,确定每个序列矩阵对应的特征向量;
第三输入模块,用于将每个序列矩阵对应的特征向量输入至第二初始训练模型,得到每个序列矩阵对应的域名识别结果;
第一调整模块,用于通过反向传播算法,利用每个序列矩阵对应的域名识别结果,对所述第一初始训练模型包括的所述第一权重矩阵、所述第二权重矩阵、所述第三权重矩阵、所述第一偏置参数、所述第二偏置参数和所述第三偏置参数进行调整,得到所述域名特征分析模型;
第二调整模块,用于通过反向传播算法,利用每个序列对应的域名识别结果,对所述第二初始训练模型进行调整,得到所述域名分类模型。
本公开实施例提供的识别域名的装置,应用于网络设备,网络设备已配置域名特征分析模型和域名分类模型,域名特征分析模型包括输入输出门。网络设备接收终端发送的域名解析请求,该域名解析请求包括待识别的域名,域名包括至少一个字符。网络设备确定域名对应的序列矩阵,其中,序列矩阵包括至少一个字符向量,至少一个字符向量中每个字符向量与至少一个字符中每个字符一一对应。
然后,网络设备将至少一个字符向量中每个字符向量,依次输入至输入输出门。输入输出门包括多个激活函数之间的逻辑运算规则。网络设备通过多个激活函数之间的逻辑运算规则,分别对至少一个字符向量中每个字符向量进行逻辑运算处理,得到序列矩阵对应的特征向量。
网络设备将序列矩阵对应的特征向量输入至域名分类模型中,确定域名是否为合法域名。
基于本公开实施例,不需要技术人员设定字符特征库,提高了识别域名的准确度。
与上述识别域名的方法实施例对应,本公开实施例还提供了一种网络设备。参照图8,图8示出了本公开实施例提供的一种网络设备的结构框图。
该网络设备包括:处理器801、收发器802和存储有机器可执行指令的机器可读存储介质803。
该网络设备已配置域名特征分析模型和域名分类模型,该域名特征分析模型包括输入 输出门。域名特征分析模型和域名分类模型可以通过软件功能模块实现。可以理解的是,前述软件功能模块可已加载至存储器(英文:flash)中,处理器801通过调用的方式实现;或者,前述软件功能模块可可已设置在处理器内部,处理器801通过访问的方式实现。
其中,收发器802用于实现:接收终端发送的域名解析请求,并将所述域名解析请求传输至所述处理器801,所述域名解析请求包括待识别的域名,所述域名包括至少一个字符;
通过读取并执行所述机器可执行指令,所述处理器801被使得:
确定所述域名对应的序列矩阵,所述序列矩阵包括至少一个字符向量,所述至少一个字符向量中的每个字符向量与所述至少一个字符中的每个字符一一对应;
将所述至少一个字符向量中的每个字符向量,依次输入至所述输入输出门,所述输入输出门包括多个激活函数之间的逻辑运算规则;
通过所述多个激活函数之间的逻辑运算规则,分别对所述至少一个字符向量中的每个字符向量进行逻辑运算处理,得到所述序列矩阵对应的特征向量;
将所述序列矩阵对应的特征向量输入至所述域名分类模型中,确定所述域名是否为合法域名。
在本公开的一个实施例中,所述机器可执行指令具体促使所述处理器801:
从所述域名中,获取有效字符,所述有效字符由所述域名中除已存储的前缀字符和已存储的后缀字符以外的字符构成;
根据已存储的字符与索引值映射规则,确定所述有效字符中每个字符对应的索引值,得到所述有效字符对应的第一索引序列;
当所述第一索引序列未达到标准长度时,将所述第一索引序列填充为第二索引序列,所述第二索引序列具有标准长度;
计算所述第二索引序列中每个索引值对应的字符向量;
通过所述第二索引序列中的每个索引值对应的字符向量,确定所述序列矩阵。
在本公开的一个实施例中,所述机器可执行指令具体促使所述处理器801:
获取当前输入至所述输入输出门的第一字符向量、上一次输入至所述输入输出门的第二字符向量的输出值、以及上一次输入至所述输入输出门的所述第二字符向量的反馈值;
对所述第一字符向量、所述第二字符向量的输出值、以及所述第二字符向量的反馈值进行第一逻辑运算,得到所述第一字符向量的输出值;
通过得到的至少一个字符向量的输出值确定所述序列矩阵对应的特征向量。
在本公开的一个实施例中,所述机器可执行指令具体促使所述处理器801:
根据第一权重矩阵,对所述第一字符向量和所述第二字符向量的反馈值进行第一加权计算,得到第一加权结果;
根据第二权重矩阵,对所述第一字符向量和所述第二字符向量的反馈值进行第二加权计算,得到第二加权结果;
将所述第一加权结果和第一偏置参数输入至第一激活函数,得到第一运算结果;
将所述第二加权结果和第二偏置参数输入至第二激活函数,得到第二运算结果;
将所述第一运算结果和所述第二运算结果相乘,将相乘得到的结果与所述第二字符向 量的输出值相加,得到所述第一字符向量对应的输出值。
在本公开的一个实施例中,所述域名特征分析模型还包括反馈门,所述机器可执行指令还促使所述处理器801:对所述第一字符向量的输出值、所述第一字符向量、以及所述第二字符向量的反馈值进行第二逻辑运算,得到所述第一字符向量的反馈值;
其中,所述反馈值用于计算下一次输入至所述输入输出门的字符向量的输出值。
在本公开的一个实施例中,所述机器可执行指令具体促使所述处理器801:根据第三权重矩阵,对所述第一字符向量和所述第二字符向量的反馈值进行第三加权计算,得到第三加权结果;
将所述第三加权结果和第三偏置参数输入至第三激活函数,得到第三运算结果;
将所述第一字符向量的输出值输入至第四激活函数,得到第四运算结果;
将所述第三运算结果与所述第四运算结果相乘,得到所述第一字符向量的反馈值。
在本公开的一个实施例中,所述机器可执行指令还促使所述处理器801:
获取已存储的训练样本集合,所述训练样本集合包括多个正样本和多个负样本,每个正样本为合法域名对应的序列矩阵;每个负样本为非法域名对应的序列矩阵;
将每个序列矩阵包括的每个字符向量,依次输入至第一初始训练模型,得到所述每个字符向量对应的输出值;
通过每个序列矩阵中的每个字符向量的输出值,确定每个序列矩阵对应的特征向量;
将每个序列矩阵对应的特征向量输入至第二初始训练模型,得到每个序列矩阵对应的域名识别结果;
通过反向传播算法,利用每个序列矩阵对应的域名识别结果,对所述第一初始训练模型包括的所述第一权重矩阵、所述第二权重矩阵、所述第三权重矩阵、所述第一偏置参数、所述第二偏置参数和所述第三偏置参数进行调整,得到所述域名特征分析模型;
通过反向传播算法,利用每个序列矩阵对应的域名识别结果,对所述第二初始训练模型进行调整,得到所述域名分类模型。
在本公开的一个实施例中,所述机器可执行指令还促使所述处理器801:
如果确定所述域名为合法域名,则向所述终端发送响应消息,所述响应消息包括所述域名对应的网络协议IP地址。
如图8所示,网络设备还可以包括通信总线804。通过通信总线804,处理器801、机器可读存储介质803之间完成通信交互。通信总线804可以是外设部件互连标准(英文:Peripheral Component Interconnect,简称:PCI)总线或扩展工业标准结构(英文:Extended Industry Standard Architecture,简称:EISA)总线等。该通信总线804可以分为地址总线、数据总线、控制总线等。
机器可读存储介质803可以包括随机存取存储器(英文:Random Access Memory,简称:RAM),也可以包括非易失性存储器(英文:Non-Volatile Memory,简称:NVM),例如至少一个磁盘存储器。另外,机器可读存储介质803还可以是至少一个位于远离前述处理器的存储装置。
处理器801可以是通用处理器,包括中央处理器(英文:Central Processing Unit,简称:CPU)、网络处理器(英文:Network Processor,简称:NP)等;还可以是数字信 号处理器(英文:Digital Signal Processing,简称:DSP)、专用集成电路(英文:Application Specific Integrated Circuit,简称:ASIC)、现场可编程门阵列(英文:Field-Programmable Gate Array,简称:FPGA)或其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。
本公开实施例提供的网络设备中,已配置域名特征分析模型和域名分类模型,域名特征分析模型包括输入输出门。网络设备接收终端发送的域名解析请求,该域名解析请求包括待识别的域名,域名包括至少一个字符。网络设备确定域名对应的序列矩阵,其中,序列矩阵包括至少一个字符向量,至少一个字符向量中每个字符向量与至少一个字符中每个字符一一对应。
然后,网络设备将至少一个字符向量中每个字符向量,依次输入至输入输出门。输入输出门包括多个激活函数之间的逻辑运算规则。网络设备通过多个激活函数之间的逻辑运算规则,分别对至少一个字符向量中每个字符向量进行逻辑运算处理,得到序列矩阵对应的特征向量。
最后,网络设备将序列矩阵对应的特征向量输入至域名分类模型中,确定域名是否为合法域名。
基于本公开实施例,不需要技术人员设定字符特征库,提高了识别域名的准确度。
在本公开实施例中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
本说明书中的各个实施例均采用相关的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
以上所述仅为本公开的较佳实施例而已,并非用于限定本公开的保护范围。凡在本公开的精神和原则之内所作的任何修改、等同替换、改进等,均包含在本公开的保护范围内。

Claims (14)

  1. 一种识别域名的方法,所述方法应用于网络设备,所述网络设备已配置域名特征分析模型和域名分类模型,所述域名特征分析模型包括输入输出门,所述方法包括:
    接收终端发送的域名解析请求,所述域名解析请求包括待识别的域名,所述域名包括至少一个字符;
    确定所述域名对应的序列矩阵,所述序列矩阵包括至少一个字符向量,所述至少一个字符向量中的每个字符向量与所述至少一个字符中的每个字符一一对应;
    将所述至少一个字符向量中的每个字符向量,依次输入至所述输入输出门,所述输入输出门包括多个激活函数之间的逻辑运算规则;
    通过所述多个激活函数之间的逻辑运算规则,分别对所述至少一个字符向量中的每个字符向量进行逻辑运算处理,得到所述序列矩阵对应的特征向量;
    将所述序列矩阵对应的特征向量输入至所述域名分类模型中,确定所述域名是否为合法域名。
  2. 根据权利要求1所述的方法,其中,所述确定所述域名对应的序列矩阵,包括:
    从所述域名中,获取有效字符,所述有效字符由所述域名中除已存储的前缀字符和已存储的后缀字符以外的字符构成;
    根据已存储的字符与索引值映射规则,确定所述有效字符中每个字符对应的索引值,得到所述有效字符对应的第一索引序列;
    当所述第一索引序列未达到标准长度时,将所述第一索引序列填充为第二索引序列,所述第二索引序列具有标准长度;
    计算所述第二索引序列中每个索引值对应的字符向量;
    通过所述第二索引序列中的每个索引值对应的字符向量,确定所述序列矩阵。
  3. 根据权利要求1所述的方法,其中,所述通过所述多个激活函数之间的逻辑运算规则,分别对所述至少一个字符向量中每个字符向量进行逻辑运算处理,得到所述序列矩阵对应的特征向量,包括:
    获取当前输入至所述输入输出门的第一字符向量、上一次输入至所述输入输出门的第二字符向量的输出值、以及上一次输入至所述输入输出门的所述第二字符向量的反馈值;
    对所述第一字符向量、所述第二字符向量的输出值、以及所述第二字符向量的反馈值进行第一逻辑运算,得到所述第一字符向量的输出值;
    通过得到的至少一个字符向量的输出值,确定所述序列矩阵对应的特征向量。
  4. 根据权利要求3所述的方法,其中,所述对所述第一字符向量、所述第二字符向量的输出值、以及所述第二字符向量的反馈值进行第一逻辑运算,得到所述第一字符向量的输出值,包括:
    根据第一权重矩阵,对所述第一字符向量和所述第二字符向量的反馈值进行第一加权计算,得到第一加权结果;
    根据第二权重矩阵,对所述第一字符向量和所述第二字符向量的反馈值进行第二加权计算,得到第二加权结果;
    将所述第一加权结果和第一偏置参数输入至第一激活函数,得到第一运算结果;
    将所述第二加权结果和第二偏置参数输入至第二激活函数,得到第二运算结果;
    将所述第一运算结果和所述第二运算结果相乘,将相乘得到的结果与所述第二字符向量的输出值相加,得到所述第一字符向量对应的输出值。
  5. 根据权利要求4所述的方法,其中,所述域名特征分析模型还包括反馈门;
    所述得到所述第一字符向量的输出值之后,还包括:
    对所述第一字符向量的输出值、所述第一字符向量、以及所述第二字符向量的反馈值进行第二逻辑运算,得到所述第一字符向量的反馈值;
    其中,所述反馈值用于计算下一次输入至所述输入输出门的字符向量的输出值。
  6. 根据权利要求5所述的方法,其中,所述对所述第一字符向量的输出值、所述第一字符向量、以及所述第二字符向量的反馈值进行第二逻辑运算,得到所述第一字符向量的反馈值,包括:
    根据第三权重矩阵,对所述第一字符向量和所述第二字符向量的反馈值进行第三加权计算,得到第三加权结果;
    将所述第三加权结果和第三偏置参数输入至第三激活函数,得到第三运算结果;
    将所述第一字符向量的输出值输入至第四激活函数,得到第四运算结果;
    将所述第三运算结果与所述第四运算结果相乘,得到所述第一字符向量的反馈值。
  7. 根据权利要求6所述的方法,其中,还包括:
    获取已存储的训练样本集合,所述训练样本集合包括多个正样本和多个负样本,每个正样本为合法域名对应的序列矩阵;每个负样本为非法域名对应的序列矩阵;
    将每个序列矩阵包括的每个字符向量,依次输入至第一初始训练模型,得到所述每个字符向量对应的输出值;
    通过每个序列矩阵中的每个字符向量的输出值,确定每个序列矩阵对应的特征向量;
    将每个序列矩阵对应的特征向量输入至第二初始训练模型,得到每个序列矩阵对应的域名识别结果;
    通过反向传播算法,利用每个序列矩阵对应的的域名识别结果,对所述第一初始训练模型包括的所述第一权重矩阵、所述第二权重矩阵、所述第三权重矩阵、所述第一偏置参数、所述第二偏置参数和所述第三偏置参数进行调整,得到所述域名特征分析模型;
    通过反向传播算法,利用每个序列矩阵对应的域名识别结果,对所述第二初始训练模型进行调整,得到所述域名分类模型。
  8. 一种网络设备,所述网络设备已配置域名特征分析模型和域名分类模型,所述域名特征分析模型包括输入输出门,所述网络设备包括:处理器、收发器和存储有机器可执行指令的机器可读存储介质;
    其中,所述收发器用于实现:接收终端发送的域名解析请求,并将所述域名解析请求传输至所述处理器,所述域名解析请求包括待识别的域名,所述域名包括至少一个字符;
    通过读取并执行所述机器可执行指令,所述处理器被使得:
    确定所述域名对应的序列矩阵,所述序列矩阵包括至少一个字符向量,所述至少一个字符向量中的每个字符向量与所述至少一个字符中的每个字符一一对应;
    将所述至少一个字符向量中的每个字符向量,依次输入至所述输入输出门,所述输入 输出门包括多个激活函数之间的逻辑运算规则;
    通过所述多个激活函数之间的逻辑运算规则,分别对所述至少一个字符向量中的每个字符向量进行逻辑运算处理,得到所述序列矩阵对应的特征向量;
    将所述序列矩阵对应的特征向量输入至所述域名分类模型中,确定所述域名是否为合法域名。
  9. 根据权利要求8所述的网络设备,其中,所述机器可执行指令具体促使所述处理器:
    从所述域名中,获取有效字符,所述有效字符由所述域名中除已存储的前缀字符和已存储的后缀字符以外的字符构成;
    根据已存储的字符与索引值映射规则,确定所述有效字符中每个字符对应的索引值,得到所述有效字符对应的第一索引序列;
    当所述第一索引序列未达到标准长度时,将所述第一索引序列填充为第二索引序列,所述第二索引序列具有标准长度;
    计算所述第二索引序列中每个索引值对应的字符向量;
    通过所述第二索引序列中的每个索引值对应的字符向量,确定所述序列矩阵。
  10. 根据权利要求8所述的网络设备,其中,所述机器可执行指令具体促使所述处理器:
    获取当前输入至所述输入输出门的第一字符向量、上一次输入至所述输入输出门的第二字符向量的输出值、以及上一次输入至所述输入输出门的所述第二字符向量的反馈值;
    对所述第一字符向量、所述第二字符向量的输出值、以及所述第二字符向量的反馈值进行第一逻辑运算,得到所述第一字符向量的输出值;
    通过得到的至少一个字符向量的输出值确定所述序列矩阵对应的特征向量。
  11. 根据权利要求10所述的网络设备,其中,所述机器可执行指令具体促使所述处理器:
    根据第一权重矩阵,对所述第一字符向量和所述第二字符向量的反馈值进行第一加权计算,得到第一加权结果;
    根据第二权重矩阵,对所述第一字符向量和所述第二字符向量的反馈值进行第二加权计算,得到第二加权结果;
    将所述第一加权结果和第一偏置参数输入至第一激活函数,得到第一运算结果;
    将所述第二加权结果和第二偏置参数输入至第二激活函数,得到第二运算结果;
    将所述第一运算结果和所述第二运算结果相乘,将相乘得到的结果与所述第二字符向量的输出值相加,得到所述第一字符向量对应的输出值。
  12. 根据权利要求11所述的网络设备,其中,所述域名特征分析模型还包括反馈门,所述机器可执行指令还促使所述处理器:
    对所述第一字符向量的输出值、所述第一字符向量、以及所述第二字符向量的反馈值进行第二逻辑运算,得到所述第一字符向量的反馈值;
    其中,所述反馈值用于计算下一次输入至所述输入输出门的字符向量的输出值。
  13. 根据权利要求12所述的网络设备,其中,所述机器可执行指令具体促使所述处 理器:根据第三权重矩阵,对所述第一字符向量和所述第二字符向量的反馈值进行第三加权计算,得到第三加权结果;
    将所述第三加权结果和第三偏置参数输入至第三激活函数,得到第三运算结果;
    将所述第一字符向量的输出值输入至第四激活函数,得到第四运算结果;
    将所述第三运算结果与所述第四运算结果相乘,得到所述第一字符向量的反馈值。
  14. 根据权利要求13所述的网络设备,其中,所述机器可执行指令还促使所述处理器:
    获取已存储的训练样本集合,所述训练样本集合包括多个正样本和多个负样本,每个正样本为合法域名对应的序列矩阵;每个负样本为非法域名对应的序列矩阵;
    将每个序列矩阵包括的每个字符向量,依次输入至第一初始训练模型,得到所述每个字符向量对应的输出值;
    通过每个序列矩阵中的每个字符向量的输出值,确定每个序列矩阵对应的特征向量;
    将每个序列矩阵对应的特征向量输入至第二初始训练模型,得到每个序列矩阵对应的域名识别结果;
    通过反向传播算法,利用每个序列矩阵对应的的域名识别结果,对所述第一初始训练模型包括的所述第一权重矩阵、所述第二权重矩阵、所述第三权重矩阵、所述第一偏置参数、所述第二偏置参数和所述第三偏置参数进行调整,得到所述域名特征分析模型;
    通过反向传播算法,利用每个序列矩阵对应的域名识别结果,对所述第二初始训练模型进行调整,得到所述域名分类模型。
PCT/CN2019/087076 2018-05-21 2019-05-15 域名识别 WO2019223587A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US17/050,026 US20210097399A1 (en) 2018-05-21 2019-05-15 Domain name identification
JP2021510515A JP7069410B2 (ja) 2018-05-21 2019-05-15 ドメイン名の識別
EP19808429.5A EP3799398A4 (en) 2018-05-21 2019-05-15 DOMAIN NAME IDENTIFICATION

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810489709.0 2018-05-21
CN201810489709.0A CN109889616B (zh) 2018-05-21 2018-05-21 一种识别域名的方法及装置

Publications (2)

Publication Number Publication Date
WO2019223587A1 WO2019223587A1 (zh) 2019-11-28
WO2019223587A9 true WO2019223587A9 (zh) 2020-01-30

Family

ID=66924764

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/087076 WO2019223587A1 (zh) 2018-05-21 2019-05-15 域名识别

Country Status (5)

Country Link
US (1) US20210097399A1 (zh)
EP (1) EP3799398A4 (zh)
JP (1) JP7069410B2 (zh)
CN (1) CN109889616B (zh)
WO (1) WO2019223587A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112769974A (zh) * 2020-12-30 2021-05-07 亚信科技(成都)有限公司 一种域名检测方法、系统及存储介质
CN115391689B (zh) * 2022-08-23 2023-08-22 北京泰镝科技股份有限公司 一种短链接生成方法、装置、设备及存储介质

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7188180B2 (en) * 1998-10-30 2007-03-06 Vimetx, Inc. Method for establishing secure communication link between computers of virtual private network
US20040015584A1 (en) * 2000-10-09 2004-01-22 Brian Cartmell Registering and using multilingual domain names
US20030138147A1 (en) * 2002-01-17 2003-07-24 Yandi Ongkojoyo Object recognition system for screening device
US8041662B2 (en) * 2007-08-10 2011-10-18 Microsoft Corporation Domain name geometrical classification using character-based n-grams
CN101702660B (zh) * 2009-11-12 2011-12-14 中国科学院计算技术研究所 异常域名检测方法及系统
CN103428307B (zh) * 2013-08-09 2016-07-20 中国科学院计算机网络信息中心 仿冒域名检测方法及设备
WO2015087835A1 (ja) * 2013-12-10 2015-06-18 日本電信電話株式会社 Urlマッチング装置、urlマッチング方法、および、urlマッチングプログラム
US9363282B1 (en) * 2014-01-28 2016-06-07 Infoblox Inc. Platforms for implementing an analytics framework for DNS security
EP2916525A1 (en) * 2014-03-06 2015-09-09 Verisign, Inc. Name collision risk manager
JP6368127B2 (ja) * 2014-04-09 2018-08-01 キヤノン株式会社 通信装置、制御方法、及びプログラム
US9653093B1 (en) * 2014-08-19 2017-05-16 Amazon Technologies, Inc. Generative modeling of speech using neural networks
CN105577660B (zh) * 2015-12-22 2019-03-08 国家电网公司 基于随机森林的dga域名检测方法
CN105610830A (zh) * 2015-12-30 2016-05-25 山石网科通信技术有限公司 域名的检测方法及装置
CN105827594B (zh) * 2016-03-08 2018-11-27 北京航空航天大学 一种基于域名可读性及域名解析行为的可疑性检测方法
EP3475822B1 (en) * 2016-06-22 2020-07-22 Invincea, Inc. Methods and apparatus for detecting whether a string of characters represents malicious activity using machine learning
US10218716B2 (en) * 2016-10-01 2019-02-26 Intel Corporation Technologies for analyzing uniform resource locators
CN106713312A (zh) * 2016-12-21 2017-05-24 深圳市深信服电子科技有限公司 检测非法域名的方法及装置
US10819724B2 (en) * 2017-04-03 2020-10-27 Royal Bank Of Canada Systems and methods for cyberbot network detection
CN107168952B (zh) * 2017-05-15 2021-06-04 北京百度网讯科技有限公司 基于人工智能的信息生成方法和装置
CN107682348A (zh) * 2017-10-19 2018-02-09 杭州安恒信息技术有限公司 基于机器学习的dga域名快速判别方法及装置
CN107807987B (zh) * 2017-10-31 2021-07-02 广东工业大学 一种字符串分类方法、系统及一种字符串分类设备

Also Published As

Publication number Publication date
CN109889616B (zh) 2020-06-05
CN109889616A (zh) 2019-06-14
WO2019223587A1 (zh) 2019-11-28
EP3799398A1 (en) 2021-03-31
EP3799398A4 (en) 2021-06-30
US20210097399A1 (en) 2021-04-01
JP7069410B2 (ja) 2022-05-17
JP2021520019A (ja) 2021-08-12

Similar Documents

Publication Publication Date Title
CN108200034B (zh) 一种识别域名的方法及装置
US10887344B2 (en) Network endpoint spoofing detection and mitigation
CN108650260B (zh) 一种恶意网站的识别方法和装置
WO2021243663A1 (zh) 一种会话检测方法、装置、检测设备及计算机存储介质
CN111835763B (zh) 一种dns隧道流量检测方法、装置及电子设备
KR20190126201A (ko) 이미지 기반의 captcha 과제
Chawla Phishing website analysis and detection using Machine Learning
CN111224941A (zh) 一种威胁类型识别方法及装置
WO2019223587A9 (zh) 域名识别
CN114050912B (zh) 一种基于深度强化学习的恶意域名检测方法和装置
CN113158182A (zh) 一种web攻击检测方法、装置及电子设备和存储介质
WO2019201295A1 (zh) 文件识别方法和特征提取方法
US11757901B2 (en) Malicious homoglyphic domain name detection and associated cyber security applications
WO2019101197A1 (zh) 网页请求识别
CN109525577B (zh) 基于http行为图的恶意软件检测方法
CN110830445A (zh) 一种异常访问对象的识别方法及设备
CN111260032A (zh) 神经网络训练方法、图像处理方法及装置
CN110798488A (zh) Web应用攻击检测方法
EP3633950B1 (en) Method for evaluating domain name and server using the same
US10805318B2 (en) Identification of a DNS packet as malicious based on a value
US20210158217A1 (en) Method and Apparatus for Generating Application Identification Model
CN113905016A (zh) 一种dga域名检测方法、检测装置及计算机存储介质
CN110958244A (zh) 一种基于深度学习的仿冒域名检测方法及装置
WO2020000752A1 (zh) 仿冒移动应用程序的判别方法及系统
CN111901324B (zh) 一种基于序列熵流量识别的方法、装置和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19808429

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021510515

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019808429

Country of ref document: EP

Effective date: 20201221