CN110798481A

CN110798481A - Malicious domain name detection method and device based on deep learning

Info

Publication number: CN110798481A
Application number: CN201911084930.9A
Authority: CN
Inventors: 仝哲; 范渊
Original assignee: Hangzhou Dbappsecurity Technology Co Ltd
Current assignee: Hangzhou Dbappsecurity Technology Co Ltd
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2020-02-14

Abstract

The invention provides a malicious domain name detection method and device based on deep learning, which relate to the technical field of network security and comprise the following steps: acquiring a domain name to be detected; analyzing the domain name to be detected to obtain message information of the domain name to be detected; processing the message information of the domain name to be detected based on a natural language processing algorithm and a text feature extraction algorithm to obtain feature information of the domain name to be detected; the characteristic information is input into a deep learning model to obtain a detection result, wherein the detection result represents whether the domain name to be detected is a malicious domain name, the deep learning model is a learning model constructed based on a convolutional neural network and a full connection layer, and the technical problem that the accuracy rate of detecting whether the domain name to be detected is the malicious domain name is low in the existing domain name detection method is solved.

Description

Malicious domain name detection method and device based on deep learning

Technical Field

The invention relates to the technical field of network security, in particular to a malicious domain name detection method and device based on deep learning.

Background

With the development of the internet, thousands of domain names are registered every day, and how to detect malicious domain names from massive domain names becomes an important matter for network attack detection and defense. However, the detection technology commonly used at present is mainly based on a regular expression and a white list, and has the problem of high false alarm rate.

No effective solution has been proposed to the above problems.

Disclosure of Invention

In view of this, the present invention provides a malicious domain name detection method and apparatus based on deep learning, so as to alleviate the technical problem that the accuracy rate of detecting whether a domain name to be detected is a malicious domain name is low in the existing domain name detection method.

In a first aspect, an embodiment of the present invention provides a malicious domain name detection method based on deep learning, including: acquiring a domain name to be detected; analyzing the domain name to be detected to obtain message information of the domain name to be detected; processing the message information of the domain name to be detected based on a natural language processing algorithm and a text feature extraction algorithm to obtain feature information of the domain name to be detected; and inputting the characteristic information into a deep learning model to obtain a detection result, wherein the detection result represents whether the domain name to be detected is a malicious domain name, and the deep learning model is a learning model constructed based on a convolutional neural network and a full connection layer.

Further, processing the message information of the domain name to be detected based on a natural language processing algorithm and a text feature extraction algorithm to obtain feature information of the domain name to be detected, including: segmenting the domain name to be detected to obtain a triple of the domain name to be detected; processing the triples based on the natural language processing algorithm to obtain target triples; and processing the target triple based on a text feature extraction algorithm to obtain the feature information of the domain name to be detected.

Further, the feature information includes: domain name lexical characteristic information and domain name network characteristic information; based on a text feature extraction algorithm, processing the target triple to obtain feature information of the domain name to be detected, wherein the method comprises the following steps: processing the target triple based on a domain name lexical feature extraction algorithm to obtain domain name lexical feature information; and processing the target triple based on a domain name network feature extraction algorithm to obtain domain name network feature information.

Further, the message information includes: DNS inquires message information and response message information.

Further, the method further comprises constructing the deep learning model by: obtaining a plurality of sample domain names, wherein the sample domain names comprise legal domain names and malicious domain names; analyzing each sample domain name to obtain message information of each sample domain name; processing the message information of each sample domain name based on a natural language processing algorithm and a text feature extraction algorithm to obtain feature information of each sample domain name: inputting the characteristic information of the plurality of sample domain names into an initial deep learning model, and training the initial deep learning model to obtain the deep learning model.

In a second aspect, an embodiment of the present invention further provides a malicious domain name detection apparatus based on deep learning, including: the domain name detection device comprises an acquisition unit, an analysis unit, an extraction unit and a detection unit, wherein the acquisition unit is used for acquiring a domain name to be detected; the analysis unit is used for analyzing the domain name to be detected to obtain message information of the domain name to be detected; the extraction unit is used for processing the message information of the domain name to be detected based on a natural language processing algorithm and a text feature extraction algorithm to obtain the feature information of the domain name to be detected; the detection unit is used for inputting the characteristic information into a deep learning model to obtain a detection result, wherein the detection result represents whether the domain name to be detected is a malicious domain name, and the deep learning model is a learning model constructed based on a convolutional neural network and a full connection layer.

Further, the extraction unit is further configured to: segmenting the domain name to be detected to obtain a triple of the domain name to be detected; processing the triples based on the natural language processing algorithm to obtain target triples; and processing the target triple based on a text feature extraction algorithm to obtain the feature information of the domain name to be detected.

Further, the feature information includes: domain name lexical characteristic information and domain name network characteristic information; the extraction unit is also used for processing the target triple based on a domain name lexical feature extraction algorithm to obtain domain name lexical feature information; and processing the target triple based on a domain name network feature extraction algorithm to obtain domain name network feature information.

Further, the apparatus further comprises: a training unit to: obtaining a plurality of sample domain names, wherein the sample domain names comprise legal domain names and malicious domain names; analyzing each sample domain name to obtain message information of each sample domain name; processing the message information of each sample domain name based on a natural language processing algorithm and a text feature extraction algorithm to obtain feature information of each sample domain name: inputting the characteristic information of the plurality of sample domain names into an initial deep learning model, and training the initial deep learning model to obtain the deep learning model.

In the embodiment of the invention, firstly, a domain name to be detected is obtained; then, analyzing the domain name to be detected to obtain message information of the domain name to be detected; then, processing the message information of the domain name to be detected based on a natural language processing algorithm and a text feature extraction algorithm to obtain feature information of the domain name to be detected; and finally, inputting the characteristic information into a learning model constructed based on the convolutional neural network and the full connection layer to obtain a detection result, wherein the learning model constructed based on the convolutional neural network and the full connection layer has higher accuracy rate for domain name detection, so that the domain name to be detected is detected through the learning model constructed based on the convolutional neural network and the full connection layer, the aim of improving the accuracy rate of domain name detection is fulfilled, the technical problem that the accuracy rate of detecting whether the domain name to be detected is a malicious domain name is lower in the existing domain name detection method is solved, and the technical effect of improving the accuracy rate of detecting whether the domain name to be detected is the malicious domain name is realized.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a malicious domain name detection method based on deep learning according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a deep learning model training method according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a malicious domain name detection apparatus based on deep learning according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a server according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The first embodiment is as follows:

according to an embodiment of the present invention, there is provided an embodiment of a malicious domain name detection method based on deep learning, it should be noted that the steps illustrated in the flowchart of the drawings may be executed in a computer system such as a set of computer executable instructions, and although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be executed in an order different from that herein.

Fig. 1 is a flowchart of a malicious domain name detection method based on deep learning according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:

step S102, acquiring a domain name to be detected;

step S104, analyzing the domain name to be detected to obtain message information of the domain name to be detected;

step S106, processing the message information of the domain name to be detected based on a natural language processing algorithm and a text feature extraction algorithm to obtain feature information of the domain name to be detected;

step S108, inputting the characteristic information into a deep learning model to obtain a detection result, wherein the detection result represents whether the domain name to be detected is a malicious domain name, and the deep learning model is a learning model constructed based on a convolutional neural network and a full connection layer.

It should be noted that the message information includes: DNS inquires message information and response message information.

In this embodiment of the present invention, step S106 further includes the following steps:

step S11, the domain name to be detected is segmented to obtain a triple of the domain name to be detected;

step S12, processing the triples based on the natural language processing algorithm to obtain target triples;

and step S13, processing the target triple based on a text feature extraction algorithm to obtain the feature information of the domain name to be detected.

In the embodiment of the present invention, it should be noted that the characteristic information includes: domain name lexical characteristic information and domain name network characteristic information.

After the domain name to be detected is obtained, the domain name to be detected is segmented into triples.

Com "may be converted to < '> goo', 'oog', 'ogl', 'gle', 'le.', 'e.c', 'co', 'com' >, for example, and then vectorized using a word embedding algorithm in natural language processing techniques.

Then, extracting features from the DNS query message and the response message obtained by analysis by using a text feature extraction technology, and constructing a domain name algorithm based on a lexical feature algorithm and network attributes to extract feature information of the domain name to be detected, wherein the lexical special diagnosis information of the domain name comprises: the length of the domain name to be detected, the number of separators in the domain name to be detected, the proportion of the number in the domain name to be detected to the total length, the number of special characters in the domain name to be detected, the maximum length among the separators of the domain name to be detected and the like; the domain name network characteristic information comprises: TTL (Time To Live) average value, response type, number of response values, and the like.

By extracting a plurality of characteristic information of the domain name to be detected, resolving the domain name to be detected by using a natural language processing technology and matching with a deep learning model for detection, the accuracy of detection is improved, and the method has strong practicability.

In the embodiment of the present invention, as shown in fig. 2, the deep learning model is constructed by the following steps:

step S202, obtaining a plurality of sample domain names, wherein the sample domain names comprise legal domain names and malicious domain names;

step S204, analyzing each sample domain name to obtain message information of each sample domain name;

step S206, based on a natural language processing algorithm and a text feature extraction algorithm, processing the message information of each sample domain name to obtain the feature information of each sample domain name:

step S208, inputting the characteristic information of the plurality of sample domain names into an initial deep learning model, and training the initial deep learning model to obtain the deep learning model.

In the embodiment of the invention, a sufficient number of legal domain names and malicious domain names are obtained through an open source channel and are obtained after screening, the legal domain names and the malicious domain names form a sample domain name, the legal domain name is used as a positive sample, and the malicious domain name is used as a negative sample.

In addition, the plurality of sample domain names can be divided into two parts, namely training samples and testing samples, after the initial deep learning model completes training through the training samples, the detection accuracy of the deep learning model is detected through the testing samples, if the accuracy of the detection result is low, the training samples are obtained again to train the deep learning model until the accuracy of the detection result meets an expected target, and therefore the accuracy of detecting whether the domain name to be detected is the malicious domain name is improved.

After obtaining a plurality of sample domain names, analyzing each sample domain name respectively to obtain message information of each sample domain name.

Then, each sample domain name is analyzed to obtain the message information of each sample domain name.

By acquiring a large number of sample Domain names and analyzing each sample Domain Name, DNS (Domain Name System) query message information and response message information of each sample are obtained.

The initial deep learning model is trained by utilizing massive DNS query message information and response message information, so that the detection accuracy of the initial deep learning model can be effectively improved, and the accuracy of detecting whether the domain name to be detected is a malicious domain name is improved.

And finally, inputting the characteristic information of the plurality of sample domain names into the initial deep learning model, and training the initial deep learning model to obtain the deep learning model.

Example two:

the invention further provides an embodiment of a malicious domain name detection device based on deep learning, which is used for executing the malicious domain name detection method based on deep learning provided by the embodiment of the invention.

As shown in fig. 3, the malicious domain name detection apparatus based on deep learning includes: an acquisition unit 10, an analysis unit 20, an extraction unit 30 and a detection unit 40.

The acquiring unit 10 is configured to acquire a domain name to be detected;

the analyzing unit 20 is configured to analyze the domain name to be detected to obtain message information of the domain name to be detected;

the extraction unit 30 is configured to process the message information of the domain name to be detected based on a natural language processing algorithm and a text feature extraction algorithm, so as to obtain feature information of the domain name to be detected;

the detection unit 40 is configured to input the feature information into a deep learning model to obtain a detection result, where the detection result represents whether the domain name to be detected is a malicious domain name, and the deep learning model is a learning model constructed based on a convolutional neural network and a full connection layer.

Preferably, the extraction unit is further configured to: segmenting the domain name to be detected to obtain a triple of the domain name to be detected; processing the triples based on the natural language processing algorithm to obtain target triples; and processing the target triple based on a text feature extraction algorithm to obtain the feature information of the domain name to be detected.

Preferably, the feature information includes: domain name lexical characteristic information and domain name network characteristic information; the extraction unit is also used for processing the target triple based on a domain name lexical feature extraction algorithm to obtain domain name lexical feature information; and processing the target triple based on a domain name network feature extraction algorithm to obtain domain name network feature information.

Preferably, the message information includes: DNS inquires message information and response message information.

Preferably, the apparatus further comprises: a training unit to: obtaining a plurality of sample domain names, wherein the sample domain names comprise legal domain names and malicious domain names; analyzing each sample domain name to obtain message information of each sample domain name; processing the message information of each sample domain name based on a natural language processing algorithm and a text feature extraction algorithm to obtain feature information of each sample domain name: inputting the characteristic information of the plurality of sample domain names into an initial deep learning model, and training the initial deep learning model to obtain the deep learning model.

Referring to fig. 4, an embodiment of the present invention further provides a server 100, including: the device comprises a processor 50, a memory 51, a bus 52 and a communication interface 53, wherein the processor 50, the communication interface 53 and the memory 51 are connected through the bus 52; the processor 50 is arranged to execute executable modules, such as computer programs, stored in the memory 51.

The Memory 51 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 53 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.

The bus 52 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 4, but that does not indicate only one bus or one type of bus.

The memory 51 is used for storing a program, the processor 50 executes the program after receiving an execution instruction, and the method executed by the apparatus defined by the flow process disclosed in any of the foregoing embodiments of the present invention may be applied to the processor 50, or implemented by the processor 50.

The processor 50 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 50. The Processor 50 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 51, and the processor 50 reads the information in the memory 51 and completes the steps of the method in combination with the hardware thereof.

In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A malicious domain name detection method based on deep learning is characterized by comprising the following steps:

acquiring a domain name to be detected;

analyzing the domain name to be detected to obtain message information of the domain name to be detected;

processing the message information of the domain name to be detected based on a natural language processing algorithm and a text feature extraction algorithm to obtain feature information of the domain name to be detected;

and inputting the characteristic information into a deep learning model to obtain a detection result, wherein the detection result represents whether the domain name to be detected is a malicious domain name, and the deep learning model is a learning model constructed based on a convolutional neural network and a full connection layer.

2. The method according to claim 1, wherein processing the message information of the domain name to be detected based on a natural language processing algorithm and a text feature extraction algorithm to obtain the feature information of the domain name to be detected comprises:

segmenting the domain name to be detected to obtain a triple of the domain name to be detected;

processing the triples based on the natural language processing algorithm to obtain target triples;

and processing the target triple based on a text feature extraction algorithm to obtain the feature information of the domain name to be detected.

3. The method of claim 2, wherein the feature information comprises: domain name lexical characteristic information and domain name network characteristic information;

based on a text feature extraction algorithm, processing the target triple to obtain feature information of the domain name to be detected, wherein the method comprises the following steps:

processing the target triple based on a domain name lexical feature extraction algorithm to obtain domain name lexical feature information;

and processing the target triple based on a domain name network feature extraction algorithm to obtain domain name network feature information.

4. The method of claim 1, wherein the message information comprises: DNS inquires message information and response message information.

5. The method of claim 4, further comprising constructing the deep learning model by:

obtaining a plurality of sample domain names, wherein the sample domain names comprise legal domain names and malicious domain names;

analyzing each sample domain name to obtain message information of each sample domain name;

processing the message information of each sample domain name based on a natural language processing algorithm and a text feature extraction algorithm to obtain feature information of each sample domain name:

inputting the characteristic information of the plurality of sample domain names into an initial deep learning model, and training the initial deep learning model to obtain the deep learning model.

6. A malicious domain name detection device based on deep learning is characterized by comprising: an acquisition unit, an analysis unit, an extraction unit and a detection unit, wherein,

the acquisition unit is used for acquiring a domain name to be detected;

the analysis unit is used for analyzing the domain name to be detected to obtain message information of the domain name to be detected;

the extraction unit is used for processing the message information of the domain name to be detected based on a natural language processing algorithm and a text feature extraction algorithm to obtain the feature information of the domain name to be detected;

the detection unit is used for inputting the characteristic information into a deep learning model to obtain a detection result, wherein the detection result represents whether the domain name to be detected is a malicious domain name, and the deep learning model is a learning model constructed based on a convolutional neural network and a full connection layer.

7. The apparatus of claim 6, wherein the extraction unit is further configured to:

8. The apparatus of claim 7, wherein the feature information comprises: domain name lexical characteristic information and domain name network characteristic information;

the extraction unit is also used for processing the target triple based on a domain name lexical feature extraction algorithm to obtain domain name lexical feature information;

9. The apparatus of claim 6, wherein the message information comprises: DNS inquires message information and response message information.

10. The apparatus of claim 6, further comprising: a training unit to: