CN109302418B

CN109302418B - Malicious domain name detection method and device based on deep learning

Info

Publication number: CN109302418B
Application number: CN201811361303.0A
Authority: CN
Inventors: 黄小鹏; 赵子渊; 刘欣春; 陈丽红
Original assignee: Eastcompeace Technology Co Ltd
Current assignee: Eastcompeace Technology Co Ltd
Priority date: 2018-11-15
Filing date: 2018-11-15
Publication date: 2021-11-12
Anticipated expiration: 2038-11-15
Also published as: CN109302418A

Abstract

The embodiment of the invention discloses a malicious domain name detection method and device based on deep learning, wherein a first weak correlation characteristic of each domain name in a domain name training set is extracted, and a malicious domain name detection model is trained by utilizing a known domain name detection result and the first weak correlation characteristic of the domain name, so that the trained malicious domain name detection model can still obtain a normal or malicious detection result of the domain name through the first weak correlation characteristic when the characteristic directly related to the normal or malicious detection result of the domain name is not obtained, and the technical problems that the domain name cannot be automatically analyzed due to the fact that the traditional malicious domain name detection is generally based on a result of threat intelligence library or manual analysis and the existing efficiency is low are solved.

Description

Malicious domain name detection method and device based on deep learning

Technical Field

The invention relates to the technical field of domain name detection, in particular to a malicious domain name detection method and device based on deep learning.

Background

The DNS (domain name system) is an important infrastructure of the internet and is mainly responsible for performing the inter-conversion between IP addresses and domain names. Due to the openness of the DNS, a hacker can use the malicious domain name to implement network attack or broiler control, so that the detection of the malicious domain name becomes an important measure for network security protection.

The traditional malicious domain name detection is generally based on the result of threat intelligence library or manual analysis, and the domain name cannot be automatically analyzed, so that the technical problem of low efficiency exists.

Disclosure of Invention

The invention provides a malicious domain name detection method and device based on deep learning, and solves the technical problems that the traditional malicious domain name detection is generally based on the results of threat intelligence libraries or manual analysis, the domain name cannot be automatically analyzed, and the efficiency is low.

The invention provides a malicious domain name detection method based on deep learning, which comprises the following steps:

acquiring a domain name training sample set;

acquiring a first weak correlation characteristic of each domain name in the domain name training sample set;

performing malicious domain name detection training based on deep learning by using the domain name training sample set and the first weak correlation characteristic of each domain name in the domain name training sample set to generate a malicious domain name detection model;

and detecting whether the unknown domain name is a malicious domain name or not through the malicious domain name detection model.

Optionally, the obtaining the first weak correlation feature of each domain name in the domain name training sample set specifically includes:

obtaining an A record, an AAAA record, an MX record and an NS record of each domain name in the domain name training sample set through DNS query;

obtaining the domain name registration time, the domain name registrant, the domain name registration mailbox and the domain name registration mechanism of each domain name in the domain name training sample set through WHOIS query;

and acquiring domain name ranking information, the search engine listing number of the domain names, a WEB home page corresponding to the domain names, WEB HTTPS certificate information corresponding to the domain names, IP geographic positions corresponding to the domain names and domain name IP resolution history of each domain name in the domain name training sample set through a WEB request tool.

Optionally, after the first weak correlation feature of each domain name in the domain name training sample set is obtained, the performing malicious domain name detection training based on deep learning by using the domain name training sample set and the first weak correlation feature of each domain name in the domain name training sample set further includes before the generating a malicious domain name detection model:

and normalizing the first weak correlation characteristic of each domain name in the domain name training sample set, and converting the first weak correlation characteristic of each domain name in the domain name training sample set into floating point numbers in a range of [0, 1).

Optionally, the domain name training sample set includes a positive sample set and a negative sample set;

the domain names in the positive sample set are normal domain names, and the domain names in the negative sample set are malicious domain names.

Optionally, the performing malicious domain name detection training based on deep learning by using the domain name training sample set and the weak correlation features of each domain name in the domain name training sample set specifically includes:

performing model training by using the domain name training sample set and the first weak correlation characteristics of each domain name in the domain name training sample set through at least one feedforward neural network model to generate at least one malicious domain name detection model;

when two or more malicious domain name detection models are generated, a domain name test sample set is obtained;

acquiring a second weak correlation characteristic of each domain name in the domain name test sample set;

respectively testing two or more generated malicious domain name detection models by utilizing the domain name test sample set and the second weak correlation characteristics of each domain name in the domain name test sample set;

respectively counting test results of the generated two or more malicious domain name detection models, wherein the test results comprise accuracy and recall rate;

determining an optimal malicious domain name detection model in two or more malicious domain name detection models according to the test result;

correspondingly, the detecting, by the malicious domain name detection model, whether the unknown domain name is a malicious domain name specifically includes:

and detecting whether the unknown domain name is a malicious domain name or not through the optimal malicious domain name detection model.

The invention provides a malicious domain name detection device based on deep learning, which comprises:

the first acquisition unit is used for acquiring a domain name training sample set;

the second acquisition unit is used for acquiring a first weak correlation characteristic of each domain name in the domain name training sample set;

the training unit is used for carrying out malicious domain name detection training based on deep learning by utilizing the domain name training sample set and the first weak correlation characteristic of each domain name in the domain name training sample set to generate a malicious domain name detection model;

and the detection unit is used for detecting whether the unknown domain name is a malicious domain name or not through the malicious domain name detection model.

Optionally, the second obtaining unit includes:

the first acquisition subunit is used for acquiring an A record, an AAAA record, an MX record and an NS record of each domain name in the domain name training sample set through DNS query;

the second acquisition subunit is used for acquiring the domain name registration time, the domain name registrant, the domain name registration mailbox and the domain name registration mechanism of each domain name in the domain name training sample set through WHOIS query;

and the third acquiring subunit is configured to acquire, by using a WEB request tool, domain name ranking information, search engine entry number of the domain names, a WEB home page corresponding to the domain names, WEB HTTPS certificate information corresponding to the domain names, IP geographical positions corresponding to the domain names, and domain name IP resolution history of each domain name in the domain name training sample set.

Optionally, the method further comprises:

a preprocessing unit, configured to perform normalization processing on the first weak correlation feature of each domain name in the domain name training sample set, and convert the first weak correlation feature of each domain name in the domain name training sample set into a floating point number in a [0,1) range.

Optionally, the training unit comprises:

the training subunit is used for performing model training by adopting the domain name training sample set and the first weak correlation characteristic of each domain name in the domain name training sample set through at least one feedforward neural network model to generate at least one malicious domain name detection model;

the fourth acquisition subunit is used for acquiring a domain name test sample set when two or more malicious domain name detection models are generated;

a fifth obtaining subunit, configured to obtain a second weak correlation feature of each domain name in the domain name test sample set;

the testing subunit is used for respectively testing the two or more generated malicious domain name detection models by utilizing the domain name testing sample set and the second weak correlation characteristics of each domain name in the domain name testing sample set;

the statistical subunit is used for respectively counting the test results of the generated two or more malicious domain name detection models, and the test results comprise accuracy and recall rate;

the determining subunit is used for determining an optimal malicious domain name detection model in the two or more malicious domain name detection models according to the test result;

correspondingly, the detection unit is further configured to detect whether the unknown domain name is a malicious domain name through the optimal malicious domain name detection model.

According to the technical scheme, the invention has the following advantages:

according to the method, the first weak correlation characteristics of each domain name in the domain name training set are extracted, the known domain name detection result and the first weak correlation characteristics of the domain name are used for training the malicious domain name detection model, so that the trained malicious domain name detection model can still obtain the normal or malicious detection result of the domain name through the first weak correlation characteristics when the characteristics directly related to the normal or malicious detection result of the domain name are not obtained, and the technical problems that the domain name cannot be automatically analyzed and the efficiency is low due to the fact that the traditional malicious domain name detection is generally based on the result of threat information base or manual analysis are solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a schematic flowchart of an embodiment of a malicious domain name detection method based on deep learning according to the present invention;

fig. 2 is a schematic flowchart of another embodiment of a malicious domain name detection method based on deep learning according to the present invention;

fig. 3 is a schematic structural diagram of an embodiment of a malicious domain name detection apparatus based on deep learning according to the present invention;

fig. 4 is a schematic structural diagram of another embodiment of a malicious domain name detection apparatus based on deep learning according to the present invention.

Detailed Description

The embodiment of the invention provides a malicious domain name detection method and device based on deep learning, and solves the technical problems that the traditional malicious domain name detection is generally based on the result of threat intelligence library or manual analysis, the domain name cannot be automatically analyzed, and the efficiency is low.

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, an embodiment of a malicious domain name detection method based on deep learning according to the present invention includes:

101. acquiring a domain name training sample set;

it should be noted that, first, a domain name training sample set is required to be obtained, where the domain name training sample set includes various feature information of a domain name with known normal or malicious results, the malicious domain name is used as a negative sample in the training sample set, and the normal domain name is used as a positive sample in the training sample set.

102. Acquiring a first weak correlation characteristic of each domain name in a domain name training sample set;

it should be noted that the weakly correlated feature does not clearly lead to a defined result by this feature, i.e. there is no direct causal link. Specifically, weakly correlated features include, but are not limited to: domain name registration time, domain name registrant, domain name registration mailbox and mechanism; the method comprises the steps of A recording IP obtained by domain name resolution, historical record of domain name resolution, certificate of website corresponding to the domain name, record corresponding to the domain name, MX recording of the domain name, web home page of the domain name, suffix of the domain name, global ranking of the domain name, the recording number of the domain name in a search engine, the geographic position of the IP corresponding to the domain name and the like.

And after a domain name training sample set is obtained, the first weak correlation characteristic of each domain name in the domain name training sample set is collected.

103. Carrying out malicious domain name detection training based on deep learning by utilizing the domain name training sample set and the first weak correlation characteristic of each domain name in the domain name training sample set to generate a malicious domain name detection model;

it should be noted that the malicious domain name detection training based on deep learning is performed by using the domain name training sample set and the first weak correlation feature of each domain name in the domain name training sample set, so as to generate a malicious domain name detection model, wherein the training model is a deep learning model.

104. Detecting whether the unknown domain name is a malicious domain name or not through a malicious domain name detection model;

it should be noted that, after the malicious domain name detection model is generated through training, the unknown domain name is input into the malicious domain name detection model as an input quantity, and a detection result of the unknown domain name is obtained.

According to the embodiment of the invention, the first weak correlation characteristic of each domain name in the domain name training set is extracted, and the known domain name detection result and the first weak correlation characteristic of the domain name are used for training the malicious domain name detection model, so that the trained malicious domain name detection model can still obtain the normal or malicious detection result of the domain name through the first weak correlation characteristic when the characteristic directly related to the normal or malicious detection result of the domain name is not obtained, and the technical problems that the traditional malicious domain name detection is generally based on the result of threat intelligence library or manual analysis, the domain name cannot be automatically analyzed, and the existing efficiency is low are solved.

The above is a description of an embodiment of the malicious domain name detection method based on deep learning provided by the present invention, and another embodiment of the malicious domain name detection method based on deep learning provided by the present invention will be described below.

Referring to fig. 2, another embodiment of a malicious domain name detection method based on deep learning according to the present invention includes:

201. acquiring a domain name training sample set;

202. Obtaining an A record, an AAAA record, an MX record and an NS record of each domain name in a domain name training sample set through DNS query;

it should be noted that, in the first aspect, the a record, the AAAA record, the MX record, and the NS record of each domain name in the domain name training sample can be obtained through DNS query.

203. Obtaining the domain name registration time, domain name registrars, domain name registration mailboxes and domain name registration mechanisms of each domain name in a domain name training sample set through WHOIS query;

it should be noted that, in the second aspect, the domain name registration time, the domain name registrant, the domain name registration mailbox, and the domain name registration mechanism of each domain name in the domain name training sample set may be obtained through WHOIS query.

204. Through a WEB request tool, acquiring domain name ranking information, search engine listing number of domain names, WEB home pages corresponding to the domain names, WEB HTTPS certificate information corresponding to the domain names, IP geographic positions corresponding to the domain names and domain name IP resolution history of each domain name in a domain name training sample set;

in the third aspect, the domain name ranking information, the search engine listing number of the domain names, the WEB home page corresponding to the domain names, the WEB HTTPS certificate information corresponding to the domain names, the IP geographical positions corresponding to the domain names, and the domain name IP resolution history of each domain name in the domain name training sample set may also be obtained by using a WEB request tool.

205. Normalizing the first weak correlation characteristic of each domain name in the domain name training sample set, and converting the first weak correlation characteristic of each domain name in the domain name training sample set into a floating point number in a range of [0,1 ];

it should be noted that after the first weak correlation feature of each domain name in the domain name training sample set is obtained, normalization processing needs to be performed on the first weak correlation feature, and the first weak correlation feature is converted into a floating point number in a range of [0,1), which actually operates as follows:

for the first weak correlation characteristics with small numerical value variation difference, such as A records, AAAA records, MX records, NS records, domain name registrars and the like of the domain name, normalization processing can be carried out in a linear scaling mode;

for the first weak correlation characteristics with larger difference of the domain name registration time, the domain name ranking information and the number of lamp value values recorded by the search engine of the domain name, normalization processing can be performed in a Z score scaling mode.

206. Performing model training by adopting a domain name training sample set and a first weak correlation characteristic of each domain name in the domain name training sample set through at least one feedforward neural network model to generate at least one malicious domain name detection model;

it should be noted that, the domain name training sample set and the first weak correlation feature of each domain name in the domain name training sample set are used to perform malicious domain name detection training based on deep learning, so as to generate at least one malicious domain name detection model, where the training model may be a KNN model, a convolutional neural network model, or the like.

207. When two or more malicious domain name detection models are generated, a domain name test sample set is obtained;

it should be noted that when two or more malicious domain name detection models are generated, the efficiency and the success rate of the two or more malicious domain name detection models need to be compared, so as to obtain a domain name test sample set, and the two or more malicious domain name detection models are tested through the domain name test sample set.

It is understood that the domain name test sample set may be a partial sample set extracted from the domain name training sample set, or may be a single batch of sample sets including positive samples and negative samples.

208. Acquiring a second weak correlation characteristic of each domain name in a domain name test sample set;

it should be noted that, similarly, after the domain name test sample set is obtained, the second weakly correlated feature of each domain name in the domain name test sample set is obtained.

209. Respectively testing the generated two or more malicious domain name detection models by utilizing the domain name test sample set and the second weak correlation characteristics of each domain name in the domain name test sample set;

it should be noted that, the generated two or more malicious domain name detection models are respectively tested by using the domain name test sample set and the second weak correlation characteristic of each domain name in the domain name test sample set, so as to obtain a test result.

210. Respectively counting test results of the generated two or more malicious domain name detection models, wherein the test results comprise accuracy and recall rate;

it should be noted that after the test results of each domain name for two or more malicious domain name detection models are obtained, the accuracy and the recall rate of the two or more malicious domain name detection models are respectively counted.

211. Determining an optimal malicious domain name detection model in two or more malicious domain name detection models according to the test result;

it should be noted that the malicious domain name detection model with the highest accuracy and recall rate in the two or more malicious domain name detection models is selected as the optimal malicious domain name detection model.

212. Detecting whether the unknown domain name is a malicious domain name or not through an optimal malicious domain name detection model;

it should be noted that, after the optimal malicious domain name detection model is determined, the unknown domain name is input into the optimal malicious domain name detection model as an input quantity, and a detection result of the unknown domain name is obtained.

The above is a description of another embodiment of the malicious domain name detection method based on deep learning provided by the present invention, and an embodiment of the malicious domain name detection device based on deep learning provided by the present invention will be described below.

Referring to fig. 3, an embodiment of a malicious domain name detection apparatus based on deep learning according to the present invention includes:

a first obtaining unit 301, configured to obtain a domain name training sample set;

a second obtaining unit 302, configured to obtain a first weak correlation feature of each domain name in a domain name training sample set;

the training unit 303 is configured to perform malicious domain name detection training based on deep learning by using the domain name training sample set and the first weak correlation feature of each domain name in the domain name training sample set, and generate a malicious domain name detection model;

a detecting unit 304, configured to detect whether the unknown domain name is a malicious domain name through a malicious domain name detection model.

The above is a description of an embodiment of a malicious domain name detection device based on deep learning provided by the present invention, and another embodiment of a malicious domain name detection device based on deep learning provided by the present invention will be described below.

Referring to fig. 4, another embodiment of a malicious domain name detection apparatus based on deep learning according to the present invention includes:

a first obtaining unit 401, configured to obtain a domain name training sample set;

a second obtaining unit 402, configured to obtain a first weak correlation feature of each domain name in a domain name training sample set;

the second acquisition unit 402 includes:

the first obtaining subunit 4021 is configured to obtain an a record, an AAAA record, an MX record, and an NS record of each domain name in the domain name training sample set through DNS query;

the second obtaining subunit 4022 is configured to obtain, through WHOIS query, domain name registration time, a domain name registrar, a domain name registration mailbox, and a domain name registration mechanism for each domain name in the domain name training sample set;

a third obtaining subunit 4023, configured to obtain, by using a WEB request tool, domain name ranking information, search engine entry number of domain names, a WEB home page corresponding to a domain name, WEB HTTPS certificate information corresponding to a domain name, a geographic location of an IP corresponding to a domain name, and a domain name IP resolution history of each domain name in a domain name training sample set;

a preprocessing unit 403, configured to perform normalization processing on the first weak correlation feature of each domain name in the domain name training sample set, and convert the first weak correlation feature of each domain name in the domain name training sample set into a floating point number in a range of [0, 1);

a training unit 404, configured to perform malicious domain name detection training based on deep learning by using the domain name training sample set and the first weak correlation feature of each domain name in the domain name training sample set, and generate a malicious domain name detection model;

the training unit 404 includes:

the training subunit 4041 is configured to perform model training by using a domain name training sample set and the first weak correlation feature of each domain name in the domain name training sample set through at least one feedforward neural network model, and generate at least one malicious domain name detection model;

a fourth obtaining subunit 4042, configured to obtain a domain name test sample set when two or more malicious domain name detection models are generated;

a fifth obtaining subunit 4043, configured to obtain a second weak correlation feature of each domain name in the domain name test sample set;

the testing subunit 4044 is configured to respectively test the two or more generated malicious domain name detection models by using the domain name testing sample set and the second weak correlation feature of each domain name in the domain name testing sample set;

a statistics subunit 4045, configured to separately count test results of the generated two or more malicious domain name detection models, where the test results include accuracy and recall rate;

the determining subunit 4046 is configured to determine, according to the test result, an optimal malicious domain name detection model in the two or more malicious domain name detection models;

the detecting unit 405 is configured to detect whether the unknown domain name is a malicious domain name through an optimal malicious domain name detection model.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A malicious domain name detection method based on deep learning is characterized by comprising the following steps:

acquiring a domain name training sample set;

detecting whether the unknown domain name is a malicious domain name or not through the malicious domain name detection model;

the malicious domain name detection training based on deep learning is performed by using the domain name training sample set and the weak correlation characteristics of each domain name in the domain name training sample set, and the generation of the malicious domain name detection model specifically comprises the following steps:

performing model training by adopting the domain name training sample set and the first weak correlation characteristic of each domain name in the domain name training sample set through at least one feedforward neural network model to generate at least two malicious domain name detection models;

2. The malicious domain name detection method based on deep learning according to claim 1, wherein the obtaining of the first weakly-correlated feature of each domain name in the domain name training sample set specifically comprises:

3. The method according to claim 2, wherein after the first weak correlation feature of each domain name in the domain name training sample set is obtained, the malicious domain name detection training based on deep learning is performed by using the domain name training sample set and the first weak correlation feature of each domain name in the domain name training sample set, and before the malicious domain name detection model is generated, the method further comprises:

4. The malicious domain name detection method based on deep learning of claim 1, wherein the domain name training sample set comprises a positive sample set and a negative sample set;

5. A malicious domain name detection device based on deep learning is characterized by comprising:

the detection unit is used for detecting whether the unknown domain name is a malicious domain name or not through the malicious domain name detection model;

the training unit includes:

the training subunit is used for performing model training by adopting the domain name training sample set and the first weak correlation characteristic of each domain name in the domain name training sample set through at least one feedforward neural network model to generate at least two malicious domain name detection models;

6. The apparatus according to claim 5, wherein the second obtaining unit comprises:

7. The apparatus for detecting malicious domain name based on deep learning according to claim 6, further comprising:

8. The deep learning based malicious domain name detection device according to claim 5, wherein the domain name training sample set comprises a positive sample set and a negative sample set;