CN112966713A - DGA domain name detection method and device based on deep learning and computer equipment - Google Patents

DGA domain name detection method and device based on deep learning and computer equipment Download PDF

Info

Publication number
CN112966713A
CN112966713A CN202110142074.9A CN202110142074A CN112966713A CN 112966713 A CN112966713 A CN 112966713A CN 202110142074 A CN202110142074 A CN 202110142074A CN 112966713 A CN112966713 A CN 112966713A
Authority
CN
China
Prior art keywords
domain name
name data
training
detected
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110142074.9A
Other languages
Chinese (zh)
Other versions
CN112966713B (en
Inventor
刘晶
范渊
黄进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dbappsecurity Technology Co Ltd
Original Assignee
Hangzhou Dbappsecurity Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dbappsecurity Technology Co Ltd filed Critical Hangzhou Dbappsecurity Technology Co Ltd
Priority to CN202110142074.9A priority Critical patent/CN112966713B/en
Publication of CN112966713A publication Critical patent/CN112966713A/en
Application granted granted Critical
Publication of CN112966713B publication Critical patent/CN112966713B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to a DGA domain name detection method based on deep learning, wherein the DGA domain name detection method based on the deep learning comprises the following steps: acquiring domain name data to be detected; preprocessing the domain name data to be detected to obtain discretized domain name data to be detected, wherein the preprocessing at least comprises discretizing and vectorizing the domain name data to be detected; inputting the discrete domain name data to be detected into the trained neural network model, and outputting a classification result. By the method, the domain name data are preprocessed and then input into the neural network model, the problems of low domain name detection accuracy and high false alarm rate of the DGA are solved, the false alarm rate of domain name detection can be reduced, and the domain name detection accuracy is improved.

Description

DGA domain name detection method and device based on deep learning and computer equipment
Technical Field
The present application relates to the field of domain name detection, and in particular, to a method, an apparatus, and a computer device for detecting a DGA domain name based on deep learning.
Background
At present, DGA domain names in botnet are widely used, and network attack behaviors based on DGA are more and more extensive. A host infected with DGA malicious codes can periodically generate a large number of domain names, and the traditional blacklist interception means cannot show the effect. Botnets infect bots with a large number of hosts, thereby forming a one-to-many controlled network between the controller and the infected host. Hackers can use botnet to launch large-scale network attacks such as Distributed Denial of service (Distributed Denial of Services), massive spam, etc., or steal information such as confidential information, personal privacy, etc., which causes great harm to information security of organizations and individuals. At present, a plurality of detection schemes are provided for DGA domain name detection, but the detection model is still incomplete, so the detection accuracy is low, and the false alarm rate is high.
Aiming at the problems of low domain name detection accuracy and high false alarm rate, no effective solution is provided at present.
Disclosure of Invention
The embodiment of the application provides a DGA domain name detection method and device based on deep learning and computer equipment, and aims to at least solve the problems of low domain name detection accuracy and high false alarm rate in the related technology.
In a first aspect, an embodiment of the present application provides a method for detecting a DGA domain name based on deep learning, which is characterized by including: acquiring domain name data to be detected; preprocessing the domain name data to be detected to obtain discretized domain name data to be detected, wherein the preprocessing at least comprises discretizing vectorization on the domain name data to be detected; inputting the discretized domain name data to be detected into the trained neural network model, and outputting a classification result. In some embodiments, the preprocessing the domain name data to be detected to obtain discretized domain name data to be detected includes: acquiring a preset vectorization format; and converting the domain name data to be detected into discretized domain name data to be detected based on the vectorization format.
In some embodiments, obtaining the preset vectorization format includes: the obtaining of the preset vectorization format includes: acquiring first training domain name data; converting the first training domain name data into a first training domain name discrete vector based on the character features of the first training domain name data, wherein the character features at least comprise letters, numbers and special characters, and the first training domain name discrete vector is composed of the letters and the numbers; acquiring a mapping relation based on the first training domain name data and the first training domain name discrete vector; and acquiring the vectorization format based on the mapping relation.
In some of these embodiments, said inputting said discrete vector data before the trained neural network model comprises: acquiring second training domain name data and a training classification result of each second training domain name data, wherein the second training domain name data is a known domain name data set and comprises a normal domain name and a DGA domain name, and the training classification result is a classification result of the second training domain name data; preprocessing the second training domain name data to obtain discretized second training vector domain name data; establishing a training set according to the discretized second training vector domain name data and the classification result of the second training vector domain name data; training the neural network model based on the training set to obtain a trained neural network model. In some of these embodiments, training the neural network model based on the training set, the obtaining the trained neural network model comprises: acquiring third training domain name data and a training classification result of each third training domain name data, wherein the third training domain name data is a known domain name data set and comprises a normal domain name and a DGA domain name, and the training classification result is a classification result of the third training domain name data; preprocessing the third training domain name data to obtain discretized third training vector domain name data; establishing a verification set by using the discretized third training vector domain name data and the classification result of the third training vector domain name data; and verifying the trained neural network model based on the verification set to verify whether the trained neural network model meets a preset condition.
In some embodiments, the preprocessing the domain name data to be detected to obtain discretized domain name data to be detected further includes: and utilizing a filling function to fill the scattered domain name data to be detected into a preset vector dimension.
In some embodiments, after determining that the domain name data is the normal domain name or the DGA domain name according to the classification result, the method further includes: and storing the judgment result into a memory.
In a second aspect, an embodiment of the present application provides a domain name detection apparatus, including: the domain name acquisition module is used for acquiring domain name data to be detected; the preprocessing module is used for preprocessing the domain name data to be detected into discrete vector data, wherein the preprocessing at least comprises the discrete vectorization of the domain name data; and the domain name judging module is used for inputting the discrete domain name data to be detected into the trained neural network model and outputting a classification result.
In a third aspect, embodiments of the present application provide a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor, when executing the computer program, implements the deep learning based DGA domain name detection method according to the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the deep learning based DGA domain name detection method according to the first aspect.
Compared with the related art, the DGA domain name detection method based on deep learning provided by the embodiment of the application has the advantages that the domain name data to be detected are subjected to discrete vectorization firstly, and then the data subjected to the discrete vectorization are input into the trained neural network model, so that the classification result of the domain name detection is output, the false alarm rate of the domain name detection can be reduced, and the domain name detection accuracy rate is improved.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a schematic flow chart of a DGA domain name detection method based on deep learning in one embodiment of the present application;
FIG. 2 is a schematic flow chart of a neural network model training method based on a deep learning DGA domain name detection method in an embodiment of the present application;
FIG. 3 is a block diagram of a DGA domain name detection device based on deep learning according to an embodiment of the present application;
fig. 4 is a schematic hardware configuration diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
The DGA domain name detection method disclosed at present mainly comprises other models such as a deep learning model based on word2vec and LSTM combination, a model based on ngram and linguistic statistical characteristics combination, a deep learning model based on dictionary and LSTM combination and the like, but the methods have high false alarm rate when detection is carried out in an actual environment, and the actual effect is far away from indexes such as model training, accuracy rate of more than 99% in a test stage, recall rate and the like.
In the natural language processing sequence-to-sequence (seq2seq) problem, when the attention mechanism is not used, the last hidden layer of the coding layer is responsible for expressing the task of the whole sentence, so this model architecture has the problem of an information bottleneck, all the information of the whole sentence is forced to be captured by the vector of the last hidden layer, since this layer is the only input of the decoding layer. If some information of the original sentence is not in the vector, the decoding layer cannot correctly translate the original sentence. The attention mechanism can solve the problem of information bottleneck, and the main idea is to associate each hidden layer of the decoding layer with the coding layer, obtain an attention score (attention score) by adopting dot product (dot product), obtain an attention distribution (attention distribution) by using a softmax activation function, and further obtain an attention output (attention output) by using weighted sum of the hidden layers of the coding layer. The attention mechanism is a great breakthrough in the field of deep learning in the present year, the basic principle of the attention mechanism is derived from basic research on human cognition, and the attention mechanism mainly has the following advantages: 1) obviously providing the performance of the model; 2) the information bottleneck problem is solved; 3) the method is beneficial to solving the problem of gradient disappearance; 4) providing interpretability of some models. The DGA domain name detection method and device based on the deep learning attention mechanism aim to greatly improve the DGA domain name detection accuracy rate and reduce the false alarm rate.
Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a DGA domain name detection method based on deep learning according to an embodiment of the present invention.
In this embodiment, the DGA domain name detection method based on deep learning includes:
s101, domain name data to be detected is obtained.
The domain name data to be detected is a domain name data set, is the domain name data which needs to be detected and classified, and at least comprises normal domain name data and DGA domain name data. The method aims to distinguish normal domain name data from DGA domain name data.
S102, preprocessing the domain name data to be detected to obtain discretized domain name data to be detected, wherein the preprocessing at least comprises discretizing and vectorizing the domain name data to be detected.
It can be understood that, in this embodiment, before domain name data is detected, discretized domain name data to be detected needs to be obtained, and discretized domain name detection data can be obtained through preprocessing.
S103, inputting the discrete domain name data to be detected into the trained neural network model, and outputting a classification result.
In this embodiment, the neural network model is a domain name detection model based on a deep learning attention mechanism, and discrete domain name data to be detected is input into the model, so that a domain name classification result can be obtained.
According to the DGA domain name detection method based on deep learning, firstly, domain name data to be detected are obtained; preprocessing the domain name data to be detected to obtain discretized domain name data to be detected, wherein the preprocessing at least comprises discretizing and vectorizing the domain name data to be detected; inputting the discrete domain name data to be detected into the trained neural network model, and outputting a classification result. By the DGA domain name detection method based on deep learning, the DGA domain name detection accuracy is effectively improved, and the false alarm rate is reduced.
In another embodiment, the preprocessing the domain name data to be detected to obtain discretized domain name data to be detected includes: acquiring a preset vectorization format; converting the domain name data to be detected into discretized domain name data to be detected based on the vectorization format. In this embodiment, preprocessing the domain name data to be detected requires obtaining a vectorization format, where the vectorization format can convert the domain name data to be detected into discretized domain name data to be detected.
In another embodiment, obtaining the preset vectorization format includes: acquiring first training domain name data; converting the first training domain name data into a first training domain name discrete vector based on character features of the first training domain name data, wherein the character features at least comprise letters, numbers and special characters, and the first training domain name discrete vector is composed of the letters and the numbers; acquiring a mapping relation based on the first training domain name data and the first training domain name discrete vector; and acquiring a vectorization format based on the mapping relation.
In this embodiment, a vectorization format needs to be obtained based on the first training domain name data, and the first training domain name data is converted into a discrete vector through the vectorization format. Obtaining the vectorization format, firstly writing a function, mapping each letter, number and special symbol of all the first training domain name data to form a dictionary which only has the letter and the number and corresponds to the character feature of the first training domain name data, and the dictionary is the vectorization format. In this embodiment, by writing a function, a first training domain name discrete vector is obtained based on the first training domain name data, a mapping relationship among the first training domain name discrete vector and the first training domain name data is obtained, the vectorization format is obtained based on the mapping relationship, and the vectorization format is stored, which can be used for performing discrete vectorization on other domain name data.
In one embodiment, inputting discrete vector data before the trained neural network model comprises: acquiring second training domain name data and a training classification result of each second training domain name data, wherein the second training domain name data is a domain name data set of a known domain name classification result and comprises a normal domain name and a DGA domain name, and the training classification result is a classification result of the second training domain name data; preprocessing the second training domain name data to obtain discretized second training vector domain name data; establishing a training set according to the discretized second training vector domain name data and the classification result of the second training vector domain name data; training the neural network model based on the training set to obtain the trained neural network model.
It is easy to understand that before the neural network model is put into use, the built neural network model needs to be trained, first, second training domain name data and a training classification result corresponding to the second training domain name data are obtained, namely, the domain name is a normal domain name or a DGA domain name, then, the second training domain name data are preprocessed into discretization second training vector domain name data, through preprocessing, the data format of the domain name data input during training can be ensured to be the same as the data format of the domain name data during actual detection, the detection result is ensured to be more accurate, then, the discretization second training domain name data and the training classification result of the second training domain name data are divided into a training set, then, the discretization second training domain name data are input into the neural network model, parameters in the model are debugged based on the training classification result, and finally, the trained neural network model is obtained, the method in the embodiment can input more high-dimensional vector data at one time under the same hardware, and the false alarm rate of the model is greatly reduced to be within 3% from about 40% to 20% when the model is tested in a real environment.
In another embodiment, training the neural network model based on a training set, and obtaining the trained neural network model includes obtaining third training domain name data and a training classification result of each third training domain name data, where the third training domain name data is a known domain name data set including a normal domain name and a DGA domain name, and the training classification result is a classification result of the third training domain name data; preprocessing the third training domain name data to obtain discretized third training vector domain name data; establishing a verification set according to the discrete third training vector domain name data and the classification result of the third training vector domain name data; and verifying the trained neural network model based on the verification set, and verifying whether the optimized neural network model meets the preset conditions.
It can be understood that after the trained neural network model is obtained, the trained neural network model needs to be verified to judge the training result, based on which, first, the third training vector domain name data is obtained, then the third training domain name data is preprocessed into the discretized third training vector domain name data, the preprocessing can ensure that the data format of the domain name data input during training is the same as the data format of the domain name data during actual detection, ensuring that the detection result is more accurate, then the classification result of the third training vector domain name data is obtained, a verification set is established by the discretized third training vector domain name data and the classification result of the third training vector domain name data, the discretized third training vector domain name data is input into the trained neural network model, the detection result is obtained, the detection result is compared with the previously obtained classification result, and the judgment accuracy of the neural network model to the domain name data can be obtained by comparison, when the judgment accuracy rate meets the preset condition, the neural network model is shown to meet the preset standard, and the preset standard can be set by a user. And if the judgment accuracy rate does not meet the preset condition, the neural network model is trained and verified again.
In another embodiment, the preprocessing the domain name data to be detected to obtain discretized domain name data to be detected further comprises: and utilizing a filling function to fill the scattered domain name data to be detected into a preset vector dimension.
In this embodiment, the domain name to be detected is preprocessed into discretized domain name data to be detected, and the discretized domain name data to be detected needs to be unified into a vector dimension for importing into the neural network model, so that a vector dimension is preset, and all the domain name data are kept in the same preset vector dimension by using the pad _ sequences function of the keras. The domain name data with the same vector dimension can form a discrete vector domain name data set matrix, and the domain name detection classification result can be more conveniently obtained by inputting the matrix into the neural network model.
In some embodiments, after determining that the domain name data is the normal domain name or the DGA domain name according to the classification result, the method further includes: and storing the judgment result into a memory.
In one embodiment, discrete domain name data to be detected is input to the trained neural network model, and the following processes are carried out: firstly, Embedding a domain name data set subjected to discrete vectorization to obtain a continuous vector, then inputting the domain name data of the continuous vector into a Bidirectional LSTM layer, inputting the domain name data of the continuous vector into an attention mechanism layer, and finally outputting a classification result by a sense layer by adopting a sigmoid activation function.
Embedding is a method of converting discrete vectors or variables into a continuous vector representation. In deep learning, space dimensions of discrete vectors or variables can be generally reduced through Embedding, particularly, the space dimensions are converted into continuous vectors, and compared with a conventional One-Hot method, the Embedding method not only greatly reduces the vector dimensions, but also enables similarity calculation between different words to be possible, and the Embedding is learnable.
LSTM (Long Short-Term Memory) means Long-Short Term Memory, as a nonlinear model, LSTM can be used as a complex nonlinear unit for constructing a larger deep neural network, while Bidirective LSTM is the upgrade of LSTM, LSTM can only process the forward relation of text in Natural Language Processing (NLP), and Bidirective LSTM can process not only the forward relation but also the backward relation.
Attention Mechanism (Attention Mechanism) is a resource allocation scheme, which uses limited computing resources to process more important information and is a main means to solve information overload. It mainly includes Attention Score (Attention Score), Attention Distribution (Attention Distribution), and Attention Output (Attention Output). There are many variations of this, differing primarily in the focus scoring segment.
In another embodiment, as shown in fig. 2, fig. 2 is a schematic flowchart of a process of training a neural network model of a DGA domain name detection method based on deep learning, including:
s201, importing a normal domain name and a DGA domain name.
S202, performing data processing operations such as domain name extraction, new label creation, data frame combination, shuffle and the like on the imported domain name, aiming at disordering the imported domain name, establishing a classification label for the imported domain name and reserving 10% of data.
S203 performs discrete integer vectorization on the domain name data by using a dictionary obtained based on the training domain name data, where the dictionary is one of vectorization formats, and performs padding operation to keep the domain name having the same vector dimension.
And S204, dividing the discrete vector domain name data subjected to discrete integer vectorization into a training set and a verification set.
S205, a domain name detection model based on a deep learning attention mechanism is constructed based on neural network layers such as Embedding, Bidirectional LSTM, attention mechanism and the like.
And S206, training and verifying the neural network model by using the training set and the verification set, and saving the model when the model evaluation meets the preset requirement.
And S207, carrying out discrete integer vectorization on a part of the data reserved before, importing the data into the model, carrying out final test, and obtaining the trained neural network model when the model meets the preset requirement.
And S208, deploying the model for use.
In another embodiment, the step of obtaining a trained neural network model comprises:
s301, importing 100 normal domain names and 120 normal domain names using DGA.
S302, respectively extracting 90 thousands of normal domain names and 90 thousands of DGA domain names, setting the normal domain name label as 0 and the DGA domain name label as 1, and calling a concat function of the pandas and a shuffle function of the sklern to sequentially integrate and reorder the data frames. The remaining normal and DGA domains are used for model testing, which are never seen for the model, and this step is to prevent data leakage (data leak).
S303, writing a function to map each letter, number and feature coincidence of the current existing domain name to form a dictionary containing 39 key-values, wherein the dictionary is in a vectorization format and is written into a txt document for model test and model deployment; the maximum length of domain name discrete integer vectorization is set to be 40, all 140-ten-thousand domain name data are vectorized by using the dictionary, and because the domain names have different lengths, the functions of pad _ sequences of a keras (artificial neural network library) are used for automatic completion, so that the dimensionality of the discrete integer vectorization of all the domain names is kept consistent.
S304, using a train _ text _ split function of a sklern (machine learning library) to divide domain name data and corresponding labels after discrete integer vectorization into a training set and a verification set.
S305, firstly, compiling an attention mechanism layer, and setting an embedding size to be 128; writing a detection model, and sequentially setting parameters of an embedding layer and a Dropout layer to be 0.7 by adopting a Keras API functional programming method; bidirective LSTM uses the relu activation function, the recovery _ drop is set to 0.5, drop is set to 0.8, the attention mechanism layer parameter is set to 100, the sense layer uses the sigmoid activation function, then the model is compiled, the optimizer uses the Adam algorithm, the learning rate uses 1e-7, the loss function uses binary _ cross, the metrics uses the accuracy, and the model uses the early strategy.
S306, training the model, setting the verification set proportion to be 20%, setting the batch size to be 512, setting the epochs to be 8, setting the shuffle to be True, setting the model verification set accuracy to be 97.75%, setting the accuracy to be 98.99%, setting the recall rate to be 96.47%, and storing the model.
S307, 10 thousands of mapping dictionaries and models which are stored before the new environment is imported, normal domain name data and DGA domain names which are not found in the models, discrete integer vectorization is carried out on the domain names according to the dictionaries before the models are installed and trained, then padding operation is carried out, and then the models are used for testing, so that the accuracy rate of predicting the unseen domain names reaches 97.65%.
S308, installing Python, TensorFlow, Pandas and Numpy packages on the server, ensuring the version numbers to be consistent, and ensuring the consistency of the deployment environment and the development environment to perform deployment.
The present embodiment further provides a domain name detection apparatus, which is used to implement the foregoing embodiments and preferred embodiments, and the description of the apparatus is omitted here. As used hereinafter, the terms "module," "unit," "subunit," and the like may implement a combination of software and/or hardware for a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 3 is a block diagram of a domain name detection apparatus according to an embodiment of the present application, and as shown in fig. 3, the apparatus includes:
the domain name obtaining module 10 is configured to obtain domain name data to be detected.
The preprocessing module 20 is configured to preprocess the domain name data to be detected into discrete vector data, where the preprocessing at least includes performing discrete vectorization on the domain name data.
The preprocessing module 20 is further configured to obtain a preset vectorization format; converting the domain name data to be detected into discretized domain name data to be detected based on the vectorization format.
The preprocessing module 20 is further configured to obtain first training domain name data; converting the first training domain name data into a first training domain name discrete vector based on character features of the first training domain name data, wherein the character features at least comprise letters, numbers and special characters, and the first training domain name discrete vector is composed of the letters and the numbers; acquiring a mapping relation based on the first training domain name data and the first training domain name discrete vector; and acquiring a vectorization format based on the mapping relation.
The preprocessing module 20 is further configured to utilize a filling function to fill the discretized domain name data to be detected into a preset vector dimension.
And the domain name judging module 30 is used for inputting the discrete domain name data to be detected into the trained neural network model and outputting a classification result.
The domain name detection device further comprises:
the neural network model training module is used for acquiring second training domain name data and a training classification result of each second training domain name data, the second training domain name data is a known domain name data set and comprises a normal domain name and a DGA domain name, and the training classification result is a classification result of the second training domain name data; preprocessing the second training domain name data to obtain discretized second training vector domain name data; establishing a training set according to the discretized second training vector domain name data and the classification result of the second training vector domain name data; and training the neural network model based on the training set to obtain the optimized neural network model.
The neural network model detection module is used for acquiring third training domain name data and a training classification result of each third training domain name data, the third training domain name data is a known domain name data set and comprises a normal domain name and a DGA domain name, and the training classification result is a classification result of the third training domain name data; preprocessing the third training domain name data to obtain discretized third training vector domain name data; establishing a verification set according to the discrete third training vector domain name data and the classification result of the third training vector domain name data; and verifying the optimized neural network model based on the verification set to verify whether the optimized neural network model meets the preset conditions.
And the storage module is used for storing the judgment result into the memory.
The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.
In addition, the DGA domain name detection method based on deep learning of the embodiment of the present application described in conjunction with fig. 1 can be implemented by a computer device. Fig. 4 is a hardware structure diagram of a computer device according to an embodiment of the present application.
The computer device may include a processor 41 and a memory 42 storing computer program instructions.
Specifically, the processor 41 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.
Memory 42 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 42 may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 42 may include removable or non-removable (or fixed) media, where appropriate. The memory 42 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 42 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 42 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.
Memory 42 may be used to store or cache various data files for processing and/or communication use, as well as possibly computer program instructions for execution by processor 41.
The processor 41 reads and executes computer program instructions stored in the memory 42 to implement any one of the deep learning based DGA domain name detection methods in the above embodiments.
In some of these embodiments, the computer device may also include a communication interface 43 and a bus 40. As shown in fig. 4, the processor 41, the memory 42, and the communication interface 43 are connected via the bus 40 to complete mutual communication.
The communication interface 43 is used for implementing communication between modules, devices, units and/or apparatuses in the embodiments of the present application. The communication interface 43 may also be implemented with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.
Bus 40 comprises hardware, software, or both coupling the components of the computer device to each other. Bus 40 includes, but is not limited to, at least one of the following: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, Bus 40 may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (FSB), a Hyper Transport (HT) Interconnect, an ISA (ISA) Bus, an InfiniBand (InfiniBand) Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a microchannel Architecture (MCA) Bus, a PCI (Peripheral Component Interconnect) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a Video Electronics Bus (audio Electronics Association), abbreviated VLB) bus or other suitable bus or a combination of two or more of these. Bus 40 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.
The computing device may execute the DGA domain name detection method based on deep learning in the embodiment of the present application based on the acquired computer instruction, thereby implementing the DGA domain name detection method based on deep learning described with reference to fig. 1.
In addition, in combination with the DGA domain name detection method based on deep learning in the foregoing embodiments, embodiments of the present application may provide a computer-readable storage medium to implement. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the deep learning based DGA domain name detection methods in the embodiments described above.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A DGA domain name detection method based on deep learning is characterized by comprising the following steps:
acquiring domain name data to be detected;
preprocessing the domain name data to be detected to obtain discretized domain name data to be detected, wherein the preprocessing at least comprises discretizing vectorization on the domain name data to be detected;
inputting the discretized domain name data to be detected into the trained neural network model, and outputting a classification result.
2. The method according to claim 1, wherein the preprocessing the domain name data to be detected to obtain discretized domain name data to be detected comprises:
acquiring a preset vectorization format;
and converting the domain name data to be detected into discretized domain name data to be detected based on the vectorization format.
3. The method of claim 2, wherein the obtaining the preset vectorization format comprises:
acquiring first training domain name data;
converting the first training domain name data into a first training domain name discrete vector based on the character features of the first training domain name data, wherein the character features at least comprise letters, numbers and special characters, and the first training domain name discrete vector is composed of the letters and the numbers;
acquiring a mapping relation based on the first training domain name data and the first training domain name discrete vector;
and acquiring the vectorization format based on the mapping relation.
4. The method of claim 1, wherein inputting the discrete vector data before the trained neural network model comprises:
acquiring second training domain name data and a training classification result of each second training domain name data, wherein the second training domain name data is a known domain name data set and comprises a normal domain name and a DGA domain name, and the training classification result is a classification result of the second training domain name data;
preprocessing the second training domain name data to obtain discretized second training vector domain name data;
establishing a training set according to the discretized second training vector domain name data and the classification result of the second training vector domain name data;
training the neural network model based on the training set to obtain a trained neural network model.
5. The method of claim 4, wherein training the neural network model based on the training set, the obtaining of the trained neural network model comprises:
acquiring third training domain name data and a training classification result of each third training domain name data, wherein the third training domain name data is a known domain name data set and comprises a normal domain name and a DGA domain name, and the training classification result is a classification result of the third training domain name data;
preprocessing the third training domain name data to obtain discretized third training vector domain name data; establishing a verification set by using the discretized third training vector domain name data and the classification result of the third training vector domain name data;
and verifying the trained neural network model based on the verification set to verify whether the trained neural network model meets a preset condition.
6. The method according to claim 1, wherein preprocessing the domain name data to be detected to obtain discretized domain name data to be detected further comprises:
and utilizing a filling function to fill the discretized domain name data to be detected to a preset vector dimension.
7. The method of claim 1, wherein after determining that the domain name data is the normal domain name or the DGA domain name according to the classification result, the method further comprises:
and storing the judgment result into a memory.
8. A domain name detecting apparatus, comprising:
the domain name acquisition module is used for acquiring domain name data to be detected;
the preprocessing module is used for preprocessing the domain name data to be detected into discrete vector data, wherein the preprocessing at least comprises the discrete vectorization of the domain name data;
and the domain name judging module is used for inputting the discretized domain name data to be detected into the trained neural network model and outputting a classification result.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN202110142074.9A 2021-02-02 2021-02-02 DGA domain name detection method and device based on deep learning and computer equipment Active CN112966713B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110142074.9A CN112966713B (en) 2021-02-02 2021-02-02 DGA domain name detection method and device based on deep learning and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110142074.9A CN112966713B (en) 2021-02-02 2021-02-02 DGA domain name detection method and device based on deep learning and computer equipment

Publications (2)

Publication Number Publication Date
CN112966713A true CN112966713A (en) 2021-06-15
CN112966713B CN112966713B (en) 2024-03-19

Family

ID=76271822

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110142074.9A Active CN112966713B (en) 2021-02-02 2021-02-02 DGA domain name detection method and device based on deep learning and computer equipment

Country Status (1)

Country Link
CN (1) CN112966713B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113726730A (en) * 2021-07-14 2021-11-30 国网山东省电力公司信息通信公司 DGA domain name detection method and system based on deep learning algorithm
CN114095176A (en) * 2021-10-29 2022-02-25 北京天融信网络安全技术有限公司 Malicious domain name detection method and device
CN115955457A (en) * 2023-03-10 2023-04-11 北京升鑫网络科技有限公司 Malicious domain name detection method and device and electronic equipment
CN116074081A (en) * 2023-01-28 2023-05-05 鹏城实验室 DGA domain name detection method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103152222A (en) * 2013-01-05 2013-06-12 中国科学院信息工程研究所 Method for detecting quick-changing attack domain name based on host group characteristics
CN105939321A (en) * 2015-12-07 2016-09-14 杭州迪普科技有限公司 DNS (Domain Name System) attack detection method and device
WO2019136953A1 (en) * 2018-01-15 2019-07-18 深圳市联软科技股份有限公司 C&c domain name analysis-based botnet detection method, device, apparatus and medium
CN110807098A (en) * 2019-09-24 2020-02-18 武汉智美互联科技有限公司 DGA domain name detection method based on BiRNN deep learning
CN111125700A (en) * 2019-12-11 2020-05-08 中山大学 DGA family classification method based on host relevance
CN111628970A (en) * 2020-04-24 2020-09-04 中国科学院计算技术研究所 DGA type botnet detection method, medium and electronic equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103152222A (en) * 2013-01-05 2013-06-12 中国科学院信息工程研究所 Method for detecting quick-changing attack domain name based on host group characteristics
CN105939321A (en) * 2015-12-07 2016-09-14 杭州迪普科技有限公司 DNS (Domain Name System) attack detection method and device
WO2019136953A1 (en) * 2018-01-15 2019-07-18 深圳市联软科技股份有限公司 C&c domain name analysis-based botnet detection method, device, apparatus and medium
CN110807098A (en) * 2019-09-24 2020-02-18 武汉智美互联科技有限公司 DGA domain name detection method based on BiRNN deep learning
CN111125700A (en) * 2019-12-11 2020-05-08 中山大学 DGA family classification method based on host relevance
CN111628970A (en) * 2020-04-24 2020-09-04 中国科学院计算技术研究所 DGA type botnet detection method, medium and electronic equipment

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113726730A (en) * 2021-07-14 2021-11-30 国网山东省电力公司信息通信公司 DGA domain name detection method and system based on deep learning algorithm
CN114095176A (en) * 2021-10-29 2022-02-25 北京天融信网络安全技术有限公司 Malicious domain name detection method and device
CN114095176B (en) * 2021-10-29 2024-04-09 北京天融信网络安全技术有限公司 Malicious domain name detection method and device
CN116074081A (en) * 2023-01-28 2023-05-05 鹏城实验室 DGA domain name detection method, device, equipment and storage medium
CN116074081B (en) * 2023-01-28 2023-06-13 鹏城实验室 DGA domain name detection method, device, equipment and storage medium
CN115955457A (en) * 2023-03-10 2023-04-11 北京升鑫网络科技有限公司 Malicious domain name detection method and device and electronic equipment
CN115955457B (en) * 2023-03-10 2023-05-30 北京升鑫网络科技有限公司 Malicious domain name detection method and device and electronic equipment

Also Published As

Publication number Publication date
CN112966713B (en) 2024-03-19

Similar Documents

Publication Publication Date Title
CN112966713A (en) DGA domain name detection method and device based on deep learning and computer equipment
Melicher et al. Fast, lean, and accurate: Modeling password guessability using neural networks
Ndichu et al. A machine learning approach to detection of JavaScript-based attacks using AST features and paragraph vectors
Opara et al. HTMLPhish: Enabling phishing web page detection by applying deep learning techniques on HTML analysis
US11580222B2 (en) Automated malware analysis that automatically clusters sandbox reports of similar malware samples
EP3051767A1 (en) Method and apparatus for automatically identifying signature of malicious traffic using latent dirichlet allocation
Xiang et al. A word-embedding-based steganalysis method for linguistic steganography via synonym substitution
KR101858620B1 (en) Device and method for analyzing javascript using machine learning
CN111866004B (en) Security assessment method, apparatus, computer system, and medium
CN111159697B (en) Key detection method and device and electronic equipment
Al-Wesabi A smart English text zero-watermarking approach based on third-level order and word mechanism of Markov model
Trieu et al. Document sensitivity classification for data leakage prevention with twitter-based document embedding and query expansion
CN108470126A (en) Data processing method, device and storage medium
CN113810375B (en) Webshell detection method, device and equipment and readable storage medium
Rasheed et al. Adversarial attacks on featureless deep learning malicious URLs detection
CN105243327A (en) Security processing method for files
CN105468972A (en) Mobile terminal file detection method
Pu et al. BERT-Embedding-Based JSP Webshell Detection on Bytecode Level Using XGBoost
CN110263540A (en) A kind of marking code method and device
KR102434899B1 (en) Method for Training Malware Detection Model And Method for Detecting Malware
KR20210024748A (en) Malware documents detection device and method using generative adversarial networks
CN116506141A (en) DGA domain name family detection method, device, computer equipment and storage medium
Anand et al. Android Malware Detection using LSTM with Smali Codes
CN115022001B (en) Training method and device of domain name recognition model, electronic equipment and storage medium
CN115758368B (en) Prediction method and device for malicious cracking software, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant