WO2019096099A1 - Dga域名实时检测方法和装置 - Google Patents

Dga域名实时检测方法和装置 Download PDF

Info

Publication number
WO2019096099A1
WO2019096099A1 PCT/CN2018/115087 CN2018115087W WO2019096099A1 WO 2019096099 A1 WO2019096099 A1 WO 2019096099A1 CN 2018115087 W CN2018115087 W CN 2018115087W WO 2019096099 A1 WO2019096099 A1 WO 2019096099A1
Authority
WO
WIPO (PCT)
Prior art keywords
domain name
model
feature
deep learning
classifier
Prior art date
Application number
PCT/CN2018/115087
Other languages
English (en)
French (fr)
Inventor
曾凤
常朔
万晓川
Original Assignee
瀚思安信(北京)软件技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 瀚思安信(北京)软件技术有限公司 filed Critical 瀚思安信(北京)软件技术有限公司
Priority to US16/764,741 priority Critical patent/US11334764B2/en
Publication of WO2019096099A1 publication Critical patent/WO2019096099A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the present invention relates to the field of network security technologies, and in particular, to a DGA domain name real-time detection method and apparatus.
  • BotNet refers to the use of one or more means of communication to infect a large number of zombie hosts (Bot), so that between the master (Botmaster) and the infected host, through the command and control server (Command and Control Server, referred to as C2 server, can form a one-to-many control network, the purpose of which is to infect as many hosts as possible. Whether it is for the security of the network or the protection of user data security, botnets are a threat.
  • DGA Domain Generation Algorithms
  • an existing detection method is to predict which domain names may be generated and pre-registered in the future by collecting DGA domain name samples and reversing the DGA, and blacklist them.
  • this solution has an obvious drawback. Because DGA can generate thousands of domain names in a short period of time, it is impossible for network security personnel to repeatedly collect domain name samples and update blacklists every day.
  • Another classic detection technology is to implement feature extraction and classification of domain name data, which mainly includes two stages, namely feature engineering and classification algorithm.
  • Feature engineering is one of the most cumbersome tasks, and it mainly includes two aspects:
  • Typical domain name statistical features include, for example, domain name length, binary grammar, N-gram, information entropy, life cycle, and character frequency distribution.
  • the Chinese patent application with the publication number CN105577660A proposes a DGA domain name detection method based on random forest.
  • the Chinese patent application with the publication number CN105897714A proposes a botnet detection method based on DNS traffic characteristics.
  • a DGA domain name detection method and system is also proposed in U.S. Patent Application Serial No. US 2013/0191915 A1.
  • These patent applications use the above-mentioned classic statistical feature-based feature engineering method to achieve the detection of DGA domain names.
  • This detection method has some shortcomings, such as: excessive dependence on artificial feature engineering, the implementation is difficult; the detection rate is low, the false alarm rate is high; the detection speed is slow, and real-time detection cannot be realized.
  • the LSTM model is a special type of cyclic neural network. It can learn long-term dependence information, such as text and language. This method is based on LSTM model for automatic feature extraction, eliminating the tedious steps of feature engineering and without relying on context information. To the extent, real-time detection of DGA domain names is achieved. However, this method requires a large amount of training data to train the LSTM model, and the parameter weight of the model needs to be adjusted during the training process. The computational strength of the model training is large; in addition, the model is more unbalanced in the training set. Sensitive, there is also a lack of detection capabilities for some DGA domain names that lack sufficient training set support.
  • An aspect of the present invention provides a DGA domain name real-time detection method, which is characterized by the following steps:
  • Step S1 converting the original domain name into a multi-dimensional numerical vector
  • Step S2 inputting the multi-dimensional numerical vector into a deep learning model pre-trained based on the ImageNet data set to generate a domain name feature
  • Step S3 training the domain name classifier based on the generated domain name feature
  • step S4 the domain name classifier obtained by the training classifies and predicts the DGA domain name.
  • the step S1 converting the original domain name into a multi-dimensional value vector comprises the following steps:
  • Step S11 converting a string of the original domain name into a multi-dimensional image byte matrix to match the input of the deep learning model pre-trained based on the ImageNet data set;
  • step S12 the size of the multi-dimensional image byte matrix is reduced to a predetermined size.
  • the step S2 further includes:
  • step S2' the multi-dimensional image byte matrix reduced to a predetermined size is normalized.
  • the generating the domain name feature in the step S2 further includes:
  • the third last layer of the pre-trained deep learning model is extracted to generate a domain name feature.
  • the pre-trained deep learning model based on the ImageNet data set comprises: an AlexNet model, a VGG model, a SqueezeNet model, an Inception model, or a ResNet model.
  • the domain name classifier includes a decision tree model, a support vector machine model, a logistic regression model, or a random forest model.
  • training the domain name classifier based on the generated domain name feature in the step S3 comprises: calculating a similarity distance between the two domain names.
  • the training the domain name classifier based on the generated domain name feature in the step S3 comprises: calculating a feature average value of the domain name in the domain name family as a feature of the domain name family.
  • Another aspect of the present invention provides a DGA domain name real-time detecting apparatus, which is characterized by comprising the following modules:
  • a conversion module for converting the original domain name into a multidimensional numerical vector
  • a deep learning module configured to input the multidimensional numerical vector into a deep learning model pre-trained based on the ImageNet data set to generate a domain name feature
  • a classifier training module configured to train a domain name classifier based on the generated domain name feature
  • the prediction module is configured to classify and predict the DGA domain name based on the trained domain name classifier.
  • the conversion module includes:
  • a pre-processing unit for converting a string of the original domain name into a multi-dimensional image byte matrix to match an input of a deep learning model pre-trained based on the ImageNet data set;
  • an adjusting unit configured to reduce the size of the multi-dimensional image byte matrix to a predetermined size.
  • the detecting device further comprises:
  • a normalization module configured to perform normalization processing on the multi-dimensional image byte matrix reduced to a predetermined size.
  • the deep learning module extracts a third to last layer of the pre-trained deep learning model to generate a domain name feature.
  • the pre-trained deep learning model based on the ImageNet data set comprises: an AlexNet model, a VGG model, a SqueezeNet model, an Inception model, or a ResNet model.
  • the domain name classifier includes a decision tree model, a support vector machine model, a logistic regression model, or a random forest model.
  • the classifier training module includes:
  • the similarity calculation unit is configured to calculate a similarity distance between two domain names.
  • the classifier training module includes:
  • a feature calculation unit configured to calculate a feature average value of the domain name in the domain name family as a feature of the domain name family.
  • Another aspect of the invention provides a computer readable storage medium having stored thereon computer program instructions for performing the following steps in a computer:
  • Step S1 converting the original domain name into a multi-dimensional numerical vector
  • Step S2 inputting the multi-dimensional numerical vector into a deep learning model pre-trained based on the ImageNet data set to generate a domain name feature
  • Step S3 training the domain name classifier based on the generated domain name feature
  • step S4 the domain name classifier obtained by the training classifies and predicts the DGA domain name.
  • the step S1 converting the original domain name into a multi-dimensional value vector comprises the following steps:
  • Step S11 converting a string of the original domain name into a multi-dimensional image byte matrix to match the input of the deep learning model pre-trained based on the ImageNet data set;
  • step S12 the size of the multi-dimensional image byte matrix is reduced to a predetermined size.
  • the step S2 further includes:
  • step S2' the multi-dimensional image byte matrix reduced to a predetermined size is normalized.
  • the generating the domain name feature in the step S2 further includes:
  • the third last layer of the pre-trained deep learning model is extracted to generate a domain name feature.
  • the pre-trained deep learning model based on the ImageNet data set comprises: an AlexNet model, a VGG model, a SqueezeNet model, an Inception model, or a ResNet model.
  • the domain name classifier includes a decision tree model, a support vector machine model, a logistic regression model, or a random forest model.
  • training the domain name classifier based on the generated domain name feature in the step S3 comprises: calculating a similarity distance between the two domain names.
  • the training the domain name classifier based on the generated domain name feature in the step S3 comprises: calculating a feature average value of the domain name in the domain name family as a feature of the domain name family.
  • Another aspect of the present invention provides a computer apparatus comprising a processor and a memory, the memory storing computer program instructions, wherein the computer program instructions are used to perform the following steps while the processor is running:
  • Step S1 converting the original domain name into a multi-dimensional numerical vector
  • Step S2 inputting the multi-dimensional numerical vector into a deep learning model pre-trained based on the ImageNet data set to generate a domain name feature
  • Step S3 training the domain name classifier based on the generated domain name feature
  • step S4 the domain name classifier obtained by the training classifies and predicts the DGA domain name.
  • the step S1 converting the original domain name into a multi-dimensional value vector comprises the following steps:
  • Step S11 converting a string of the original domain name into a multi-dimensional image byte matrix to match the input of the deep learning model pre-trained based on the ImageNet data set;
  • step S12 the size of the multi-dimensional image byte matrix is reduced to a predetermined size.
  • the step S2 further includes:
  • step S2' the multi-dimensional image byte matrix reduced to a predetermined size is normalized.
  • the generating the domain name feature in the step S2 further includes:
  • the third last layer of the pre-trained deep learning model is extracted to generate a domain name feature.
  • the pre-trained deep learning model based on the ImageNet data set comprises: an AlexNet model, a VGG model, a SqueezeNet model, an Inception model, or a ResNet model.
  • the domain name classifier includes a decision tree model, a support vector machine model, a logistic regression model, or a random forest model.
  • training the domain name classifier based on the generated domain name feature in the step S3 comprises: calculating a similarity distance between the two domain names.
  • the training the domain name classifier based on the generated domain name feature in the step S3 comprises: calculating a feature average value of the domain name in the domain name family as a feature of the domain name family.
  • the word-embedding conversion of the domain name data and the migration learning of the deep learning model will be based on the ImageNet data set for the first time.
  • the pre-trained deep learning model is used for real-time detection of DGA domain names from the visual image classification detection field, avoiding the high-intensity training and parameter weight adjustment process of the deep learning model in DGA domain name detection, with high detection rate and low The rate of false positives, and has a faster detection speed.
  • FIG. 1 is a schematic flowchart of a DGA domain name real-time detection method according to an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of a DGA domain name real-time detection method according to another embodiment of the present invention.
  • FIG. 3 is a schematic flowchart of a DGA domain name real-time detection method according to another embodiment of the present invention.
  • FIG. 4 is a diagram showing an application example of an embodiment of performing domain name conversion according to the present invention.
  • FIG. 5 is a schematic structural diagram of a DGA domain name real-time detecting apparatus according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of a conversion module according to an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of a DGA domain name real-time detecting apparatus according to another embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a classifier training module according to an embodiment of the present invention.
  • FIG. 9 is a comparison diagram of DGA domain name detection speed performance according to an embodiment of the present invention.
  • FIG. 1 is a schematic flowchart diagram of a DGA domain name real-time detection method according to an embodiment of the present invention. As shown in FIG. 1, the method includes the following steps:
  • Step S1 converting the original domain name into a multi-dimensional numerical vector
  • Step S2 input the multi-dimensional numerical vector into a deep learning model pre-trained based on the ImageNet data set, and generate a domain name feature;
  • Step S3 training the domain name classifier based on the generated domain name feature
  • step S4 the domain name classifier obtained by the training classifies and predicts the DGA domain name.
  • the ImageNet data set is the name of a well-known computer vision system identification project. It is the largest database for image recognition in the world, and contains more than 10 million hand-labeled pictures and more than 20,000. Object category. Based on this large-scale data set, some excellent deep learning models have been developed and trained, such as AlexNet model, VGG model, SqueezeNet model, Inception model or ResNet model. At present, these excellent deep learning models are mainly used in computer vision recognition, speech recognition, natural language processing and other technical fields, and have achieved great success in these fields, but have not been found to be applied to computer network security, especially DGA domain name detection. The precedent.
  • the DGA domain name as the object of learning and classification is essentially a character type data, which is different from the original image data in the ImageNet dataset in both size and content.
  • the embodiments of the present invention are directed to the above two difficulties.
  • the original domain name data of the character type is converted into the image format of the multi-dimensional numerical vector by the word embedding technology, so that the domain name data can be pre-formed based on the ImageNet data set like the image data in the ImageNet data set.
  • Trained deep learning model processing Word embedding is a noun in natural language processing and is mathematically defined as a mapping from document space projection to a numerical vector space (usually low-dimensional). The mapping is a single shot function, ie each Y has a unique X correspondence and vice versa.
  • word embedding technology document type data can be numerically processed to convert document analysis problems into corresponding numerical vectors.
  • the parameter weights of the deep learning model that has been pre-trained based on the ImageNet dataset are directly transferred to the target learning model for the domain name dataset after the word embedding transformation, thus effectively utilizing the ImageNet dataset.
  • Training the knowledge and experience of the excellent deep learning model evolved, avoiding the high-intensity training and parameter weight adjustment process based on large-scale domain name data for deep learning models, and also making the detection of DGA domain names ensure real-time performance requirements. It also has a high detection rate and a low false positive rate.
  • the deep learning model based on the ImageNet dataset pre-training comprises: an AlexNet model, a VGG model, a SqueezeNet model, an Inception model, or a ResNet model.
  • generating the domain name feature in the step S2 further comprises: extracting a third-order layer of the pre-trained deep learning model to generate a domain name feature. This is because in the pre-trained deep learning model, the top layer of the output layer usually has over-fitting problems, and the lower-level features of the top-level output layer are often more suitable for classification.
  • the domain name classifier includes a decision tree model, a support vector machine model, a logistic regression model, or a random forest model.
  • training the domain name classifier based on the generated domain name feature in the step S3 further comprises: calculating a similarity distance between the two domain names.
  • a similarity score for Euclidean distance between two domain names helps to improve the accuracy of the classification.
  • the training the domain name classifier based on the generated domain name feature in the step S3 comprises: calculating a feature average value of the domain name in the domain name family as a feature of the domain name family.
  • the DGA domain name has multiple domain name families.
  • the embodiment of the present invention uses the feature average value of the domain name in the domain name family as the feature of the domain name family, thereby realizing the DGA domain name family. Classification detection.
  • FIG. 2 is a schematic flowchart diagram of a DGA domain name real-time detection method according to another embodiment of the present invention.
  • the step S1 of converting the original domain name into the multi-dimensional value vector in the embodiment of the present invention includes:
  • Step S11 converting a string of the original domain name into a multi-dimensional image byte matrix to match the input of the deep learning model pre-trained based on the ImageNet data set;
  • step S12 the size of the multi-dimensional image byte matrix is reduced to a predetermined size.
  • the step S2 includes inputting the multi-dimensional image byte matrix into a deep learning model pre-trained based on the ImageNet data set to generate a domain name feature.
  • the original domain name is converted into a multi-dimensional image byte matrix. Since the length of the domain name string is smaller than that of the general image data, the size of the converted image byte matrix is reduced to a predetermined size. Can significantly reduce the use of memory space.
  • FIG. 3 is a schematic flowchart diagram of a DGA domain name real-time detection method according to another embodiment of the present invention. As shown in FIG. 3, the method further includes the following steps before step S2 of the embodiment shown in FIG. 2:
  • step S2' the multi-dimensional image byte matrix reduced to a predetermined size is normalized.
  • the normalization processing of the multi-dimensional image byte matrix after the word embedding conversion makes the vector representation of the domain name data more standard and standardized, and further improves the classification accuracy of the domain name.
  • FIG. 4 is a diagram showing an application example of an embodiment of performing domain name conversion according to the present invention.
  • a domain name zzzzanerraticallyqozaw.com is generated by using a DGA, and the domain name string is first converted into an image byte matrix of [224 ⁇ 224 ⁇ 3] by word embedding, due to the domain name character.
  • the maximum length of the string usually does not exceed 25, we can further reduce the size of the [224 ⁇ 224 ⁇ 3] image byte matrix to [25 ⁇ 25 ⁇ 3], and finally input the AlexNet pre-trained based on the ImageNet data set. Deep learning model domain name characteristics.
  • reducing the size of the converted image byte matrix to a predetermined size can significantly reduce the occupation of the memory space.
  • FIG. 5 is a schematic structural diagram of a DGA domain name real-time detecting apparatus 100 according to an embodiment of the present invention. As shown in FIG. 5, the DGA domain name real-time detecting apparatus 100 includes the following modules:
  • the conversion module 10 is configured to convert the original domain name into a multi-dimensional numerical vector
  • the deep learning module 20 is configured to input the multi-dimensional numerical vector into a deep learning model pre-trained based on the ImageNet data set to generate a domain name feature;
  • a classifier training module 30 configured to train a domain name classifier based on the generated domain name feature
  • the prediction module 40 is configured to classify and predict the DGA domain name based on the trained domain name classifier.
  • FIG. 6 is a schematic structural diagram of a conversion module 10 according to an embodiment of the present invention. As shown in FIG. 6, the conversion module 10 includes the following units:
  • the pre-processing unit 11 is configured to convert the string of the original domain name into a multi-dimensional image byte matrix to match the input of the deep learning model pre-trained based on the ImageNet data set.
  • the adjusting unit 12 is configured to reduce the size of the multi-dimensional image byte matrix to a predetermined size.
  • FIG. 7 is a schematic structural diagram of a DGA domain name real-time detecting apparatus 200 according to another embodiment of the present invention. As shown in FIG. 7, the DGA domain name real-time detecting apparatus 200 further includes the following modules based on the DGA domain name real-time detecting apparatus 100 shown in FIG. 5:
  • the normalization module 50 is configured to perform normalization processing on the multi-dimensional image byte matrix reduced to a predetermined size.
  • FIG. 8 is a schematic structural diagram of a classifier training module 30 according to an embodiment of the present invention. As shown in FIG. 8, the classifier training module 30 includes the following units:
  • the similarity calculation unit 31 is configured to calculate a similarity distance between two domain names
  • the feature calculation unit 32 is configured to calculate a feature average value of the domain name in the domain name family as a feature of the domain name family.
  • the embodiment of the present invention selects Alexa's first 1 million domain name data as a non-DGA domain name, and selects the actual 33 million DGA malicious domain names as test data. These DGA malicious domain names include 64 domain name families.
  • the above data were classified and detected by using a variety of deep learning models pre-trained based on the ImageNet data set. The experimental results are shown in Table 1. It can be seen that the true positive rate of the DGA domain name detection in the embodiment of the present invention can be as high as 99.863%, and the accuracy rate reaches 98.568%.
  • FIG. 9 is a comparison diagram of DGA domain name detection speed performance according to an embodiment of the present invention.
  • the amount of domain name data processed per day is twice or more than the number of domain names processed by one CPU runtime, when two
  • the amount of domain name data that can be processed per day can be up to 5 million or more.
  • the above experimental results show that some embodiments of the present invention use the ImageNet dataset pre-trained deep learning model for real-time detection of DGA domain names from the visual image classification detection field for the first time, avoiding the deep learning model in DGA domain name detection.
  • the high-intensity training and parameter weight adjustment process has a high detection rate and a low false alarm rate, and has a faster detection speed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

一种DGA域名实时检测方法和装置,其将原始域名转换为多维数值向量,输入到基于ImageNet数据集预训练好的深度学习模型,生成域名特征,并基于生成的域名特征训练域名分类器,基于训练得到的域名分类器对DGA域名进行分类和预测。该方法首次将基于ImageNet数据集预训练好的深度学习模型从视觉图像分类检测领域用于对DGA域名的实时检测,避免了在DGA域名检测中对深度学习模型的高强度训练和参数权重调整过程,具有较高的检测率和更快的检测速度。

Description

DGA域名实时检测方法和装置 技术领域
本发明涉及网络安全技术领域,特别涉及一种DGA域名实时检测方法和装置。
背景技术
僵尸网络(BotNet)是指采用一种或多种传播手段,将大量僵尸主机(Bot)感染病毒,从而在主控者(Botmaster)和被感染主机之间,通过命令与控制服务器(Command and Control Server,简称C2服务器)形成的可一对多控制的网络,其目的是尽可能地感染更多的主机。不论是对网络安全运行还是用户数据安全的保护来说,僵尸网络都是极具威胁的隐患。
目前,攻击者操纵僵尸网络通常会使用多个域名的方式来连接至C2服务器,从而达到操控受害者机器的目的。这些域名通常会被编码在恶意程序中,这也使得攻击者具有了很大的灵活性,他们可以轻松地更改这些域名以及IP。该连接方式最大的优势是用极为简单的代码便可实现,劣势是极易被政府检测到。域名生成算法(Domain Generation Algorithms,简称DGA)是一种利用随机字符来生成C&C域名(简称DGA域名),从而逃避域名黑名单检测的技术手段,通过DGA,攻击者可以在短时间内自动产生成千上万的域名,这样就可有效地避开域名黑名单列表以及政府的检测。DGA的出现对网络安全带来了极大威胁,例如前段时间席卷全球的勒索病毒CryptoLocker就采用了这种域名生成算法,因此如何实现对DGA恶意域名进行有效检测一直是网络安全领域的研究目标。
为了实现这一目标,现有的一种检测手段是通过收集DGA域名样本以及对DGA进行逆向,来预测哪些域名将来可能会被生成和预注册,并将它们列入黑名单中。但是,这种方案存在一个明显的弊端,由于DGA可以在短时间内生成成千上万的域名,网络安全人员不可能做到每天都重复收集域名样本和更新黑名单列表。
另一种经典的检测技术是对域名数据进行特征提取和分类来实现,其主要包括两个阶段,即特征工程和分类算法。特征工程是其中最为烦琐的工作,其主要包含两个方面:
1)基于过滤的方法:采用Alexa前100万个网站和黑名单对域名进行检测;
2)基于统计特征的方法:典型的域名统计特征例如包括域名长度、二元语法、N元语法、信息熵、生存周期和字符频率分布等。
申请公布号为CN105577660A的中国专利申请提出了一种基于随机森林的DGA域名检测方法。申请公布号为CN105897714A的中国专利申请提出了一种基于DNS流量特征的僵尸网络检测方法。申请公布号为US2013/0191915A1的美国专利申请也提出了 一种DGA域名检测方法和系统。这些专利申请均是采用上述经典的基于统计特征的特征工程方法来实现对DGA域名的检测。这种检测方式存在一些缺点,例如:过度依赖人工特征工程,实现难度较大;检测率偏低,误报率较高;检测速度慢,不能实现实时检测。
随着机器学习技术尤其是深度学习技术近年来的发展,针对上述经典的DGA域名检测技术的缺点,研究人员开始探索利用深度学习技术检测DGA域名的解决方案。公开文献(“Predicting Domain Generation Algorithms with Long Short-Term Memory Networks”,Woodbridge J et al.,https://arxiv.org/abs/1611.00791,2016年11月)提提出了一种使用长短期记忆网络(LSTM)检测DGA域名的方法,该方法基于训练数据集训练LSTM模型,将域名字符序列输入该LSTM模型进行特征抽取,随后基于逻辑回归分类进行分类和预测。LSTM模型是循环神经网络的特殊类型,可以学习长期依赖信息,如文本和语言等,该方法基于LSTM模型进行自动特征提取,省去了特征工程这一繁琐步骤,并且不需要依赖上下文信息,一定程度上实现了对DGA域名的实时检测。但是,这种方法需要大量的训练数据来训练LSTM模型,且训练过程中需要对模型的参数权重进行调整,模型训练的计算强度较大;此外,这种模型对训练集中的类不平衡性较为敏感,对于缺乏足够训练集支持的一些DGA域名族的检测能力也存在不足。
发明内容
本发明一方面提供一种DGA域名实时检测方法,其特征在于包括以下步骤:
步骤S1,将原始域名转换为多维数值向量;
步骤S2,将所述多维数值向量输入基于ImageNet数据集预训练好的深度学习模型,生成域名特征;
步骤S3,基于生成的域名特征训练域名分类器;
步骤S4,基于训练得到的域名分类器对DGA域名进行分类和预测。
在一些实施方式中,所述步骤S1将原始域名转换为多维数值向量包括以下步骤:
步骤S11,将原始域名的字符串转换为多维图像字节矩阵,以匹配基于ImageNet数据集预训练好的深度学习模型的输入;
步骤S12,将所述多维图像字节矩阵的尺寸缩小到预定大小。
在一些实施方式中,所述步骤S2之前进一步包括:
步骤S2’,对所述缩小到预定大小的多维图像字节矩阵进行归一化处理。
在一些实施方式中,所述步骤S2中生成域名特征进一步包括:
抽取所述预训练好的深度学习模型的倒数第三层来生成域名特征。
在一些实施方式中,所述基于ImageNet数据集预训练好的深度学习模型包括:AlexNet模型、VGG模型、SqueezeNet模型、Inception模型或ResNet模型。
在一些实施方式中,所述域名分类器包括决策树模型、支持向量机模型、逻辑回归模型或随机森林模型。
在一些实施方式中,所述步骤S3中基于生成的域名特征训练域名分类器包括:计算两个域名之间的相似度距离。
在一些实施方式中,所述步骤S3中基于生成的域名特征训练域名分类器包括:计算域名族中域名的特征平均值作为所述域名族的特征。
本发明另一方面提供一种DGA域名实时检测装置,其特征在于包括以下模块:
转换模块,用于将原始域名转换为多维数值向量;
深度学习模块,用于将所述多维数值向量输入到基于ImageNet数据集预训练好的深度学习模型,生成域名特征;
分类器训练模块,用于基于生成的域名特征训练域名分类器;
预测模块,用于基于训练得到的域名分类器对DGA域名进行分类和预测。
在一些实施方式中,所述转换模块包括:
预处理单元,用于将原始域名的字符串转换为多维图像字节矩阵,以匹配基于ImageNet数据集预训练好的深度学习模型的输入;
调整单元,用于将所述多维图像字节矩阵的尺寸缩小到预定大小。
在一些实施方式中,所述检测装置进一步包括:
归一化模块,用于对所述缩小到预定大小的多维图像字节矩阵进行归一化处理。
在一些实施方式中,所述深度学习模块抽取所述预训练好的深度学习模型的倒数第三层来生成域名特征。
在一些实施方式中,所述基于ImageNet数据集预训练好的深度学习模型包括:AlexNet模型、VGG模型、SqueezeNet模型、Inception模型或ResNet模型。
在一些实施方式中,所述域名分类器包括决策树模型、支持向量机模型、逻辑回归模型或随机森林模型。
在一些实施方式中,所述分类器训练模块包括:
相似度计算单元,用于计算两个域名之间的相似度距离。
在一些实施方式中,所述分类器训练模块包括:
特征计算单元,用于计算域名族中域名的特征平均值作为所述域名族的特征。
本发明另一方面提供一计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令用于在计算机中执行以下步骤:
步骤S1,将原始域名转换为多维数值向量;
步骤S2,将所述多维数值向量输入基于ImageNet数据集预训练好的深度学习模型,生成域名特征;
步骤S3,基于生成的域名特征训练域名分类器;
步骤S4,基于训练得到的域名分类器对DGA域名进行分类和预测。
在一些实施方式中,所述步骤S1将原始域名转换为多维数值向量包括以下步骤:
步骤S11,将原始域名的字符串转换为多维图像字节矩阵,以匹配基于ImageNet数据集预训练好的深度学习模型的输入;
步骤S12,将所述多维图像字节矩阵的尺寸缩小到预定大小。
在一些实施方式中,所述步骤S2之前进一步包括:
步骤S2’,对所述缩小到预定大小的多维图像字节矩阵进行归一化处理。
在一些实施方式中,所述步骤S2中生成域名特征进一步包括:
抽取所述预训练好的深度学习模型的倒数第三层来生成域名特征。
在一些实施方式中,所述基于ImageNet数据集预训练好的深度学习模型包括:AlexNet模型、VGG模型、SqueezeNet模型、Inception模型或ResNet模型。
在一些实施方式中,所述域名分类器包括决策树模型、支持向量机模型、逻辑回归模型或随机森林模型。
在一些实施方式中,所述步骤S3中基于生成的域名特征训练域名分类器包括:计算两个域名之间的相似度距离。
在一些实施方式中,所述步骤S3中基于生成的域名特征训练域名分类器包括:计算域名族中域名的特征平均值作为所述域名族的特征。
本发明另一方面提供一种计算机装置,其包括处理器和存储器,所述存储器存储有计算机程序指令,其特征在于所述计算机程序指令在处理器运行时用于执行以下步骤:
步骤S1,将原始域名转换为多维数值向量;
步骤S2,将所述多维数值向量输入基于ImageNet数据集预训练好的深度学习模型,生成域名特征;
步骤S3,基于生成的域名特征训练域名分类器;
步骤S4,基于训练得到的域名分类器对DGA域名进行分类和预测。
在一些实施方式中,所述步骤S1将原始域名转换为多维数值向量包括以下步骤:
步骤S11,将原始域名的字符串转换为多维图像字节矩阵,以匹配基于ImageNet数据集预训练好的深度学习模型的输入;
步骤S12,将所述多维图像字节矩阵的尺寸缩小到预定大小。
在一些实施方式中,所述步骤S2之前进一步包括:
步骤S2’,对所述缩小到预定大小的多维图像字节矩阵进行归一化处理。
在一些实施方式中,所述步骤S2中生成域名特征进一步包括:
抽取所述预训练好的深度学习模型的倒数第三层来生成域名特征。
在一些实施方式中,所述基于ImageNet数据集预训练好的深度学习模型包括:AlexNet模型、VGG模型、SqueezeNet模型、Inception模型或ResNet模型。
在一些实施方式中,所述域名分类器包括决策树模型、支持向量机模型、逻辑回归模型或随机森林模型。
在一些实施方式中,所述步骤S3中基于生成的域名特征训练域名分类器包括:计算两个域名之间的相似度距离。
在一些实施方式中,所述步骤S3中基于生成的域名特征训练域名分类器包括:计算域名族中域名的特征平均值作为所述域名族的特征。
在本发明的一些实施方式中,利用基于ImageNet大规模图像数据集已经预训练好的优秀深度学习模型的知识优势,通过域名数据的词嵌入转换和深度学习模型的迁移学习首次将基于ImageNet数据集预训练的深度学习模型从视觉图像分类检测领域用于对DGA域名的实时检测,避免了在DGA域名检测中对深度学习模型的高强度训练和参数权重调整过程,具有较高检测率和较低的误报率,并且具有更快的检测速度。
附图说明
图1为本发明一个实施方式的DGA域名实时检测方法的流程示意图;
图2为本发明另一个实施方式的DGA域名实时检测方法的流程示意图;
图3为本发明另一个实施方式的DGA域名实时检测方法的流程示意图;
图4为本发明进行域名转换的一个实施方式的应用示例图;
图5为本发明一个实施方式的DGA域名实时检测装置的结构示意图;
图6为本发明一个实施方式的转换模块的结构示意图;
图7为本发明另一个实施方式的DGA域名实时检测装置的结构示意图;
图8为本发明一个实施方式的分类器训练模块的结构示意图;
图9为本发明实施例的DGA域名检测速度性能的比较图。
具体实施方式
下面结合附图对本发明进行清楚、完整地说明。
图1为本发明一个实施方式的DGA域名实时检测方法的流程示意图。如图1所示,该方法包括如下步骤:
步骤S1,将原始域名转换为多维数值向量;
步骤S2,将所述多维数值向量输入到基于ImageNet数据集预训练好的深度学习模型,生成域名特征;
步骤S3,基于生成的域名特征训练域名分类器;
步骤S4,基于训练得到的域名分类器对DGA域名进行分类和预测。
本发明实施例中,ImageNet数据集是目前著名的一个计算机视觉系统识别项目的名称,它是目前世界上用于图像识别最大的数据库,其包含1000万以上的具有手工标注的图片以及2万多的对象类别。基于该大规模数据集已经发展和训练出了一些优秀的深度学习模型,例如AlexNet模型、VGG模型、SqueezeNet模型、Inception模型或ResNet模型等。目前,这些优秀的深度学习模型主要应用于计算机视觉识别、语音识别、自然语言处理等技术领域,并在这些领域取得了较大成功,但尚未发现有应用于计算机网络安全领域尤其是DGA域名检测的先例。
因此,如何将这些基于ImageNet数据集已经预训练好的深度学习模型应用到DGA域名检测问题上,存在两个主要的难点:
首先,作为学习和分类对象的DGA域名本质上是一种字符类型的数据,它无论是尺寸大小还是内容上均有别于ImageNet数据集中的原始图片数据;
其次,对域名进行检测,需要处理的域名数据能达到百万级,基于这些海量的域名数据重新训练深度学习模型将面临巨大的计算强度,耗费大量的时间和资源。
本发明实施例针对上述两个难点,首先,通过词嵌入技术将字符类型的原始域名数据转换成多维数值向量的图像格式,使得域名数据能够像ImageNet数据集中图像数据一样能被基于ImageNet数据集预训练好的深度学习模型处理。词嵌入是自然语言处理中的名词,从数学上就是定义为一个从文档空间投影到数值向量空间(通常是低维度的)的映射。该映射为一个单射函数,即每个Y只有唯一的X对应,反之亦然。通过词嵌入技术,能够将文档类型数据进行数值化处理,从而将文档分析问题转化成相对应的数值向量的问题。
其次,借助迁移学习理论,直接将基于ImageNet数据集已经预训练好的深度学习模型的参数权重迁移到针对经过词嵌入转换后的域名数据集的目标学习模型中,从而有效利用了基于ImageNet数据集训练演进出的优秀深度学习模型的知识经验,避免了基于大规模域名数据对深度学习模型的高强度训练和参数权重调整过程,同时也使得对DGA域名的检测在确保实时性的性能要求时,也具有较高检测率和较低的误报率。
在一些实施方式中,所述基于ImageNet数据集预训练的深度学习模型包括:AlexNet模型、VGG模型、SqueezeNet模型、Inception模型或ResNet模型。
在一些实施方式中,所述步骤S2中生成域名特征进一步包括:抽取所述预训练的深度学习模型的倒数第三层来生成域名特征。这是因为预训练的深度学习模型中,顶层的输出层通常存在过拟合问题,比顶层输出层更低层的特征往往更适用于进行分类。
在一些实施方式中,所述域名分类器包括决策树模型、支持向量机模型、逻辑回归模型或随机森林模型。
在一些实施方式中,所述步骤S3中基于生成的域名特征训练域名分类器进一步包括:计算两个域名之间的相似度距离。对两个域名之间进行欧式距离的相似度评分有助于提高分类的精确度。
在一些实施方式中,所述步骤S3中基于生成的域名特征训练域名分类器包括:计算域名族中域名的特征平均值作为所述域名族的特征。在现实世界中,DGA域名具有多个域名族,对于这些域名族的检测,本发明实施例采用计算域名族中域名的特征平均值作为所述域名族的特征,以此可以实现对DGA域名族的分类检测。
图2为本发明另一个实施方式的DGA域名实时检测方法的流程示意图。如图2所示,在图1所示实施例的基础上,本发明实施例中所述将原始域名转换为多维数值向量的步骤S1包括:
步骤S11,将原始域名的字符串转换为多维图像字节矩阵,以匹配基于ImageNet数据集预训练好的深度学习模型的输入;
步骤S12,将所述多维图像字节矩阵的尺寸缩小到预定大小。
本发明实施例中,所述步骤S2包括将所述多维图像字节矩阵输入到基于ImageNet数据集预训练好的深度学习模型,生成域名特征。
本发明实施例中,将原始域名转换为多维图像字节矩阵,由于域名字符串的长度大小相对于一般图像数据而言尺寸更小,将转换后的图像字节矩阵的大小缩小到预定大小,可以显著降低对内存空间的占用。
图3为本发明另一个实施方式的DGA域名实时检测方法的流程示意图。如图3所示,该方法在图2所示实施例的步骤S2之前进一步包括以下步骤:
步骤S2’,对所述缩小到预定大小的多维图像字节矩阵进行归一化处理。
本发明实施例中,通过对经过词嵌入转换后的多维图像字节矩阵进行归一化处理,使得域名数据的向量表征更加标准和规范,进一步提高域名的分类准确性。
图4为本发明进行域名转换的一个实施方式的应用示例图。如图4所示,本发明实施例中,以一个DGA生成域名zzzzanerraticallyqozaw.com为例,首先将该域名字符串通过词嵌入转换为[224×224×3]的图像字节矩阵,由于域名字符串的最大长度通常不超过25,我们可以将所述[224×224×3]的图像字节矩阵的尺寸进一步缩小到[25×25×3],最 后输入基于ImageNet数据集预训练好的AlexNet深度学习模型生域名特征。由此,将转换后的图像字节矩阵的尺寸缩小到预定大小,可以显著降低内存空间的占用。
图5是本发明一个实施方式的DGA域名实时检测装置100的结构示意图。如图5所示,DGA域名实时检测装置100包括如下模块:
转换模块10,用于将原始域名转换为多维数值向量;
深度学习模块20,用于将所述多维数值向量输入到基于ImageNet数据集预训练好的深度学习模型,生成域名特征;
分类器训练模块30,用于基于生成的域名特征训练域名分类器;
预测模块40,用于基于训练得到的域名分类器对DGA域名进行分类和预测。
图6是本发明一个实施方式的转换模块10的结构示意图。如图6所示,转换模块10包括如下单元:
预处理单元11,用于将原始域名的字符串转换为多维图像字节矩阵,以匹配基于ImageNet数据集预训练好的深度学习模型的输入。
调整单元12,用于将所述多维图像字节矩阵的尺寸缩小到预定大小。
图7是本发明另一个实施方式的DGA域名实时检测装置200的结构示意图。如图7所示,DGA域名实时检测装置200在图5所示DGA域名实时检测装置100的基础上进一步包括如下模块:
归一化模块50,用于对所述缩小到预定大小的多维图像字节矩阵进行归一化处理。
图8是本发明一个实施方式的分类器训练模块30的结构示意图。如图8所示,分类器训练模块30包括如下单元:
相似度计算单元31,用于计算两个域名之间的相似度距离;
特征计算单元32,用于计算域名族中域名的特征平均值作为所述域名族的特征。
本发明实施例选取了Alexa前100万条域名数据作为非DGA域名,并选取真实的3300万条DGA恶意域名作为测试数据,这些DGA恶意域名包含64个域名族。分别采用基于ImageNet数据集预训练好的多种深度学习模型对上述数据进行分类检测,实验结果如表1所示。可见,本发明实施例对于DGA域名检测的真阳性率可以高达99.863%,准确率达到98.568%。
图9为本发明实施例的DGA域名检测速度性能的比较图。如图9所示,利用GPU的图像数据处理能力,当采用一个GPU运行本发明所述检测方法时每天处理的域名数据量是采用一个CPU运行时处理的域名数量的两倍以上,当采用两个GPU运行本发明所述检测方法时,每天能处理的域名数据量最高可以达到500万条以上。
表1.模型检测的实验结果
Figure PCTCN2018115087-appb-000001
上述实验结果表明,本发明的一些实施方式首次将基于ImageNet数据集预训练好的深度学习模型从视觉图像分类检测领域用于对DGA域名的实时检测,避免了在DGA域名检测中对深度学习模型的高强度训练和参数权重调整过程,具有较高检测率和较低的误报率,并且具有更快的检测速度。

Claims (18)

  1. 一种DGA域名实时检测方法,其特征在于包括以下步骤:
    步骤S1,将原始域名转换为多维数值向量;
    步骤S2,将所述多维数值向量输入到基于ImageNet数据集预训练好的深度学习模型,生成域名特征;
    步骤S3,基于生成的域名特征训练域名分类器;
    步骤S4,基于训练得到的域名分类器对DGA域名进行分类和预测。
  2. 根据权利要求1所述的方法,其特征在于:所述步骤S1将原始域名转换为多维数值向量包括以下步骤:
    步骤S11,将原始域名的字符串转换为多维图像字节矩阵,以匹配基于ImageNet数据集预训练好的深度学习模型的输入;
    步骤S12,将所述多维图像字节矩阵的尺寸缩小到预定大小。
  3. 根据权利要求2所述的方法,其特征在于:所述步骤S2之前进一步包括以下步骤:
    步骤S2’,对所述缩小到预定大小的多维图像字节矩阵进行归一化处理。
  4. 根据权利要求3所述的方法,其特征在于:所述步骤S2中生成域名特征包括:抽取所述训练好的深度学习模型的倒数第三层来生成域名特征。
  5. 根据权利要求4所述的方法,其特征在于:所述基于ImageNet数据集预训练好的深度学习模型包括:AlexNet模型、VGG模型、SqueezeNet模型、Inception模型或ResNet模型。
  6. 根据权利要求5所述的方法,其特征在于:所述域名分类器包括决策树模型、支持向量机模型、逻辑回归模型或随机森林模型。
  7. 根据权利要求6所述的方法,其特征在于:所述步骤S3中基于生成的域名特征训练域名分类器包括:计算两个域名之间的相似度距离。
  8. 根据权利要求7所述的方法,其特征在于:所述步骤S3中基于生成的域名特征训练域名分类器包括:计算域名族中域名的特征平均值作为所述域名族的特征。
  9. 一种DGA域名实时检测装置,其特征在于包括以下模块:
    转换模块,用于将原始域名转换为多维数值向量;
    深度学习模块,用于将所述多维数值向量输入到基于ImageNet数据集预训练好的深度学习模型,生成域名特征;
    分类器训练模块,用于基于生成的域名特征训练域名分类器;
    预测模块,用于基于训练得到的域名分类器对DGA域名进行分类和预测
  10. 根据权利要求9所述的装置,其特征在于:所述转换模块包括:
    预处理单元,用于将原始域名的字符串转换为多维图像字节矩阵,以匹配基于ImageNet数据集预训练好的深度学习模型的输入;
    调整单元,用于将所述多维图像字节矩阵的尺寸缩小到预定大小。
  11. 根据权利要求10所述的装置,其特征在于:所述检测装置进一步包括:
    归一化模块,用于对所述缩小到预定大小的多维图像字节矩阵进行归一化处理。
  12. 根据权利要求11所述的装置,其特征在于:所述深度学习模块抽取所述预训练好的深度学习模型的倒数第三层来生成域名特征。
  13. 根据权利要求12所述的装置,其特征在于:所述基于ImageNet数据集预训练好的深度学习模型包括:AlexNet模型、VGG模型、SqueezeNet模型、Inception模型或ResNet模型。
  14. 根据权利要求13所述的装置,其特征在于:所述域名分类器包括决策树模型、支持向量机模型、逻辑回归模型或随机森林模型。
  15. 根据权利要求14所述的装置,其特征在于:所述分类器训练模块包括:
    相似度计算单元,用于计算两个域名之间的相似度距离。
  16. 根据权利要求15所述的装置,其特征在于:所述分类器训练模块包括:
    特征计算单元,用于计算域名族中域名的特征平均值作为所述域名族的特征。
  17. 一种计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令用于在计算机中执行以下步骤:
    步骤S1,将原始域名转换为多维数值向量;
    步骤S2,将所述多维数值向量输入到基于ImageNet数据集预训练好的深度学习模型,生成域名特征;
    步骤S3,基于生成的域名特征训练域名分类器;
    步骤S4,基于训练得到的域名分类器对DGA域名进行分类和预测。
  18. 根据权利要求17所述的存储介质,其特征在于:所述步骤S1将原始域名转换为多维数值向量包括以下步骤:
    步骤S11,将原始域名的字符串转换为多维图像字节矩阵,以匹配基于ImageNet数据集预训练好的深度学习模型的输入;
    步骤S12,将所述多维图像字节矩阵的尺寸缩小到预定大小。
PCT/CN2018/115087 2017-11-15 2018-11-12 Dga域名实时检测方法和装置 WO2019096099A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/764,741 US11334764B2 (en) 2017-11-15 2018-11-12 Real-time detection method and apparatus for DGA domain name

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711130020.0A CN109788079B (zh) 2017-11-15 2017-11-15 Dga域名实时检测方法和装置
CN201711130020.0 2017-11-15

Publications (1)

Publication Number Publication Date
WO2019096099A1 true WO2019096099A1 (zh) 2019-05-23

Family

ID=66494337

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/115087 WO2019096099A1 (zh) 2017-11-15 2018-11-12 Dga域名实时检测方法和装置

Country Status (3)

Country Link
US (1) US11334764B2 (zh)
CN (1) CN109788079B (zh)
WO (1) WO2019096099A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125700A (zh) * 2019-12-11 2020-05-08 中山大学 一种基于主机关联性的dga家族分类方法
CN112217787A (zh) * 2020-08-31 2021-01-12 北京工业大学 一种基于ed-gan的仿冒域名训练数据生成方法及系统
CN114095176A (zh) * 2021-10-29 2022-02-25 北京天融信网络安全技术有限公司 一种恶意域名检测方法及装置
CN114499906A (zh) * 2020-11-12 2022-05-13 清华大学 一种dga域名检测方法及系统
CN114978558A (zh) * 2021-02-20 2022-08-30 中国电信股份有限公司 域名识别方法和装置、计算机装置和存储介质

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11423143B1 (en) 2017-12-21 2022-08-23 Exabeam, Inc. Anomaly detection based on processes executed within a network
US11431741B1 (en) * 2018-05-16 2022-08-30 Exabeam, Inc. Detecting unmanaged and unauthorized assets in an information technology network with a recurrent neural network that identifies anomalously-named assets
US11625366B1 (en) 2019-06-04 2023-04-11 Exabeam, Inc. System, method, and computer program for automatic parser creation
CN110674370A (zh) * 2019-09-23 2020-01-10 鹏城实验室 域名识别方法及装置、存储介质及电子设备
CN110798481A (zh) * 2019-11-08 2020-02-14 杭州安恒信息技术股份有限公司 基于深度学习的恶意域名检测方法及装置
US11227122B1 (en) * 2019-12-31 2022-01-18 Facebook, Inc. Methods, mediums, and systems for representing a model in a memory of device
CN111628970B (zh) * 2020-04-24 2021-10-15 中国科学院计算技术研究所 一种dga型僵尸网络的检测方法、介质和电子设备
US11956253B1 (en) 2020-06-15 2024-04-09 Exabeam, Inc. Ranking cybersecurity alerts from multiple sources using machine learning
CN112866257B (zh) * 2021-01-22 2023-09-26 网宿科技股份有限公司 一种域名检测方法、系统及装置
CN113572770B (zh) * 2021-07-26 2022-09-02 清华大学 检测域名生成算法生成的域名的方法及装置
CN113746952B (zh) * 2021-09-14 2024-04-16 京东科技信息技术有限公司 Dga域名检测方法、装置、电子设备及计算机存储介质
CN114648069A (zh) * 2022-03-23 2022-06-21 三六零数字安全科技集团有限公司 域名检测方法和装置、设备及存储介质
CN116074081B (zh) * 2023-01-28 2023-06-13 鹏城实验室 Dga域名检测方法、装置、设备及存储介质
CN116132154B (zh) * 2023-02-03 2023-06-30 北京六方云信息技术有限公司 Dns隧道流量检测系统的验证方法、装置、设备及存储介质
CN115913792B (zh) * 2023-03-08 2023-05-23 浙江鹏信信息科技股份有限公司 Dga域名的鉴别方法、系统及可读介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105577660A (zh) * 2015-12-22 2016-05-11 国家电网公司 基于随机森林的dga域名检测方法
CN105610830A (zh) * 2015-12-30 2016-05-25 山石网科通信技术有限公司 域名的检测方法及装置
CN105897714A (zh) * 2016-04-11 2016-08-24 天津大学 基于dns流量特征的僵尸网络检测方法
US20160359887A1 (en) * 2015-06-04 2016-12-08 Cisco Technology, Inc. Domain name system (dns) based anomaly detection
CN106911717A (zh) * 2017-04-13 2017-06-30 成都亚信网络安全产业技术研究院有限公司 一种域名检测方法及装置

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8516585B2 (en) * 2010-10-01 2013-08-20 Alcatel Lucent System and method for detection of domain-flux botnets and the like
CN102333313A (zh) * 2011-10-18 2012-01-25 中国科学院计算技术研究所 移动僵尸网络特征码生成方法和移动僵尸网络检测方法
US20130263226A1 (en) * 2012-01-22 2013-10-03 Frank W. Sudia False Banking, Credit Card, and Ecommerce System
US9922190B2 (en) * 2012-01-25 2018-03-20 Damballa, Inc. Method and system for detecting DGA-based malware
JP2015043486A (ja) * 2013-08-26 2015-03-05 ソニー株式会社 プロキシサーバ装置、情報処理方法、プログラム、端末装置、およびコンテンツ供給システム
CN104778407B (zh) * 2015-04-14 2017-08-08 电子科技大学 一种多维无特征码恶意程序检测方法
US9917852B1 (en) * 2015-06-29 2018-03-13 Palo Alto Networks, Inc. DGA behavior detection
US9781139B2 (en) * 2015-07-22 2017-10-03 Cisco Technology, Inc. Identifying malware communications with DGA generated domains by discriminative learning
US10185761B2 (en) * 2015-08-07 2019-01-22 Cisco Technology, Inc. Domain classification based on domain name system (DNS) traffic
CN106850571A (zh) * 2016-12-29 2017-06-13 北京奇虎科技有限公司 僵尸网络家族的识别方法和装置
US10819724B2 (en) * 2017-04-03 2020-10-27 Royal Bank Of Canada Systems and methods for cyberbot network detection
US11025648B2 (en) * 2017-09-21 2021-06-01 Infoblox Inc. Detection of algorithmically generated domains based on a dictionary

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160359887A1 (en) * 2015-06-04 2016-12-08 Cisco Technology, Inc. Domain name system (dns) based anomaly detection
CN105577660A (zh) * 2015-12-22 2016-05-11 国家电网公司 基于随机森林的dga域名检测方法
CN105610830A (zh) * 2015-12-30 2016-05-25 山石网科通信技术有限公司 域名的检测方法及装置
CN105897714A (zh) * 2016-04-11 2016-08-24 天津大学 基于dns流量特征的僵尸网络检测方法
CN106911717A (zh) * 2017-04-13 2017-06-30 成都亚信网络安全产业技术研究院有限公司 一种域名检测方法及装置

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125700A (zh) * 2019-12-11 2020-05-08 中山大学 一种基于主机关联性的dga家族分类方法
CN111125700B (zh) * 2019-12-11 2023-02-07 中山大学 一种基于主机关联性的dga家族分类方法
CN112217787A (zh) * 2020-08-31 2021-01-12 北京工业大学 一种基于ed-gan的仿冒域名训练数据生成方法及系统
CN112217787B (zh) * 2020-08-31 2022-11-04 北京工业大学 一种基于ed-gan的仿冒域名训练数据生成方法及系统
CN114499906A (zh) * 2020-11-12 2022-05-13 清华大学 一种dga域名检测方法及系统
CN114978558A (zh) * 2021-02-20 2022-08-30 中国电信股份有限公司 域名识别方法和装置、计算机装置和存储介质
CN114095176A (zh) * 2021-10-29 2022-02-25 北京天融信网络安全技术有限公司 一种恶意域名检测方法及装置
CN114095176B (zh) * 2021-10-29 2024-04-09 北京天融信网络安全技术有限公司 一种恶意域名检测方法及装置

Also Published As

Publication number Publication date
CN109788079B (zh) 2022-03-15
US11334764B2 (en) 2022-05-17
US20210182612A1 (en) 2021-06-17
CN109788079A (zh) 2019-05-21

Similar Documents

Publication Publication Date Title
WO2019096099A1 (zh) Dga域名实时检测方法和装置
WO2022011977A1 (zh) 一种网络异常检测方法、系统、终端以及存储介质
Chen et al. Automatic mobile application traffic identification by convolutional neural networks
JP2022141931A (ja) 生体検出モデルのトレーニング方法及び装置、生体検出の方法及び装置、電子機器、記憶媒体、並びにコンピュータプログラム
CN109873810B (zh) 一种基于樽海鞘群算法支持向量机的网络钓鱼检测方法
CN112073551B (zh) 基于字符级滑动窗口和深度残差网络的dga域名检测系统
CN109461446A (zh) 一种识别用户目标请求的方法、装置、系统及存储介质
CN113704410B (zh) 情绪波动检测方法、装置、电子设备及存储介质
CN113806746A (zh) 基于改进cnn网络的恶意代码检测方法
CN109977118A (zh) 一种基于词嵌入技术和lstm的异常域名检测方法
CN111224998B (zh) 一种基于极限学习机的僵尸网络识别方法
CN115270996A (zh) 一种dga域名检测方法、检测装置及计算机存储介质
CN114826681A (zh) 一种dga域名检测方法、系统、介质、设备及终端
Chen et al. An efficient network intrusion detection model based on temporal convolutional networks
CN110704611B (zh) 基于特征解交织的非法文本识别方法及装置
CN111291078B (zh) 一种域名匹配检测方法及装置
CN116684144A (zh) 一种恶意域名检测方法及装置
CN113194092B (zh) 一种精准的恶意流量变种检测方法
CN113179250B (zh) web未知威胁检测方法及系统
CN115473734A (zh) 基于单分类和联邦学习的远程代码执行攻击检测方法
CN113055890B (zh) 一种面向移动恶意网页的多设备组合优化的实时检测系统
CN113722713A (zh) 一种恶意代码检测的方法、装置、电子设备及存储介质
CN111159588B (zh) 一种基于url成像技术的恶意url检测方法
Nazih et al. Fast Detection of Distributed Denial of Service Attacks in VoIP Networks Using Convolutional Neural Networks
Ding et al. Detecting Domain Generation Algorithms with Bi-LSTM.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18878986

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18878986

Country of ref document: EP

Kind code of ref document: A1