CN114793170A - DNS tunnel detection method, system, equipment and terminal based on open set identification - Google Patents

DNS tunnel detection method, system, equipment and terminal based on open set identification Download PDF

Info

Publication number
CN114793170A
CN114793170A CN202210308273.7A CN202210308273A CN114793170A CN 114793170 A CN114793170 A CN 114793170A CN 202210308273 A CN202210308273 A CN 202210308273A CN 114793170 A CN114793170 A CN 114793170A
Authority
CN
China
Prior art keywords
dns
probability
data
class
set identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210308273.7A
Other languages
Chinese (zh)
Other versions
CN114793170B (en
Inventor
付玉龙
焦小彬
刘璐璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202210308273.7A priority Critical patent/CN114793170B/en
Publication of CN114793170A publication Critical patent/CN114793170A/en
Application granted granted Critical
Publication of CN114793170B publication Critical patent/CN114793170B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/20Network architectures or network communication protocols for network security for managing network security; network security policies in general

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of network security, and discloses a detection method, a system, equipment and a terminal for identifying a DNS tunnel based on open set, wherein a captured DNS data packet is analyzed; performing feature extraction on the DNS query sub-domain name by using a neural network; dividing boundaries of a data space, respectively constructing a division forest for input DNS query domain name feature vectors, and calibrating a classification boundary by using an extreme value theory; calculating the probability that the test sample belongs to all classes including the unknown class by using a probability estimation method; and identifying the test sample into a classification label according to the maximum probability, and identifying the test sample as an unknown class when the probabilities that the test sample belongs to the known classes are all less than a threshold value epsilon. The invention applies the open set identification technology to DNS tunnel detection, and is expected to solve the judgment problem of hidden tunnels with unknown classes.

Description

DNS tunnel detection method, system, equipment and terminal based on open set identification
Technical Field
The invention belongs to the technical field of network security, and particularly relates to a DNS tunnel detection method, system, device and terminal based on open set identification.
Background
The DNS protocol is the most important infrastructure on the internet, and mainly functions to convert a host name into an IP address, thereby ensuring smooth execution of other network applications, and DNS services have become an indispensable important ring on the internet. Because the protocol lacks data confidentiality and integrity protection, the protocol becomes a preferred covert communication channel mode for an attacker, and a DNS tunnel is used for bypassing a security control strategy to realize the transmission of a remote control command or steal related sensitive data, thereby bringing serious threat to a network security environment.
The existing DNS hidden channel detection is usually carried out defense based on characteristic engineering of expert knowledge, and classification and identification are realized by extracting relevant characteristics of communication behaviors of DNS data and word frequency analysis of query records and using a relevant machine learning algorithm. The captured DNS data messages need to be reconstructed into a data stream before feature extraction to extract more auxiliary identification information.
Through the above analysis, the problems and defects of the prior art are as follows:
(1) excessive auxiliary identification information is used in the data acquisition process, so that the performance of an online system is reduced;
(2) the method for post detection of reconstructing the data message into the data stream cannot meet the requirement of timely responding to attack threats of an intrusion detection system, and causes serious consequences such as data leakage and the like;
(3) the method of manually extracting the features has certain limitation, extraction is carried out only according to a known knowledge set, and the false alarm rate is high;
(4) when the method is applied to an online scene, the existing scheme cannot accurately identify the DNS variant tunnel, and for data of unknown types, a classification identification algorithm needs to be improved.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a DNS tunnel detection scheme and a system based on open set identification.
The DNS tunnel detection method based on open set identification is realized by preprocessing DNS data, expressing the characteristic vector of the data by using a deep learning network, performing segmentation transformation on the obtained characteristic vector space by using an open set identification algorithm to determine the boundary between a known space and an unknown space, and finally, outputting an identification result through probability evaluation.
Further, analyzing the captured DNS data packet; performing feature extraction on the DNS query sub-domain name by using a neural network; dividing boundaries of a data space, respectively constructing a division forest for input DNS query domain name feature vectors, and calibrating a classification boundary by using an extreme value theory; calculating the probability that the test sample belongs to all classes including the unknown class by using a probability estimation method; and identifying the test sample into a classification label according to the maximum probability, and identifying the test sample as an unknown class when the probabilities that the test sample belongs to the known classes are all less than a threshold value epsilon.
Further, the open set identification-based DNS tunnel detection method comprises the following steps:
the first step, data preprocessing stage; analyzing the captured DNS data packet, and converting characters into integer numerical values;
step two, a characteristic vector representation stage; performing feature extraction on DNS query sub-domain names by using a neural network, performing feature vector representation on the DNS query domain names by adopting CNN (convolutional neural network), LSTM (least squares) and different combination modes, performing network training, and taking a space where a penultimate layer of the trained network is located as a feature space;
thirdly, constructing an open set identification classification model; completing boundary division of a data space, respectively constructing a division forest for input DNS query domain name feature vectors, calculating the average path length of each category as a category center, and calibrating a classification boundary by using an extreme value theory;
fourthly, a probability evaluation stage; calculating probabilities that the test sample belongs to all classes including the unknown class using a probability estimation method based on a similarity between the test sample and each of the known classes;
fifthly, outputting an identification result; and identifying the test sample into a classification label according to the maximum probability, and identifying the test sample into an unknown class when the probabilities that the test sample belongs to the known classes are all smaller than a threshold value epsilon.
Further, the first step includes: extracting a domain name inquiring field of the DNS for reservation, and creating two comparison tables by analyzing the specification of domain name code of the DNS: one for mapping characters to numbers and the other for mapping numbers to characters, each character having a corresponding integer value after processing; padding is performed after numerical value encoding, and the tail is filled with '0', and if data exceeds the input length, truncation is performed.
Further, after the second step of network training is completed, extracting an activation vector of a layer before the classification layer as a feature vector representation of the current sample, namely, selecting an input of a SoftMax classification layer in the network as the feature vector.
Further, the third step includes: and performing segmentation transformation on the obtained feature vector space by using the extreme forest applied to open set identification, so that each divided class boundary comprises a minimum distance boundary.
Further, the training classifier of the third step:
(1) according to all known category data, performing sub-sampling on each category, randomly selecting a feature as an initial node, randomly selecting a value in a value range of the feature, dividing psi samples into two parts, dividing samples smaller than the value into left branches, and dividing samples larger than the value into right branches; this binary partitioning is repeated in the left and right branches until the following condition is satisfied: the tree has reached a limited height; there is only one sample on a node; all the characteristics of the samples on the nodes are the same; iterating and repeating the partitioning process according to the sample data capacity, and forming a binary tree forest by the generated partitioning tree;
(2) calculating the average pathlength of each category, the number of samples ψ, which is equal to the average search length for unsuccessful searches of a binary ordered tree:
Figure BDA0003566865230000031
(3) calculating the path length vector l (x) of the training samples in each class, and calculating the distance d from the average path length i (x)=l i (x)-C i (ψ) to obtain a distance feature set L (x);
(4) determining the classification limit of each category by using an extremum theory, describing the distribution of abnormal high or low data by using an extremum model, and modeling right deflection data, left deflection data or symmetrical data; d (x) was fitted separately using Weibull distribution, dividing known class boundaries:
Figure BDA0003566865230000041
wherein
Figure BDA0003566865230000042
β, λ and κ are the position, ratio and shape parameters, respectively.
Further, the probability prediction specifically includes: respectively calculating the probability of the sample to be detected belonging to each category, and indirectly calculating the rejection probability through the probability; for a sample x to be detected, calculating a probability score calculation method formula of distribution belonging to a certain known class i through Weibull distribution:
Figure BDA0003566865230000043
the path vector L (x) is the difference value between the average distance length of the sample x to be measured and each known class, the path vector is converted by combining a formula to be applied to the class division after the extreme value model calibration, and the probability fraction calculation method formula of each class is as follows:
Figure BDA0003566865230000044
calculating unknown classes after obtaining the probability that the test sample belongs to a known classRejection probability is distributed by inverse Weibull 1-W i And obtaining the probability score that the sample x to be detected belongs to the unknown category as follows:
Figure BDA0003566865230000045
wherein N +1 represents an unknown class;
identifying the sample to be detected as the category of the maximum probability value based on the obtained N +1 probability values; the open space risk is further limited by setting a threshold epsilon: if the maximum probability value P is obtained max < ε, reject the sample as unknown.
It is a further object of the present invention to provide a computer device, comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the open set based identification DNS tunnel detection method.
Another object of the present invention is to provide an information data processing terminal for implementing the open set identification-based DNS tunnel detection method.
Another object of the present invention is to provide an open set identification based DNS tunnel detection system for implementing the open set identification based DNS tunnel detection method, the open set identification based DNS tunnel detection system including:
the data preprocessing module is used for analyzing the captured DNS data packet and converting characters into integer numerical values;
the characteristic extraction module is used for extracting the characteristics of the DNS inquiry sub-domain name by using a neural network, expressing the characteristic vector of the DNS inquiry domain name by adopting CNN (CNN), LSTM (localized maximum likelihood) and different combination modes, and performing network training by taking the space where the trained network is located at the penultimate layer as a characteristic space;
the training classifier module is used for completing boundary division of a data space, respectively constructing and dividing forests for input DNS query domain name feature vectors, calculating the average path length of each category as a category center, and calibrating a classification boundary by using an extreme value theory;
the output prediction probability module is used for calculating the probability that the test sample belongs to all the classes including the unknown class by using a probability estimation method based on the similarity between the test sample and each known class; and identifying the test sample into a classification label according to the maximum probability, and identifying the test sample as an unknown class when the probabilities that the test sample belongs to the known classes are all less than a threshold value epsilon.
In combination with the technical solutions and the technical problems to be solved, please analyze the advantages and positive effects of the technical solutions to be protected in the present invention from the following aspects:
first, aiming at the technical problems existing in the prior art and the difficulty in solving the problems, the technical problems to be solved by the technical scheme of the present invention are closely combined with results, data and the like in the research and development process, and some creative technical effects are brought after the problems are solved. The specific description is as follows: the invention applies the open set identification technology to DNS tunnel detection, hopefully solves the judgment problem of hidden tunnels with unknown classes, and the open set identification is taken as a solution for dealing with open class classification in the real world. The DNS data as the similar text data can be subjected to semantic feature extraction by using a deep learning technology through a partial method in the process of transferring natural language, so that the limitation of manually extracting features is avoided. Therefore, the feature extraction method and the open set identification algorithm combined with deep learning have very important significance for detecting the existing DNS hidden channel.
Secondly, considering the technical scheme as a whole or from the perspective of products, the technical effect and advantages of the technical scheme to be protected by the invention are specifically described as follows:
aiming at the characteristics of DNS data, a new open set identification algorithm is provided by combining extreme forest design based on a compact decreasing probability model and is used for carrying out segmentation transformation on a DNS data feature vector space extracted by a neural network so as to achieve the purpose of dividing a known space and an unknown space and achieve the purpose of detecting unknown class DNS tunnels with real-time performance and high accuracy. Third, as an inventive supplementary proof of the claims of the present invention, there are also presented several important aspects:
(1) the technical scheme of the invention fills the technical blank in the industry at home and abroad:
because a detection method aiming at unknown DNS tunnel identification is still lacked at present, the invention introduces the open set identification into the DNS tunnel detection field, further defines the boundary division of the known data space and the unknown data space based on the proposed open set identification model, and fills the blank of DNS tunnel detection of unknown types.
(2) The technical scheme of the invention solves the technical problem that people are eagerly to solve but can not be successfully solved all the time:
the invention provides an effective solution for DNS tunnel detection of unknown types aiming at massive and complex DNS data in the current network environment and combining the requirement of an online intrusion detection system for real-time detection.
Drawings
Fig. 1 is a flowchart of a DNS tunnel detection method based on open set identification according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a DNS tunnel detection system based on open set identification according to an embodiment of the present invention;
fig. 3 is a flowchart of an implementation of a DNS tunnel detection method based on open set identification according to an embodiment of the present invention;
in the figure: 1. a data preprocessing module; 2. a feature extraction module; 3. training a classifier module; 4. and outputting a prediction probability module.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
First, an embodiment is explained. This section is an illustrative example developed to explain the claims in order to enable those skilled in the art to fully understand how to implement the present invention.
As shown in fig. 1, the DNS tunnel detection method based on open set identification according to the embodiment of the present invention includes the following steps:
s101: analyzing the captured DNS data packet, and converting characters into integer numerical values;
s102: performing feature extraction on DNS query sub-domain names by using a neural network, performing feature vector representation on the DNS query domain names by using a CNN convolutional neural network, an LSTM long-short term memory neural network and different combination modes, performing network training, and taking a space where the trained penultimate layer of the network is located as a feature space;
s103: completing boundary division of a data space, respectively constructing and dividing forests for input DNS query domain name feature vectors, calculating the average path length of each category as a category center, and calibrating a classification boundary by using an extreme value theory;
s104: calculating probabilities that the test sample belongs to all classes including the unknown class using a probability estimation method based on a similarity between the test sample and each of the known classes;
s105: and identifying the test sample into a classification label according to the maximum probability, and identifying the test sample as an unknown class when the probabilities that the test sample belongs to the known classes are all less than a threshold value epsilon.
As shown in fig. 2, the open set identification based DNS tunnel detection system according to the embodiment of the present invention includes:
the data preprocessing module 1 is used for analyzing the captured DNS data packet and converting characters into integer numerical values;
the feature extraction module 2 is used for extracting features of DNS inquiry sub-domain names by using a neural network, expressing the feature vectors of the DNS inquiry domain names by adopting CNN (CNN), LSTM (least significant mode) and different combination modes, performing network training, and taking the space where the trained penultimate layer of the network is as a feature space;
the training classifier module 3 is used for completing boundary division of a data space, respectively constructing and dividing forests for input DNS query domain name feature vectors, calculating the average path length of each category as a category center, and calibrating a classification boundary by using an extreme value theory;
an output prediction probability module 4, configured to calculate, based on a similarity between the test sample and each known class, a probability that the test sample belongs to all classes including the unknown class by using a probability estimation method; and identifying the test sample into a classification label according to the maximum probability, and identifying the test sample into an unknown class when the probabilities that the test sample belongs to the known classes are all smaller than a threshold value epsilon.
As shown in fig. 3, the method for detecting a DNS tunnel based on open set identification according to the embodiment of the present invention specifically includes the following steps:
the method comprises the following steps: and (4) preprocessing data. Firstly, resolving the captured DNS data packet, and only extracting a domain name field queried by the DNS for reservation. In addition, because the neural network is used in the scheme for automatic feature extraction to avoid the defect of manually extracting features, in order to meet the requirement of inputting the neural network model, characters need to be converted into integer numerical values. Through analysis of the DNS query domain name coding specification, the characters include "a-z", "0-9", "-", etc. for a total of 68, two look-up tables are created: one for mapping characters to numbers and the other for mapping numbers to characters, each character having a corresponding integer value after processing. Padding is performed after numerical value encoding, and the tail is filled with '0', and if data exceeds the input length, truncation is performed.
Step two: and (4) extracting features. In order to obtain the feature vector representation of the DNS query domain name, a neural network is used for carrying out feature extraction on the DNS query sub-domain name, the feature vector representation is mainly designed to be carried out on the DNS query domain name by adopting CNN, LSTM and different combination modes, after the network training is finished, the activation vector of the previous layer of the classification layer is extracted to be used as the feature vector representation of the current sample, namely, the input of a SoftMax classification layer in the network is selected to be used as the feature vector, namely, the space where the trained penultimate layer of the network is located is used as the feature space.
Step three: and training a classifier. And carrying out segmentation transformation on the obtained feature vector space by using a limit forest applied to open set identification, so that each divided category boundary can contain a minimum distance boundary, completing boundary division of a data space, respectively constructing a divided forest for the input DNS query domain name feature vectors, calculating the average path length of each category as a category center, and calibrating the classification boundary by using an extreme value theory.
Step four: and outputting the prediction probability. In the prediction phase, based on the similarity between the test sample and each known class, a probability estimation method is used to calculate the probability that the test sample belongs to all classes, including the unknown class. And finally, identifying the test sample into a classification label according to the maximum probability, and identifying the test sample into an unknown class when the probabilities that the test sample belongs to the known class are all smaller than a threshold epsilon.
In the embodiment of the present invention, the step three training classifier is described in detail as follows:
(1) according to all known class data, each class is sub-sampled, a feature is randomly selected as an initial node, a value is randomly selected in the value range of the feature, psi samples are divided into two parts, samples smaller than the value are divided into left branches, and samples larger than the value are divided into right branches. This binary partitioning is repeated in the left and right branches until the following condition is satisfied:
a. the tree has reached a limited height;
b. there is only one sample on a node;
c. all features of the samples on the node are the same.
And then, iterating and repeating the partitioning process according to the sample data capacity, and forming a binary tree forest by the generated partitioning trees.
(2) Calculating the average pathlength for each category, assuming a sample number ψ, which is equivalent to the average search length for unsuccessful searches of the binary ordered tree:
Figure BDA0003566865230000091
(3) calculating the path length vector l (x) of the training samples in each class and calculating the average path lengthDistance d i (x)=l i (x)-C i (ψ:), the distance feature set L (x) is obtained.
(4) The classification bound for each class is determined using extremum theory, and equation 2 describes the probability density of the Weibull distribution. The extreme value model describes the distribution of abnormally high or abnormally low data, and can model right skewed data, left skewed data, or symmetric data. Since there is no auxiliary identification information about unknown classes in the training set, there is a long tail distribution phenomenon in the data path length distribution, and it is obviously inappropriate to model the path length vector only by using the central tendency metric model. Known class boundaries can be demarcated by separately fitting d (x) using Weibull distributions.
Figure BDA0003566865230000101
Wherein
Figure BDA0003566865230000102
β, λ and κ are position, proportion and shape parameters, respectively.
In an embodiment of the present invention, the probability prediction is described in detail as follows:
the objective of the prediction phase is to obtain the probability that a sample belongs to each class, and the main task is to calculate the rejection probability of an unknown sample, however, the prior knowledge of the unknown class is lacked for open set identification, and it is difficult to directly calculate the rejection probability. The probability that the sample to be detected belongs to each category is calculated respectively, and the rejection probability is calculated indirectly through the probabilities.
For a sample x to be detected, a probability score calculation method for calculating distribution belonging to a certain known class i through Weibull distribution is as shown in formula (3):
Figure BDA0003566865230000103
the path vector l (x) is the difference between the average distance length of the sample x to be measured and each known class, and is transformed by combining formula 3 to be applied to the class division after the extreme value model calibration, and the probability score of each class is calculated by the following formula (4):
Figure BDA0003566865230000104
after the probability that the test sample belongs to the known class is obtained, the rejection probability of the unknown class is calculated and distributed through inverse Weibull 1-W i Obtaining that the probability score of the sample x to be detected belonging to the unknown category is:
Figure BDA0003566865230000105
where N +1 represents the unknown class.
And finally, identifying the sample to be detected as the category of the maximum probability value based on the obtained N +1 probability values. The open space risk is further limited by setting a threshold epsilon: if the maximum probability value R is obtained by calculation max < ε, reject the sample as unknown.
Second, the application embodiment. In order to prove the creativity and the technical value of the technical scheme of the invention, the part is the application example of the technical scheme of the claims on specific products or related technologies.
The DNS tunnel detection method based on open set identification provided by the embodiment of the invention mainly comprises four parts in the whole process, namely data preprocessing, deep feature extraction, open set identification model construction and probability evaluation. Firstly, collected DNS data are preprocessed and then input into a neural network for deep feature extraction, and effective feature vector representation cannot be carried out on domain name data because original untrained network weights are only initialized. Firstly, a closed set identification mode is adopted to carry out multi-classification task training of known classes on the feature extraction network and the SoftMax layer, and training of weight of the feature extraction backbone network is completed.
After the training of the feature extraction network is completed, the last but one layer of full connection layer is extracted to be output as the feature vector representation of the DNS query domain name and input into the open set identification algorithm. And training an open set recognition model after the extracted feature vectors are obtained, and completing boundary division of an open space and an unknown space. And respectively constructing a division forest for the input DNS query domain name feature vector, calculating the average path length of each category as a category center, and calibrating a classification boundary by using an extreme value theory to determine the division boundary between the known category and an open space.
And finally, judging whether the sample is a normal request or attack class or an unknown class in the known classes according to the probability output by the recognition algorithm.
It should be noted that embodiments of the present invention can be realized in hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. It will be appreciated by those skilled in the art that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, for example such code provided on a carrier medium such as a diskette, CD-or DVD-ROM, a programmable memory such as read-only memory (firmware) or a data carrier such as an optical or electronic signal carrier. The apparatus of the present invention and its modules may be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, or software executed by various types of processors, or a combination of hardware circuits and software, e.g., firmware.
And thirdly, evidence of relevant effects of the embodiment. The embodiment of the invention has some positive effects in the process of research and development or use, and indeed has great advantages compared with the prior art, and the following contents are described by combining data, charts and the like in the test process.
The existing DNS tunnel detection scheme based on deep learning does not realize judgment on unknown categories, and cannot be directly used as a reference model for experimental comparison. Training an LSTM-CNN neural network structure for feature extraction, using hyper-parametric settings: the output vector dimension of the Embedding Embedding layer is 128 dimensions, 128 units are selected by the LSTM layer in the structure, and the window size of the convolutional layer is set to be 3. And finally adding a SoftMax layer into the four network structures to complete the learning of the network weight, setting the learning rate to be 0.001, selecting 100 for the optimization strategy AdamaOptimizer and batch _ size, and training 20 epochs in total. And extracting the output of the penultimate full-link layer as the characteristic vector representation of the DNS query domain name, and respectively inputting the characteristic vector representation into a set identification algorithm for DNS tunnel identification. Data sets with different openness are set, and the overall recognition accuracy is compared with the method in the invention by using the existing open set recognition algorithm OpenMax and W-SVM.
Figure BDA0003566865230000121
Experiments show that even under the condition of large data set openness, the method of the invention has remarkable advantages compared with other two methods.
The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A DNS tunnel detection method based on open set identification is characterized in that after DNS data are preprocessed, a deep learning network is used for representing feature vectors of the data, an open set identification algorithm is used for carrying out segmentation transformation on an obtained feature vector space so as to determine the boundary between a known space and an unknown space, and finally, an identification result is output through probability evaluation.
2. The open-set identification based DNS tunnel detection method according to claim 1, wherein said method further comprises: analyzing the captured DNS data packet; performing feature extraction on DNS query sub-domain names by using a neural network; dividing boundaries of a data space, respectively constructing and dividing forests for input DNS query domain name feature vectors, and calibrating classification boundaries by using an extreme value theory; calculating the probability that the test sample belongs to all classes including the unknown class by using a probability estimation method; and identifying the test sample into a classification label according to the maximum probability, and identifying the test sample into an unknown class when the probabilities that the test sample belongs to the known class are all less than a threshold value epsilon.
3. The open set identification based DNS tunnel detection method of claim 2, wherein the open set identification based DNS tunnel detection method comprises the steps of:
firstly, analyzing the captured DNS data packet, and converting characters into integer numerical values;
secondly, extracting features of DNS inquiry sub-domain names by using a neural network, expressing the feature vectors of the DNS inquiry domain names by adopting CNN, LSTM and different combination modes, carrying out network training, and taking the space where the trained network is located at the penultimate layer as a feature space;
thirdly, completing boundary division of a data space, respectively constructing and dividing forests for input DNS query domain name feature vectors, calculating the average path length of each category as a category center, and calibrating a classification boundary by using an extreme value theory;
a fourth step of calculating probabilities that the test sample belongs to all classes including the unknown class using a probability estimation method based on a similarity between the test sample and each of the known classes;
and fifthly, identifying the test sample into a classification label according to the maximum probability, and identifying the sample as an unknown class when the probabilities of the test sample belonging to the known classes are all smaller than a threshold epsilon.
4. The open-set identification based DNS tunnel detection method of claim 3, wherein the first step comprises: extracting a domain name field queried by a DNS (domain name system) for reservation, and creating two comparison tables by analyzing the DNS domain name code specification: one for mapping characters to numbers and the other for mapping numbers to characters, each character having a corresponding integer value after processing; carrying out padding filling after numerical value coding, and completing the tail by '0', and if data exceeds the input length, carrying out truncation;
and after the network training in the second step is finished, extracting an activation vector of a previous layer of the classification layer as a feature vector of a current sample, namely selecting the input of a SoftMax classification layer in the network as the feature vector.
5. The open-set identification based DNS tunnel detection method of claim 3, wherein the third step comprises: and performing segmentation transformation on the obtained feature vector space by using the extreme forest applied to open set identification, so that each divided class boundary comprises a minimum distance boundary.
6. The open-set identification-based DNS tunnel detection method of claim 4, wherein the training classifier of the third step:
(1) according to all known category data, performing sub-sampling on each category, randomly selecting a feature as an initial node, randomly selecting a value in a value range of the feature, dividing psi samples into two parts, dividing samples smaller than the value into left branches, and dividing samples larger than the value into right branches; this binary partitioning is repeated in the left and right branches until the following condition is satisfied: the tree has reached a limited height; there is only one sample on a node; all the characteristics of the samples on the nodes are the same; iterating and repeating the partitioning process according to the sample data capacity, and forming a binary tree forest by the generated partitioning trees;
(2) calculating the average pathlength of each category, the number of samples ψ, which is equal to the average search length for unsuccessful searches of a binary ordered tree:
Figure FDA0003566865220000021
(3) calculating the path length vector l (x) of the training samples in each class, and calculating the distance d from the average path length i (x)=l i (x)-C i (ψ) to obtain a distance feature set L (x);
(4) determining the classification limit of each category by using an extremum theory, wherein an extremum model describes the distribution of abnormally high or low data, and models right deflection data, left deflection data or symmetric data; d (x) was fitted separately using Weibull distribution, dividing known class bounds:
Figure FDA0003566865220000031
wherein
Figure FDA0003566865220000032
β, λ and k are the position, scale and shape parameters, respectively.
7. The open-set identification-based DNS tunnel detection method of claim 3, wherein the probabilistic prediction specifically comprises: respectively calculating the probability of the sample to be detected belonging to each category, and indirectly calculating the rejection probability through the probability; for a sample x to be detected, calculating a probability score calculation method formula of distribution belonging to a certain known class i through Weibull distribution:
Figure FDA0003566865220000033
the path vector L (x) is the difference value between the average distance length of the sample x to be measured and each known class, the path vector is converted by combining a formula to be applied to the class division after the extreme value model is calibrated, and the probability fraction of each class is calculated by the formula:
Figure FDA0003566865220000034
after obtaining a test sampleAfter the probability of the unknown class is calculated, the rejection probability of the unknown class is distributed 1-W through inverse Weibull i And obtaining the probability score that the sample x to be detected belongs to the unknown category as follows:
Figure FDA0003566865220000035
wherein N +1 represents an unknown class;
identifying the sample to be detected as the category of the maximum probability value based on the obtained N +1 probability values; the open space risk is further limited by setting a threshold epsilon: if the maximum probability value P is obtained by calculation max < ε, reject the sample as unknown.
8. A computer arrangement, characterized in that the computer arrangement comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the open set identification based DNS tunnel detection method according to any one of claims 1 to 7.
9. An information data processing terminal, characterized in that the information data processing terminal is used for implementing the open set identification based DNS tunnel detection method of any one of claims 1 to 7.
10. An open set identification based DNS tunnel detection system implementing the open set identification based DNS tunnel detection method according to any one of claims 1 to 7, wherein the open set identification based DNS tunnel detection system comprises:
the data preprocessing module is used for analyzing the captured DNS data packet and converting characters into integer numerical values;
the characteristic extraction module is used for extracting the characteristics of the DNS inquiry sub-domain name by using a neural network, expressing the characteristic vector of the DNS inquiry domain name by adopting CNN (convolutional neural network), LSTM (least significant word) and different combination modes, performing network training and taking the space where the last but one layer of the trained network is as a characteristic space;
the training classifier module is used for completing boundary division of a data space, respectively constructing and dividing forests for input DNS query domain name feature vectors, calculating the average path length of each category as a category center, and calibrating a classification boundary by using an extreme value theory;
the output prediction probability module is used for calculating the probability that the test sample belongs to all the classes including the unknown class by using a probability estimation method based on the similarity between the test sample and each known class; and identifying the test samples into classification labels according to the maximum probability, and identifying the samples as unknown classes when the probabilities that the test samples belong to the known classes are all smaller than a threshold epsilon.
CN202210308273.7A 2022-03-28 2022-03-28 DNS tunnel detection method, system, equipment and terminal based on open set identification Active CN114793170B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210308273.7A CN114793170B (en) 2022-03-28 2022-03-28 DNS tunnel detection method, system, equipment and terminal based on open set identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210308273.7A CN114793170B (en) 2022-03-28 2022-03-28 DNS tunnel detection method, system, equipment and terminal based on open set identification

Publications (2)

Publication Number Publication Date
CN114793170A true CN114793170A (en) 2022-07-26
CN114793170B CN114793170B (en) 2024-03-19

Family

ID=82462291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210308273.7A Active CN114793170B (en) 2022-03-28 2022-03-28 DNS tunnel detection method, system, equipment and terminal based on open set identification

Country Status (1)

Country Link
CN (1) CN114793170B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115643087A (en) * 2022-10-24 2023-01-24 天津大学 DNS tunnel detection method based on fusion of coding characteristics and statistical behavior characteristics

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120198549A1 (en) * 2011-02-01 2012-08-02 Manos Antonakakis Method and system for detecting malicious domain names at an upper dns hierarchy
US20190147336A1 (en) * 2017-11-10 2019-05-16 Fujitsu Limited Method and apparatus of open set recognition and a computer readable storage medium
US10778702B1 (en) * 2017-05-12 2020-09-15 Anomali, Inc. Predictive modeling of domain names using web-linking characteristics
CN111898038A (en) * 2020-07-04 2020-11-06 西北工业大学 Social media false news detection method based on man-machine cooperation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120198549A1 (en) * 2011-02-01 2012-08-02 Manos Antonakakis Method and system for detecting malicious domain names at an upper dns hierarchy
US10778702B1 (en) * 2017-05-12 2020-09-15 Anomali, Inc. Predictive modeling of domain names using web-linking characteristics
US20190147336A1 (en) * 2017-11-10 2019-05-16 Fujitsu Limited Method and apparatus of open set recognition and a computer readable storage medium
CN111898038A (en) * 2020-07-04 2020-11-06 西北工业大学 Social media false news detection method based on man-machine cooperation

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115643087A (en) * 2022-10-24 2023-01-24 天津大学 DNS tunnel detection method based on fusion of coding characteristics and statistical behavior characteristics
CN115643087B (en) * 2022-10-24 2024-04-30 天津大学 DNS tunnel detection method based on fusion of coding features and statistical behavior features

Also Published As

Publication number Publication date
CN114793170B (en) 2024-03-19

Similar Documents

Publication Publication Date Title
CN110351301B (en) HTTP request double-layer progressive anomaly detection method
CN111818198B (en) Domain name detection method, domain name detection device, equipment and medium
CN114816909A (en) Real-time log detection early warning method and system based on machine learning
KR102638370B1 (en) Explanable active learning method using Bayesian dual autoencoder for object detector and active learning device using the same
CN109446804B (en) Intrusion detection method based on multi-scale feature connection convolutional neural network
CN114386514B (en) Unknown flow data identification method and device based on dynamic network environment
CN113032525A (en) False news detection method and device, electronic equipment and storage medium
CN115631365A (en) Cross-modal contrast zero sample learning method fusing knowledge graph
CN114553983A (en) Deep learning-based high-efficiency industrial control protocol analysis method
CN112052451A (en) Webshell detection method and device
CN113067798B (en) ICS intrusion detection method and device, electronic equipment and storage medium
CN110958244A (en) Method and device for detecting counterfeit domain name based on deep learning
CN114793170A (en) DNS tunnel detection method, system, equipment and terminal based on open set identification
CN116150651A (en) AI-based depth synthesis detection method and system
CN114826681A (en) DGA domain name detection method, system, medium, equipment and terminal
CN118250169A (en) Network asset class recommendation method, device and storage medium
CN113434857A (en) User behavior safety analysis method and system applying deep learning
CN115879030A (en) Network attack classification method and system for power distribution network
CN115759043A (en) Document-level sensitive information detection model training and prediction method
CN113657443B (en) On-line Internet of things equipment identification method based on SOINN network
CN114882289A (en) SAR target open set identification method based on self-adaptive determination rejection criterion
CN114528908A (en) Network request data classification model training method, classification method and storage medium
CN114722920A (en) Deep map convolution model phishing account identification method based on map classification
CN118069885B (en) Dynamic video content coding and retrieving method and system
CN118827143A (en) Protocol identification method and device for flow data in power grid and computer program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant