CN114793170B - DNS tunnel detection method, system, equipment and terminal based on open set identification - Google Patents

DNS tunnel detection method, system, equipment and terminal based on open set identification Download PDF

Info

Publication number
CN114793170B
CN114793170B CN202210308273.7A CN202210308273A CN114793170B CN 114793170 B CN114793170 B CN 114793170B CN 202210308273 A CN202210308273 A CN 202210308273A CN 114793170 B CN114793170 B CN 114793170B
Authority
CN
China
Prior art keywords
dns
probability
category
class
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210308273.7A
Other languages
Chinese (zh)
Other versions
CN114793170A (en
Inventor
付玉龙
焦小彬
刘璐璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202210308273.7A priority Critical patent/CN114793170B/en
Publication of CN114793170A publication Critical patent/CN114793170A/en
Application granted granted Critical
Publication of CN114793170B publication Critical patent/CN114793170B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/20Network architectures or network communication protocols for network security for managing network security; network security policies in general

Abstract

The invention belongs to the technical field of network security, and discloses a DNS tunnel detection method, system, equipment and terminal based on open set identification, which are used for analyzing captured DNS data packets; performing feature extraction on the DNS query sub domain name by using a neural network; dividing the boundary of a data space, respectively constructing divided forests for the input DNS inquiry domain name feature vectors, and calibrating the classification boundary by using an extremum theory; calculating the probability that the test sample belongs to all the categories including the unknown category by using a probability estimation method; and identifying the test sample into a classification label according to the maximum probability, and identifying the sample as an unknown class when the probabilities of the test sample belonging to the known class are all smaller than a threshold epsilon. The invention applies the open set identification technology to DNS tunnel detection, which is hopeful to solve the difficult problem of judging unknown class hidden tunnels, and the open set identification is taken as a solution method for coping with open class classification faced in the real world.

Description

DNS tunnel detection method, system, equipment and terminal based on open set identification
Technical Field
The invention belongs to the technical field of network security, and particularly relates to a DNS tunnel detection method, system, equipment and terminal based on open set identification.
Background
The DNS protocol is the most important infrastructure on the internet, and is mainly used to convert a host name into an IP address, so as to ensure that other network applications execute smoothly, and DNS services have become an essential ring on the internet. Because the protocol lacks data confidentiality and integrity protection, the method becomes a preferred hidden communication channel mode for an attacker, and a DNS tunnel is used for bypassing a security control strategy, so that the transmission of a remote control command or the stealing of related sensitive data is realized, and serious threat is brought to a network security environment.
The prior DNS hidden channel detection is usually defended based on feature engineering of expert knowledge, and classification recognition is realized by extracting relevant features of word frequency analysis of communication behaviors and query records of DNS data and using relevant machine learning algorithms. The captured DNS data messages need to be reconstructed into a data stream before feature extraction to extract more auxiliary identification information.
Through the above analysis, the problems and defects existing in the prior art are as follows:
(1) Excessive auxiliary identification information is used in the data acquisition process, so that the performance of an online system is reduced;
(2) The method for postmortem detection of reconstructing the data message into the data stream can not meet the requirement of the intrusion detection system on timely response to the attack threat, and serious consequences such as data leakage are caused;
(3) The mode of manually extracting the features has certain limitation, and the false alarm rate is high by extracting according to the known knowledge set;
(4) When the method is applied to an online scene, the prior scheme cannot accurately identify the DNS variant tunnel, and the classification identification algorithm needs to be improved for unknown data.
Disclosure of Invention
Aiming at the problems existing in the prior art, the invention provides a DNS tunnel detection scheme and system based on open set identification.
The invention is realized in such a way, and the DNS tunnel detection method based on open set identification is realized by preprocessing DNS data, then using a deep learning network to represent the feature vector of the data, using an open set identification algorithm to divide and transform the obtained feature vector space so as to clear the limit between the known space and the unknown space, and finally outputting the identification result through probability evaluation.
Further, analyzing the captured DNS data packet; performing feature extraction on the DNS query sub domain name by using a neural network; dividing the boundary of a data space, respectively constructing divided forests for the input DNS inquiry domain name feature vectors, and calibrating the classification boundary by using an extremum theory; calculating the probability that the test sample belongs to all the categories including the unknown category by using a probability estimation method; and identifying the test sample into a classification label according to the maximum probability, and identifying the sample as an unknown class when the probabilities of the test sample belonging to the known class are all smaller than a threshold epsilon.
Further, the open set identification-based DNS tunnel detection method comprises the following steps:
the first step, a data preprocessing stage; analyzing the captured DNS data packet, and converting the characters into integer numerical values;
secondly, a feature vector representation stage; extracting features of the DNS query sub domain name by using a neural network, expressing feature vectors of the DNS query sub domain name by adopting CNN, LSTM and different combination modes, performing network training, and taking the space of the trained network penultimate layer as a feature space;
thirdly, constructing an open set identification classification model; dividing the boundary of a data space, respectively constructing a division forest for the input DNS inquiry domain name feature vector, calculating the average path length of each category as a category center, and calibrating a category boundary by using an extremum theory;
fourth, probability evaluation stage; based on the similarity between the test sample and each known class, calculating the probability that the test sample belongs to all classes including the unknown class using a probability estimation method;
fifthly, outputting an identification result; and identifying the test sample into a classification label according to the maximum probability, and identifying the sample as an unknown class when the probabilities of the test sample belonging to the known class are all smaller than a threshold epsilon.
Further, the first step includes: extracting a DNS query domain name field for reservation, and creating two comparison tables through analysis of DNS query domain name coding specifications: one for mapping the characters to numbers and the other for mapping the numbers to the characters, each character having a corresponding integer value after processing; after numerical coding, padding is carried out, the end is padded with 0, and when the data exceeds the input length, cutting is carried out.
Further, after the second step of network training is completed, the activation vector of the layer before the classification layer is extracted as the feature vector representation of the current sample, that is, the input of the SoftMax classification layer in the network is selected as the feature vector.
Further, the third step includes: the obtained feature vector space is subjected to segmentation transformation by using a limit forest applied to open set recognition, so that each segmented category boundary comprises a minimum distance boundary.
Further, training the classifier in the third step:
(1) According to all the known class data, sub-sampling is carried out on each class, a feature is randomly selected as an initial node, a value is randomly selected in the value range of the feature, the psi samples are divided into two parts, the samples smaller than the value are divided into left branches, and the samples larger than the value are divided into right branches; this binary division is repeated in the left and right branches until the following condition is satisfied: the tree reaches a limited height; only one sample is on the node; all features of the sample on the node are the same; repeating the dividing process according to the sample data capacity iteration, and forming the generated dividing tree into a binary tree forest;
(2) Calculating the average path length of each category, wherein the number of samples psi is equal to the average searching length of unsuccessful binary sequencing tree searching:
(3) Calculating the path length vector l (x) of the training samples in each class and calculating the distance d from the average path length i (x)=l i (x)-C i (ψ) to obtain a distance feature set L (x);
(4) Determining a classification limit of each category by using an extremum theory, wherein an extremum model describes the distribution of abnormally high or abnormally low value data, and modeling right deflection data, left deflection data or symmetrical data; d (x) was fitted separately using Weibull distributions, dividing the known class boundaries:
wherein the method comprises the steps ofBeta, lambda and kappa are the position, scale and shape parameters, respectively.
Further, the probability prediction specifically includes: respectively calculating the probability that the sample to be detected belongs to each category, and indirectly calculating the rejection probability through the probability; for a sample x to be detected, calculating a probability score calculation method formula of a distribution belonging to a certain known category i through Weibull distribution:
the path vector L (x) is the difference value between the sample x to be detected and the average distance length of each known class, the path vector is converted by combining a formula to be applied to class division calibrated by an extremum model, and each class probability score calculating method comprises the following formula:
after obtaining the probability that the test sample belongs to the known class, calculating the rejection probability of the unknown class through inverse Weibull distribution 1-W i The probability score of the sample x to be measured belonging to the unknown class is obtained as follows:
wherein n+1 represents an unknown class;
based on the obtained n+1 probability values, identifying the sample to be detected as the category of the maximum probability value; the open space risk is further limited by setting a threshold epsilon: if the maximum probability value P is calculated max < ε, reject the sample as an unknown class.
It is a further object of the present invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the open set based DNS tunnel detection method.
Another object of the present invention is to provide an information data processing terminal, which is configured to implement the DNS tunnel detection method based on open set identification.
Another object of the present invention is to provide an open set identification-based DNS tunnel detection system implementing the open set identification-based DNS tunnel detection method, the open set identification-based DNS tunnel detection system including:
the data preprocessing module is used for resolving the captured DNS data packet and converting characters into integer numerical values;
the feature extraction module is used for extracting features of the DNS query sub domain name by using a neural network, carrying out feature vector representation on the DNS query sub domain name by adopting CNN, LSTM and different combination modes, carrying out network training, and taking the space of the trained network penultimate layer as a feature space;
the training classifier module is used for completing limit division of a data space, respectively constructing a division forest for the input DNS inquiry domain name feature vector, calculating the average path length of each category as a category center, and calibrating a classification boundary by using an extremum theory;
the output prediction probability module is used for calculating the probability that the test sample belongs to all the categories including the unknown category by using a probability estimation method based on the similarity between the test sample and each known category; and identifying the test sample into a classification label according to the maximum probability, and identifying the sample as an unknown class when the probabilities of the test sample belonging to the known class are all smaller than a threshold epsilon.
In combination with the above technical solution and the technical problems to be solved, please analyze the following aspects to provide the following advantages and positive effects:
first, aiming at the technical problems in the prior art and the difficulty in solving the problems, the technical problems solved by the technical proposal of the invention are analyzed in detail and deeply by tightly combining the technical proposal to be protected, the results and data in the research and development process, and the like, and some technical effects brought after the problems are solved have creative technical effects. The specific description is as follows: the invention applies the open set identification technology to DNS tunnel detection, which is hopeful to solve the difficult problem of judging unknown class hidden tunnels, and the open set identification is taken as a solution method for coping with open class classification faced in the real world. The DNS data is used as text-like data, semantic feature extraction can be carried out on the DNS data by migrating part of methods in natural language processing by using a deep learning technology, and the limitation of manually extracting features is avoided. Therefore, the feature extraction method and the open set recognition algorithm combined with deep learning have very important significance for detecting the existing DNS hidden channel.
Secondly, the technical scheme is regarded as a whole or from the perspective of products, and the technical scheme to be protected has the following technical effects and advantages:
aiming at the characteristics of DNS data, the invention provides a novel open set recognition algorithm based on a compact decreasing probability model and combined with a limit forest design, which is used for dividing and transforming the DNS data feature vector space extracted by a neural network so as to achieve the purposes of dividing a known space and an unknown space and achieving the purposes of real-time and high-accuracy detection of an unknown class DNS tunnel. Thirdly, as inventive supplementary evidence of the claims of the present invention, the following important aspects are also presented:
(1) The technical scheme of the invention fills the technical blank in the domestic and foreign industries:
because the detection method for unknown DNS tunnel identification is still lacking at present, the open set identification is introduced into the field of DNS tunnel detection, the limit division of the known data space and the unknown data space is further defined based on the proposed open set identification model, and the blank of detecting the unknown DNS tunnels is filled.
(2) Whether the technical scheme of the invention solves the technical problems that people want to solve all the time but fail to obtain success all the time is solved:
aiming at massive and complex DNS data in the current network environment, the invention provides an effective solution for detecting the DNS tunnel of unknown type by combining the requirement that an online intrusion detection system needs real-time detection.
Drawings
Fig. 1 is a flowchart of a DNS tunnel detection method based on open set identification provided in an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a DNS tunnel detection system based on open set identification according to an embodiment of the present invention;
fig. 3 is a flowchart of an implementation of a DNS tunnel detection method based on open set identification according to an embodiment of the present invention;
in the figure: 1. a data preprocessing module; 2. a feature extraction module; 3. training a classifier module; 4. and outputting a prediction probability module.
Detailed Description
The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
1. The embodiments are explained. In order to fully understand how the invention may be embodied by those skilled in the art, this section is an illustrative embodiment in which the claims are presented for purposes of illustration.
As shown in fig. 1, the open set identification-based DNS tunnel detection method provided by the embodiment of the present invention includes the following steps:
s101: analyzing the captured DNS data packet, and converting the characters into integer numerical values;
s102: extracting features of the DNS query sub domain name by using a neural network, expressing feature vectors of the DNS query sub domain name by adopting a CNN convolutional neural network, an LSTM long-term memory neural network and different combination modes, performing network training, and taking the space of the trained network penultimate layer as a feature space;
s103: dividing the boundary of a data space, respectively constructing a division forest for the input DNS inquiry domain name feature vector, calculating the average path length of each category as a category center, and calibrating a category boundary by using an extremum theory;
s104: based on the similarity between the test sample and each known class, calculating the probability that the test sample belongs to all classes including the unknown class using a probability estimation method;
s105: and identifying the test sample into a classification label according to the maximum probability, and identifying the sample as an unknown class when the probabilities of the test sample belonging to the known class are all smaller than a threshold epsilon.
As shown in fig. 2, the DNS tunnel detection system based on open set identification provided in the embodiment of the present invention includes:
the data preprocessing module 1 is used for resolving the captured DNS data packet and converting characters into integer numerical values;
the feature extraction module 2 is used for extracting features of the DNS query sub domain name by using a neural network, carrying out feature vector representation on the DNS query domain name by adopting CNN, LSTM and different combination modes, carrying out network training, and taking the space of the trained network penultimate layer as a feature space;
training a classifier module 3, which is used for completing the limit division of a data space, respectively constructing a division forest for the input DNS inquiry domain name feature vector, calculating the average path length of each category as a category center, and calibrating a classification boundary by using an extremum theory;
an output prediction probability module 4 for calculating probabilities that the test sample belongs to all the categories including the unknown category using a probability estimation method based on the similarity between the test sample and each of the known categories; and identifying the test sample into a classification label according to the maximum probability, and identifying the sample as an unknown class when the probabilities of the test sample belonging to the known class are all smaller than a threshold epsilon.
As shown in fig. 3, the DNS tunnel detection method based on open set identification provided by the embodiment of the present invention specifically includes the following steps:
step one: and (5) preprocessing data. Firstly, the captured DNS data packet is resolved, and only the DNS query domain name field is extracted for reservation. In addition, because the neural network is used for automatic feature extraction in the scheme to avoid the defect of manually extracting features, in order to meet the input requirement of the neural network model, the characters need to be converted into integer numerical values. Through analysis of the DNS query domain name coding specification, the characters include 68 in total of "a-z", "0-9", "-", etc., two look-up tables are created: one for mapping the characters to numbers and the other for mapping the numbers to the characters, each character having a corresponding integer value after processing. After numerical coding, padding is carried out, the end is padded with 0, and when the data exceeds the input length, cutting is carried out.
Step two: and (5) extracting characteristics. In order to obtain the characteristic vector representation of the DNS query domain name, a neural network is used for carrying out characteristic extraction on the DNS query sub domain name, the CNN, the LSTM and different combination modes are mainly adopted for carrying out characteristic vector representation on the DNS query domain name, after the network training is finished, an activation vector of a layer before a classification layer is extracted to serve as the characteristic vector representation of a current sample, namely, the input of a SoftMax classification layer in the network is selected to serve as the characteristic vector, namely, the space of the last-last layer of the trained network is used as the characteristic space.
Step three: and training a classifier. And dividing and transforming the obtained feature vector space by using a limit forest applied to open set identification, so that each divided category boundary can contain a minimum distance boundary, dividing the boundary of the data space is completed, dividing the input DNS inquiry domain name feature vector into separate forests, calculating the average path length of each category as a category center, and calibrating the category boundary by using an extremum theory.
Step four: and outputting the prediction probability. In the prediction phase, based on the similarity between the test sample and each known class, a probability estimation method is used to calculate the probability that the test sample belongs to all classes, including the unknown class. And finally, identifying the test sample into a classification label according to the maximum probability, and identifying the sample as an unknown class when the probabilities of the test sample belonging to the known class are smaller than a threshold epsilon.
In an embodiment of the present invention, the step three training classifier is described in detail as follows:
(1) And according to all the known class data, sub-sampling is carried out on each class, a feature is randomly selected as an initial node, a value is randomly selected in the value range of the feature, the psi samples are divided into two parts, the samples smaller than the value are divided into left branches, and the samples larger than the value are divided into right branches. This binary division is repeated in the left and right branches until the following condition is satisfied:
a. the tree reaches a limited height;
b. only one sample is on the node;
c. the samples on the nodes are all identical in character.
And then repeating the dividing process according to the sample data capacity iteration, and forming the generated dividing tree into a binary tree forest.
(2) Calculating the average path length of each category, and assuming the number of samples psi, the average path length is equal to the average search length of unsuccessful binary sequence tree search:
(3) Calculating the path length vector l (x) of the training samples in each class and calculating the distance d from the average path length i (x)=l i (x)-C i And (phi:) obtaining a distance feature set L (x).
(4) The classification limits for each category are determined using extremum theory, and equation 2 describes the probability density of the Weibull distribution. Extremum models describe the distribution of abnormally high or abnormally low data, and right skew data, left skew data, or symmetric data may be modeled. Since the training set does not have any auxiliary identification information about the unknown class, the data path length distribution has long tail distribution phenomenon, and it is obviously not suitable to model the path length vector by using the central trend metric model only. The known class boundaries can be divided by fitting D (x) separately using Weibull distributions.
Wherein the method comprises the steps ofBeta, lambda and kappa are the position, scale and shape parameters, respectively.
In an embodiment of the present invention, probability prediction is described in detail as follows:
the goal of the prediction stage is to obtain the probability that a sample belongs to each class, the main task is to calculate the rejection probability of an unknown sample, however, for open set identification, the prior knowledge of the unknown class is lacking, and it is difficult to directly calculate the rejection probability. First, the probability that the sample to be measured belongs to each category is calculated respectively, and the rejection probability is calculated indirectly through the probabilities.
For the sample x to be tested, the probability score computing method for computing the distribution belonging to a certain known class i through Weibull distribution is as shown in the formula (3):
the path vector L (x) is the difference between the sample x to be measured and the average distance length of each known class, and is converted in combination with formula 3 to be applied to class division calibrated by the extremum model, and each class probability score calculating method is as formula (4):
after obtaining the probability that the test sample belongs to the known class, calculating the rejection probability of the unknown class through inverse Weibull distribution 1-W i The probability score of the sample x to be detected belonging to the unknown class is obtained as follows:
where n+1 represents an unknown class.
And finally, based on the obtained n+1 probability values, identifying the sample to be detected as the category with the maximum probability value. The open space risk is further limited by setting a threshold epsilon: if the maximum probability value R is calculated max < ε, reject the sample as an unknown class.
2. Application example. In order to prove the inventive and technical value of the technical solution of the present invention, this section is an application example on specific products or related technologies of the claim technical solution.
The DNS tunnel detection method based on open set identification provided by the embodiment of the invention comprises the steps of mainly dividing the whole process into four parts, namely data preprocessing, depth feature extraction, open set identification model construction and probability evaluation. Firstly, the acquired DNS data is preprocessed and then input into a neural network for deep feature extraction, and as the original untrained network weight is only initialized, the domain name data cannot be represented by an effective feature vector. Firstly, a closed set identification mode is adopted, a softMax layer is added to a feature extraction network to perform multi-classification task training of known types, and training of the weight of the feature extraction backbone network is completed.
After the feature extraction network training is completed, the output of the penultimate full-connection layer is extracted and used as the feature vector representation of the DNS query domain name, and the feature vector representation is input into an open set recognition algorithm. After the extracted feature vectors are obtained, training an open set recognition model to finish the boundary division of the open space and the unknown space. And respectively constructing a partition forest for the input DNS query domain name feature vector, calculating the average path length of each category as a category center, and calibrating the classification boundary by using an extremum theory to determine the partition limit of the known category and the open space.
And finally judging whether the sample is a normal request or attack category and an unknown category in the known categories according to the probability output by the recognition algorithm.
It should be noted that the embodiments of the present invention can be realized in hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or special purpose design hardware. Those of ordinary skill in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The device of the present invention and its modules may be implemented by hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., as well as software executed by various types of processors, or by a combination of the above hardware circuitry and software, such as firmware.
3. Evidence of the effect of the examples. The embodiment of the invention has a great advantage in the research and development or use process, and has the following description in combination with data, charts and the like of the test process.
The existing DNS tunnel detection scheme based on deep learning does not realize the judgment of unknown categories, and cannot be directly used as a reference model for experimental comparison. Training the LSTM-CNN neural network structure for feature extraction, using the settings of the superparameter as: the output vector dimension of the Embedding layer is 128 dimension, 128 units are selected by the LSTM layer in the structure, and the window size of the convolution layer is set to 3. And finally adding a softMax layer into the four network structures to complete the learning of the network weight, setting the learning rate to be 0.001, selecting 100 by an optimization strategy AdamOptimer, and training 20 epochs in total. And extracting the output of the penultimate full-connection layer as the characteristic vector representation of the DNS query domain name, and respectively inputting the characteristic vector representation into a set identification algorithm to carry out DNS tunnel identification. The data sets with different openness are set, and the existing open set recognition algorithm OpenMax and the existing W-SVM algorithm are used for comparing the overall recognition accuracy with the method.
Experiments have shown that the method according to the invention shows significant advantages over both other methods even in cases where the data set is more open.
The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.

Claims (5)

1. The DNS tunnel detection method based on open set recognition is characterized in that after DNS data are preprocessed, a deep learning network is used for representing feature vectors of the data, an open set recognition algorithm is used for dividing and transforming the obtained feature vector space so as to define the limit between a known space and an unknown space, and finally a recognition result is output through probability evaluation;
the method further comprises: resolving the captured DNS data packet; performing feature extraction on the DNS query sub domain name by using a neural network; dividing the boundary of a data space, respectively constructing divided forests for the input DNS inquiry domain name feature vectors, and calibrating the classification boundary by using an extremum theory; calculating the probability that the test sample belongs to all the categories including the unknown category by using a probability estimation method; identifying the test sample into a classification label according to the maximum probability, and identifying the sample as an unknown class when the probabilities of the test sample belonging to the known class are all smaller than a threshold epsilon;
the DNS tunnel detection method based on open set identification comprises the following steps:
firstly, analyzing the captured DNS data packet, and converting characters into integer values;
secondly, performing feature extraction on the DNS query sub domain name by using a neural network, performing feature vector representation on the DNS query sub domain name by adopting CNN, LSTM and different combination modes, performing network training, and taking the space of the trained network penultimate layer as a feature space;
thirdly, finishing the limit division of a data space, respectively constructing a division forest for the input DNS inquiry domain name feature vector, calculating the average path length of each category as a category center, and calibrating a category boundary by using an extremum theory;
a fourth step of calculating probabilities that the test sample belongs to all categories including the unknown category using a probability estimation method based on the similarity between the test sample and each known category;
fifthly, identifying the test sample to classify labels according to the maximum probability, and identifying the sample as an unknown class when the probabilities of the test sample belonging to the known class are all smaller than a threshold epsilon;
the third step comprises: dividing and transforming the obtained feature vector space by using a limit forest applied to open set identification, so that each divided category boundary comprises a minimum distance boundary;
training classifier in the third step:
(1) According to all the known class data, sub-sampling is carried out on each class, a feature is randomly selected as an initial node, a value is randomly selected in the value range of the feature, the psi samples are divided into two parts, the samples smaller than the value are divided into left branches, and the samples larger than the value are divided into right branches; this binary division is repeated in the left and right branches until the following condition is satisfied: the tree reaches a limited height; only one sample is on the node; all features of the sample on the node are the same; repeating the dividing process according to the sample data capacity iteration, and forming the generated dividing tree into a binary tree forest;
(2) Calculating the average path length of each category, wherein the number of samples psi is equal to the average searching length of unsuccessful binary sequencing tree searching:
(3) Calculating the way of training samples in each categoryA path length vector l (x), and calculates a distance d from the average path length i (x)=l i (x)-C i (ψ) to obtain a distance feature set L (x);
(4) Determining a classification limit of each category by using an extremum theory, wherein an extremum model describes the distribution of abnormally high or abnormally low value data, and modeling right deflection data, left deflection data or symmetrical data; d (x) was fitted separately using Weibull distributions, dividing the known class boundaries:
wherein the method comprises the steps ofBeta, lambda and kappa are position, scale and shape parameters, respectively;
the probability prediction specifically comprises the following steps: respectively calculating the probability that the sample to be detected belongs to each category, and indirectly calculating the rejection probability through the probability; for a sample x to be detected, calculating a probability score calculation method formula of a distribution belonging to a certain known category i through Weibull distribution:
the path vector L (x) is the difference value between the sample x to be detected and the average distance length of each known class, the path vector is converted by combining a formula to be applied to class division calibrated by an extremum model, and each class probability score calculating method comprises the following formula:
after obtaining the probability that the test sample belongs to the known class, calculating the rejection probability of the unknown class through inverse Weibull distribution 1-W i The probability score of the sample x to be measured belonging to the unknown class is obtained as follows:
wherein n+1 represents an unknown class;
based on the obtained n+1 probability values, identifying the sample to be detected as the category of the maximum probability value; the open space risk is further limited by setting a threshold epsilon: if the maximum probability value P is calculated max < ε, reject the sample as an unknown class.
2. The open set identification DNS tunnel detection method according to claim 1, wherein the first step includes: extracting a DNS query domain name field for reservation, and creating two comparison tables through analysis of DNS query domain name coding specifications: one for mapping the characters to numbers and the other for mapping the numbers to the characters, each character having a corresponding integer value after processing; performing padding after numerical coding, filling with 0 at the tail, and cutting off if the data exceeds the input length;
and after the second step of network training is completed, extracting an activation vector of a layer before the classification layer as a characteristic vector representation of a current sample, namely selecting the input of the softMax classification layer in the network as the characteristic vector.
3. A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the open set-based DNS tunnel detection method according to any of claims 1-2.
4. An information data processing terminal, characterized in that the information data processing terminal is configured to implement the open set identification-based DNS tunnel detection method according to any one of claims 1 to 2.
5. An open set identification-based DNS tunnel detection system for implementing the open set identification-based DNS tunnel detection method according to any one of claims 1 to 2, wherein the open set identification-based DNS tunnel detection system includes:
the data preprocessing module is used for resolving the captured DNS data packet and converting characters into integer numerical values;
the feature extraction module is used for extracting features of the DNS query sub domain name by using a neural network, carrying out feature vector representation on the DNS query sub domain name by adopting CNN, LSTM and different combination modes, carrying out network training, and taking the space of the trained network penultimate layer as a feature space;
the training classifier module is used for completing limit division of a data space, respectively constructing a division forest for the input DNS inquiry domain name feature vector, calculating the average path length of each category as a category center, and calibrating a classification boundary by using an extremum theory;
the output prediction probability module is used for calculating the probability that the test sample belongs to all the categories including the unknown category by using a probability estimation method based on the similarity between the test sample and each known category; and identifying the test sample into a classification label according to the maximum probability, and identifying the sample as an unknown class when the probabilities of the test sample belonging to the known class are all smaller than a threshold epsilon.
CN202210308273.7A 2022-03-28 2022-03-28 DNS tunnel detection method, system, equipment and terminal based on open set identification Active CN114793170B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210308273.7A CN114793170B (en) 2022-03-28 2022-03-28 DNS tunnel detection method, system, equipment and terminal based on open set identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210308273.7A CN114793170B (en) 2022-03-28 2022-03-28 DNS tunnel detection method, system, equipment and terminal based on open set identification

Publications (2)

Publication Number Publication Date
CN114793170A CN114793170A (en) 2022-07-26
CN114793170B true CN114793170B (en) 2024-03-19

Family

ID=82462291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210308273.7A Active CN114793170B (en) 2022-03-28 2022-03-28 DNS tunnel detection method, system, equipment and terminal based on open set identification

Country Status (1)

Country Link
CN (1) CN114793170B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10778702B1 (en) * 2017-05-12 2020-09-15 Anomali, Inc. Predictive modeling of domain names using web-linking characteristics
CN111898038A (en) * 2020-07-04 2020-11-06 西北工业大学 Social media false news detection method based on man-machine cooperation

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8631489B2 (en) * 2011-02-01 2014-01-14 Damballa, Inc. Method and system for detecting malicious domain names at an upper DNS hierarchy
CN109784325A (en) * 2017-11-10 2019-05-21 富士通株式会社 Opener recognition methods and equipment and computer readable storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10778702B1 (en) * 2017-05-12 2020-09-15 Anomali, Inc. Predictive modeling of domain names using web-linking characteristics
CN111898038A (en) * 2020-07-04 2020-11-06 西北工业大学 Social media false news detection method based on man-machine cooperation

Also Published As

Publication number Publication date
CN114793170A (en) 2022-07-26

Similar Documents

Publication Publication Date Title
CN108694225B (en) Image searching method, feature vector generating method and device and electronic equipment
CN107294993B (en) WEB abnormal traffic monitoring method based on ensemble learning
CN109936582B (en) Method and device for constructing malicious traffic detection model based on PU learning
CN111046664A (en) False news detection method and system based on multi-granularity graph convolution neural network
CN110084610B (en) Network transaction fraud detection system based on twin neural network
CN111008337B (en) Deep attention rumor identification method and device based on ternary characteristics
CN113220886A (en) Text classification method, text classification model training method and related equipment
KR19990010210A (en) Mass Pattern Matching Device and Method
CN114978613B (en) Network intrusion detection method based on data enhancement and self-supervision feature enhancement
JP2023514294A (en) Explanable active learning method using Bayesian dual autoencoder for object detector and active learning device using it
CN114760098A (en) CNN-GRU-based power grid false data injection detection method and device
CN112200664A (en) Repayment prediction method based on ERNIE model and DCNN model
CN111222589A (en) Image text recognition method, device, equipment and computer storage medium
CN111221960A (en) Text detection method, similarity calculation method, model training method and device
CN113052577A (en) Method and system for estimating category of virtual address of block chain digital currency
CN115544303A (en) Method, apparatus, device and medium for determining label of video
KR20200063067A (en) Apparatus and method for validating self-propagated unethical text
CN112527959B (en) News classification method based on pooling convolution embedding and attention distribution neural network
CN114064487A (en) Code defect detection method
CN114793170B (en) DNS tunnel detection method, system, equipment and terminal based on open set identification
CN115512693A (en) Audio recognition method, acoustic model training method, device and storage medium
CN114618167A (en) Anti-cheating detection model construction method and anti-cheating detection method
CN111556017A (en) Network intrusion detection method based on self-coding machine and electronic device
CN115550684B (en) Improved video content filtering method and system
CN116776144A (en) Sample behavior feature determination method, feature extraction model training method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant