CN114793170A

CN114793170A - DNS tunnel detection method, system, equipment and terminal based on open set identification

Info

Publication number: CN114793170A
Application number: CN202210308273.7A
Authority: CN
Inventors: 付玉龙; 焦小彬; 刘璐璐
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-03-28
Filing date: 2022-03-28
Publication date: 2022-07-26
Anticipated expiration: 2042-03-28
Also published as: CN114793170B

Abstract

The invention belongs to the technical field of network security, and discloses a detection method, a system, equipment and a terminal for identifying a DNS tunnel based on open set, wherein a captured DNS data packet is analyzed; performing feature extraction on the DNS query sub-domain name by using a neural network; dividing boundaries of a data space, respectively constructing a division forest for input DNS query domain name feature vectors, and calibrating a classification boundary by using an extreme value theory; calculating the probability that the test sample belongs to all classes including the unknown class by using a probability estimation method; and identifying the test sample into a classification label according to the maximum probability, and identifying the test sample as an unknown class when the probabilities that the test sample belongs to the known classes are all less than a threshold value epsilon. The invention applies the open set identification technology to DNS tunnel detection, and is expected to solve the judgment problem of hidden tunnels with unknown classes.

Description

DNS tunnel detection method, system, device and terminal based on open set identification

技术领域technical field

本发明属于网络安全技术领域，尤其涉及一种基于开集识别DNS隧道检测方法、系统、设备及终端。The invention belongs to the technical field of network security, and in particular relates to a method, system, device and terminal for detecting a DNS tunnel based on open set identification.

背景技术Background technique

DNS协议是互联网上最重要的基础设施，其主要作用是将主机名称转换为IP地址，从而保障其他网络应用顺利执行，DNS服务已经成为互联网上不可或缺的重要一环。由于其协议缺乏数据保密性、完整性保护，成为了攻击者首选的隐蔽通信信道方式，使用DNS隧道绕过安全控制策略，实现远程控制命令的传输或窃取相关敏感数据，给网络安全环境带来了严重的威胁。The DNS protocol is the most important infrastructure on the Internet. Its main function is to convert host names into IP addresses, so as to ensure the smooth execution of other network applications. DNS services have become an indispensable part of the Internet. Due to the lack of data confidentiality and integrity protection in its protocol, it has become the preferred method of covert communication channels for attackers. DNS tunnels are used to bypass security control policies, realize the transmission of remote control commands or steal related sensitive data, which brings a lot to the network security environment. a serious threat.

现有的DNS隐蔽通道检测通常是基于专家知识的特征工程来进行防御，通过对DNS数据的通信行为和查询记录的词频分析相关特征进行提取，使用相关的机器学习算法实现分类识别。在特征提取之前需要对捕获到的DNS数据报文重构成数据流，以提取更多的辅助识别信息。Existing DNS covert channel detection is usually based on feature engineering based on expert knowledge for defense. It extracts relevant features of DNS data communication behavior and word frequency analysis of query records, and uses related machine learning algorithms to achieve classification and identification. Before feature extraction, the captured DNS data packets need to be reconstructed into data streams to extract more auxiliary identification information.

通过上述分析，现有技术存在的问题及缺陷为：Through the above analysis, the existing problems and defects in the prior art are:

(1)在数据采集过程中使用过多的辅助识别信息，降低了在线系统的性能；(1) Too much auxiliary identification information is used in the data collection process, which reduces the performance of the online system;

(2)将数据报文重构成数据流的事后检测的方法无法满足入侵检测系统对攻击威胁及时响应的需求，造成数据泄露等严重后果；(2) The post-event detection method of reconstructing data packets into data streams cannot meet the needs of the intrusion detection system to respond to attack threats in a timely manner, resulting in serious consequences such as data leakage;

(3)采取人工提取特征的方式具有一定的局限性，仅根据已知知识集进行提取，识别误报率较高；(3) The method of manually extracting features has certain limitations, and it is only extracted according to the known knowledge set, and the identification false alarm rate is high;

(4)应用于在线场景时，现有方案无法对DNS变种隧道进行准确识别，对于未知种类的数据，需要对分类识别算法进行改进。(4) When applied to online scenarios, the existing solutions cannot accurately identify DNS variant tunnels. For unknown types of data, the classification and identification algorithm needs to be improved.

发明内容SUMMARY OF THE INVENTION

针对现有技术存在的问题，本发明提供了一种基于开集识别DNS隧道检测方案及系统。Aiming at the problems existing in the prior art, the present invention provides a scheme and system for detecting DNS tunnels based on open set identification.

本发明是这样实现的，一种基于开集识别DNS隧道检测方法，所述基于开集识别DNS隧道检测方法在将DNS数据预处理后，使用深度学习网络对数据进行特征向量表示，使用开集识别算法对得到的特征向量空间进行分割变换以明确已知空间与未知空间的界限，最后通过概率评估输出识别结果。The present invention is implemented in this way, a method for detecting DNS tunnels based on open set identification, wherein the DNS tunnel detection method based on open set identification preprocesses DNS data, uses a deep learning network to represent the data with feature vectors, and uses an open set to perform feature vector representation on the data. The recognition algorithm divides and transforms the obtained feature vector space to clarify the boundary between the known space and the unknown space, and finally outputs the recognition result through probability evaluation.

进一步，将捕获到的DNS数据包进行解析；使用神经网络对DNS查询子域名进行特征提取；数据空间的界限划分，对输入的DNS查询域名特征向量分别构建划分森林，使用极值理论对分类边界进行校准；使用概率估计方法计算测试样本属于包括未知类别在内的所有类别的概率；根据最大的概率将测试样本标识分类标签，当测试样本属于已知类别的概率均小于阈值ε时，将该样本识别为未知类别。Further, analyze the captured DNS data packets; use neural network to perform feature extraction on DNS query sub-domain names; divide the data space, construct partition forests for the input DNS query domain name feature vectors, and use extreme value theory to classify the classification boundaries Perform calibration; use the probability estimation method to calculate the probability that the test sample belongs to all categories including the unknown category; identify the test sample to the classification label according to the maximum probability, when the probability of the test sample belonging to the known category is less than the threshold ε, the The sample is identified as an unknown class.

进一步，所述基于开集识别DNS隧道检测方法包括以下步骤：Further, the method for detecting DNS tunnels based on open set identification includes the following steps:

第一步，数据预处理阶段；将捕获到的DNS数据包进行解析，将字符转换为整数型数值；The first step is the data preprocessing stage; parse the captured DNS packets and convert characters into integer values;

第二步，特征向量表示阶段；使用神经网络对DNS查询子域名进行特征提取，采用CNN和LSTM及不同组合方式对DNS查询域名进行特征向量表示，进行网络训练，以经训练后的网络倒数第二层所在空间为特征空间；The second step, the feature vector representation stage; use the neural network to extract the features of the DNS query sub-domain name, use CNN and LSTM and different combinations to represent the DNS query domain name with feature vectors, and carry out network training. The space where the second floor is located is the characteristic space;

第三步，构建开集识别分类模型；完成数据空间的界限划分，对输入的DNS查询域名特征向量分别构建划分森林，计算每个类别的平均路径长度作为类别中心，使用极值理论对分类边界进行校准；The third step is to build an open set recognition classification model; complete the boundary division of the data space, build a partition forest for the input DNS query domain name feature vector, calculate the average path length of each category as the category center, and use the extreme value theory to classify the classification boundaries. to calibrate;

第四步，概率评估阶段；基于测试样本和每个已知类别之间的相似性，使用概率估计方法来计算测试样本属于包括未知类别在内的所有类别的概率；The fourth step is the probability evaluation stage; based on the similarity between the test sample and each known category, the probability estimation method is used to calculate the probability that the test sample belongs to all categories including the unknown category;

第五步，输出识别结果；根据最大的概率将测试样本标识分类标签，当测试样本属于已知类别的概率均小于阈值ε时，将该样本识别为未知类别。The fifth step is to output the recognition result; according to the maximum probability, the test sample is identified as a classification label, and when the probability of the test sample belonging to a known category is less than the threshold ε, the sample is identified as an unknown category.

进一步，所述第一步包括：提取DNS查询域名字段进行保留，通过对DNS查询域名编码规范的分析，创建两个对照表：一个用于将字符映射到数字，另一个用于将数字映射到字符，处理之后每个字符都有一个对应的整数值；进行数值编码后进行padding填充，末尾以“0”补齐，如数据超过输入长度，则进行截断。Further, the first step includes: extracting the DNS query domain name field for reservation, and creating two comparison tables by analyzing the DNS query domain name coding specification: one for mapping characters to numbers, and the other for mapping numbers to numbers character, each character has a corresponding integer value after processing; padding is performed after numerical encoding, and the end is padded with "0". If the data exceeds the input length, it will be truncated.

进一步，所述第二步网络训练完成之后，提取分类层前一层的激活向量作为当前样本的特征向量表示即选择网络中SoftMax分类层的输入作为特征向量。Further, after the second step of network training is completed, the activation vector of the previous layer of the classification layer is extracted as the feature vector representation of the current sample, that is, the input of the SoftMax classification layer in the network is selected as the feature vector.

进一步，所述第三步包括：使用应用于开集识别的极限森林对得到的特征向量空间进行分割变换，使划分出的每个类别界限包含最小距离边界。Further, the third step includes: using the limit forest applied to the open set identification to perform segmentation and transformation on the obtained feature vector space, so that each class boundary that is divided contains the minimum distance boundary.

进一步，第三步的训练分类器：Further, the third step of training the classifier:

(1)根据所有的已知类别数据，对于每个类别进行子采样，随机选取一个特征作为起始节点，在特征的取值范围内随机选取一个值，将ψ个样本分为两部分，小于值的样本分为左分支，大于值的样本分为右分支；在左右分支里重复这个二叉划分，直到满足以下条件：树达到了限制的高度；节点上只有一个样本；节点上的样本所有特征都相同；根据样本数据容量迭代重复划分过程，将生成的划分树组成二叉树森林；(1) According to all known category data, perform sub-sampling for each category, randomly select a feature as the starting node, randomly select a value within the value range of the feature, and divide the ψ samples into two parts, less than The samples of the value are divided into the left branch, and the samples larger than the value are divided into the right branch; this binary division is repeated in the left and right branches until the following conditions are met: the tree reaches the limit height; there is only one sample on the node; all samples on the node are The characteristics are the same; the division process is iteratively repeated according to the sample data capacity, and the generated division tree is formed into a binary tree forest;

(2)计算每个类别的平均路径长度，样本个数ψ，平均路径长度等同于二叉排序树查找不成功的平均查找长度：(2) Calculate the average path length of each category, the number of samples ψ, and the average path length is equivalent to the average search length of the unsuccessful binary sorting tree search:

(3)计算每个类别中训练样本的路径长度向量l(x)，并计算与平均路径长度的距离d_i(x)＝l_i(x)-C_i(ψ)，得到距离特征集合L(x)；(3) Calculate the path length vector l(x) of the training samples in each category, and calculate the distance d _i (x)=l _i (x)-C _i (ψ) from the average path length to obtain the distance feature set L (x);

(4)使用极值理论确定每个类别的分类界限，极值模型描述异常高或异常低值数据的分布，对右偏斜数据、左偏斜数据或对称数据建模；使用Weibull分布分别对D(x)进行拟合，划分已知类别界限：(4) Use extreme value theory to determine the classification boundary of each category. The extreme value model describes the distribution of abnormally high or abnormally low data, and models right-skewed data, left-skewed data, or symmetric data; D(x) is fitted to divide the known class boundaries:

其中

β，λ和κ分别是位置、比例和形状参数。in

β, λ and κ are location, scale and shape parameters, respectively.

进一步，概率预测具体包括：分别计算待测样本属于每个类别的概率，通过概率间接计算拒绝概率；对于待测样本x，通过Weibull分布计算属于某个已知类别i分布的概率分数计算方法公式：Further, the probability prediction specifically includes: separately calculating the probability that the sample to be tested belongs to each category, and indirectly calculating the probability of rejection through the probability; for the sample x to be tested, calculating the probability score calculation method formula of a certain known category i distribution through Weibull distribution :

路径向量L(x)是待测样本x与每个已知类别平均距离长度的差值，结合公式对路径向量进行转换以应用到经极值模型校准后的类别划分，每个类别概率分数计算方法公式：The path vector L(x) is the difference between the sample x to be tested and the average distance length of each known category. The path vector is converted with the formula to apply to the category division after calibration by the extreme value model, and the probability score of each category is calculated. Method formula:

在得到测试样本属于已知类别的概率后，计算未知类别的拒绝概率通过逆Weibull分布1-W_i得到，待测样本x属于未知类别的概率分数为：After obtaining the probability that the test sample belongs to the known category, the rejection probability of the unknown category is calculated through the inverse Weibull distribution 1-W _i , and the probability score of the sample to be tested x belonging to the unknown category is:

其中，N+1表示未知类别；Among them, N+1 represents the unknown category;

基于得到的N+1个概率值，将待测样本识别为最大概率值所在的类别；通过设置阈值ε进一步限制开放空间风险：若计算得出最大的概率值P_max＜ε，将样本作为未知类拒绝。Based on the obtained N+1 probability values, the sample to be tested is identified as the category with the maximum probability value; the open space risk is further limited by setting the threshold ε: if the maximum probability value P _max <ε is calculated, the sample is regarded as unknown Class rejected.

本发明的另一目的在于提供一种计算机设备，所述计算机设备包括存储器和处理器，所述存储器存储有计算机程序，所述计算机程序被所述处理器执行时，使得所述处理器执行所述基于开集识别DNS隧道检测方法的步骤。Another object of the present invention is to provide a computer device, the computer device includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor causes the processor to execute the program. Describe the steps of the method for detecting DNS tunnels based on open set identification.

本发明的另一目的在于提供一种信息数据处理终端，所述信息数据处理终端用于实现所述基于开集识别DNS隧道检测方法。Another object of the present invention is to provide an information data processing terminal, which is used for implementing the method for detecting a DNS tunnel based on open set identification.

本发明的另一目的在于提供一种实施所述基于开集识别DNS隧道检测方法的基于开集识别DNS隧道检测系统，所述基于开集识别DNS隧道检测系统包括：Another object of the present invention is to provide a system for detecting open-set-based DNS tunnels that implements the method for detecting open-set-based DNS tunnels. The open-set-based DNS tunnel detection system includes:

数据预处理模块，用于将捕获到的DNS数据包进行解析，将字符转换为整数型数值；The data preprocessing module is used to parse the captured DNS packets and convert characters into integer values;

特征提取模块，用于使用神经网络对DNS查询子域名进行特征提取，采用CNN和LSTM及不同组合方式对DNS查询域名进行特征向量表示，进行网络训练，以经训练后的网络倒数第二层所在空间为特征空间；The feature extraction module is used to extract the features of DNS query sub-domain names by using neural network, and use CNN and LSTM and different combinations to represent DNS query domain names as feature vectors, and perform network training. After training, the penultimate layer of the network is located The space is the feature space;

训练分类器模块，用于完成数据空间的界限划分，对输入的DNS查询域名特征向量分别构建划分森林，计算每个类别的平均路径长度作为类别中心，使用极值理论对分类边界进行校准；Train the classifier module to complete the boundary division of the data space, construct a partition forest for the input DNS query domain name feature vector, calculate the average path length of each category as the category center, and use the extreme value theory to calibrate the classification boundary;

输出预测概率模块，用于基于测试样本和每个已知类别之间的相似性，使用概率估计方法计算测试样本属于包括未知类别在内的所有类别的概率；根据最大的概率将测试样本标识分类标签，当测试样本属于已知类别的概率均小于阈值ε时，将该样本识别为未知类别。The output prediction probability module is used to calculate the probability that the test sample belongs to all categories including the unknown category based on the similarity between the test sample and each known category, using the probability estimation method; classify the test sample identification according to the maximum probability label, when the probability of a test sample belonging to a known class is less than the threshold ε, the sample is identified as an unknown class.

结合上述的技术方案和解决的技术问题，请从以下几方面分析本发明所要保护的技术方案所具备的优点及积极效果为：In combination with the above-mentioned technical solutions and the technical problems solved, please analyze the advantages and positive effects of the technical solutions to be protected by the present invention from the following aspects:

第一、针对上述现有技术存在的技术问题以及解决该问题的难度，紧密结合本发明的所要保护的技术方案以及研发过程中结果和数据等，详细、深刻地分析本发明技术方案如何解决的技术问题，解决问题之后带来的一些具备创造性的技术效果。具体描述如下：本发明将开集识别技术应用于DNS隧道检测有望解决对于未知类别隐蔽隧道的判断难题，开集识别作为应对现实世界中面临的开放类别分类的解决方法。DNS数据作为类文本数据，能够通过迁移自然语言处理中的部分方法，使用深度学习技术对DNS数据进行语义特征提取，避免了通过人工提取特征的局限性。因此，结合深度学习的特征提取方法和开集识别算法对现有的DNS隐蔽通道的检测具有十分重要的意义。First, in view of the technical problems existing in the above-mentioned prior art and the difficulty of solving the problems, closely combine the technical solutions to be protected of the present invention and the results and data in the research and development process, etc., and analyze in detail and profoundly how to solve the technical solutions of the present invention. Technical problems, some creative technical effects brought about by solving problems. The specific description is as follows: the present invention applies the open set identification technology to DNS tunnel detection, which is expected to solve the problem of judging hidden tunnels of unknown categories, and the open set identification is a solution to the open category classification faced in the real world. DNS data, as text-like data, can use deep learning technology to extract semantic features of DNS data by migrating some methods in natural language processing, avoiding the limitations of manual feature extraction. Therefore, it is of great significance to combine the feature extraction method of deep learning and the open set recognition algorithm for the detection of existing DNS covert channels.

第二，把技术方案看做一个整体或者从产品的角度，本发明所要保护的技术方案具备的技术效果和优点，具体描述如下：Second, considering the technical solution as a whole or from the product point of view, the technical effects and advantages of the technical solution to be protected by the present invention are specifically described as follows:

本发明针对DNS数据的特点，基于紧致递减概率模型，结合极限森林设计提出了一种新的开集识别算法，用于对经神经网络提取的DNS数据特征向量空间进行分割变换，以达到划分已知空间与未知空间的目的，达到实时性和高准确率的未知类别DNS隧道检测的目的。第三，作为本发明的权利要求的创造性辅助证据，还体现在以下几个重要方面：Aiming at the characteristics of DNS data, the present invention proposes a new open set identification algorithm based on the compact decreasing probability model combined with the limit forest design, which is used to divide and transform the DNS data feature vector space extracted by the neural network, so as to achieve division The purpose of known space and unknown space is to achieve the purpose of real-time and high-accuracy DNS tunnel detection of unknown types. Third, as an auxiliary evidence of inventive step for the claims of the present invention, it is also reflected in the following important aspects:

(1)本发明的技术方案填补了国内外业内技术空白：(1) the technical scheme of the present invention has filled the technical gap in the industry at home and abroad:

由于目前仍缺乏针对未知DNS隧道识别的检测方法，本发明将开集识别引入DNS隧道检测领域，基于所提出的开集识别模型进一步明确了已知数据空间与未知数据空间的界限划分，填补了对未知类别的DNS隧道检测的空白。Since there is still a lack of detection methods for unknown DNS tunnel identification, the present invention introduces open set identification into the field of DNS tunnel detection. Based on the proposed open set identification model, the boundary between known data space and unknown data space is further clarified, filling the Blank for DNS tunnel detection of unknown class.

(2)本发明的技术方案是否解决了人们一直渴望解决、但始终未能获得成功的技术难题：(2) Whether the technical solution of the present invention has solved the technical problem that people have been eager to solve, but have not been able to achieve success:

本发明针对在现网环境中海量且复杂的DNS数据，结合在线入侵检测系统需要实时检测的需要，对未知种类的DNS隧道检测提供了有效的解决方案。Aiming at the massive and complex DNS data in the existing network environment, the present invention provides an effective solution for unknown types of DNS tunnel detection in combination with the need for real-time detection of an online intrusion detection system.

附图说明Description of drawings

图1是本发明实施例提供的基于开集识别DNS隧道检测方法流程图；1 is a flowchart of a method for detecting a DNS tunnel based on open set identification provided by an embodiment of the present invention;

图2是本发明实施例提供的基于开集识别DNS隧道检测系统的结构示意图；2 is a schematic structural diagram of a system for detecting DNS tunnels based on open set identification provided by an embodiment of the present invention;

图3是本发明实施例提供的基于开集识别DNS隧道检测方法的实现流程图；Fig. 3 is the realization flow chart of the DNS tunnel detection method based on open set identification provided by the embodiment of the present invention;

图中：1、数据预处理模块；2、特征提取模块；3、训练分类器模块；4、输出预测概率模块。In the figure: 1. Data preprocessing module; 2. Feature extraction module; 3. Training classifier module; 4. Output prediction probability module.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

一、解释说明实施例。为了使本领域技术人员充分了解本发明如何具体实现，该部分是对权利要求技术方案进行展开说明的解释说明实施例。1. Explain the embodiment. In order for those skilled in the art to fully understand how the present invention is specifically implemented, this part is an explanatory embodiment for explaining the technical solutions of the claims.

如图1所示，本发明实施例提供的基于开集识别DNS隧道检测方法包括以下步骤：As shown in FIG. 1 , the method for detecting a DNS tunnel based on open set identification provided by an embodiment of the present invention includes the following steps:

S101：将捕获到的DNS数据包进行解析，将字符转换为整数型数值；S101: Parse the captured DNS data packets, and convert characters into integer values;

S102：使用神经网络对DNS查询子域名进行特征提取，采用CNN卷积神经网络和LSTM长短期记忆神经网络及不同组合方式对DNS查询域名进行特征向量表示，进行网络训练，以经训练后的网络倒数第二层所在空间为特征空间；S102: Use neural network to perform feature extraction on DNS query sub-domain name, use CNN convolutional neural network and LSTM long short-term memory neural network and different combinations to represent DNS query domain name with feature vector, perform network training, and use the trained network The space where the penultimate layer is located is the feature space;

S103：完成数据空间的界限划分，对输入的DNS查询域名特征向量分别构建划分森林，计算每个类别的平均路径长度作为类别中心，使用极值理论对分类边界进行校准；S103: Complete the boundary division of the data space, construct a partition forest for the input DNS query domain name feature vector respectively, calculate the average path length of each category as the category center, and use the extreme value theory to calibrate the classification boundary;

S104：基于测试样本和每个已知类别之间的相似性，使用概率估计方法来计算测试样本属于包括未知类别在内的所有类别的概率；S104: Based on the similarity between the test sample and each known category, use a probability estimation method to calculate the probability that the test sample belongs to all categories including the unknown category;

S105：根据最大的概率将测试样本标识分类标签，当测试样本属于已知类别的概率均小于阈值ε时，将该样本识别为未知类别。S105: The test sample is identified as a classification label according to the maximum probability, and when the probability of the test sample belonging to a known category is less than the threshold ε, the sample is identified as an unknown category.

如图2所示，本发明实施例提供的基于开集识别DNS隧道检测系统包括：As shown in FIG. 2 , the system for detecting DNS tunnels based on open set identification provided by an embodiment of the present invention includes:

数据预处理模块1，用于将捕获到的DNS数据包进行解析，将字符转换为整数型数值；Data preprocessing module 1, which is used to parse the captured DNS data packets and convert characters into integer values;

特征提取模块2，用于使用神经网络对DNS查询子域名进行特征提取，采用CNN和LSTM及不同组合方式对DNS查询域名进行特征向量表示，进行网络训练，以经训练后的网络倒数第二层所在空间为特征空间；The feature extraction module 2 is used to extract features of DNS query sub-domain names using neural network, and use CNN and LSTM and different combination methods to represent DNS query domain names as feature vectors, and perform network training. The space where it is located is the feature space;

训练分类器模块3，用于完成数据空间的界限划分，对输入的DNS查询域名特征向量分别构建划分森林，计算每个类别的平均路径长度作为类别中心，使用极值理论对分类边界进行校准；The training classifier module 3 is used to complete the boundary division of the data space, construct a partition forest for the input DNS query domain name feature vector, calculate the average path length of each category as the category center, and use the extreme value theory to calibrate the classification boundary;

输出预测概率模块4，用于基于测试样本和每个已知类别之间的相似性，使用概率估计方法计算测试样本属于包括未知类别在内的所有类别的概率；根据最大的概率将测试样本标识分类标签，当测试样本属于已知类别的概率均小于阈值ε时，将该样本识别为未知类别。The output prediction probability module 4 is used to calculate the probability that the test sample belongs to all categories including the unknown category by using the probability estimation method based on the similarity between the test sample and each known category; the test sample is identified according to the maximum probability Classification label, when the probability of a test sample belonging to a known class is less than the threshold ε, the sample is identified as an unknown class.

如图3所示，本发明实施例提供的基于开集识别DNS隧道检测方法具体包括以下步骤：As shown in FIG. 3 , the method for detecting a DNS tunnel based on open set identification provided by an embodiment of the present invention specifically includes the following steps:

步骤一：数据预处理。首先将捕获到的DNS数据包进行解析，仅提取DNS查询域名字段进行保留。另外，由于方案中使用神经网络来进行自动特征提取以避免人工提取特征的缺陷，为满足神经网络模型输入的要求，需要将字符转换为整数型数值。通过对DNS查询域名编码规范的分析，字符包括“A-z”、“0-9”、“-”、“.”等共计68个，创建两个对照表：一个用于将字符映射到数字，另一个用于将数字映射到字符，处理之后每个字符都有一个对应的整数值。进行数值编码后进行padding填充，末尾以“0”补齐，如数据超过输入长度，则进行截断。Step 1: Data preprocessing. First, the captured DNS data packets are parsed, and only the domain name field of the DNS query is extracted for reservation. In addition, because the neural network is used for automatic feature extraction in the scheme to avoid the defects of manual feature extraction, in order to meet the input requirements of the neural network model, it is necessary to convert characters into integer values. Through the analysis of the DNS query domain name encoding specification, there are 68 characters including "A-z", "0-9", "-", ".", etc., and two comparison tables are created: one is used to map characters to numbers, and the other is used to map characters to numbers. One is used to map numbers to characters, each character has a corresponding integer value after processing. After numerical encoding, padding is performed, and the end is filled with "0". If the data exceeds the input length, it will be truncated.

步骤二：特征提取。为得到DNS查询域名的特征向量表征，使用神经网络对DNS查询子域名进行特征提取，主要设计采用CNN和LSTM及不同组合方式对DNS查询域名进行特征向量表示，网络训练完成之后，提取分类层前一层的激活向量作为当前样本的特征向量表示即选择网络中SoftMax分类层的输入作为特征向量，即以经训练后的网络倒数第二层所在空间为特征空间。Step 2: Feature extraction. In order to obtain the feature vector representation of the DNS query domain name, the neural network is used to extract the features of the DNS query sub-domain name. The main design uses CNN and LSTM and different combinations to represent the DNS query domain name vector. The activation vector of one layer is used as the feature vector representation of the current sample, that is, the input of the SoftMax classification layer in the network is selected as the feature vector, that is, the space where the penultimate layer of the trained network is located is the feature space.

步骤三：训练分类器。使用应用于开集识别的极限森林对得到的特征向量空间进行分割变换，使划分出的每个类别界限可以包含最小距离边界，完成数据空间的界限划分，对输入的DNS查询域名特征向量分别构建划分森林，计算每个类别的平均路径长度作为类别中心，使用极值理论对分类边界进行校准。Step 3: Train the classifier. Use the limit forest applied to open set recognition to divide and transform the obtained feature vector space, so that each category boundary can contain the minimum distance boundary, complete the boundary division of the data space, and construct the input DNS query domain name feature vector separately. The forest is divided, the average path length of each class is calculated as the class center, and the class boundaries are calibrated using extreme value theory.

步骤四：输出预测概率。在预测阶段，基于测试样本和每个已知类别之间的相似性，使用概率估计方法来计算测试样本属于包括未知类别在内的所有类别的概率。最后根据最大的概率将测试样本标识分类标签，当测试样本属于已知类别的概率均小于阈值ε时，将该样本识别为未知类别。Step 4: Output the predicted probability. In the prediction stage, based on the similarity between the test sample and each known class, a probability estimation method is used to calculate the probability that the test sample belongs to all classes including the unknown class. Finally, according to the maximum probability, the test sample is identified as a classification label. When the probability of the test sample belonging to a known category is less than the threshold ε, the sample is identified as an unknown category.

在本发明的实施例中，步骤三训练分类器详细描述如下：In the embodiment of the present invention, the step 3 training classifier is described in detail as follows:

(1)根据所有的已知类别数据，对于每个类别进行子采样，随机选取一个特征作为起始节点，在该特征的取值范围内随机选取一个值，将ψ个样本分为两部分，小于该值的样本分为左分支，大于该值的样本分为右分支。在左右分支里重复这个二叉划分，直到满足以下条件：(1) According to all known category data, perform sub-sampling for each category, randomly select a feature as the starting node, randomly select a value within the value range of the feature, and divide the ψ samples into two parts, The samples smaller than this value are divided into the left branch, and the samples larger than this value are divided into the right branch. Repeat this binary division in the left and right branches until the following conditions are met:

a.树达到了限制的高度；a. The tree has reached the height limit;

b.节点上只有一个样本；b. There is only one sample on the node;

c.节点上的样本所有特征都相同。c. All features of the samples on the node are the same.

然后根据样本数据容量迭代重复划分过程，将生成的划分树组成二叉树森林。Then the division process is iteratively repeated according to the sample data capacity, and the generated division tree is formed into a binary tree forest.

(2)计算每个类别的平均路径长度，假设样本个数ψ，平均路径长度等同于二叉排序树查找不成功的平均查找长度：(2) Calculate the average path length of each category, assuming the number of samples ψ, the average path length is equivalent to the average search length of the unsuccessful binary sorting tree search:

(3)计算每个类别中训练样本的路径长度向量l(x)，并计算与平均路径长度的距离d_i(x)＝l_i(x)-C_i(ψ:)，得到距离特征集合L(x)。(3) Calculate the path length vector l(x) of the training samples in each category, and calculate the distance d _i (x)=l _i (x)-C _i (ψ:) from the average path length to obtain the distance feature set L(x).

(4)使用极值理论确定每个类别的分类界限，公式2描述了Weibull分布的概率密度。极值模型描述了异常高或异常低值数据的分布，可以对右偏斜数据、左偏斜数据或对称数据建模。由于训练集中没有任何关于未知类别的辅助识别信息，数据路径长度分布存在长尾分布现象，仅仅使用中心趋势度量模型对路径长度向量进行建模显然是不合适的。使用Weibull分布分别对D(x)进行拟合，可以划分已知类别界限。(4) Using extreme value theory to determine the classification boundaries of each category, Equation 2 describes the probability density of the Weibull distribution. Extreme value models describe the distribution of data with abnormally high or low values and can model right-skewed, left-skewed, or symmetric data. Since there is no auxiliary identification information about unknown categories in the training set, the data path length distribution has a long-tailed distribution phenomenon, and it is obviously inappropriate to use only the central trend metric model to model the path length vector. Using the Weibull distribution to fit D(x) separately, the known class boundaries can be divided.

其中

β，λ和κ分别是位置、比例和形状参数。in

β, λ and κ are location, scale and shape parameters, respectively.

在本发明的实施例中，概率预测详细描述如下：In the embodiment of the present invention, the probability prediction is described in detail as follows:

预测阶段的目标是获得样本属于每个类别的概率，主要任务是计算属于未知样本的拒绝概率，然而对于开放集识别来讲缺少未知类别的先验知识，很难直接计算拒绝概率。首先分别计算待测样本属于每个类别的概率，通过这些概率间接计算拒绝概率。The goal of the prediction stage is to obtain the probability that the sample belongs to each category. The main task is to calculate the rejection probability of an unknown sample. However, for open set recognition, there is no prior knowledge of the unknown category, so it is difficult to directly calculate the rejection probability. First, the probability that the sample to be tested belongs to each category is calculated separately, and the rejection probability is indirectly calculated through these probabilities.

对于待测样本x，通过Weibull分布计算属于某个已知类别i分布的概率分数计算方法如公式(3)：For the sample x to be tested, the Weibull distribution is used to calculate the probability score that belongs to a known class i distribution, as shown in formula (3):

路径向量L(x)是待测样本x与每个已知类别平均距离长度的差值，结合公式3对路径向量进行转换以应用到经极值模型校准后的类别划分，每个类别概率分数计算方法如公式(4)：The path vector L(x) is the difference between the sample x to be tested and the average distance length of each known category. Combined with formula 3, the path vector is converted to apply to the category division after calibration by the extreme value model, and the probability score of each category is The calculation method is as formula (4):

在得到测试样本属于已知类别的概率后，计算未知类别的拒绝概率通过逆Weibull分布1-W_i得到，即待测样本x属于未知类别的概率分数为：After obtaining the probability that the test sample belongs to the known category, the rejection probability of the unknown category is calculated by the inverse Weibull distribution 1-W _i , that is, the probability score of the sample to be tested x belonging to the unknown category is:

其中，N+1表示未知类别。Among them, N+1 represents the unknown category.

最后，基于得到的N+1个概率值，将待测样本识别为最大概率值所在的类别。通过设置阈值ε进一步限制开放空间风险：若计算得出最大的概率值R_max＜ε，将该样本作为未知类拒绝。Finally, based on the obtained N+1 probability values, the sample to be tested is identified as the category where the maximum probability value is located. The open space risk is further limited by setting a threshold ε: if the maximum probability value R _max <ε is calculated, the sample is rejected as an unknown class.

二、应用实施例。为了证明本发明的技术方案的创造性和技术价值，该部分是对权利要求技术方案进行具体产品上或相关技术上的应用实施例。2. Application Examples. In order to prove the creativity and technical value of the technical solution of the present invention, this part is an application example of the technical solution in the claims on specific products or related technologies.

本发明实例提供的基于开集识别的DNS隧道检测方法包括整个过程主要分为四个部分，分别是数据预处理、深度特征提取、开集识别模型构建和概率评估。首先将采集DNS数据进行预处理后，输入到神经网络进行深度特征提取，由于原始未经训练的网络权重仅经过初始化，无法对域名数据进行有效特征向量表示。首先采取闭集识别的方式，对特征提取网络加上SoftMax层进行已知类别的多分类任务训练，完成对特征提取骨干网络权重的训练。The DNS tunnel detection method based on open set identification provided by the example of the present invention includes the whole process is mainly divided into four parts, namely data preprocessing, deep feature extraction, open set identification model construction and probability evaluation. First, after preprocessing the collected DNS data, it is input into the neural network for deep feature extraction. Since the original untrained network weights are only initialized, the domain name data cannot be represented by an effective feature vector. First, the method of closed set recognition is adopted, and the feature extraction network is added to the SoftMax layer to perform multi-classification task training of known categories, and the training of the weight of the feature extraction backbone network is completed.

对特征提取网络训练完成之后，提取倒数第二层全连接层输出作为DNS查询域名的特征向量表征，输入到开集识别算法中。得到提取的特征向量之后训练开集识别模型，完成开放空间与未知空间的界限划分。对输入的DNS查询域名特征向量分别构建划分森林，计算每个类别的平均路径长度作为类别中心，使用极值理论对分类边界进行校准，以确定已知类别与开放空间的划分界限。After the training of the feature extraction network is completed, the output of the penultimate fully connected layer is extracted as the feature vector representation of the DNS query domain name and input to the open set recognition algorithm. After the extracted feature vector is obtained, the open set recognition model is trained to complete the boundary division between the open space and the unknown space. The partition forest is constructed for the input DNS query domain name feature vector, and the average path length of each category is calculated as the category center.

最后通过识别算法输出的概率判断样本是已知类别中的正常请求或攻击类别以及未知类别。Finally, it is judged by the probability output by the identification algorithm that the sample is a normal request or an attack category and an unknown category in the known category.

应当注意，本发明的实施方式可以通过硬件、软件或者软件和硬件的结合来实现。硬件部分可以利用专用逻辑来实现；软件部分可以存储在存储器中，由适当的指令执行系统，例如微处理器或者专用设计硬件来执行。本领域的普通技术人员可以理解上述的设备和方法可以使用计算机可执行指令和/或包含在处理器控制代码中来实现，例如在诸如磁盘、CD或DVD-ROM的载体介质、诸如只读存储器(固件)的可编程的存储器或者诸如光学或电子信号载体的数据载体上提供了这样的代码。本发明的设备及其模块可以由诸如超大规模集成电路或门阵列、诸如逻辑芯片、晶体管等的半导体、或者诸如现场可编程门阵列、可编程逻辑设备等的可编程硬件设备的硬件电路实现，也可以用由各种类型的处理器执行的软件实现，也可以由上述硬件电路和软件的结合例如固件来实现。It should be noted that the embodiments of the present invention may be implemented by hardware, software, or a combination of software and hardware. The hardware portion may be implemented using special purpose logic; the software portion may be stored in memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those of ordinary skill in the art will appreciate that the apparatus and methods described above may be implemented using computer-executable instructions and/or embodied in processor control code, for example on a carrier medium such as a disk, CD or DVD-ROM, such as a read-only memory Such code is provided on a programmable memory (firmware) or a data carrier such as an optical or electronic signal carrier. The device and its modules of the present invention can be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., It can also be implemented by software executed by various types of processors, or by a combination of the above-mentioned hardware circuits and software, such as firmware.

三、实施例相关效果的证据。本发明实施例在研发或者使用过程中取得了一些积极效果，和现有技术相比的确具备很大的优势，下面内容结合试验过程的数据、图表等进行描述。3. Evidence of the relevant effects of the embodiment. The embodiments of the present invention have achieved some positive effects in the process of research and development or use, and indeed have great advantages compared with the prior art.

现有的基于深度学习的DNS隧道检测方案均没有实现对于未知类别的判断，无法直接作为基准模型进行实验比较。训练用于特征提取的LSTM-CNN神经网络结构，使用超参数的设置为：Embedding嵌入层输出向量维度为128维，结构中LSTM层选取128个单元，卷积层窗口大小设置为3。在四种网络结构最后加入SoftMax层以完成网络权重的学习，设置学习率0.001，优化策略AdamOptimizer，batch_size选择100，共训练20个epoch。提取倒数第二层全连接层输出作为DNS查询域名的特征向量表征，分别输入到集识别算法中进行DNS隧道识别。设置不同开放性的数据集，使用已有的开集识别算法OpenMax以及W-SVM算法与本发明中的方法进行了整体识别准确率对比。None of the existing deep learning-based DNS tunnel detection schemes have realized the judgment of unknown categories, and cannot be directly used as a benchmark model for experimental comparison. To train the LSTM-CNN neural network structure for feature extraction, the hyperparameter settings are: the output vector dimension of the Embedding embedding layer is 128 dimensions, the LSTM layer in the structure selects 128 units, and the convolution layer window size is set to 3. At the end of the four network structures, the SoftMax layer is added to complete the learning of network weights. The learning rate is set to 0.001, the optimization strategy AdamOptimizer is selected, and the batch_size is selected as 100, and a total of 20 epochs are trained. The output of the penultimate fully connected layer is extracted as the feature vector representation of the DNS query domain name, and input to the set identification algorithm for DNS tunnel identification. Data sets with different openness are set, and the overall recognition accuracy is compared with the method in the present invention by using the existing open set recognition algorithm OpenMax and W-SVM algorithm.

实验表明，即便在数据集开放性较大的情况下，本发明的方法相比其它两种方法都显示出了显著优势。Experiments show that the method of the present invention shows significant advantages over the other two methods even when the data set is more open.

以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，都应涵盖在本发明的保护范围之内。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited to this. Any person skilled in the art is within the technical scope disclosed by the present invention, and all within the spirit and principle of the present invention Any modifications, equivalent replacements and improvements made within the scope of the present invention should be included within the protection scope of the present invention.

Claims

1. a DNS tunnel detection method based on open set identification, is characterized in that, after DNS data preprocessing, use deep learning network to carry out feature vector representation to data, use open set identification algorithm to carry out segmentation transformation to the feature vector space that obtains In order to clarify the boundary between the known space and the unknown space, and finally output the recognition result through probability evaluation.

2. the detection method based on open set identification DNS tunnel as claimed in claim 1, is characterized in that, described method further comprises: the DNS data packet that captures is analyzed; Use neural network to carry out feature extraction to DNS query subdomain name; The boundary of the data space is divided, and the input DNS query domain name feature vector is divided into a partition forest, and the classification boundary is calibrated using the extreme value theory; the probability estimation method is used to calculate the probability that the test sample belongs to all categories including unknown categories; according to the maximum The probability of identifying the test sample as a classification label, and when the probability of the test sample belonging to a known category is less than the threshold ε, the sample is identified as an unknown category.

3. the method for detecting DNS tunnel based on open set identification as claimed in claim 2, is characterized in that, the described method for detecting DNS tunnel based on open set identification comprises the following steps:

The first step is to parse the captured DNS packets and convert the characters to integer values;

The second step is to use neural network to perform feature extraction on DNS query sub-domain names, use CNN, LSTM and different combinations to represent DNS query domain names as feature vectors, and perform network training. The space where the penultimate layer of the trained network is located is feature space;

The third step is to complete the boundary division of the data space, construct a partition forest for the input DNS query domain name feature vector, calculate the average path length of each category as the category center, and use the extreme value theory to calibrate the classification boundary;

The fourth step, based on the similarity between the test sample and each known category, use the probability estimation method to calculate the probability that the test sample belongs to all categories including the unknown category;

In the fifth step, the test sample is identified with a classification label according to the maximum probability. When the probability of the test sample belonging to a known category is less than the threshold ε, the sample is identified as an unknown category.

4. the method for detecting a DNS tunnel based on open set identification as claimed in claim 3, wherein the first step comprises: extracting the DNS query domain name field to retain, and by analyzing the DNS query domain name coding specification, create two Comparison table: one is used to map characters to numbers, and the other is used to map numbers to characters. After processing, each character has a corresponding integer value; after numerical encoding, padding is performed, and the end is filled with "0" , if the data exceeds the input length, it will be truncated;

After the second step of network training is completed, the activation vector of the previous layer of the classification layer is extracted as the feature vector representation of the current sample, that is, the input of the SoftMax classification layer in the network is selected as the feature vector.

5. the DNS tunnel detection method based on open set identification as claimed in claim 3, is characterized in that, described 3rd step comprises: use the limit forest that is applied to open set identification to carry out segmentation transformation to the eigenvector space that obtains, make division. Each class bound out contains the minimum distance bound.

6. based on the open set identification DNS tunnel detection method as claimed in claim 4, it is characterized in that, the training classifier of the 3rd step:

(1) According to all known category data, perform sub-sampling for each category, randomly select a feature as the starting node, randomly select a value within the value range of the feature, and divide the ψ samples into two parts, less than The samples of the value are divided into the left branch, and the samples larger than the value are divided into the right branch; this binary division is repeated in the left and right branches until the following conditions are met: the tree reaches the limit height; there is only one sample on the node; all samples on the node are The characteristics are the same; the division process is iteratively repeated according to the sample data capacity, and the generated division tree is formed into a binary tree forest;

(2) Calculate the average path length of each category, the number of samples ψ, and the average path length is equivalent to the average search length of the unsuccessful binary sorting tree search:

(3) Calculate the path length vector l(x) of the training samples in each category, and calculate the distance d _i (x)=l _i (x)-C _i (ψ) from the average path length to obtain the distance feature set L (x);

(4) Use extreme value theory to determine the classification boundary of each category. The extreme value model describes the distribution of abnormally high or abnormally low data, and models right-skewed data, left-skewed data, or symmetric data; D(x) is fitted to divide the known class boundaries:

in

β, λ and k are the position, scale and shape parameters, respectively.

7. based on the open set identification DNS tunnel detection method as claimed in claim 3, it is characterized in that, probability prediction specifically comprises: calculate respectively the probability that sample to be tested belongs to each category, indirectly calculate rejection probability by probability; x, through Weibull distribution to calculate the probability score calculation method that belongs to a known class i distribution formula:

The path vector L(x) is the difference between the sample x to be tested and the average distance length of each known category. The path vector is converted with the formula to apply to the category division after calibration by the extreme value model, and the probability score of each category is calculated. Method formula:

After obtaining the probability that the test sample belongs to the known category, the rejection probability of the unknown category is calculated through the inverse Weibull distribution 1-W _i , and the probability score of the sample to be tested x belonging to the unknown category is:

Among them, N+1 represents the unknown category;

Based on the obtained N+1 probability values, the sample to be tested is identified as the category with the maximum probability value; the open space risk is further limited by setting the threshold ε: if the maximum probability value P _max <ε is calculated, the sample is regarded as unknown Class rejected.

8. A computer device, characterized in that the computer device comprises a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor is made to execute claim 1 Step 7 of any one of the steps of the method for detecting DNS tunnels based on open set identification.

9 . An information data processing terminal, wherein the information data processing terminal is configured to implement the method for detecting a DNS tunnel based on open set identification according to any one of claims 1 to 7 .

10. A system for detecting open-set-based DNS tunnels according to any one of claims 1 to 7, wherein the open-set-based DNS tunnel detection system comprises:

The data preprocessing module is used to parse the captured DNS packets and convert characters into integer values;

The feature extraction module is used to extract the features of DNS query sub-domain names by using neural network, and use CNN and LSTM and different combinations to represent DNS query domain names as feature vectors, and perform network training. After training, the penultimate layer of the network is located The space is the feature space;

Train the classifier module to complete the boundary division of the data space, construct a partition forest for the input DNS query domain name feature vector, calculate the average path length of each category as the category center, and use the extreme value theory to calibrate the classification boundary;

The output prediction probability module is used to calculate the probability that the test sample belongs to all categories including the unknown category based on the similarity between the test sample and each known category, using the probability estimation method; classify the test sample identification according to the maximum probability label, when the probability of the test sample belonging to a known class is less than the threshold ε, the sample is identified as an unknown class.