CN102142068A

CN102142068A - Method for detecting unknown malicious code

Info

Publication number: CN102142068A
Application number: CN201110076525XA
Authority: CN
Inventors: 李元诚; 李盼
Original assignee: North China Electric Power University
Current assignee: North China Electric Power University
Priority date: 2011-03-29
Filing date: 2011-03-29
Publication date: 2011-08-03

Abstract

The invention discloses a method for detecting unknown malicious codes in the technical field of information security, which can detect malicious codes in files in advance without updating a malicious code library. The method includes using the Byte n-grams method to extract the feature vectors of the files in the training set; using a local linear embedding algorithm to reduce the dimensionality of the feature vectors of the extracted files in the training set; taking the feature vectors after dimensionality reduction as input, and using kernel coverage The learning algorithm trains the kernel coverage classifier; then uses the Byte n-grams method to extract the feature vectors of the files in the test set; uses the local linear embedding algorithm to reduce the dimensionality of the feature vectors of the extracted files in the test set; input the results after dimensionality reduction The core coverage classifier performs classification, makes statistics on the classification results and determines whether the files in the test set contain malicious code. The invention improves the detection speed of files and realizes the accurate detection of malicious codes in advance.

Description

A Detection Method of Unknown Malicious Code

技术领域technical field

本发明属于信息安全技术领域，尤其涉及一种未知恶意代码的检测方法。The invention belongs to the technical field of information security, and in particular relates to a method for detecting unknown malicious codes.

背景技术Background technique

目前，恶意代码在互联网上无处不在，其传播性、危害性、隐藏性等也在不断提高，从而使计算机恶意代码检测工作面临着巨大的挑战。现有的计算机恶意代码检测技术主要有两种，一种是基于特征码的模式匹配技术，另一种是基于恶意代码行为规则的检测技术。At present, malicious codes are ubiquitous on the Internet, and their dissemination, harm, and concealment are constantly improving, so that the detection of computer malicious codes is facing a huge challenge. There are two main types of existing computer malicious code detection technologies, one is the pattern matching technology based on signatures, and the other is the detection technology based on malicious code behavior rules.

基于特征码的模式匹配技术是当恶意代码文件出现后由分析人员对其进行人工分析，提取出能唯一标识此恶意代码文件的特征码，并将特征码升级给恶意代码特征码库，然后将特征码库提供给用户，用来查杀计算机程序中的恶意代码。基于恶意代码行为规则的检测技术，是依据专家预先定义的一些恶意代码行为规则来检测恶意代码。上述两种检测方法的缺点是必须不断更新恶意代码数据库，否则新类型的恶意代码便可以绕过检测。另外，这两种技术是一种事后检测技术，不能在新出现的恶意代码执行之前检测到它，只有当恶意代码出现后，由分析人员对其进行特征提取并将其特征码升级给特征数据库，才可以进行检测。然而在此期间，恶意代码可以已经得到运行并造成破坏。The signature-based pattern matching technology is to manually analyze the malicious code file after it appears, extract the signature that can uniquely identify the malicious code file, and upgrade the signature to the malicious code signature database, and then The feature code library is provided to the user to detect and kill malicious codes in computer programs. The detection technology based on malicious code behavior rules detects malicious codes based on some malicious code behavior rules predefined by experts. The disadvantage of the above two detection methods is that the malicious code database must be constantly updated, otherwise new types of malicious code can bypass detection. In addition, these two technologies are post-event detection technologies, which cannot detect new malicious codes before they are executed. Only when malicious codes appear, the analysts will extract their features and update their signatures to the feature database. , can be detected. In the meantime, however, malicious code could have gotten executed and caused damage.

发明内容Contents of the invention

本发明的目的在于，针对目前恶意代码检测技术存在的不足，提出一种未知恶意代码的检测方法，以同时包含恶意文件和非恶意文件的样本集作为训练集，利用分类算法训练分类器，然后利用训练好的分类器对未知文件进行分类，以确定其是否为恶意代码文件。The purpose of the present invention is to propose a detection method for unknown malicious codes for the deficiencies in current malicious code detection technology, using a sample set containing both malicious files and non-malicious files as a training set, using a classification algorithm to train a classifier, and then Use the trained classifier to classify unknown files to determine whether they are malicious code files.

为了实现本发明的目的，本发明的提供的技术方案是，一种未知恶意代码的检测方法，其特征是所述方法包括下列步骤：In order to realize the purpose of the present invention, the technical solution provided by the present invention is a detection method of unknown malicious code, which is characterized in that the method comprises the following steps:

步骤1：利用Byte n-grams方法提取训练集中的文件的特征向量；Step 1: Utilize the Byte n-grams method to extract the feature vectors of the files in the training set;

步骤2：采用局部线性嵌入算法对提取的训练集中的文件的特征向量进行降维；Step 2: using a local linear embedding algorithm to reduce the dimensionality of the feature vectors of the files in the extracted training set;

步骤3：将降维后的特征向量作为输入，利用核覆盖学习算法训练核覆盖分类器；Step 3: Take the feature vector after dimensionality reduction as input, and use the kernel coverage learning algorithm to train the kernel coverage classifier;

步骤4：利用Byte n-grams方法提取测试集中的文件的特征向量；Step 4: Utilize the Byte n-grams method to extract the feature vector of the file in the test set;

步骤5：采用局部线性嵌入算法对提取的测试集中的文件的特征向量进行降维；Step 5: using a local linear embedding algorithm to reduce the dimensionality of the feature vectors of the files in the extracted test set;

步骤6：将降维后的结果输入核覆盖分类器进行分类，对分类结果进行统计后，确定测试集中的文件是否含有恶意代码。Step 6: Input the result after dimensionality reduction into the kernel coverage classifier for classification, and after making statistics on the classification results, determine whether the files in the test set contain malicious code.

所述采用局部线性嵌入算法对特征向量进行降维具体包括：The use of the local linear embedding algorithm to reduce the dimensionality of the feature vector specifically includes:

步骤21：将特征向量作为样本点，利用K近邻方法寻找每个样本点的K个近邻点，其中K为设定值；Step 21: Using the feature vector as a sample point, use the K nearest neighbor method to find K neighbor points of each sample point, where K is a set value;

步骤22：利用公式

构造出每个样本点xi的局部重建权值矩阵，其中N为样本点的个数；Step 22: Utilize the formula

Construct a local reconstruction weight matrix for each sample point xi, where N is the number of sample points;

步骤23：由每个样本点x_i的局部重建权值矩阵及其近邻点计算其低维输出值。Step 23: Calculate its low-dimensional output value from the local reconstruction weight matrix of each sample point x _i and its neighbor points.

所述步骤23中，样本点x_i的低维输出y_i满足如下映射条件：In the step 23, the low-dimensional output y _i of the sample point _xi satisfies the following mapping conditions:

且

其中I是m×m的单位矩阵，m是降维后的维数。

and

where I is the identity matrix of m×m, and m is the dimension after dimension reduction.

所述步骤3具体包括：The step 3 specifically includes:

步骤31：在样本点构成的样本空间中，构造覆盖领域系；Step 31: In the sample space constituted by the sample points, construct the coverage field system;

步骤32：对覆盖领域进行融合，将属于同类的覆盖领域融合成特征空间的一个球面；Step 32: Fusing the coverage areas, merging the coverage areas belonging to the same category into a sphere of the feature space;

步骤33：构造出融合曲面f(x)，对每一个样本点x_i计算f(x_i)的值，如果f(x_i)的值大于零，则该样本点x_i代表不含恶意代码的文件；如果f(x_i)的值小于零，则该样本点x_i代表含有恶意代码的文件。Step 33: Construct a fusion surface f(x), calculate the value of f( _xi ) for each sample point _xi , if the value of f(xi ₎ is greater than zero, then the sample point _xi represents no malicious code files; if the value of f( _xi ) is less than zero, then the sample point _xi represents a file containing malicious code.

本发明引入流形学习算法对文件进行特征选择，能够从高维数据中发现有意义的低维结构并进行降维，提高了文件的检测速度；另外，在分类学习算法中引入核覆盖学习算法，能构造出一次就可准确划分样本集的核函数，从而实现了准确检测新出现的恶意代码的目标。The invention introduces a manifold learning algorithm to select the features of the file, can find meaningful low-dimensional structures from high-dimensional data and perform dimension reduction, and improves the detection speed of files; in addition, a kernel coverage learning algorithm is introduced into the classification learning algorithm , can construct a kernel function that can accurately divide the sample set at one time, so as to achieve the goal of accurately detecting emerging malicious codes.

附图说明Description of drawings

图1是未知恶意代码的检测方法的过程示意图；Fig. 1 is the schematic diagram of the process of the detection method of unknown malicious code;

图2是采用局部线性嵌入算法对提取的特征向量进行降维的流程图；Fig. 2 is a flow chart of dimensionality reduction of extracted feature vectors using local linear embedding algorithm;

图3是利用核覆盖学习算法训练核覆盖分类器的流程图。Figure 3 is a flowchart of training a kernel-coverage classifier using the kernel-coverage learning algorithm.

具体实施方式Detailed ways

下面结合附图，对优选实施例作详细说明。应该强调的是，下述说明仅仅是示例性的，而不是为了限制本发明的范围及其应用。The preferred embodiments will be described in detail below in conjunction with the accompanying drawings. It should be emphasized that the following description is only exemplary and not intended to limit the scope of the invention and its application.

本发明解决问题的思路是：以同时含有恶意代码的文件和不含恶意代码的文件集为训练样本，采用流形学习算法对训练集文件进行特征选择，从而每个文件对应一个特征向量，特征向量作为核覆盖分类算法的输入来训练核覆盖分类器。最后对未知文件进行特征选择产生对应的特征向量，作为分类器的输入对其进行分类，从而分辨出其是恶意文件或非恶意文件。The idea of solving the problem of the present invention is: take files containing malicious codes and file sets not containing malicious codes as training samples, and use manifold learning algorithm to perform feature selection on the training set files, so that each file corresponds to a feature vector, and the features The vector is used as input to the kernel coverage classification algorithm to train the kernel coverage classifier. Finally, feature selection is performed on the unknown file to generate the corresponding feature vector, which is used as the input of the classifier to classify it, so as to distinguish whether it is a malicious file or a non-malicious file.

下面结合附图说明本发明的具体实现方式。图1是本发明所提供的未知恶意代码的智能检测方法的检测过程示意图。该方法包括如下的步骤：The specific implementation of the present invention will be described below in conjunction with the accompanying drawings. FIG. 1 is a schematic diagram of the detection process of the intelligent detection method for unknown malicious code provided by the present invention. The method comprises the steps of:

步骤1：利用Byte n-grams方法提取训练集中的文件的特征向量。Step 1: Use the Byte n-grams method to extract the feature vectors of the files in the training set.

训练集可以通过网上下载的标准数据集进行构造。在网上，能够下载到专门用来进行恶意代码检测的标准数据集，数据集中会包含恶意代码文件和正常文件，可以根据特定规则从标准数据集中选择文件来构造训练集。The training set can be constructed from the standard data set downloaded from the Internet. On the Internet, you can download a standard data set specially used for malicious code detection. The data set will contain malicious code files and normal files. You can select files from the standard data set according to specific rules to construct a training set.

Byte n-grams方法是对二进制字节流或文本采用一个n字节大小的滑动窗口进行取词，每个词都是n个字节大小。比如一个文本文件的内容为“abcdef”，那么它的2-grams序列为：ab bc cd de ef，3-grams序列为：abc bcd cde def。The Byte n-grams method uses a sliding window of n bytes to extract words from the binary byte stream or text, and each word is n bytes in size. For example, the content of a text file is "abcdef", then its 2-grams sequence is: ab bc cd de ef, and the 3-grams sequence is: abc bcd cde def.

以一个文件的内容是“abcd”为例，对该文件提取2-grams序列为：ab bccd，这样就说这个文件具有三个属性，可以利用这三个属性组成的向量来表示这个文件，向量为：{ab，bc，cd}。Taking the content of a file as "abcd" as an example, the sequence of 2-grams extracted from the file is: ab bccd. In this way, the file has three attributes, and the vector composed of these three attributes can be used to represent the file. The vector is: {ab, bc, cd}.

对每个属性进行量化，可以得到该文件的特征向量。比如a在字母表中位置为1，b为2……，那么我们可以以位置和的规则进行量化，量化结果为{3，5，7}。向量{3，5，7}即为该文件的特征向量。By quantifying each attribute, the feature vector of the file can be obtained. For example, the position of a in the alphabet is 1, and the position of b is 2..., then we can quantify according to the rule of position sum, and the quantized result is {3, 5, 7}. The vector {3, 5, 7} is the feature vector of the file.

步骤2：采用局部线性嵌入算法对提取的训练集中的文件的特征向量进行降维。图2是采用局部线性嵌入算法对提取的特征向量进行降维的流程图，图2中，采用局部线性嵌入算法对提取的训练集中的文件的特征向量进行降维包括：Step 2: Using a local linear embedding algorithm to reduce the dimensionality of the extracted feature vectors of the files in the training set. Fig. 2 is a flow chart of dimensionality reduction of extracted feature vectors using local linear embedding algorithm. In Fig. 2, dimensionality reduction of feature vectors of files in the extracted training set using local linear embedding algorithm includes:

步骤21：将特征向量作为样本点，利用K近邻方法寻找每个样本点的K个近邻点，其中K为设定值。Step 21: Using the feature vector as a sample point, use the K nearest neighbor method to find K neighbor points of each sample point, where K is a set value.

把相对于所求样本点距离最近的K个样本点规定为所求样本点的K个近邻点，其中K是预先给定的值，距离的计算可采用欧式距离计算方法。欧氏距离算法如下：设x，y∈R^N，则x，y的欧氏距离可由下式求得：The K sample points closest to the sample point to be obtained are specified as the K neighbor points of the sample point to be obtained, where K is a predetermined value, and the distance can be calculated using the Euclidean distance calculation method. The Euclidean distance algorithm is as follows: Let x, y∈R ^N , then the Euclidean distance of x, y can be obtained by the following formula:

${(({Σ Σ}_{i i = = 11}^{N N} {(({x x}^{i i} - - {y the y}^{i i}))}^{22}))}^{\frac{11}{22}}$

步骤22：利用公式

构造出每个样本点x_i的局部重建权值矩阵

其中

N为样本点的个数。Step 22: Utilize the formula

Construct a local reconstruction weight matrix for each sample point x _i

in

N is the number of sample points.

W＝(w_ij)∈M_n，n是这样的权值矩阵，如果x_i与x_j不相邻，则w_ij＝0，设x_i与x_j(j＝1，2，…，K)是相邻的，则有约束 W=(w _ij )∈M _{n, n} is such a weight matrix, if x _i and x _j are not adjacent, then w _ij =0, set x _i and x _j (j=1,2,...,K ) are adjacent, then there is a constraint

使用XW近似表示X，会存在一定的误差，这里定义矩阵的Frobenius范数如下：A＝(a_i，j)∈M_m，m，为一个m阶矩阵，则 Using XW to represent X approximately, there will be certain errors. Here, the Frobenius norm of the matrix is defined as follows: A=(a _{i, j} )∈M _{m, m} , which is an m-order matrix, then

由下式约束寻找W：

即

这相当于求一系列最小二乘问题的解。如对x_i而言，由下面的方程组可以获得Find W by the following constraints:

Right now

This is equivalent to finding the solution to a series of least squares problems. As for x _i , the following equations can be obtained

${w w}_{jk jk,, i i} : : \{\begin{matrix} {Σ Σ}_{k k = = 11}^{K K} {w w}_{jk jk,, i i} = = 11 \\ X x {w w}_{i i} = = {x x}_{i i} \end{matrix}$

通过权值矩阵W，我们可以在低维空间中找到合适的y_i，可通过以下约束来完成：

其中y_i是x_i的输出向量，y_jk，i(k＝1，2，…，K)是y_i的近邻点，并且要满足两个条件：

与

其中I是m×m的单位矩阵。由此，损失函数可重写为：其中M是n×n的对称矩阵：M＝(I-W)^T(I-W)。Through the weight matrix W, we can find the appropriate y _i in the low-dimensional space, which can be done by the following constraints:

Where y _i is the output vector of x _i , y _{jk, i} (k=1, 2, ..., K) is the neighbor point of y _i , and two conditions must be met:

and

where I is the identity matrix of m×m. Thus, the loss function can be rewritten as: where M is an n×n symmetric matrix: M=(IW) ^T (IW).

要使损失函数值达到最小，则取Y为M的最小m个非零特征值所对应的特征向量。在处理过程中，将M的特征值从小到大排列，第一个特征值几乎接近于零，那么舍去第一个特征值。通常取从第2到第m+1之间的特征值所对应的特征向量作为输出结果。To minimize the value of the loss function, Y is the eigenvector corresponding to the smallest m non-zero eigenvalues of M. During the processing, the eigenvalues of M are arranged from small to large, and the first eigenvalue is almost close to zero, then the first eigenvalue is discarded. Usually, the eigenvectors corresponding to the eigenvalues from the 2nd to the m+1th are taken as the output result.

步骤3：将降维后的特征向量作为输入，利用核覆盖学习算法训练核覆盖分类器。Step 3: Taking the dimensionally reduced feature vector as input, train a kernel-coverage classifier using the kernel-coverage learning algorithm.

核覆盖学习算法是在覆盖算法中引入核函数。首先，取一核函数K(x，y)＝＜T(x)，T(y)＞做以下变换T：D →Z，x∈D；其中D为输入的定义域为n维空间的有界集合，共有p个样本，这种变换就是将D上的点映射到P维核空间上。记核空间的输入集为P_t，t＝1，2，…，p。在核空间中，不妨设输出集Y的前k个值均不相同。令所有输出为Y_j(j≤k)的样本标号的集合为I_j，其对应的输入集合记为P_j(j＝0，1，…，k-1)。经过上面的一系列初始化后，即可开始求取一批核空间中的覆盖。图3是利用核覆盖学习算法训练核覆盖分类器的流程图，图3中，利用核覆盖学习算法训练核覆盖分类器包括如下步骤：Kernel coverage learning algorithm introduces the kernel function into the coverage algorithm. First, take a kernel function K(x, y)=<T(x), T(y)> and do the following transformation T: D→Z, x∈D; where D is the domain of the input is n-dimensional space Bounded set, a total of p samples, this transformation is to map the points on D to the P-dimensional kernel space. The input set of the kernel space is P _t , t=1, 2, . . . , p. In the kernel space, it is advisable to assume that the first k values of the output set Y are all different. Let the set of all sample labels whose output is Y _j (j≤k) be I _j , and its corresponding input set be denoted as P _j (j=0, 1, . . . , k-1). After a series of initializations above, the coverage in a batch of kernel spaces can be obtained. Fig. 3 is the flow chart that utilizes kernel coverage learning algorithm to train nuclear coverage classifier, among Fig. 3, utilizes kernel coverage learning algorithm to train kernel coverage classifier and comprises the following steps:

步骤31：在样本点构成的样本空间中，构造覆盖领域系。Step 31: In the sample space constituted by the sample points, construct the coverage field system.

(1)在样本集中任取一个尚未被覆盖的点x_j∈P_t，按式(1) Randomly select a point x _j ∈ P _t that has not been covered in the sample set, according to the formula

${d d}_{j j}^{((11))} = = \underset{m m &NotElement; &NotElement; {I I}_{j j}}{min min} {{K K (({x x}_{j j},, {x x}_{m m}))}}$

${d d}_{j j}^{((22))} = = \underset{m m &Element; &Element; {I I}_{j j}}{max max} {{K K (({x x}_{j j},, {x x}_{m m})) | | K K (({x x}_{j j},, {x x}_{m m})) < < {d d}_{j j}^{((11))}}}$

${d d}_{j j} = = [[{d d}_{j j}^{((11))} + + {d d}_{j j}^{((22))}]] / / 22$

${θ θ}_{j j} = = [[{d d}_{j j}^{((11))} - - {d d}_{j j}^{((22))}]] / / 22$

计算，根据x_i和d_j构造一个覆盖

该覆盖的中心为x_i，覆盖半径为d_j，分类间隔为d_j。其中，I_j是一个下标的集合，x_m表示一个样本点，在第一个公式中，x_m表示m的值不属于I_j。Compute, construct a cover from _xi and _dj

The center of the coverage is x _i , the coverage radius is d _j , and the class interval is d _j . Among them, I _j is a set of subscripts, x _m represents a sample point, and in the first formula, x _m represents that the value of m does not belong to I _j .

(2)

求出后，将P_t中所有的已被

覆盖的点从P_t中删除，再从P_t中选择一个x_j(j∈I_j)，重复第(1)步操作，直到所有的x_j∈I_j均被删除为止。这样便构造出一个类的所有覆盖领域。(2)

After finding out, all the items in P _t have been

The covered points are deleted from P _t , and then a x _j (j∈I _j ) is selected from P _t , and step (1) is repeated until all x _j ∈ I _j are deleted. This constructs all coverage fields of a class.

步骤32：对覆盖领域进行融合，将属于同类的覆盖领域融合成特征空间的一个球面。Step 32: Fusing the coverage domains, merging the coverage domains belonging to the same category into a sphere of the feature space.

对所求出的所有覆盖领域，令

其中d_i表示以x_i为中心的领域的半径，求二次规划问题：For all coverage areas obtained, let

Where d _i represents the radius of the domain centered on _xi , and the quadratic programming problem is found:

$\{\begin{matrix} max max w w ((α α)) = = {Σ Σ}_{i i = = 11}^{m m} {a a}_{i i} - - \frac{11}{22} {Σ Σ}_{i i,, j j = = 11}^{m m} {α α}_{i i} {α α}_{j j} {y the y}_{i i} {y the y}_{j j} ((K K (({d d}_{i i},, {d d}_{j j})) + + K K (({d d}_{j j},, {d d}_{i i})) / / 22)) \\ {Σ Σ}_{i i = = 11}^{m m} {α α}_{i i} {y the y}_{i i} = = 00,, {α α}_{i i} &GreaterEqual; &Greater Equal; 00,, i i = = 1,2 1,2,, . . . . . . m m \end{matrix}$

得到最优解α^*＝{α₁α₂…α_m}。The optimal solution α ^* = {α ₁ α ₂ . . . α _m } is obtained.

步骤33：构造出融合曲面f(x)，对每一个样本点x_i计算f(x_i)的值，如果f(x_i)的值大于零，则该样本点x_i不含恶意文件；如果f(x_i)的值小于零，则该样本点x_i含有恶意文件。Step 33: Construct a fusion surface f(x), calculate the value of f( _xi ) for each sample point _xi , if the value of f( _xi ) is greater than zero, then the sample point _xi does not contain malicious files; If the value of f( _xi ) is less than zero, then the sample point _xi contains malicious files.

用步骤32中得到的α^*构造超平面：Construct a hyperplane with α ^* obtained in step 32:

$f f ((x x)) = = {Σ Σ}_{i i = = 11}^{m m} {α α}_{i i} {y the y}_{i i} K K (({d d}_{i i},, x x))$

其判别函数为：F(x)＝Sign(f(x)+b₀)，其中b₀为决策阈值。Its discriminant function is: F(x)=Sign(f(x)+b ₀ ), where b ₀ is the decision threshold.

对样本进行分类时，对每一个样本，计算f(x)的值，若f(x)＞0，则x属于正类(即不含恶意文件)，若f(x)＜0，则x属于负类(即含有恶意文件)，若f(x)＝0，则称x被拒识。可以设定一个阈值ε，当|f(x)|＜ε时认为x被拒识，这样可以减少误差。When classifying samples, calculate the value of f(x) for each sample, if f(x)>0, then x belongs to the positive class (that is, no malicious files), if f(x)<0, then x Belongs to the negative class (that is, contains malicious files), if f(x)=0, it is said that x is rejected. A threshold ε can be set, and when |f(x)|<ε, it is considered that x is rejected, which can reduce errors.

步骤4：利用Byte n-grams方法提取测试集中的文件的特征向量。Step 4: Use the Byte n-grams method to extract the feature vectors of the files in the test set.

如步骤1那样，利用Byte n-grams方法提取测试集中的文件的特征向量。测试集可以从网络上提供的数据集中进行选取。As in step 1, use the Byte n-grams method to extract the feature vectors of the files in the test set. The test set can be selected from datasets available on the web.

步骤5：采用局部线性嵌入算法对提取的测试集中的文件的特征向量进行降维。Step 5: Using a local linear embedding algorithm to reduce the dimensionality of the extracted feature vectors of the files in the test set.

如步骤1那样，采用局部线性嵌入算法对步骤4提取的特征向量进行降维。As in step 1, the feature vector extracted in step 4 is reduced in dimension using a local linear embedding algorithm.

步骤6：将降维后的结果输入核覆盖分类器进行分类，对分类结果进行统计后，确定测试集中的文件是否含有恶意文件。Step 6: Input the result after dimensionality reduction into the kernel coverage classifier for classification, and after making statistics on the classification results, determine whether the files in the test set contain malicious files.

将步骤5的降维结果作为输入，使用步骤3得到的核覆盖分类器对步骤5的降维结果进行分类，对分类结果进行统计，而后，根据核覆盖分类器分类统计结果确定测试集中的文件是否含有恶意文件。Take the dimension reduction result of step 5 as input, use the kernel coverage classifier obtained in step 3 to classify the dimension reduction result of step 5, and make statistics on the classification results, and then determine the files in the test set according to the classification statistics results of the kernel coverage classifier Whether it contains malicious files.

本发明以同时包含恶意文件和非恶意文件的样本集作为训练集，利用分类算法训练分类器，然后利用训练好的分类器对未知文件进行分类，以确定其是否为恶意文件。在对文件进行特征选择的过程中引入流形学习算法，对大量的文件特征属性进行分析处理，以发现隐藏在高维数据中有意义的低维结构，从而达到对高维文件特征属性进行降维处理的目的，提高了处理速度。在分类学习算法中引入核覆盖学习算法，该算法是在覆盖算法中引入支持向量机中的核函数的概念，与支持向量机算法相比，该算法对任意给定的样本集，能构造出一次就可准确划分样本集的核函数，从而保证了在先验知识不足和小样本的情况下，系统仍有较好的分类正确率和较小的运算量。The invention uses a sample set containing both malicious files and non-malicious files as a training set, uses a classification algorithm to train a classifier, and then uses the trained classifier to classify unknown files to determine whether they are malicious files. In the process of file feature selection, manifold learning algorithm is introduced to analyze and process a large number of file feature attributes to find meaningful low-dimensional structures hidden in high-dimensional data, so as to achieve the reduction of high-dimensional file feature attributes. The purpose of dimension processing improves the processing speed. The kernel coverage learning algorithm is introduced into the classification learning algorithm. This algorithm introduces the concept of the kernel function in the support vector machine into the coverage algorithm. Compared with the support vector machine algorithm, this algorithm can construct a model for any given sample set. The kernel function of the sample set can be accurately divided once, thus ensuring that the system still has a good classification accuracy and a small amount of calculation in the case of insufficient prior knowledge and small samples.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权利要求的保护范围为准。The above is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art within the technical scope disclosed in the present invention can easily think of changes or Replacement should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be determined by the protection scope of the claims.

Claims

1. a detection method of unknown malicious code, it is characterized in that described method comprises the following steps:

Step 1: Utilize the Byte n-grams method to extract the feature vectors of the files in the training set;

Step 2: using a local linear embedding algorithm to reduce the dimensionality of the feature vectors of the files in the extracted training set;

Step 3: Take the feature vector after dimensionality reduction as input, and use the kernel coverage learning algorithm to train the kernel coverage classifier;

Step 4: Utilize the Byte n-grams method to extract the feature vector of the file in the test set;

Step 5: using a local linear embedding algorithm to reduce the dimensionality of the feature vectors of the files in the extracted test set;

Step 6: Input the result after dimensionality reduction into the kernel coverage classifier for classification, and after making statistics on the classification results, determine whether the files in the test set contain malicious code.

2. the detection method of a kind of unknown malicious code according to claim 1, it is characterized in that described adopting local linear embedding algorithm to carry out dimensionality reduction to feature vector and specifically comprise:

Step 21: Using the feature vector as a sample point, use the K nearest neighbor method to find K neighbor points of each sample point, where K is a set value;

Step 22: Utilize the formula

Construct a local reconstruction weight matrix for each sample point x _i , where

N is the number of sample points;

Step 23: Calculate its low-dimensional output value from the local reconstruction weight matrix of each sample point x _i and its neighbor points.

In the step 23, the low-dimensional output y _i of the sample point _xi satisfies the following mapping conditions:

and

3. the detection method of a kind of unknown malicious code according to claim 1, it is characterized in that described step 3 specifically comprises:

Step 31: In the sample space constituted by the sample points, construct the coverage field system;

Step 32: Fusing the coverage areas, merging the coverage areas belonging to the same category into a sphere of the feature space;

Step 33: Construct a fusion surface f(x), calculate the value of f( _xi ) for each sample point _xi , if the value of f(xi ₎ is greater than zero, then the sample point _xi represents no malicious code files; if the value of f( _xi ) is less than zero, then the sample point _xi represents a file containing malicious code.