CN102142068A - Method for detecting unknown malicious code - Google Patents

Method for detecting unknown malicious code Download PDF

Info

Publication number
CN102142068A
CN102142068A CN201110076525XA CN201110076525A CN102142068A CN 102142068 A CN102142068 A CN 102142068A CN 201110076525X A CN201110076525X A CN 201110076525XA CN 201110076525 A CN201110076525 A CN 201110076525A CN 102142068 A CN102142068 A CN 102142068A
Authority
CN
China
Prior art keywords
files
coverage
malicious code
sample point
feature vectors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201110076525XA
Other languages
Chinese (zh)
Inventor
李元诚
李盼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North China Electric Power University
Original Assignee
North China Electric Power University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China Electric Power University filed Critical North China Electric Power University
Priority to CN201110076525XA priority Critical patent/CN102142068A/en
Publication of CN102142068A publication Critical patent/CN102142068A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了信息安全技术领域中的一种未知恶意代码的检测方法,能够在不更新恶意代码库的情况下对文件中的恶意代码进行事前检测。该方法包括利用Byte n-grams方法提取训练集中的文件的特征向量;采用局部线性嵌入算法对提取的训练集中的文件的特征向量进行降维;将降维后的特征向量作为输入,利用核覆盖学习算法训练核覆盖分类器;再利用Byte n-grams方法提取测试集中的文件的特征向量;采用局部线性嵌入算法对提取的测试集中的文件的特征向量进行降维;将降维后的结果输入核覆盖分类器进行分类,对分类结果进行统计并确定测试集中的文件是否含有恶意代码。本发明提高了文件的检测速度,实现了恶意代码的事前准确检测。

The invention discloses a method for detecting unknown malicious codes in the technical field of information security, which can detect malicious codes in files in advance without updating a malicious code library. The method includes using the Byte n-grams method to extract the feature vectors of the files in the training set; using a local linear embedding algorithm to reduce the dimensionality of the feature vectors of the extracted files in the training set; taking the feature vectors after dimensionality reduction as input, and using kernel coverage The learning algorithm trains the kernel coverage classifier; then uses the Byte n-grams method to extract the feature vectors of the files in the test set; uses the local linear embedding algorithm to reduce the dimensionality of the feature vectors of the extracted files in the test set; input the results after dimensionality reduction The core coverage classifier performs classification, makes statistics on the classification results and determines whether the files in the test set contain malicious code. The invention improves the detection speed of files and realizes the accurate detection of malicious codes in advance.

Description

一种未知恶意代码的检测方法A Detection Method of Unknown Malicious Code

技术领域technical field

本发明属于信息安全技术领域,尤其涉及一种未知恶意代码的检测方法。The invention belongs to the technical field of information security, and in particular relates to a method for detecting unknown malicious codes.

背景技术Background technique

目前,恶意代码在互联网上无处不在,其传播性、危害性、隐藏性等也在不断提高,从而使计算机恶意代码检测工作面临着巨大的挑战。现有的计算机恶意代码检测技术主要有两种,一种是基于特征码的模式匹配技术,另一种是基于恶意代码行为规则的检测技术。At present, malicious codes are ubiquitous on the Internet, and their dissemination, harm, and concealment are constantly improving, so that the detection of computer malicious codes is facing a huge challenge. There are two main types of existing computer malicious code detection technologies, one is the pattern matching technology based on signatures, and the other is the detection technology based on malicious code behavior rules.

基于特征码的模式匹配技术是当恶意代码文件出现后由分析人员对其进行人工分析,提取出能唯一标识此恶意代码文件的特征码,并将特征码升级给恶意代码特征码库,然后将特征码库提供给用户,用来查杀计算机程序中的恶意代码。基于恶意代码行为规则的检测技术,是依据专家预先定义的一些恶意代码行为规则来检测恶意代码。上述两种检测方法的缺点是必须不断更新恶意代码数据库,否则新类型的恶意代码便可以绕过检测。另外,这两种技术是一种事后检测技术,不能在新出现的恶意代码执行之前检测到它,只有当恶意代码出现后,由分析人员对其进行特征提取并将其特征码升级给特征数据库,才可以进行检测。然而在此期间,恶意代码可以已经得到运行并造成破坏。The signature-based pattern matching technology is to manually analyze the malicious code file after it appears, extract the signature that can uniquely identify the malicious code file, and upgrade the signature to the malicious code signature database, and then The feature code library is provided to the user to detect and kill malicious codes in computer programs. The detection technology based on malicious code behavior rules detects malicious codes based on some malicious code behavior rules predefined by experts. The disadvantage of the above two detection methods is that the malicious code database must be constantly updated, otherwise new types of malicious code can bypass detection. In addition, these two technologies are post-event detection technologies, which cannot detect new malicious codes before they are executed. Only when malicious codes appear, the analysts will extract their features and update their signatures to the feature database. , can be detected. In the meantime, however, malicious code could have gotten executed and caused damage.

发明内容Contents of the invention

本发明的目的在于,针对目前恶意代码检测技术存在的不足,提出一种未知恶意代码的检测方法,以同时包含恶意文件和非恶意文件的样本集作为训练集,利用分类算法训练分类器,然后利用训练好的分类器对未知文件进行分类,以确定其是否为恶意代码文件。The purpose of the present invention is to propose a detection method for unknown malicious codes for the deficiencies in current malicious code detection technology, using a sample set containing both malicious files and non-malicious files as a training set, using a classification algorithm to train a classifier, and then Use the trained classifier to classify unknown files to determine whether they are malicious code files.

为了实现本发明的目的,本发明的提供的技术方案是,一种未知恶意代码的检测方法,其特征是所述方法包括下列步骤:In order to realize the purpose of the present invention, the technical solution provided by the present invention is a detection method of unknown malicious code, which is characterized in that the method comprises the following steps:

步骤1:利用Byte n-grams方法提取训练集中的文件的特征向量;Step 1: Utilize the Byte n-grams method to extract the feature vectors of the files in the training set;

步骤2:采用局部线性嵌入算法对提取的训练集中的文件的特征向量进行降维;Step 2: using a local linear embedding algorithm to reduce the dimensionality of the feature vectors of the files in the extracted training set;

步骤3:将降维后的特征向量作为输入,利用核覆盖学习算法训练核覆盖分类器;Step 3: Take the feature vector after dimensionality reduction as input, and use the kernel coverage learning algorithm to train the kernel coverage classifier;

步骤4:利用Byte n-grams方法提取测试集中的文件的特征向量;Step 4: Utilize the Byte n-grams method to extract the feature vector of the file in the test set;

步骤5:采用局部线性嵌入算法对提取的测试集中的文件的特征向量进行降维;Step 5: using a local linear embedding algorithm to reduce the dimensionality of the feature vectors of the files in the extracted test set;

步骤6:将降维后的结果输入核覆盖分类器进行分类,对分类结果进行统计后,确定测试集中的文件是否含有恶意代码。Step 6: Input the result after dimensionality reduction into the kernel coverage classifier for classification, and after making statistics on the classification results, determine whether the files in the test set contain malicious code.

所述采用局部线性嵌入算法对特征向量进行降维具体包括:The use of the local linear embedding algorithm to reduce the dimensionality of the feature vector specifically includes:

步骤21:将特征向量作为样本点,利用K近邻方法寻找每个样本点的K个近邻点,其中K为设定值;Step 21: Using the feature vector as a sample point, use the K nearest neighbor method to find K neighbor points of each sample point, where K is a set value;

步骤22:利用公式

Figure BDA0000052621520000021
构造出每个样本点xi的局部重建权值矩阵,其中N为样本点的个数;Step 22: Utilize the formula
Figure BDA0000052621520000021
Construct a local reconstruction weight matrix for each sample point xi, where N is the number of sample points;

步骤23:由每个样本点xi的局部重建权值矩阵及其近邻点计算其低维输出值。Step 23: Calculate its low-dimensional output value from the local reconstruction weight matrix of each sample point x i and its neighbor points.

所述步骤23中,样本点xi的低维输出yi满足如下映射条件:In the step 23, the low-dimensional output y i of the sample point xi satisfies the following mapping conditions:

Figure BDA0000052621520000031
Figure BDA0000052621520000032
Figure BDA0000052621520000033
其中I是m×m的单位矩阵,m是降维后的维数。
Figure BDA0000052621520000031
and
Figure BDA0000052621520000032
Figure BDA0000052621520000033
where I is the identity matrix of m×m, and m is the dimension after dimension reduction.

所述步骤3具体包括:The step 3 specifically includes:

步骤31:在样本点构成的样本空间中,构造覆盖领域系;Step 31: In the sample space constituted by the sample points, construct the coverage field system;

步骤32:对覆盖领域进行融合,将属于同类的覆盖领域融合成特征空间的一个球面;Step 32: Fusing the coverage areas, merging the coverage areas belonging to the same category into a sphere of the feature space;

步骤33:构造出融合曲面f(x),对每一个样本点xi计算f(xi)的值,如果f(xi)的值大于零,则该样本点xi代表不含恶意代码的文件;如果f(xi)的值小于零,则该样本点xi代表含有恶意代码的文件。Step 33: Construct a fusion surface f(x), calculate the value of f( xi ) for each sample point xi , if the value of f(xi ) is greater than zero, then the sample point xi represents no malicious code files; if the value of f( xi ) is less than zero, then the sample point xi represents a file containing malicious code.

本发明引入流形学习算法对文件进行特征选择,能够从高维数据中发现有意义的低维结构并进行降维,提高了文件的检测速度;另外,在分类学习算法中引入核覆盖学习算法,能构造出一次就可准确划分样本集的核函数,从而实现了准确检测新出现的恶意代码的目标。The invention introduces a manifold learning algorithm to select the features of the file, can find meaningful low-dimensional structures from high-dimensional data and perform dimension reduction, and improves the detection speed of files; in addition, a kernel coverage learning algorithm is introduced into the classification learning algorithm , can construct a kernel function that can accurately divide the sample set at one time, so as to achieve the goal of accurately detecting emerging malicious codes.

附图说明Description of drawings

图1是未知恶意代码的检测方法的过程示意图;Fig. 1 is the schematic diagram of the process of the detection method of unknown malicious code;

图2是采用局部线性嵌入算法对提取的特征向量进行降维的流程图;Fig. 2 is a flow chart of dimensionality reduction of extracted feature vectors using local linear embedding algorithm;

图3是利用核覆盖学习算法训练核覆盖分类器的流程图。Figure 3 is a flowchart of training a kernel-coverage classifier using the kernel-coverage learning algorithm.

具体实施方式Detailed ways

下面结合附图,对优选实施例作详细说明。应该强调的是,下述说明仅仅是示例性的,而不是为了限制本发明的范围及其应用。The preferred embodiments will be described in detail below in conjunction with the accompanying drawings. It should be emphasized that the following description is only exemplary and not intended to limit the scope of the invention and its application.

本发明解决问题的思路是:以同时含有恶意代码的文件和不含恶意代码的文件集为训练样本,采用流形学习算法对训练集文件进行特征选择,从而每个文件对应一个特征向量,特征向量作为核覆盖分类算法的输入来训练核覆盖分类器。最后对未知文件进行特征选择产生对应的特征向量,作为分类器的输入对其进行分类,从而分辨出其是恶意文件或非恶意文件。The idea of solving the problem of the present invention is: take files containing malicious codes and file sets not containing malicious codes as training samples, and use manifold learning algorithm to perform feature selection on the training set files, so that each file corresponds to a feature vector, and the features The vector is used as input to the kernel coverage classification algorithm to train the kernel coverage classifier. Finally, feature selection is performed on the unknown file to generate the corresponding feature vector, which is used as the input of the classifier to classify it, so as to distinguish whether it is a malicious file or a non-malicious file.

下面结合附图说明本发明的具体实现方式。图1是本发明所提供的未知恶意代码的智能检测方法的检测过程示意图。该方法包括如下的步骤:The specific implementation of the present invention will be described below in conjunction with the accompanying drawings. FIG. 1 is a schematic diagram of the detection process of the intelligent detection method for unknown malicious code provided by the present invention. The method comprises the steps of:

步骤1:利用Byte n-grams方法提取训练集中的文件的特征向量。Step 1: Use the Byte n-grams method to extract the feature vectors of the files in the training set.

训练集可以通过网上下载的标准数据集进行构造。在网上,能够下载到专门用来进行恶意代码检测的标准数据集,数据集中会包含恶意代码文件和正常文件,可以根据特定规则从标准数据集中选择文件来构造训练集。The training set can be constructed from the standard data set downloaded from the Internet. On the Internet, you can download a standard data set specially used for malicious code detection. The data set will contain malicious code files and normal files. You can select files from the standard data set according to specific rules to construct a training set.

Byte n-grams方法是对二进制字节流或文本采用一个n字节大小的滑动窗口进行取词,每个词都是n个字节大小。比如一个文本文件的内容为“abcdef”,那么它的2-grams序列为:ab bc cd de ef,3-grams序列为:abc bcd cde def。The Byte n-grams method uses a sliding window of n bytes to extract words from the binary byte stream or text, and each word is n bytes in size. For example, the content of a text file is "abcdef", then its 2-grams sequence is: ab bc cd de ef, and the 3-grams sequence is: abc bcd cde def.

以一个文件的内容是“abcd”为例,对该文件提取2-grams序列为:ab bccd,这样就说这个文件具有三个属性,可以利用这三个属性组成的向量来表示这个文件,向量为:{ab,bc,cd}。Taking the content of a file as "abcd" as an example, the sequence of 2-grams extracted from the file is: ab bccd. In this way, the file has three attributes, and the vector composed of these three attributes can be used to represent the file. The vector is: {ab, bc, cd}.

对每个属性进行量化,可以得到该文件的特征向量。比如a在字母表中位置为1,b为2……,那么我们可以以位置和的规则进行量化,量化结果为{3,5,7}。向量{3,5,7}即为该文件的特征向量。By quantifying each attribute, the feature vector of the file can be obtained. For example, the position of a in the alphabet is 1, and the position of b is 2..., then we can quantify according to the rule of position sum, and the quantized result is {3, 5, 7}. The vector {3, 5, 7} is the feature vector of the file.

步骤2:采用局部线性嵌入算法对提取的训练集中的文件的特征向量进行降维。图2是采用局部线性嵌入算法对提取的特征向量进行降维的流程图,图2中,采用局部线性嵌入算法对提取的训练集中的文件的特征向量进行降维包括:Step 2: Using a local linear embedding algorithm to reduce the dimensionality of the extracted feature vectors of the files in the training set. Fig. 2 is a flow chart of dimensionality reduction of extracted feature vectors using local linear embedding algorithm. In Fig. 2, dimensionality reduction of feature vectors of files in the extracted training set using local linear embedding algorithm includes:

步骤21:将特征向量作为样本点,利用K近邻方法寻找每个样本点的K个近邻点,其中K为设定值。Step 21: Using the feature vector as a sample point, use the K nearest neighbor method to find K neighbor points of each sample point, where K is a set value.

把相对于所求样本点距离最近的K个样本点规定为所求样本点的K个近邻点,其中K是预先给定的值,距离的计算可采用欧式距离计算方法。欧氏距离算法如下:设x,y∈RN,则x,y的欧氏距离可由下式求得:The K sample points closest to the sample point to be obtained are specified as the K neighbor points of the sample point to be obtained, where K is a predetermined value, and the distance can be calculated using the Euclidean distance calculation method. The Euclidean distance algorithm is as follows: Let x, y∈R N , then the Euclidean distance of x, y can be obtained by the following formula:

(( ΣΣ ii == 11 NN (( xx ii -- ythe y ii )) 22 )) 11 22

步骤22:利用公式

Figure BDA0000052621520000052
构造出每个样本点xi的局部重建权值矩阵
Figure BDA0000052621520000053
其中
Figure BDA0000052621520000054
N为样本点的个数。Step 22: Utilize the formula
Figure BDA0000052621520000052
Construct a local reconstruction weight matrix for each sample point x i
Figure BDA0000052621520000053
in
Figure BDA0000052621520000054
N is the number of sample points.

W=(wij)∈Mn,n是这样的权值矩阵,如果xi与xj不相邻,则wij=0,设xi与xj(j=1,2,…,K)是相邻的,则有约束 W=(w ij )∈M n, n is such a weight matrix, if x i and x j are not adjacent, then w ij =0, set x i and x j (j=1,2,...,K ) are adjacent, then there is a constraint

使用XW近似表示X,会存在一定的误差,这里定义矩阵的Frobenius范数如下:A=(ai,j)∈Mm,m,为一个m阶矩阵,则 Using XW to represent X approximately, there will be certain errors. Here, the Frobenius norm of the matrix is defined as follows: A=(a i, j )∈M m, m , which is an m-order matrix, then

由下式约束寻找W:

Figure BDA0000052621520000057
Figure BDA0000052621520000058
这相当于求一系列最小二乘问题的解。如对xi而言,由下面的方程组可以获得Find W by the following constraints:
Figure BDA0000052621520000057
Right now
Figure BDA0000052621520000058
This is equivalent to finding the solution to a series of least squares problems. As for x i , the following equations can be obtained

ww jkjk ,, ii :: ΣΣ kk == 11 KK ww jkjk ,, ii == 11 Xx ww ii == xx ii

步骤23:由每个样本点xi的局部重建权值矩阵及其近邻点计算其低维输出值。Step 23: Calculate its low-dimensional output value from the local reconstruction weight matrix of each sample point x i and its neighbor points.

通过权值矩阵W,我们可以在低维空间中找到合适的yi,可通过以下约束来完成:

Figure BDA0000052621520000061
其中yi是xi的输出向量,yjk,i(k=1,2,…,K)是yi的近邻点,并且要满足两个条件:
Figure BDA0000052621520000062
Figure BDA0000052621520000063
其中I是m×m的单位矩阵。由此,损失函数可重写为:其中M是n×n的对称矩阵:M=(I-W)T(I-W)。Through the weight matrix W, we can find the appropriate y i in the low-dimensional space, which can be done by the following constraints:
Figure BDA0000052621520000061
Where y i is the output vector of x i , y jk, i (k=1, 2, ..., K) is the neighbor point of y i , and two conditions must be met:
Figure BDA0000052621520000062
and
Figure BDA0000052621520000063
where I is the identity matrix of m×m. Thus, the loss function can be rewritten as: where M is an n×n symmetric matrix: M=(IW) T (IW).

要使损失函数值达到最小,则取Y为M的最小m个非零特征值所对应的特征向量。在处理过程中,将M的特征值从小到大排列,第一个特征值几乎接近于零,那么舍去第一个特征值。通常取从第2到第m+1之间的特征值所对应的特征向量作为输出结果。To minimize the value of the loss function, Y is the eigenvector corresponding to the smallest m non-zero eigenvalues of M. During the processing, the eigenvalues of M are arranged from small to large, and the first eigenvalue is almost close to zero, then the first eigenvalue is discarded. Usually, the eigenvectors corresponding to the eigenvalues from the 2nd to the m+1th are taken as the output result.

步骤3:将降维后的特征向量作为输入,利用核覆盖学习算法训练核覆盖分类器。Step 3: Taking the dimensionally reduced feature vector as input, train a kernel-coverage classifier using the kernel-coverage learning algorithm.

核覆盖学习算法是在覆盖算法中引入核函数。首先,取一核函数K(x,y)=<T(x),T(y)>做以下变换T:D →Z,x∈D;其中D为输入的定义域为n维空间的有界集合,共有p个样本,这种变换就是将D上的点映射到P维核空间上。记核空间的输入集为Pt,t=1,2,…,p。在核空间中,不妨设输出集Y的前k个值均不相同。令所有输出为Yj(j≤k)的样本标号的集合为Ij,其对应的输入集合记为Pj(j=0,1,…,k-1)。经过上面的一系列初始化后,即可开始求取一批核空间中的覆盖。图3是利用核覆盖学习算法训练核覆盖分类器的流程图,图3中,利用核覆盖学习算法训练核覆盖分类器包括如下步骤:Kernel coverage learning algorithm introduces the kernel function into the coverage algorithm. First, take a kernel function K(x, y)=<T(x), T(y)> and do the following transformation T: D→Z, x∈D; where D is the domain of the input is n-dimensional space Bounded set, a total of p samples, this transformation is to map the points on D to the P-dimensional kernel space. The input set of the kernel space is P t , t=1, 2, . . . , p. In the kernel space, it is advisable to assume that the first k values of the output set Y are all different. Let the set of all sample labels whose output is Y j (j≤k) be I j , and its corresponding input set be denoted as P j (j=0, 1, . . . , k-1). After a series of initializations above, the coverage in a batch of kernel spaces can be obtained. Fig. 3 is the flow chart that utilizes kernel coverage learning algorithm to train nuclear coverage classifier, among Fig. 3, utilizes kernel coverage learning algorithm to train kernel coverage classifier and comprises the following steps:

步骤31:在样本点构成的样本空间中,构造覆盖领域系。Step 31: In the sample space constituted by the sample points, construct the coverage field system.

(1)在样本集中任取一个尚未被覆盖的点xj∈Pt,按式(1) Randomly select a point x j ∈ P t that has not been covered in the sample set, according to the formula

dd jj (( 11 )) == minmin mm &NotElement;&NotElement; II jj {{ KK (( xx jj ,, xx mm )) }}

dd jj (( 22 )) == maxmax mm &Element;&Element; II jj {{ KK (( xx jj ,, xx mm )) || KK (( xx jj ,, xx mm )) << dd jj (( 11 )) }}

dd jj == [[ dd jj (( 11 )) ++ dd jj (( 22 )) ]] // 22

&theta;&theta; jj == [[ dd jj (( 11 )) -- dd jj (( 22 )) ]] // 22

计算,根据xi和dj构造一个覆盖

Figure BDA0000052621520000075
该覆盖的中心为xi,覆盖半径为dj,分类间隔为dj。其中,Ij是一个下标的集合,xm表示一个样本点,在第一个公式中,xm表示m的值不属于Ij。Compute, construct a cover from xi and dj
Figure BDA0000052621520000075
The center of the coverage is x i , the coverage radius is d j , and the class interval is d j . Among them, I j is a set of subscripts, x m represents a sample point, and in the first formula, x m represents that the value of m does not belong to I j .

(2)

Figure BDA0000052621520000076
求出后,将Pt中所有的已被
Figure BDA0000052621520000077
覆盖的点从Pt中删除,再从Pt中选择一个xj(j∈Ij),重复第(1)步操作,直到所有的xj∈Ij均被删除为止。这样便构造出一个类的所有覆盖领域。(2)
Figure BDA0000052621520000076
After finding out, all the items in P t have been
Figure BDA0000052621520000077
The covered points are deleted from P t , and then a x j (j∈I j ) is selected from P t , and step (1) is repeated until all x j ∈ I j are deleted. This constructs all coverage fields of a class.

步骤32:对覆盖领域进行融合,将属于同类的覆盖领域融合成特征空间的一个球面。Step 32: Fusing the coverage domains, merging the coverage domains belonging to the same category into a sphere of the feature space.

对所求出的所有覆盖领域,令

Figure BDA0000052621520000078
其中di表示以xi为中心的领域的半径,求二次规划问题:For all coverage areas obtained, let
Figure BDA0000052621520000078
Where d i represents the radius of the domain centered on xi , and the quadratic programming problem is found:

maxmax w w (( &alpha;&alpha; )) == &Sigma;&Sigma; ii == 11 mm aa ii -- 11 22 &Sigma;&Sigma; ii ,, jj == 11 mm &alpha;&alpha; ii &alpha;&alpha; jj ythe y ii ythe y jj (( KK (( dd ii ,, dd jj )) ++ KK (( dd jj ,, dd ii )) // 22 )) &Sigma;&Sigma; ii == 11 mm &alpha;&alpha; ii ythe y ii == 00 ,, &alpha;&alpha; ii &GreaterEqual;&Greater Equal; 00 ,, ii == 1,21,2 ,, .. .. .. mm

得到最优解α*={α1α2…αm}。The optimal solution α * = {α 1 α 2 . . . α m } is obtained.

步骤33:构造出融合曲面f(x),对每一个样本点xi计算f(xi)的值,如果f(xi)的值大于零,则该样本点xi不含恶意文件;如果f(xi)的值小于零,则该样本点xi含有恶意文件。Step 33: Construct a fusion surface f(x), calculate the value of f( xi ) for each sample point xi , if the value of f( xi ) is greater than zero, then the sample point xi does not contain malicious files; If the value of f( xi ) is less than zero, then the sample point xi contains malicious files.

用步骤32中得到的α*构造超平面:Construct a hyperplane with α * obtained in step 32:

ff (( xx )) == &Sigma;&Sigma; ii == 11 mm &alpha;&alpha; ii ythe y ii KK (( dd ii ,, xx ))

其判别函数为:F(x)=Sign(f(x)+b0),其中b0为决策阈值。Its discriminant function is: F(x)=Sign(f(x)+b 0 ), where b 0 is the decision threshold.

对样本进行分类时,对每一个样本,计算f(x)的值,若f(x)>0,则x属于正类(即不含恶意文件),若f(x)<0,则x属于负类(即含有恶意文件),若f(x)=0,则称x被拒识。可以设定一个阈值ε,当|f(x)|<ε时认为x被拒识,这样可以减少误差。When classifying samples, calculate the value of f(x) for each sample, if f(x)>0, then x belongs to the positive class (that is, no malicious files), if f(x)<0, then x Belongs to the negative class (that is, contains malicious files), if f(x)=0, it is said that x is rejected. A threshold ε can be set, and when |f(x)|<ε, it is considered that x is rejected, which can reduce errors.

步骤4:利用Byte n-grams方法提取测试集中的文件的特征向量。Step 4: Use the Byte n-grams method to extract the feature vectors of the files in the test set.

如步骤1那样,利用Byte n-grams方法提取测试集中的文件的特征向量。测试集可以从网络上提供的数据集中进行选取。As in step 1, use the Byte n-grams method to extract the feature vectors of the files in the test set. The test set can be selected from datasets available on the web.

步骤5:采用局部线性嵌入算法对提取的测试集中的文件的特征向量进行降维。Step 5: Using a local linear embedding algorithm to reduce the dimensionality of the extracted feature vectors of the files in the test set.

如步骤1那样,采用局部线性嵌入算法对步骤4提取的特征向量进行降维。As in step 1, the feature vector extracted in step 4 is reduced in dimension using a local linear embedding algorithm.

步骤6:将降维后的结果输入核覆盖分类器进行分类,对分类结果进行统计后,确定测试集中的文件是否含有恶意文件。Step 6: Input the result after dimensionality reduction into the kernel coverage classifier for classification, and after making statistics on the classification results, determine whether the files in the test set contain malicious files.

将步骤5的降维结果作为输入,使用步骤3得到的核覆盖分类器对步骤5的降维结果进行分类,对分类结果进行统计,而后,根据核覆盖分类器分类统计结果确定测试集中的文件是否含有恶意文件。Take the dimension reduction result of step 5 as input, use the kernel coverage classifier obtained in step 3 to classify the dimension reduction result of step 5, and make statistics on the classification results, and then determine the files in the test set according to the classification statistics results of the kernel coverage classifier Whether it contains malicious files.

本发明以同时包含恶意文件和非恶意文件的样本集作为训练集,利用分类算法训练分类器,然后利用训练好的分类器对未知文件进行分类,以确定其是否为恶意文件。在对文件进行特征选择的过程中引入流形学习算法,对大量的文件特征属性进行分析处理,以发现隐藏在高维数据中有意义的低维结构,从而达到对高维文件特征属性进行降维处理的目的,提高了处理速度。在分类学习算法中引入核覆盖学习算法,该算法是在覆盖算法中引入支持向量机中的核函数的概念,与支持向量机算法相比,该算法对任意给定的样本集,能构造出一次就可准确划分样本集的核函数,从而保证了在先验知识不足和小样本的情况下,系统仍有较好的分类正确率和较小的运算量。The invention uses a sample set containing both malicious files and non-malicious files as a training set, uses a classification algorithm to train a classifier, and then uses the trained classifier to classify unknown files to determine whether they are malicious files. In the process of file feature selection, manifold learning algorithm is introduced to analyze and process a large number of file feature attributes to find meaningful low-dimensional structures hidden in high-dimensional data, so as to achieve the reduction of high-dimensional file feature attributes. The purpose of dimension processing improves the processing speed. The kernel coverage learning algorithm is introduced into the classification learning algorithm. This algorithm introduces the concept of the kernel function in the support vector machine into the coverage algorithm. Compared with the support vector machine algorithm, this algorithm can construct a model for any given sample set. The kernel function of the sample set can be accurately divided once, thus ensuring that the system still has a good classification accuracy and a small amount of calculation in the case of insufficient prior knowledge and small samples.

以上所述,仅为本发明较佳的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到的变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应该以权利要求的保护范围为准。The above is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art within the technical scope disclosed in the present invention can easily think of changes or Replacement should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be determined by the protection scope of the claims.

Claims (3)

1.一种未知恶意代码的检测方法,其特征是所述方法包括下列步骤:1. a detection method of unknown malicious code, it is characterized in that described method comprises the following steps: 步骤1:利用Byte n-grams方法提取训练集中的文件的特征向量;Step 1: Utilize the Byte n-grams method to extract the feature vectors of the files in the training set; 步骤2:采用局部线性嵌入算法对提取的训练集中的文件的特征向量进行降维;Step 2: using a local linear embedding algorithm to reduce the dimensionality of the feature vectors of the files in the extracted training set; 步骤3:将降维后的特征向量作为输入,利用核覆盖学习算法训练核覆盖分类器;Step 3: Take the feature vector after dimensionality reduction as input, and use the kernel coverage learning algorithm to train the kernel coverage classifier; 步骤4:利用Byte n-grams方法提取测试集中的文件的特征向量;Step 4: Utilize the Byte n-grams method to extract the feature vector of the file in the test set; 步骤5:采用局部线性嵌入算法对提取的测试集中的文件的特征向量进行降维;Step 5: using a local linear embedding algorithm to reduce the dimensionality of the feature vectors of the files in the extracted test set; 步骤6:将降维后的结果输入核覆盖分类器进行分类,对分类结果进行统计后,确定测试集中的文件是否含有恶意代码。Step 6: Input the result after dimensionality reduction into the kernel coverage classifier for classification, and after making statistics on the classification results, determine whether the files in the test set contain malicious code. 2.根据权利要求1所述的一种未知恶意代码的检测方法,其特征是所述采用局部线性嵌入算法对特征向量进行降维具体包括:2. the detection method of a kind of unknown malicious code according to claim 1, it is characterized in that described adopting local linear embedding algorithm to carry out dimensionality reduction to feature vector and specifically comprise: 步骤21:将特征向量作为样本点,利用K近邻方法寻找每个样本点的K个近邻点,其中K为设定值;Step 21: Using the feature vector as a sample point, use the K nearest neighbor method to find K neighbor points of each sample point, where K is a set value; 步骤22:利用公式
Figure FDA0000052621510000011
构造出每个样本点xi的局部重建权值矩阵,其中
Figure FDA0000052621510000012
N为样本点的个数;
Step 22: Utilize the formula
Figure FDA0000052621510000011
Construct a local reconstruction weight matrix for each sample point x i , where
Figure FDA0000052621510000012
N is the number of sample points;
步骤23:由每个样本点xi的局部重建权值矩阵及其近邻点计算其低维输出值。Step 23: Calculate its low-dimensional output value from the local reconstruction weight matrix of each sample point x i and its neighbor points. 所述步骤23中,样本点xi的低维输出yi满足如下映射条件:In the step 23, the low-dimensional output y i of the sample point xi satisfies the following mapping conditions:
Figure FDA0000052621510000022
Figure FDA0000052621510000023
其中I是m×m的单位矩阵,m是降维后的维数。
and
Figure FDA0000052621510000022
Figure FDA0000052621510000023
where I is the identity matrix of m×m, and m is the dimension after dimension reduction.
3.根据权利要求1所述的一种未知恶意代码的检测方法,其特征是所述步骤3具体包括:3. the detection method of a kind of unknown malicious code according to claim 1, it is characterized in that described step 3 specifically comprises: 步骤31:在样本点构成的样本空间中,构造覆盖领域系;Step 31: In the sample space constituted by the sample points, construct the coverage field system; 步骤32:对覆盖领域进行融合,将属于同类的覆盖领域融合成特征空间的一个球面;Step 32: Fusing the coverage areas, merging the coverage areas belonging to the same category into a sphere of the feature space; 步骤33:构造出融合曲面f(x),对每一个样本点xi计算f(xi)的值,如果f(xi)的值大于零,则该样本点xi代表不含恶意代码的文件;如果f(xi)的值小于零,则该样本点xi代表含有恶意代码的文件。Step 33: Construct a fusion surface f(x), calculate the value of f( xi ) for each sample point xi , if the value of f(xi ) is greater than zero, then the sample point xi represents no malicious code files; if the value of f( xi ) is less than zero, then the sample point xi represents a file containing malicious code.
CN201110076525XA 2011-03-29 2011-03-29 Method for detecting unknown malicious code Pending CN102142068A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110076525XA CN102142068A (en) 2011-03-29 2011-03-29 Method for detecting unknown malicious code

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110076525XA CN102142068A (en) 2011-03-29 2011-03-29 Method for detecting unknown malicious code

Publications (1)

Publication Number Publication Date
CN102142068A true CN102142068A (en) 2011-08-03

Family

ID=44409571

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110076525XA Pending CN102142068A (en) 2011-03-29 2011-03-29 Method for detecting unknown malicious code

Country Status (1)

Country Link
CN (1) CN102142068A (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102346830A (en) * 2011-09-23 2012-02-08 重庆大学 Gradient histogram-based virus detection method
CN102411687A (en) * 2011-11-22 2012-04-11 华北电力大学 Deep learning detection method for unknown malicious codes
CN102651088A (en) * 2012-04-09 2012-08-29 南京邮电大学 Classification method for malicious code based on A_Kohonen neural network
CN102779249A (en) * 2012-06-28 2012-11-14 奇智软件(北京)有限公司 Malicious program detection method and scan engine
CN104077524A (en) * 2013-03-25 2014-10-01 腾讯科技(深圳)有限公司 Training method used for virus identification and virus identification method and device
CN104504334A (en) * 2013-12-05 2015-04-08 卡巴斯基实验室封闭式股份公司 System and method used for evaluating selectivity of classification rules
CN104778407A (en) * 2015-04-14 2015-07-15 电子科技大学 Multi-dimensional feature-code-free rogue program detecting method
CN106446221A (en) * 2016-09-30 2017-02-22 北京奇虎科技有限公司 Data analyzing method and device
CN106447066A (en) * 2016-06-01 2017-02-22 上海坤士合生信息科技有限公司 Big data feature extraction method and device
US20180144131A1 (en) * 2016-11-21 2018-05-24 Michael Wojnowicz Anomaly based malware detection
WO2018184102A1 (en) * 2017-04-03 2018-10-11 Royal Bank Of Canada Systems and methods for malicious code detection
CN108985361A (en) * 2018-07-02 2018-12-11 北京金睛云华科技有限公司 A kind of malicious traffic stream detection implementation method and device based on deep learning
CN109934004A (en) * 2019-03-14 2019-06-25 中国科学技术大学 A method for protecting privacy in a machine learning service system
US12131294B2 (en) 2012-06-21 2024-10-29 Open Text Corporation Activity stream based interaction
US12149623B2 (en) 2018-02-23 2024-11-19 Open Text Inc. Security privilege escalation exploit detection and mitigation
US12164466B2 (en) 2010-03-29 2024-12-10 Open Text Inc. Log file management
US12197383B2 (en) 2015-06-30 2025-01-14 Open Text Corporation Method and system for using dynamic content types
US12235960B2 (en) 2019-03-27 2025-02-25 Open Text Inc. Behavioral threat detection definition and compilation
US12261822B2 (en) 2014-06-22 2025-03-25 Open Text Inc. Network threat prediction and blocking
US12282549B2 (en) 2005-06-30 2025-04-22 Open Text Inc. Methods and apparatus for malware threat research

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090300765A1 (en) * 2008-05-27 2009-12-03 Deutsche Telekom Ag Unknown malcode detection using classifiers with optimal training sets
CN101944167A (en) * 2010-09-29 2011-01-12 中国科学院计算技术研究所 Method and system for identifying malicious program
CN101984450A (en) * 2010-12-15 2011-03-09 北京安天电子设备有限公司 Malicious code detection method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090300765A1 (en) * 2008-05-27 2009-12-03 Deutsche Telekom Ag Unknown malcode detection using classifiers with optimal training sets
CN101944167A (en) * 2010-09-29 2011-01-12 中国科学院计算技术研究所 Method and system for identifying malicious program
CN101984450A (en) * 2010-12-15 2011-03-09 北京安天电子设备有限公司 Malicious code detection method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《2010 International Conference on Information,Networking and Automation(ICINA)》 20101019 Li Yuancheng等 An intrusion detection method based on LLE and BVM , *
《电子学报》 20070531 周鸣争等 基于构造性核覆盖算法的异常入侵检测 第35卷, 第5期 *

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12282549B2 (en) 2005-06-30 2025-04-22 Open Text Inc. Methods and apparatus for malware threat research
US12164466B2 (en) 2010-03-29 2024-12-10 Open Text Inc. Log file management
US12210479B2 (en) 2010-03-29 2025-01-28 Open Text Inc. Log file management
CN102346830A (en) * 2011-09-23 2012-02-08 重庆大学 Gradient histogram-based virus detection method
CN102411687A (en) * 2011-11-22 2012-04-11 华北电力大学 Deep learning detection method for unknown malicious codes
CN102411687B (en) * 2011-11-22 2014-04-23 华北电力大学 Deep learning detection method for unknown malicious code
CN102651088B (en) * 2012-04-09 2014-03-26 南京邮电大学 Classification method for malicious code based on A_Kohonen neural network
CN102651088A (en) * 2012-04-09 2012-08-29 南京邮电大学 Classification method for malicious code based on A_Kohonen neural network
US12131294B2 (en) 2012-06-21 2024-10-29 Open Text Corporation Activity stream based interaction
CN102779249B (en) * 2012-06-28 2015-07-29 北京奇虎科技有限公司 Malware detection methods and scanning engine
CN102779249A (en) * 2012-06-28 2012-11-14 奇智软件(北京)有限公司 Malicious program detection method and scan engine
CN104077524A (en) * 2013-03-25 2014-10-01 腾讯科技(深圳)有限公司 Training method used for virus identification and virus identification method and device
CN104077524B (en) * 2013-03-25 2018-01-09 腾讯科技(深圳)有限公司 Training method and viruses indentification method and device for viruses indentification
CN104504334A (en) * 2013-12-05 2015-04-08 卡巴斯基实验室封闭式股份公司 System and method used for evaluating selectivity of classification rules
CN104504334B (en) * 2013-12-05 2018-08-10 卡巴斯基实验室封闭式股份公司 System and method for assessing classifying rules selectivity
US12301539B2 (en) 2014-06-22 2025-05-13 Open Text Inc. Network threat prediction and blocking
US12261822B2 (en) 2014-06-22 2025-03-25 Open Text Inc. Network threat prediction and blocking
CN104778407A (en) * 2015-04-14 2015-07-15 电子科技大学 Multi-dimensional feature-code-free rogue program detecting method
CN104778407B (en) * 2015-04-14 2017-08-08 电子科技大学 A kind of multidimensional is without condition code malware detection methods
US12197383B2 (en) 2015-06-30 2025-01-14 Open Text Corporation Method and system for using dynamic content types
CN106447066A (en) * 2016-06-01 2017-02-22 上海坤士合生信息科技有限公司 Big data feature extraction method and device
CN106446221B (en) * 2016-09-30 2019-09-17 北京奇虎科技有限公司 Data analysis method and device
CN106446221A (en) * 2016-09-30 2017-02-22 北京奇虎科技有限公司 Data analyzing method and device
US10489589B2 (en) * 2016-11-21 2019-11-26 Cylance Inc. Anomaly based malware detection
US11210394B2 (en) 2016-11-21 2021-12-28 Cylance Inc. Anomaly based malware detection
US20180144131A1 (en) * 2016-11-21 2018-05-24 Michael Wojnowicz Anomaly based malware detection
US10685284B2 (en) 2017-04-03 2020-06-16 Royal Bank Of Canada Systems and methods for malicious code detection
WO2018184102A1 (en) * 2017-04-03 2018-10-11 Royal Bank Of Canada Systems and methods for malicious code detection
US12149623B2 (en) 2018-02-23 2024-11-19 Open Text Inc. Security privilege escalation exploit detection and mitigation
CN108985361A (en) * 2018-07-02 2018-12-11 北京金睛云华科技有限公司 A kind of malicious traffic stream detection implementation method and device based on deep learning
CN108985361B (en) * 2018-07-02 2021-06-18 北京金睛云华科技有限公司 Malicious traffic detection implementation method and device based on deep learning
CN109934004A (en) * 2019-03-14 2019-06-25 中国科学技术大学 A method for protecting privacy in a machine learning service system
US12235960B2 (en) 2019-03-27 2025-02-25 Open Text Inc. Behavioral threat detection definition and compilation

Similar Documents

Publication Publication Date Title
CN102142068A (en) Method for detecting unknown malicious code
Ali et al. Spike2vec: An efficient and scalable embedding approach for covid-19 spike sequences
CN107392015B (en) A kind of intrusion detection method based on semi-supervised learning
CN105915555A (en) Method and system for detecting network anomalous behavior
CN111382438B (en) Malware detection method based on multi-scale convolutional neural network
CN110110734B (en) Open set identification method, information processing apparatus, and storage medium
CN105335756A (en) Robust learning model and image classification system
CN105354595A (en) Robust visual image classification method and system
CN102263790A (en) An Intrusion Detection Method Based on Ensemble Learning
CN110084314B (en) A method for filtering false-positive genetic mutations for targeted capture gene sequencing data
CN107577702B (en) Method for distinguishing traffic information in social media
CN104820841B (en) Hyperspectral classification method based on low order mutual information and spectrum context waveband selection
CN111723666A (en) A signal recognition method and device based on semi-supervised learning
CN106156805A (en) A kind of classifier training method of sample label missing data
CN102158486A (en) Method for rapidly detecting network invasion
WO2020088338A1 (en) Method and apparatus for building recognition model
CN117633811A (en) A code vulnerability detection method based on multi-view feature fusion
CN116975863A (en) Malicious code detection method based on convolutional neural network
Aljabri et al. Fake news detection using machine learning models
CN110795736A (en) Malicious android software detection method based on SVM decision tree
Uhlig et al. Combining AI and AM–Improving approximate matching through transformer networks
CN116628695A (en) Vulnerability mining method and device based on multi-task learning
CN105469095A (en) Vehicle model identification method based on pattern set histograms of vehicle model images
CN111783088B (en) Malicious code family clustering method and device and computer equipment
CN104778478A (en) Handwritten numeral identification method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20110803