CN102142068A - Method for detecting unknown malicious code - Google Patents
Method for detecting unknown malicious code Download PDFInfo
- Publication number
- CN102142068A CN102142068A CN201110076525XA CN201110076525A CN102142068A CN 102142068 A CN102142068 A CN 102142068A CN 201110076525X A CN201110076525X A CN 201110076525XA CN 201110076525 A CN201110076525 A CN 201110076525A CN 102142068 A CN102142068 A CN 102142068A
- Authority
- CN
- China
- Prior art keywords
- files
- coverage
- malicious code
- sample point
- feature vectors
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 25
- 239000013598 vector Substances 0.000 claims abstract description 39
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 35
- 238000001514 detection method Methods 0.000 claims abstract description 20
- 238000012549 training Methods 0.000 claims abstract description 16
- 238000012360 testing method Methods 0.000 claims abstract description 15
- 239000011159 matrix material Substances 0.000 claims description 14
- 230000004927 fusion Effects 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 8
- 230000006399 behavior Effects 0.000 description 3
- 238000007635 classification algorithm Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 230000006378 damage Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了信息安全技术领域中的一种未知恶意代码的检测方法,能够在不更新恶意代码库的情况下对文件中的恶意代码进行事前检测。该方法包括利用Byte n-grams方法提取训练集中的文件的特征向量;采用局部线性嵌入算法对提取的训练集中的文件的特征向量进行降维;将降维后的特征向量作为输入,利用核覆盖学习算法训练核覆盖分类器;再利用Byte n-grams方法提取测试集中的文件的特征向量;采用局部线性嵌入算法对提取的测试集中的文件的特征向量进行降维;将降维后的结果输入核覆盖分类器进行分类,对分类结果进行统计并确定测试集中的文件是否含有恶意代码。本发明提高了文件的检测速度,实现了恶意代码的事前准确检测。
The invention discloses a method for detecting unknown malicious codes in the technical field of information security, which can detect malicious codes in files in advance without updating a malicious code library. The method includes using the Byte n-grams method to extract the feature vectors of the files in the training set; using a local linear embedding algorithm to reduce the dimensionality of the feature vectors of the extracted files in the training set; taking the feature vectors after dimensionality reduction as input, and using kernel coverage The learning algorithm trains the kernel coverage classifier; then uses the Byte n-grams method to extract the feature vectors of the files in the test set; uses the local linear embedding algorithm to reduce the dimensionality of the feature vectors of the extracted files in the test set; input the results after dimensionality reduction The core coverage classifier performs classification, makes statistics on the classification results and determines whether the files in the test set contain malicious code. The invention improves the detection speed of files and realizes the accurate detection of malicious codes in advance.
Description
技术领域technical field
本发明属于信息安全技术领域,尤其涉及一种未知恶意代码的检测方法。The invention belongs to the technical field of information security, and in particular relates to a method for detecting unknown malicious codes.
背景技术Background technique
目前,恶意代码在互联网上无处不在,其传播性、危害性、隐藏性等也在不断提高,从而使计算机恶意代码检测工作面临着巨大的挑战。现有的计算机恶意代码检测技术主要有两种,一种是基于特征码的模式匹配技术,另一种是基于恶意代码行为规则的检测技术。At present, malicious codes are ubiquitous on the Internet, and their dissemination, harm, and concealment are constantly improving, so that the detection of computer malicious codes is facing a huge challenge. There are two main types of existing computer malicious code detection technologies, one is the pattern matching technology based on signatures, and the other is the detection technology based on malicious code behavior rules.
基于特征码的模式匹配技术是当恶意代码文件出现后由分析人员对其进行人工分析,提取出能唯一标识此恶意代码文件的特征码,并将特征码升级给恶意代码特征码库,然后将特征码库提供给用户,用来查杀计算机程序中的恶意代码。基于恶意代码行为规则的检测技术,是依据专家预先定义的一些恶意代码行为规则来检测恶意代码。上述两种检测方法的缺点是必须不断更新恶意代码数据库,否则新类型的恶意代码便可以绕过检测。另外,这两种技术是一种事后检测技术,不能在新出现的恶意代码执行之前检测到它,只有当恶意代码出现后,由分析人员对其进行特征提取并将其特征码升级给特征数据库,才可以进行检测。然而在此期间,恶意代码可以已经得到运行并造成破坏。The signature-based pattern matching technology is to manually analyze the malicious code file after it appears, extract the signature that can uniquely identify the malicious code file, and upgrade the signature to the malicious code signature database, and then The feature code library is provided to the user to detect and kill malicious codes in computer programs. The detection technology based on malicious code behavior rules detects malicious codes based on some malicious code behavior rules predefined by experts. The disadvantage of the above two detection methods is that the malicious code database must be constantly updated, otherwise new types of malicious code can bypass detection. In addition, these two technologies are post-event detection technologies, which cannot detect new malicious codes before they are executed. Only when malicious codes appear, the analysts will extract their features and update their signatures to the feature database. , can be detected. In the meantime, however, malicious code could have gotten executed and caused damage.
发明内容Contents of the invention
本发明的目的在于,针对目前恶意代码检测技术存在的不足,提出一种未知恶意代码的检测方法,以同时包含恶意文件和非恶意文件的样本集作为训练集,利用分类算法训练分类器,然后利用训练好的分类器对未知文件进行分类,以确定其是否为恶意代码文件。The purpose of the present invention is to propose a detection method for unknown malicious codes for the deficiencies in current malicious code detection technology, using a sample set containing both malicious files and non-malicious files as a training set, using a classification algorithm to train a classifier, and then Use the trained classifier to classify unknown files to determine whether they are malicious code files.
为了实现本发明的目的,本发明的提供的技术方案是,一种未知恶意代码的检测方法,其特征是所述方法包括下列步骤:In order to realize the purpose of the present invention, the technical solution provided by the present invention is a detection method of unknown malicious code, which is characterized in that the method comprises the following steps:
步骤1:利用Byte n-grams方法提取训练集中的文件的特征向量;Step 1: Utilize the Byte n-grams method to extract the feature vectors of the files in the training set;
步骤2:采用局部线性嵌入算法对提取的训练集中的文件的特征向量进行降维;Step 2: using a local linear embedding algorithm to reduce the dimensionality of the feature vectors of the files in the extracted training set;
步骤3:将降维后的特征向量作为输入,利用核覆盖学习算法训练核覆盖分类器;Step 3: Take the feature vector after dimensionality reduction as input, and use the kernel coverage learning algorithm to train the kernel coverage classifier;
步骤4:利用Byte n-grams方法提取测试集中的文件的特征向量;Step 4: Utilize the Byte n-grams method to extract the feature vector of the file in the test set;
步骤5:采用局部线性嵌入算法对提取的测试集中的文件的特征向量进行降维;Step 5: using a local linear embedding algorithm to reduce the dimensionality of the feature vectors of the files in the extracted test set;
步骤6:将降维后的结果输入核覆盖分类器进行分类,对分类结果进行统计后,确定测试集中的文件是否含有恶意代码。Step 6: Input the result after dimensionality reduction into the kernel coverage classifier for classification, and after making statistics on the classification results, determine whether the files in the test set contain malicious code.
所述采用局部线性嵌入算法对特征向量进行降维具体包括:The use of the local linear embedding algorithm to reduce the dimensionality of the feature vector specifically includes:
步骤21:将特征向量作为样本点,利用K近邻方法寻找每个样本点的K个近邻点,其中K为设定值;Step 21: Using the feature vector as a sample point, use the K nearest neighbor method to find K neighbor points of each sample point, where K is a set value;
步骤22:利用公式构造出每个样本点xi的局部重建权值矩阵,其中N为样本点的个数;Step 22: Utilize the formula Construct a local reconstruction weight matrix for each sample point xi, where N is the number of sample points;
步骤23:由每个样本点xi的局部重建权值矩阵及其近邻点计算其低维输出值。Step 23: Calculate its low-dimensional output value from the local reconstruction weight matrix of each sample point x i and its neighbor points.
所述步骤23中,样本点xi的低维输出yi满足如下映射条件:In the
且 其中I是m×m的单位矩阵,m是降维后的维数。 and where I is the identity matrix of m×m, and m is the dimension after dimension reduction.
所述步骤3具体包括:The
步骤31:在样本点构成的样本空间中,构造覆盖领域系;Step 31: In the sample space constituted by the sample points, construct the coverage field system;
步骤32:对覆盖领域进行融合,将属于同类的覆盖领域融合成特征空间的一个球面;Step 32: Fusing the coverage areas, merging the coverage areas belonging to the same category into a sphere of the feature space;
步骤33:构造出融合曲面f(x),对每一个样本点xi计算f(xi)的值,如果f(xi)的值大于零,则该样本点xi代表不含恶意代码的文件;如果f(xi)的值小于零,则该样本点xi代表含有恶意代码的文件。Step 33: Construct a fusion surface f(x), calculate the value of f( xi ) for each sample point xi , if the value of f(xi ) is greater than zero, then the sample point xi represents no malicious code files; if the value of f( xi ) is less than zero, then the sample point xi represents a file containing malicious code.
本发明引入流形学习算法对文件进行特征选择,能够从高维数据中发现有意义的低维结构并进行降维,提高了文件的检测速度;另外,在分类学习算法中引入核覆盖学习算法,能构造出一次就可准确划分样本集的核函数,从而实现了准确检测新出现的恶意代码的目标。The invention introduces a manifold learning algorithm to select the features of the file, can find meaningful low-dimensional structures from high-dimensional data and perform dimension reduction, and improves the detection speed of files; in addition, a kernel coverage learning algorithm is introduced into the classification learning algorithm , can construct a kernel function that can accurately divide the sample set at one time, so as to achieve the goal of accurately detecting emerging malicious codes.
附图说明Description of drawings
图1是未知恶意代码的检测方法的过程示意图;Fig. 1 is the schematic diagram of the process of the detection method of unknown malicious code;
图2是采用局部线性嵌入算法对提取的特征向量进行降维的流程图;Fig. 2 is a flow chart of dimensionality reduction of extracted feature vectors using local linear embedding algorithm;
图3是利用核覆盖学习算法训练核覆盖分类器的流程图。Figure 3 is a flowchart of training a kernel-coverage classifier using the kernel-coverage learning algorithm.
具体实施方式Detailed ways
下面结合附图,对优选实施例作详细说明。应该强调的是,下述说明仅仅是示例性的,而不是为了限制本发明的范围及其应用。The preferred embodiments will be described in detail below in conjunction with the accompanying drawings. It should be emphasized that the following description is only exemplary and not intended to limit the scope of the invention and its application.
本发明解决问题的思路是:以同时含有恶意代码的文件和不含恶意代码的文件集为训练样本,采用流形学习算法对训练集文件进行特征选择,从而每个文件对应一个特征向量,特征向量作为核覆盖分类算法的输入来训练核覆盖分类器。最后对未知文件进行特征选择产生对应的特征向量,作为分类器的输入对其进行分类,从而分辨出其是恶意文件或非恶意文件。The idea of solving the problem of the present invention is: take files containing malicious codes and file sets not containing malicious codes as training samples, and use manifold learning algorithm to perform feature selection on the training set files, so that each file corresponds to a feature vector, and the features The vector is used as input to the kernel coverage classification algorithm to train the kernel coverage classifier. Finally, feature selection is performed on the unknown file to generate the corresponding feature vector, which is used as the input of the classifier to classify it, so as to distinguish whether it is a malicious file or a non-malicious file.
下面结合附图说明本发明的具体实现方式。图1是本发明所提供的未知恶意代码的智能检测方法的检测过程示意图。该方法包括如下的步骤:The specific implementation of the present invention will be described below in conjunction with the accompanying drawings. FIG. 1 is a schematic diagram of the detection process of the intelligent detection method for unknown malicious code provided by the present invention. The method comprises the steps of:
步骤1:利用Byte n-grams方法提取训练集中的文件的特征向量。Step 1: Use the Byte n-grams method to extract the feature vectors of the files in the training set.
训练集可以通过网上下载的标准数据集进行构造。在网上,能够下载到专门用来进行恶意代码检测的标准数据集,数据集中会包含恶意代码文件和正常文件,可以根据特定规则从标准数据集中选择文件来构造训练集。The training set can be constructed from the standard data set downloaded from the Internet. On the Internet, you can download a standard data set specially used for malicious code detection. The data set will contain malicious code files and normal files. You can select files from the standard data set according to specific rules to construct a training set.
Byte n-grams方法是对二进制字节流或文本采用一个n字节大小的滑动窗口进行取词,每个词都是n个字节大小。比如一个文本文件的内容为“abcdef”,那么它的2-grams序列为:ab bc cd de ef,3-grams序列为:abc bcd cde def。The Byte n-grams method uses a sliding window of n bytes to extract words from the binary byte stream or text, and each word is n bytes in size. For example, the content of a text file is "abcdef", then its 2-grams sequence is: ab bc cd de ef, and the 3-grams sequence is: abc bcd cde def.
以一个文件的内容是“abcd”为例,对该文件提取2-grams序列为:ab bccd,这样就说这个文件具有三个属性,可以利用这三个属性组成的向量来表示这个文件,向量为:{ab,bc,cd}。Taking the content of a file as "abcd" as an example, the sequence of 2-grams extracted from the file is: ab bccd. In this way, the file has three attributes, and the vector composed of these three attributes can be used to represent the file. The vector is: {ab, bc, cd}.
对每个属性进行量化,可以得到该文件的特征向量。比如a在字母表中位置为1,b为2……,那么我们可以以位置和的规则进行量化,量化结果为{3,5,7}。向量{3,5,7}即为该文件的特征向量。By quantifying each attribute, the feature vector of the file can be obtained. For example, the position of a in the alphabet is 1, and the position of b is 2..., then we can quantify according to the rule of position sum, and the quantized result is {3, 5, 7}. The vector {3, 5, 7} is the feature vector of the file.
步骤2:采用局部线性嵌入算法对提取的训练集中的文件的特征向量进行降维。图2是采用局部线性嵌入算法对提取的特征向量进行降维的流程图,图2中,采用局部线性嵌入算法对提取的训练集中的文件的特征向量进行降维包括:Step 2: Using a local linear embedding algorithm to reduce the dimensionality of the extracted feature vectors of the files in the training set. Fig. 2 is a flow chart of dimensionality reduction of extracted feature vectors using local linear embedding algorithm. In Fig. 2, dimensionality reduction of feature vectors of files in the extracted training set using local linear embedding algorithm includes:
步骤21:将特征向量作为样本点,利用K近邻方法寻找每个样本点的K个近邻点,其中K为设定值。Step 21: Using the feature vector as a sample point, use the K nearest neighbor method to find K neighbor points of each sample point, where K is a set value.
把相对于所求样本点距离最近的K个样本点规定为所求样本点的K个近邻点,其中K是预先给定的值,距离的计算可采用欧式距离计算方法。欧氏距离算法如下:设x,y∈RN,则x,y的欧氏距离可由下式求得:The K sample points closest to the sample point to be obtained are specified as the K neighbor points of the sample point to be obtained, where K is a predetermined value, and the distance can be calculated using the Euclidean distance calculation method. The Euclidean distance algorithm is as follows: Let x, y∈R N , then the Euclidean distance of x, y can be obtained by the following formula:
步骤22:利用公式构造出每个样本点xi的局部重建权值矩阵其中N为样本点的个数。Step 22: Utilize the formula Construct a local reconstruction weight matrix for each sample point x i in N is the number of sample points.
W=(wij)∈Mn,n是这样的权值矩阵,如果xi与xj不相邻,则wij=0,设xi与xj(j=1,2,…,K)是相邻的,则有约束 W=(w ij )∈M n, n is such a weight matrix, if x i and x j are not adjacent, then w ij =0, set x i and x j (j=1,2,...,K ) are adjacent, then there is a constraint
使用XW近似表示X,会存在一定的误差,这里定义矩阵的Frobenius范数如下:A=(ai,j)∈Mm,m,为一个m阶矩阵,则 Using XW to represent X approximately, there will be certain errors. Here, the Frobenius norm of the matrix is defined as follows: A=(a i, j )∈M m, m , which is an m-order matrix, then
由下式约束寻找W:即这相当于求一系列最小二乘问题的解。如对xi而言,由下面的方程组可以获得Find W by the following constraints: Right now This is equivalent to finding the solution to a series of least squares problems. As for x i , the following equations can be obtained
步骤23:由每个样本点xi的局部重建权值矩阵及其近邻点计算其低维输出值。Step 23: Calculate its low-dimensional output value from the local reconstruction weight matrix of each sample point x i and its neighbor points.
通过权值矩阵W,我们可以在低维空间中找到合适的yi,可通过以下约束来完成:其中yi是xi的输出向量,yjk,i(k=1,2,…,K)是yi的近邻点,并且要满足两个条件:与其中I是m×m的单位矩阵。由此,损失函数可重写为:其中M是n×n的对称矩阵:M=(I-W)T(I-W)。Through the weight matrix W, we can find the appropriate y i in the low-dimensional space, which can be done by the following constraints: Where y i is the output vector of x i , y jk, i (k=1, 2, ..., K) is the neighbor point of y i , and two conditions must be met: and where I is the identity matrix of m×m. Thus, the loss function can be rewritten as: where M is an n×n symmetric matrix: M=(IW) T (IW).
要使损失函数值达到最小,则取Y为M的最小m个非零特征值所对应的特征向量。在处理过程中,将M的特征值从小到大排列,第一个特征值几乎接近于零,那么舍去第一个特征值。通常取从第2到第m+1之间的特征值所对应的特征向量作为输出结果。To minimize the value of the loss function, Y is the eigenvector corresponding to the smallest m non-zero eigenvalues of M. During the processing, the eigenvalues of M are arranged from small to large, and the first eigenvalue is almost close to zero, then the first eigenvalue is discarded. Usually, the eigenvectors corresponding to the eigenvalues from the 2nd to the m+1th are taken as the output result.
步骤3:将降维后的特征向量作为输入,利用核覆盖学习算法训练核覆盖分类器。Step 3: Taking the dimensionally reduced feature vector as input, train a kernel-coverage classifier using the kernel-coverage learning algorithm.
核覆盖学习算法是在覆盖算法中引入核函数。首先,取一核函数K(x,y)=<T(x),T(y)>做以下变换T:D →Z,x∈D;其中D为输入的定义域为n维空间的有界集合,共有p个样本,这种变换就是将D上的点映射到P维核空间上。记核空间的输入集为Pt,t=1,2,…,p。在核空间中,不妨设输出集Y的前k个值均不相同。令所有输出为Yj(j≤k)的样本标号的集合为Ij,其对应的输入集合记为Pj(j=0,1,…,k-1)。经过上面的一系列初始化后,即可开始求取一批核空间中的覆盖。图3是利用核覆盖学习算法训练核覆盖分类器的流程图,图3中,利用核覆盖学习算法训练核覆盖分类器包括如下步骤:Kernel coverage learning algorithm introduces the kernel function into the coverage algorithm. First, take a kernel function K(x, y)=<T(x), T(y)> and do the following transformation T: D→Z, x∈D; where D is the domain of the input is n-dimensional space Bounded set, a total of p samples, this transformation is to map the points on D to the P-dimensional kernel space. The input set of the kernel space is P t , t=1, 2, . . . , p. In the kernel space, it is advisable to assume that the first k values of the output set Y are all different. Let the set of all sample labels whose output is Y j (j≤k) be I j , and its corresponding input set be denoted as P j (j=0, 1, . . . , k-1). After a series of initializations above, the coverage in a batch of kernel spaces can be obtained. Fig. 3 is the flow chart that utilizes kernel coverage learning algorithm to train nuclear coverage classifier, among Fig. 3, utilizes kernel coverage learning algorithm to train kernel coverage classifier and comprises the following steps:
步骤31:在样本点构成的样本空间中,构造覆盖领域系。Step 31: In the sample space constituted by the sample points, construct the coverage field system.
(1)在样本集中任取一个尚未被覆盖的点xj∈Pt,按式(1) Randomly select a point x j ∈ P t that has not been covered in the sample set, according to the formula
计算,根据xi和dj构造一个覆盖该覆盖的中心为xi,覆盖半径为dj,分类间隔为dj。其中,Ij是一个下标的集合,xm表示一个样本点,在第一个公式中,xm表示m的值不属于Ij。Compute, construct a cover from xi and dj The center of the coverage is x i , the coverage radius is d j , and the class interval is d j . Among them, I j is a set of subscripts, x m represents a sample point, and in the first formula, x m represents that the value of m does not belong to I j .
(2)求出后,将Pt中所有的已被覆盖的点从Pt中删除,再从Pt中选择一个xj(j∈Ij),重复第(1)步操作,直到所有的xj∈Ij均被删除为止。这样便构造出一个类的所有覆盖领域。(2) After finding out, all the items in P t have been The covered points are deleted from P t , and then a x j (j∈I j ) is selected from P t , and step (1) is repeated until all x j ∈ I j are deleted. This constructs all coverage fields of a class.
步骤32:对覆盖领域进行融合,将属于同类的覆盖领域融合成特征空间的一个球面。Step 32: Fusing the coverage domains, merging the coverage domains belonging to the same category into a sphere of the feature space.
对所求出的所有覆盖领域,令其中di表示以xi为中心的领域的半径,求二次规划问题:For all coverage areas obtained, let Where d i represents the radius of the domain centered on xi , and the quadratic programming problem is found:
得到最优解α*={α1α2…αm}。The optimal solution α * = {α 1 α 2 . . . α m } is obtained.
步骤33:构造出融合曲面f(x),对每一个样本点xi计算f(xi)的值,如果f(xi)的值大于零,则该样本点xi不含恶意文件;如果f(xi)的值小于零,则该样本点xi含有恶意文件。Step 33: Construct a fusion surface f(x), calculate the value of f( xi ) for each sample point xi , if the value of f( xi ) is greater than zero, then the sample point xi does not contain malicious files; If the value of f( xi ) is less than zero, then the sample point xi contains malicious files.
用步骤32中得到的α*构造超平面:Construct a hyperplane with α * obtained in step 32:
其判别函数为:F(x)=Sign(f(x)+b0),其中b0为决策阈值。Its discriminant function is: F(x)=Sign(f(x)+b 0 ), where b 0 is the decision threshold.
对样本进行分类时,对每一个样本,计算f(x)的值,若f(x)>0,则x属于正类(即不含恶意文件),若f(x)<0,则x属于负类(即含有恶意文件),若f(x)=0,则称x被拒识。可以设定一个阈值ε,当|f(x)|<ε时认为x被拒识,这样可以减少误差。When classifying samples, calculate the value of f(x) for each sample, if f(x)>0, then x belongs to the positive class (that is, no malicious files), if f(x)<0, then x Belongs to the negative class (that is, contains malicious files), if f(x)=0, it is said that x is rejected. A threshold ε can be set, and when |f(x)|<ε, it is considered that x is rejected, which can reduce errors.
步骤4:利用Byte n-grams方法提取测试集中的文件的特征向量。Step 4: Use the Byte n-grams method to extract the feature vectors of the files in the test set.
如步骤1那样,利用Byte n-grams方法提取测试集中的文件的特征向量。测试集可以从网络上提供的数据集中进行选取。As in
步骤5:采用局部线性嵌入算法对提取的测试集中的文件的特征向量进行降维。Step 5: Using a local linear embedding algorithm to reduce the dimensionality of the extracted feature vectors of the files in the test set.
如步骤1那样,采用局部线性嵌入算法对步骤4提取的特征向量进行降维。As in
步骤6:将降维后的结果输入核覆盖分类器进行分类,对分类结果进行统计后,确定测试集中的文件是否含有恶意文件。Step 6: Input the result after dimensionality reduction into the kernel coverage classifier for classification, and after making statistics on the classification results, determine whether the files in the test set contain malicious files.
将步骤5的降维结果作为输入,使用步骤3得到的核覆盖分类器对步骤5的降维结果进行分类,对分类结果进行统计,而后,根据核覆盖分类器分类统计结果确定测试集中的文件是否含有恶意文件。Take the dimension reduction result of
本发明以同时包含恶意文件和非恶意文件的样本集作为训练集,利用分类算法训练分类器,然后利用训练好的分类器对未知文件进行分类,以确定其是否为恶意文件。在对文件进行特征选择的过程中引入流形学习算法,对大量的文件特征属性进行分析处理,以发现隐藏在高维数据中有意义的低维结构,从而达到对高维文件特征属性进行降维处理的目的,提高了处理速度。在分类学习算法中引入核覆盖学习算法,该算法是在覆盖算法中引入支持向量机中的核函数的概念,与支持向量机算法相比,该算法对任意给定的样本集,能构造出一次就可准确划分样本集的核函数,从而保证了在先验知识不足和小样本的情况下,系统仍有较好的分类正确率和较小的运算量。The invention uses a sample set containing both malicious files and non-malicious files as a training set, uses a classification algorithm to train a classifier, and then uses the trained classifier to classify unknown files to determine whether they are malicious files. In the process of file feature selection, manifold learning algorithm is introduced to analyze and process a large number of file feature attributes to find meaningful low-dimensional structures hidden in high-dimensional data, so as to achieve the reduction of high-dimensional file feature attributes. The purpose of dimension processing improves the processing speed. The kernel coverage learning algorithm is introduced into the classification learning algorithm. This algorithm introduces the concept of the kernel function in the support vector machine into the coverage algorithm. Compared with the support vector machine algorithm, this algorithm can construct a model for any given sample set. The kernel function of the sample set can be accurately divided once, thus ensuring that the system still has a good classification accuracy and a small amount of calculation in the case of insufficient prior knowledge and small samples.
以上所述,仅为本发明较佳的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到的变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应该以权利要求的保护范围为准。The above is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art within the technical scope disclosed in the present invention can easily think of changes or Replacement should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be determined by the protection scope of the claims.
Claims (3)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110076525XA CN102142068A (en) | 2011-03-29 | 2011-03-29 | Method for detecting unknown malicious code |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110076525XA CN102142068A (en) | 2011-03-29 | 2011-03-29 | Method for detecting unknown malicious code |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102142068A true CN102142068A (en) | 2011-08-03 |
Family
ID=44409571
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110076525XA Pending CN102142068A (en) | 2011-03-29 | 2011-03-29 | Method for detecting unknown malicious code |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102142068A (en) |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102346830A (en) * | 2011-09-23 | 2012-02-08 | 重庆大学 | Gradient histogram-based virus detection method |
CN102411687A (en) * | 2011-11-22 | 2012-04-11 | 华北电力大学 | Deep learning detection method for unknown malicious codes |
CN102651088A (en) * | 2012-04-09 | 2012-08-29 | 南京邮电大学 | Classification method for malicious code based on A_Kohonen neural network |
CN102779249A (en) * | 2012-06-28 | 2012-11-14 | 奇智软件(北京)有限公司 | Malicious program detection method and scan engine |
CN104077524A (en) * | 2013-03-25 | 2014-10-01 | 腾讯科技(深圳)有限公司 | Training method used for virus identification and virus identification method and device |
CN104504334A (en) * | 2013-12-05 | 2015-04-08 | 卡巴斯基实验室封闭式股份公司 | System and method used for evaluating selectivity of classification rules |
CN104778407A (en) * | 2015-04-14 | 2015-07-15 | 电子科技大学 | Multi-dimensional feature-code-free rogue program detecting method |
CN106446221A (en) * | 2016-09-30 | 2017-02-22 | 北京奇虎科技有限公司 | Data analyzing method and device |
CN106447066A (en) * | 2016-06-01 | 2017-02-22 | 上海坤士合生信息科技有限公司 | Big data feature extraction method and device |
US20180144131A1 (en) * | 2016-11-21 | 2018-05-24 | Michael Wojnowicz | Anomaly based malware detection |
WO2018184102A1 (en) * | 2017-04-03 | 2018-10-11 | Royal Bank Of Canada | Systems and methods for malicious code detection |
CN108985361A (en) * | 2018-07-02 | 2018-12-11 | 北京金睛云华科技有限公司 | A kind of malicious traffic stream detection implementation method and device based on deep learning |
CN109934004A (en) * | 2019-03-14 | 2019-06-25 | 中国科学技术大学 | A method for protecting privacy in a machine learning service system |
US12131294B2 (en) | 2012-06-21 | 2024-10-29 | Open Text Corporation | Activity stream based interaction |
US12149623B2 (en) | 2018-02-23 | 2024-11-19 | Open Text Inc. | Security privilege escalation exploit detection and mitigation |
US12164466B2 (en) | 2010-03-29 | 2024-12-10 | Open Text Inc. | Log file management |
US12197383B2 (en) | 2015-06-30 | 2025-01-14 | Open Text Corporation | Method and system for using dynamic content types |
US12235960B2 (en) | 2019-03-27 | 2025-02-25 | Open Text Inc. | Behavioral threat detection definition and compilation |
US12261822B2 (en) | 2014-06-22 | 2025-03-25 | Open Text Inc. | Network threat prediction and blocking |
US12282549B2 (en) | 2005-06-30 | 2025-04-22 | Open Text Inc. | Methods and apparatus for malware threat research |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090300765A1 (en) * | 2008-05-27 | 2009-12-03 | Deutsche Telekom Ag | Unknown malcode detection using classifiers with optimal training sets |
CN101944167A (en) * | 2010-09-29 | 2011-01-12 | 中国科学院计算技术研究所 | Method and system for identifying malicious program |
CN101984450A (en) * | 2010-12-15 | 2011-03-09 | 北京安天电子设备有限公司 | Malicious code detection method and system |
-
2011
- 2011-03-29 CN CN201110076525XA patent/CN102142068A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090300765A1 (en) * | 2008-05-27 | 2009-12-03 | Deutsche Telekom Ag | Unknown malcode detection using classifiers with optimal training sets |
CN101944167A (en) * | 2010-09-29 | 2011-01-12 | 中国科学院计算技术研究所 | Method and system for identifying malicious program |
CN101984450A (en) * | 2010-12-15 | 2011-03-09 | 北京安天电子设备有限公司 | Malicious code detection method and system |
Non-Patent Citations (2)
Title |
---|
《2010 International Conference on Information,Networking and Automation(ICINA)》 20101019 Li Yuancheng等 An intrusion detection method based on LLE and BVM , * |
《电子学报》 20070531 周鸣争等 基于构造性核覆盖算法的异常入侵检测 第35卷, 第5期 * |
Cited By (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US12282549B2 (en) | 2005-06-30 | 2025-04-22 | Open Text Inc. | Methods and apparatus for malware threat research |
US12164466B2 (en) | 2010-03-29 | 2024-12-10 | Open Text Inc. | Log file management |
US12210479B2 (en) | 2010-03-29 | 2025-01-28 | Open Text Inc. | Log file management |
CN102346830A (en) * | 2011-09-23 | 2012-02-08 | 重庆大学 | Gradient histogram-based virus detection method |
CN102411687A (en) * | 2011-11-22 | 2012-04-11 | 华北电力大学 | Deep learning detection method for unknown malicious codes |
CN102411687B (en) * | 2011-11-22 | 2014-04-23 | 华北电力大学 | Deep learning detection method for unknown malicious code |
CN102651088B (en) * | 2012-04-09 | 2014-03-26 | 南京邮电大学 | Classification method for malicious code based on A_Kohonen neural network |
CN102651088A (en) * | 2012-04-09 | 2012-08-29 | 南京邮电大学 | Classification method for malicious code based on A_Kohonen neural network |
US12131294B2 (en) | 2012-06-21 | 2024-10-29 | Open Text Corporation | Activity stream based interaction |
CN102779249B (en) * | 2012-06-28 | 2015-07-29 | 北京奇虎科技有限公司 | Malware detection methods and scanning engine |
CN102779249A (en) * | 2012-06-28 | 2012-11-14 | 奇智软件(北京)有限公司 | Malicious program detection method and scan engine |
CN104077524A (en) * | 2013-03-25 | 2014-10-01 | 腾讯科技(深圳)有限公司 | Training method used for virus identification and virus identification method and device |
CN104077524B (en) * | 2013-03-25 | 2018-01-09 | 腾讯科技(深圳)有限公司 | Training method and viruses indentification method and device for viruses indentification |
CN104504334A (en) * | 2013-12-05 | 2015-04-08 | 卡巴斯基实验室封闭式股份公司 | System and method used for evaluating selectivity of classification rules |
CN104504334B (en) * | 2013-12-05 | 2018-08-10 | 卡巴斯基实验室封闭式股份公司 | System and method for assessing classifying rules selectivity |
US12301539B2 (en) | 2014-06-22 | 2025-05-13 | Open Text Inc. | Network threat prediction and blocking |
US12261822B2 (en) | 2014-06-22 | 2025-03-25 | Open Text Inc. | Network threat prediction and blocking |
CN104778407A (en) * | 2015-04-14 | 2015-07-15 | 电子科技大学 | Multi-dimensional feature-code-free rogue program detecting method |
CN104778407B (en) * | 2015-04-14 | 2017-08-08 | 电子科技大学 | A kind of multidimensional is without condition code malware detection methods |
US12197383B2 (en) | 2015-06-30 | 2025-01-14 | Open Text Corporation | Method and system for using dynamic content types |
CN106447066A (en) * | 2016-06-01 | 2017-02-22 | 上海坤士合生信息科技有限公司 | Big data feature extraction method and device |
CN106446221B (en) * | 2016-09-30 | 2019-09-17 | 北京奇虎科技有限公司 | Data analysis method and device |
CN106446221A (en) * | 2016-09-30 | 2017-02-22 | 北京奇虎科技有限公司 | Data analyzing method and device |
US10489589B2 (en) * | 2016-11-21 | 2019-11-26 | Cylance Inc. | Anomaly based malware detection |
US11210394B2 (en) | 2016-11-21 | 2021-12-28 | Cylance Inc. | Anomaly based malware detection |
US20180144131A1 (en) * | 2016-11-21 | 2018-05-24 | Michael Wojnowicz | Anomaly based malware detection |
US10685284B2 (en) | 2017-04-03 | 2020-06-16 | Royal Bank Of Canada | Systems and methods for malicious code detection |
WO2018184102A1 (en) * | 2017-04-03 | 2018-10-11 | Royal Bank Of Canada | Systems and methods for malicious code detection |
US12149623B2 (en) | 2018-02-23 | 2024-11-19 | Open Text Inc. | Security privilege escalation exploit detection and mitigation |
CN108985361A (en) * | 2018-07-02 | 2018-12-11 | 北京金睛云华科技有限公司 | A kind of malicious traffic stream detection implementation method and device based on deep learning |
CN108985361B (en) * | 2018-07-02 | 2021-06-18 | 北京金睛云华科技有限公司 | Malicious traffic detection implementation method and device based on deep learning |
CN109934004A (en) * | 2019-03-14 | 2019-06-25 | 中国科学技术大学 | A method for protecting privacy in a machine learning service system |
US12235960B2 (en) | 2019-03-27 | 2025-02-25 | Open Text Inc. | Behavioral threat detection definition and compilation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102142068A (en) | Method for detecting unknown malicious code | |
Ali et al. | Spike2vec: An efficient and scalable embedding approach for covid-19 spike sequences | |
CN107392015B (en) | A kind of intrusion detection method based on semi-supervised learning | |
CN105915555A (en) | Method and system for detecting network anomalous behavior | |
CN111382438B (en) | Malware detection method based on multi-scale convolutional neural network | |
CN110110734B (en) | Open set identification method, information processing apparatus, and storage medium | |
CN105335756A (en) | Robust learning model and image classification system | |
CN105354595A (en) | Robust visual image classification method and system | |
CN102263790A (en) | An Intrusion Detection Method Based on Ensemble Learning | |
CN110084314B (en) | A method for filtering false-positive genetic mutations for targeted capture gene sequencing data | |
CN107577702B (en) | Method for distinguishing traffic information in social media | |
CN104820841B (en) | Hyperspectral classification method based on low order mutual information and spectrum context waveband selection | |
CN111723666A (en) | A signal recognition method and device based on semi-supervised learning | |
CN106156805A (en) | A kind of classifier training method of sample label missing data | |
CN102158486A (en) | Method for rapidly detecting network invasion | |
WO2020088338A1 (en) | Method and apparatus for building recognition model | |
CN117633811A (en) | A code vulnerability detection method based on multi-view feature fusion | |
CN116975863A (en) | Malicious code detection method based on convolutional neural network | |
Aljabri et al. | Fake news detection using machine learning models | |
CN110795736A (en) | Malicious android software detection method based on SVM decision tree | |
Uhlig et al. | Combining AI and AM–Improving approximate matching through transformer networks | |
CN116628695A (en) | Vulnerability mining method and device based on multi-task learning | |
CN105469095A (en) | Vehicle model identification method based on pattern set histograms of vehicle model images | |
CN111783088B (en) | Malicious code family clustering method and device and computer equipment | |
CN104778478A (en) | Handwritten numeral identification method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20110803 |