CN104331436A

CN104331436A - Rapid classification method of malicious codes based on family genetic codes

Info

Publication number: CN104331436A
Application number: CN201410571621.5A
Authority: CN
Inventors: 沈超; 程颢; 张泽华; 管晓宏
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2014-10-23
Filing date: 2014-10-23
Publication date: 2015-02-04
Anticipated expiration: 2034-10-23
Also published as: CN104331436B

Abstract

The invention discloses a rapid classification method of malicious codes based on family genetic codes. The method comprises performing character representation on the malicious codes by using behavior appearance frequencies on multiple behavioral aspects, generating the family genetic codes based on aggregation and difference among massive malicious code samples, and exactly and rapidly classifying the malicious codes by using the direct match between the family genetic codes and the added malicious code feature vectors. The method has the advantages that the action information of the malicious codes is described from multiple behavioral aspects, the family genetic codes are generated by using the aggregation and difference among similar malicious code samples, the exactness and universality for representing the malicious code family can be obviously improved; simultaneously, through the method of directly matching the malicious code feature vectors and the family genetic codes, the comparison and classification speed of the malicious codes can be effectively improved; furthermore, the algorithm of the whole process is highly automatic without human intervention; the stability and accuracy of the method are increased.

Description

The quick classifying method of malicious code based on family gene code

Technical field

The present invention relates to computer security technology, particularly a kind of computer malevolence code classifying method.

Background technology

Along with the progress of society and the development of science and technology, computing machine has penetrated into the every aspect of people's life, and increasing personal information (as picture, video, chat record etc.) and sensitive information (as Bank Account Number, business material etc.) are stored among computing machine.Meanwhile, sharply increase to usurp, to destroy the malicious code quantity that these information are fundamental purpose in recent years, and present many, the intelligentized features of mutation, this makes the identification of computer malevolence code and classification problem cause the very big attention of national governments and the common people.

Existing malicious code classifying method is mainly from malicious code static nature or behavioral characteristics, mostly concentrate on the Cluster Classification to known malicious code, poor to the classifying quality of newly-increased malicious code, the signature analysis form of simultaneously malicious code behavior is single and sample size is less, lacks accuracy and universality.

Summary of the invention

The object of this invention is to provide a kind of can fast and accurately to the method that computer malevolence code is sorted out, particularly a kind of malicious code family gene code utilizing magnanimity malicious code sample to obtain after cluster realizes the method quick and precisely sorted out newly-increased malicious code.

For reaching above object, the present invention takes following technical scheme to realize:

The quick classifying method of malicious code based on family gene code, is characterized in that, comprises following two large steps:

The first step, the generation of malicious code family gene code, comprises step by step following:

(1) obtain the malicious code sample collection be made up of M malicious code sample, wherein, M is at least 1,000 ten thousand;

(2) from malicious code sample, behavioural information is extracted;

(3) sort method is carried out to the frequency that the various actions of all malicious code samples occur, select frequency summation to be not less than the behavior of 3 as the behavior vector portraying malicious code sample;

(4) frequency that each element of usage behavior vector occurs in malicious code behavioural information forms the proper vector of this malicious code sample;

(5) adopt the distance between manhatton distance algorithm calculating malicious code sample proper vector, form the Distance matrix D={ d of malicious code sample collection _ij} _{m × M}, wherein, d _ijrepresent the distance of malicious code sample i to malicious code sample j, matrix D is symmetrical about diagonal line;

(6) based on the Distance matrix D of malicious code sample collection, concentrate from malicious code sample and extract malicious code family gene code, generate malicious code family to facilitate.

Second step, malicious code is sorted out fast, comprises step by step following:

(1) for newly-increased malicious code sample, extract its behavioural information, compare with concentrating the behavior vector obtained at malicious code sample, the frequency occurred in the behavioural information of newly-increased malicious code sample by element each in behavior vector is as the proper vector of this sample;

(2) proper vector of newly-increased malicious code sample is mated with malicious code family gene code, the classification belonging to newly-increased malicious code is judged.

In said method, the behavioural information of malicious code described in the first step (2) refers to malicious code in the process of implementation to the access behavior of computer resource, comprises the access behavior of API importing table, file operation behavior, process operation behavior, registry operations behavior, the behavior of dynamic link library call, Hook Function call behavior.

Select frequency summation to be not less than the behavior of 3 as the behavior vector portraying malicious code sample described in the first step (3), its concrete steps are:

(1) concentrate the behavioural information of each sample to carry out statistical study to malicious code sample, utilize all behaviors occurred to form initial characteristics collection;

(2) calculating initial characteristics concentrates each element in the behavioural information of all samples, occur the summation of frequency, and sorting and removing occurs that frequency summation is the element of 1 and 2, uses remaining element as the feature of portraying malicious code sample.

The concrete grammar extracting malicious code family gene code described in the first step (6) from sample set is:

1) by the distance d between malicious code sample _ij(i<j) carrying out descending sort, getting the intermediate value of rank results as blocking distance d _c;

2) gaussian kernel function is adopted to calculate the concentration class ρ of each malicious code sample _i, represent that this sample is by the parcel degree of its neighbours' sample, computing formula is:

ρ_{i} = \underset{j &Element; I_{D} \ {i}}{Σ} e^{- {(\frac{d_{ij}}{d_{c}})}^{2}};

3) the descending sort subscript sequence of malicious code sample concentration class is generated

4) diversity factor of each malicious code sample is calculated represent the distance between this malicious code sample and the large malicious code sample of other concentration class, computing formula is:

δ_{s_{i}} = \{\begin{matrix} \min_{s_{j}, j < i} {d_{s_{i} s_{j}}}, & i &GreaterEqual; 2; \\ \min_{j &GreaterEqual; 2} {{δ_{s}}_{j}} & i = 1 . \end{matrix};

5) for each malicious code sample, calculate the decision value of this malicious code sample as family gene code, this decision value is the concentration class of this malicious code sample and the product of diversity factor;

6) decision value of each malicious code sample as family gene code is compared with the threshold epsilon preset, if be greater than this threshold value, then judge that this sample is as a family gene code, and be stored in database.

The concrete grammar carrying out newly-increased malicious code sample kind judging according to proper vector and the matching result of malicious code family gene code described in second step (2) is: newly-increased malicious code sample proper vector mated with each malicious code family gene code in database, obtain the Similarity value with this malicious code family gene code, if there is the situation that Similarity value is greater than predetermined threshold value, this malicious code sample is classified as the malicious code family that maximum similarity value is corresponding; If there is not the situation that Similarity value is greater than predetermined threshold value, this malicious code sample is classified as newly-increased malicious code family.

The malicious code sample proper vector formed in the first step (4), its storage means is: use index matrix to store, in index matrix, be only greater than 0 element position being less than 10 in recording feature vector.

Compared with the conventional method, classification technology based on malicious code family gene code has significant advantage: first, the malicious code sample enormous amount analyzed, and adopt the mode of dynamic and static state integrate features from multiple behavior layer in the face of malicious code behavior is described and portrays, aggregation between similar malicious code and otherness is utilized to generate family gene code, the representative and universality of the malicious code family gene code of generation; Secondly, the mode of malicious code proper vector and family gene code directly being mated is adopted effectively can to increase the speed of malicious code comparison and classification; In addition, the algorithm of whole process is all increasingly automated, without the need to human intervention, adds stability and the accuracy of this method.

Accompanying drawing explanation

Fig. 1 is the overall procedure schematic diagram of the inventive method.

Fig. 2 is the generation family gene code step idiographic flow schematic diagram in the first step of Fig. 1.

Fig. 3 is the idiographic flow schematic diagram of second largest step in Fig. 1.

Embodiment

See Fig. 1, the present invention relates to the quick classifying method of a kind of malicious code based on family gene code, can be used for identifying fast family's information of newly-increased malicious code, realize sorting out fast and accurately magnanimity malicious code.The present invention comprises the generation of family gene code and quick classification two parts of malicious code, and concrete implementation step is as follows:

1) generating portion of family gene code comprises the steps:

(1) malicious code sample collection (comprising M malicious code sample) is obtained.

(2) dis-assembling is carried out to each malicious code sample, analyze dis-assembling result and obtain the static behavior information of malicious code, comprise API importing table and call behavior, then this malicious code sample is placed in sandbox runs, monitor its dynamic operation behavior to host computer simultaneously, obtain the dynamic behaviour information of malicious code, comprise file operation behavior, process operation behavior, registry operations behavior, the behavior of dynamic link library call, Hook Function calls behavior.Concentrate the behavioural information of each sample to carry out statistical study to malicious code sample, utilize all behaviors occurred to form initial characteristics collection.

(3) concentrate at initial characteristics, sort method is carried out to the frequency that the various actions of all malicious code samples occur, remove and occur that frequency is the element of 1 and 2, use remaining P element formation to portray the behavior vector C of malicious code sample.Wherein element refers to the access behavior of each class behavior to the specific objective resource of malicious code host computer, comprise and behavior is called to specific objective api function, to the operation behavior of specific file, to the operation behavior of specific process, to the operation behavior of specific registration table, to the behavior of calling of specific dynamic chained library and call behavior to specific Hook Function.

(4) frequency occurred in malicious code behavioural information by each element in behavior vector C is as the proper vector of this malicious code sample, the each malicious code sample concentrated for malicious code sample all generates the proper vector of a P dimension, total M proper vector, each proper vector is expressed as V _j=[S ₁, S ₂, S ₃..., S _p], wherein S _irepresent the frequency that i-th element occurs in the behavior vector of sample j;

(5) employing manhatton distance calculates the distance between two between malicious code sample, and generate the Distance matrix D of malicious code sample collection, D is the matrix of M × M.Wherein d _ijrepresent the distance of sample i to sample j, wherein D inner opposite angle line element is 0, and D is symmetrical about diagonal line;

(6) based on the Distance matrix D of malicious code sample collection, from sample set, family gene code is extracted its concrete implementation step is:

4) diversity factor of each malicious code sample is calculated represent the distance between this sample and the large sample of other concentration class, computing formula is

δ_{s_{i}} = \{\begin{matrix} \min_{s_{j}, j < i} {d_{s_{i} s_{j}}}, & i &GreaterEqual; 2; \\ \min_{j &GreaterEqual; 2} {{δ_{s}}_{j}} & i = 1 . \end{matrix};

5) for each malicious code sample, calculating this sample as the decision value of family gene code is the concentration class of this sample and the product γ of diversity factor _i=ρ _iδ _i;

6) using the decision value γ of each malicious code sample as family gene code _icompare with the threshold epsilon preset, if be greater than this threshold value, then judge that this sample is as a family gene code, and be stored in database.

(7) for the malicious code sample of non-malicious code family gene code, the distance between each sample and all family gene codes is extracted from distance matrix, principle according to minimum distance is sorted out each sample, forms malicious code family, and is stored in database.

2) the quick classification part of malicious code, comprises the steps:

(1) for newly-increased malicious code sample B, carry out dis-assembling to B, obtain its static nature, the API importing table extracting B calls behavior; In sandbox, run B, and monitor its dynamic operation behavior to host computer, obtain its behavioral characteristics, extraction document operation behavior, process operation behavior, registry operations behavior, the behavior of dynamic link library call, Hook Function calls behavior;

(2) based on the behavior vector C obtaining malicious code in gene code generative process, carry out feature extraction to the behavioural information of obtained malicious code sample B, the frequency occurred in the behavioural information of B by each element in behavior vector C is as the proper vector of B;

(3) proper vector of malicious code sample B is mated with the gene code of each family in database, calculate the manhatton distance between them, as the similarity of B and this family, if there is the situation that Similarity value is greater than predetermined threshold value, this malicious code sample is classified as family corresponding to maximum similarity value; If there is not the situation that Similarity value is greater than predetermined threshold value, this malicious code sample is classified as newly-increased family, and the proper vector of B to be inserted in database and to be designated as newly-increased family.

Claims

1. based on the quick classifying method of malicious code of family gene code, it is characterized in that, comprise following two large steps:

(2) from malicious code sample, behavioural information is extracted;

(6) based on the Distance matrix D of malicious code sample collection, concentrate from malicious code sample and extract malicious code family gene code, generate malicious code family to facilitate;

2. the quick classifying method of the malicious code based on family gene code according to claim 1, it is characterized in that, the behavioural information of malicious code described in the first step (2) refers to malicious code in the process of implementation to the access behavior of computer resource, comprises the access behavior of API importing table, file operation behavior, process operation behavior, registry operations behavior, the behavior of dynamic link library call, Hook Function call behavior.

3. the quick classifying method of the malicious code based on family gene code according to claim 1, it is characterized in that, select frequency summation to be not less than the behavior of 3 as the behavior vector portraying malicious code sample described in the first step (3), its concrete steps are:

4. the quick classifying method of the malicious code based on family gene code according to claim 1, is characterized in that, the concrete grammar extracting malicious code family gene code described in the first step (6) from sample set is:

wherein I _dfor the set of the sequence number of all malice samples;

4) diversity factor of each malicious code sample is calculated , represent the distance between this malicious code sample and the large malicious code sample of other concentration class, computing formula is:

δ_{S_{i}} = \{\begin{matrix} \min_{s_{j}, j < i} {d_{s_{i} s_{j}}}, i &GreaterEqual; 2 \\ \min_{j &GreaterEqual; 2} {δ_{s_{j}}}, i = 1 . \end{matrix};

5. the quick classifying method of the malicious code based on family gene code according to claim 1, it is characterized in that, the concrete grammar carrying out newly-increased malicious code sample kind judging according to proper vector and the matching result of malicious code family gene code described in second step (2) is: newly-increased malicious code sample proper vector mated with each malicious code family gene code in database, obtain the Similarity value with this malicious code family gene code, if there is the situation that Similarity value is greater than predetermined threshold value, this malicious code sample is classified as the malicious code family that maximum similarity value is corresponding, if there is not the situation that Similarity value is greater than predetermined threshold value, this malicious code sample is classified as newly-increased malicious code family.

6. the quick classifying method of the malicious code based on family gene code according to claim 1, it is characterized in that, the malicious code sample proper vector formed in the first step (4), its storage means is: use index matrix to store, in index matrix, be only greater than 0 element position being less than 10 in recording feature vector.