CN103679019A

CN103679019A - Malicious file identifying method and device

Info

Publication number: CN103679019A
Application number: CN201210332168.3A
Authority: CN
Inventors: 王健
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2012-09-10
Filing date: 2012-09-10
Publication date: 2014-03-26
Anticipated expiration: 2032-09-10
Also published as: CN103679019B

Abstract

The invention discloses a malicious file identifying method and device. The method comprises the steps of extracting a weak feature set of a sample file to be identified, searching for a preset black weight table and a preset white weight table according to the weak feature set and acquiring a black weight and a white weight, acquiring a black weight coefficient through calculation according to the black weight, the white weight and a predetermined algorithm, and identifying the black and white property of the sample file according to the black weight coefficient. According to the malicious file identifying method and device, by extracting the weak feature set of the sample file to be identified, searching for the preset black weight table and the preset white weight table, acquiring the black weight and the white weight and acquiring the black weight coefficient namely the black equivocation through calculation based on the predetermined algorithm, comprehensive judgment is achieved based on multiple combination features of the sample file, the black and white property of the sample file is identified, weights of all the combination features can be obtained through manual experience and data statistics, and then accuracy of judgment of a malicious sample is improved.

Description

Malicious file recognition methods and device

Technical field

The present invention relates to Internet technical field, relate in particular to a kind of malicious file recognition methods and device.

Background technology

For a sample file, its sample characteristics comprises strong feature and weak feature, and strong feature generally has uniqueness, and by strong feature, can identify this sample file is black sample file or white sample file.Relatively common strong feature, is difficult to define the black and white attribute of this sample file by feature a little less than or a group.Such as path of version information, compiling information, responsive character string (as process name occurring in URL, sample etc.), file icon and the file of file etc.

At present, killing antagonism can be concentrated strong feature is resisted, and therefore, the identification of existing malicious file adopts the authentication method based on strong characteristic statistics conventionally.The method adopts many features of fixed position, adopts the method for statistics to obtain a black mark sheet and a white mark sheet, and unknown file can directly obtain the attribute of sample by inquiry black and white mark sheet.

But there is following shortcoming in existing recognition methods:

1, the feature that existing characteristic statistics method adopts has uniqueness, non-black white, although sample recall rate is higher, reports by mistake also larger.

2, the location comparison of feature extraction is fixed, and general antagonism meeting free to kill is out of shape or revises the feature of some fixed positions, once the position of this position and assessor acquisition feature coincide, will be easy to walk around the killing of this method.

Summary of the invention

Fundamental purpose of the present invention is to provide a kind of malicious file recognition methods and device, is intended to improve the identification accuracy of malicious file.

In order to achieve the above object, the present invention proposes a kind of malicious file recognition methods, comprising:

Extract the weak feature set of sample file to be identified;

According to described weak feature set, search black, the white weight table of setting up in advance, obtain black weights and white weights;

According to described black weights and white weights and pre-defined algorithm, calculate and obtain black weights coefficient;

According to described black weights coefficient, identify the black and white attribute of described sample file.

The present invention also proposes a kind of malicious file recognition device, comprising:

Extraction module, for extracting the weak feature set of sample file to be identified;

Search module, for searching according to described weak feature set black, the white weight table of setting up in advance, obtain black weights and white weights;

Computing module, for according to described black weights and white weights and pre-defined algorithm, calculates and obtains black weights coefficient;

Identification module, for identifying the black and white attribute of described sample file according to described black weights coefficient.

A kind of malicious file recognition methods and device that the present invention proposes, by extracting the weak feature set of sample file to be identified, search black, the white weight table of setting up in advance, obtain black weights and white weights, and based on pre-defined algorithm, calculate that to obtain black weights coefficient be black suspicious degree, by the many assemblage characteristics to sample file, carry out synthetic determination thus, the black and white attribute of recognition sample file, and the weights of each assemblage characteristic can obtain according to artificial experience and data statistics, improved thus the accuracy that malice sample is judged.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of malicious file recognition methods the first embodiment of the present invention;

Fig. 2 is the schematic flow sheet of malicious file recognition methods the second embodiment of the present invention;

Fig. 3 is the structural representation of malicious file recognition device the first embodiment of the present invention;

Fig. 4 is the structural representation of malicious file recognition device the second embodiment of the present invention.

In order to make technical scheme of the present invention clearer, clear, below in conjunction with accompanying drawing, be described in further detail.

Embodiment

The solution of the embodiment of the present invention is mainly: the weak feature set of extracting sample file to be identified, search black, the white weight table of setting up in advance, obtain black weights and white weights, based on pre-defined algorithm, calculate that to obtain black weights coefficient be black suspicious degree, black and white attribute with this recognition sample file, each feature comprising in black, white weight table due to foundation in advance or the weights of each assemblage characteristic can obtain according to artificial experience and data statistics, can improve thus the accuracy that malice sample is judged.

The technical term the present invention relates to comprises:

One-dimensional characteristic: the special case of of multidimensional characteristic, refer to extract one independently feature from a file, do not combine with any other feature.Such as: feature A forms an one-dimensional characteristic.Concrete as: feature T1, T2, T3 ... Tn etc.

Multidimensional characteristic: more than two and two combination of feature of extracting from a file.Such as: feature A and feature B form a two dimensional character.Specific as follows:

Two dimensional character, such as: T1T2, T1T3 ... T1Tn, T2Tn, T2T3 ... TmTn

Three-dimensional feature, such as: T1T2T3, T1T3T4 ... T2T3Tn, T2TmTn ... TiTmTn

......

As shown in Figure 1, first embodiment of the invention proposes a kind of malicious file recognition methods, comprising:

Step S101, extracts the weak feature set of sample file to be identified;

The present embodiment is considered, general killing antagonism can be concentrated strong feature is resisted, and for the distortion sample of same family, its weak feature often changes not quite, therefore, the weak feature of sample file is also a kind of effective means of identification to new variant virus or unknown virus.

The present embodiment has been set up black, white weight table in advance, this black, white weight table includes from the corresponding relation of known black sample file collection and the weak feature set of white sample file collection extraction and the weights of setting, wherein, the weights of corresponding each weak feature set can be set automatically according to statistics, also can set manually.

Whether the present embodiment is that malicious file carries out synthetic determination by feature or Feature Combination to sample file.Above-mentioned weak feature set can be the set of one-dimensional characteristic, can be also the set that the multidimensional characteristic such as two-dimentional, three-dimensional combines.Therefore,, in black, the white weight table of setting up in advance, corresponding each category feature combination (comprising one-dimensional characteristic, two dimensional character etc.), all has corresponding weights corresponding with it.

When having sample file to identify, first from this sample file, extract weak feature set.

When extracting weak feature set, the feature locations extracting can not fixed, there is certain range of choice, internal characteristics and the peripheral information that can comprise file, wherein, internal characteristics can be version information, compiling information, responsive character string (as process name occurring in URL, sample etc.), the file icon of sample file; Described peripheral information can be the file path of depositing on subscriber set, filename etc.

Step S102, searches according to described weak feature set black, the white weight table of setting up in advance, obtains black weights and white weights;

As previously mentioned, in black, the white weight table of setting up in advance, include the corresponding relation of the weights of weak feature set and its setting.After extracting the weak feature set of sample file to be identified, for this weak feature set, remove to search black, white weight table, obtain corresponding black weights and white weights.

Step S103, according to described black weights and white weights and pre-defined algorithm, calculates and obtains black weights coefficient;

The present embodiment specifically adopts black weights and the white weights that Bayes' theorem integrating step S102 obtains to carry out the calculating of black weights coefficient.Wherein, black weights coefficient refers to the black suspicious degree of sample file, can judge the black and white attribute of this sample file by this black weights coefficient.

The formula that adopts Bayes' theorem to calculate black weights coefficient is:

Wherein: P (A|B) represents black weights coefficient, refer to the black suspicious degree of sample file; P (B|A) is black weights; P (B)=black weights+white weights; If set the black and white attribute probability of sample file to be identified, equate, P (A)=50%.

Thus, for a unknown sample file, by inquiry black and white weight table, obtain its black, white weights, then by Bayes' theorem, can to obtain this sample file be black Bayes's weights coefficient, the i.e. black suspicious degree of sample.

Step S104, identifies the black and white attribute of described sample file according to described black weights coefficient.

In calculating, get after the black weights coefficient of sample file, this black weights coefficient and preset threshold value are compared, with this, carry out the black and white attribute of judgement sample file.

The present embodiment is provided with two threshold values, the first preset threshold value and the second preset threshold value, and described the second preset threshold value is less than described the first preset threshold value, such as the first preset threshold value and the second preset threshold value can be distinguished value 70% and 50%.

By the second preset threshold value, carrying out judgement sample file is black suspicious sample presents, and by the first preset threshold value, carrying out judgement sample file is black file.

Particularly, the black weights coefficient and the first preset threshold value that calculating are obtained compare; If described black weights coefficient is greater than described the first preset threshold value, identifying described sample file is black sample file.

When being that non-black sample file is that black weights coefficient is while being less than or equal to described the first preset threshold value by the first preset threshold value judgement sample file, for judgement sample file is the probability of black sample file or white sample file, described black weights coefficient and the second preset threshold value are compared; If described black weights coefficient is greater than described the second preset threshold value and is less than described the first preset threshold value; Identifying described sample file is black suspicious sample presents, if described black weights coefficient is less than described the second preset threshold value, can judge that this sample file is as white suspicious sample presents or white sample file.

The present embodiment, by such scheme, carries out synthetic determination according to the feature of sample file or assemblage characteristic, provides a black weights coefficient, and the weights of each stack features can obtain according to artificial experience and data statistics, can improve thus the accuracy that sample is judged.

It should be noted that, in the weak feature set of utilizing sample file to be identified to extract, inquire about black, white weight table obtains black, white weights, while calculating thus black weights coefficient, can first utilize the one-dimensional characteristic of sample file to extract weak feature set, go to search black, white weights, finally calculate black weights coefficient, if the black weights coefficient calculating does not reach desired threshold (non-black sample file), can consider again to utilize the two dimensional character of sample file to extract weak feature set, inquire about black, white weight table obtains black, white weights, calculate thus black weights coefficient, judge the black and white attribute of sample file.

Further, principle is by increasing dimension, to more high-dimensional assemblage characteristic expansions such as three-dimensional features, to improve the accuracy of sample file judgement thus.

Certainly, along with the increase of characteristic dimension, the combination of feature is the power multiplication of dimension, can greatly increase the complexity of calculating, and therefore, dimension can be controlled in three-dimensional in actual applications.

In addition,, because the present invention extracts the unfixing of feature locations, along with the increase of dimension, also can reduce the efficiency of feature extraction.

The present embodiment passes through such scheme, can start with from the weak feature of easy uncared-for sample file, the distinguishing ability of enhancing to trojan horse, for example, for information such as file version information, icons, prior art can not be identified as a single feature it to sample, but, by the present embodiment technological means, in conjunction with feature a little less than other, its black weights coefficient of comprehensive assessment, can be converted into strong feature by feature a little less than this sample is effectively identified, thereby has improved the judgement accuracy of sample file.

As shown in Figure 2, second embodiment of the invention proposes a kind of malicious file recognition methods, on the basis of above-mentioned the first embodiment, before above-mentioned steps S101, also comprises:

Step S90, chooses known black sample set and white sample set is trained, and extracts the wherein weak feature set of each sample;

Step S100, sets weights for feature set a little less than each, sets up described black, white weight table.

The difference of the present embodiment and above-described embodiment is, the present embodiment also comprises the scheme black, white weight table of setting up.

Particularly, when setting up black, white weight table, first, collect a collection of known black sample set and white sample set is trained, extract the weak feature set of each sample in black sample set and white sample set, then, weak feature set is weighted, for feature set a little less than each is set weights, thus, obtain white weights storehouse and black weights storehouse, afterwards, for white weights storehouse and black weights storehouse, set up black weight table and white weight table.

Wherein, the weights of corresponding each weak feature set can be set automatically according to statistics, also can set according to artificial experience.

Particularly, for the situation of weights is set according to artificial experience:

For one-dimensional characteristic weights, if an one-dimensional characteristic A is enough to judge that a sample is as black, can manually give higher weights of this one-dimensional characteristic;

For two dimensional character weights, if a sample that not only contains A feature but also contain B feature is a virus document, according to artificial experience, can give high weight of its two dimensional character AB.

For set the situation of weights according to statistics, its computing method are as follows:

By the statistics of the black and white sample set to set, the frequency that feature group is occurred in its set sample set is as the weights of concentrated this feature group of respective sample.

The present embodiment passes through such scheme, with the sample file of known black and white, carry out feature extraction and set up black, white weight table, by the many assemblage characteristics to sample file, carry out synthetic determination, the black and white attribute of recognition sample file, the weights of each assemblage characteristic can obtain according to artificial experience and data statistics, have improved thus the accuracy that malice sample is judged.

As shown in Figure 3, first embodiment of the invention proposes a kind of malicious file recognition device, comprising: extraction module 401, search module 402, computing module 403 and identification module 404, wherein:

Extraction module 401, for extracting the weak feature set of sample file to be identified;

Search module 402, for searching according to described weak feature set black, the white weight table of setting up in advance, obtain black weights and white weights;

Computing module 403, for according to described black weights and white weights and pre-defined algorithm, calculates and obtains black weights coefficient;

Identification module 404, for identifying the black and white attribute of described sample file according to described black weights coefficient.

When having sample file to identify, first by extraction module 401, from this sample file, extract weak feature set.

As previously mentioned, in black, the white weight table of setting up in advance, include the corresponding relation of the weights of weak feature set and its setting.After extracting the weak feature set of sample file to be identified, search module 402 and remove to search black, white weight table for this weak feature set, obtain corresponding black weights and white weights.

After getting black weights and white weights, computing module 403 adopts Bayes' theorem to carry out the calculating of black weights coefficient.Wherein, black weights coefficient refers to the black suspicious degree of sample file, can judge the black and white attribute of this sample file by this black weights coefficient.

P (A | B) = \frac{P (A) P (B | A)}{P (B)}

In calculating, get after the black weights coefficient of sample file, identification module 404 compares this black weights coefficient and preset threshold value, with this, carrys out the black and white attribute of judgement sample file.

As shown in Figure 4, second embodiment of the invention also proposes a kind of malicious file recognition device, on the basis of above-described embodiment, also comprises:

Set up module 400, for choosing known black sample set, train with white sample set, extract the wherein weak feature set of each sample; For feature set a little less than each, set weights, set up described black, white weight table.

The foregoing is only the preferred embodiments of the present invention; not thereby limit the scope of the claims of the present invention; every equivalent structure or flow process conversion that utilizes instructions of the present invention and accompanying drawing content to do; or be directly or indirectly used in other relevant technical field, be all in like manner included in scope of patent protection of the present invention.

Claims

1. a malicious file recognition methods, is characterized in that, comprising:

Extract the weak feature set of sample file to be identified;

2. method according to claim 1, is characterized in that, the described step of identifying the black and white attribute of described sample file according to black weights coefficient comprises:

Described black weights coefficient and the first preset threshold value are compared;

If described black weights coefficient is greater than described the first preset threshold value, identifying described sample file is black sample file.

3. method according to claim 2, is characterized in that, the described step of identifying the black and white attribute of described sample file according to black weights coefficient further comprises:

Described black weights coefficient and the second preset threshold value are compared; Described the second preset threshold value is less than described the first preset threshold value;

If described black weights coefficient is greater than described the second preset threshold value and is less than described the first preset threshold value; Identifying described sample file is black suspicious sample presents.

4. according to the method described in claim 1,2 or 3, it is characterized in that, before the step of the weak feature set of the sample file that described extraction is to be identified, also comprise:

Choose known black sample set and train with white sample set, extract the wherein weak feature set of each sample;

For feature set a little less than each, set weights, set up described black, white weight table.

5. method according to claim 4, is characterized in that, described weak feature set comprises the set of one-dimensional characteristic or the set of multidimensional characteristic combination.

6. method according to claim 4, it is characterized in that, when extracting weak feature set, the characteristic range of extraction comprises internal characteristics and the peripheral information of file, and it is one of following that described internal characteristics at least comprises: the version information of sample file, compiling information, responsive character string, file icon; It is one of following that described peripheral information at least comprises: the path that file is deposited on subscriber set, filename.

7. a malicious file recognition device, is characterized in that, comprising:

8. device according to claim 7, is characterized in that, described identification module is also for comparing described black weights coefficient and the first preset threshold value; If described black weights coefficient is greater than described the first preset threshold value, identifying described sample file is black sample file.

9. device according to claim 8, is characterized in that, described identification module is also for comparing described black weights coefficient and the second preset threshold value; Described the second preset threshold value is less than described the first preset threshold value; If described black weights coefficient is greater than described the second preset threshold value and is less than described the first preset threshold value; Identifying described sample file is black suspicious sample presents.

10. according to the device described in claim 7,8 or 9, it is characterized in that, also comprise:

Set up module, for choosing known black sample set, train with white sample set, extract the wherein weak feature set of each sample; For feature set a little less than each, set weights, set up described black, white weight table.

11. devices according to claim 10, is characterized in that, described weak feature set comprises the set of one-dimensional characteristic or the set of multidimensional characteristic combination.