CN102737186B

CN102737186B - Malicious file identification method, device and storage medium

Info

Publication number: CN102737186B
Application number: CN201210213078.2A
Authority: CN
Inventors: 崔精兵; 杨宜; 于涛; 白子潘; 吴家旭
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd; Tencent Cloud Computing Beijing Co Ltd
Priority date: 2012-06-26
Filing date: 2012-06-26
Publication date: 2015-06-17
Anticipated expiration: 2032-06-26
Also published as: CN102737186A

Abstract

The invention discloses a malicious file identification method, a device and a storage medium. The method comprises the steps of adopting a preset learning set consisting of a malicious file and a normal file to generate a machine learning model; reading a file to be detected other than the learning set; converting the file to be detected to a vector; and performing the malicious file identification on the file to be detected which is converted to the vector through the machine learning model. The machine learning model is generated by the preset learning set consisting of the malicious file and the normal file, and the generated machine learning model is used for performing the malicious file identification on the file to be detected except the learning set, so that the virus characteristics can be timely, accurately and efficiently extracted, any discovered malicious file can be immediately processed, and the detection efficiency of the malicious file can be greatly improved.

Description

Malicious file recognition methods, device and storage medium

Technical field

The present invention relates to Internet technical field, particularly security fields, particularly relate to a kind of malicious file recognition methods, device and storage medium.

Background technology

Along with the development of Internet technology, the propagation of virus is also in aggravation.Virus causes great harm to the safety of user profile and user's property, and therefore, exploitation is swift in response, efficiently, virus investigation rate and the high antivirus engine of accuracy become the emphasis of current internet information safety limit.

The viral recognition technology that traditional antivirus engine adopts usually is as follows: analyst analyzes virus document, extracts virus characteristic, virus characteristic warehouse-in, antivirus engine are scanned existing file according to virus base, if run into the feature that can match, reports poison.

There is following drawback in traditional viral recognition technology:

1, require higher to the professional skill of analyst, and the quality extracting virus characteristic determine rate of false alarm and quote rate;

2, analyze virus document and extract virus characteristic very consuming time;

3, efficiency is low, and along with increasing of virus base record, in order to collide each record, the required time can become geometry multiple to increase;

4, to the discovery of virus not in time, relative to the new viral species of magnanimity, because the processing power of analyst is limited, for the process of some viruses, to only have etc. during virus outbreak and just can be found or pay attention to, then process, and now virus has caused sizable harm.

Summary of the invention

Fundamental purpose of the present invention is to provide a kind of malicious file recognition methods, device and storage medium, is intended to the detection efficiency improving malicious file.

In order to achieve the above object, the present invention proposes a kind of malicious file recognition methods, comprises the following steps:

The study collection of predetermined malicious file and normal file composition is adopted to generate machine learning model;

Read the file to be detected beyond study collection;

Described file translations to be detected is become vector;

By described machine learning model, malicious file identification is carried out to the file to be detected changing into vector.

The present invention also proposes a kind of malicious file recognition device, comprising:

Model generation module, generates machine learning model for adopting the study collection of predetermined malicious file and normal file composition;

Read module, for reading the file to be detected beyond study collection;

Vector conversion module, for becoming vector by described file translations to be detected;

Identification module, for carrying out malicious file identification by described machine learning model to the file to be detected changing into vector.

The present invention also proposes a kind of storage medium of embodied on computer readable, have stored thereon the program that computing machine is run, after in the storer that program loads computing machine, the study collection of predetermined malicious file and normal file composition is adopted to generate machine learning model; Read the file to be detected beyond study collection; Described file translations to be detected is become vector; By described machine learning model, malicious file identification is carried out to the file to be detected changing into vector.

A kind of malicious file recognition methods that the present invention proposes, device and storage medium, machine learning model is generated by the study collection of the malicious file that presets and normal file composition, by the machine learning model generated, malicious file identification is carried out to the file to be detected beyond study collection, can in time, accurately and effectively extract virus characteristic, can process immediately any malicious file found, improve the detection efficiency of malicious file thus greatly.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of malicious file recognition methods preferred embodiment of the present invention;

Fig. 2 is the schematic flow sheet adopting the study collection of predetermined malicious file and normal file composition to generate machine learning model in malicious file recognition methods preferred embodiment of the present invention;

Fig. 3 concentrates the vector of malicious file and normal file to carry out the schematic flow sheet of dimension merging and screening to study in malicious file recognition methods preferred embodiment of the present invention;

Fig. 4 concentrates the vector of malicious file and normal file to carry out the schematic flow sheet of a kind of example of dimension merging and screening to study in malicious file recognition methods preferred embodiment of the present invention;

Fig. 5 is the structural representation of malicious file recognition device of the present invention preferred embodiment;

Fig. 6 is the structural representation of model generation module in malicious file recognition device of the present invention preferred embodiment;

Fig. 7 is the structural representation merging and screen unit in malicious file recognition device of the present invention preferred embodiment;

Fig. 8 is the structural representation of identification module in malicious file recognition device of the present invention preferred embodiment.

In order to make technical scheme of the present invention clearly, understand, be described in further detail below in conjunction with accompanying drawing.

Embodiment

Solution for embodiment of the invention is mainly: adopt the study collection of predetermined malicious file and normal file composition to generate machine learning model; Read the file to be detected beyond study collection, and described file translations to be detected is become vector, by machine learning model, malicious file identification is carried out to the file to be detected changing into vector, utilize the feature that machine learning is reacted in time, processing speed is fast, promote the detection efficiency of malicious file.

In the present invention, malicious file can be the file of virus document or other malice, and following examples illustrate with malicious file.Wherein, the technical term related to comprises:

Black file: virus document

Black vector: the vector that virus document changes into

Text of an annotated book part: normal non-viral file

Bai Xiangliang: the vector that normal non-viral file translations becomes

SVM: Support Vector Machine

PE file: a kind of executable file format under windows system

As shown in Figure 1, present pre-ferred embodiments proposes a kind of malicious file recognition methods, comprising:

Step S101, adopts the study collection of predetermined malicious file and normal file composition to generate machine learning model;

For windows system, in order to carry out virus investigation to the file under windows system, first the present embodiment utilizes known virus document and non-viral file (i.e. malicious file alleged by the present embodiment and normal file) to generate machine learning model, to carry out virus by this machine learning model to the file under windows system to identify, the classification problem of virus document and normal file in resolution system.

Above-mentioned known virus document and non-viral file can be collected in advance by virus analysis teacher, and form a study collection, after carrying out feature extraction, dimension merging and screening by concentrating each virus document and normal file to study, concentrate each virus document and normal file to carry out vector by sorter to study to learn, finally generate machine learning model.

Particularly, first, the malicious file concentrate study and normal file change into vector respectively, are namely the extractions concentrated malicious file of study and normal file being completed respectively effective sample feature.

For an executable file (PE file), virus is identified that helpful feature comprises: character string, instruction sequence, functional procedure, import and export the attribute etc. of function and each section.

The present embodiment is by right for value value composition one (key:value) of these feature key and this feature, a file then becomes (comprising malicious file and normal file) set of (key:value), if by each key as a dimension, then the set of (key:value) of a file can regard a unfixed multi-C vector of dimension as.

By feature extraction, be a unfixed multi-C vector of dimension by file translations.But, generate the sorter of machine learning model it is desirable that the fixing vector of dimension, and the method for dimension being fixed is the dimension (key) merging All Files, if Single document does not exist a certain dimension, its value is set to 0; Mass file is then had to the dimension with magnanimity, there will be dimension disaster, therefore, need to merge these dimensions and screen; Vector after being combined finally by sorter and screening learns, and generates machine learning model.

Step S102, reads the file to be detected beyond study collection;

Step S103, becomes vector by file translations to be detected;

Step S104, carries out malicious file identification by machine learning model to the file to be detected changing into vector.

Above-mentioned steps S102 is in step S104, when learning have file to need to detect outside collection, read file to be detected, file translations to be detected is become vector, the machine learning model generated by step S101 carries out malicious file identification to the file to be detected changing into vector.

As a kind of preferred embodiment, for PC, the machine learning model that step S101 generates can be applied to the virus investigation engine of PC front end, the PC of user carries out virus investigation, and its specific implementation process is as follows:

1, the file to be detected on PC is read;

2, the file transform to be detected on the PC read is become vector.

As previously mentioned, sorter concentrates each virus document and normal file to carry out vector study to study, thus generation machine learning model, namely the file object handled by machine learning model should be vector, therefore, in the present embodiment, when reading file to be detected in PC system, needing file translations to be detected is vector, namely from file to be detected, extracts effective sample characteristics.This effective sample characteristics comprises: character string, instruction sequence, functional procedure, import and export the attribute etc. of function and each section.

Then, by right for value value composition one (key:value) of these feature key and this feature, a file then becomes (comprising malicious file and normal file) set of (key:value), if by each key as a dimension, then the set of (key:value) of a file can regard a unfixed multi-C vector of dimension as.

3, by machine learning model, malicious file identification is carried out to the file to be detected changed on the PC of vector.

The file to be detected be converted on the PC of vector is placed in machine learning model judge, therefrom identifies virus document and normal file.Be specially: carry out linear function calculating by machine learning model to changing into the file to be detected after vector; The attribute of malicious file and normal file is judged according to result of calculation, thus the malicious file exported in file to be detected and normal file.

Particularly, as shown in Figure 2, the step that above-mentioned steps S101 adopts the study collection of predetermined malicious file and normal file composition to generate machine learning model comprises:

Step S1011, the malicious file concentrate study and normal file change into vector respectively;

The malicious file concentrate study and normal file change into vector respectively, are namely the extractions concentrated malicious file of study and normal file being completed respectively effective sample feature.

Step S1012, concentrates the vector of malicious file and normal file to carry out dimension merging and screening to study;

By feature extraction, be a unfixed multi-C vector of dimension by file translations.But, generate the sorter of machine learning model it is desirable that the fixing vector of dimension, and the method for dimension being fixed is the dimension (key) merging All Files, if Single document does not exist a certain dimension, its value is set to 0; Mass file is then had to the dimension with magnanimity, there will be dimension disaster, therefore, need to merge these dimensions and screen.

Dimension specifically merges and filters out K dimension by the present embodiment, and wherein, K dimension refers to according to certain rule from multiple dimension, through merging and screening, and front K the dimension selected.Follow-up composition graphs 3 to be described in detail.

Step S1013, the vector after being combined by sorter and screening is learnt, and generates machine learning model.

In the present embodiment, sorter can adopt linear classifier, so-called Linear SVM, refers to that its kernel function is interior Product function.The present embodiment specifically adopts support vector machines (Support Vector Machine), SVM is a kind of trainable Learning machine, belong to vague generalization linear classifier, the feature of this sorter is: can minimize experience error and maximize set marginarium.SVM being applied to virus to identify, is namely the classification problem that will solve virus document and normal file.

Vector after SVM is combined and screens learns, and namely generates machine learning model.

Certainly, in other embodiments, other machines learning method can also be used instead and differentiate, and without the need to using SVM.

More specifically, as shown in Figure 3, if the vector setting all malicious files that described study is concentrated is black vector set, the vector of all normal files is white vector set, then above-mentioned steps S1012 comprises the step that study concentrates the vector of malicious file and normal file to carry out dimension merging and screening:

Step S10, the black vector of random selecting two from described black vector set, extracts the total dimension of two black vectors, as black dimension collection; The white vector of random selecting two from described white vector set, extracts the total dimension of two white vectors, as white dimension collection;

Step S11, concentrates described black dimension all dimensions appearing at described white dimension concentrated to remove, forms new black dimension collection, give weight to each dimension that described white dimension collection and new black dimension are concentrated;

In above-mentioned steps S10 and step S11, in order to dimension being merged and filtering out K dimension, in the following ways:

Whole black vector set and white vector set merged and screens the problem of dimension, splitting into the subproblem of the white vector of two black vector sums two; Then each subproblem is separated, two white vectors are extracted total dimension (getting common factor), as the white dimension collection of subproblem, two black vectors are extracted the black dimension collection of total dimension as subproblem, and black dimension concentrated all dimensions appearing at white dimension concentrated to remove, give weight to each black, the white dimension elected.

Step S12, carries out dimension merging by described white dimension collection and new black dimension collection respectively according to weight, and abandons merging the dimension of rear weight lower than predefined weight threshold values;

The solution of all subproblems is merged according to dimension, a weight threshold w is set in merging process, if the weight of the dimension after merging (weighted value that during merging, dimension is corresponding is added) is lower than w, then directly abandons this dimension, prevent dimension collection and unrestrictedly increase.

Step S13, judges whether vectors all in black vector set and white vector set is disposed respectively; If; Then enter step S14; Otherwise, return step S10;

Step S14, filters the black dimension collection after merging with the white dimension collection after merging;

Step S15, sorts according to weight size to the black dimension collection after filtering, and takes out the black dimension of the highest front K dimension of rank as final dimension;

In above-mentioned steps S13-step S15, when vectors all in black vector set and white vector set study is complete, black dimension collection (i.e. black dimension collection=black dimension collection-Bai dimension collection) is filtered with the white dimension collection after merging, according to weight size, rank is carried out to black dimension collection, takes out the black dimension of the highest front K dimension of rank as a result.

Step S16, changes into K dimensional vector by the institute's directed quantity in described black vector set and white vector set.

Vectors all in black and white file is changed into the canonical form of the K dimensional vector selected, so that SVM learns K dimensional vector, generate machine learning model.

The process of the vector of all virus documents and normal file is concentrated to be described in detail with instantiation to above-mentioned merging and screening study below.

As shown in Figure 4, black, white vectorial general collection is represented respectively with FB, FW, the total dimension collection of black, white vector is represented respectively with FBL and FWL, the mark of two of random selecting from black vector set black vectors is represented respectively with B1, B2, represent the mark of two of random selecting from white vector set white vectors with W1, W2 respectively, the process learning to concentrate the vector of all virus documents and normal file to merge and screen be specially:

S1, initialization FB, FW, select black-and-white vector collection; If select black vector, then enter step S2, if select white vector, then enter step S3;

S2, judges whether the black vector in black vector set is all labeled; If so, then step S4 is entered; Otherwise, enter step S21;

S21, random selecting two black vectorial B1, B2;

S22, extracts total dimension collection FBL and gives weight to each dimension; Enter S23;

S3, judges whether the white vector in white vector set is all labeled; If so, then step S4 is entered; Otherwise, enter step S31;

S31, random selecting two white vectorial W1, W2;

S32, extracts total dimension collection FWL and gives weight to each dimension; Enter S23;

S23, makes difference set by FBL and FWL, as new FBL;

S24, is merged into respectively by new FBL and FWL in general collection FB, FW, is added by power collection during merging;

S25, weight in FB and FW is less than w-limit(setting weight threshold values) dimension reject; Return step S2 and S3 respectively.

S4, FB and FW make difference set as new FB;

S5, K dimension before being taken out according to weight sequencing by FB, obtains FB net result.

The present embodiment generates machine learning model by the study collection of the malicious file that presets and normal file composition, and by the machine learning model generated, malicious file identification is carried out to the file to be detected beyond study collection, namely the malicious code features such as virus are automatically extracted by machine, eliminate the participation of analyst, and machine learning reaction in time, can accurately and effectively extract virus characteristic, all can process immediately any malicious file found, improve the detection efficiency of malicious file thus greatly.

As shown in Figure 5, present pre-ferred embodiments proposes a kind of malicious file recognition device, comprising: model generation module 501, read module 502, vectorial conversion module 503 and identification module 504, wherein:

Model generation module 501, generates machine learning model for adopting the study collection of predetermined malicious file and normal file composition;

Read module 502, for reading the file to be detected beyond study collection;

Vector conversion module 503, for becoming vector by described file translations to be detected;

Identification module 504, for carrying out malicious file identification by described machine learning model to the file to be detected changing into vector.

Above-mentioned known virus document and non-viral file can be collected in advance by virus analysis teacher, and form a study collection, after concentrating each virus document and normal file to carry out feature extraction, dimension merging and screening by model generation module 501 to study, concentrate each virus document and normal file to carry out vector by sorter to study to learn, finally generate machine learning model.

When learning have file to need to detect outside collection, read file to be detected, file translations to be detected is become vector, the machine learning model generated by step S101 carries out malicious file identification to the file to be detected changing into vector.

As a kind of preferred embodiment, for PC, the machine learning model that model generation module 501 generates can be applied to the virus investigation engine of PC front end, the PC of user carries out virus investigation, and its specific implementation process is as follows:

1, the file to be detected on PC is read;

2, the file transform to be detected on the PC read is become vector.

As previously mentioned, sorter concentrates each virus document and normal file to carry out vector study to study, thus generation machine learning model, namely the file object handled by machine learning model should be vector, therefore, in the present embodiment, when read module 502 reads file to be detected in PC system, it is vector that vector conversion module 503 needs file translations to be detected, namely from file to be detected, extracts effective sample characteristics.This effective sample characteristics comprises: character string, instruction sequence, functional procedure, import and export the attribute etc. of function and each section.

The file to be detected be converted on the PC of vector is placed in machine learning model and judges by identification module 504, therefrom identifies virus document and normal file.Be specially: carry out linear function calculating by machine learning model to changing into the file to be detected after vector; The attribute of malicious file and normal file is judged according to result of calculation, thus the malicious file exported in file to be detected and normal file.

Particularly, as shown in Figure 6, described model generation module 501 comprises: vectorial conversion unit 5011, merging and screening unit 5012 and generation unit 5013, wherein:

Vector conversion unit 5011, for changing into vector respectively by malicious file concentrated for described study and normal file;

Merge and screening unit 5012, carry out dimension merging and screening for concentrating the vector of malicious file and normal file to described study;

Generation unit 5013, is learnt for the vector after being combined by sorter and screening, and generates machine learning model.

In the present embodiment, the malicious file concentrate study and normal file change into vector respectively, are namely the extractions concentrated malicious file of study and normal file being completed respectively effective sample feature.

By feature extraction, be a unfixed multi-C vector of dimension by file translations.But, generate the sorter of machine learning model it is desirable that the fixing vector of dimension, and the method for dimension being fixed is the dimension (key) merging All Files, if Single document does not exist a certain dimension, its value is set to 0; Mass file is then had to the dimension with magnanimity, there will be dimension disaster, therefore, need to merge these dimensions and screen.Dimension specifically merges and filters out K dimension by the present embodiment, and wherein, K dimension refers to according to certain rule from multiple dimension, through merging and screening, and front K the dimension selected.

More specifically, as shown in Figure 7, if the vector setting all malicious files that described study is concentrated is black vector set, the vector of all normal files is white vector set, then described merging and screening unit 5012 comprise: the first extraction subelement 50121, screening subelement 50122, merging subelement 50123, filtration subelement 50124, second extract subelement 50125 and transformant unit 50126, wherein:

First extracts subelement 50121, for the black vector of random selecting two from described black vector set, extracts the total dimension of two black vectors, as black dimension collection; The white vector of random selecting two from described white vector set, extracts the total dimension of two white vectors, as white dimension collection;

Screening subelement 50122, for described black dimension being concentrated all dimensions appearing at described white dimension concentrated to remove, forming new black dimension collection, giving weight to each dimension that described white dimension collection and new black dimension are concentrated;

Merging subelement 50123, for described white dimension collection and new black dimension collection are carried out dimension merging respectively according to weight, and abandoning merging the dimension of rear weight lower than predefined weight threshold values;

Filter subelement 50124, for after Vector Processing all in described black vector set and white vector set, filter the black dimension collection after merging with the white dimension collection after merging;

Second extracts subelement 50125, for sorting according to weight size to the black dimension collection after filtration, takes out the black dimension of the highest front K dimension of rank as final dimension;

Transformant unit 50126, for changing into K dimensional vector by the institute's directed quantity in described black vector set and white vector set.

In the present embodiment, in order to dimension being merged and filtering out K dimension, in the following ways:

When vectors all in black vector set and white vector set study is complete, black dimension collection (i.e. black dimension collection=black dimension collection-Bai dimension collection) is filtered with the white dimension collection after merging, according to weight size, rank is carried out to black dimension collection, takes out the black dimension of the highest front K dimension of rank as a result.

In addition, as shown in Figure 8, above-mentioned identification module 504 comprises: computing unit 5041 and output unit 5042, wherein:

Computing unit 5041, for obtaining result of calculation to changing into the file to be detected after vector by machine learning model;

Output unit 5042, for exporting malicious file in file to be detected and normal file according to result of calculation.

The recognition methods of embodiment of the present invention malicious file, device and storage medium, machine learning model is generated by the study collection of the malicious file that presets and normal file composition, and by the machine learning model generated, malicious file identification is carried out to the file to be detected beyond study collection, namely the malicious code features such as virus are automatically extracted by machine, eliminate the participation of analyst, and machine learning reaction in time, can accurately and effectively extract virus characteristic, all can process immediately any malicious file found, improve the detection efficiency of malicious file thus greatly.

In addition, the present invention also proposes a kind of storage medium of embodied on computer readable, have stored thereon the program that computing machine is run, after in the storer that program loads computing machine, adopt the study collection of predetermined malicious file and normal file composition to generate machine learning model; Read the file to be detected beyond study collection; Described file translations to be detected is become vector; By described machine learning model, malicious file identification is carried out to the file to be detected changing into vector.

It should be noted that, the above embodiment of the present invention all illustrates with windows operating system, but be not limited to windows operating system, other operating systems also can adopt by reference such scheme of the present invention and carry out malicious file detection identification, such as mac or Linux system etc., its concrete principle does not repeat them here.

The foregoing is only the preferred embodiments of the present invention; not thereby the scope of the claims of the present invention is limited; every utilize instructions of the present invention and accompanying drawing content to do equivalent structure or flow process conversion; or be directly or indirectly used in other relevant technical field, be all in like manner included in scope of patent protection of the present invention.

Claims

1. a malicious file recognition methods, is characterized in that, comprises the following steps:

Read the file to be detected beyond study collection;

Described file translations to be detected is become vector; Namely from file to be detected, effective sample characteristics is extracted, by right for value value composition one (key:value) of these feature key and this feature, a file to be detected then becomes the set of (key:value), is converted into a unfixed multi-C vector of dimension;

By described machine learning model, malicious file identification is carried out to the file to be detected changing into vector; The step of the study collection generation machine learning model of the malicious file that described employing is predetermined and normal file composition comprises:

The malicious file concentrate described study and normal file change into vector respectively;

The vector of malicious file and normal file is concentrated to carry out dimension merging and screening to described study;

Vector after being combined by sorter and screening is learnt, and generates machine learning model; The vector setting all malicious files that described study is concentrated is black vector set, and the vector of all normal files is white vector set, and the described step to learning to concentrate the vector of malicious file and normal file to carry out dimension merging and screening comprises:

The black vector of random selecting two from described black vector set, extracts the total dimension of two black vectors, as black dimension collection; The white vector of random selecting two from described white vector set, extracts the total dimension of two white vectors, as white dimension collection;

Described black dimension concentrated all dimensions appearing at described white dimension concentrated to remove, form new black dimension collection, give weight to each dimension that described white dimension collection and new black dimension are concentrated;

Described white dimension collection and new black dimension collection are carried out dimension merging respectively according to weight, and abandons merging the dimension of rear weight lower than predefined weight threshold values; With these above-mentioned three steps that circulate, until Vector Processing all in described black vector set and white vector set is complete.

2. method according to claim 1, is characterized in that, the described step to learning to concentrate the vector of malicious file and normal file to carry out dimension merging and screening further comprises:

After Vector Processing all in described black vector set and white vector set, filter the black dimension collection after merging with the white dimension collection after merging;

Black dimension collection after filtering is sorted according to weight size, takes out the black dimension of the highest front K dimension of rank as final dimension;

Institute's directed quantity in described black vector set and white vector set is changed into K dimensional vector.

3. method according to claim 1 and 2, is characterized in that, is describedly comprised the step that the file to be detected changing into vector carries out malicious file identification by machine learning model:

Result of calculation is obtained by machine learning model to changing into the file to be detected after vector;

Malicious file in file to be detected and normal file is exported according to result of calculation.

4. method according to claim 3, is characterized in that, described predetermined malicious file and normal file refer to the known malicious file and normal file collected in advance.

5. a malicious file recognition device, is characterized in that, comprising:

Read module, for reading the file to be detected beyond study collection;

Vector conversion module, for becoming vector by described file translations to be detected; Namely from file to be detected, effective sample characteristics is extracted, by right for value value composition one (key:value) of these feature key and this feature, a file to be detected then becomes the set of (key:value), is converted into a unfixed multi-C vector of dimension;

Identification module, for carrying out malicious file identification by described machine learning model to the file to be detected changing into vector; Described model generation module comprises:

Vector conversion unit, for changing into vector respectively by malicious file concentrated for described study and normal file;

Merge and screening unit, carry out dimension merging and screening for concentrating the vector of malicious file and normal file to described study;

Generation unit, is learnt for the vector after being combined by sorter and screening, and generates machine learning model; The vector setting all malicious files that described study is concentrated is black vector set, and the vector of all normal files is white vector set, and described merging and screening unit comprise:

First extracts subelement, for the black vector of random selecting two from described black vector set, extracts the total dimension of two black vectors, as black dimension collection; The white vector of random selecting two from described white vector set, extracts the total dimension of two white vectors, as white dimension collection;

Screening subelement, for described black dimension being concentrated all dimensions appearing at described white dimension concentrated to remove, forming new black dimension collection, giving weight to each dimension that described white dimension collection and new black dimension are concentrated;

Merging subelement, for described white dimension collection and new black dimension collection are carried out dimension merging respectively according to weight, and abandoning merging the dimension of rear weight lower than predefined weight threshold values;

Filter subelement, for after Vector Processing all in described black vector set and white vector set, filter the black dimension collection after merging with the white dimension collection after merging;

Second extracts subelement, for sorting according to weight size to the black dimension collection after filtration, takes out the black dimension of the highest front K dimension of rank as final dimension;

Transformant unit, for changing into K dimensional vector by the institute's directed quantity in described black vector set and white vector set.

6. device according to claim 5, is characterized in that, described identification module comprises:

Computing unit, for obtaining result of calculation to changing into the file to be detected after vector by machine learning model;

Output unit, for exporting malicious file in file to be detected and normal file according to result of calculation.

7. device according to claim 6, is characterized in that, described predetermined malicious file and normal file refer to the known malicious file and normal file collected in advance.