CN102737186A

CN102737186A - Malicious file identification method, device and storage medium

Info

Publication number: CN102737186A
Application number: CN2012102130782A
Authority: CN
Inventors: 崔精兵; 杨宜; 于涛; 白子潘; 吴家旭
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd; Tencent Cloud Computing Beijing Co Ltd
Priority date: 2012-06-26
Filing date: 2012-06-26
Publication date: 2012-10-17
Anticipated expiration: 2032-06-26
Also published as: CN102737186B

Abstract

The invention discloses a malicious file identification method, a device and a storage medium. The method comprises the steps of adopting a preset learning set consisting of a malicious file and a normal file to generate a machine learning model; reading a file to be detected other than the learning set; converting the file to be detected to a vector; and performing the malicious file identification on the file to be detected which is converted to the vector through the machine learning model. The machine learning model is generated by the preset learning set consisting of the malicious file and the normal file, and the generated machine learning model is used for performing the malicious file identification on the file to be detected except the learning set, so that the virus characteristics can be timely, accurately and efficiently extracted, any discovered malicious file can be immediately processed, and the detection efficiency of the malicious file can be greatly improved.

Description

Malice file identification method, device and storage medium

Technical field

The present invention relates to Internet technical field,, particularly security fields relate in particular to a kind of malice file identification method, device and storage medium.

Background technology

Along with the development of Internet technology, the propagation of virus is also in aggravation.Virus has caused great harm to safety and user's property of user profile, therefore, develop be swift in response, efficiently, look into the emphasis that malicious rate and the high antivirus engine of accuracy have become current internet information safety limit.

The viral recognition technology that traditional antivirus engine usually adopts is following: the analyst analyzes virus document, extracts virus characteristic, virus characteristic warehouse-in, antivirus engine is scanned existing file according to virus base, if run into the characteristic that can mate then report poison.

There is following drawback in traditional viral recognition technology:

1, analyst's professional skill is had relatively high expectations, and the quality of extraction virus characteristic has determined rate of false alarm and has quoted rate;

2, analysis virus document and extraction virus characteristic are very consuming time;

3, efficient is low, and along with increasing of virus base record, in order to collide each bar record, the needed time can become how much multiples to increase;

4, untimely to the discovery of virus, with respect to the new viral species of magnanimity, because analyst's processing power is limited; For some viral processing; Just can come to light or pay attention to when virus outbreak such as having only, handle then, and virus caused sizable harm this moment.

Summary of the invention

Fundamental purpose of the present invention is to provide a kind of malice file identification method, device and storage medium, is intended to improve the detection efficiency of malice file.

In order to achieve the above object, the present invention proposes a kind of malice file identification method, may further comprise the steps:

The study collection that adopts predetermined malice file and normal file to form generates machine learning model;

Read the file to be detected beyond the study collection;

Said file to be detected is changed into vector;

Through said machine learning model the file to be detected that changes into vector is carried out the malice file identification.

The present invention also proposes a kind of malice file identification device, comprising:

The model generation module, the study collection that is used to adopt predetermined malice file and normal file to form generates machine learning model;

Read module is used to read the file to be detected beyond the study collection;

The vector conversion module is used for said file to be detected is changed into vector;

Identification module is used for through said machine learning model the file to be detected that changes into vector being carried out the malice file identification.

The present invention also proposes a kind of storage medium of embodied on computer readable; Stored the program that computing machine can be moved above that; After program is packed in the storer of computing machine, the study collection generation machine learning model that adopts predetermined malice file and normal file to form; Read the file to be detected beyond the study collection; Said file to be detected is changed into vector; Through said machine learning model the file to be detected that changes into vector is carried out the malice file identification.

A kind of malice file identification method, device and storage medium that the present invention proposes; The study collection of forming through predefined malice file and normal file generates machine learning model; Machine learning model through generating is carried out the malice file identification to the file to be detected beyond the study collection; Can in time, accurately and effectively extract virus characteristic, can handle immediately, promote the detection efficiency of malice file thus greatly any malice file of finding.

Description of drawings

Fig. 1 is the schematic flow sheet of malice file identification method of the present invention preferred embodiment;

Fig. 2 is the schematic flow sheet that adopts the study collection generation machine learning model of predetermined malice file and normal file composition in the malice file identification method of the present invention preferred embodiment;

Fig. 3 concentrates the vector of malice file and normal file to carry out the schematic flow sheet that dimension merges and screens to learning in the malice file identification method of the present invention preferred embodiment;

Fig. 4 carries out to the vector of learning to concentrate malice file and normal file in the malice file identification method of the present invention preferred embodiment that dimension merges and the schematic flow sheet of a kind of instance of screening;

Fig. 5 is the structural representation of malice file identification device of the present invention preferred embodiment;

Fig. 6 is the structural representation of model generation module in the malice file identification device of the present invention preferred embodiment;

Fig. 7 merges in the malice file identification device of the present invention preferred embodiment and the structural representation of screening unit;

Fig. 8 is the structural representation of identification module in the malice file identification device of the present invention preferred embodiment.

In order to make technical scheme of the present invention clearer, clear, will combine accompanying drawing to do further to detail below.

Embodiment

Embodiment of the invention solution mainly is: the study collection that adopts predetermined malice file and normal file to form generates machine learning model; Read the file to be detected beyond the study collection; And said file to be detected changed into vector; Through machine learning model the file to be detected that changes into vector is carried out the malice file identification, utilize timely, the fast characteristics of processing speed of machine learning reaction, promote the detection efficiency of malice file.

The malice file can be the file of virus document or other malice among the present invention, and following examples illustrate with the malice file.Wherein, the technical term that relates to comprises:

Black file: virus document

Black vector: the vector that virus document changes into

Text of an annotated book spare: normal non-virus document

Bai Xiangliang: the vector that normal non-virus document changes into

SVM: SVM

PE file: a kind of executable file format under the windows system

As shown in Figure 1, preferred embodiment of the present invention proposes a kind of malice file identification method, comprising:

Step S101, the study collection that adopts predetermined malice file and normal file to form generates machine learning model;

With the windows system is example; For the file under the windows system is looked into poison; Present embodiment at first utilizes known virus document and non-virus document (being alleged malice file of present embodiment and normal file) to generate machine learning model; So that the file under the windows system is carried out virus identification, the classification problem of virus document and normal file in the resolution system through this machine learning model.

Above-mentioned known virus document and non-virus document can be collected by the virus analysis teacher in advance; And form a study collection; Through to learn to concentrate each virus document and normal file carry out feature extraction, dimension merges and screen after; Concentrate each virus document and normal file to carry out vector study by sorter to learning, finally generate machine learning model.

Particularly, at first, malice file and normal file that study is concentrated change into vector respectively, promptly are that malice file and normal file that study is concentrated are accomplished the effective sample Feature Extraction respectively.

For an executable file (PE file), virus is discerned helpful characteristic comprise: character string, instruction sequence, functional procedure, import and export the attribute of function and each section etc.

Present embodiment is right with value value composition one (key:value) of these characteristics key and this characteristic; A file has then become (comprising malice file and normal file) set of (key:value); If as a dimension, then the set of (key:value) of a file can be regarded a unfixed multi-C vector of dimension as with each key.

Through feature extraction, file is converted into a unfixed multi-C vector of dimension.But what generate that the sorter of machine learning model needs is the fixing vector of a dimension, and is the dimension (key) that merges All Files with the method for fixed in dimension, if there is not a certain dimension in single file its value is not made as 0; For mass file the dimension with magnanimity is arranged then, dimension disaster can occur, therefore, need merge and screen these dimensions; Vector after being combined and screening through sorter is at last learnt, and generates machine learning model.

Step S102 reads the file to be detected beyond the study collection;

Step S103 changes into vector with file to be detected;

Step S104 carries out the malice file identification through machine learning model to the file to be detected that changes into vector.

Above-mentioned steps S102 is to step S104; When having file to detect outside the study collection; Read file to be detected, file to be detected is changed into vector, the file to be detected that changes into vector is carried out the malice file identification through the machine learning model that step S101 generates.

As a kind of preferred embodiment, be example with the PC, can the machine learning model that step S101 generates be applied to the malicious engine of looking into of PC front end, on user's PC, look into poison, its practical implementation process is following:

1, reads file to be detected on the PC;

2, the file conversion to be detected on the PC that reads is become vector.

As previously mentioned, sorter concentrates each virus document and normal file to carry out vector study to learning, thereby generates machine learning model; Be that the handled file object of machine learning model should be a vector; Therefore, in the present embodiment, in reading the PC system during file to be detected; Need file to be detected be converted into vector, promptly from file to be detected, extract effective sample characteristics.This effective sample characteristics comprises: character string, instruction sequence, functional procedure, import and export the attribute of function and each section etc.

Then; Value value composition one (key:value) of these characteristics key and this characteristic is right; A file has then become (comprising malice file and normal file) set of (key:value); If as a dimension, then the set of (key:value) of a file can be regarded a unfixed multi-C vector of dimension as with each key.

3, through machine learning model the file to be detected on the PC that changes into vector is carried out the malice file identification.

Place machine learning model to judge the file to be detected on the PC that is converted into vector, therefrom identify virus document and normal file.Be specially: carry out linear function and calculate changing into file to be detected behind the vector through machine learning model; Judge the attribute of malice file and normal file according to result of calculation, thereby export malice file and the normal file in the file to be detected.

Particularly, as shown in Figure 2, the step that the study collection that above-mentioned steps S101 adopts predetermined malice file and normal file to form generates machine learning model comprises:

Step S1011, malice file and normal file that study is concentrated change into vector respectively;

Malice file and normal file that study is concentrated change into vector respectively, promptly are malice file and normal file that study is concentrated are accomplished the effective sample Feature Extraction respectively.

Step S1012 concentrates the vector of malice file and normal file to carry out dimension merging and screening to learning;

Through feature extraction, file is converted into a unfixed multi-C vector of dimension.But what generate that the sorter of machine learning model needs is the fixing vector of a dimension, and is the dimension (key) that merges All Files with the method for fixed in dimension, if there is not a certain dimension in single file its value is not made as 0; For mass file the dimension with magnanimity is arranged then, dimension disaster can occur, therefore, need merge and screen these dimensions.

Present embodiment specifically merges dimension and filters out the K dimension, and wherein, the K dimension is meant from a plurality of dimensions according to certain rule, through merging and screening preceding K the dimension of selecting.The follow-up Fig. 3 that will combine sets forth in detail.

Step S1013, the vector after being combined and screening through sorter is learnt, and generates machine learning model.

Sorter can adopt linear classifier in the present embodiment, and so-called linear SVM is meant that its kernel function is the inner product function.Present embodiment specifically adopts SVMs SVM (Support Vector Machine); SVM is a kind of trainable study machine; Belong to the vague generalization linear classifier, the characteristics of this sorter are: can minimize experience sum of errors maximization set marginarium.SVM is applied to virus identification, promptly is the classification problem that will solve virus document and normal file.

Vector after SVM is combined and screens is learnt, and promptly generates machine learning model.

Certainly, in other embodiments, can also use the other machines learning method instead and differentiate, and need not to use SVM.

More particularly; As shown in Figure 3; Be black vector set if set the vector of all malice files that said study concentrates, the vector of all normal files is white vector set, and then above-mentioned steps S1012 carries out to the vector of learning to concentrate malice file and normal file that dimension merges and the step of screening comprises:

Step S10, picked at random two black vectors extract two black vectorial total dimensions, as black dimension collection from said black vector set; Picked at random two white vectors extract two white vectorial total dimensions, as white dimension collection from said white vector set;

Step S11 concentrates all to appear at the concentrated dimension of said white dimension said black dimension and removes, form new black dimension collection, gives weight to said white dimension collection and new each concentrated dimension of black dimension;

Among above-mentioned steps S10 and the step S11,, adopt following mode in order dimension to be merged and to filter out the K dimension:

With the problem of whole black vector set, split into the subproblem of two black vector sum two white vectors with white vector set merging and screening dimension; Separate each subproblem then; Two white vectors are extracted total dimension (getting common factor); As the white dimension collection of subproblem; Two black vectors are extracted the black dimension collection of total dimension as subproblem, and will deceive dimension and concentrate all to appear at the dimension that white dimension concentrates to remove, give weight each black, white dimension of electing.

Step S12 carries out dimension with said white dimension collection and new black dimension collection respectively according to weight and merges, and will merge the dimension that back weight is lower than the predefined weight threshold values and abandon;

Separating according to dimension of all subproblems merged, a weight threshold w is set in the merging process,, then directly abandon this dimension, prevent the dimension collection and unrestrictedly increase if the weight of the dimension after merging (the corresponding weighted value addition of dimension during merging) is lower than w.

Step S13 judges respectively whether all vectors dispose in black vector set and the white vector set; If; Then get into step S14; Otherwise, return step S10;

Step S14 filters the black dimension collection after merging with the white dimension collection after merging;

Step S15, according to the ordering of weight size, the black dimension of taking out the highest preceding K dimension of rank is as final dimension to the black dimension collection after filtering;

Among the above-mentioned steps S13-step S15; Vector study all in black vector set and white vector set finish; Filter black dimension collection (promptly black dimension collection=black dimension collection-Bai dimension collection) with the white dimension collection after merging; Black dimension collection is carried out rank according to the weight size, and the black dimension of the preceding K dimension that the taking-up rank is the highest as a result of.

Step S16 changes into the K dimensional vector with the institute's directed quantity in said black vector set and the white vector set.

Vectors all in the black and white file is changed into the canonical form of the K dimensional vector of selecting,, generate machine learning model so that SVM learns the K dimensional vector.

Concentrate the process of the vector of all virus documents and normal file to set forth in detail with instantiation to above-mentioned merging and screening study below.

As shown in Figure 4; With FB, FW represent respectively to deceive, white vectorial general collection; With FBL and FWL represent respectively to deceive, the total dimension collection of white vector; Represent the marks of two black vectors of picked at random from black vector set respectively with B1, B2, represent the marks of two white vectors of picked at random from white vector set respectively, the vector of learning to concentrate all virus documents and normal file is merged and the process of screening is specially with W1, W2:

S1, initialization FB, FW select the black-and-white vector collection; If select black vector, then get into step S2, if select white vector, then get into step S3;

S2 judges whether the black vector in the black vector set all is labeled; If then get into step S4; Otherwise, get into step S21;

S21, two black vectorial B1 of picked at random, B2;

S22 extracts total dimension collection FBL and gives weight to each dimension; Get into S23;

S3 judges whether the white vector in the white vector set all is labeled; If then get into step S4; Otherwise, get into step S31;

S31, two white vectorial W1 of picked at random, W2;

S32 extracts total dimension collection FWL and gives weight to each dimension; Get into S23;

S23 makes difference set with FBL and FWL, as new FBL;

S24 merges to new FBL and FWL respectively among general collection FB, the FW, will weigh the collection addition during merging;

S25 rejects weight among FB and the FW less than the dimension of w-limit (the weight threshold values of setting); Return step S2 and S3 respectively.

S4, FB and FW make difference set as new FB;

S5, K dimension before ordering is taken out according to weight with FB obtains the FB net result.

The study collection that present embodiment is formed through predefined malice file and normal file generates machine learning model; And the file to be detected beyond the study collection is carried out the malice file identification through the machine learning model that generates; Promptly extract malicious code characteristics such as virus automatically by machine; Saved analyst's participation, and the machine learning reaction in time, can accurately also effectively extract virus characteristic; Any malice file to finding all can be handled immediately, has promoted the detection efficiency of malice file thus greatly.

As shown in Figure 5, preferred embodiment of the present invention proposes a kind of malice file identification device, comprising: model generation module 501, read module 502, vectorial conversion module 503 and identification module 504, wherein:

Model generation module 501, the study collection that is used to adopt predetermined malice file and normal file to form generates machine learning model;

Read module 502 is used to read the file to be detected beyond the study collection;

Vector conversion module 503 is used for said file to be detected is changed into vector;

Identification module 504 is used for through said machine learning model the file to be detected that changes into vector being carried out the malice file identification.

Above-mentioned known virus document and non-virus document can be collected by the virus analysis teacher in advance; And form a study collection; After concentrating that each virus document and normal file carry out feature extraction, dimension merges and screen through the study of 501 pairs of model generation modules; Concentrate each virus document and normal file to carry out vector study by sorter to learning, finally generate machine learning model.

When having file to detect outside the study collection, read file to be detected, file to be detected is changed into vector, through the machine learning model that step S101 generates the file to be detected that changes into vector is carried out the malice file identification.

As a kind of preferred embodiment, be example with the PC, can the machine learning model that model generation module 501 generates be applied to the malicious engine of looking into of PC front end, on user's PC, look into poison, its practical implementation process is following:

1, reads file to be detected on the PC;

2, the file conversion to be detected on the PC that reads is become vector.

As previously mentioned, sorter concentrates each virus document and normal file to carry out vector study to learning, thereby generates machine learning model; Be that the handled file object of machine learning model should be a vector; Therefore, in the present embodiment, when read module 502 reads in the PC system file to be detected; Vector conversion module 503 need be converted into vector with file to be detected, promptly from file to be detected, extracts effective sample characteristics.This effective sample characteristics comprises: character string, instruction sequence, functional procedure, import and export the attribute of function and each section etc.

The file to be detected that identification module 504 will be converted on the vectorial PC places machine learning model to judge, therefrom identifies virus document and normal file.Be specially: carry out linear function and calculate changing into file to be detected behind the vector through machine learning model; Judge the attribute of malice file and normal file according to result of calculation, thereby export malice file and the normal file in the file to be detected.

Particularly, as shown in Figure 6, said model generation module 501 comprises: vectorial conversion unit 5011, merging and screening unit 5012 and generation unit 5013, wherein:

Vector conversion unit 5011 is used for malice file and normal file that said study is concentrated are changed into vector respectively;

Merge and screening unit 5012, be used for concentrating the vector of malice file and normal file to carry out dimension merging and screening said study;

Generation unit 5013, the vector after being used for being combined and screening through sorter is learnt, and generates machine learning model.

In the present embodiment, malice file and normal file that study is concentrated change into vector respectively, promptly are that malice file and normal file that study is concentrated are accomplished the effective sample Feature Extraction respectively.

Through feature extraction, file is converted into a unfixed multi-C vector of dimension.But what generate that the sorter of machine learning model needs is the fixing vector of a dimension, and is the dimension (key) that merges All Files with the method for fixed in dimension, if there is not a certain dimension in single file its value is not made as 0; For mass file the dimension with magnanimity is arranged then, dimension disaster can occur, therefore, need merge and screen these dimensions.Present embodiment specifically merges dimension and filters out the K dimension, and wherein, the K dimension is meant from a plurality of dimensions according to certain rule, through merging and screening preceding K the dimension of selecting.

More particularly; As shown in Figure 7; Be black vector set if set the vector of all concentrated malice files of said study; The vector of all normal files is white vector set, and then said merging and screening unit 5012 comprise: first extracts subelement 50121, screening subelement 50122, merging subelement 50123, filtration subelement 50124, second extraction subelement 50125 and the transformant unit 50126, wherein:

First extracts subelement 50121, is used for extracting the total dimension of two black vectors, as black dimension collection from said black vector set picked at random two black vectors; Picked at random two white vectors extract two white vectorial total dimensions, as white dimension collection from said white vector set;

Screening subelement 50122 is used for concentrating all to appear at the concentrated dimension of said white dimension said black dimension and removes, form new black dimension collection, gives weight to said white dimension collection and new each concentrated dimension of black dimension;

Merge subelement 50123, be used for that said white dimension collection and new black dimension collection are carried out dimension respectively according to weight and merge, and will merge the dimension that back weight is lower than the predefined weight threshold values and abandon;

Filter subelement 50124, be used for after said black vector set and the white all Vector Processing of vector set finish, filtering the black dimension collection after merging with the white dimension collection after merging;

Second extracts subelement 50125, is used for the black dimension collection after filtering is sorted according to the weight size, and the black dimension of taking out the highest preceding K dimension of rank is as final dimension;

Transformant unit 50126 is used for institute's directed quantity of said black vector set and white vector set is changed into the K dimensional vector.

In the present embodiment,, adopt following mode in order dimension to be merged and to filter out the K dimension:

Vector study all in black vector set and white vector set finish; Filter black dimension collection (promptly black dimension collection=black dimension collection-Bai dimension collection) with the white dimension collection after merging; Black dimension collection is carried out rank according to the weight size, and the black dimension of the preceding K dimension that the taking-up rank is the highest as a result of.

In addition, as shown in Figure 8, above-mentioned identification module 504 comprises: computing unit 5041 and output unit 5042, wherein:

Computing unit 5041 is used for the file to be detected that changes into behind the vector is obtained result of calculation through machine learning model;

Output unit 5042 is used for exporting according to result of calculation the malice file and the normal file of file to be detected.

Embodiment of the invention malice file identification method, device and storage medium; The study collection of forming through predefined malice file and normal file generates machine learning model; And the file to be detected beyond the study collection is carried out the malice file identification through the machine learning model that generates, and promptly extract malicious code characteristics such as virus automatically by machine, saved analyst's participation; And the machine learning reaction in time; Can accurately also effectively extract virus characteristic, all can handle immediately, promote the detection efficiency of malice file thus greatly any malice file of finding.

In addition; The present invention also proposes a kind of storage medium of embodied on computer readable; Stored the program that computing machine can be moved above that, after program is packed in the storer of computing machine, the study collection generation machine learning model that adopts predetermined malice file and normal file to form; Read the file to be detected beyond the study collection; Said file to be detected is changed into vector; Through said machine learning model the file to be detected that changes into vector is carried out the malice file identification.

Need to prove; The above embodiment of the present invention all illustrates with windows operating system; But be not limited to windows operating system; Other operating systems also can be carried out malice file detection identification by adopting by reference such scheme of the present invention, and such as mac or linux system etc., its concrete principle repeats no more at this.

The above is merely the preferred embodiments of the present invention; Be not so limit claim of the present invention; Every equivalent structure or flow process conversion that utilizes instructions of the present invention and accompanying drawing content to be done; Or directly or indirectly be used in other relevant technical field, all in like manner be included in the scope of patent protection of the present invention.

Claims

1. a malice file identification method is characterized in that, may further comprise the steps:

Read the file to be detected beyond the study collection;

Said file to be detected is changed into vector;

2. method according to claim 1 is characterized in that, the step that the study collection that malice file that said employing is predetermined and normal file are formed generates machine learning model comprises:

Malice file and normal file that said study is concentrated change into vector respectively;

Concentrate the vector of malice file and normal file to carry out dimension merging and screening to said study;

Vector after being combined and screening through sorter is learnt, and generates machine learning model.

3. method according to claim 2; It is characterized in that; The vector of setting all concentrated malice files of said study is black vector set; The vector of all normal files is white vector set, saidly the vector of learning to concentrate malice file and normal file is carried out dimension merges and the step of screening comprises:

Picked at random two black vectors extract two black vectorial total dimensions, as black dimension collection from said black vector set; Picked at random two white vectors extract two white vectorial total dimensions, as white dimension collection from said white vector set;

Concentrate all to appear at the concentrated dimension of said white dimension said black dimension and remove, form new black dimension collection, give weight said white dimension collection and new each concentrated dimension of black dimension;

Said white dimension collection and new black dimension collection are carried out dimension respectively according to weight merge, and will merge the dimension that back weight is lower than the predefined weight threshold values and abandon; With these above-mentioned three steps that circulate, all Vector Processing finish in said black vector set and white vector set.

4. method according to claim 3 is characterized in that, saidly the vector of learning to concentrate malice file and normal file is carried out dimension merges and the step of screening further comprises:

After all Vector Processing finish in said black vector set and the white vector set, with the black dimension collection after the white dimension collection filtration merging after merging;

According to the ordering of weight size, the black dimension of taking out the highest preceding K dimension of rank is as final dimension to the black dimension collection after filtering;

Institute's directed quantity in said black vector set and the white vector set is changed into the K dimensional vector.

5. according to claim 1,2,3 or 4 described methods, it is characterized in that, saidly the step that the file to be detected that changes into vector carries out the malice file identification comprised through machine learning model:

File to be detected to changing into behind the vector obtains result of calculation through machine learning model;

Export malice file and normal file in the file to be detected according to result of calculation.

6. method according to claim 5 is characterized in that, said predetermined malice file and normal file are meant known malice file and the normal file of collecting in advance.

7. a malice file identification device is characterized in that, comprising:

8. device according to claim 7 is characterized in that, said model generation module comprises:

The vector conversion unit is used for malice file and normal file that said study is concentrated are changed into vector respectively;

Merge and screening unit, be used for concentrating the vector of malice file and normal file to carry out dimension merging and screening said study;

Generation unit, the vector after being used for being combined and screening through sorter is learnt, and generates machine learning model.

9. device according to claim 8 is characterized in that, the vector of setting all concentrated malice files of said study is black vector set, and the vector of all normal files is white vector set, and said merging and screening unit comprise:

First extracts subelement, is used for extracting the total dimension of two black vectors, as black dimension collection from said black vector set picked at random two black vectors; Picked at random two white vectors extract two white vectorial total dimensions, as white dimension collection from said white vector set;

The screening subelement is used for concentrating all to appear at the concentrated dimension of said white dimension said black dimension and removes, form new black dimension collection, gives weight to said white dimension collection and new each concentrated dimension of black dimension;

Merge subelement, be used for that said white dimension collection and new black dimension collection are carried out dimension respectively according to weight and merge, and will merge the dimension that back weight is lower than the predefined weight threshold values and abandon;

Filter subelement, be used for after said black vector set and the white all Vector Processing of vector set finish, filtering the black dimension collection after merging with the white dimension collection after merging;

Second extracts subelement, is used for the black dimension collection after filtering is sorted according to the weight size, and the black dimension of taking out the highest preceding K dimension of rank is as final dimension;

The transformant unit is used for institute's directed quantity of said black vector set and white vector set is changed into the K dimensional vector.

10. according to claim 7,8 or 9 described devices, it is characterized in that said identification module comprises:

Computing unit is used for the file to be detected that changes into behind the vector is obtained result of calculation through machine learning model;

Output unit is used for exporting according to result of calculation the malice file and the normal file of file to be detected.

11. device according to claim 10 is characterized in that, said predetermined malice file and normal file are meant known malice file and the normal file of collecting in advance.

12. the storage medium of an embodied on computer readable has been stored the program that computing machine can be moved above that, after program is packed in the storer of computing machine, and the study collection generation machine learning model that adopts predetermined malice file and normal file to form; Read the file to be detected beyond the study collection; Said file to be detected is changed into vector; Through said machine learning model the file to be detected that changes into vector is carried out the malice file identification.