CN102737186A - Malicious file identification method, device and storage medium - Google Patents

Malicious file identification method, device and storage medium Download PDF

Info

Publication number
CN102737186A
CN102737186A CN2012102130782A CN201210213078A CN102737186A CN 102737186 A CN102737186 A CN 102737186A CN 2012102130782 A CN2012102130782 A CN 2012102130782A CN 201210213078 A CN201210213078 A CN 201210213078A CN 102737186 A CN102737186 A CN 102737186A
Authority
CN
China
Prior art keywords
file
dimension
vector
black
malice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012102130782A
Other languages
Chinese (zh)
Other versions
CN102737186B (en
Inventor
崔精兵
杨宜
于涛
白子潘
吴家旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Tencent Cloud Computing Beijing Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201210213078.2A priority Critical patent/CN102737186B/en
Publication of CN102737186A publication Critical patent/CN102737186A/en
Application granted granted Critical
Publication of CN102737186B publication Critical patent/CN102737186B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Storage Device Security (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a malicious file identification method, a device and a storage medium. The method comprises the steps of adopting a preset learning set consisting of a malicious file and a normal file to generate a machine learning model; reading a file to be detected other than the learning set; converting the file to be detected to a vector; and performing the malicious file identification on the file to be detected which is converted to the vector through the machine learning model. The machine learning model is generated by the preset learning set consisting of the malicious file and the normal file, and the generated machine learning model is used for performing the malicious file identification on the file to be detected except the learning set, so that the virus characteristics can be timely, accurately and efficiently extracted, any discovered malicious file can be immediately processed, and the detection efficiency of the malicious file can be greatly improved.

Description

Malice file identification method, device and storage medium
Technical field
The present invention relates to Internet technical field,, particularly security fields relate in particular to a kind of malice file identification method, device and storage medium.
Background technology
Along with the development of Internet technology, the propagation of virus is also in aggravation.Virus has caused great harm to safety and user's property of user profile, therefore, develop be swift in response, efficiently, look into the emphasis that malicious rate and the high antivirus engine of accuracy have become current internet information safety limit.
The viral recognition technology that traditional antivirus engine usually adopts is following: the analyst analyzes virus document, extracts virus characteristic, virus characteristic warehouse-in, antivirus engine is scanned existing file according to virus base, if run into the characteristic that can mate then report poison.
There is following drawback in traditional viral recognition technology:
1, analyst's professional skill is had relatively high expectations, and the quality of extraction virus characteristic has determined rate of false alarm and has quoted rate;
2, analysis virus document and extraction virus characteristic are very consuming time;
3, efficient is low, and along with increasing of virus base record, in order to collide each bar record, the needed time can become how much multiples to increase;
4, untimely to the discovery of virus, with respect to the new viral species of magnanimity, because analyst's processing power is limited; For some viral processing; Just can come to light or pay attention to when virus outbreak such as having only, handle then, and virus caused sizable harm this moment.
Summary of the invention
Fundamental purpose of the present invention is to provide a kind of malice file identification method, device and storage medium, is intended to improve the detection efficiency of malice file.
In order to achieve the above object, the present invention proposes a kind of malice file identification method, may further comprise the steps:
The study collection that adopts predetermined malice file and normal file to form generates machine learning model;
Read the file to be detected beyond the study collection;
Said file to be detected is changed into vector;
Through said machine learning model the file to be detected that changes into vector is carried out the malice file identification.
The present invention also proposes a kind of malice file identification device, comprising:
The model generation module, the study collection that is used to adopt predetermined malice file and normal file to form generates machine learning model;
Read module is used to read the file to be detected beyond the study collection;
The vector conversion module is used for said file to be detected is changed into vector;
Identification module is used for through said machine learning model the file to be detected that changes into vector being carried out the malice file identification.
The present invention also proposes a kind of storage medium of embodied on computer readable; Stored the program that computing machine can be moved above that; After program is packed in the storer of computing machine, the study collection generation machine learning model that adopts predetermined malice file and normal file to form; Read the file to be detected beyond the study collection; Said file to be detected is changed into vector; Through said machine learning model the file to be detected that changes into vector is carried out the malice file identification.
A kind of malice file identification method, device and storage medium that the present invention proposes; The study collection of forming through predefined malice file and normal file generates machine learning model; Machine learning model through generating is carried out the malice file identification to the file to be detected beyond the study collection; Can in time, accurately and effectively extract virus characteristic, can handle immediately, promote the detection efficiency of malice file thus greatly any malice file of finding.
Description of drawings
Fig. 1 is the schematic flow sheet of malice file identification method of the present invention preferred embodiment;
Fig. 2 is the schematic flow sheet that adopts the study collection generation machine learning model of predetermined malice file and normal file composition in the malice file identification method of the present invention preferred embodiment;
Fig. 3 concentrates the vector of malice file and normal file to carry out the schematic flow sheet that dimension merges and screens to learning in the malice file identification method of the present invention preferred embodiment;
Fig. 4 carries out to the vector of learning to concentrate malice file and normal file in the malice file identification method of the present invention preferred embodiment that dimension merges and the schematic flow sheet of a kind of instance of screening;
Fig. 5 is the structural representation of malice file identification device of the present invention preferred embodiment;
Fig. 6 is the structural representation of model generation module in the malice file identification device of the present invention preferred embodiment;
Fig. 7 merges in the malice file identification device of the present invention preferred embodiment and the structural representation of screening unit;
Fig. 8 is the structural representation of identification module in the malice file identification device of the present invention preferred embodiment.
In order to make technical scheme of the present invention clearer, clear, will combine accompanying drawing to do further to detail below.
Embodiment
Embodiment of the invention solution mainly is: the study collection that adopts predetermined malice file and normal file to form generates machine learning model; Read the file to be detected beyond the study collection; And said file to be detected changed into vector; Through machine learning model the file to be detected that changes into vector is carried out the malice file identification, utilize timely, the fast characteristics of processing speed of machine learning reaction, promote the detection efficiency of malice file.
The malice file can be the file of virus document or other malice among the present invention, and following examples illustrate with the malice file.Wherein, the technical term that relates to comprises:
Black file: virus document
Black vector: the vector that virus document changes into
Text of an annotated book spare: normal non-virus document
Bai Xiangliang: the vector that normal non-virus document changes into
SVM: SVM
PE file: a kind of executable file format under the windows system
As shown in Figure 1, preferred embodiment of the present invention proposes a kind of malice file identification method, comprising:
Step S101, the study collection that adopts predetermined malice file and normal file to form generates machine learning model;
With the windows system is example; For the file under the windows system is looked into poison; Present embodiment at first utilizes known virus document and non-virus document (being alleged malice file of present embodiment and normal file) to generate machine learning model; So that the file under the windows system is carried out virus identification, the classification problem of virus document and normal file in the resolution system through this machine learning model.
Above-mentioned known virus document and non-virus document can be collected by the virus analysis teacher in advance; And form a study collection; Through to learn to concentrate each virus document and normal file carry out feature extraction, dimension merges and screen after; Concentrate each virus document and normal file to carry out vector study by sorter to learning, finally generate machine learning model.
Particularly, at first, malice file and normal file that study is concentrated change into vector respectively, promptly are that malice file and normal file that study is concentrated are accomplished the effective sample Feature Extraction respectively.
For an executable file (PE file), virus is discerned helpful characteristic comprise: character string, instruction sequence, functional procedure, import and export the attribute of function and each section etc.
Present embodiment is right with value value composition one (key:value) of these characteristics key and this characteristic; A file has then become (comprising malice file and normal file) set of (key:value); If as a dimension, then the set of (key:value) of a file can be regarded a unfixed multi-C vector of dimension as with each key.
Through feature extraction, file is converted into a unfixed multi-C vector of dimension.But what generate that the sorter of machine learning model needs is the fixing vector of a dimension, and is the dimension (key) that merges All Files with the method for fixed in dimension, if there is not a certain dimension in single file its value is not made as 0; For mass file the dimension with magnanimity is arranged then, dimension disaster can occur, therefore, need merge and screen these dimensions; Vector after being combined and screening through sorter is at last learnt, and generates machine learning model.
Step S102 reads the file to be detected beyond the study collection;
Step S103 changes into vector with file to be detected;
Step S104 carries out the malice file identification through machine learning model to the file to be detected that changes into vector.
Above-mentioned steps S102 is to step S104; When having file to detect outside the study collection; Read file to be detected, file to be detected is changed into vector, the file to be detected that changes into vector is carried out the malice file identification through the machine learning model that step S101 generates.
As a kind of preferred embodiment, be example with the PC, can the machine learning model that step S101 generates be applied to the malicious engine of looking into of PC front end, on user's PC, look into poison, its practical implementation process is following:
1, reads file to be detected on the PC;
2, the file conversion to be detected on the PC that reads is become vector.
As previously mentioned, sorter concentrates each virus document and normal file to carry out vector study to learning, thereby generates machine learning model; Be that the handled file object of machine learning model should be a vector; Therefore, in the present embodiment, in reading the PC system during file to be detected; Need file to be detected be converted into vector, promptly from file to be detected, extract effective sample characteristics.This effective sample characteristics comprises: character string, instruction sequence, functional procedure, import and export the attribute of function and each section etc.
Then; Value value composition one (key:value) of these characteristics key and this characteristic is right; A file has then become (comprising malice file and normal file) set of (key:value); If as a dimension, then the set of (key:value) of a file can be regarded a unfixed multi-C vector of dimension as with each key.
3, through machine learning model the file to be detected on the PC that changes into vector is carried out the malice file identification.
Place machine learning model to judge the file to be detected on the PC that is converted into vector, therefrom identify virus document and normal file.Be specially: carry out linear function and calculate changing into file to be detected behind the vector through machine learning model; Judge the attribute of malice file and normal file according to result of calculation, thereby export malice file and the normal file in the file to be detected.
Particularly, as shown in Figure 2, the step that the study collection that above-mentioned steps S101 adopts predetermined malice file and normal file to form generates machine learning model comprises:
Step S1011, malice file and normal file that study is concentrated change into vector respectively;
Malice file and normal file that study is concentrated change into vector respectively, promptly are malice file and normal file that study is concentrated are accomplished the effective sample Feature Extraction respectively.
For an executable file (PE file), virus is discerned helpful characteristic comprise: character string, instruction sequence, functional procedure, import and export the attribute of function and each section etc.
Present embodiment is right with value value composition one (key:value) of these characteristics key and this characteristic; A file has then become (comprising malice file and normal file) set of (key:value); If as a dimension, then the set of (key:value) of a file can be regarded a unfixed multi-C vector of dimension as with each key.
Step S1012 concentrates the vector of malice file and normal file to carry out dimension merging and screening to learning;
Through feature extraction, file is converted into a unfixed multi-C vector of dimension.But what generate that the sorter of machine learning model needs is the fixing vector of a dimension, and is the dimension (key) that merges All Files with the method for fixed in dimension, if there is not a certain dimension in single file its value is not made as 0; For mass file the dimension with magnanimity is arranged then, dimension disaster can occur, therefore, need merge and screen these dimensions.
Present embodiment specifically merges dimension and filters out the K dimension, and wherein, the K dimension is meant from a plurality of dimensions according to certain rule, through merging and screening preceding K the dimension of selecting.The follow-up Fig. 3 that will combine sets forth in detail.
Step S1013, the vector after being combined and screening through sorter is learnt, and generates machine learning model.
Sorter can adopt linear classifier in the present embodiment, and so-called linear SVM is meant that its kernel function is the inner product function.Present embodiment specifically adopts SVMs SVM (Support Vector Machine); SVM is a kind of trainable study machine; Belong to the vague generalization linear classifier, the characteristics of this sorter are: can minimize experience sum of errors maximization set marginarium.SVM is applied to virus identification, promptly is the classification problem that will solve virus document and normal file.
Vector after SVM is combined and screens is learnt, and promptly generates machine learning model.
Certainly, in other embodiments, can also use the other machines learning method instead and differentiate, and need not to use SVM.
More particularly; As shown in Figure 3; Be black vector set if set the vector of all malice files that said study concentrates, the vector of all normal files is white vector set, and then above-mentioned steps S1012 carries out to the vector of learning to concentrate malice file and normal file that dimension merges and the step of screening comprises:
Step S10, picked at random two black vectors extract two black vectorial total dimensions, as black dimension collection from said black vector set; Picked at random two white vectors extract two white vectorial total dimensions, as white dimension collection from said white vector set;
Step S11 concentrates all to appear at the concentrated dimension of said white dimension said black dimension and removes, form new black dimension collection, gives weight to said white dimension collection and new each concentrated dimension of black dimension;
Among above-mentioned steps S10 and the step S11,, adopt following mode in order dimension to be merged and to filter out the K dimension:
With the problem of whole black vector set, split into the subproblem of two black vector sum two white vectors with white vector set merging and screening dimension; Separate each subproblem then; Two white vectors are extracted total dimension (getting common factor); As the white dimension collection of subproblem; Two black vectors are extracted the black dimension collection of total dimension as subproblem, and will deceive dimension and concentrate all to appear at the dimension that white dimension concentrates to remove, give weight each black, white dimension of electing.
Step S12 carries out dimension with said white dimension collection and new black dimension collection respectively according to weight and merges, and will merge the dimension that back weight is lower than the predefined weight threshold values and abandon;
Separating according to dimension of all subproblems merged, a weight threshold w is set in the merging process,, then directly abandon this dimension, prevent the dimension collection and unrestrictedly increase if the weight of the dimension after merging (the corresponding weighted value addition of dimension during merging) is lower than w.
Step S13 judges respectively whether all vectors dispose in black vector set and the white vector set; If; Then get into step S14; Otherwise, return step S10;
Step S14 filters the black dimension collection after merging with the white dimension collection after merging;
Step S15, according to the ordering of weight size, the black dimension of taking out the highest preceding K dimension of rank is as final dimension to the black dimension collection after filtering;
Among the above-mentioned steps S13-step S15; Vector study all in black vector set and white vector set finish; Filter black dimension collection (promptly black dimension collection=black dimension collection-Bai dimension collection) with the white dimension collection after merging; Black dimension collection is carried out rank according to the weight size, and the black dimension of the preceding K dimension that the taking-up rank is the highest as a result of.
Step S16 changes into the K dimensional vector with the institute's directed quantity in said black vector set and the white vector set.
Vectors all in the black and white file is changed into the canonical form of the K dimensional vector of selecting,, generate machine learning model so that SVM learns the K dimensional vector.
Concentrate the process of the vector of all virus documents and normal file to set forth in detail with instantiation to above-mentioned merging and screening study below.
As shown in Figure 4; With FB, FW represent respectively to deceive, white vectorial general collection; With FBL and FWL represent respectively to deceive, the total dimension collection of white vector; Represent the marks of two black vectors of picked at random from black vector set respectively with B1, B2, represent the marks of two white vectors of picked at random from white vector set respectively, the vector of learning to concentrate all virus documents and normal file is merged and the process of screening is specially with W1, W2:
S1, initialization FB, FW select the black-and-white vector collection; If select black vector, then get into step S2, if select white vector, then get into step S3;
S2 judges whether the black vector in the black vector set all is labeled; If then get into step S4; Otherwise, get into step S21;
S21, two black vectorial B1 of picked at random, B2;
S22 extracts total dimension collection FBL and gives weight to each dimension; Get into S23;
S3 judges whether the white vector in the white vector set all is labeled; If then get into step S4; Otherwise, get into step S31;
S31, two white vectorial W1 of picked at random, W2;
S32 extracts total dimension collection FWL and gives weight to each dimension; Get into S23;
S23 makes difference set with FBL and FWL, as new FBL;
S24 merges to new FBL and FWL respectively among general collection FB, the FW, will weigh the collection addition during merging;
S25 rejects weight among FB and the FW less than the dimension of w-limit (the weight threshold values of setting); Return step S2 and S3 respectively.
S4, FB and FW make difference set as new FB;
S5, K dimension before ordering is taken out according to weight with FB obtains the FB net result.
The study collection that present embodiment is formed through predefined malice file and normal file generates machine learning model; And the file to be detected beyond the study collection is carried out the malice file identification through the machine learning model that generates; Promptly extract malicious code characteristics such as virus automatically by machine; Saved analyst's participation, and the machine learning reaction in time, can accurately also effectively extract virus characteristic; Any malice file to finding all can be handled immediately, has promoted the detection efficiency of malice file thus greatly.
As shown in Figure 5, preferred embodiment of the present invention proposes a kind of malice file identification device, comprising: model generation module 501, read module 502, vectorial conversion module 503 and identification module 504, wherein:
Model generation module 501, the study collection that is used to adopt predetermined malice file and normal file to form generates machine learning model;
Read module 502 is used to read the file to be detected beyond the study collection;
Vector conversion module 503 is used for said file to be detected is changed into vector;
Identification module 504 is used for through said machine learning model the file to be detected that changes into vector being carried out the malice file identification.
With the windows system is example; For the file under the windows system is looked into poison; Present embodiment at first utilizes known virus document and non-virus document (being alleged malice file of present embodiment and normal file) to generate machine learning model; So that the file under the windows system is carried out virus identification, the classification problem of virus document and normal file in the resolution system through this machine learning model.
Above-mentioned known virus document and non-virus document can be collected by the virus analysis teacher in advance; And form a study collection; After concentrating that each virus document and normal file carry out feature extraction, dimension merges and screen through the study of 501 pairs of model generation modules; Concentrate each virus document and normal file to carry out vector study by sorter to learning, finally generate machine learning model.
Particularly, at first, malice file and normal file that study is concentrated change into vector respectively, promptly are that malice file and normal file that study is concentrated are accomplished the effective sample Feature Extraction respectively.
For an executable file (PE file), virus is discerned helpful characteristic comprise: character string, instruction sequence, functional procedure, import and export the attribute of function and each section etc.
Present embodiment is right with value value composition one (key:value) of these characteristics key and this characteristic; A file has then become (comprising malice file and normal file) set of (key:value); If as a dimension, then the set of (key:value) of a file can be regarded a unfixed multi-C vector of dimension as with each key.
Through feature extraction, file is converted into a unfixed multi-C vector of dimension.But what generate that the sorter of machine learning model needs is the fixing vector of a dimension, and is the dimension (key) that merges All Files with the method for fixed in dimension, if there is not a certain dimension in single file its value is not made as 0; For mass file the dimension with magnanimity is arranged then, dimension disaster can occur, therefore, need merge and screen these dimensions; Vector after being combined and screening through sorter is at last learnt, and generates machine learning model.
When having file to detect outside the study collection, read file to be detected, file to be detected is changed into vector, through the machine learning model that step S101 generates the file to be detected that changes into vector is carried out the malice file identification.
As a kind of preferred embodiment, be example with the PC, can the machine learning model that model generation module 501 generates be applied to the malicious engine of looking into of PC front end, on user's PC, look into poison, its practical implementation process is following:
1, reads file to be detected on the PC;
2, the file conversion to be detected on the PC that reads is become vector.
As previously mentioned, sorter concentrates each virus document and normal file to carry out vector study to learning, thereby generates machine learning model; Be that the handled file object of machine learning model should be a vector; Therefore, in the present embodiment, when read module 502 reads in the PC system file to be detected; Vector conversion module 503 need be converted into vector with file to be detected, promptly from file to be detected, extracts effective sample characteristics.This effective sample characteristics comprises: character string, instruction sequence, functional procedure, import and export the attribute of function and each section etc.
Then; Value value composition one (key:value) of these characteristics key and this characteristic is right; A file has then become (comprising malice file and normal file) set of (key:value); If as a dimension, then the set of (key:value) of a file can be regarded a unfixed multi-C vector of dimension as with each key.
3, through machine learning model the file to be detected on the PC that changes into vector is carried out the malice file identification.
The file to be detected that identification module 504 will be converted on the vectorial PC places machine learning model to judge, therefrom identifies virus document and normal file.Be specially: carry out linear function and calculate changing into file to be detected behind the vector through machine learning model; Judge the attribute of malice file and normal file according to result of calculation, thereby export malice file and the normal file in the file to be detected.
Particularly, as shown in Figure 6, said model generation module 501 comprises: vectorial conversion unit 5011, merging and screening unit 5012 and generation unit 5013, wherein:
Vector conversion unit 5011 is used for malice file and normal file that said study is concentrated are changed into vector respectively;
Merge and screening unit 5012, be used for concentrating the vector of malice file and normal file to carry out dimension merging and screening said study;
Generation unit 5013, the vector after being used for being combined and screening through sorter is learnt, and generates machine learning model.
In the present embodiment, malice file and normal file that study is concentrated change into vector respectively, promptly are that malice file and normal file that study is concentrated are accomplished the effective sample Feature Extraction respectively.
For an executable file (PE file), virus is discerned helpful characteristic comprise: character string, instruction sequence, functional procedure, import and export the attribute of function and each section etc.
Present embodiment is right with value value composition one (key:value) of these characteristics key and this characteristic; A file has then become (comprising malice file and normal file) set of (key:value); If as a dimension, then the set of (key:value) of a file can be regarded a unfixed multi-C vector of dimension as with each key.
Through feature extraction, file is converted into a unfixed multi-C vector of dimension.But what generate that the sorter of machine learning model needs is the fixing vector of a dimension, and is the dimension (key) that merges All Files with the method for fixed in dimension, if there is not a certain dimension in single file its value is not made as 0; For mass file the dimension with magnanimity is arranged then, dimension disaster can occur, therefore, need merge and screen these dimensions.Present embodiment specifically merges dimension and filters out the K dimension, and wherein, the K dimension is meant from a plurality of dimensions according to certain rule, through merging and screening preceding K the dimension of selecting.
Sorter can adopt linear classifier in the present embodiment, and so-called linear SVM is meant that its kernel function is the inner product function.Present embodiment specifically adopts SVMs SVM (Support Vector Machine); SVM is a kind of trainable study machine; Belong to the vague generalization linear classifier, the characteristics of this sorter are: can minimize experience sum of errors maximization set marginarium.SVM is applied to virus identification, promptly is the classification problem that will solve virus document and normal file.
Vector after SVM is combined and screens is learnt, and promptly generates machine learning model.
Certainly, in other embodiments, can also use the other machines learning method instead and differentiate, and need not to use SVM.
More particularly; As shown in Figure 7; Be black vector set if set the vector of all concentrated malice files of said study; The vector of all normal files is white vector set, and then said merging and screening unit 5012 comprise: first extracts subelement 50121, screening subelement 50122, merging subelement 50123, filtration subelement 50124, second extraction subelement 50125 and the transformant unit 50126, wherein:
First extracts subelement 50121, is used for extracting the total dimension of two black vectors, as black dimension collection from said black vector set picked at random two black vectors; Picked at random two white vectors extract two white vectorial total dimensions, as white dimension collection from said white vector set;
Screening subelement 50122 is used for concentrating all to appear at the concentrated dimension of said white dimension said black dimension and removes, form new black dimension collection, gives weight to said white dimension collection and new each concentrated dimension of black dimension;
Merge subelement 50123, be used for that said white dimension collection and new black dimension collection are carried out dimension respectively according to weight and merge, and will merge the dimension that back weight is lower than the predefined weight threshold values and abandon;
Filter subelement 50124, be used for after said black vector set and the white all Vector Processing of vector set finish, filtering the black dimension collection after merging with the white dimension collection after merging;
Second extracts subelement 50125, is used for the black dimension collection after filtering is sorted according to the weight size, and the black dimension of taking out the highest preceding K dimension of rank is as final dimension;
Transformant unit 50126 is used for institute's directed quantity of said black vector set and white vector set is changed into the K dimensional vector.
In the present embodiment,, adopt following mode in order dimension to be merged and to filter out the K dimension:
With the problem of whole black vector set, split into the subproblem of two black vector sum two white vectors with white vector set merging and screening dimension; Separate each subproblem then; Two white vectors are extracted total dimension (getting common factor); As the white dimension collection of subproblem; Two black vectors are extracted the black dimension collection of total dimension as subproblem, and will deceive dimension and concentrate all to appear at the dimension that white dimension concentrates to remove, give weight each black, white dimension of electing.
Separating according to dimension of all subproblems merged, a weight threshold w is set in the merging process,, then directly abandon this dimension, prevent the dimension collection and unrestrictedly increase if the weight of the dimension after merging (the corresponding weighted value addition of dimension during merging) is lower than w.
Vector study all in black vector set and white vector set finish; Filter black dimension collection (promptly black dimension collection=black dimension collection-Bai dimension collection) with the white dimension collection after merging; Black dimension collection is carried out rank according to the weight size, and the black dimension of the preceding K dimension that the taking-up rank is the highest as a result of.
Vectors all in the black and white file is changed into the canonical form of the K dimensional vector of selecting,, generate machine learning model so that SVM learns the K dimensional vector.
In addition, as shown in Figure 8, above-mentioned identification module 504 comprises: computing unit 5041 and output unit 5042, wherein:
Computing unit 5041 is used for the file to be detected that changes into behind the vector is obtained result of calculation through machine learning model;
Output unit 5042 is used for exporting according to result of calculation the malice file and the normal file of file to be detected.
Embodiment of the invention malice file identification method, device and storage medium; The study collection of forming through predefined malice file and normal file generates machine learning model; And the file to be detected beyond the study collection is carried out the malice file identification through the machine learning model that generates, and promptly extract malicious code characteristics such as virus automatically by machine, saved analyst's participation; And the machine learning reaction in time; Can accurately also effectively extract virus characteristic, all can handle immediately, promote the detection efficiency of malice file thus greatly any malice file of finding.
In addition; The present invention also proposes a kind of storage medium of embodied on computer readable; Stored the program that computing machine can be moved above that, after program is packed in the storer of computing machine, the study collection generation machine learning model that adopts predetermined malice file and normal file to form; Read the file to be detected beyond the study collection; Said file to be detected is changed into vector; Through said machine learning model the file to be detected that changes into vector is carried out the malice file identification.
Need to prove; The above embodiment of the present invention all illustrates with windows operating system; But be not limited to windows operating system; Other operating systems also can be carried out malice file detection identification by adopting by reference such scheme of the present invention, and such as mac or linux system etc., its concrete principle repeats no more at this.
The above is merely the preferred embodiments of the present invention; Be not so limit claim of the present invention; Every equivalent structure or flow process conversion that utilizes instructions of the present invention and accompanying drawing content to be done; Or directly or indirectly be used in other relevant technical field, all in like manner be included in the scope of patent protection of the present invention.

Claims (12)

1. a malice file identification method is characterized in that, may further comprise the steps:
The study collection that adopts predetermined malice file and normal file to form generates machine learning model;
Read the file to be detected beyond the study collection;
Said file to be detected is changed into vector;
Through said machine learning model the file to be detected that changes into vector is carried out the malice file identification.
2. method according to claim 1 is characterized in that, the step that the study collection that malice file that said employing is predetermined and normal file are formed generates machine learning model comprises:
Malice file and normal file that said study is concentrated change into vector respectively;
Concentrate the vector of malice file and normal file to carry out dimension merging and screening to said study;
Vector after being combined and screening through sorter is learnt, and generates machine learning model.
3. method according to claim 2; It is characterized in that; The vector of setting all concentrated malice files of said study is black vector set; The vector of all normal files is white vector set, saidly the vector of learning to concentrate malice file and normal file is carried out dimension merges and the step of screening comprises:
Picked at random two black vectors extract two black vectorial total dimensions, as black dimension collection from said black vector set; Picked at random two white vectors extract two white vectorial total dimensions, as white dimension collection from said white vector set;
Concentrate all to appear at the concentrated dimension of said white dimension said black dimension and remove, form new black dimension collection, give weight said white dimension collection and new each concentrated dimension of black dimension;
Said white dimension collection and new black dimension collection are carried out dimension respectively according to weight merge, and will merge the dimension that back weight is lower than the predefined weight threshold values and abandon; With these above-mentioned three steps that circulate, all Vector Processing finish in said black vector set and white vector set.
4. method according to claim 3 is characterized in that, saidly the vector of learning to concentrate malice file and normal file is carried out dimension merges and the step of screening further comprises:
After all Vector Processing finish in said black vector set and the white vector set, with the black dimension collection after the white dimension collection filtration merging after merging;
According to the ordering of weight size, the black dimension of taking out the highest preceding K dimension of rank is as final dimension to the black dimension collection after filtering;
Institute's directed quantity in said black vector set and the white vector set is changed into the K dimensional vector.
5. according to claim 1,2,3 or 4 described methods, it is characterized in that, saidly the step that the file to be detected that changes into vector carries out the malice file identification comprised through machine learning model:
File to be detected to changing into behind the vector obtains result of calculation through machine learning model;
Export malice file and normal file in the file to be detected according to result of calculation.
6. method according to claim 5 is characterized in that, said predetermined malice file and normal file are meant known malice file and the normal file of collecting in advance.
7. a malice file identification device is characterized in that, comprising:
The model generation module, the study collection that is used to adopt predetermined malice file and normal file to form generates machine learning model;
Read module is used to read the file to be detected beyond the study collection;
The vector conversion module is used for said file to be detected is changed into vector;
Identification module is used for through said machine learning model the file to be detected that changes into vector being carried out the malice file identification.
8. device according to claim 7 is characterized in that, said model generation module comprises:
The vector conversion unit is used for malice file and normal file that said study is concentrated are changed into vector respectively;
Merge and screening unit, be used for concentrating the vector of malice file and normal file to carry out dimension merging and screening said study;
Generation unit, the vector after being used for being combined and screening through sorter is learnt, and generates machine learning model.
9. device according to claim 8 is characterized in that, the vector of setting all concentrated malice files of said study is black vector set, and the vector of all normal files is white vector set, and said merging and screening unit comprise:
First extracts subelement, is used for extracting the total dimension of two black vectors, as black dimension collection from said black vector set picked at random two black vectors; Picked at random two white vectors extract two white vectorial total dimensions, as white dimension collection from said white vector set;
The screening subelement is used for concentrating all to appear at the concentrated dimension of said white dimension said black dimension and removes, form new black dimension collection, gives weight to said white dimension collection and new each concentrated dimension of black dimension;
Merge subelement, be used for that said white dimension collection and new black dimension collection are carried out dimension respectively according to weight and merge, and will merge the dimension that back weight is lower than the predefined weight threshold values and abandon;
Filter subelement, be used for after said black vector set and the white all Vector Processing of vector set finish, filtering the black dimension collection after merging with the white dimension collection after merging;
Second extracts subelement, is used for the black dimension collection after filtering is sorted according to the weight size, and the black dimension of taking out the highest preceding K dimension of rank is as final dimension;
The transformant unit is used for institute's directed quantity of said black vector set and white vector set is changed into the K dimensional vector.
10. according to claim 7,8 or 9 described devices, it is characterized in that said identification module comprises:
Computing unit is used for the file to be detected that changes into behind the vector is obtained result of calculation through machine learning model;
Output unit is used for exporting according to result of calculation the malice file and the normal file of file to be detected.
11. device according to claim 10 is characterized in that, said predetermined malice file and normal file are meant known malice file and the normal file of collecting in advance.
12. the storage medium of an embodied on computer readable has been stored the program that computing machine can be moved above that, after program is packed in the storer of computing machine, and the study collection generation machine learning model that adopts predetermined malice file and normal file to form; Read the file to be detected beyond the study collection; Said file to be detected is changed into vector; Through said machine learning model the file to be detected that changes into vector is carried out the malice file identification.
CN201210213078.2A 2012-06-26 2012-06-26 Malicious file identification method, device and storage medium Active CN102737186B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210213078.2A CN102737186B (en) 2012-06-26 2012-06-26 Malicious file identification method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210213078.2A CN102737186B (en) 2012-06-26 2012-06-26 Malicious file identification method, device and storage medium

Publications (2)

Publication Number Publication Date
CN102737186A true CN102737186A (en) 2012-10-17
CN102737186B CN102737186B (en) 2015-06-17

Family

ID=46992673

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210213078.2A Active CN102737186B (en) 2012-06-26 2012-06-26 Malicious file identification method, device and storage medium

Country Status (1)

Country Link
CN (1) CN102737186B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473506A (en) * 2013-08-30 2013-12-25 北京奇虎科技有限公司 Method and device of recognizing malicious APK files
CN104008333A (en) * 2013-02-21 2014-08-27 腾讯科技(深圳)有限公司 Installation package detecting method and device
CN104424437A (en) * 2013-08-28 2015-03-18 贝壳网际(北京)安全技术有限公司 Multi-file sample testing method and device and client
CN104598820A (en) * 2015-01-14 2015-05-06 国家电网公司 Trojan virus detection method based on feature behavior activity
CN105897751A (en) * 2016-06-03 2016-08-24 北京奇虎科技有限公司 Generation method and device of threat Intelligence
CN105897752A (en) * 2016-06-03 2016-08-24 北京奇虎科技有限公司 Safety detection method and device of unknown domain name
CN106156120A (en) * 2015-04-07 2016-11-23 阿里巴巴集团控股有限公司 The method and apparatus that character string is classified
WO2017084586A1 (en) * 2015-11-17 2017-05-26 武汉安天信息技术有限责任公司 Method , system, and device for inferring malicious code rule based on deep learning method
CN109992969A (en) * 2019-03-25 2019-07-09 腾讯科技(深圳)有限公司 A kind of malicious file detection method, device and detection platform
WO2019242442A1 (en) * 2018-06-20 2019-12-26 深信服科技股份有限公司 Multi-model feature-based malware identification method, system and related apparatus
CN111371812A (en) * 2020-05-27 2020-07-03 腾讯科技(深圳)有限公司 Virus detection method, device and medium
CN111859381A (en) * 2019-04-29 2020-10-30 深信服科技股份有限公司 File detection method, device, equipment and medium
US11055443B1 (en) 2020-10-01 2021-07-06 Flexxon Pte. Ltd. Module and method for detecting malicious activities in a storage device
CN113935022A (en) * 2021-12-17 2022-01-14 北京微步在线科技有限公司 Homologous sample capturing method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101604364A (en) * 2009-07-10 2009-12-16 珠海金山软件股份有限公司 Computer rogue program categorizing system and sorting technique based on file instruction sequence
CN101984450A (en) * 2010-12-15 2011-03-09 北京安天电子设备有限公司 Malicious code detection method and system
CN102346829A (en) * 2011-09-22 2012-02-08 重庆大学 Virus detection method based on ensemble classification
CN102479298A (en) * 2010-11-29 2012-05-30 北京奇虎科技有限公司 Program identification method and device based on machine learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101604364A (en) * 2009-07-10 2009-12-16 珠海金山软件股份有限公司 Computer rogue program categorizing system and sorting technique based on file instruction sequence
CN102479298A (en) * 2010-11-29 2012-05-30 北京奇虎科技有限公司 Program identification method and device based on machine learning
CN101984450A (en) * 2010-12-15 2011-03-09 北京安天电子设备有限公司 Malicious code detection method and system
CN102346829A (en) * 2011-09-22 2012-02-08 重庆大学 Virus detection method based on ensemble classification

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008333A (en) * 2013-02-21 2014-08-27 腾讯科技(深圳)有限公司 Installation package detecting method and device
CN104008333B (en) * 2013-02-21 2017-12-01 腾讯科技(深圳)有限公司 The detection method and equipment of a kind of installation kit
CN104424437B (en) * 2013-08-28 2018-07-10 贝壳网际(北京)安全技术有限公司 Multi-file sample testing method and device and client
CN104424437A (en) * 2013-08-28 2015-03-18 贝壳网际(北京)安全技术有限公司 Multi-file sample testing method and device and client
CN103473506A (en) * 2013-08-30 2013-12-25 北京奇虎科技有限公司 Method and device of recognizing malicious APK files
CN103473506B (en) * 2013-08-30 2016-12-28 北京奇虎科技有限公司 For the method and apparatus identifying malice APK file
CN104598820A (en) * 2015-01-14 2015-05-06 国家电网公司 Trojan virus detection method based on feature behavior activity
CN106156120A (en) * 2015-04-07 2016-11-23 阿里巴巴集团控股有限公司 The method and apparatus that character string is classified
US10503903B2 (en) 2015-11-17 2019-12-10 Wuhan Antiy Information Technology Co., Ltd. Method, system, and device for inferring malicious code rule based on deep learning method
WO2017084586A1 (en) * 2015-11-17 2017-05-26 武汉安天信息技术有限责任公司 Method , system, and device for inferring malicious code rule based on deep learning method
CN105897752A (en) * 2016-06-03 2016-08-24 北京奇虎科技有限公司 Safety detection method and device of unknown domain name
CN105897751A (en) * 2016-06-03 2016-08-24 北京奇虎科技有限公司 Generation method and device of threat Intelligence
WO2019242442A1 (en) * 2018-06-20 2019-12-26 深信服科技股份有限公司 Multi-model feature-based malware identification method, system and related apparatus
CN110619213A (en) * 2018-06-20 2019-12-27 深信服科技股份有限公司 Malicious software identification method, system and related device based on multi-model features
CN109992969A (en) * 2019-03-25 2019-07-09 腾讯科技(深圳)有限公司 A kind of malicious file detection method, device and detection platform
CN109992969B (en) * 2019-03-25 2023-03-21 腾讯科技(深圳)有限公司 Malicious file detection method and device and detection platform
CN111859381A (en) * 2019-04-29 2020-10-30 深信服科技股份有限公司 File detection method, device, equipment and medium
CN111371812A (en) * 2020-05-27 2020-07-03 腾讯科技(深圳)有限公司 Virus detection method, device and medium
US11055443B1 (en) 2020-10-01 2021-07-06 Flexxon Pte. Ltd. Module and method for detecting malicious activities in a storage device
CN114282228A (en) * 2020-10-01 2022-04-05 丰立有限公司 Module and method for detecting malicious activity in a storage device
CN114282228B (en) * 2020-10-01 2022-11-01 丰立有限公司 Module and method for detecting malicious activity in a storage device
CN113935022A (en) * 2021-12-17 2022-01-14 北京微步在线科技有限公司 Homologous sample capturing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN102737186B (en) 2015-06-17

Similar Documents

Publication Publication Date Title
CN102737186A (en) Malicious file identification method, device and storage medium
CN109960729B (en) Method and system for detecting HTTP malicious traffic
CN103810425B (en) The detection method of malice network address and device
US10817615B2 (en) Method and apparatus for verifying images based on image verification codes
CN103177215B (en) Based on the computer malware new detecting method of software control stream feature
Carrara et al. Adversarial image detection in deep neural networks
US20120284793A1 (en) Intrusion detection using mdl clustering
CN102542061B (en) Intelligent product classification method
CN104700033A (en) Virus detection method and virus detection device
Yoo et al. Two-phase malicious web page detection scheme using misuse and anomaly detection
CN102316081A (en) Method and device for identifying similar webpage
CN105516128A (en) Detecting method and device of Web attack
CN112565308B (en) Malicious application detection method, device, equipment and medium based on network traffic
CN101895517B (en) Method and device for extracting script semantics
CN105989093B (en) The automatic discovering method and its device of sensitive word and application
CN107391684A (en) A kind of method and system for threatening information generation
CN103632091A (en) Malicious feature extraction method and device and storage media
CN106776069A (en) The automatic method and system for collecting transmission data between a kind of software systems
CN104794397B (en) Virus detection method and device
CN108268775B (en) Web vulnerability detection method and device, electronic equipment and storage medium
CN106203753A (en) The circulation method and system of problems of operation in software operation
CN101609453A (en) A kind of separator page and the method and apparatus that utilizes the document classification of this separator page
CN101436210B (en) Method and system for recognizing counterfeit web page
CN114254704A (en) HTTP tunnel detection method and device, electronic equipment and storage medium
CN103593614A (en) Unknown virus retrieval method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230706

Address after: 518057 Tencent Building, No. 1 High-tech Zone, Nanshan District, Shenzhen City, Guangdong Province, 35 floors

Patentee after: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.

Patentee after: TENCENT CLOUD COMPUTING (BEIJING) Co.,Ltd.

Address before: 2, 518044, East 403 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District

Patentee before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.

TR01 Transfer of patent right