CN102737186B - Malicious file identification method, device and storage medium - Google Patents

Malicious file identification method, device and storage medium Download PDF

Info

Publication number
CN102737186B
CN102737186B CN201210213078.2A CN201210213078A CN102737186B CN 102737186 B CN102737186 B CN 102737186B CN 201210213078 A CN201210213078 A CN 201210213078A CN 102737186 B CN102737186 B CN 102737186B
Authority
CN
China
Prior art keywords
file
dimension
vector
black
white
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210213078.2A
Other languages
Chinese (zh)
Other versions
CN102737186A (en
Inventor
崔精兵
杨宜
于涛
白子潘
吴家旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Tencent Cloud Computing Beijing Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201210213078.2A priority Critical patent/CN102737186B/en
Publication of CN102737186A publication Critical patent/CN102737186A/en
Application granted granted Critical
Publication of CN102737186B publication Critical patent/CN102737186B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a malicious file identification method, a device and a storage medium. The method comprises the steps of adopting a preset learning set consisting of a malicious file and a normal file to generate a machine learning model; reading a file to be detected other than the learning set; converting the file to be detected to a vector; and performing the malicious file identification on the file to be detected which is converted to the vector through the machine learning model. The machine learning model is generated by the preset learning set consisting of the malicious file and the normal file, and the generated machine learning model is used for performing the malicious file identification on the file to be detected except the learning set, so that the virus characteristics can be timely, accurately and efficiently extracted, any discovered malicious file can be immediately processed, and the detection efficiency of the malicious file can be greatly improved.

Description

Malicious file recognition methods, device and storage medium
Technical field
The present invention relates to Internet technical field, particularly security fields, particularly relate to a kind of malicious file recognition methods, device and storage medium.
Background technology
Along with the development of Internet technology, the propagation of virus is also in aggravation.Virus causes great harm to the safety of user profile and user's property, and therefore, exploitation is swift in response, efficiently, virus investigation rate and the high antivirus engine of accuracy become the emphasis of current internet information safety limit.
The viral recognition technology that traditional antivirus engine adopts usually is as follows: analyst analyzes virus document, extracts virus characteristic, virus characteristic warehouse-in, antivirus engine are scanned existing file according to virus base, if run into the feature that can match, reports poison.
There is following drawback in traditional viral recognition technology:
1, require higher to the professional skill of analyst, and the quality extracting virus characteristic determine rate of false alarm and quote rate;
2, analyze virus document and extract virus characteristic very consuming time;
3, efficiency is low, and along with increasing of virus base record, in order to collide each record, the required time can become geometry multiple to increase;
4, to the discovery of virus not in time, relative to the new viral species of magnanimity, because the processing power of analyst is limited, for the process of some viruses, to only have etc. during virus outbreak and just can be found or pay attention to, then process, and now virus has caused sizable harm.
Summary of the invention
Fundamental purpose of the present invention is to provide a kind of malicious file recognition methods, device and storage medium, is intended to the detection efficiency improving malicious file.
In order to achieve the above object, the present invention proposes a kind of malicious file recognition methods, comprises the following steps:
The study collection of predetermined malicious file and normal file composition is adopted to generate machine learning model;
Read the file to be detected beyond study collection;
Described file translations to be detected is become vector;
By described machine learning model, malicious file identification is carried out to the file to be detected changing into vector.
The present invention also proposes a kind of malicious file recognition device, comprising:
Model generation module, generates machine learning model for adopting the study collection of predetermined malicious file and normal file composition;
Read module, for reading the file to be detected beyond study collection;
Vector conversion module, for becoming vector by described file translations to be detected;
Identification module, for carrying out malicious file identification by described machine learning model to the file to be detected changing into vector.
The present invention also proposes a kind of storage medium of embodied on computer readable, have stored thereon the program that computing machine is run, after in the storer that program loads computing machine, the study collection of predetermined malicious file and normal file composition is adopted to generate machine learning model; Read the file to be detected beyond study collection; Described file translations to be detected is become vector; By described machine learning model, malicious file identification is carried out to the file to be detected changing into vector.
A kind of malicious file recognition methods that the present invention proposes, device and storage medium, machine learning model is generated by the study collection of the malicious file that presets and normal file composition, by the machine learning model generated, malicious file identification is carried out to the file to be detected beyond study collection, can in time, accurately and effectively extract virus characteristic, can process immediately any malicious file found, improve the detection efficiency of malicious file thus greatly.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of malicious file recognition methods preferred embodiment of the present invention;
Fig. 2 is the schematic flow sheet adopting the study collection of predetermined malicious file and normal file composition to generate machine learning model in malicious file recognition methods preferred embodiment of the present invention;
Fig. 3 concentrates the vector of malicious file and normal file to carry out the schematic flow sheet of dimension merging and screening to study in malicious file recognition methods preferred embodiment of the present invention;
Fig. 4 concentrates the vector of malicious file and normal file to carry out the schematic flow sheet of a kind of example of dimension merging and screening to study in malicious file recognition methods preferred embodiment of the present invention;
Fig. 5 is the structural representation of malicious file recognition device of the present invention preferred embodiment;
Fig. 6 is the structural representation of model generation module in malicious file recognition device of the present invention preferred embodiment;
Fig. 7 is the structural representation merging and screen unit in malicious file recognition device of the present invention preferred embodiment;
Fig. 8 is the structural representation of identification module in malicious file recognition device of the present invention preferred embodiment.
In order to make technical scheme of the present invention clearly, understand, be described in further detail below in conjunction with accompanying drawing.
Embodiment
Solution for embodiment of the invention is mainly: adopt the study collection of predetermined malicious file and normal file composition to generate machine learning model; Read the file to be detected beyond study collection, and described file translations to be detected is become vector, by machine learning model, malicious file identification is carried out to the file to be detected changing into vector, utilize the feature that machine learning is reacted in time, processing speed is fast, promote the detection efficiency of malicious file.
In the present invention, malicious file can be the file of virus document or other malice, and following examples illustrate with malicious file.Wherein, the technical term related to comprises:
Black file: virus document
Black vector: the vector that virus document changes into
Text of an annotated book part: normal non-viral file
Bai Xiangliang: the vector that normal non-viral file translations becomes
SVM: Support Vector Machine
PE file: a kind of executable file format under windows system
As shown in Figure 1, present pre-ferred embodiments proposes a kind of malicious file recognition methods, comprising:
Step S101, adopts the study collection of predetermined malicious file and normal file composition to generate machine learning model;
For windows system, in order to carry out virus investigation to the file under windows system, first the present embodiment utilizes known virus document and non-viral file (i.e. malicious file alleged by the present embodiment and normal file) to generate machine learning model, to carry out virus by this machine learning model to the file under windows system to identify, the classification problem of virus document and normal file in resolution system.
Above-mentioned known virus document and non-viral file can be collected in advance by virus analysis teacher, and form a study collection, after carrying out feature extraction, dimension merging and screening by concentrating each virus document and normal file to study, concentrate each virus document and normal file to carry out vector by sorter to study to learn, finally generate machine learning model.
Particularly, first, the malicious file concentrate study and normal file change into vector respectively, are namely the extractions concentrated malicious file of study and normal file being completed respectively effective sample feature.
For an executable file (PE file), virus is identified that helpful feature comprises: character string, instruction sequence, functional procedure, import and export the attribute etc. of function and each section.
The present embodiment is by right for value value composition one (key:value) of these feature key and this feature, a file then becomes (comprising malicious file and normal file) set of (key:value), if by each key as a dimension, then the set of (key:value) of a file can regard a unfixed multi-C vector of dimension as.
By feature extraction, be a unfixed multi-C vector of dimension by file translations.But, generate the sorter of machine learning model it is desirable that the fixing vector of dimension, and the method for dimension being fixed is the dimension (key) merging All Files, if Single document does not exist a certain dimension, its value is set to 0; Mass file is then had to the dimension with magnanimity, there will be dimension disaster, therefore, need to merge these dimensions and screen; Vector after being combined finally by sorter and screening learns, and generates machine learning model.
Step S102, reads the file to be detected beyond study collection;
Step S103, becomes vector by file translations to be detected;
Step S104, carries out malicious file identification by machine learning model to the file to be detected changing into vector.
Above-mentioned steps S102 is in step S104, when learning have file to need to detect outside collection, read file to be detected, file translations to be detected is become vector, the machine learning model generated by step S101 carries out malicious file identification to the file to be detected changing into vector.
As a kind of preferred embodiment, for PC, the machine learning model that step S101 generates can be applied to the virus investigation engine of PC front end, the PC of user carries out virus investigation, and its specific implementation process is as follows:
1, the file to be detected on PC is read;
2, the file transform to be detected on the PC read is become vector.
As previously mentioned, sorter concentrates each virus document and normal file to carry out vector study to study, thus generation machine learning model, namely the file object handled by machine learning model should be vector, therefore, in the present embodiment, when reading file to be detected in PC system, needing file translations to be detected is vector, namely from file to be detected, extracts effective sample characteristics.This effective sample characteristics comprises: character string, instruction sequence, functional procedure, import and export the attribute etc. of function and each section.
Then, by right for value value composition one (key:value) of these feature key and this feature, a file then becomes (comprising malicious file and normal file) set of (key:value), if by each key as a dimension, then the set of (key:value) of a file can regard a unfixed multi-C vector of dimension as.
3, by machine learning model, malicious file identification is carried out to the file to be detected changed on the PC of vector.
The file to be detected be converted on the PC of vector is placed in machine learning model judge, therefrom identifies virus document and normal file.Be specially: carry out linear function calculating by machine learning model to changing into the file to be detected after vector; The attribute of malicious file and normal file is judged according to result of calculation, thus the malicious file exported in file to be detected and normal file.
Particularly, as shown in Figure 2, the step that above-mentioned steps S101 adopts the study collection of predetermined malicious file and normal file composition to generate machine learning model comprises:
Step S1011, the malicious file concentrate study and normal file change into vector respectively;
The malicious file concentrate study and normal file change into vector respectively, are namely the extractions concentrated malicious file of study and normal file being completed respectively effective sample feature.
For an executable file (PE file), virus is identified that helpful feature comprises: character string, instruction sequence, functional procedure, import and export the attribute etc. of function and each section.
The present embodiment is by right for value value composition one (key:value) of these feature key and this feature, a file then becomes (comprising malicious file and normal file) set of (key:value), if by each key as a dimension, then the set of (key:value) of a file can regard a unfixed multi-C vector of dimension as.
Step S1012, concentrates the vector of malicious file and normal file to carry out dimension merging and screening to study;
By feature extraction, be a unfixed multi-C vector of dimension by file translations.But, generate the sorter of machine learning model it is desirable that the fixing vector of dimension, and the method for dimension being fixed is the dimension (key) merging All Files, if Single document does not exist a certain dimension, its value is set to 0; Mass file is then had to the dimension with magnanimity, there will be dimension disaster, therefore, need to merge these dimensions and screen.
Dimension specifically merges and filters out K dimension by the present embodiment, and wherein, K dimension refers to according to certain rule from multiple dimension, through merging and screening, and front K the dimension selected.Follow-up composition graphs 3 to be described in detail.
Step S1013, the vector after being combined by sorter and screening is learnt, and generates machine learning model.
In the present embodiment, sorter can adopt linear classifier, so-called Linear SVM, refers to that its kernel function is interior Product function.The present embodiment specifically adopts support vector machines (Support Vector Machine), SVM is a kind of trainable Learning machine, belong to vague generalization linear classifier, the feature of this sorter is: can minimize experience error and maximize set marginarium.SVM being applied to virus to identify, is namely the classification problem that will solve virus document and normal file.
Vector after SVM is combined and screens learns, and namely generates machine learning model.
Certainly, in other embodiments, other machines learning method can also be used instead and differentiate, and without the need to using SVM.
More specifically, as shown in Figure 3, if the vector setting all malicious files that described study is concentrated is black vector set, the vector of all normal files is white vector set, then above-mentioned steps S1012 comprises the step that study concentrates the vector of malicious file and normal file to carry out dimension merging and screening:
Step S10, the black vector of random selecting two from described black vector set, extracts the total dimension of two black vectors, as black dimension collection; The white vector of random selecting two from described white vector set, extracts the total dimension of two white vectors, as white dimension collection;
Step S11, concentrates described black dimension all dimensions appearing at described white dimension concentrated to remove, forms new black dimension collection, give weight to each dimension that described white dimension collection and new black dimension are concentrated;
In above-mentioned steps S10 and step S11, in order to dimension being merged and filtering out K dimension, in the following ways:
Whole black vector set and white vector set merged and screens the problem of dimension, splitting into the subproblem of the white vector of two black vector sums two; Then each subproblem is separated, two white vectors are extracted total dimension (getting common factor), as the white dimension collection of subproblem, two black vectors are extracted the black dimension collection of total dimension as subproblem, and black dimension concentrated all dimensions appearing at white dimension concentrated to remove, give weight to each black, the white dimension elected.
Step S12, carries out dimension merging by described white dimension collection and new black dimension collection respectively according to weight, and abandons merging the dimension of rear weight lower than predefined weight threshold values;
The solution of all subproblems is merged according to dimension, a weight threshold w is set in merging process, if the weight of the dimension after merging (weighted value that during merging, dimension is corresponding is added) is lower than w, then directly abandons this dimension, prevent dimension collection and unrestrictedly increase.
Step S13, judges whether vectors all in black vector set and white vector set is disposed respectively; If; Then enter step S14; Otherwise, return step S10;
Step S14, filters the black dimension collection after merging with the white dimension collection after merging;
Step S15, sorts according to weight size to the black dimension collection after filtering, and takes out the black dimension of the highest front K dimension of rank as final dimension;
In above-mentioned steps S13-step S15, when vectors all in black vector set and white vector set study is complete, black dimension collection (i.e. black dimension collection=black dimension collection-Bai dimension collection) is filtered with the white dimension collection after merging, according to weight size, rank is carried out to black dimension collection, takes out the black dimension of the highest front K dimension of rank as a result.
Step S16, changes into K dimensional vector by the institute's directed quantity in described black vector set and white vector set.
Vectors all in black and white file is changed into the canonical form of the K dimensional vector selected, so that SVM learns K dimensional vector, generate machine learning model.
The process of the vector of all virus documents and normal file is concentrated to be described in detail with instantiation to above-mentioned merging and screening study below.
As shown in Figure 4, black, white vectorial general collection is represented respectively with FB, FW, the total dimension collection of black, white vector is represented respectively with FBL and FWL, the mark of two of random selecting from black vector set black vectors is represented respectively with B1, B2, represent the mark of two of random selecting from white vector set white vectors with W1, W2 respectively, the process learning to concentrate the vector of all virus documents and normal file to merge and screen be specially:
S1, initialization FB, FW, select black-and-white vector collection; If select black vector, then enter step S2, if select white vector, then enter step S3;
S2, judges whether the black vector in black vector set is all labeled; If so, then step S4 is entered; Otherwise, enter step S21;
S21, random selecting two black vectorial B1, B2;
S22, extracts total dimension collection FBL and gives weight to each dimension; Enter S23;
S3, judges whether the white vector in white vector set is all labeled; If so, then step S4 is entered; Otherwise, enter step S31;
S31, random selecting two white vectorial W1, W2;
S32, extracts total dimension collection FWL and gives weight to each dimension; Enter S23;
S23, makes difference set by FBL and FWL, as new FBL;
S24, is merged into respectively by new FBL and FWL in general collection FB, FW, is added by power collection during merging;
S25, weight in FB and FW is less than w-limit(setting weight threshold values) dimension reject; Return step S2 and S3 respectively.
S4, FB and FW make difference set as new FB;
S5, K dimension before being taken out according to weight sequencing by FB, obtains FB net result.
The present embodiment generates machine learning model by the study collection of the malicious file that presets and normal file composition, and by the machine learning model generated, malicious file identification is carried out to the file to be detected beyond study collection, namely the malicious code features such as virus are automatically extracted by machine, eliminate the participation of analyst, and machine learning reaction in time, can accurately and effectively extract virus characteristic, all can process immediately any malicious file found, improve the detection efficiency of malicious file thus greatly.
As shown in Figure 5, present pre-ferred embodiments proposes a kind of malicious file recognition device, comprising: model generation module 501, read module 502, vectorial conversion module 503 and identification module 504, wherein:
Model generation module 501, generates machine learning model for adopting the study collection of predetermined malicious file and normal file composition;
Read module 502, for reading the file to be detected beyond study collection;
Vector conversion module 503, for becoming vector by described file translations to be detected;
Identification module 504, for carrying out malicious file identification by described machine learning model to the file to be detected changing into vector.
For windows system, in order to carry out virus investigation to the file under windows system, first the present embodiment utilizes known virus document and non-viral file (i.e. malicious file alleged by the present embodiment and normal file) to generate machine learning model, to carry out virus by this machine learning model to the file under windows system to identify, the classification problem of virus document and normal file in resolution system.
Above-mentioned known virus document and non-viral file can be collected in advance by virus analysis teacher, and form a study collection, after concentrating each virus document and normal file to carry out feature extraction, dimension merging and screening by model generation module 501 to study, concentrate each virus document and normal file to carry out vector by sorter to study to learn, finally generate machine learning model.
Particularly, first, the malicious file concentrate study and normal file change into vector respectively, are namely the extractions concentrated malicious file of study and normal file being completed respectively effective sample feature.
For an executable file (PE file), virus is identified that helpful feature comprises: character string, instruction sequence, functional procedure, import and export the attribute etc. of function and each section.
The present embodiment is by right for value value composition one (key:value) of these feature key and this feature, a file then becomes (comprising malicious file and normal file) set of (key:value), if by each key as a dimension, then the set of (key:value) of a file can regard a unfixed multi-C vector of dimension as.
By feature extraction, be a unfixed multi-C vector of dimension by file translations.But, generate the sorter of machine learning model it is desirable that the fixing vector of dimension, and the method for dimension being fixed is the dimension (key) merging All Files, if Single document does not exist a certain dimension, its value is set to 0; Mass file is then had to the dimension with magnanimity, there will be dimension disaster, therefore, need to merge these dimensions and screen; Vector after being combined finally by sorter and screening learns, and generates machine learning model.
When learning have file to need to detect outside collection, read file to be detected, file translations to be detected is become vector, the machine learning model generated by step S101 carries out malicious file identification to the file to be detected changing into vector.
As a kind of preferred embodiment, for PC, the machine learning model that model generation module 501 generates can be applied to the virus investigation engine of PC front end, the PC of user carries out virus investigation, and its specific implementation process is as follows:
1, the file to be detected on PC is read;
2, the file transform to be detected on the PC read is become vector.
As previously mentioned, sorter concentrates each virus document and normal file to carry out vector study to study, thus generation machine learning model, namely the file object handled by machine learning model should be vector, therefore, in the present embodiment, when read module 502 reads file to be detected in PC system, it is vector that vector conversion module 503 needs file translations to be detected, namely from file to be detected, extracts effective sample characteristics.This effective sample characteristics comprises: character string, instruction sequence, functional procedure, import and export the attribute etc. of function and each section.
Then, by right for value value composition one (key:value) of these feature key and this feature, a file then becomes (comprising malicious file and normal file) set of (key:value), if by each key as a dimension, then the set of (key:value) of a file can regard a unfixed multi-C vector of dimension as.
3, by machine learning model, malicious file identification is carried out to the file to be detected changed on the PC of vector.
The file to be detected be converted on the PC of vector is placed in machine learning model and judges by identification module 504, therefrom identifies virus document and normal file.Be specially: carry out linear function calculating by machine learning model to changing into the file to be detected after vector; The attribute of malicious file and normal file is judged according to result of calculation, thus the malicious file exported in file to be detected and normal file.
Particularly, as shown in Figure 6, described model generation module 501 comprises: vectorial conversion unit 5011, merging and screening unit 5012 and generation unit 5013, wherein:
Vector conversion unit 5011, for changing into vector respectively by malicious file concentrated for described study and normal file;
Merge and screening unit 5012, carry out dimension merging and screening for concentrating the vector of malicious file and normal file to described study;
Generation unit 5013, is learnt for the vector after being combined by sorter and screening, and generates machine learning model.
In the present embodiment, the malicious file concentrate study and normal file change into vector respectively, are namely the extractions concentrated malicious file of study and normal file being completed respectively effective sample feature.
For an executable file (PE file), virus is identified that helpful feature comprises: character string, instruction sequence, functional procedure, import and export the attribute etc. of function and each section.
The present embodiment is by right for value value composition one (key:value) of these feature key and this feature, a file then becomes (comprising malicious file and normal file) set of (key:value), if by each key as a dimension, then the set of (key:value) of a file can regard a unfixed multi-C vector of dimension as.
By feature extraction, be a unfixed multi-C vector of dimension by file translations.But, generate the sorter of machine learning model it is desirable that the fixing vector of dimension, and the method for dimension being fixed is the dimension (key) merging All Files, if Single document does not exist a certain dimension, its value is set to 0; Mass file is then had to the dimension with magnanimity, there will be dimension disaster, therefore, need to merge these dimensions and screen.Dimension specifically merges and filters out K dimension by the present embodiment, and wherein, K dimension refers to according to certain rule from multiple dimension, through merging and screening, and front K the dimension selected.
In the present embodiment, sorter can adopt linear classifier, so-called Linear SVM, refers to that its kernel function is interior Product function.The present embodiment specifically adopts support vector machines (Support Vector Machine), SVM is a kind of trainable Learning machine, belong to vague generalization linear classifier, the feature of this sorter is: can minimize experience error and maximize set marginarium.SVM being applied to virus to identify, is namely the classification problem that will solve virus document and normal file.
Vector after SVM is combined and screens learns, and namely generates machine learning model.
Certainly, in other embodiments, other machines learning method can also be used instead and differentiate, and without the need to using SVM.
More specifically, as shown in Figure 7, if the vector setting all malicious files that described study is concentrated is black vector set, the vector of all normal files is white vector set, then described merging and screening unit 5012 comprise: the first extraction subelement 50121, screening subelement 50122, merging subelement 50123, filtration subelement 50124, second extract subelement 50125 and transformant unit 50126, wherein:
First extracts subelement 50121, for the black vector of random selecting two from described black vector set, extracts the total dimension of two black vectors, as black dimension collection; The white vector of random selecting two from described white vector set, extracts the total dimension of two white vectors, as white dimension collection;
Screening subelement 50122, for described black dimension being concentrated all dimensions appearing at described white dimension concentrated to remove, forming new black dimension collection, giving weight to each dimension that described white dimension collection and new black dimension are concentrated;
Merging subelement 50123, for described white dimension collection and new black dimension collection are carried out dimension merging respectively according to weight, and abandoning merging the dimension of rear weight lower than predefined weight threshold values;
Filter subelement 50124, for after Vector Processing all in described black vector set and white vector set, filter the black dimension collection after merging with the white dimension collection after merging;
Second extracts subelement 50125, for sorting according to weight size to the black dimension collection after filtration, takes out the black dimension of the highest front K dimension of rank as final dimension;
Transformant unit 50126, for changing into K dimensional vector by the institute's directed quantity in described black vector set and white vector set.
In the present embodiment, in order to dimension being merged and filtering out K dimension, in the following ways:
Whole black vector set and white vector set merged and screens the problem of dimension, splitting into the subproblem of the white vector of two black vector sums two; Then each subproblem is separated, two white vectors are extracted total dimension (getting common factor), as the white dimension collection of subproblem, two black vectors are extracted the black dimension collection of total dimension as subproblem, and black dimension concentrated all dimensions appearing at white dimension concentrated to remove, give weight to each black, the white dimension elected.
The solution of all subproblems is merged according to dimension, a weight threshold w is set in merging process, if the weight of the dimension after merging (weighted value that during merging, dimension is corresponding is added) is lower than w, then directly abandons this dimension, prevent dimension collection and unrestrictedly increase.
When vectors all in black vector set and white vector set study is complete, black dimension collection (i.e. black dimension collection=black dimension collection-Bai dimension collection) is filtered with the white dimension collection after merging, according to weight size, rank is carried out to black dimension collection, takes out the black dimension of the highest front K dimension of rank as a result.
Vectors all in black and white file is changed into the canonical form of the K dimensional vector selected, so that SVM learns K dimensional vector, generate machine learning model.
In addition, as shown in Figure 8, above-mentioned identification module 504 comprises: computing unit 5041 and output unit 5042, wherein:
Computing unit 5041, for obtaining result of calculation to changing into the file to be detected after vector by machine learning model;
Output unit 5042, for exporting malicious file in file to be detected and normal file according to result of calculation.
The recognition methods of embodiment of the present invention malicious file, device and storage medium, machine learning model is generated by the study collection of the malicious file that presets and normal file composition, and by the machine learning model generated, malicious file identification is carried out to the file to be detected beyond study collection, namely the malicious code features such as virus are automatically extracted by machine, eliminate the participation of analyst, and machine learning reaction in time, can accurately and effectively extract virus characteristic, all can process immediately any malicious file found, improve the detection efficiency of malicious file thus greatly.
In addition, the present invention also proposes a kind of storage medium of embodied on computer readable, have stored thereon the program that computing machine is run, after in the storer that program loads computing machine, adopt the study collection of predetermined malicious file and normal file composition to generate machine learning model; Read the file to be detected beyond study collection; Described file translations to be detected is become vector; By described machine learning model, malicious file identification is carried out to the file to be detected changing into vector.
It should be noted that, the above embodiment of the present invention all illustrates with windows operating system, but be not limited to windows operating system, other operating systems also can adopt by reference such scheme of the present invention and carry out malicious file detection identification, such as mac or Linux system etc., its concrete principle does not repeat them here.
The foregoing is only the preferred embodiments of the present invention; not thereby the scope of the claims of the present invention is limited; every utilize instructions of the present invention and accompanying drawing content to do equivalent structure or flow process conversion; or be directly or indirectly used in other relevant technical field, be all in like manner included in scope of patent protection of the present invention.

Claims (7)

1. a malicious file recognition methods, is characterized in that, comprises the following steps:
The study collection of predetermined malicious file and normal file composition is adopted to generate machine learning model;
Read the file to be detected beyond study collection;
Described file translations to be detected is become vector; Namely from file to be detected, effective sample characteristics is extracted, by right for value value composition one (key:value) of these feature key and this feature, a file to be detected then becomes the set of (key:value), is converted into a unfixed multi-C vector of dimension;
By described machine learning model, malicious file identification is carried out to the file to be detected changing into vector; The step of the study collection generation machine learning model of the malicious file that described employing is predetermined and normal file composition comprises:
The malicious file concentrate described study and normal file change into vector respectively;
The vector of malicious file and normal file is concentrated to carry out dimension merging and screening to described study;
Vector after being combined by sorter and screening is learnt, and generates machine learning model; The vector setting all malicious files that described study is concentrated is black vector set, and the vector of all normal files is white vector set, and the described step to learning to concentrate the vector of malicious file and normal file to carry out dimension merging and screening comprises:
The black vector of random selecting two from described black vector set, extracts the total dimension of two black vectors, as black dimension collection; The white vector of random selecting two from described white vector set, extracts the total dimension of two white vectors, as white dimension collection;
Described black dimension concentrated all dimensions appearing at described white dimension concentrated to remove, form new black dimension collection, give weight to each dimension that described white dimension collection and new black dimension are concentrated;
Described white dimension collection and new black dimension collection are carried out dimension merging respectively according to weight, and abandons merging the dimension of rear weight lower than predefined weight threshold values; With these above-mentioned three steps that circulate, until Vector Processing all in described black vector set and white vector set is complete.
2. method according to claim 1, is characterized in that, the described step to learning to concentrate the vector of malicious file and normal file to carry out dimension merging and screening further comprises:
After Vector Processing all in described black vector set and white vector set, filter the black dimension collection after merging with the white dimension collection after merging;
Black dimension collection after filtering is sorted according to weight size, takes out the black dimension of the highest front K dimension of rank as final dimension;
Institute's directed quantity in described black vector set and white vector set is changed into K dimensional vector.
3. method according to claim 1 and 2, is characterized in that, is describedly comprised the step that the file to be detected changing into vector carries out malicious file identification by machine learning model:
Result of calculation is obtained by machine learning model to changing into the file to be detected after vector;
Malicious file in file to be detected and normal file is exported according to result of calculation.
4. method according to claim 3, is characterized in that, described predetermined malicious file and normal file refer to the known malicious file and normal file collected in advance.
5. a malicious file recognition device, is characterized in that, comprising:
Model generation module, generates machine learning model for adopting the study collection of predetermined malicious file and normal file composition;
Read module, for reading the file to be detected beyond study collection;
Vector conversion module, for becoming vector by described file translations to be detected; Namely from file to be detected, effective sample characteristics is extracted, by right for value value composition one (key:value) of these feature key and this feature, a file to be detected then becomes the set of (key:value), is converted into a unfixed multi-C vector of dimension;
Identification module, for carrying out malicious file identification by described machine learning model to the file to be detected changing into vector; Described model generation module comprises:
Vector conversion unit, for changing into vector respectively by malicious file concentrated for described study and normal file;
Merge and screening unit, carry out dimension merging and screening for concentrating the vector of malicious file and normal file to described study;
Generation unit, is learnt for the vector after being combined by sorter and screening, and generates machine learning model; The vector setting all malicious files that described study is concentrated is black vector set, and the vector of all normal files is white vector set, and described merging and screening unit comprise:
First extracts subelement, for the black vector of random selecting two from described black vector set, extracts the total dimension of two black vectors, as black dimension collection; The white vector of random selecting two from described white vector set, extracts the total dimension of two white vectors, as white dimension collection;
Screening subelement, for described black dimension being concentrated all dimensions appearing at described white dimension concentrated to remove, forming new black dimension collection, giving weight to each dimension that described white dimension collection and new black dimension are concentrated;
Merging subelement, for described white dimension collection and new black dimension collection are carried out dimension merging respectively according to weight, and abandoning merging the dimension of rear weight lower than predefined weight threshold values;
Filter subelement, for after Vector Processing all in described black vector set and white vector set, filter the black dimension collection after merging with the white dimension collection after merging;
Second extracts subelement, for sorting according to weight size to the black dimension collection after filtration, takes out the black dimension of the highest front K dimension of rank as final dimension;
Transformant unit, for changing into K dimensional vector by the institute's directed quantity in described black vector set and white vector set.
6. device according to claim 5, is characterized in that, described identification module comprises:
Computing unit, for obtaining result of calculation to changing into the file to be detected after vector by machine learning model;
Output unit, for exporting malicious file in file to be detected and normal file according to result of calculation.
7. device according to claim 6, is characterized in that, described predetermined malicious file and normal file refer to the known malicious file and normal file collected in advance.
CN201210213078.2A 2012-06-26 2012-06-26 Malicious file identification method, device and storage medium Active CN102737186B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210213078.2A CN102737186B (en) 2012-06-26 2012-06-26 Malicious file identification method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210213078.2A CN102737186B (en) 2012-06-26 2012-06-26 Malicious file identification method, device and storage medium

Publications (2)

Publication Number Publication Date
CN102737186A CN102737186A (en) 2012-10-17
CN102737186B true CN102737186B (en) 2015-06-17

Family

ID=46992673

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210213078.2A Active CN102737186B (en) 2012-06-26 2012-06-26 Malicious file identification method, device and storage medium

Country Status (1)

Country Link
CN (1) CN102737186B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008333B (en) * 2013-02-21 2017-12-01 腾讯科技(深圳)有限公司 The detection method and equipment of a kind of installation kit
CN104424437B (en) * 2013-08-28 2018-07-10 贝壳网际(北京)安全技术有限公司 Multi-file sample testing method and device and client
CN103473506B (en) * 2013-08-30 2016-12-28 北京奇虎科技有限公司 For the method and apparatus identifying malice APK file
CN104598820A (en) * 2015-01-14 2015-05-06 国家电网公司 Trojan virus detection method based on feature behavior activity
CN106156120B (en) * 2015-04-07 2020-02-28 阿里巴巴集团控股有限公司 Method and device for classifying character strings
CN105975857A (en) * 2015-11-17 2016-09-28 武汉安天信息技术有限责任公司 Method and system for deducing malicious code rules based on in-depth learning method
CN105897752B (en) * 2016-06-03 2019-08-02 北京奇虎科技有限公司 The safety detection method and device of unknown domain name
CN105897751B (en) * 2016-06-03 2019-08-02 北京奇虎科技有限公司 Threaten the generation method and device of information
CN110619213A (en) * 2018-06-20 2019-12-27 深信服科技股份有限公司 Malicious software identification method, system and related device based on multi-model features
CN109992969B (en) * 2019-03-25 2023-03-21 腾讯科技(深圳)有限公司 Malicious file detection method and device and detection platform
CN111859381A (en) * 2019-04-29 2020-10-30 深信服科技股份有限公司 File detection method, device, equipment and medium
CN111371812B (en) * 2020-05-27 2020-09-01 腾讯科技(深圳)有限公司 Virus detection method, device and medium
SG10202009754QA (en) 2020-10-01 2020-11-27 Flexxon Pte Ltd Module and method for detecting malicious activities in a storage device
CN113935022A (en) * 2021-12-17 2022-01-14 北京微步在线科技有限公司 Homologous sample capturing method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101604364A (en) * 2009-07-10 2009-12-16 珠海金山软件股份有限公司 Computer rogue program categorizing system and sorting technique based on file instruction sequence
CN101984450A (en) * 2010-12-15 2011-03-09 北京安天电子设备有限公司 Malicious code detection method and system
CN102346829A (en) * 2011-09-22 2012-02-08 重庆大学 Virus detection method based on ensemble classification
CN102479298A (en) * 2010-11-29 2012-05-30 北京奇虎科技有限公司 Program identification method and device based on machine learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101604364A (en) * 2009-07-10 2009-12-16 珠海金山软件股份有限公司 Computer rogue program categorizing system and sorting technique based on file instruction sequence
CN102479298A (en) * 2010-11-29 2012-05-30 北京奇虎科技有限公司 Program identification method and device based on machine learning
CN101984450A (en) * 2010-12-15 2011-03-09 北京安天电子设备有限公司 Malicious code detection method and system
CN102346829A (en) * 2011-09-22 2012-02-08 重庆大学 Virus detection method based on ensemble classification

Also Published As

Publication number Publication date
CN102737186A (en) 2012-10-17

Similar Documents

Publication Publication Date Title
CN102737186B (en) Malicious file identification method, device and storage medium
Islam et al. Classification of malware based on string and function feature selection
CN102346829B (en) Virus detection method based on ensemble classification
CN103177215B (en) Based on the computer malware new detecting method of software control stream feature
CN104700033A (en) Virus detection method and virus detection device
US20120284793A1 (en) Intrusion detection using mdl clustering
CN101604364B (en) Classification system and classification method of computer rogue programs based on file instruction sequence
CN108985064B (en) Method and device for identifying malicious document
CN105279277A (en) Knowledge data processing method and device
CN103810425A (en) Method and device for detecting malicious website
CN101604363A (en) Computer rogue program categorizing system and sorting technique based on the file instruction frequency
CN112307473A (en) Malicious JavaScript code detection model based on Bi-LSTM network and attention mechanism
CN105516128A (en) Detecting method and device of Web attack
CN109063478A (en) Method for detecting virus, device, equipment and the medium of transplantable executable file
CN115757991A (en) Webpage identification method and device, electronic equipment and storage medium
Jiang et al. A feature selection method for malware detection
CN104866764A (en) Object reference graph-based Android cellphone malicious software detection method
CN113543117A (en) Prediction method and device for number portability user and computing equipment
CN105989093B (en) The automatic discovering method and its device of sensitive word and application
AU2021100392A4 (en) A method for malware detection and classification using multi-level resnet paradigm on pe binary images
KR102192196B1 (en) An apparatus and method for detecting malicious codes using ai based machine running cross validation techniques
CN108287831A (en) A kind of URL classification method and system, data processing method and system
CN103632091A (en) Malicious feature extraction method and device and storage media
CN108875374B (en) Malicious PDF detection method and device based on document node type
CN116821903A (en) Detection rule determination and malicious binary file detection method, device and medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230706

Address after: 518057 Tencent Building, No. 1 High-tech Zone, Nanshan District, Shenzhen City, Guangdong Province, 35 floors

Patentee after: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.

Patentee after: TENCENT CLOUD COMPUTING (BEIJING) Co.,Ltd.

Address before: 2, 518044, East 403 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District

Patentee before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.

TR01 Transfer of patent right