CN104598816B - A kind of file scanning method and device - Google Patents

A kind of file scanning method and device Download PDF

Info

Publication number
CN104598816B
CN104598816B CN201410806302.8A CN201410806302A CN104598816B CN 104598816 B CN104598816 B CN 104598816B CN 201410806302 A CN201410806302 A CN 201410806302A CN 104598816 B CN104598816 B CN 104598816B
Authority
CN
China
Prior art keywords
model
file
malicious file
detected
malicious
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410806302.8A
Other languages
Chinese (zh)
Other versions
CN104598816A (en
Inventor
熊蜀光
冯侦探
曹德强
王新
邓小路
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Anyi Hengtong Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anyi Hengtong Beijing Technology Co Ltd filed Critical Anyi Hengtong Beijing Technology Co Ltd
Priority to CN201410806302.8A priority Critical patent/CN104598816B/en
Publication of CN104598816A publication Critical patent/CN104598816A/en
Application granted granted Critical
Publication of CN104598816B publication Critical patent/CN104598816B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Abstract

The embodiment of the invention provides a kind of file scanning method and device.On the one hand, the embodiment of the present invention judges the type of file to be detected by using M the first model respectively, and to obtain M judged result, M is the integer more than or equal to 2;So as to according to the M judged result, the file to be detected is judged to the number of the first model of malicious file for acquisition;And then, according to the number of first model that the file to be detected is judged to malicious file, obtain the type of the file to be detected.Therefore, judgment models are to the detection performance of malicious file during technical scheme provided in an embodiment of the present invention can be realized improving file scan.

Description

A kind of file scanning method and device
【Technical field】
The present invention relates to field of computer technology, more particularly to a kind of file scanning method and device.
【Background technology】
File scanning method based on machine learning, its basic thought is:The characteristic vector of the file of known type is calculated, Then machine training is carried out using characteristic vector, obtains judgment models, the file of UNKNOWN TYPE is judged using judgment models Type, to detect malicious file therein.
However, new malicious file can continuously emerge over time, and the training method based on machine learning In, the judgment models of acquisition are all single models, therefore, the judgment models used in file scan are being faced in the prior art During emerging malicious file, the detection performance to malicious file is relatively low.
【The content of the invention】
In view of this, a kind of file scanning method and device be the embodiment of the invention provides, it is possible to achieve improve file and sweep Detection performance of the judgment models to malicious file during retouching.
A kind of one side of the embodiment of the present invention, there is provided file scanning method, including:
Judge the type of file to be detected respectively using M the first models, to obtain M judged result, M be more than or Integer equal to 2;
According to the M judged result, the file to be detected is judged to the number of the first model of malicious file for acquisition Mesh;
According to the number of first model that the file to be detected is judged to malicious file, obtain described to be detected The type of file.
Aspect as described above and any possible implementation, it is further provided a kind of implementation, it is described according to institute The number of the first model that the file to be detected is judged to malicious file is stated, the type of the file to be detected, bag is obtained Include:
Compare by the file to be detected be judged to malicious file the first model number and default first threshold Size;
If the file to be detected to be judged to the number of the first model of malicious file less than the first threshold, it is determined that The file to be detected is normal file;
If the file to be detected is judged to, the number of the first model of malicious file is more than or equal to described first Threshold value, determines that the file to be detected is malicious file.
Aspect as described above and any possible implementation, it is further provided a kind of implementation, methods described is also Including:
Emerging malicious file is obtained, as training sample;
Machine training is carried out using the training sample, to generate the second model;
The M the first model is adjusted using second model.
Aspect as described above and any possible implementation, it is further provided a kind of implementation, the M first Model constitutes first set;It is described the M the first model is adjusted using second model, including:
Second model is added to default second set, the second set includes K the second model, and K is big In 0 integer;
According to first model in second model and the first set in the second set, generation P model group, P is more than 0 and less than or equal to the product of M and K;
Using the second model in model group each described, replaced in the first set and belong to the first of the model group Model, to obtain P the 3rd set;
Obtain the malicious file recall rate and malicious file error rate of each the 3rd set;
According to the malicious file recall rate and malicious file error rate of the 3rd set each described, one the described 3rd is selected Set;
The first set is adjusted using the second model in the corresponding model group of the 3rd set selected.
Aspect as described above and any possible implementation, it is further provided a kind of implementation, it is described using choosing The second model in the corresponding model group of the 3rd set for going out is adjusted to the first set, including:
Compare the malicious file recall rate of the 3rd set selected and the malicious file recall rate of the first set Size, and compare select it is described 3rd set malicious file error rate and the first set malicious file mistake The size of rate;
If the malicious file recall rate of the malicious file recall rate more than the first set of the 3rd set selected, And the malicious file error rate of the malicious file error rate less than the first set of the 3rd set selected, using described The second model in the corresponding model group of 3rd set, replaces the first model for belonging to the model group in the first set, Or, the second model in the corresponding model group of the 3rd set is increased in the first set.
A kind of one side of the embodiment of the present invention, there is provided file scanning device, including:
Type judging unit, the type for being judged file to be detected respectively using M the first model, is sentenced with obtaining M Disconnected result, M is the integer more than or equal to 2;
As a result statistic unit, malice text is judged to for according to the M judged result, obtaining by the file to be detected The number of the first model of part;
Type determining units, for the number according to first model that the file to be detected is judged to malicious file Mesh, obtains the type of the file to be detected.
Aspect as described above and any possible implementation, it is further provided a kind of implementation, the type is true Order unit, specifically for:
Compare by the file to be detected be judged to malicious file the first model number and default first threshold Size;
If the file to be detected to be judged to the number of the first model of malicious file less than the first threshold, it is determined that The file to be detected is normal file;
If the file to be detected is judged to, the number of the first model of malicious file is more than or equal to described first Threshold value, determines that the file to be detected is malicious file.
Aspect as described above and any possible implementation, it is further provided a kind of implementation, described device is also Including:
File obtaining unit, for obtaining emerging malicious file, as training sample;
Model generation unit, for carrying out machine training using the training sample, to generate the second model;
Model adjustment unit, for being adjusted to the M the first model using second model.
Aspect as described above and any possible implementation, it is further provided a kind of implementation, the M first Model constitutes first set;The model adjustment unit, specifically for:
Second model is added to default second set, the second set includes K the second model, and K is big In 0 integer;
According to first model in second model and the first set in the second set, generation P model group, P is more than 0 and less than or equal to the product of M and K;
Using the second model in model group each described, replaced in the first set and belong to the first of the model group Model, to obtain P the 3rd set;
Obtain the malicious file recall rate and malicious file error rate of each the 3rd set;
According to the malicious file recall rate and malicious file error rate of the 3rd set each described, one the described 3rd is selected Set;
The first set is adjusted using the second model in the corresponding model group of the 3rd set selected.
Aspect as described above and any possible implementation, it is further provided a kind of implementation, the model are adjusted Whole unit is used to adjust the first set using the second model in the corresponding model group of the 3rd set selected When whole, specifically for:
Compare the malicious file recall rate of the 3rd set selected and the malicious file recall rate of the first set Size, and compare select it is described 3rd set malicious file error rate and the first set malicious file mistake The size of rate;
If the malicious file recall rate of the malicious file recall rate more than the first set of the 3rd set selected, And the malicious file error rate of the malicious file error rate less than the first set of the 3rd set selected, using described The second model in the corresponding model group of 3rd set, replaces the first model for belonging to the model group in the first set, Or, the second model in the corresponding model group of the 3rd set is increased in the first set.
As can be seen from the above technical solutions, the embodiment of the present invention has the advantages that:
In technical scheme provided in an embodiment of the present invention, the type decision of file to be detected is carried out using multiple models, and Comprehensive descision is carried out to the type of file to be detected according to the result of determination of multiple models, such that it is able to realize improving file scan During judgment models to the detection performance of malicious file, improve Detection accuracy of the judgment models to malicious file.
【Brief description of the drawings】
Technical scheme in order to illustrate more clearly the embodiments of the present invention, below will be attached to what is used needed for embodiment Figure is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for this area For those of ordinary skill, without having to pay creative labor, can also obtain other attached according to these accompanying drawings Figure.
Fig. 1 is the schematic flow sheet of the embodiment one of the file scanning method that the embodiment of the present invention is provided;
Fig. 2 is the exemplary plot that the judgment models that the embodiment of the present invention is provided are judged file to be detected;
Fig. 3 is the schematic flow sheet of the embodiment two of the file scanning method that the embodiment of the present invention is provided;
Fig. 4 is the functional block diagram of the file scanning device that the embodiment of the present invention is provided.
【Specific embodiment】
In order to be better understood from technical scheme, the embodiment of the present invention is retouched in detail below in conjunction with the accompanying drawings State.
It will be appreciated that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.Base Embodiment in the present invention, those of ordinary skill in the art obtained under the premise of creative work is not made it is all its Its embodiment, belongs to the scope of protection of the invention.
The term for using in embodiments of the present invention is the purpose only merely for description specific embodiment, and is not intended to be limiting The present invention." one kind ", " described " and " being somebody's turn to do " of singulative used in the embodiment of the present invention and appended claims It is also intended to include most forms, unless context clearly shows that other implications.
It should be appreciated that term "and/or" used herein is only a kind of incidence relation for describing affiliated partner, represent There may be three kinds of relations, for example, A and/or B, can represent:Individualism A, while there is A and B, individualism B these three Situation.In addition, character "/" herein, typicallys represent forward-backward correlation pair as if a kind of relation of "or".
It will be appreciated that though in embodiments of the present invention may be using term first, second etc. is come description collections or retouches Model is stated, but these keywords should not necessarily be limited by these terms.These terms are only used for being distinguished from each other open keyword.For example, In the case of not departing from range of embodiment of the invention, first set can also be referred to as second set, similarly, second set First set can be referred to as.
Depending on linguistic context, word as used in this " if " can be construed to " ... when " or " when ... When " or " in response to determining " or " in response to detection ".Similarly, depending on linguistic context, phrase " if it is determined that " or " if detection (condition or event of statement) " can be construed to " when it is determined that when " or " in response to determine " or " when the detection (condition of statement Or event) when " or " in response to detection (condition or event of statement) ".
Embodiment one
The embodiment of the present invention provides a kind of file scanning method, refer to Fig. 1, its text provided by the embodiment of the present invention The schematic flow sheet of the embodiment one of part scan method, as illustrated, the method is comprised the following steps:
S101, the type of file to be detected is judged using M the first model respectively, and to obtain M judged result, M is big In or equal to 2 integer.
Specifically, refer to Fig. 2, it is judged file to be detected by the judgment models that the embodiment of the present invention is provided Exemplary plot, as illustrated, in the embodiment of the present invention, sentenced using M type of first model respectively to file to be detected It is disconnected, to obtain M judged result.
Wherein, M is the integer more than or equal to 2.
It should be noted that M the first model is generated by machine training using incomplete same training sample, The diversity of training sample is improved, be ensure that and cover to a greater extent various malicious files.
Preferably, each first model can be to training sample using nearest neighbor algorithm (k-Nearest Neighbor, kNN) Originally the model that machine training is obtained is carried out, kNN models are properly termed as;Or, each first model can also be using simple shellfish This algorithm of leaf carries out the model that machine training is obtained to training sample;Or, each first model can also be using support to Amount machine (Support Vector Machine, SVM) carries out the model that machine training is obtained to training sample, is properly termed as SVM Model.
It should be noted that M the first model can be the model obtained using identical algorithms, such as all it is SVM models.Or Person, M the first model can also not exclusively be the model obtained using identical algorithms, and such as a part is SVM models, another part It is kNN models.
So that the first model is SVM models as an example, the first model is illustrated to the method that file to be detected is judged Explanation:The first model obtained by training is a matrix, and the number arranged in the matrix is equal to the number of types of file to be detected Mesh, in the embodiment of the present invention, the type of file can include normal file and malicious file, therefore, the number of row is equal to 2.The When one model is judged file to be detected, the characteristic vector of the first file to be detected according to UNKNOWN TYPE, by UNKNOWN TYPE File to be detected characteristic vector and the matrix multiple, in the vector of acquisition, the numerical value correspondence of each element is by text to be detected Part is judged to the score of respective type, and score type high is exactly the type that the file to be detected is determined.
Wherein, the method for obtaining the characteristic vector of the file to be detected of the UNKNOWN TYPE can be included but is not limited to:Treat Detection file is characterized in a string of binary characters for 4 byte longs, and such as 0x01234567 and 0x89ABCDEF are features. , it is necessary to pre-configured characteristic set when extracting the characteristic vector of file to be detected, for example, [0x01234567, 0x89ABCDEF, 0xAAAABBBB] it is characteristic set.Then, some files to be detected of sequential scan, check in characteristic set Whether each feature is present in file to be detected, if it is, element corresponding with this feature in characteristic vector is set to 1, otherwise it is set to 0.For example, 0x01234567 and 0xAAAABBBB exists in certain file to be detected, and 0x89ABCDEF is not deposited , then the characteristic vector of the file to be detected is [1,0,1].
S102, according to the M judged result, the file to be detected is judged to the first model of malicious file for acquisition Number.
Specifically, as shown in Fig. 2 obtain M judged result after, count in the M judged result, file to be detected is sentenced It is set to the number k of the first model of malicious file, that is to say, that how many first model treats this in M judged result of statistics Detection file is judged to malicious file.
S103, according to the number of first model that the file to be detected is judged to malicious file, obtains described The type of file to be detected.
Specifically, as shown in Fig. 2 obtain file to be detected is judged to the number k of the first model of malicious file after, Compare the size t of number k and the default first threshold of the first model that file to be detected is judged to malicious file.
If file to be detected is judged to, the number k of the first model of malicious file is less than default first threshold t, really Fixed file to be detected is normal file.
If conversely, file to be detected is judged to the number k of the first model of malicious file more than or equal to default First threshold t, it is determined that file to be detected is malicious file.
This way it is possible to realize to the type decision of file to be detected, to detect that the file to be detected is malice text Part.
It should be noted that be all in the prior art the type decision that file to be detected is carried out using single model, whenever When updating occurs in training sample, such as there is new malicious file, it is necessary to which machine training is re-started based on all training samples, To generate new model, the type decision of file to be detected is carried out using new model, so as to bring huge computing cost, And the renewal efficiency comparison of model is low.In the embodiment of the present invention, the type decision of file to be detected is carried out using multiple models, adopted The benefit for carrying out the type decision of file to be detected with multiple models can be conveniently to carry out the increase or replacement of model, with profit The ability of malicious file is persistently detected in the maintenance of file scanning device, and lifting.When the renewal for needing to be trained sample When, it is only necessary to machine training is carried out based on emerging training sample, a new model is generated, and be added to M judgement mould In type, it is not necessary to re-start machine training based on all training samples, therefore computing cost can be reduced, improve model Renewal efficiency, improve model to the detection performance of malicious file, improve the recall rate of malicious file, reduce the mistake of malicious file Report rate.
Embodiment two
Fig. 3 is refer to, the schematic flow sheet of the embodiment two of its file scanning method provided by the embodiment of the present invention, As shown in figure 3, being based on above-mentioned file scanning method, this document scan method can also include:
Emerging malicious file is obtained, as training sample;
Machine training is carried out using the training sample, to generate the second model;
The M the first model is adjusted using second model.
In the embodiment of the present invention, by way of carrying out constantly adjustment to M the first model, persistently to lift file scan The detectability to malicious file of device.
For example, can be by the emerging malicious file that obtains in the recent period as training sample, when malicious file accumulation to During fixed number amount, such as 10,000, it is possible to machine training is carried out using training sample, to generate the second model.Wherein it is possible to profit The malicious file for determining in aforementioned manners, carries out desk checking, if desk checking result is malicious file, can be as newly going out Existing training sample, for machine training.Conversely, if desk checking result is normal file, will not be used as training sample.
Preferably, in the embodiment of the present invention, second model can be kNN models;Or, the second model can also be The judgment models that machine training is obtained are carried out to training sample using NB Algorithm;Or, the second model can also be SVM models.
Preferably, can be carried out entering M the first model using the second model at interval of a period of time, such as one month The operation of row adjustment, to realize the M renewal of the first model, adjustment, to detect emerging malicious file.
Below to being illustrated to the method that the M the first model is adjusted using second model.
First, M the first model composition first set, first set is to be used to carry out the class of file to be detected on line The set that type judges, so first set is properly termed as Online.The second model for generating is added to default second set, The second set includes K the second models, and K is the integer more than 0, second set as first set Online standby collection Close, so second set is properly termed as Backup.
Then, first set Online and second set Backup is traveled through, one the is taken from first set Online One model Online [i], i take the integer in 1 to M.And, a second Model B ackup is taken from second set Backup [j], j takes the integer in 1 to K.According in the second Model B ackup [j] and the first set in second set One the first model Online [i], generates P model group, and P is more than 0 and less than or equal to the product of M and K.Remember each model Group is { Online [i], Backup [j] }.
Then, using the second Model B ackup [j] in model group each described, replaced in the first set Online The the first model Online [i] for belonging to the model group is changed, to obtain P the 3rd set.Wherein, each the 3rd set can be designated as New [i, j], so, just comprising other in individual first models of M in addition to first model Online [i] in the 3rd set First model, also comprising a new second Model B ackup [j] for adding.
It should be noted that the treatment is carried out for each model group, to generate corresponding 3rd set, so there is P Individual model group, just there is P the 3rd set.
Finally, the judgement of file type is carried out to the file of some UNKNOWN TYPEs using each the 3rd set, it is every to obtain The malicious file recall rate and malicious file error rate of individual 3rd set.Then, according to the evil of the 3rd set each described Meaning file recall rate and malicious file error rate, select the 3rd set.And it is right using the 3rd set selected The second model in the model group answered is adjusted to the first set.
Wherein, malicious file recall rate is equal to the 3rd set and the judgement of file type is carried out to the file of some UNKNOWN TYPEs When, the number of the malicious file for correctly the detecting ratio total with malicious file in the file of UNKNOWN TYPE, malicious file inspection Extracting rate is higher, represents that the 3rd set is capable of detecting when more malicious files.
Wherein, when malicious file error rate carries out the judgement of file type equal to the 3rd set to the file of UNKNOWN TYPE, Normal file is judged to the total ratio of the number of malicious file and the file of UNKNOWN TYPE, malicious file error rate is got over It is low, represent that the accuracy rate of the 3rd set detection malicious file is higher.
Preferably, according to the malicious file recall rate and malicious file error rate of the 3rd set each described, one is selected The method of the 3rd set can be included but is not limited to:According to the malicious file recall rate and malicious file of each the 3rd set The ratio of error rate, obtains the efficiency ratio of each the 3rd set.Then according to the order that efficiency ratio is descending, to P the 3rd Set is ranked up, to obtain ranking results, the 3rd set made number one in selected and sorted result, i.e., in P the 3rd collection Maximum the 3rd set New [i, j] of efficiency ratio is found in conjunction.
Preferably, using the second Model B ackup in the corresponding model group of the 3rd set New [i, j] selected The method that [j] is adjusted to the first set Online can be included but is not limited to:
Compare the malicious file recall rate of the 3rd set New [i, j] selected with the first set Online's The size of malicious file recall rate, and compare the malicious file error rate of the 3rd set New [i, j] selected with it is described The size of the malicious file error rate of first set Online.
If the malicious file recall rate of the 3rd set New [i, j] selected is more than the first set Online's Malicious file recall rate, and the malicious file error rate of the 3rd set New [i, j] selected is less than the first set The malicious file error rate of Online, represents the malicious file recall rate and malicious file error rate of the 3rd set New [i, j] The malicious file recall rate and malicious file error rate of better than currently used first set Online, then need using choosing The second Model B ackup in the corresponding model group of the 3rd set New [i, j] { Online [i], Backup [j] } for going out [j], is adjusted to the first set Online, and the adjustment can include:If the first model in first set Online The number of Online [i] reaches default model threshold, then using the corresponding model group { Online of the 3rd set New [i, j] [i], Backup [j] } in the second Model B ackup [j], in the first set Online replace belong to the model group First model Online [i].Or, if the number of the first model Online [i] in first set Online is also not reaching to Default model threshold, can directly increase the corresponding model groups of the 3rd set New [i, j] in first set Online The second Model B ackup [j] in { Online [i], Backup [j] }.
If conversely, the malicious file recall rate of the 3rd set New [i, j] selected is less than or equal to described first The malicious file recall rate of set Online, and/or, the malicious file error rate of the 3rd set New [i, j] selected is big In or equal to the first set Online malicious file error rate, represent the 3rd set New [i, j] malicious file examine Extracting rate and/or malicious file error rate malicious file recall rate and evil not better than currently used first set Online Meaning file error rate, then do not utilize the second model in the corresponding model groups of the 3rd set New [i, j] selected to first set Online is adjusted, and keeps current first set Online constant.
It should be noted that terminal involved in the embodiment of the present invention can include but is not limited to personal computer (Personal Computer, PC), personal digital assistant (Personal Digital Assistant, PDA), wireless handheld Equipment, panel computer (Tablet Computer), mobile phone, MP3 player, MP4 players etc..
It should be noted that the executive agent of above-mentioned file scanning method can be file scanning device, the device can be with The application of terminal is located locally, or can also be the plug-in unit or SDK being located locally in the application of terminal Functional units such as (Software Development Kit, SDK), the embodiment of the present invention is not particularly limited to this.
It is understood that the application can be mounted in the application program (nativeApp) in terminal, or may be used also To be a web page program (webApp) of browser in terminal, the embodiment of the present invention is not defined to this.
The embodiment of the present invention further provides the device embodiment for realizing each step and method in above method embodiment.
Fig. 4 is refer to, the functional block diagram of its file scanning device provided by the embodiment of the present invention.As illustrated, The device includes:
Type judging unit 401, the type for judging file to be detected respectively using M the first model, to obtain M Judged result, M is the integer more than or equal to 2;
As a result, be judged to for the file to be detected to dislike for according to the M judged result, obtaining by statistic unit 402 The number of the first model of meaning file;
Type determining units 403, for according to first model that the file to be detected is judged to malicious file Number, obtain the type of the file to be detected.
Preferably, the type determining units 403, specifically for:
Compare by the file to be detected be judged to malicious file the first model number and default first threshold Size;
If the file to be detected to be judged to the number of the first model of malicious file less than the first threshold, it is determined that The file to be detected is normal file;
If the file to be detected is judged to, the number of the first model of malicious file is more than or equal to described first Threshold value, determines that the file to be detected is malicious file.
Preferably, described device also includes:
File obtaining unit 404, for obtaining emerging malicious file, as training sample;
Model generation unit 405, for carrying out machine training using the training sample, to generate the second model;
Model adjustment unit 406, for being adjusted to the M the first model using second model.
Preferably, the M the first model composition first set;The model adjustment unit 406, specifically for:
Second model is added to default second set, the second set includes K the second model, and K is big In 0 integer;
According to first model in second model and the first set in the second set, generation P model group, P is more than 0 and less than or equal to the product of M and K;
Using the second model in model group each described, replaced in the first set and belong to the first of the model group Model, to obtain P the 3rd set;
Obtain the malicious file recall rate and malicious file error rate of each the 3rd set;
According to the malicious file recall rate and malicious file error rate of the 3rd set each described, one the described 3rd is selected Set;
The first set is adjusted using the second model in the corresponding model group of the 3rd set selected.
Preferably, the model adjustment unit 406 is used for using in the corresponding model group of the 3rd set selected When second model is adjusted to the first set, specifically for:
Compare the malicious file recall rate of the 3rd set selected and the malicious file recall rate of the first set Size, and compare select it is described 3rd set malicious file error rate and the first set malicious file mistake The size of rate;
If the malicious file recall rate of the malicious file recall rate more than the first set of the 3rd set selected, And the malicious file error rate of the malicious file error rate less than the first set of the 3rd set selected, using described The second model in the corresponding model group of 3rd set, replaces the first model for belonging to the model group in the first set, Or, the second model in the corresponding model group of the 3rd set is increased in the first set.
Because each unit in the present embodiment is able to carry out the method shown in Fig. 1~Fig. 3, what the present embodiment was not described in detail Part, refers to the related description to Fig. 1~Fig. 3.
The technical scheme of the embodiment of the present invention has the advantages that:
The embodiment of the present invention judges the type of file to be detected by using M the first model respectively, to obtain M judgement As a result, M is the integer more than or equal to 2;So as to according to the M judged result, obtain and judge the file to be detected It is the number of the first model of malicious file;And then, the file to be detected is judged to the first of malicious file according to described The number of model, obtains the type of the file to be detected.
Therefore, in technical scheme provided in an embodiment of the present invention, the type for carrying out file to be detected using multiple models is sentenced It is fixed, and the result of determination of the multiple models of foundation carries out comprehensive descision to the type of file to be detected, such that it is able to realize improving text Judgment models improve Detection accuracy of the judgment models to malicious file to the detection performance of malicious file in part scanning process.
In addition, the benefit that the type decision of file to be detected is carried out using multiple models can be conveniently to carry out the increasing of model Plus or replace, be beneficial to the maintenance of file scanning device, and lifting persistently detects the ability of malicious file.
In addition, in the embodiment of the present invention, can be to being carried out for carrying out multiple models of type decision to file to be detected Adjustment, is updated with implementation model, and the new model for being used to adjust is obtained using emerging training sample, with prior art The middle technical scheme for needing to re-start machine training based on all training samples is compared, it is possible to reduce computing cost, improves mould The renewal efficiency of type, improves detection performance of the model to malicious file, improves the recall rate of malicious file, reduces malicious file Rate of false alarm.
It is apparent to those skilled in the art that, for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, may be referred to the corresponding process in preceding method embodiment, will not be repeated here.
In several embodiments provided by the present invention, it should be understood that disclosed system, apparatus and method can be with Realize by another way.For example, device embodiment described above is only schematical, for example, the unit Divide, only a kind of division of logic function there can be other dividing mode when actually realizing, for example, multiple units or group Part can be combined or be desirably integrated into another system, or some features can be ignored, or not performed.It is another, it is shown Or the coupling each other that discusses or direct-coupling or communication connection can be by some interfaces, device or unit it is indirect Coupling is communicated to connect, and can be electrical, mechanical or other forms.
The unit that is illustrated as separating component can be or may not be it is physically separate, it is aobvious as unit The part for showing can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of unit therein can be according to the actual needs selected to realize the mesh of this embodiment scheme 's.
In addition, during each functional unit in each embodiment of the invention can be integrated in a processing unit, it is also possible to It is that unit is individually physically present, it is also possible to which two or more units are integrated in a unit.Above-mentioned integrated list Unit can both be realized in the form of hardware, it would however also be possible to employ hardware adds the form of SFU software functional unit to realize.
The above-mentioned integrated unit realized in the form of SFU software functional unit, can store and be deposited in an embodied on computer readable In storage media.Above-mentioned SFU software functional unit storage is in a storage medium, including some instructions are used to so that a computer Device (can be personal computer, server, or network equipment etc.) or processor (Processor) perform the present invention each The part steps of embodiment methods described.And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (Read- Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disc or CD etc. it is various Can be with the medium of store program codes.
Presently preferred embodiments of the present invention is the foregoing is only, is not intended to limit the invention, it is all in essence of the invention Within god and principle, any modification, equivalent substitution and improvements done etc. should be included within the scope of protection of the invention.

Claims (6)

1. a kind of file scanning method, it is characterised in that methods described includes:
Judge the type of file to be detected respectively using M the first models, to obtain M judged result, M be more than or equal to 2 integer;
According to the M judged result, the file to be detected is judged to the number of the first model of malicious file for acquisition;
According to the number of first model that the file to be detected is judged to malicious file, the file to be detected is obtained Type;
Methods described also includes:Emerging malicious file is obtained, as training sample;Machine is carried out using the training sample Device is trained, to generate the second model;The M the first model is adjusted using second model, including:
The M the first model composition first set, default second set, second collection are added to by second model Close comprising K the second model, K is the integer more than 0;
According to first model in second model and the first set in the second set, generation P Model group, P is more than 0 and less than or equal to the product of M and K;Using the second model in model group each described, described The first model for belonging to the model group is replaced in one set, to obtain P the 3rd set;
Obtain the malicious file recall rate and malicious file error rate of each the 3rd set;According to the 3rd set each described Malicious file recall rate and malicious file error rate, select the 3rd set;
The first set is adjusted using the second model in the corresponding model group of the 3rd set selected.
2. method according to claim 1, it is characterised in that described to be judged to the file to be detected to dislike according to described The number of the first model of meaning file, obtains the type of the file to be detected, including:
Compare the size of number and the default first threshold of the first model that the file to be detected is judged to malicious file;
If the file to be detected to be judged to the number of the first model of malicious file less than the first threshold, it is determined that described File to be detected is normal file;
If the file to be detected to be judged to the number of the first model of malicious file more than or equal to the first threshold, Determine that the file to be detected is malicious file.
3. method according to claim 1, it is characterised in that described using the corresponding model of the 3rd set selected The second model in group is adjusted to the first set, including:
The malicious file recall rate for comparing the 3rd set selected is big with the malicious file recall rate of the first set It is small, and the malicious file error rate and the malicious file error rate of the first set for comparing the 3rd set selected Size;
If the malicious file recall rate of the malicious file recall rate more than the first set of the 3rd set selected, and choosing The malicious file error rate of the malicious file error rate less than the first set of the 3rd set for going out, using the described 3rd Gather the second model in corresponding model group, the first model for belonging to the model group is replaced in the first set, or, Increase the second model in the corresponding model group of the 3rd set in the first set.
4. a kind of file scanning device, it is characterised in that described device includes:
Type judging unit, the type for judging file to be detected respectively using M the first model, knot is judged to obtain M Really, M is the integer more than or equal to 2;
As a result statistic unit, for according to the M judged result, the file to be detected to be judged to malicious file by acquisition The number of the first model;
Type determining units, for the number according to first model that the file to be detected is judged to malicious file, Obtain the type of the file to be detected;
Described device also includes:
File obtaining unit, for obtaining emerging malicious file, as training sample;
Model generation unit, for carrying out machine training using the training sample, to generate the second model;
Model adjustment unit, for being adjusted to the M the first model using second model;
The M the first model composition first set;The model adjustment unit, specifically for:
Second model is added to default second set, the second set includes K the second model, and K is more than 0 Integer;
According to first model in second model and the first set in the second set, generation P Model group, P is more than 0 and less than or equal to the product of M and K;
Using the second model in model group each described, the first mould for belonging to the model group is replaced in the first set Type, to obtain P the 3rd set;
Obtain the malicious file recall rate and malicious file error rate of each the 3rd set;
According to the malicious file recall rate and malicious file error rate of the 3rd set each described, the 3rd collection is selected Close;
The first set is adjusted using the second model in the corresponding model group of the 3rd set selected.
5. device according to claim 4, it is characterised in that the type determining units, specifically for:
Compare the size of number and the default first threshold of the first model that the file to be detected is judged to malicious file;
If the file to be detected to be judged to the number of the first model of malicious file less than the first threshold, it is determined that described File to be detected is normal file;
If the file to be detected to be judged to the number of the first model of malicious file more than or equal to the first threshold, Determine that the file to be detected is malicious file.
6. device according to claim 4, it is characterised in that the model adjustment unit is used for using described for selecting When the second model in the corresponding model group of three set is adjusted to the first set, specifically for:
The malicious file recall rate for comparing the 3rd set selected is big with the malicious file recall rate of the first set It is small, and the malicious file error rate and the malicious file error rate of the first set for comparing the 3rd set selected Size;
If the malicious file recall rate of the malicious file recall rate more than the first set of the 3rd set selected, and choosing The malicious file error rate of the malicious file error rate less than the first set of the 3rd set for going out, using the described 3rd Gather the second model in corresponding model group, the first model for belonging to the model group is replaced in the first set, or, Increase the second model in the corresponding model group of the 3rd set in the first set.
CN201410806302.8A 2014-12-22 2014-12-22 A kind of file scanning method and device Active CN104598816B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410806302.8A CN104598816B (en) 2014-12-22 2014-12-22 A kind of file scanning method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410806302.8A CN104598816B (en) 2014-12-22 2014-12-22 A kind of file scanning method and device

Publications (2)

Publication Number Publication Date
CN104598816A CN104598816A (en) 2015-05-06
CN104598816B true CN104598816B (en) 2017-07-04

Family

ID=53124593

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410806302.8A Active CN104598816B (en) 2014-12-22 2014-12-22 A kind of file scanning method and device

Country Status (1)

Country Link
CN (1) CN104598816B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992969B (en) * 2019-03-25 2023-03-21 腾讯科技(深圳)有限公司 Malicious file detection method and device and detection platform

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102346828A (en) * 2011-09-20 2012-02-08 海南意源高科技有限公司 Malicious program judging method based on cloud security
CN102799804A (en) * 2012-04-30 2012-11-28 珠海市君天电子科技有限公司 Comprehensive identification method and system for security of unknown file
EP2597569A1 (en) * 2011-11-24 2013-05-29 Kaspersky Lab Zao System and method for distributing processing of computer security tasks
CN103177217A (en) * 2013-04-08 2013-06-26 腾讯科技(深圳)有限公司 File scan method, file scan system, client-side and server
CN104091122A (en) * 2014-06-17 2014-10-08 北京邮电大学 Detection system of malicious data in mobile internet

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102346828A (en) * 2011-09-20 2012-02-08 海南意源高科技有限公司 Malicious program judging method based on cloud security
EP2597569A1 (en) * 2011-11-24 2013-05-29 Kaspersky Lab Zao System and method for distributing processing of computer security tasks
CN102799804A (en) * 2012-04-30 2012-11-28 珠海市君天电子科技有限公司 Comprehensive identification method and system for security of unknown file
CN103177217A (en) * 2013-04-08 2013-06-26 腾讯科技(深圳)有限公司 File scan method, file scan system, client-side and server
CN104091122A (en) * 2014-06-17 2014-10-08 北京邮电大学 Detection system of malicious data in mobile internet

Also Published As

Publication number Publication date
CN104598816A (en) 2015-05-06

Similar Documents

Publication Publication Date Title
Wu et al. Twitter spam detection: Survey of new approaches and comparative study
Kim et al. Genetic algorithm to improve SVM based network intrusion detection system
CN111428231B (en) Safety processing method, device and equipment based on user behaviors
CN103024746B (en) System and method for processing spam short messages for telecommunication operator
Uysal et al. The impact of feature extraction and selection on SMS spam filtering
CN104217160B (en) A kind of Chinese detection method for phishing site and system
Sheikhi et al. An effective model for SMS spam detection using content-based features and averaged neural network
Kim et al. Fusions of GA and SVM for anomaly detection in intrusion detection system
CN107786575A (en) A kind of adaptive malice domain name detection method based on DNS flows
CN103106365B (en) The detection method of the malicious application software on a kind of mobile terminal
CN109450845B (en) Detection method for generating malicious domain name based on deep neural network algorithm
WO2016201938A1 (en) Multi-stage phishing website detection method and system
Tsai et al. D2S: document-to-sentence framework for novelty detection
Lu et al. Telecom fraud identification based on ADASYN and random forest
Rajalakshmi et al. Web page classification using n-gram based URL features
CN108023868A (en) Malice resource address detection method and device
CN104598595A (en) Fraud webpage detection method and corresponding device
CN103618744A (en) Intrusion detection method based on fast k-nearest neighbor (KNN) algorithm
CN112488716A (en) Abnormal event detection system
CN103412940A (en) Method for detecting fraud telephones
CN107231383B (en) CC attack detection method and device
CN106910135A (en) User recommends method and device
Bingol et al. Rumor Detection in Social Media using machine learning methods
CN111753299A (en) Unbalanced malicious software detection method based on packet integration
CN104598816B (en) A kind of file scanning method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20190812

Address after: 100085 Beijing, Haidian District, No. ten on the ground floor, No. 10 Baidu building, layer 2

Patentee after: BEIJING BAIDU NETCOM SCIENCE AND TECHNOLOGY Co.,Ltd.

Address before: 100193 room 1-01, 1-03, 1-04, block C, software Plaza, building 4, No. 8, Mong West Road, Beijing, Haidian District

Patentee before: Pacify a Heng Tong (Beijing) Science and Technology Ltd.

TR01 Transfer of patent right