Embodiment
According to above overall introduction to the script viroid as can be known, an important feature of script virus is that the readability of source code is very strong, and normally on object computer the source code to virus explained execution.In view of script virus is explained these characteristics of execution, the method of this viroid of identification that the present invention proposes, the ultimate principle of copying compiler, but the execution contexts in the file is carried out grammatical analysis, but, promptly analyze its behavioral characteristic so that analyze the function that this execution contexts will be realized.This behavioral characteristic is presented as the resulting grammar property of grammatical analysis, cycle index for example, the calling or the like of specific function.The script virus of the same family that those generate by mutation, its behavioral characteristic, just grammar property is normally changeless.Therefore, the grammar property of being looked into file by coupling can detect the mutation of various viruses effectively.
1-4 below with reference to accompanying drawings is the concrete steps that example is described the method for the identification virus that the present invention proposes in detail with the identification script virus.Here it may be noted that the method for identification virus proposed by the invention can also be applied to have with the script viroid other viruses of similar characteristics, and be not limited in script virus itself.
Fig. 1 shows the overview flow chart of the method that is used to discern script virus in the one embodiment of the invention.
As shown in Figure 1, in step 1001, begin to carry out the script virus scanning sequence.Because present embodiment is to be example with the identification script virus, thereby the virus scanning program here only scans the script type file.If the method that the present invention is proposed is used to discern the Virus Type that other explain execution, then correspondingly the type file is scanned.
At first, in step 1002, the script file that scans is carried out pre-service, but to extract execution contexts wherein.Then, in step 1003, but judge whether that success has extracted execution contexts, if success then continues lexical analysis in the execution in step 1004, otherwise execution in step 1009 reporting errors and finish this virus scan.But the pretreatment operation here can comprise in order to extract the required institute of execution contexts in steps, for example the text of encrypting is decrypted etc.The specific descriptions of this pre-treatment step will 2 be introduced in the back in conjunction with the accompanying drawings in detail.
In step 1004, but the execution contexts that generates through pre-service is carried out lexical analysis, to extract the word sequence array.According to Fundamentals of Compiling, but lexical analysis is the word that for example extracts from the execution contexts that occurs as continuous character string as variable, function name etc., so that obtain the logical relation between these words in grammatical analysis.In the present invention, be that characteristics according to script design to the extraction of word in the lexical analysis.In addition, in order to add fast scan speed and reliability, in the present invention, but the lexical analysis step also comprises whether having comprised the system call (step 1005) that can be utilized by virus in the execution contexts that check extracts.If do not comprise these system calls, then can abandon this virus scan (advancing to step 1010), but think that this execution contexts that extracts can not cause the infringement to computer system.Otherwise, continue the grammatical analysis in the execution in step 1006.Concrete lexical analysis step will 3 be described in detail in the back in conjunction with the accompanying drawings.
In step 1006, the word sequence array that lexical analysis generates is carried out grammatical analysis, the generative grammar tree is extracted grammar property.Here said grammar property for example calls or the like for maximum cycle, loop statement number, the conditional statement number of plies, conditional statement number, function call number of times, function parameter number and type, function call rreturn value, system function.Here it may be noted that, but owing to the objective of the invention is to go out script virus according to the behavioral characteristic quick identification of execution contexts, thereby the syntax analysis step among the present invention is different from the processing in traditional compilation process, and it is more paid attention to more embodying the analysis of the function body of behavioral characteristic.In addition, speed and the accuracy in order to accelerate grammatical analysis adopted classification analysis in the grammatical analysis process in the present invention, and the thought of parsing table is set up in classification, and particular content will 4 be introduced in the back in conjunction with the accompanying drawings in detail.
But after obtaining the grammar property of execution contexts, in step 1007, grammar property known or unknown virus in the grammar property that extracts and the virus characteristic storehouse is mated, and judge whether the two mates.If coupling, then execution in step 1008, the Virus Type that reports the file of looking into to contain, Virus Name, characteristic quantities such as virus method; Otherwise represent that this file does not comprise known script virus, directly execution in step 1010.At last, in step 1010, finish the work of this scanning.
In the method that is used for discerning virus that above-mentioned the present invention who describes in conjunction with Fig. 1 proposes, thought of the present invention is mainly reflected in pretreatment operation, lexical analysis and three parts of grammatical analysis.The concrete operations of these three parts are described below successively.
Pre-service
Fig. 2 shows the particular flow sheet of pre-treatment step shown in Figure 1.As shown in Figure 2, in step 2001, begin pretreatment operation to script file.In step 2002, but filter out in the script file, for example notes content etc. the part of execution contexts without any influence.In step 2003, but judge whether remaining execution contexts is encrypted,, analyze ciphertext, take corresponding decryption method to be decrypted, but otherwise execution in step 2006 is directly extracted execution contexts if encrypt then execution in step 2004.In step 2004, for example, know that it has adopted the public encryption algorithm by analyzing the text of encrypting, then adopt the text after corresponding decipherment algorithm obtains deciphering.Perhaps, if ciphertext itself has comprised decryption step, can obtain corresponding decipherment algorithm by analysis so.Moreover, can also adopt the mode of virtual execution, but in internal memory, generate the interim execution contexts after the deciphering, thereby obtain the text after the deciphering.After the deciphering, in step 2005, judge whether deciphering is successful, if successful then execution in step 2006 is extracted executable text, otherwise execution in step 2007 report pre-service are failed.In step 2006, extract executable text, use for follow-up lexical analysis.At last, in step 2008, finish this script file pretreatment operation.
Lexical analysis
But after obtaining execution contexts, carry out lexical analysis.Fig. 3 shows the particular flow sheet of this step.
In step 3001, the beginning lexical analysis.In step 3002, adopt canonical formula syntax analysis, by a cover operator operational code and a compound statement operational code among the present invention, extract word.This cover operator operational code and compound statement operational code are that the characteristics according to script design, be convenient to from script text, extract the feature string, call this operator operational code and compound statement operational code at every turn and will obtain a mark (token) feature string (need to indicate, operator operational code and compound statement operational code can be designed according to the characteristics of token feature string) here.Then, in step 3003, the token feature string that generates in the accumulative total step 3002 is to form the word sequence array.In the present invention, the token feature string that extracts will judge in step 3004 that all whether this token feature string is complementary with a certain system call,, judges whether to belong to system call that is.If, then execution in step 3005 its verifications of statistics and, otherwise execution in step 3006.In step 3005, the Keyword List in the reference system, obtain this system call verification and and call verification and accumulative total with the other system that obtains before, thereby obtain the verification of a statistics and.Then, in step 3006, judge whether text runs through, repeated execution of steps 3002 obtains next token feature string if do not run through then, and so circulation is up to running through all texts.
But after running through all execution contexts, in step 3007, carry out the system call coupling.In the present embodiment, in advance by the analytic system characteristics, drawing may be by the system call of virus utilization, and in other words, virus must comprise these system calls at least, just might encroach on system.Then, calculate these system calls the statistics verification and, in order in step 3007, carrying out system call coupling.In the step 3007 of lexical analysis, the statistics verification that will in step 3005, obtain and with the verification of the system call that may be utilized by virus that precomputes with mate, and judge whether to mate (step 3008) then.In the present embodiment, if the system call verification that current text comprised that obtains through lexical analysis and greater than this verification that calculates in advance and, show that then system call mates, promptly current text comprises the system call that can be utilized by virus.Report the lexical analysis success this moment, but need carry out further grammatical analysis current execution contexts.If do not match, then report the lexical analysis failure, thereby withdraw from this scanning process, think that promptly current file does not have virus (step 3009).The advantage of doing like this is, the script misidentification that does not carry out any system call can not carried the file of virus, thereby can improve the accuracy of looking into poison, reduces the probability of wrong report, can also accelerate virus scan speed simultaneously.At last, in step 3010, finish this lexical analysis operation.
In the present embodiment, but only provided the method whether above a kind of judgement execution contexts comprises the system call that can be utilized by virus.In actual applications, can also adopt several different methods to realize this judgement, be not limited to this.In addition, according to the needs of practical application, the determining step of this system call coupling also not all is necessary.
Above 3 lexical analysis processes of describing in conjunction with the accompanying drawings, but for the execution contexts that great majority are write by program language, normally necessary.But the language of also not getting rid of some particular type adopts extremely simple definition and structure, thereby can only directly carry out grammatical analysis by simple processing, and this will decide on practical application.
Grammatical analysis
After above lexical analysis completed successfully, it also was complex grammar analytic process that the present invention will carry out most critical.
The present invention has done following fractionation according to the characteristics of script virus with grammatical analysis:
1, the source program after the whole lexical analysis is regarded as a definition chain, this part is called the analysis of global variable and function definition.2, in function definition, the definition of parameter-definition and function body branched away and carry out grammatical analysis respectively.3, in the analysis of function body, again separate analysis is separated in the definition of parameter-definition and expression formula.
Whole grammatical analysis process adopts SLR (1) to analyze and adds the conflict analysis of shift-in stipulations, has adopted the operation precedence analytic approach when expression parsing.
Fig. 4 is the process flow diagram of this step.In step 4001, the beginning grammatical analysis.In step 4002, determine the type of the current grammatical analysis that will carry out, promptly define the type of grammatical analysis.In the present embodiment, according to the characteristics of script, grammatical analysis is divided into five types: expression formula, function, parameter, global variable and system function analysis.For example, but when beginning that just execution contexts carried out grammatical analysis, can be the global variable analysis with its type definition by analyzing.Then, in step 4003, the kind of judging grammatical analysis is to carry out corresponding operating: if global variable then in step 4007, adds global variable to symbol table.Certainly, at the beginning of grammatical analysis, the type definition also may be other types, perhaps along with grammatical analysis progressively deeply, the type definition also can be other types.For example: if expression formula then enters in the step 4004, carry out expression parsing, what wherein adopt is the operation precedence analytic approach; If function then in step 4005, carries out the function parameter analysis to function parameters; And in step 4006, function body itself is carried out the function body analysis; If system function then in step 4008, adds system function to symbol table.
In each grammatical analysis class, proceed further analysis respectively then.For example, after expression parsing (step 4004), in step 4009, further the function call in the expression formula is analyzed; And in function body analysis (step 4006) step 4010 afterwards, further local variable in the function body is analyzed; And in step 4011, local variable is added in the symbol table.
No matter be above-mentioned which kind of grammatical analysis that exemplifies, it all will enter in the step 4012, and according to grammatical norm, whether discriminatory analysis operation is correct, if correct execution in step 4013 then, otherwise execution in step 4014 error exits.In step 4013, whether discriminatory analysis is finished subsequently, and execution in step 4015 finishes to analyze if finished then, otherwise execution in step 4002 continues to analyze.For example, in the branch of function body class, after step 4012 is judged local variable analysis correctly, in step 4013, find to analyze not finish as yet.For example, need the expression formula in the analytic function body then, then flow process is circulated to step 4002, determines that the type of grammatical analysis next time is expression parsing, and through the differentiation of step 4003, the branch that flow process advances to expression parsing proceeds then.The flow process of the execution graph that circulates according to the method 4 is up to finishing all analysis operations (step 4015).
Owing to the objective of the invention is to analyze the behavioral characteristic of script file, thereby the analysis of function body is a key of the present invention.Usually the grammatical analysis of function body is very complicated, thereby in the present invention, the grammatical analysis of function body analysis and other types is separated carry out, and construct a parsing table separately, like this, can be convenient to upgrading and renewal.
After grammatical analysis shown in Figure 4 finishes, can obtain syntax tree, it has reflected the logical relation between each variable, the expression formula etc.Get back to Fig. 1 below, in flow process shown in Figure 1, but obtain maximum cycle in the execution contexts according to the syntax tree that obtains, the loop statement number, the conditional statement number of plies, conditional statement number, the function call number of times, function parameter number and type, the function call rreturn value, system function calls a series of grammar properties such as feature.Next be exactly and the virus characteristic storehouse in viral grammar property mate, judge whether to contain virus.For example, in the grammar property that scanning obtains, system function calls, the cycling jump number of times, the rreturn value of function call number of times and function call just and these grammar properties of certain script virus be complementary, show that then this script may contain virus, therefore report virus also provides corresponding virus method.
Here need explanation, can store the grammar property of known viruse and/or the grammar property of unknown virus in the virus characteristic storehouse.Wherein, known viruse is meant the virus that has recorded information such as clear and definite title, sign in current virus base, and unknown virus then is meant and does not record the clear and definite title or the virus of sign as yet.Usually, the grammar property of unknown virus is the behavioral characteristic according to known viruse, the grammar property that is enough to constitute virus behavior of inferring.This grammar property of the unknown virus of definition voluntarily can obtain by the learning functionality of antivirus software, thereby can realize the purpose of pre-anti-virus to a certain extent.
Below 1-4 has described the method for the identification virus that the present invention proposes in conjunction with the accompanying drawings.The method that the present invention proposes can be realized that also can be realized by hardware, the mode that perhaps adopts software and hardware to combine realizes by computer software.
Below exemplarily provide the hardware configuration of the method that a kind of the present invention of realization proposes, as shown in Figure 5.
As shown in Figure 5, the device 500 of the identification virus method that is used to realize that the present invention proposes, comprising: pretreatment unit 510, analysis and processing unit 520, recognition unit 530, wherein analysis and processing unit 510 comprises lexical analysis unit 522 and parsing unit 526.
Particularly, pretreatment unit 510 be used for to the text of being looked into file carry out pre-service (deciphering) but to extract execution contexts.Analysis and processing unit 520 is used for but described execution contexts is carried out grammatical analysis, but to obtain the grammar property of described execution contexts.This analysis and processing unit 520 specifically comprises and is used for but described execution contexts is carried out the lexical analysis unit 522 of lexical analysis with the extraction word sequence, and the parsing unit 526 that is used for the described word sequence that described lexical analysis unit extracts is carried out grammatical analysis.The grammar property of virus mates in grammar property that recognition unit 530 will be obtained by parsing unit and the virus characteristic storehouse, whether comprises virus to judge described file.Described lexical analysis unit 522 comprises that also a judging unit 523 is used for according to the described word sequence that extracts, judges whether to comprise the system call (determination methods is same as shown in Figure 3) that can be utilized by virus; And, judge that then described file does not comprise virus if do not comprise the system call that can be utilized by script virus.
Below in conjunction with the accompanying drawings 1 one 5 with the script virus be example describe that the present invention proposes in detail pass through to analyze the method that the behavioral characteristic of being looked into file is discerned virus.This method not only goes for script virus, can also be used for poison is looked in the analysis of other sound code files.
Beneficial effect
The method of the above identification virus that proposes for the present invention in conjunction with the embodiment of the invention is described in detail.The method that the present invention proposes obtains to be looked into the grammar property of file by grammatical analysis, utilizes the matching result of this grammar property that obtains and the grammar property of virus again, determines that this quilt looks into file and whether comprise this virus.
Compare with the method for merely carrying out search matched according to the character string of extracting in the Virus Sample, because the method that the present invention proposes is that the behavioral characteristic with virus is that poison is looked on the basis, thereby improved the accuracy of virus scan greatly, and can add fast scan speed, avoid the wasting of resources.Because the multiple mutation body of virus has fixing behavioral characteristic usually, i.e. the method that the present invention of grammar property, thereby employing proposes can effectively tackle the different mutation of same virus family.In addition, the method that the present invention proposes has very strong dirigibility, for example by the fractionation with grammatical analysis, makes only need the grammatical analysis of change function body partly just can finish to the expansion and the upgrading of grammatical analysis future.Moreover the present invention formats unformatted file by pre-service and lexical analysis, supports the ordering of virus, can significantly improve killing poison speed.
It will be appreciated by those skilled in the art that the disclosed method and apparatus that is used to discern virus of the invention described above, can also on the basis that does not break away from content of the present invention, make various improvement.Therefore, protection scope of the present invention should be determined by the content of appending claims.