CN100504903C

CN100504903C - Malevolence code automatic recognition method

Info

Publication number: CN100504903C
Application number: CNB2007101219336A
Authority: CN
Inventors: 梁知音; 韦韬; 邹维; 韩心慧; 诸葛建伟; 陈昱; 毛剑
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2007-09-18
Filing date: 2007-09-18
Publication date: 2009-06-24
Anticipated expiration: 2027-09-18
Also published as: CN101140611A

Abstract

The invention belongs to the field of malicious code automatic analysis, and a malicious code automatic identification method. The invention first dismantles an executable program sample under analysis into components under analysis; then, compare a component under analysis with a component affected by known malicious behaviors, in order to automatically determine whether the sample under analysis is a malicious code. The invention has advantages of broad analysis coverage scope, high analysis speed on malicious samples, and updating malicious code behavior component library.

Description

A kind of malevolence code automatic recognition method

Technical field

The invention belongs to field of malicious code automatic analysis, be specifically related to a kind of method of utilizing reverse Engineering Technology and code similarity comparison techniques to quicken to analyze malicious code.

Background technology

On present internet, malicious code is ubiquitous, overflows, and network security in serious threat.The realization difficulty of malicious code correlation technique is higher relatively, along with popularizing of network, be on the increase the special website that malicious code realization technology is discussed on the internet but in recent years,, people can directly obtain the source code of malicious code from network, and the source code of basic malice function becomes extremely easy to obtain.These have promoted spreading unchecked of malicious code mutation, in the malicious code of different mutation, the multiplexing phenomenon of code segment is very obvious, a lot of emerging malicious codes have all adopted the realization technology of malicious code in the past, even directly use existing source code, have only new samples seldom just can add some new functions or new implementation method.Volatile growth has appearred in the quantity of malicious code and mutation thereof at present, and traditional manual analytical approach can not satisfy the demand of the express-analysis of malicious code..

In field of malicious code automatic analysis, mainly contain the method that two classes are analyzed automatically at present: dynamic-analysis method and static analysis method.Dynamic-analysis method is meant dynamically to be carried out program to be analyzed and observes its operational process and result in security context, this analytical approach can be used for excavating the part behavior of malicious code, but sometimes because environment does not meet the requirement of code operation, or not meeting the malicious act trigger condition, the method for dynamic operation is difficult to carry out all paths.In addition, resulting result is monitored in performance analysis yet needs further to analyze and gather.The static analysis method mainly contains pattern match and these two kinds of methods of lexical analysis, by whether comprising given behavior or meaning of one's words pattern in the check program, comes whether to comprise in the determining program specific malicious act.This method has been set up corresponding relation between malicious act and program segment, but is difficult to set up stalwartness and the very strong model of distinctiveness.

Consider above these, the present invention proposes the method for malicious act in a kind of static analysis malicious code, it is relatively low to modeling demand, characteristics according to binary program and malicious code realization, program is split into member, analyze then and mate, the method for this automatic analysis is to rogue program, particularly the Botnet program has extraordinary analytical effect, can find malicious act in the sample to be analyzed fast automatically.

Botnet (Botnet) is meant the combination of the machine that some are captured, these machines are called as so-called " corpse main frame " (bot), the framework that often includes order line control in the Botnet, they capture the main frame on the network by instruments such as worm, wooden horse or back doors and hide, accept the source of Botnet again by its remote control module, i.e. Botnet effector's (bot herder) Long-distance Control.These Botnet programs also comprise breeding and the functional module that spreads usually, can be automatic, or under its effector's order, carry out network sweep, and invade other easy infection main frames, stay the copy of oneself.The Botnet effector often utilizes bot program to realize the target of some malice, as steals user's bank card information, and particular server is implemented Denial of Service attack etc.The functional module that often comprises in the Botnet program has: command control module, duplicate/propagation module, host computer control module, download/upload file, module information steal module, reverse-examination survey/back analysis module etc., its structure such as Fig. 1 (information such as the structure of Botnet and mechanism, referring to document: P.Barford andV.Yegneswaran, ＂ An Inside Look at Botnets ＂, Special Workshop on Malware Detection, Advances in Information Security, 2006).

Summary of the invention

The purpose of this invention is to provide a kind of malicious code automatic identification method, utilize reverse Engineering Technology and code similarity comparison techniques, quicken to analyze malicious code, effectively raise the efficient and the coverage rate of malicious code analysis.

Above-mentioned purpose of the present invention is achieved by the following technical solution:

A kind of malicious code automatic identification method, its step comprises:

1) resolves executable program sample to be analyzed, obtain the function node and function recalls information in this program;

2) according to above-mentioned function node and function recalls information, all functions that extract each member head function and directly or indirectly call, obtaining with each member head function is the member to be analyzed of sign:

3) with each member to be analyzed of gained one by one with known malicious behavior component base in each known malicious behavior member carry out similarity relatively, know that to oneself the malicious act member is similar until a member to be analyzed, or relatively finish all members to be analyzed;

4) similar as there being member to be analyzed with the known malicious behavior member in the known malicious behavior component base, judge that then this program comprises malicious act.

Further, the present invention has proposed the expression and the extracting method of member the binary executable from the domain knowledge of malicious code, characteristics according to the scale-of-two malicious code, designed and utilized function call bunch to express the method for malicious act member, the function of every other function was called as member head function in wherein function bunch directly or indirectly called bunch;

The method of three kinds of identification means head functions among the present invention:

[1] based on the member recognition methods of program scheduler: the be scheduled member head function of program scheduler of identification:

Program scheduler is the process that the message and the code of a particular sequence are shone upon mutually.Malicious code, particularly Botnet program are usually according to the instruction calls building blocks of function that receives.Therefore, discerned call function, we just can be easy to extract the malicious act building blocks of function that this call function calls.

[2] based on the member recognition methods of crucial API: the member head function of one group of crucial API is directly or indirectly called in identification:

Functional component in the malicious code need be finished the task of certain malice usually, as kills the process of antivirus applet, and the record keyboard activity is attacked the website etc. that is injured.In order to finish these functions, these members usually need the direct or indirect system API that calls one group of key.Relative, can extract potential member by checking those functions that converged these API.At this moment, API set is very important for extracting the member function, can be by the definition storehouse, or the known malicious member that obtains from additive method is for collection.

[3] based on the member recognition methods of repeatedly calling principle: the member head function that identification is repeated to call by function segments different in the program:

The function that is repeated to call bunch, if not built-in function, the API that neither system provides then generally is to write the function module of finishing certain standalone feature by the user.These characteristics also can be used for the malice member in the extraction procedure.The information that needs is the set and the corresponding function call figure of the call function and the function that is called.If a function A has called function B and C, function B and C have called function D again, and then function D can be represented as potential reusable module, because D is repeatedly called, as shown in Figure 2.For fear of built-in function and system's api function erroneous judgement are reusable module, need to realize the definition exclusion rule, ignore the repetition recalls information of these functions.

Further, the method of the binary program member similarity of function call out of true that the present invention proposes a kind of " weight-threshold value " coupling bunch detects the known malicious member of existence by the function in the member function that obtains in the contrast new samples bunch and the known malicious component base.

Further, the present invention proposes a kind of method that makes up known malicious behavior component base, and its step comprises:

1) resolves known rogue program sample, obtain the function node and function recalls information in this program;

2) according to the feature of known malicious program, all functions that from step 1) gained function, extract each member head function and directly or indirectly call, obtaining with each member head function is the member of sign;

3) analyze above-mentioned each member, identify the malicious act member, deposit the malicious act member in known malicious behavior component base according to the form of setting.

The present invention has following good effect:

1. method provided by the invention is a kind of method of malicious code analysis of static state, compares with dynamic analysing method, and it is wider to analyze coverage rate; Compare with the method for model testing and lexical analysis, do not need to design accurate malicious act model, but make up the malice component base, and the known malicious member mainly is to extract automatically according to the design feature of rogue program, the malicious act model is many easily than making up;

2. for the sample that detects the malice member, the member that the present invention can obtain its fractionation is divided into " known " and " the unknown " two classes, and this can quicken the analytic process of malice sample.For the malice member of having analyzed, can generate analysis report by system automatically, avoid the analytical work of repetition; And if partial component does not find corresponding coupling in certain malicious code in the known malicious component base, this member has comprised new malicious act probably, the present invention can pick out selective analysis with these members, analyzes the result who obtains and can be used for the additional known malice component base that improves;

3. according to similarity relation between the malice member, can classify, follow the trail of the source of new samples, can concern according to the family tree between the member similarity member malicious code sample to the malice sample of analyzing.

Description of drawings

Below in conjunction with the drawings and specific embodiments the present invention is described in further detail:

The program structure synoptic diagram of Fig. 1 Botnet program

The synoptic diagram of " member " of Fig. 2 the present invention definition

The process flow diagram of a concrete case study on implementation of Fig. 3

Embodiment

Among the present invention, definition of the component is a program module of finishing the cluster related function formation of certain special degree specific function.Comprise a stature function in this function bunch, just make up head, the every other function that it is direct or indirect in calling bunch.The synoptic diagram of the member that defines among the present invention of being shown in Figure 2.

Member extractive technique and binary code similarity comparison techniques in the reverse-engineering field have been used among the present invention.The member extractive technique is an important issue of field of software engineering research, and its main target is to discern reusable component from legacy code.The method that member extracts and assesses is (referring to document: Luo Jing, Zhang Lu, the Jiasu of grandson family " member extractive technique summary ", computer science volume Dec the 32nd in 2005), comprise that from the aspects such as tolerance of domain knowledge, structure and member, this method is used for the analysis field of malicious code equally.Binary code similarity comparison techniques is (referring to document E.Carreraand G.Erdelyi, ＂ Digital genome mapping:Advanced binary malware analysis ＂, Proceedingsof 15th Virus Bulletin International Conference (VB2004), p187-197,2004) be the subject under discussion of field of software engineering broad research, more and more be applied to the field of malicious code research now.API sequence, function call figure, control flow graph, program dependency graph etc. all are used to the binary code similarity relatively.

Figure 3 shows that the process flow diagram of a concrete case study on implementation of the present invention, its flow process is as follows:

1. make up the malice component base

[1] for some known rogue program sample, the use disassembler is resolved, according to the program function characteristic information, (function node and function recalls information extracting method sees also " hacker's dis-assembling revealed secrets " to extract function node and function recalls information in the program, Kris Kaspersky work, Tan Mingjin translates, Electronic Industry Press .2005, P85);

[2], extract a function of malice member according to the feature of known malicious code program scheduler.Program scheduler commonly used has two kinds of implementations.First kind is direct scheduling, resolve the message of accepting by scheduling function, directly call corresponding function then, second kind is the registration scheduling: each member will oneself be registered in the overall scheduling table, and scheduler program is consulted the overall scheduling table according to the command analysis result and called corresponding function.Extract known malicious member head function in function that the scheduling function irc_parseline () of the present invention here from rbot and sdbot sample called and the function that registration function g_cMainCtrl.m_cCommands registered in the agobot sample;

[3] according to the relation of the function node in a function and the program, extract known malice member.Behind the additional description of these known malicious members sample source and function information, it is saved in the known malicious component base.

2. extract the member to be selected in the executable program sample to be analyzed

[1] for unknown sample to be analyzed, use disassembler to resolve, according to the program function characteristic information, extract function node and function recalls information in the program;

[2] extraction member to be selected wherein, adopt crucial API to converge and repeatedly call principle and seek malice member head function, as, antivirus applet killed, usually need call (AdjustTokenPrivileges, CreateToolhelp32Snapshot, LookupPrivilegeValueA, Module32First, OpenProcess, Process32First, Process32Next TerminateProcess) waits the API set; The record keystroke activity need be called (GetAsyncKeyState, GetForegroundWindow, GetKeyState) set such as API such as grade usually;

3. to the member in member to be selected and the known malicious behavior component base, if do the member that the comparison of member similarity shows that existence and known malicious member are complementary, judge that then sample to be analyzed is a malicious code, and certain intention malicious act member that comprises known certain family generates the analysis report of sample automatically according to these information; Filter out doubtful malicious act member simultaneously, it is done further to analyse in depth, the result that analysis obtains replenishes and improves known malicious act component base.

Wherein the member similarity relatively, by the function in the member function that obtains in the new samples bunch and the known malicious component base bunch being done known malicious member to identification existence recently.If in the member to be selected of new samples, exist the component specification similarity of the member head function in function and the known malicious member to be higher than predetermined threshold value (being set at 0.8 here), just think to have comprised this malice member in the new samples.Therefore, the problem of member detection just is converted into function problem relatively.The method of function call figure out of true that the present invention proposes a kind of " weight-threshold value " coupling mainly comprises following 4 steps:

1. the function call figure of target sample and the function call figure of known malicious member are carried out topological sorting (referring to document: Xu Zhuoqun, Yang Dongqing, Tang Shiwei, Zhang Ming, " data structure and algorithm ", Higher Education Publishing House, chapter 6, p183,2004); Function node among the function call figure is lined up an ordered sequence, make the function that comes the front in this sequence can not call the function node that comes its back, except the recursive call;

2. calculate node weights W (F): the node weight of calculating each function in target sample and the known malicious member along topological sequences.If any function of function node never call, then the weight of this function node is set to 1, if a function call other functions (not comprising recursive call), its weight be set to be called weight of function adds 1;

3. calculate similarity S based on weight and threshold calculations (F, G): calculate the similarity between each function G in each function F and known members inside in the sample to be analyzed along topological sequences.Divide following three kinds of situations:

A) when two functions all are API, if identical, then similarity is designated as 1, if different, similarity is designated as 0;

B) when a function be API, and another one is not when being, similarity is designated as 0;

C) when two functions are not API, its similarity is defined as the maximal value of the right weighting similarity sum of function in the function set that both call; When if this maximal value is lower than a preset threshold value (as 0.8), the similarity that then defines these two functions is 0.Concrete computing method are as follows:

The node weight of i. establishing F is bigger, is designated as T;

Ii. remember that the function set that F calls is { f ₀, f ₁..., f _m, the function set that note G calls is { g ₀, g ₁..., g _n, note r=min (m, n).Note, because calculation of similarity degree carries out along topological sequences, so f _iAnd g _jBetween similarity all finish i.e. S (f as calculated _i, g _j) known;

Iii. for sequence G _k={ g _K0, g _K1..., g _Kr(0 ≦ k wherein _i≦ r, and k ₁≠ k _jIf i ≠ j), candidate's weighting similarity S '=(1+ ∑ (W (fi) * S (f _i, g _Ki)))/T;

Iv. remember sequence G _k'={ g _K0', g _K1' ..., g _Kr' be to make a sequence of S ' maximum, remember that this maximal value is S _Max

If S v. _MaxGreater than predetermined threshold value, as 0.8, then the similarity of F and G is S _MaxOtherwise the similarity of F and G is 0.

For a known malicious member head function F ', if in target sample, exist function G ', make S (F ', G ') be higher than predetermined threshold value (as 0.8), then can think in target sample, to have this malice member.In the experiment, predetermined threshold value is 0.8 o'clock, the best results of generation.

As mentioned above, the present invention utilizes the multiplexing characteristics of function code in the rogue program, proposes the automatic identification that a kind of brand-new method based on the analysis of scale-of-two member realizes malicious code.Method among the present invention has been applied to the Botnet program sample that China honey net honey net project team of Beijing University of alliance (http://www.icst.pku.edu.cn/honeynetweb/index.htm) catches, accelerated the speed that malicious code sample is analyzed greatly, obtain good effect, realized purpose of the present invention.The present invention has good practicability and popularizing application prospect.

Although for the explanation goal of the invention discloses specific embodiments and the drawings, its purpose is to help to understand content of the present invention and implement according to this, but it will be appreciated by those skilled in the art that: without departing from the spirit and scope of the invention and the appended claims, various replacements, variation and modification all are possible.Therefore, the present invention should not be limited to most preferred embodiment and the disclosed content of accompanying drawing, and the scope of protection of present invention is as the criterion with the scope that claims define.

Claims

1. malevolence code automatic recognition method, its step comprises:

1.1 resolve executable program sample to be analyzed, obtain the function node and function recalls information in this program;

1.2 according to above-mentioned function node and function recalls information, all functions that extract each member head function and directly or indirectly call, obtaining with each member head function is the member to be analyzed of sign, the function of other functions in directly or indirectly calling in the described member head function representation function bunch bunch;

1.3 with each member to be analyzed of gained one by one with known malicious behavior component base in each known malicious behavior member of depositing in carry out similarity relatively, similar until a member to be analyzed, or relatively finish all members to be analyzed to a known malicious behavior member;

1.4 as exist member to be analyzed similar with the known malicious behavior member in the known malicious behavior component base, judge that then this program comprises malicious act.

2. malevolence code automatic recognition method as claimed in claim 1 is characterized in that, uses disassembler analyzing step 1.1 described executable program samples to be analyzed.

3. malevolence code automatic recognition method as claimed in claim 1 is characterized in that obtaining the described member head of step 1.2 function by the function that the identification scheduling function is called.

4. malevolence code automatic recognition method as claimed in claim 1 is characterized in that obtaining the described member head of step 1.2 function by discerning the function that directly or indirectly calls one group of crucial API.

5. malevolence code automatic recognition method as claimed in claim 1 is characterized in that obtaining the described member head of step 1.2 function by the function that identification is repeated to call by different function segments in the program.

6. malevolence code automatic recognition method as claimed in claim 1 is characterized in that, the described similarity of step 1.3 relatively may further comprise the steps:

6.1 the function call figure to member to be analyzed and known malicious behavior member carries out topological sorting:

Function node among the function call figure is lined up a sequence, unless make and recursive call to occur, the function that comes the function node correspondence of front in this sequence can not call the function of the function node correspondence that comes its back;

6.2 function node weight in member to be analyzed and the known malicious behavior member is set along sequence:

If any function of function never call of a function node correspondence, then the weight of this node is set to 1, if the function call of a function node correspondence other functions, the situation that does not comprise recursive call, then the weight of this node is set to 1 weight that adds the function node of the function correspondence that is called;

6.3 calculate in the member to be analyzed similarity between each function in each function and known malicious behavior member along sequence:

When two functions all were API, if identical, then similarity was designated as 1, if different, similarity is designated as 0; When a function is API, and another one is not when being, similarity is designated as 0; When two functions were not API, its similarity was for the weighting similarity sum of the function in its function set that calls, and when this functional similarity degree was lower than a preset threshold value, the similarity that then defines these two functions was 0;

6.4 the similarity comparative result is judged:

Be no less than a function as if existence in the member to be analyzed and make the similarity between this function and the known malicious member head function be higher than predetermined threshold value, think that then this member to be analyzed is similar with the known malicious member.

7. malevolence code automatic recognition method as claimed in claim 6 is characterized in that the predetermined threshold value described in the described similarity comparison step is 0.8.

8. malevolence code automatic recognition method as claimed in claim 1, it is characterized in that if the executable program sample is judged as and comprises malicious act, to in this executable program sample with known malicious behavior component base in dissimilar other members analyze, to confirm as the member of malicious act member, the form that requires according to known malicious behavior component base deposits in the known malicious behavior component base.

9. malevolence code automatic recognition method as claimed in claim 1 is characterized in that, the step of construction step 1.3 described known malicious behavior component bases comprises:

9.1 resolve known rogue program sample, obtain the function node and function recalls information in this program;

9.2 according to the feature of known malicious program, all functions that from step 9.1 gained function, extract each member head function and directly or indirectly call, obtaining with each member head function is the member of sign;

9.3 analyze above-mentioned each member, identify the malicious act member, deposit the malicious act member in known malicious behavior component base according to the form of setting.

10. malevolence code automatic recognition method as claimed in claim 9 is characterized in that using in the construction method of described known malicious behavior component base disassembler analyzing step 9.1 described known malicious program samples.