CN109101816B

CN109101816B - Malicious code homology analysis method based on system call control flow graph

Info

Publication number: CN109101816B
Application number: CN201810912373.4A
Authority: CN
Inventors: 王勇; 史小东; 梁杰; 孙青煜; 张继; 刘振岩; 薛静锋
Original assignee: Beijing Institute of Technology BIT; China Information Technology Security Evaluation Center
Current assignee: Beijing Institute of Technology BIT; China Information Technology Security Evaluation Center
Priority date: 2018-08-10
Filing date: 2018-08-10
Publication date: 2022-02-08
Anticipated expiration: 2038-08-10
Also published as: CN109101816A

Abstract

The invention discloses a malicious code homology analysis method based on a system call control flow graph, which comprises the steps of firstly constructing the system call control flow graph of a program to be analyzed; the system call control flow graph is a directed weightless graph formed by system call nodes, and the direction of an edge represents the precedence relationship of system call execution; and comparing the system call control flow graphs of different programs to be analyzed to use the graph similarity as similarity measurement of homology analysis to realize the homology analysis. The invention utilizes the system call control flow graph to carry out homology analysis, the system call control flow graph completely ignores the details of software codes and only concerns the called system call function, thereby simplifying the data volume needing to be processed, and the control flow graph based on the system call has the best abstraction degree on program behaviors. Moreover, only system call is considered, so that confusion of an instruction layer is avoided to a great extent, and an anti-confusion effect is achieved.

Description

Malicious code homology analysis method based on system call control flow graph

Technical Field

The invention relates to the technical field of network security, in particular to a malicious code homology analysis method based on a system call control flow graph.

Background

The amount of malicious codes is greatly increased, but the newly appeared malicious codes are mostly variants of the existing malicious codes, and the core functions of the newly appeared malicious codes are not changed too much. The malicious code variants are mostly attacked by adopting obfuscation technology to evade virus detection.

Homology initially refers to the degree of similarity between the nucleotide sequences of two nucleic acid molecules or between the amino acid sequences of two protein molecules in molecular evolution studies. Homology analysis applied to the field of software refers to comparing two types of software from source codes to software functions, finding out whether the two types of software are the same or similar, and giving similarity to describe the similarity degree of the two types of software.

The homology analysis of the malicious codes is mostly carried out by adopting a method based on feature recognition. In the early stage, the target code is identified by extracting the sequence segment of the malicious code as the feature code and adopting a sequence matching mode. The sequence segment may be an operation code sequence after decompilation, or may be a dynamic system call sequence during code execution. Malicious code variants incorporating obfuscation techniques are often not identified by simple sequence matching. Subsequently, a research method based on behavior characteristics is proposed, and attention is paid to actual behaviors of target codes instead of code structures, so that malicious code obfuscation technology is resisted.

The common behavior feature representation method comprises an instruction block control flow graph, a function call graph and the like, and the behavior features can generate a certain amount of changes after confusion technologies such as polymorphism and deformation are introduced, so that the homology analysis capability on malicious code variants is poor.

Disclosure of Invention

In view of this, the invention provides a method for analyzing malicious code homology based on a system call control flow graph, which can improve the homology analysis capability of malicious code variants.

In order to solve the technical problem, the invention is realized as follows:

a malicious code homology analysis method based on a system call control flow graph comprises the following steps:

step one, constructing a system call control flow graph of a program to be analyzed; the system call control flow graph is a directed weightless graph formed by system call nodes, and the direction of an edge represents the precedence relationship of system call execution;

and step two, comparing the system call control flow graphs of different programs to be analyzed, and taking the graph similarity as the similarity measurement of homology analysis to realize the homology analysis.

Preferably, the first step is:

defining a function which is not called by other functions as an entry function; recursively traversing instruction block control flow diagrams of all entry functions in the program; aiming at each instruction block control flow graph, when a 'CALL' instruction is encountered, judging whether a called function is an internal function or a system calling function;

if the called function is a system call function, a node named as the system call is created; each node is connected at the relative position of the instruction block control flow graph according to the affiliated instruction block, so as to establish a system call graph;

if the called internal function is the internal function, traversing the instruction block control flow graph of the internal function, establishing a system call graph of the internal function according to a mode of establishing the system call graph for the system call function, and inserting the system call graph of the internal function into a system call graph of the upper level according to the call position of the internal function;

when the recursive traversal process is finished, a system call control flow graph only containing system call nodes is obtained.

Preferably, step one, before performing the recursive traversal, further prunes the instruction block control flow graph, and only retains the basic block and control flow containing the "CALL" instruction.

Preferably, the pruning is: aiming at a basic block A which does not contain a 'CALL' instruction in an instruction block control flow diagram, respectively storing a previous basic block and a subsequent basic block thereof, respectively connecting each previous basic block with each subsequent basic block, and then deleting the basic block A to jump and branch the basic block A;

if the basic block A is the function entry or exit, the entry node and the exit node of the basic block A are updated before the basic block A is deleted.

Preferably, the mode that the nodes are connected according to the relative positions of the instruction blocks in the instruction block control flow graph in the step one is as follows:

for a plurality of nodes belonging to the same basic block, sequentially connecting the nodes according to an execution sequence; and then according to the position of the basic block in the instruction block control flow diagram, connecting the connected first node with the previous basic block of the basic block, and connecting the tail node with the basic block and the subsequent basic block.

Preferably, for the case of calling the internal node, further determining whether the return parameter is _ cross parameter value of the calling function is 1; if the system call graph of the internal function is 1, indicating that a path exists between the previous node and the subsequent node, and further connecting the previous node and the subsequent node of the current internal function after inserting the system call graph of the internal function into the previous system call graph; if the is _ cross parameter value is 0, then only the insert operation need be performed.

Preferably, the second step includes:

setting I system call control flow diagrams constructed by a first program to be analyzed as a source diagram, and setting J system call control flow diagrams constructed by a second program to be analyzed as a target diagram; for source graph g_iI1, 2, …, I, compute source graph g_iWith each target map h₁～h_JThe similarity between the two is taken as the maximum value and is recorded as SimMax_i(ii) a Then, the maximum value SimMax of the similarity of all the source graphs is utilized₁～SimMax_IAnd carrying out weighted average to obtain the similarity S of the two programs to be analyzed.

Preferably, when it is desired to analyze whether two programs to be analyzed have variants belonging to the same family or are homologous, a threshold value X may be established, and the similarity between the two programs to be analyzed is calculated, and if the similarity is greater than X, the two programs to be analyzed may be considered as belonging to the same family variant or homologous, otherwise not.

Preferably, the source graph g_iAnd target graph h_jThe similarity between the two is calculated in the following way:

step a: computing source graph g_iAnd target graph h_jGraph edit distance editdstance (g) between_i,h_j) (ii) a Graph edit distance is node cost and edge costSumming;

the node cost obtaining mode is as follows: source graph g_iIs matched to the target graph h_jIf the node is a null node, adding the cost c to the node cost NodeCost as a slave source graph g_iCost of deleting nodes; in the same way, the target graph h_jIs matched to the source graph g_iIf the node is a hollow node, adding a cost c into the NodeCost as a source graph g_iThe cost of inserting a node;

the method for obtaining the edge cost comprises the following steps: find the source graph g_iAnd target graph h_jNumber of middle matching edge C_EIf the edge cost EdgeCost is equal to (| E)_g|-C_E)×c(σ_RE)+(|E_h|-C_E)×c(σ_IE) Wherein E is_gIs a source diagram g_iSet of middle edges, E_hIs a target graph h_jSet of middle edges, E_gAnd E_hAdding absolute value symbols to represent the number of edges in the graph; parameter c (σ)_RE) And c (σ)_IE) Respectively representing the operation cost values of deleting and adding an edge of the definition;

step b, calculating a source graph g_iAnd target graph h_jMaximum graph edit distance MaxEditDistance (g) therebetween_i,h_j)＝NodeCost_max+EdgeCost_max；NodeCost_maxIs the maximum node cost, which is the source graph g_iAnd target graph h_jThe sum of the nodes of (c) is multiplied by the cost c; EdgeCost_maxIs the maximum edge cost, EdgeCost_max＝(|E_g|)×c(σ_RE)+(|E_h|)×c(σ_IE)；

Step c, calculating a source graph g_iAnd target graph h_jGraph similarity of (c) Sim (g)_i,h_j) Comprises the following steps:

preferably, for SimMax₁～SimMax_IThe weighted average is performed, and the method for obtaining the similarity S is as follows:

wherein, B_iAnd E_iAre respectively a source graph g_iThe number of nodes and the number of edges in.

Has the advantages that:

(1) aiming at the control flow structure of malicious codes in static analysis, the invention firstly provides a control flow graph based on system call, the control flow graph is a directed non-weighted graph formed by system call nodes, and the direction of edges represents the precedence relationship of system call execution. The system call control flow graph completely ignores the details of software codes and only focuses on called system call functions, so that the data volume needing to be processed is simplified, and the abstraction degree of the control flow graph based on the system call to program behaviors is the best. Moreover, only system call is considered, so that confusion of an instruction layer is avoided to a great extent, and an anti-confusion effect is achieved.

(2) Based on the system call control flow graph, the invention further provides a homology analysis method based on the system call control flow graph. After the system call graph and the graph similarity are obtained, the similarity between the samples is obtained through weighted average calculation. By adopting the scheme, the homology analysis capability of the malicious code variants can be improved.

Drawings

FIG. 1 is a flow diagram acquisition process based on system call control flow;

FIG. 2 is a diagram of a function call;

FIG. 3 is a control flow diagram based on instruction blocks;

FIG. 4 is a control flow graph pruning for an instruction block;

FIG. 5 is a control flow diagram of a system call of the function Sub _1404A 4;

FIG. 6 is an internal function system call graph insertion process;

fig. 7 is a control flow diagram based on system calls.

Detailed Description

In the homology analysis aiming at the malicious code variants, the internal logic of the malicious code processed by the obfuscation technologies such as deformation and polymorphism can be disturbed, and the instruction block control flow graph extracted by static analysis has small available value; and dynamic analysis overhead is large. In addition, the function call graph can obtain the call relationship between functions, but has no time-series relationship.

The invention considers to construct a system call control flow graph, the control flow graph is a directed weightless graph formed by system call nodes, the direction of edges represents the precedence relationship of system call execution, and the invention reserves the time sequence relationship between system calls. Here, the "system call" means that a program running in the user space requests a service requiring higher authority to run from the kernel of the operating system. The system call provides an interface between the user program and the operating system. Most systems require interactive operations to be run in kernel mode. Such as device IO operations or interprocess communications.

The system call control flow graph completely ignores the details of software codes, only preserves the time sequence relation between system calls, and only concerns the called system call function, thereby simplifying the data volume needing to be processed, and the control flow graph based on the system call has the best abstraction degree to the program behavior. Moreover, only system call is considered, so that confusion of an instruction layer is avoided to a great extent, and a good anti-confusion effect is achieved. Therefore, the homology analysis of the malicious codes is carried out based on the system call control flow graph, and the homology analysis capability of the malicious code variants can be improved.

The system call is a very important software behavior, the abstraction degree of a control flow graph based on the system call is higher than that of a control flow graph based on an instruction block and a function call graph, but the control flow graph of the system call is often difficult to obtain. The method for acquiring the system call control flow graph is mainly constructed based on an API call sequence under static and dynamic analysis at present, and is difficult to acquire and results are occasional. It is also noted that the function call graph and the control flow graph abstract the program structure, which implies the execution path information of the system call. Therefore, the invention selects the system call under the static analysis as a basic object for extracting and researching the behavior characteristics of the target code, and provides a method for constructing a control flow graph of the system call. The method extracts calling process information of system calling on the basis of an instruction block control flow graph, and constructs the control flow graph based on the system calling to describe behavior characteristics of a target program. On the basis of calling a control flow graph by a system, similarity comparison of the graphs is introduced, and the similarity of the graphs is used as similarity measurement between malicious codes for homology analysis, so that the analysis capability is improved.

Based on the analysis, the invention provides a malicious code homology analysis method based on a system call control flow graph, which has the following basic idea: constructing a system call control flow graph of a program to be analyzed (malicious code); the system call control flow graph is a directed weightless graph formed by system call nodes, and the direction of an edge represents the precedence relationship of system call execution; and then comparing the system call control flow graphs of different programs to be analyzed, and taking the graph similarity as the similarity measurement of homology analysis to realize the homology analysis. Meanwhile, the invention also provides a scheme for constructing the system call control flow graph based on the instruction block control flow graph.

The following is a detailed description of specific embodiments of the invention.

Step one, constructing a system call control flow graph.

The basic process of constructing the system call control flow graph of the program is as follows:

firstly, obtaining an instruction block control flow graph of each function in a program through the prior art, and then recursively traversing the instruction block control flow graphs of each entry function in the program; here, the "entry function" refers to a function that is not called by another function, and means different from the "program entry". For a program, there is only one program entry; and one function may have multiple entry functions, and of course, the entry functions also include program entries. As shown in fig. 2, the first row belongs to the entry function, but only the light gray dllenrypoint is the program entry.

Aiming at each instruction block control flow graph, when a 'CALL' instruction is encountered, whether a called function is an internal function or a system calling function is judged:

if the called internal function is the internal function, traversing the instruction block control flow graph of the internal function according to the method, establishing a system call graph of the internal function according to the mode of establishing the system call graph for the system call function, and inserting the system call graph of the internal function into a system call graph at the upper stage according to the call position of the internal function;

The method flow for constructing a control flow diagram based on system call according to the idea is shown in fig. 1, and comprises the steps of preprocessing a sample (disassembling, delimiting a function, calling a function relation and dividing basic blocks), pruning an obtained control flow diagram based on an instruction sequence, and only reserving the basic blocks and the control flow containing call instructions in the function. And then, taking the entry function as an entry to traverse a recursive 'CALL' instruction downwards to construct a control flow graph based on the system CALL.

Step 1.1: and preprocessing to obtain a control flow diagram of an instruction block of each function in the program.

Firstly, disassembling pretreatment is carried out on the sample program to obtain a disassembling code. On this basis, a sample instruction block-based control flow graph and function call graph are obtained.

When disassembling the binary code, the feature code or other technical means is used to find out the functional structure of the intermediate code, which is similar to the function defined by the source program and is a special code sequence, and they all have independent complete structures. The IDA Pro processed disassembled code already has function boundaries, each starting with proc and ending with endup. The disassembled assembly code is divided into a plurality of functions. When it uses a CALL instruction, indicating that it CALLs some internal function or system CALL that can be returned, a CALL instruction that traverses some internal function knows how many functions the function CALLs. Function call graph output from assembly code as shown in fig. 2, dark gray part is system call, black part is internal function, light gray part is program entry.

Deep inside the function, the instruction sequence is divided by the jump statements, and the instruction sequence in the two jump statements is used as a basic unit (also called basic block) of control flow. The calling or jumping among the basic units is reflected in the codes, namely conditional branching and transition judgment. The control flow graph formed by the basic block and the control flow and based on the instruction block forms the execution flow of the function, and is more simplified.

Fig. 3 shows an instruction block-based control flow graph, which contains three basic blocks and three edges. It is easy to see from the jump branch that it has two execution paths. The second basic block contains three CALL instructions (circled in the figure).

Step 1.2: pruning

In order to construct a system call control flow graph only containing system calls, from another point of view, only non-call instructions in an instruction block control flow graph need to be removed, and a control flow graph only containing system call nodes can be obtained. Therefore, after the instruction block control flow graph is obtained, the basic block which does not contain the call instruction is deleted, namely pruned, and then the recursive traversal in the step 1.3 is performed, so that the non-call instruction in the instruction block containing the call instruction is deleted. The basic operation of pruning is to delete the basic block that does not contain the "CALL" instruction, but still preserve its execution path.

If the function Func has a basic block a without a call instruction, it respectively traverses the previous basic block (the jump branch points to the basic block of a) and the subsequent basic block (the basic block pointed to by the jump branch of a) which store it, cross-connects the previous basic block of a with the subsequent basic block of a (i.e. each previous basic block is respectively connected with each subsequent basic block), and then deletes the basic block a and the jump branch of a, and the pruning process is as shown in fig. 4.

If the basic block A is a function entry or exit, its entry node (i.e., the subsequent basic block of A) and exit node (the subsequent basic block of A) need to be updated before A is deleted. If a is both an entry node and an exit node, it indicates that there is a no-call API or internal function execution path when executing the function Func, and the parameter is _ cross is set to 1/0 to describe whether there is a system call path for the function.

Step 1.3: constructing a system call control flow graph through recursive traversal

Step 1.2 details the pruning process. And the control flow graph after pruning comprises a basic block only containing a calling instruction, then a system calling control flow graph is constructed, a non-calling instruction is ignored when the control flow graph of the instruction block is traversed, and only the calling instruction is saved as a node of the system calling control flow graph.

And aiming at the instruction block control flow graph of each entry function, traversing each instruction of each basic block one by one. When a "CALL" instruction is encountered and its calling function is a system CALL function, a node named this system CALL is created. For a plurality of nodes belonging to the same basic block, sequentially connecting according to an execution sequence, as shown in fig. 5, constructing three nodes by three system calls belonging to the same basic block in fig. 3, and sequentially connecting; and then according to the position of the basic block in the instruction block control flow diagram, connecting the connected first node with the previous basic block of the basic block, and connecting the tail node with the basic block and the subsequent basic block.

Each basic block generates 1 directed graph composed of system call nodes. The control flow graph of the function Sub _4014a4 shown in fig. 3 is pruned to leave 1 basic block containing 3 system call functions called in sequence. The system call control flow graph obtained by sequentially connecting according to the execution sequence is shown in fig. 5, and comprises 1 starting node and 1 ending node. If a plurality of basic blocks are connected with each other according to the relative positions of the basic blocks in the control flow graph.

In all call instructions of the basic block, there are several calls to internal functions. When calling of an internal function is met, a system calling control flow graph of the internal function is obtained by using recursive traversal, and then the system calling control flow graph is inserted into a system calling control flow graph at the upper stage, namely a starting node and an ending node of the system calling control flow graph are respectively connected with a front node and a rear node of the internal calling. For example: if the GetProcAddress in fig. 5 is replaced by calling the internal function Sub _ a, the system call graph of Sub _ a is constructed and the relevant connections are made, as shown in fig. 6.

For the case of calling the internal node, further determining whether the parameter value of the return parameter is _ cross is 1, if so, indicating that a path exists between the previous node and the subsequent node, except for the above insertion step, further connecting the previous node and the subsequent node of the current internal function, for example, if the parameter value of is _ cross is 1 in fig. 6, connecting LoadLibraryA and VirtualProtect to form a path; if the is _ cross parameter value is 0, then only the above-described insert operation needs to be performed.

When the basic block instruction of the function is traversed, a starting Node list Entry _ Node and a Return Node list Return _ Node are set for the function, and the sum of the function system call control flow graph is stored. The acquisition of the starting (returning) node can be realized by establishing a corresponding data structure, firstly traversing the starting basic block (which is not unique) of the control flow graph, and then acquiring the starting (ending) system call node of the basic block and adding the system call node to the control flow graph.

When the system call graphs of all internal functions are connected through the relative position in the instruction block control flow graph, the system call control flow graph of the whole program is obtained. A system call control flow graph obtained by recursion of an entry function in malicious code is shown in fig. 7, and the timing relationship and the execution path of the function system call can be observed.

The obtained system call control flow graph is not unique because the entry function is not unique, and a sample containing a plurality of entry functions can be processed to obtain a set containing a plurality of system call control flow graphs. In particular, in the implementation process, many special situations need to be considered. For example, when a system call graph of an internal call function is obtained, a basic block containing a call is found, and the function does not involve a system call, in which case an empty graph is returned.

Step two, homology analysis based on system call control flow graph

The invention provides a homology analysis algorithm based on similarity comparison between a system call control flow graph and a graph. The algorithm is not only suitable for detecting the malicious code variants, but also suitable for homology analysis of other binary programs. The algorithm and the system call control flow graph also provide a reference for homology analysis research of other software.

This process is described in detail below.

Step 2.1 calculate the source map g using an improved map edit distance algorithm_iAnd target graph h_jGraph edit distance editdstance (g) between_i,h_j). The graph edit distance is the sum of the node cost and the edge cost.

The graph edit distance based method is suitable for various types of graphs, such as a directed graph, an undirected graph, a property graph, a non-property graph and the like. The system call control flow graph referred to herein is a directed weightless graph and the semantics of the points are determined. The graph edit distance algorithm considers a weighted directed weighted graph, slightly adjusts the graph edit distance algorithm based on the original graph edit distance algorithm in order to adapt to the computation of the graph edit distance of a system call control flow graph (a directed non-weighted graph), and comprises the following steps of:

a) the node cost obtaining mode is as follows: source graph g_iOne node in (b) matches the target graph h_jAdding 1 to the cost NodeCost as the slave source graph g_iThe cost of the node is deleted. Similarly, graph h_jA node in (b) matches graph g_iIf the node is a hollow node, adding cost 1 to NodeCost as a source graph g_iThe cost of inserting a node. Finally obtain g_iConversion to h_jI.e., the difference between the nodes of the two graphs.

b) The method for obtaining the edge cost comprises the following steps: find the source graph g_iAnd target graph h_jNumber of middle matching edge C_EFor any edge (x, y) E_gIf (x, y) is E_hThen the match contains the edge (x, y), C_EAnd adding 1. Wherein E is_gIs a source diagram g_iSet of middle edges, E_hIs a target graph h_jA set of medium edges.

Edge cost EdgeCost ═ E | (E |)_g|-C_E)×c(σ_RE)+(|E_h|-C_E)×c(σ_IE) Wherein the addend terms are respectively source graph g_iCost sum of deleting and adding edges, E_gAnd E_hThe sign of the added absolute value represents the number of edges in the graph, and the parameter is c (sigma)_RE) And c (σ)_IE) Respectively representing the definition of the cost value of deleting/adding the edge operation, and being self-defined.

c) Graph edit distance EditDistance (g) is the node cost + edge cost_i,h_j)＝NodeCost+EdgeCost。

Step 2.2: computing source graph g_iAnd target graph h_jMaximum graph edit distance MaxEditDistance (g) therebetween_i,h_j)。

When the graph edit distance is zero, the similarity of the two graphs can be considered to be 100%, and when the graph edit distance is sufficiently large, the graph similarity can be considered to be zero. For this we need to map the graph edit distance to a graph similarity of value range 0 to 1. The mapping process is described as follows:

the source graph is edited into the target graph, and the maximum edit distance is obtained, namely, the maximum cost sum is obtained after all points and edges of the source graph are deleted, the points and edges of the target graph are re-added, and the like. The graph edit distance is equal to the maximum edit distance when the source graph and the target graph are completely different. Calculation graph (g)_i,h_j) Maximum graph edit distance MaxEditdistance (g)_i,h_j). The node cost is still adopted as the maximum node cost plus the maximum edge cost_max+EdgeCost_maxAt this time, NodeCost_max＝c×(N_g+N_h)，N_gAnd N_hRespectively show diagram g_iAnd h_jThe number of nodes in;let C_EWhen equal to 0, EdgeCost_max＝(|E_g|)×c(σ_RE)+(|E_h|)×c(σ_IE)。

Step 2.3, calculate graph g_iAnd graph h_jGraph similarity of (c) Sim (g)_i,h_j) Comprises the following steps:

step 2.4, for source graph g₁～g_IEach of which calculates a source map g_i(I ═ 1,2, …, I) for each target map h₁～h_JThe similarity between the two is taken as the maximum value and is recorded as SimMax_i。

Step 2.5, utilizing the similarity maximum value SimMax of all source graphs₁～SimMax_IAnd carrying out weighted average to obtain the similarity S of the two programs to be analyzed.

The specific method comprises the following steps: and taking the system call number and the edge number contained in each system call control flow graph as the weight of the control flow graph, and obtaining the similarity of the program through weighted calculation. Setting the similarity of the ith control flow graph as SimMax_iThe number of system calls (number of nodes) is B_iThe number of sides is E_iThen the similarity between the two programs is:

wherein, sigma I represents that I is taken from 1 to I and then summed.

And 2.6, when whether two malicious codes belong to the same family variety or are homologous needs to be analyzed, setting a threshold value X, calculating the similarity S between the two samples, and if the similarity S is greater than X, determining that the two samples belong to the same family variety or the homologous, otherwise, determining that the two samples do not belong to the same family variety or the homologous.

In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A malicious code homology analysis method based on a system call control flow graph is characterized by comprising the following steps:

step one, constructing a system call control flow graph of a program to be analyzed; the system call control flow graph is a directed weightless graph formed by system call nodes, and the direction of an edge represents the precedence relationship of system call execution; the method specifically comprises the following steps:

when the recursive traversal process is finished, obtaining a system call control flow graph only containing system call nodes;

2. The method of claim 1, wherein step one further prunes the instruction block control flow graph, retaining only basic blocks and control flows containing a "CALL" instruction, prior to performing a recursive traversal.

3. The method of claim 2, wherein the pruning is: respectively storing a previous basic block and a subsequent basic block of a basic block A which does not contain a 'CALL' instruction in an instruction block control flow diagram, respectively connecting each previous basic block with each subsequent basic block, and then deleting the basic block A and a jump branch thereof;

4. The method of claim 1, wherein the nodes of step one are connected in a manner that the instruction blocks are in relative positions of the instruction block control flow graph as follows:

for a plurality of nodes belonging to the same basic block, sequentially connecting the nodes according to an execution sequence; and then according to the position of the basic block in the instruction block control flow graph, connecting the connected first node with the previous basic block of the basic block, and connecting the tail node with the subsequent basic block of the basic block.

5. The method of claim 1, wherein for the case of calling an internal node, it is further determined whether a return parameter is _ cross parameter value of the calling function is 1; if the system call graph of the internal function is 1, indicating that a path exists between the previous node and the subsequent node, and further connecting the previous node and the subsequent node of the current internal function after inserting the system call graph of the internal function into the previous system call graph; if the is _ cross parameter value is 0, then only the insert operation need be performed.

6. The method of claim 1, wherein step two comprises:

setting I system call control flow diagrams constructed by a first program to be analyzed as a source diagram, and setting J system call control flow diagrams constructed by a second program to be analyzed as a target diagram; for source graph g_iI1, 2, …, I, compute source graph g_iWith each target map h₁～h_JSimilarity between them, and taking the maximum valueIs denoted as SimMax_i(ii) a Then, the maximum value SimMax of the similarity of all the source graphs is utilized₁～SimMax_IAnd carrying out weighted average to obtain the similarity S of the two programs to be analyzed.

7. The method of claim 6, wherein when it is desired to analyze whether two programs to be analyzed have variants belonging to the same family or are homologous, a threshold value X is established, and the similarity between the two programs to be analyzed is calculated, and if the similarity is greater than X, the two programs to be analyzed are considered to belong to the same family variant or homologous, otherwise they are not.

8. The method of claim 6, wherein the source graph g_iAnd target graph h_jThe similarity between the two is calculated in the following way:

step a: computing source graph g_iAnd target graph h_jGraph edit distance editdstance (g) between_i,h_j) (ii) a The graph editing distance is the sum of the node cost and the edge cost;

the method for obtaining the edge cost comprises the following steps: find the source graph g_iAnd target graph h_jNumber of middle matching edge C_EIf the edge cost EdgeCost is equal to (| E)_g|-C_E)×c(σ_RE)+(|E_h|-C_E)×c(σ_IE) Wherein E is_gIs a source diagram g_iSet of middle edges, E_hIs a target graph h_jSet of middle edges, E_gAnd E_hAdding absolute value symbols to represent the number of edges in the graph; parameter c (σ)_RE) Represents the set cost value of deleting an edge, c (σ)_IE) Adding a side to indicate a settingThe operational cost value of (c);

step b, calculating a source graph g_iAnd target graph h_jMaximum graph edit distance MaxEditDistance (g) therebetween_i,h_j)＝NodeCost_max+EdgeCost_max；NodeCost_maxIs the maximum node cost, which is the source graph g_iAnd target graph h_jThe sum of the number of nodes of (c) is multiplied by the cost c; EdgeCost_maxIs the maximum edge cost, EdgeCost_max＝(|E_g|)×c(σ_RE)+(|E_h|)×c(σ_IE)；

9. the method of claim 6, wherein for SimMax₁～SimMax_IThe weighted average is performed, and the method for obtaining the similarity S is as follows: