CN109101816B - Malicious code homology analysis method based on system call control flow graph - Google Patents
Malicious code homology analysis method based on system call control flow graph Download PDFInfo
- Publication number
- CN109101816B CN109101816B CN201810912373.4A CN201810912373A CN109101816B CN 109101816 B CN109101816 B CN 109101816B CN 201810912373 A CN201810912373 A CN 201810912373A CN 109101816 B CN109101816 B CN 109101816B
- Authority
- CN
- China
- Prior art keywords
- graph
- system call
- control flow
- node
- function
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 37
- 238000005259 measurement Methods 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 38
- 238000010586 diagram Methods 0.000 claims description 30
- 230000008569 process Effects 0.000 claims description 12
- 238000013138 pruning Methods 0.000 claims description 9
- 244000141353 Prunus domestica Species 0.000 claims description 2
- 230000006399 behavior Effects 0.000 abstract description 10
- 230000000694 effects Effects 0.000 abstract description 3
- 238000005516 engineering process Methods 0.000 description 4
- 230000003068 static effect Effects 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 108091028043 Nucleic acid sequence Proteins 0.000 description 1
- 241000700605 Viruses Species 0.000 description 1
- 125000003275 alpha amino acid group Chemical group 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- FGXWKSZFVQUSTL-UHFFFAOYSA-N domperidone Chemical compound C12=CC=CC=C2NC(=O)N1CCCN(CC1)CCC1N1C2=CC=C(Cl)C=C2NC1=O FGXWKSZFVQUSTL-UHFFFAOYSA-N 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 108020004707 nucleic acids Proteins 0.000 description 1
- 102000039446 nucleic acids Human genes 0.000 description 1
- 150000007523 nucleic acids Chemical class 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
- G06F21/563—Static detection by source code analysis
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Virology (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Devices For Executing Special Programs (AREA)
Abstract
The invention discloses a malicious code homology analysis method based on a system call control flow graph, which comprises the steps of firstly constructing the system call control flow graph of a program to be analyzed; the system call control flow graph is a directed weightless graph formed by system call nodes, and the direction of an edge represents the precedence relationship of system call execution; and comparing the system call control flow graphs of different programs to be analyzed to use the graph similarity as similarity measurement of homology analysis to realize the homology analysis. The invention utilizes the system call control flow graph to carry out homology analysis, the system call control flow graph completely ignores the details of software codes and only concerns the called system call function, thereby simplifying the data volume needing to be processed, and the control flow graph based on the system call has the best abstraction degree on program behaviors. Moreover, only system call is considered, so that confusion of an instruction layer is avoided to a great extent, and an anti-confusion effect is achieved.
Description
Technical Field
The invention relates to the technical field of network security, in particular to a malicious code homology analysis method based on a system call control flow graph.
Background
The amount of malicious codes is greatly increased, but the newly appeared malicious codes are mostly variants of the existing malicious codes, and the core functions of the newly appeared malicious codes are not changed too much. The malicious code variants are mostly attacked by adopting obfuscation technology to evade virus detection.
Homology initially refers to the degree of similarity between the nucleotide sequences of two nucleic acid molecules or between the amino acid sequences of two protein molecules in molecular evolution studies. Homology analysis applied to the field of software refers to comparing two types of software from source codes to software functions, finding out whether the two types of software are the same or similar, and giving similarity to describe the similarity degree of the two types of software.
The homology analysis of the malicious codes is mostly carried out by adopting a method based on feature recognition. In the early stage, the target code is identified by extracting the sequence segment of the malicious code as the feature code and adopting a sequence matching mode. The sequence segment may be an operation code sequence after decompilation, or may be a dynamic system call sequence during code execution. Malicious code variants incorporating obfuscation techniques are often not identified by simple sequence matching. Subsequently, a research method based on behavior characteristics is proposed, and attention is paid to actual behaviors of target codes instead of code structures, so that malicious code obfuscation technology is resisted.
The common behavior feature representation method comprises an instruction block control flow graph, a function call graph and the like, and the behavior features can generate a certain amount of changes after confusion technologies such as polymorphism and deformation are introduced, so that the homology analysis capability on malicious code variants is poor.
Disclosure of Invention
In view of this, the invention provides a method for analyzing malicious code homology based on a system call control flow graph, which can improve the homology analysis capability of malicious code variants.
In order to solve the technical problem, the invention is realized as follows:
a malicious code homology analysis method based on a system call control flow graph comprises the following steps:
step one, constructing a system call control flow graph of a program to be analyzed; the system call control flow graph is a directed weightless graph formed by system call nodes, and the direction of an edge represents the precedence relationship of system call execution;
and step two, comparing the system call control flow graphs of different programs to be analyzed, and taking the graph similarity as the similarity measurement of homology analysis to realize the homology analysis.
Preferably, the first step is:
defining a function which is not called by other functions as an entry function; recursively traversing instruction block control flow diagrams of all entry functions in the program; aiming at each instruction block control flow graph, when a 'CALL' instruction is encountered, judging whether a called function is an internal function or a system calling function;
if the called function is a system call function, a node named as the system call is created; each node is connected at the relative position of the instruction block control flow graph according to the affiliated instruction block, so as to establish a system call graph;
if the called internal function is the internal function, traversing the instruction block control flow graph of the internal function, establishing a system call graph of the internal function according to a mode of establishing the system call graph for the system call function, and inserting the system call graph of the internal function into a system call graph of the upper level according to the call position of the internal function;
when the recursive traversal process is finished, a system call control flow graph only containing system call nodes is obtained.
Preferably, step one, before performing the recursive traversal, further prunes the instruction block control flow graph, and only retains the basic block and control flow containing the "CALL" instruction.
Preferably, the pruning is: aiming at a basic block A which does not contain a 'CALL' instruction in an instruction block control flow diagram, respectively storing a previous basic block and a subsequent basic block thereof, respectively connecting each previous basic block with each subsequent basic block, and then deleting the basic block A to jump and branch the basic block A;
if the basic block A is the function entry or exit, the entry node and the exit node of the basic block A are updated before the basic block A is deleted.
Preferably, the mode that the nodes are connected according to the relative positions of the instruction blocks in the instruction block control flow graph in the step one is as follows:
for a plurality of nodes belonging to the same basic block, sequentially connecting the nodes according to an execution sequence; and then according to the position of the basic block in the instruction block control flow diagram, connecting the connected first node with the previous basic block of the basic block, and connecting the tail node with the basic block and the subsequent basic block.
Preferably, for the case of calling the internal node, further determining whether the return parameter is _ cross parameter value of the calling function is 1; if the system call graph of the internal function is 1, indicating that a path exists between the previous node and the subsequent node, and further connecting the previous node and the subsequent node of the current internal function after inserting the system call graph of the internal function into the previous system call graph; if the is _ cross parameter value is 0, then only the insert operation need be performed.
Preferably, the second step includes:
setting I system call control flow diagrams constructed by a first program to be analyzed as a source diagram, and setting J system call control flow diagrams constructed by a second program to be analyzed as a target diagram; for source graph giI1, 2, …, I, compute source graph giWith each target map h1~hJThe similarity between the two is taken as the maximum value and is recorded as SimMaxi(ii) a Then, the maximum value SimMax of the similarity of all the source graphs is utilized1~SimMaxIAnd carrying out weighted average to obtain the similarity S of the two programs to be analyzed.
Preferably, when it is desired to analyze whether two programs to be analyzed have variants belonging to the same family or are homologous, a threshold value X may be established, and the similarity between the two programs to be analyzed is calculated, and if the similarity is greater than X, the two programs to be analyzed may be considered as belonging to the same family variant or homologous, otherwise not.
Preferably, the source graph giAnd target graph hjThe similarity between the two is calculated in the following way:
step a: computing source graph giAnd target graph hjGraph edit distance editdstance (g) betweeni,hj) (ii) a Graph edit distance is node cost and edge costSumming;
the node cost obtaining mode is as follows: source graph giIs matched to the target graph hjIf the node is a null node, adding the cost c to the node cost NodeCost as a slave source graph giCost of deleting nodes; in the same way, the target graph hjIs matched to the source graph giIf the node is a hollow node, adding a cost c into the NodeCost as a source graph giThe cost of inserting a node;
the method for obtaining the edge cost comprises the following steps: find the source graph giAnd target graph hjNumber of middle matching edge CEIf the edge cost EdgeCost is equal to (| E)g|-CE)×c(σRE)+(|Eh|-CE)×c(σIE) Wherein E isgIs a source diagram giSet of middle edges, EhIs a target graph hjSet of middle edges, EgAnd EhAdding absolute value symbols to represent the number of edges in the graph; parameter c (σ)RE) And c (σ)IE) Respectively representing the operation cost values of deleting and adding an edge of the definition;
step b, calculating a source graph giAnd target graph hjMaximum graph edit distance MaxEditDistance (g) therebetweeni,hj)=NodeCostmax+EdgeCostmax;NodeCostmaxIs the maximum node cost, which is the source graph giAnd target graph hjThe sum of the nodes of (c) is multiplied by the cost c; EdgeCostmaxIs the maximum edge cost, EdgeCostmax=(|Eg|)×c(σRE)+(|Eh|)×c(σIE);
Step c, calculating a source graph giAnd target graph hjGraph similarity of (c) Sim (g)i,hj) Comprises the following steps:
preferably, for SimMax1~SimMaxIThe weighted average is performed, and the method for obtaining the similarity S is as follows:
wherein, BiAnd EiAre respectively a source graph giThe number of nodes and the number of edges in.
Has the advantages that:
(1) aiming at the control flow structure of malicious codes in static analysis, the invention firstly provides a control flow graph based on system call, the control flow graph is a directed non-weighted graph formed by system call nodes, and the direction of edges represents the precedence relationship of system call execution. The system call control flow graph completely ignores the details of software codes and only focuses on called system call functions, so that the data volume needing to be processed is simplified, and the abstraction degree of the control flow graph based on the system call to program behaviors is the best. Moreover, only system call is considered, so that confusion of an instruction layer is avoided to a great extent, and an anti-confusion effect is achieved.
(2) Based on the system call control flow graph, the invention further provides a homology analysis method based on the system call control flow graph. After the system call graph and the graph similarity are obtained, the similarity between the samples is obtained through weighted average calculation. By adopting the scheme, the homology analysis capability of the malicious code variants can be improved.
Drawings
FIG. 1 is a flow diagram acquisition process based on system call control flow;
FIG. 2 is a diagram of a function call;
FIG. 3 is a control flow diagram based on instruction blocks;
FIG. 4 is a control flow graph pruning for an instruction block;
FIG. 5 is a control flow diagram of a system call of the function Sub _1404A 4;
FIG. 6 is an internal function system call graph insertion process;
fig. 7 is a control flow diagram based on system calls.
Detailed Description
In the homology analysis aiming at the malicious code variants, the internal logic of the malicious code processed by the obfuscation technologies such as deformation and polymorphism can be disturbed, and the instruction block control flow graph extracted by static analysis has small available value; and dynamic analysis overhead is large. In addition, the function call graph can obtain the call relationship between functions, but has no time-series relationship.
The invention considers to construct a system call control flow graph, the control flow graph is a directed weightless graph formed by system call nodes, the direction of edges represents the precedence relationship of system call execution, and the invention reserves the time sequence relationship between system calls. Here, the "system call" means that a program running in the user space requests a service requiring higher authority to run from the kernel of the operating system. The system call provides an interface between the user program and the operating system. Most systems require interactive operations to be run in kernel mode. Such as device IO operations or interprocess communications.
The system call control flow graph completely ignores the details of software codes, only preserves the time sequence relation between system calls, and only concerns the called system call function, thereby simplifying the data volume needing to be processed, and the control flow graph based on the system call has the best abstraction degree to the program behavior. Moreover, only system call is considered, so that confusion of an instruction layer is avoided to a great extent, and a good anti-confusion effect is achieved. Therefore, the homology analysis of the malicious codes is carried out based on the system call control flow graph, and the homology analysis capability of the malicious code variants can be improved.
The system call is a very important software behavior, the abstraction degree of a control flow graph based on the system call is higher than that of a control flow graph based on an instruction block and a function call graph, but the control flow graph of the system call is often difficult to obtain. The method for acquiring the system call control flow graph is mainly constructed based on an API call sequence under static and dynamic analysis at present, and is difficult to acquire and results are occasional. It is also noted that the function call graph and the control flow graph abstract the program structure, which implies the execution path information of the system call. Therefore, the invention selects the system call under the static analysis as a basic object for extracting and researching the behavior characteristics of the target code, and provides a method for constructing a control flow graph of the system call. The method extracts calling process information of system calling on the basis of an instruction block control flow graph, and constructs the control flow graph based on the system calling to describe behavior characteristics of a target program. On the basis of calling a control flow graph by a system, similarity comparison of the graphs is introduced, and the similarity of the graphs is used as similarity measurement between malicious codes for homology analysis, so that the analysis capability is improved.
Based on the analysis, the invention provides a malicious code homology analysis method based on a system call control flow graph, which has the following basic idea: constructing a system call control flow graph of a program to be analyzed (malicious code); the system call control flow graph is a directed weightless graph formed by system call nodes, and the direction of an edge represents the precedence relationship of system call execution; and then comparing the system call control flow graphs of different programs to be analyzed, and taking the graph similarity as the similarity measurement of homology analysis to realize the homology analysis. Meanwhile, the invention also provides a scheme for constructing the system call control flow graph based on the instruction block control flow graph.
The following is a detailed description of specific embodiments of the invention.
Step one, constructing a system call control flow graph.
The basic process of constructing the system call control flow graph of the program is as follows:
firstly, obtaining an instruction block control flow graph of each function in a program through the prior art, and then recursively traversing the instruction block control flow graphs of each entry function in the program; here, the "entry function" refers to a function that is not called by another function, and means different from the "program entry". For a program, there is only one program entry; and one function may have multiple entry functions, and of course, the entry functions also include program entries. As shown in fig. 2, the first row belongs to the entry function, but only the light gray dllenrypoint is the program entry.
Aiming at each instruction block control flow graph, when a 'CALL' instruction is encountered, whether a called function is an internal function or a system calling function is judged:
if the called function is a system call function, a node named as the system call is created; each node is connected at the relative position of the instruction block control flow graph according to the affiliated instruction block, so as to establish a system call graph;
if the called internal function is the internal function, traversing the instruction block control flow graph of the internal function according to the method, establishing a system call graph of the internal function according to the mode of establishing the system call graph for the system call function, and inserting the system call graph of the internal function into a system call graph at the upper stage according to the call position of the internal function;
when the recursive traversal process is finished, a system call control flow graph only containing system call nodes is obtained.
The method flow for constructing a control flow diagram based on system call according to the idea is shown in fig. 1, and comprises the steps of preprocessing a sample (disassembling, delimiting a function, calling a function relation and dividing basic blocks), pruning an obtained control flow diagram based on an instruction sequence, and only reserving the basic blocks and the control flow containing call instructions in the function. And then, taking the entry function as an entry to traverse a recursive 'CALL' instruction downwards to construct a control flow graph based on the system CALL.
Step 1.1: and preprocessing to obtain a control flow diagram of an instruction block of each function in the program.
Firstly, disassembling pretreatment is carried out on the sample program to obtain a disassembling code. On this basis, a sample instruction block-based control flow graph and function call graph are obtained.
When disassembling the binary code, the feature code or other technical means is used to find out the functional structure of the intermediate code, which is similar to the function defined by the source program and is a special code sequence, and they all have independent complete structures. The IDA Pro processed disassembled code already has function boundaries, each starting with proc and ending with endup. The disassembled assembly code is divided into a plurality of functions. When it uses a CALL instruction, indicating that it CALLs some internal function or system CALL that can be returned, a CALL instruction that traverses some internal function knows how many functions the function CALLs. Function call graph output from assembly code as shown in fig. 2, dark gray part is system call, black part is internal function, light gray part is program entry.
Deep inside the function, the instruction sequence is divided by the jump statements, and the instruction sequence in the two jump statements is used as a basic unit (also called basic block) of control flow. The calling or jumping among the basic units is reflected in the codes, namely conditional branching and transition judgment. The control flow graph formed by the basic block and the control flow and based on the instruction block forms the execution flow of the function, and is more simplified.
Fig. 3 shows an instruction block-based control flow graph, which contains three basic blocks and three edges. It is easy to see from the jump branch that it has two execution paths. The second basic block contains three CALL instructions (circled in the figure).
Step 1.2: pruning
In order to construct a system call control flow graph only containing system calls, from another point of view, only non-call instructions in an instruction block control flow graph need to be removed, and a control flow graph only containing system call nodes can be obtained. Therefore, after the instruction block control flow graph is obtained, the basic block which does not contain the call instruction is deleted, namely pruned, and then the recursive traversal in the step 1.3 is performed, so that the non-call instruction in the instruction block containing the call instruction is deleted. The basic operation of pruning is to delete the basic block that does not contain the "CALL" instruction, but still preserve its execution path.
If the function Func has a basic block a without a call instruction, it respectively traverses the previous basic block (the jump branch points to the basic block of a) and the subsequent basic block (the basic block pointed to by the jump branch of a) which store it, cross-connects the previous basic block of a with the subsequent basic block of a (i.e. each previous basic block is respectively connected with each subsequent basic block), and then deletes the basic block a and the jump branch of a, and the pruning process is as shown in fig. 4.
If the basic block A is a function entry or exit, its entry node (i.e., the subsequent basic block of A) and exit node (the subsequent basic block of A) need to be updated before A is deleted. If a is both an entry node and an exit node, it indicates that there is a no-call API or internal function execution path when executing the function Func, and the parameter is _ cross is set to 1/0 to describe whether there is a system call path for the function.
Step 1.3: constructing a system call control flow graph through recursive traversal
Step 1.2 details the pruning process. And the control flow graph after pruning comprises a basic block only containing a calling instruction, then a system calling control flow graph is constructed, a non-calling instruction is ignored when the control flow graph of the instruction block is traversed, and only the calling instruction is saved as a node of the system calling control flow graph.
And aiming at the instruction block control flow graph of each entry function, traversing each instruction of each basic block one by one. When a "CALL" instruction is encountered and its calling function is a system CALL function, a node named this system CALL is created. For a plurality of nodes belonging to the same basic block, sequentially connecting according to an execution sequence, as shown in fig. 5, constructing three nodes by three system calls belonging to the same basic block in fig. 3, and sequentially connecting; and then according to the position of the basic block in the instruction block control flow diagram, connecting the connected first node with the previous basic block of the basic block, and connecting the tail node with the basic block and the subsequent basic block.
Each basic block generates 1 directed graph composed of system call nodes. The control flow graph of the function Sub _4014a4 shown in fig. 3 is pruned to leave 1 basic block containing 3 system call functions called in sequence. The system call control flow graph obtained by sequentially connecting according to the execution sequence is shown in fig. 5, and comprises 1 starting node and 1 ending node. If a plurality of basic blocks are connected with each other according to the relative positions of the basic blocks in the control flow graph.
In all call instructions of the basic block, there are several calls to internal functions. When calling of an internal function is met, a system calling control flow graph of the internal function is obtained by using recursive traversal, and then the system calling control flow graph is inserted into a system calling control flow graph at the upper stage, namely a starting node and an ending node of the system calling control flow graph are respectively connected with a front node and a rear node of the internal calling. For example: if the GetProcAddress in fig. 5 is replaced by calling the internal function Sub _ a, the system call graph of Sub _ a is constructed and the relevant connections are made, as shown in fig. 6.
For the case of calling the internal node, further determining whether the parameter value of the return parameter is _ cross is 1, if so, indicating that a path exists between the previous node and the subsequent node, except for the above insertion step, further connecting the previous node and the subsequent node of the current internal function, for example, if the parameter value of is _ cross is 1 in fig. 6, connecting LoadLibraryA and VirtualProtect to form a path; if the is _ cross parameter value is 0, then only the above-described insert operation needs to be performed.
When the basic block instruction of the function is traversed, a starting Node list Entry _ Node and a Return Node list Return _ Node are set for the function, and the sum of the function system call control flow graph is stored. The acquisition of the starting (returning) node can be realized by establishing a corresponding data structure, firstly traversing the starting basic block (which is not unique) of the control flow graph, and then acquiring the starting (ending) system call node of the basic block and adding the system call node to the control flow graph.
When the system call graphs of all internal functions are connected through the relative position in the instruction block control flow graph, the system call control flow graph of the whole program is obtained. A system call control flow graph obtained by recursion of an entry function in malicious code is shown in fig. 7, and the timing relationship and the execution path of the function system call can be observed.
The obtained system call control flow graph is not unique because the entry function is not unique, and a sample containing a plurality of entry functions can be processed to obtain a set containing a plurality of system call control flow graphs. In particular, in the implementation process, many special situations need to be considered. For example, when a system call graph of an internal call function is obtained, a basic block containing a call is found, and the function does not involve a system call, in which case an empty graph is returned.
Step two, homology analysis based on system call control flow graph
The invention provides a homology analysis algorithm based on similarity comparison between a system call control flow graph and a graph. The algorithm is not only suitable for detecting the malicious code variants, but also suitable for homology analysis of other binary programs. The algorithm and the system call control flow graph also provide a reference for homology analysis research of other software.
Setting I system call control flow diagrams constructed by a first program to be analyzed as a source diagram, and setting J system call control flow diagrams constructed by a second program to be analyzed as a target diagram; for source graph giI1, 2, …, I, compute source graph giWith each target map h1~hJThe similarity between the two is taken as the maximum value and is recorded as SimMaxi(ii) a Then, the maximum value SimMax of the similarity of all the source graphs is utilized1~SimMaxIAnd carrying out weighted average to obtain the similarity S of the two programs to be analyzed.
This process is described in detail below.
Step 2.1 calculate the source map g using an improved map edit distance algorithmiAnd target graph hjGraph edit distance editdstance (g) betweeni,hj). The graph edit distance is the sum of the node cost and the edge cost.
The graph edit distance based method is suitable for various types of graphs, such as a directed graph, an undirected graph, a property graph, a non-property graph and the like. The system call control flow graph referred to herein is a directed weightless graph and the semantics of the points are determined. The graph edit distance algorithm considers a weighted directed weighted graph, slightly adjusts the graph edit distance algorithm based on the original graph edit distance algorithm in order to adapt to the computation of the graph edit distance of a system call control flow graph (a directed non-weighted graph), and comprises the following steps of:
a) the node cost obtaining mode is as follows: source graph giOne node in (b) matches the target graph hjAdding 1 to the cost NodeCost as the slave source graph giThe cost of the node is deleted. Similarly, graph hjA node in (b) matches graph giIf the node is a hollow node, adding cost 1 to NodeCost as a source graph giThe cost of inserting a node. Finally obtain giConversion to hjI.e., the difference between the nodes of the two graphs.
b) The method for obtaining the edge cost comprises the following steps: find the source graph giAnd target graph hjNumber of middle matching edge CEFor any edge (x, y) EgIf (x, y) is EhThen the match contains the edge (x, y), CEAnd adding 1. Wherein E isgIs a source diagram giSet of middle edges, EhIs a target graph hjA set of medium edges.
Edge cost EdgeCost ═ E | (E |)g|-CE)×c(σRE)+(|Eh|-CE)×c(σIE) Wherein the addend terms are respectively source graph giCost sum of deleting and adding edges, EgAnd EhThe sign of the added absolute value represents the number of edges in the graph, and the parameter is c (sigma)RE) And c (σ)IE) Respectively representing the definition of the cost value of deleting/adding the edge operation, and being self-defined.
c) Graph edit distance EditDistance (g) is the node cost + edge costi,hj)=NodeCost+EdgeCost。
Step 2.2: computing source graph giAnd target graph hjMaximum graph edit distance MaxEditDistance (g) therebetweeni,hj)。
When the graph edit distance is zero, the similarity of the two graphs can be considered to be 100%, and when the graph edit distance is sufficiently large, the graph similarity can be considered to be zero. For this we need to map the graph edit distance to a graph similarity of value range 0 to 1. The mapping process is described as follows:
the source graph is edited into the target graph, and the maximum edit distance is obtained, namely, the maximum cost sum is obtained after all points and edges of the source graph are deleted, the points and edges of the target graph are re-added, and the like. The graph edit distance is equal to the maximum edit distance when the source graph and the target graph are completely different. Calculation graph (g)i,hj) Maximum graph edit distance MaxEditdistance (g)i,hj). The node cost is still adopted as the maximum node cost plus the maximum edge costmax+EdgeCostmaxAt this time, NodeCostmax=c×(Ng+Nh),NgAnd NhRespectively show diagram giAnd hjThe number of nodes in;let CEWhen equal to 0, EdgeCostmax=(|Eg|)×c(σRE)+(|Eh|)×c(σIE)。
Step 2.3, calculate graph giAnd graph hjGraph similarity of (c) Sim (g)i,hj) Comprises the following steps:
step 2.4, for source graph g1~gIEach of which calculates a source map gi(I ═ 1,2, …, I) for each target map h1~hJThe similarity between the two is taken as the maximum value and is recorded as SimMaxi。
Step 2.5, utilizing the similarity maximum value SimMax of all source graphs1~SimMaxIAnd carrying out weighted average to obtain the similarity S of the two programs to be analyzed.
The specific method comprises the following steps: and taking the system call number and the edge number contained in each system call control flow graph as the weight of the control flow graph, and obtaining the similarity of the program through weighted calculation. Setting the similarity of the ith control flow graph as SimMaxiThe number of system calls (number of nodes) is BiThe number of sides is EiThen the similarity between the two programs is:
wherein, sigma I represents that I is taken from 1 to I and then summed.
And 2.6, when whether two malicious codes belong to the same family variety or are homologous needs to be analyzed, setting a threshold value X, calculating the similarity S between the two samples, and if the similarity S is greater than X, determining that the two samples belong to the same family variety or the homologous, otherwise, determining that the two samples do not belong to the same family variety or the homologous.
In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (9)
1. A malicious code homology analysis method based on a system call control flow graph is characterized by comprising the following steps:
step one, constructing a system call control flow graph of a program to be analyzed; the system call control flow graph is a directed weightless graph formed by system call nodes, and the direction of an edge represents the precedence relationship of system call execution; the method specifically comprises the following steps:
defining a function which is not called by other functions as an entry function; recursively traversing instruction block control flow diagrams of all entry functions in the program; aiming at each instruction block control flow graph, when a 'CALL' instruction is encountered, judging whether a called function is an internal function or a system calling function;
if the called function is a system call function, a node named as the system call is created; each node is connected at the relative position of the instruction block control flow graph according to the affiliated instruction block, so as to establish a system call graph;
if the called internal function is the internal function, traversing the instruction block control flow graph of the internal function, establishing a system call graph of the internal function according to a mode of establishing the system call graph for the system call function, and inserting the system call graph of the internal function into a system call graph of the upper level according to the call position of the internal function;
when the recursive traversal process is finished, obtaining a system call control flow graph only containing system call nodes;
and step two, comparing the system call control flow graphs of different programs to be analyzed, and taking the graph similarity as the similarity measurement of homology analysis to realize the homology analysis.
2. The method of claim 1, wherein step one further prunes the instruction block control flow graph, retaining only basic blocks and control flows containing a "CALL" instruction, prior to performing a recursive traversal.
3. The method of claim 2, wherein the pruning is: respectively storing a previous basic block and a subsequent basic block of a basic block A which does not contain a 'CALL' instruction in an instruction block control flow diagram, respectively connecting each previous basic block with each subsequent basic block, and then deleting the basic block A and a jump branch thereof;
if the basic block A is the function entry or exit, the entry node and the exit node of the basic block A are updated before the basic block A is deleted.
4. The method of claim 1, wherein the nodes of step one are connected in a manner that the instruction blocks are in relative positions of the instruction block control flow graph as follows:
for a plurality of nodes belonging to the same basic block, sequentially connecting the nodes according to an execution sequence; and then according to the position of the basic block in the instruction block control flow graph, connecting the connected first node with the previous basic block of the basic block, and connecting the tail node with the subsequent basic block of the basic block.
5. The method of claim 1, wherein for the case of calling an internal node, it is further determined whether a return parameter is _ cross parameter value of the calling function is 1; if the system call graph of the internal function is 1, indicating that a path exists between the previous node and the subsequent node, and further connecting the previous node and the subsequent node of the current internal function after inserting the system call graph of the internal function into the previous system call graph; if the is _ cross parameter value is 0, then only the insert operation need be performed.
6. The method of claim 1, wherein step two comprises:
setting I system call control flow diagrams constructed by a first program to be analyzed as a source diagram, and setting J system call control flow diagrams constructed by a second program to be analyzed as a target diagram; for source graph giI1, 2, …, I, compute source graph giWith each target map h1~hJSimilarity between them, and taking the maximum valueIs denoted as SimMaxi(ii) a Then, the maximum value SimMax of the similarity of all the source graphs is utilized1~SimMaxIAnd carrying out weighted average to obtain the similarity S of the two programs to be analyzed.
7. The method of claim 6, wherein when it is desired to analyze whether two programs to be analyzed have variants belonging to the same family or are homologous, a threshold value X is established, and the similarity between the two programs to be analyzed is calculated, and if the similarity is greater than X, the two programs to be analyzed are considered to belong to the same family variant or homologous, otherwise they are not.
8. The method of claim 6, wherein the source graph giAnd target graph hjThe similarity between the two is calculated in the following way:
step a: computing source graph giAnd target graph hjGraph edit distance editdstance (g) betweeni,hj) (ii) a The graph editing distance is the sum of the node cost and the edge cost;
the node cost obtaining mode is as follows: source graph giIs matched to the target graph hjIf the node is a null node, adding the cost c to the node cost NodeCost as a slave source graph giCost of deleting nodes; in the same way, the target graph hjIs matched to the source graph giIf the node is a hollow node, adding a cost c into the NodeCost as a source graph giThe cost of inserting a node;
the method for obtaining the edge cost comprises the following steps: find the source graph giAnd target graph hjNumber of middle matching edge CEIf the edge cost EdgeCost is equal to (| E)g|-CE)×c(σRE)+(|Eh|-CE)×c(σIE) Wherein E isgIs a source diagram giSet of middle edges, EhIs a target graph hjSet of middle edges, EgAnd EhAdding absolute value symbols to represent the number of edges in the graph; parameter c (σ)RE) Represents the set cost value of deleting an edge, c (σ)IE) Adding a side to indicate a settingThe operational cost value of (c);
step b, calculating a source graph giAnd target graph hjMaximum graph edit distance MaxEditDistance (g) therebetweeni,hj)=NodeCostmax+EdgeCostmax;NodeCostmaxIs the maximum node cost, which is the source graph giAnd target graph hjThe sum of the number of nodes of (c) is multiplied by the cost c; EdgeCostmaxIs the maximum edge cost, EdgeCostmax=(|Eg|)×c(σRE)+(|Eh|)×c(σIE);
Step c, calculating a source graph giAnd target graph hjGraph similarity of (c) Sim (g)i,hj) Comprises the following steps:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810912373.4A CN109101816B (en) | 2018-08-10 | 2018-08-10 | Malicious code homology analysis method based on system call control flow graph |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810912373.4A CN109101816B (en) | 2018-08-10 | 2018-08-10 | Malicious code homology analysis method based on system call control flow graph |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109101816A CN109101816A (en) | 2018-12-28 |
CN109101816B true CN109101816B (en) | 2022-02-08 |
Family
ID=64849447
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810912373.4A Expired - Fee Related CN109101816B (en) | 2018-08-10 | 2018-08-10 | Malicious code homology analysis method based on system call control flow graph |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109101816B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112182568B (en) * | 2019-07-02 | 2022-09-27 | 四川大学 | Malicious code classification based on graph convolution network and topic model |
CN110554868B (en) * | 2019-09-11 | 2020-07-31 | 北京航空航天大学 | Software multiplexing code detection method and system |
CN111538989B (en) * | 2020-04-22 | 2022-08-26 | 四川大学 | Malicious code homology analysis method based on graph convolution network and topic model |
CN111832020B (en) * | 2020-06-22 | 2024-03-19 | 华中科技大学 | Android application maliciousness and malicious race detection model construction method and application |
CN112379922B (en) * | 2020-11-24 | 2022-07-05 | 中国科学院信息工程研究所 | Program comparison method and system |
CN112948828A (en) * | 2021-01-25 | 2021-06-11 | 厦门服云信息科技有限公司 | Binary program malicious code detection method, terminal device and storage medium |
CN113254068B (en) * | 2021-07-14 | 2021-10-22 | 苏州浪潮智能科技有限公司 | Control flow planarization automatic detection method and device |
WO2024191397A1 (en) * | 2023-03-13 | 2024-09-19 | Hacettepe Universitesi | A system for locating malicious codes |
CN117251171B (en) * | 2023-11-20 | 2024-04-12 | 常熟理工学院 | Predicate basic block detection method and equipment in control flow graph |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107169358A (en) * | 2017-05-24 | 2017-09-15 | 中国人民解放军信息工程大学 | Code homology detection method and its device based on code fingerprint |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101359352B (en) * | 2008-09-25 | 2010-08-25 | 中国人民解放军信息工程大学 | API use action discovering and malice deciding method after confusion of multi-tier synergism |
CN104834837B (en) * | 2015-04-03 | 2017-10-31 | 西北大学 | A kind of antialiasing method of binary code based on semanteme |
US20160357965A1 (en) * | 2015-06-04 | 2016-12-08 | Ut Battelle, Llc | Automatic clustering of malware variants based on structured control flow |
-
2018
- 2018-08-10 CN CN201810912373.4A patent/CN109101816B/en not_active Expired - Fee Related
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107169358A (en) * | 2017-05-24 | 2017-09-15 | 中国人民解放军信息工程大学 | Code homology detection method and its device based on code fingerprint |
Non-Patent Citations (1)
Title |
---|
"Malware classification based on call graph clustering";Joris Kinable et al.;《Journal in Computer Virology and Hacking Techniques》;20110203;2011年第7期,第233-245页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109101816A (en) | 2018-12-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109101816B (en) | Malicious code homology analysis method based on system call control flow graph | |
CN107103754B (en) | Road traffic condition prediction method and system | |
CN108985061B (en) | Webshell detection method based on model fusion | |
CN112311780A (en) | Method for generating multi-dimensional attack path and attack graph | |
CN113536308B (en) | Binary code tracing method for multi-granularity information fusion under software gene view angle | |
CN113706180B (en) | Method and system for identifying cheating communities | |
CN113868656B (en) | Behavior pattern-based APT event homology judgment method | |
CN114880456A (en) | Cross-border e-commerce big data analysis method and system applied to digital economy | |
CN111368289A (en) | Malicious software detection method and device | |
CN115100739B (en) | Man-machine behavior detection method, system, terminal device and storage medium | |
CN109147868A (en) | Protein function prediction technique, device, equipment and storage medium | |
CN116244647A (en) | Unmanned aerial vehicle cluster running state estimation method | |
CN113656798B (en) | Regularization identification method and device for malicious software tag overturn attack | |
CN114978765B (en) | Big data processing method for information attack defense and AI attack defense system | |
CN116226852A (en) | Mobile platform malicious software detection method and device based on multi-mode information fusion | |
CN116188914A (en) | Image AI processing method in meta-universe interaction scene and meta-universe interaction system | |
CN112700277B (en) | Processing method of user behavior data and multi-behavior sequence conversion model training method | |
CN115454473A (en) | Data processing method based on deep learning vulnerability decision and information security system | |
CN117972699B (en) | Third party open source component risk analysis method and system based on software genes | |
CN116720181B (en) | Visual operation risk prediction method for coping with intelligent digital service | |
CN113098867B (en) | Network security big data processing method based on artificial intelligence and big data cloud system | |
CN115033883B (en) | Intelligent contract vulnerability detection method and system based on strategy Fuzzer | |
CN117056923A (en) | Method for acquiring malicious software identification model, software identification method and device | |
KR20240066576A (en) | Node Saliency-Guided Graph Mixup with Local Structure Preservation | |
CN118503237A (en) | Efficient searching method for sensitive objects in definition of C program structure |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20220208 |
|
CF01 | Termination of patent right due to non-payment of annual fee |