CN107169358A

CN107169358A - Code homology detection method and its device based on code fingerprint

Info

Publication number: CN107169358A
Application number: CN201710375425.4A
Authority: CN
Inventors: 魏强; 刘臻; 曹琰; 尹中旭; 彭建山
Original assignee: Shanghai Red Neurons Co Ltd; PLA Information Engineering University
Current assignee: Shanghai Red Neurons Co Ltd; PLA Information Engineering University
Priority date: 2017-05-24
Filing date: 2017-05-24
Publication date: 2017-09-15
Anticipated expiration: 2037-05-24
Also published as: CN107169358B

Abstract

The present invention relates to a kind of code homology detection method and its device based on code fingerprint, this method is included：Dependence analysis is carried out to input code, original program dependency graph PDG is obtained；Structure simplification, nested removal and coloring treatment are carried out to original program dependency graph PDG, obtains and simplifies program dependency graph sPDG；Based on the crucial syntactic information of abstract syntax tree parsing code；The system call sequence of code execution path is extracted, the complete trails parameter vector set of object code is obtained, code fingerprint is built；Homologous property coefficient between calculation code fingerprint part；Two parts of codes S and T homologous sex index are calculated according to homologous property coefficient, the affinity that code both sides are present is judged by the homologous sex index.The present invention can take into account code semanteme and behavior on the basis of similitude, improve detection efficiency with simplifying mechanism using the feature of lightweight, multi-angle weighs the affinity existed between code, can improve detection efficiency while guarantee accuracy.

Description

Code homology detection method and device based on code fingerprints

Technical Field

The invention belongs to the technical field of computer software application, and particularly relates to a code homology detection method and a code homology detection device based on code fingerprints.

Background

With the increase of the demands of various internet applications and the increase of the code iteration speed, higher demands are made on the development efficiency and speed of programmers. On a software development pipeline, template-based secondary development and reuse of existing components are common phenomena; meanwhile, in order to solve new requirements, developers usually refer to codes in an open source code warehouse in the internet. This has led to the growing number of homologous codes through different channels, and the spread of hidden defects and errors in the codes. Meanwhile, with the continuous development of computer security technology and the continuous improvement of virus detection technology, the probability that malicious codes such as macro viruses, malicious VBS scripts, malicious JavaScript scripts and the like on the internet are detected is higher and higher, and an attacker needs to bypass detection by means of modifying code content, converting code forms and the like on the basis of original codes, so that the survival capability of the malicious codes is improved. Endogenous homology exists among various versions of the same malicious code, and the homology is an important basis for detecting the same malicious code.

As an important aspect of computer program research, homology detection techniques for software source code at present are mainly classified into the following types: text-based software homology detection, structural analysis-based software homology detection, and semantic-based software homology detection. One) text-based software homology detection technology, the detected object is the text of the source code, for example, the similarity detection of the code based on the similarity of the text and the attribute of the text. One benefit of treating program source code as text analysis is that it is not tied to the programming language used to analyze the object, but because it does not take into account the linguistic characteristics of the code, such methods are generally less resistant to code obfuscation. Simple code obfuscation means are as follows: the detection effect can be greatly influenced by replacing the variable function name, inserting the junk code, disordering the sentence sequence on the premise of not influencing the function and the like. Therefore, the technology can only carry out simple homology detection from the text level, and has larger limitation. Second) software homology detection techniques based on structural analysis, through analysis of code structures and expression in other comparable intermediate forms, token-based, tree-based and graph-based detection methods, etc. are common. Compared with a text-based detection method, the technology has a better detection effect and has certain resistance to common confusion means. But the computational complexity depends on the method of intermediate representation, and the complex structure can bring large performance overhead in the detection process. Thirdly), extracting features such as control flow, data flow, standard API flow and the like on the basis of static semantic analysis by a semantic-based software homology detection technology, and depicting program behaviors from different angles; or compiling and executing the source code, and recording a program instruction stream and a system calling sequence to describe the program behavior. The technology essentially describes the semantic and behavior characteristics of a program, and can effectively deal with the challenge of homology detection caused by various code confusion. However, the method based on semantics cannot effectively cover the self characteristics of the code, and meanwhile, the difficulty in carrying out accurate semantic analysis is high.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a code homology detection method and a code homology detection device based on code fingerprints, which solve the problems of poor confusion interference resistance and low detection efficiency in the software source code detection process, can accurately extract code features, effectively cope with the influence caused by common code confusion methods, improve the homology detection efficiency and the detection accuracy, and effectively prevent the spread of malicious codes.

According to the design scheme provided by the invention, the code homology detection method based on the code fingerprint comprises the following steps:

step 1, analyzing the dependency relationship of two input codes S and T to obtain an original program dependency graph PDG; carrying out structure simplification, nesting removal and coloring treatment on the original program dependency graph PDG to obtain a simplified program dependency graph sPDG;

step 2, analyzing key syntax information of the code based on the abstract syntax tree;

step 3, extracting a system calling sequence of the code execution path, acquiring a full path parameter vector set of the target code, and constructing a code fingerprint;

step 4, calculating the homology coefficient among code fingerprint parts, wherein the homology coefficient comprises a simplified program dependence graph sPDG isomorphic coefficient P_S,TSyntax information superposition coefficient C_S,TAnd the system call sequence similarity coefficient A_S,T；

And 5, calculating the homology indexes of the two codes S and T according to the homology coefficients, and judging the homology relation between the two codes according to the homology indexes.

As described above, in step 1, the original program dependency graph PDG is subjected to structure simplification, nesting removal, and coloring processing, and the simplified program dependency graph sPDG is obtained, which includes the following contents:

step 11, simplifying the structure of the original program dependence graph PDG according to a simplification principle;

step 12, removing the nested input nodes and output nodes from the nodes containing the nested relation, and removing the edges of the corresponding dependency relation to the outer layer function call nodes;

and step 13, classifying and coloring the nodes according to the statement types, and acquiring the simplified program dependence graph sPDG.

The above simplified principle in step 11 includes: removing vertices with only one outgoing edge without any incoming edge, removing vertices with only one incoming edge without any outgoing edge; removing vertices with only one input and one output edge and introducing a point from its input vertex to the output vertex; removing does not have any incoming or outgoing edge vertices.

In the above, the parsing of the key syntax information of the code based on the abstract syntax tree in step 2 includes the following contents:

step 21, recording global variables, local variables and attributes thereof in the designated code domain to form a quadruple, wherein the quadruple comprises a scope of the variables, link attributes, storage types and names;

step 22, analyzing and recording the macro definition and the corresponding content thereof to form a triple, wherein the triple comprises a macro definition identifier, a content type and a name;

and step 23, analyzing a key data structure in the code based on the abstract syntax tree AST, and recording a custom structure body in the code in a sequence form.

As described above, step 3 includes the following steps:

step 31, starting from an entry function, generating a call graph and a subsequent domination tree of a function f, extracting a system call sequence K in a single execution path, and recording a system call sequence set K in all possible execution paths;

step 32, for each function f in the system calling sequence k, locating the function domain where the function f is locatedd, analyzing all parameters of the function f in the abstract syntax tree, determining the data source s of each parameter in the f through static taint analysis, judging the source of the parameter value, and combining the data type t of the parameter to form a parameter vector e of the function f_f(d, f, t, s); obtaining a parameter vector set E of a system call sequence k_k；

Step 33, executing step 32 on each sequence K in the system call sequence set K, and acquiring a full-path parameter vector set E of the target code_K；

Step 34, according to the target code, the full path parameter vector set E_KAnd constructing the code fingerprint.

As mentioned above, the step 3 includes the following steps:

step 31, respectively calculating the field relevance, recommendation response rate and recommendation satisfaction rate of recommenders in the initial set of recommenders;

step 32, setting a recommendation response rate, a recommendation satisfaction rate and a domain correlation acceptance threshold, screening all recommenders in the initial set of recommenders through the acceptance threshold, and removing recommenders lower than the acceptance threshold from the initial set of recommenders;

and step 33, obtaining a recommender candidate set through screening.

As described above, step 4 includes the following steps:

step 41, for the simplified program dependence graph sPDG of the target code, seeking the maximum isomorphic subgraph between the simplified program dependence graphs sPDG through a progressive graph isomorphism solving algorithm, and calculating the isomorphism coefficient P between the simplified program dependence graphs sPDG_S,T；

Step 42, calculating a coincidence coefficient C through a Jaccard algorithm according to the key grammar information obtained in the step 2_S,T；

Step 43, according to the target code in step 3, the full path parameter vector set E_KSolving for E by Jaccard_KSimilarity of subsetsTaking the highest value as the similarity coefficient A of the system call sequence_S,T。

Preferably, in step 41, the original program dependency graph PDG is represented as a directed graph G ═ V, E, the node set V represents a set of predicate expressions or statements, E represents data dependencies and control dependencies existing between the parts, and let G₁＝(V₁，E₁)， G₂＝(V₂，E₂) Respectively representing a simplified program dependence graph sPDG, by evaluating the functions:

the calculation of the isomorphic coefficients P between the reduced program dependency graphs sPDG.

Preferably, step 42 includes the inclusion of coincidence coefficients of a single syntax information αWherein,respectively, the grammar information α sequence corresponding to two codes S and T, calculating the coincidence coefficient of key grammar informationw_αIs the weight of the syntax information α.

As described above, in step 5, the following formula is used:calculating the homology index of the two codes S and T, wherein w_PWeight of isomorphic coefficients, w, for sPDG graph_CAs weights of coincidence coefficients of syntax information, w_AWeights for the system call sequence similarity coefficients.

A code homology detection apparatus based on code fingerprints, comprising: the system comprises a program simplifying module, a grammar analyzing module, a fingerprint constructing module, a homology coefficient acquiring module and a homology judging module;

the program simplification module is used for analyzing the dependency relationship between the two input codes S and T and acquiring an original program dependency graph PDG; carrying out structure simplification, nesting removal and coloring treatment on the original program dependency graph PDG to obtain a simplified program dependency graph sPDG;

the grammar parsing module is used for parsing key grammar information of the codes based on an abstract grammar tree and comprises a variable parsing unit, a macro definition parsing unit and a key data structure parsing unit, wherein the variable parsing unit is used for recording global variables, local variables and corresponding action domains, link attributes and storage types of the global variables and the local variables in a designated code domain, the macro definition parsing unit is used for recording macro definitions and corresponding content types of the macro definitions, and the key data structure parsing unit is used for parsing all classes and structural bodies defined in functions in a target code domain;

the fingerprint construction module is used for extracting a system calling sequence of a code execution path, acquiring a full-path parameter vector set of a target code and constructing a code fingerprint;

a homology coefficient acquisition module for calculating the homology coefficient between code fingerprint components according to the information acquired by the program simplification module, the grammar analysis module and the fingerprint construction module, wherein the homology coefficient comprises a simplified program dependence graph sPDG isomorphic coefficient P_S,TSyntax information superposition coefficient C_S,TAnd the system call sequence similarity coefficient A_S,T；

A homology judging module used for calculating the homology indexes of the two codes S and T according to the homology coefficient obtained by the homology coefficient obtaining module and judging the homology relation between the two codes according to the homology indexes

The invention has the beneficial effects that:

aiming at the source code homology judgment method, the invention can give consideration to code semantics and behaviors on the basis of similarity, improves the detection efficiency by utilizing light-weight characteristics and a simplified mechanism, and measures the homology relation existing among codes in multiple angles; the problems existing in the prior art are solved: a) the code confusion method compounded by using various means such as format change, renaming modification, junk code insertion, statement reordering and the like cannot be effectively dealt with; b) the detection method based on the complex structure and the algorithm can obtain higher accuracy, but the problems of large solving calculation amount and low detection efficiency exist in the detection process, and the detection efficiency cannot be well taken into consideration while the accuracy is improved; the accuracy can be guaranteed, and meanwhile, the detection efficiency is improved.

The invention abstracts the logic and the characteristics of the code through the code fingerprint, integrates the grammatical characteristics and the behavior characteristics of the code while showing the relation between data flow and control flow through a program dependence graph, solves the problem that the existing code homology detection focuses on analyzing the similarity of code text and characteristics and reflects the insufficient capability of the internal logic and deep association between the codes, greatly improves the homology detection efficiency while keeping high accuracy, effectively prevents the spread of malicious codes, provides technical support for the homology detection and judgment of computer program source codes, and has important guiding significance for computer network security technology and virus detection technology.

Description of the drawings:

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a schematic flow chart of the homology analysis method in the example;

FIG. 3 is a schematic flow chart of obtaining a simplified program dependence graph sPDG in the embodiment;

FIG. 4 is a schematic diagram illustrating a process of parsing key syntax information of a code based on an abstract syntax tree in an embodiment;

FIG. 5 is a schematic flowchart of constructing a code fingerprint according to an embodiment;

FIG. 6 is a schematic diagram of a process for calculating the homology coefficients between code fingerprint components according to an embodiment;

FIG. 7 is a schematic view of the apparatus of the present invention;

FIG. 8 is an example input code schematic;

FIG. 9 is a simplified part of the structure of the dependency graph of the original program in the embodiment;

FIG. 10 is a schematic diagram of a nest removal process in an embodiment;

FIG. 11 is a diagram illustrating parsing based on an abstract syntax tree in an embodiment;

FIG. 12 is a diagram illustrating the extraction of the target code system call parameter in the embodiment.

The specific implementation mode is as follows:

in order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described in detail below with reference to the accompanying drawings and technical solutions.

At present, most of the research on methods for detecting code homology is carried out based on a single type. The coarse-grained feature detection can improve the detection efficiency but can reduce the detection accuracy, and the fine-grained features bring performance bottlenecks with large calculation amount while improving the detection accuracy. How to effectively deal with a complex code confusion means under the condition of efficient detection and accurately abstract code logic and generalize code characteristics is an important content which needs to be researched currently.

In an embodiment, a code homology detection method based on code fingerprints is provided, and is shown in fig. 1, and includes the following steps:

The embodiment can accurately extract the code features, effectively cope with the influence of common code confusion methods, and greatly improve the detection efficiency while improving the homology detection accuracy.

For two input code files to be detected, program analysis is performed according to the programming language compiling principle, and an original program dependency graph PDG of a code is obtained as a basis of code fingerprints, in another embodiment of the present invention, as shown in fig. 3, structure simplification, nesting removal and coloring processing are performed on the original program dependency graph PDG, and a simplified program dependency graph sPDG is obtained, which includes the following contents:

In another embodiment of the present invention, the PDG structure simplification for the original program dependency graph includes the following simplification operations for nodes in the graph according to the simplification principle: removing vertices with only one outgoing edge without any incoming edge, removing vertices with only one incoming edge without any outgoing edge; removing vertices with only one input and one output edge and introducing a point from its input vertex to the output vertex; removing vertices without any incoming or outgoing edges; and repeating the simplification operation until no node conforming to the simplification principle exists.

The nodes are classified according to statement type, then the nodes of different types are colored according to different colors, and each type is identified by a coloring number for comparison. Examples of classifications used are as follows: function calls, control statements, declaration statements, arithmetic statements, switch statements, logical expressions, jump statements, and return statements, among others. The details are shown in Table 1.

Type (B)	Node representation information	Color number
			Function call	Calling function and system API	1
Control statement	If,switch,while,for.	2
			Statement sentence	Variable declaration or formatting parameters	3
Operation statement	Variable operation, auto-increment/decrement operation	4
			switch statement	case,default	5
Jump statement	goto,break,continue	6
			Conditional statement	<,>,＝＝,！＝	7
Return statement	return	8
			Others	Others	0

TABLE 1

In another embodiment of the present invention, the key Syntax information of the code is parsed based on the Abstract Syntax Tree, which is used as a component of the code fingerprint, as shown in fig. 4, and the following contents are included:

In another embodiment of the present invention, a system call sequence of a code execution path is extracted, a full path parameter vector set E of a target code is obtained, and a code fingerprint is constructed, as shown in fig. 5, which includes the following contents:

step 32, for the function f in each system call sequence k, locating the function domain d where the function f is located, analyzing all parameters of the function f in the abstract syntax tree, determining the data source s of each parameter in the f through static taint analysis, judging the source of the parameter value, and combining the data type t of the parameter to form the parameter vector e of the function f_f(d, f, t, s); obtaining a parameter vector set E of a system call sequence k_k；

In the homology determination of code fingerprints, in another embodiment of the present invention, the homology coefficients between code fingerprint components are calculated, as shown in fig. 6, and include the following:

Step 43, according to the target code in step 3, the full path parameter vector set E_KSolving for E by Jaccard_KAnd taking the highest value of the similarity coefficients of the subsets as a similarity coefficient A of the system call sequence_S,T。

In another embodiment, the original program dependency graph PDG of the target code is represented as a directed graph G ═ V, E, the set of nodes V represents a set of predicate expressions or statements, E represents the data and control dependencies that exist between the parts, let G₁＝(V₁，E₁)，G₂＝(V₂，E₂) Respectively representing a simplified program dependence graph sPDG, according to the solution result of a progressive graph isomorphic solution algorithm, through an evaluation function:

calculating isomorphic coefficient P between simplified program dependence graphs sPDG, wherein when P is 0, G is represented₁Is G₂Complete subgraph of (1).

For the obtained code key grammar information sequence, in the further embodiment of the invention, the coincidence coefficient is calculated by the Jaccard algorithm, and the coincidence coefficient of the single grammar information α is containedWherein, respectively, the grammar information α sequence corresponding to two codes S and T, calculating the coincidence coefficient of key grammar informationw_αIs the weight of the syntax information α.

Aiming at two input codes S and T, calculating an sPDG graph isomorphic coefficient P_S,TSyntax information superposition coefficient C_S,TAnd the system call sequence similarity coefficient A_S,TIn other embodiments of the present invention, the following formula is used:

calculating the homology index of the two codes S and T, wherein w_PWeight of isomorphic coefficients, w, for sPDG graph_CAs weights of coincidence coefficients of syntax information, w_AWeights for the system call sequence similarity coefficients. The larger the Homology (S, T), the more obvious the Homology relationship exists among the input samples.

Corresponding to the above method, an embodiment of the present invention further provides a code homology detection apparatus based on a code fingerprint, as shown in fig. 7, including: a program simplifying module 201, a grammar parsing module 202, a fingerprint constructing module 203, a homology coefficient obtaining module 204 and a homology judging module 205;

the program simplification module 201 is used for analyzing the dependency relationship between the two input codes S and T to obtain an original program dependency graph PDG; carrying out structure simplification, nesting removal and coloring treatment on the original program dependency graph PDG to obtain a simplified program dependency graph sPDG;

the syntax parsing module 202 is configured to parse code key syntax information based on an abstract syntax tree, and includes a variable parsing unit, a macro definition parsing unit, and a key data structure parsing unit, where the variable parsing unit is configured to record a global variable, a local variable, and a scope, a link attribute, and a storage type corresponding to the global variable and the local variable in a designated code domain, the macro definition parsing unit is configured to record a macro definition and a content type corresponding to the macro definition, and the key data structure parsing unit is configured to parse structural bodies defined in all classes and functions in a target code domain;

the fingerprint construction module 203 is configured to extract a system call sequence of a code execution path, obtain a full-path parameter vector set of a target code, and construct a code fingerprint;

a homology coefficient obtaining module 204 for calculating the homology coefficient between code fingerprint parts according to the information obtained by the program simplifying module, the grammar parsing module and the fingerprint constructing module, wherein the homology coefficient comprises the homologous coefficient P of the sPDG in the simplified program dependence graph_S,TSyntax information superposition coefficient C_S,TAnd the system call sequence similarity coefficient A_S,T；

And the homology judgment module 205 is used for calculating the homology indexes of the two codes S and T according to the homology coefficients obtained by the homology coefficient acquisition module and judging the homology relationship between the two codes according to the homology indexes.

The effectiveness of the present invention is further explained by specific examples, as shown in fig. 8, part of contents in two input program code files are illustrated, program analysis is performed according to a programming language compiling principle, an original program dependency graph PDG is obtained, the original program dependency graph PDG is represented as a directed graph G ═ V, E, a node set V represents a set of predicate expressions or statements, E represents data dependencies and control dependencies existing among various parts, and the original program dependency graph PDG is used as a basis for code fingerprints; simplifying the structure of the original program dependence graph, wherein the simplified partial effect is schematically shown in FIG. 9; performing nested removal on the simplified program dependency graph, wherein the process of nested removal is shown in FIG. 10; coloring the program dependence graph after the structure simplification and the nesting removal to obtain a simplified program dependence graph sPDG; constructing an abstract syntax tree of the source code by using LLVM or Clang, and analyzing key syntax information of the code based on the abstract syntax tree to form a code fingerprint, wherein the analysis of two input files is schematically shown in FIG. 11.

Extracting a parameter sequence of system call in a target code to form a complete code fingerprint, as shown in fig. 12, starting from an entry function such as main, generating a call graph and a subsequent domination tree of a function f, extracting a system call sequence K in a single execution path, recording a system call sequence set K in all possible execution paths, locating a function domain d of the function f in each system call sequence K, analyzing all parameters of the function f in an abstract syntax tree, determining a data source s of each parameter in f through static taint analysis, determining the value of the parameter from the outside or from the inside of the function, and finally forming a parameter vector e of the function f by combining a data type t of the parameter_f(d, f, t, s). The above operation is carried out on a system calling sequence k to obtain a parameter vector set E of the sequence k_kPerforming the above steps on each sequence K in the system call sequence set K to obtain a full-path parameter vector set E of the target code_K. To alleviate E_KThe problem of overlarge size or too many invalid paths is solved, and only the number | E of elements satisfying the parameter vector set is reserved_kA path | ≧ 5.

For the simplified program dependence graph sPDG of the target code, a maximum isomorphic subgraph among the sPDGs is found by adopting a progressive graph isomorphism solving algorithm, and the sPDG graph isomorphism coefficients are calculated according to the result. Let G₁＝(V₁，E₁)，G₂＝(V₂，E₂) The simplified program dependence graph sPDG respectively represents two input code files, and the algorithm is as follows:

the number of steps n and the size m of each step expansion are determined before the algorithm starts_iWherein m is_iAnd n satisfiesAnd m is_iNot less than 1. Isomorphh (G) implementation using VFLib open source graph isomorphic decision framework_1,i,G₂) Function pair graph G_1,iAnd G₂Isomorphic determination to obtain G₁Neutralization G₂Isomorphic maximal subgraph.

According to the solution result of the isomorphic solution algorithm of the progressive graph, the following evaluation function is used for calculating the proportion of the number of different edges between the two graphs to the number of edges in the smaller graph.

The calculation result P represents the isomorphic coefficients of the sPDG graph of the simplified program attribute graph. When P is 0, it represents G₁Is G₂Complete subgraph of (1).

For the obtained code key grammar information sequence, including a variable quadruple sequence, a macro definition triple sequence and a structural body sequence, calculating a coincidence coefficient C by adopting a Jaccard algorithm, specifically as follows:

coincidence coefficients of single syntax information αWherein, the sequences of grammar information α corresponding to the two codes S and T respectively are defined When all are empty H_α0; calculating key syntax information coincidence coefficientw_αIs the weight of the syntax information α in this example, the variable information weight is set to 0.4, the macro definition information weight is set to 0.2, and the structure information weight is set to 0.4.

Full path parameter vector set E for object code_KCalculating E by adopting Jaccard algorithm_KAnd taking the highest value of the similarity coefficients of the subsets as a similarity coefficient A of the system call sequence. The algorithm is as follows:

calculating sPDG graph isomorphic coefficient P aiming at codes S and T_S,TSyntax information superposition coefficient C_S,TAnd the system call sequence similarity coefficient A_S,TThe homology indices of the two codes S and T were calculated by the following formula:

wherein w is_PWeight of isomorphic coefficients, w, for sPDG graph_CAs weights of coincidence coefficients of syntax information, w_AWeights for the system call sequence similarity coefficients. The larger the Homology (S, T), the more obvious the Homology relationship exists among the input samples.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The elements of the various examples and method steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and the components and steps of the examples have been described in a functional generic sense in the foregoing description for clarity of hardware and software interchangeability. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Those skilled in the art will appreciate that all or part of the steps of the above methods may be implemented by instructing the relevant hardware through a program, which may be stored in a computer-readable storage medium, such as: read-only memory, magnetic or optical disk, and the like. Alternatively, all or part of the steps of the foregoing embodiments may also be implemented by using one or more integrated circuits, and accordingly, each module/unit in the foregoing embodiments may be implemented in the form of hardware, and may also be implemented in the form of a software functional module. The present invention is not limited to any specific form of combination of hardware and software.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A code homology detection method based on code fingerprints is characterized by comprising the following steps:

2. The code homology detection method based on code fingerprints according to claim 1, wherein the original program dependency graph PDG is subjected to structure simplification, nesting removal and coloring processing in step 1 to obtain a simplified program dependency graph sPDG, and the method comprises the following steps:

3. The code homology detection method based on code fingerprint as claimed in claim 2, wherein the simplification rule in step 11 comprises: removing vertices with only one outgoing edge without any incoming edge, removing vertices with only one incoming edge without any outgoing edge; removing vertices with only one input and one output edge and introducing a point from its input vertex to the output vertex; removing does not have any incoming or outgoing edge vertices.

4. The method for detecting code homology based on code fingerprint as claimed in claim 1, wherein the key syntax information of the code is parsed based on the abstract syntax tree in the step 2, which comprises the following contents:

5. The code homology detection method based on code fingerprint as claimed in claim 1, wherein the step 3 comprises the following steps:

6. The code homology detection method based on code fingerprint as claimed in claim 1, wherein the step 4 comprises the following steps:

step 41, simplifying the procedure for the object codeThe method comprises the steps of obtaining the maximum isomorphic subgraph between simplified program dependence graphs sPDG through a progressive graph isomorphic solving algorithm and calculating the isomorphic coefficient P between the simplified program dependence graphs sPDG_S,T；

7. The code homology detection method based on code fingerprints as claimed in claim 6, wherein in step 41, the original program dependency graph PDG is represented as a directed graph G (V, E), the node set V represents a group of predicate expressions or sentences, E represents data dependencies and control dependencies existing among parts, and G is represented as₁＝(V₁，E₁)，G₂＝(V₂，E₂) Respectively representing a simplified program dependence graph sPDG, by evaluating the functions:

<mrow> <mi>I</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <mi>E</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mn>0</mn> </mtd> <mtd> <mrow> <mi>i</mi> <mi>f</mi> <mi> </mi> <mi>e</mi> <mo>&Element;</mo> <mi>E</mi> </mrow> </mtd> </mtr> <mtr> <mtd> <mn>1</mn> </mtd> <mtd> <mrow> <mi>o</mi> <mi>t</mi> <mi>h</mi> <mi>e</mi> <mi>r</mi> <mi>w</mi> <mi>i</mi> <mi>s</mi> <mi>e</mi> <mo>.</mo> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>,</mo> <mi>P</mi> <mo>=</mo> <mfrac> <mrow> <msub> <mi>&Sigma;</mi> <mrow> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>E</mi> <mn>1</mn> </msub> </mrow> </msub> <mi>I</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>E</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>+</mo> <msub> <mi>&Sigma;</mi> <mrow> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>E</mi> <mn>2</mn> </msub> </mrow> </msub> <mi>I</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>E</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mo>|</mo> <msub> <mi>E</mi> <mn>1</mn> </msub> <mo>|</mo> </mrow> </mfrac> </mrow>

8. The method of claim 6, wherein the step 42 comprises a coincidence coefficient of a single syntax information αWherein,respectively, the grammar information α sequence corresponding to two codes S and T, calculating the coincidence coefficient of key grammar informationw_αIs the weight of the syntax information α.

9. The code homology detection method based on code fingerprint as claimed in claim 1, wherein in step 5, by the formula:calculating the homology index of the two codes S and T, wherein w_PWeight of isomorphic coefficients, w, for sPDG graph_CAs weights of coincidence coefficients of syntax information, w_AWeights for the system call sequence similarity coefficients.

10. A code homology detection apparatus based on a code fingerprint, comprising: the system comprises a program simplifying module, a grammar analyzing module, a fingerprint constructing module, a homology coefficient acquiring module and a homology judging module;

a homology coefficient acquisition module for calculating homology between code fingerprint components according to the information acquired by the program simplification module, the grammar analysis module and the fingerprint construction moduleCoefficients containing the sPDG isomorphic coefficients P of a simplified program dependence graph_S,TSyntax information superposition coefficient C_S,TAnd the system call sequence similarity coefficient A_S,T；

And the homology judging module is used for calculating the homology indexes of the two codes S and T according to the homology coefficients obtained by the homology coefficient obtaining module and judging the homology relation between the two codes according to the homology indexes.