CN107169358B - Code homology detection method and its device based on code fingerprint - Google Patents

Code homology detection method and its device based on code fingerprint Download PDF

Info

Publication number
CN107169358B
CN107169358B CN201710375425.4A CN201710375425A CN107169358B CN 107169358 B CN107169358 B CN 107169358B CN 201710375425 A CN201710375425 A CN 201710375425A CN 107169358 B CN107169358 B CN 107169358B
Authority
CN
China
Prior art keywords
code
homology
coefficient
fingerprint
spdg
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710375425.4A
Other languages
Chinese (zh)
Other versions
CN107169358A (en
Inventor
魏强
刘臻
曹琰
尹中旭
彭建山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Red Neurons Co Ltd
PLA Information Engineering University
Original Assignee
Shanghai Red Neurons Co Ltd
PLA Information Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Red Neurons Co Ltd, PLA Information Engineering University filed Critical Shanghai Red Neurons Co Ltd
Priority to CN201710375425.4A priority Critical patent/CN107169358B/en
Publication of CN107169358A publication Critical patent/CN107169358A/en
Application granted granted Critical
Publication of CN107169358B publication Critical patent/CN107169358B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/425Lexical analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Stored Programmes (AREA)

Abstract

The present invention relates to a kind of code homology detection methods and its device based on code fingerprint, and this method includes: carrying out dependence analysis to input code, obtain original program dependency graph PDG;Simplified structure, nesting removal and coloring treatment are carried out to original program dependency graph PDG, obtains and simplifies program dependency graph sPDG;Code key syntactic information is parsed based on abstract syntax tree;The system call sequence for extracting code execution path obtains the complete trails parameter vector set of object code, constructs code fingerprint;Homologous property coefficient between calculation code fingerprint component;The homologous sex index that two parts of codes S and T are calculated according to homologous property coefficient determines affinity existing for code both sides by the homologous sex index.The present invention can take into account code semanteme and behavior on the basis of similitude, improve detection efficiency using the feature and simplified mechanism of lightweight, and multi-angle measures existing affinity between code, can be while guaranteeing accuracy, raising detection efficiency.

Description

Code homology detection method and device based on code fingerprints
Technical Field
The invention belongs to the technical field of computer software application, and particularly relates to a code homology detection method and a code homology detection device based on code fingerprints.
Background
With the increase of the demands of various internet applications and the increase of the code iteration speed, higher demands are made on the development efficiency and speed of programmers. On a software development pipeline, template-based secondary development and reuse of existing components are common phenomena; meanwhile, in order to solve new requirements, developers usually refer to codes in an open source code warehouse in the internet. This has led to the growing number of homologous codes through different channels, and the spread of hidden defects and errors in the codes. Meanwhile, with the continuous development of computer security technology and the continuous improvement of virus detection technology, the probability that malicious codes such as macro viruses, malicious VBS scripts, malicious JavaScript scripts and the like on the internet are detected is higher and higher, and an attacker needs to bypass detection by means of modifying code content, converting code forms and the like on the basis of original codes, so that the survival capability of the malicious codes is improved. Endogenous homology exists among various versions of the same malicious code, and the homology is an important basis for detecting the same malicious code.
As an important aspect of computer program research, homology detection techniques for software source code at present are mainly classified into the following types: text-based software homology detection, structural analysis-based software homology detection, and semantic-based software homology detection. One) text-based software homology detection technology, the detected object is the text of the source code, for example, the similarity detection of the code based on the similarity of the text and the attribute of the text. One benefit of treating program source code as text analysis is that it is not tied to the programming language used to analyze the object, but because it does not take into account the linguistic characteristics of the code, such methods are generally less resistant to code obfuscation. Simple code obfuscation means are as follows: the detection effect can be greatly influenced by replacing the variable function name, inserting the junk code, disordering the sentence sequence on the premise of not influencing the function and the like. Therefore, the technology can only carry out simple homology detection from the text level, and has larger limitation. Second) software homology detection techniques based on structural analysis, through analysis of code structures and expression in other comparable intermediate forms, token-based, tree-based and graph-based detection methods, etc. are common. Compared with a text-based detection method, the technology has a better detection effect and has certain resistance to common confusion means. But the computational complexity depends on the method of intermediate representation, and the complex structure can bring large performance overhead in the detection process. Thirdly), extracting features such as control flow, data flow, standard API flow and the like on the basis of static semantic analysis by a semantic-based software homology detection technology, and depicting program behaviors from different angles; or compiling and executing the source code, and recording a program instruction stream and a system calling sequence to describe the program behavior. The technology essentially describes the semantic and behavior characteristics of a program, and can effectively deal with the challenge of homology detection caused by various code confusion. However, the method based on semantics cannot effectively cover the self characteristics of the code, and meanwhile, the difficulty in carrying out accurate semantic analysis is high.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a code homology detection method and a code homology detection device based on code fingerprints, which solve the problems of poor confusion interference resistance and low detection efficiency in the software source code detection process, can accurately extract code features, effectively cope with the influence caused by common code confusion methods, improve the homology detection efficiency and the detection accuracy, and effectively prevent the spread of malicious codes.
According to the design scheme provided by the invention, the code homology detection method based on the code fingerprint comprises the following steps:
step 1, analyzing the dependency relationship of two input codes S and T to obtain an original program dependency graph PDG; carrying out structure simplification, nesting removal and coloring treatment on the original program dependency graph PDG to obtain a simplified program dependency graph sPDG;
step 2, analyzing key syntax information of the code based on the abstract syntax tree;
step 3, extracting a system calling sequence of the code execution path, acquiring a full path parameter vector set of the target code, and constructing a code fingerprint;
step 4, calculating the homology coefficient among code fingerprint parts, wherein the homology coefficient comprises a simplified program dependence graph sPDG isomorphic coefficient PS,TSyntax information superposition coefficient CS,TAnd the system call sequence similarity coefficient AS,T
And 5, calculating the homology indexes of the two codes S and T according to the homology coefficients, and judging the homology relation between the two codes according to the homology indexes.
As described above, in step 1, the original program dependency graph PDG is subjected to structure simplification, nesting removal, and coloring processing, and the simplified program dependency graph sPDG is obtained, which includes the following contents:
step 11, simplifying the structure of the original program dependence graph PDG according to a simplification principle;
step 12, removing the nested input nodes and output nodes from the nodes containing the nested relation, and removing the edges of the corresponding dependency relation to the outer layer function call nodes;
and step 13, classifying and coloring the nodes according to the statement types, and acquiring the simplified program dependence graph sPDG.
The above simplified principle in step 11 includes: removing vertices with only one outgoing edge without any incoming edge, removing vertices with only one incoming edge without any outgoing edge; removing vertices with only one input and one output edge and introducing a point from its input vertex to the output vertex; removing does not have any incoming or outgoing edge vertices.
In the above, the parsing of the key syntax information of the code based on the abstract syntax tree in step 2 includes the following contents:
step 21, recording global variables, local variables and attributes thereof in the designated code domain to form a quadruple, wherein the quadruple comprises a scope of the variables, link attributes, storage types and names;
step 22, analyzing and recording the macro definition and the corresponding content thereof to form a triple, wherein the triple comprises a macro definition identifier, a content type and a name;
and step 23, analyzing a key data structure in the code based on the abstract syntax tree AST, and recording a custom structure body in the code in a sequence form.
As described above, step 3 includes the following steps:
step 31, starting from an entry function, generating a call graph and a subsequent domination tree of a function f, extracting a system call sequence K in a single execution path, and recording a system call sequence set K in all possible execution paths;
step 32, for the function f in each system call sequence k, locating the function domain d where the function f is located, analyzing all parameters of the function f in the abstract syntax tree, determining the data source s of each parameter in the f through static taint analysis, judging the source of the parameter value, and combining the data type t of the parameter to form the parameter vector e of the function ff(d, f, t, s); obtaining a parameter vector set E of a system call sequence kk
Step 33, executing step 32 on each sequence K in the system call sequence set K, and acquiring a full-path parameter vector set E of the target codeK
Step 34, according to the target code, the full path parameter vector set EKAnd constructing the code fingerprint.
As mentioned above, the step 3 includes the following steps:
step 31, respectively calculating the field relevance, recommendation response rate and recommendation satisfaction rate of recommenders in the initial set of recommenders;
step 32, setting a recommendation response rate, a recommendation satisfaction rate and a domain correlation acceptance threshold, screening all recommenders in the initial set of recommenders through the acceptance threshold, and removing recommenders lower than the acceptance threshold from the initial set of recommenders;
and step 33, obtaining a recommender candidate set through screening.
As described above, step 4 includes the following steps:
step 41, for the simplified program dependence graph sPDG of the target code, seeking the maximum isomorphic subgraph between the simplified program dependence graphs sPDG through a progressive graph isomorphism solving algorithm, and calculating the isomorphism coefficient P between the simplified program dependence graphs sPDGS,T
Step 42, calculating a coincidence coefficient C through a Jaccard algorithm according to the key grammar information obtained in the step 2S,T
Step 43, according to the target code in step 3, the full path parameter vector set EKSolving for E by JaccardKAnd taking the highest value of the similarity coefficients of the subsets as a similarity coefficient A of the system call sequenceS,T
Preferably, in step 41, the original program dependency graph PDG is represented as a directed graph G ═ V, E, the node set V represents a set of predicate expressions or statements, E represents data dependencies and control dependencies existing between the parts, and let G1=(V1,E1), G2=(V2,E2) Respectively representing a simplified program dependence graph sPDG, by evaluating the functions:
the calculation of the isomorphic coefficients P between the reduced program dependency graphs sPDG.
Preferably, step 42 comprises the following: coincidence coefficient of single grammar information alphaWherein,sequences of grammar information alpha corresponding to the two codes S and T respectively; calculating key syntax information coincidence coefficient wαIs the weight of the syntax information a.
As described above, in step 5, the following formula is used:calculating the homology index of the two codes S and T, wherein wPWeight of isomorphic coefficients, w, for sPDG graphCAs weights of coincidence coefficients of syntax information, wAWeights for the system call sequence similarity coefficients.
A code homology detection apparatus based on code fingerprints, comprising: the system comprises a program simplifying module, a grammar analyzing module, a fingerprint constructing module, a homology coefficient acquiring module and a homology judging module;
the program simplification module is used for analyzing the dependency relationship between the two input codes S and T and acquiring an original program dependency graph PDG; carrying out structure simplification, nesting removal and coloring treatment on the original program dependency graph PDG to obtain a simplified program dependency graph sPDG;
the grammar parsing module is used for parsing key grammar information of the codes based on an abstract grammar tree and comprises a variable parsing unit, a macro definition parsing unit and a key data structure parsing unit, wherein the variable parsing unit is used for recording global variables, local variables and corresponding action domains, link attributes and storage types of the global variables and the local variables in a designated code domain, the macro definition parsing unit is used for recording macro definitions and corresponding content types of the macro definitions, and the key data structure parsing unit is used for parsing all classes and structural bodies defined in functions in a target code domain;
the fingerprint construction module is used for extracting a system calling sequence of a code execution path, acquiring a full-path parameter vector set of a target code and constructing a code fingerprint;
a homology coefficient acquisition module for calculating the homology coefficient between code fingerprint components according to the information acquired by the program simplification module, the grammar analysis module and the fingerprint construction module, wherein the homology coefficient comprises a simplified program dependence graph sPDG isomorphic coefficient PS,TSyntax information superposition coefficient CS,TAnd the system call sequence similarity coefficient AS,T
A homology judging module used for calculating the homology indexes of the two codes S and T according to the homology coefficient obtained by the homology coefficient obtaining module and judging the homology relation between the two codes according to the homology indexes
The invention has the beneficial effects that:
aiming at the source code homology judgment method, the invention can give consideration to code semantics and behaviors on the basis of similarity, improves the detection efficiency by utilizing light-weight characteristics and a simplified mechanism, and measures the homology relation existing among codes in multiple angles; the problems existing in the prior art are solved: a) the code confusion method compounded by using various means such as format change, renaming modification, junk code insertion, statement reordering and the like cannot be effectively dealt with; b) the detection method based on the complex structure and the algorithm can obtain higher accuracy, but the problems of large solving calculation amount and low detection efficiency exist in the detection process, and the detection efficiency cannot be well taken into consideration while the accuracy is improved; the accuracy can be guaranteed, and meanwhile, the detection efficiency is improved.
The invention abstracts the logic and the characteristics of the code through the code fingerprint, integrates the grammatical characteristics and the behavior characteristics of the code while showing the relation between data flow and control flow through a program dependence graph, solves the problem that the existing code homology detection focuses on analyzing the similarity of code text and characteristics and reflects the insufficient capability of the internal logic and deep association between the codes, greatly improves the homology detection efficiency while keeping high accuracy, effectively prevents the spread of malicious codes, provides technical support for the homology detection and judgment of computer program source codes, and has important guiding significance for computer network security technology and virus detection technology.
Description of the drawings:
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a schematic flow chart of the homology analysis method in the example;
FIG. 3 is a schematic flow chart of obtaining a simplified program dependence graph sPDG in the embodiment;
FIG. 4 is a schematic diagram illustrating a process of parsing key syntax information of a code based on an abstract syntax tree in an embodiment;
FIG. 5 is a schematic flowchart of constructing a code fingerprint according to an embodiment;
FIG. 6 is a schematic diagram of a process for calculating the homology coefficients between code fingerprint components according to an embodiment;
FIG. 7 is a schematic view of the apparatus of the present invention;
FIG. 8 is an example input code schematic;
FIG. 9 is a simplified part of the structure of the dependency graph of the original program in the embodiment;
FIG. 10 is a schematic diagram of a nest removal process in an embodiment;
FIG. 11 is a diagram illustrating parsing based on an abstract syntax tree in an embodiment;
FIG. 12 is a diagram illustrating the extraction of the target code system call parameter in the embodiment.
The specific implementation mode is as follows:
in order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described in detail below with reference to the accompanying drawings and technical solutions.
At present, most of the research on methods for detecting code homology is carried out based on a single type. The coarse-grained feature detection can improve the detection efficiency but can reduce the detection accuracy, and the fine-grained features bring performance bottlenecks with large calculation amount while improving the detection accuracy. How to effectively deal with a complex code confusion means under the condition of efficient detection and accurately abstract code logic and generalize code characteristics is an important content which needs to be researched currently.
In an embodiment, a code homology detection method based on code fingerprints is provided, and is shown in fig. 1, and includes the following steps:
step 1, analyzing the dependency relationship of two input codes S and T to obtain an original program dependency graph PDG; carrying out structure simplification, nesting removal and coloring treatment on the original program dependency graph PDG to obtain a simplified program dependency graph sPDG;
step 2, analyzing key syntax information of the code based on the abstract syntax tree;
step 3, extracting a system calling sequence of the code execution path, acquiring a full path parameter vector set of the target code, and constructing a code fingerprint;
step 4, calculating the homology coefficient among code fingerprint parts, wherein the homology coefficient comprises a simplified program dependence graph sPDG isomorphic coefficient PS,TSyntax information superposition coefficient CS,TAnd the system call sequence similarity coefficient AS,T
And 5, calculating the homology indexes of the two codes S and T according to the homology coefficients, and judging the homology relation between the two codes according to the homology indexes.
The embodiment can accurately extract the code features, effectively cope with the influence of common code confusion methods, and greatly improve the detection efficiency while improving the homology detection accuracy.
For two input code files to be detected, program analysis is performed according to the programming language compiling principle, and an original program dependency graph PDG of a code is obtained as a basis of code fingerprints, in another embodiment of the present invention, as shown in fig. 3, structure simplification, nesting removal and coloring processing are performed on the original program dependency graph PDG, and a simplified program dependency graph sPDG is obtained, which includes the following contents:
step 11, simplifying the structure of the original program dependence graph PDG according to a simplification principle;
step 12, removing the nested input nodes and output nodes from the nodes containing the nested relation, and removing the edges of the corresponding dependency relation to the outer layer function call nodes;
and step 13, classifying and coloring the nodes according to the statement types, and acquiring the simplified program dependence graph sPDG.
In another embodiment of the present invention, the PDG structure simplification for the original program dependency graph includes the following simplification operations for nodes in the graph according to the simplification principle: removing vertices with only one outgoing edge without any incoming edge, removing vertices with only one incoming edge without any outgoing edge; removing vertices with only one input and one output edge and introducing a point from its input vertex to the output vertex; removing vertices without any incoming or outgoing edges; and repeating the simplification operation until no node conforming to the simplification principle exists.
The nodes are classified according to statement type, then the nodes of different types are colored according to different colors, and each type is identified by a coloring number for comparison. Examples of classifications used are as follows: function calls, control statements, declaration statements, arithmetic statements, switch statements, logical expressions, jump statements, and return statements, among others. The details are shown in Table 1.
Type (B) Node representation information Color number
Function call Calling function and system API 1
Control statement If,switch,while,for. 2
Statement sentence Variable declaration or formatting parameters 3
Operation statement Variable operation, auto-increment/decrement operation 4
switch statement case,default 5
Jump statement goto,break,continue 6
Conditional statement <,>,==,!= 7
Return statement return 8
Others Others 0
TABLE 1
In another embodiment of the present invention, the key Syntax information of the code is parsed based on the Abstract Syntax Tree, which is used as a component of the code fingerprint, as shown in fig. 4, and the following contents are included:
step 21, recording global variables, local variables and attributes thereof in the designated code domain to form a quadruple, wherein the quadruple comprises a scope of the variables, link attributes, storage types and names;
step 22, analyzing and recording the macro definition and the corresponding content thereof to form a triple, wherein the triple comprises a macro definition identifier, a content type and a name;
and step 23, analyzing a key data structure in the code based on the abstract syntax tree AST, and recording a custom structure body in the code in a sequence form.
In another embodiment of the present invention, a system call sequence of a code execution path is extracted, a full path parameter vector set E of a target code is obtained, and a code fingerprint is constructed, as shown in fig. 5, which includes the following contents:
step 31, starting from an entry function, generating a call graph and a subsequent domination tree of a function f, extracting a system call sequence K in a single execution path, and recording a system call sequence set K in all possible execution paths;
step 32, for the function f in each system call sequence k, locating the function domain d where the function f is located, analyzing all parameters of the function f in the abstract syntax tree, determining the data source s of each parameter in the f through static taint analysis, judging the source of the parameter value, and combining the data type t of the parameter to form the function of the function fParameter vector ef(d, f, t, s); obtaining a parameter vector set E of a system call sequence kk
Step 33, executing step 32 on each sequence K in the system call sequence set K, and acquiring a full-path parameter vector set E of the target codeK
Step 34, according to the target code, the full path parameter vector set EKAnd constructing the code fingerprint.
In the homology determination of code fingerprints, in another embodiment of the present invention, the homology coefficients between code fingerprint components are calculated, as shown in fig. 6, and include the following:
step 41, for the simplified program dependence graph sPDG of the target code, seeking the maximum isomorphic subgraph between the simplified program dependence graphs sPDG through a progressive graph isomorphism solving algorithm, and calculating the isomorphism coefficient P between the simplified program dependence graphs sPDGS,T
Step 42, calculating a coincidence coefficient C through a Jaccard algorithm according to the key grammar information obtained in the step 2S,T
Step 43, according to the target code in step 3, the full path parameter vector set EKSolving for E by JaccardKAnd taking the highest value of the similarity coefficients of the subsets as a similarity coefficient A of the system call sequenceS,T
In another embodiment, the original program dependency graph PDG of the target code is represented as a directed graph G ═ V, E, the set of nodes V represents a set of predicate expressions or statements, E represents the data and control dependencies that exist between the parts, let G1=(V1, E1),G2=(V2,E2) Respectively representing a simplified program dependence graph sPDG, according to the solution result of a progressive graph isomorphic solution algorithm, through an evaluation function:
calculating isomorphic coefficient P between simplified program dependence graphs sPDG, wherein when P is 0, G is represented1Is G2Is totally describedAnd (4) sub-graph.
For the obtained code key syntax information sequence, in another embodiment of the present invention, the coincidence coefficient is calculated by the Jaccard algorithm, which includes the following contents: coincidence coefficient of single grammar information alphaWherein, sequences of grammar information alpha corresponding to the two codes S and T respectively; calculating key syntax information coincidence coefficientwαIs the weight of the syntax information a.
Aiming at two input codes S and T, calculating an sPDG graph isomorphic coefficient PS,TSyntax information superposition coefficient CS,TAnd the system call sequence similarity coefficient AS,TIn other embodiments of the present invention, the following formula is used:
calculating the homology index of the two codes S and T, wherein wPWeight of isomorphic coefficients, w, for sPDG graphCAs weights of coincidence coefficients of syntax information, wAWeights for the system call sequence similarity coefficients. The larger the Homology (S, T), the more obvious the Homology relationship exists among the input samples.
Corresponding to the above method, an embodiment of the present invention further provides a code homology detection apparatus based on a code fingerprint, as shown in fig. 7, including: a program simplifying module 201, a grammar parsing module 202, a fingerprint constructing module 203, a homology coefficient obtaining module 204 and a homology judging module 205;
the program simplification module 201 is used for analyzing the dependency relationship between the two input codes S and T to obtain an original program dependency graph PDG; carrying out structure simplification, nesting removal and coloring treatment on the original program dependency graph PDG to obtain a simplified program dependency graph sPDG;
the syntax parsing module 202 is configured to parse code key syntax information based on an abstract syntax tree, and includes a variable parsing unit, a macro definition parsing unit, and a key data structure parsing unit, where the variable parsing unit is configured to record a global variable, a local variable, and a scope, a link attribute, and a storage type corresponding to the global variable and the local variable in a designated code domain, the macro definition parsing unit is configured to record a macro definition and a content type corresponding to the macro definition, and the key data structure parsing unit is configured to parse structural bodies defined in all classes and functions in a target code domain;
the fingerprint construction module 203 is configured to extract a system call sequence of a code execution path, obtain a full-path parameter vector set of a target code, and construct a code fingerprint;
a homology coefficient obtaining module 204 for calculating the homology coefficient between code fingerprint parts according to the information obtained by the program simplifying module, the grammar parsing module and the fingerprint constructing module, wherein the homology coefficient comprises the homologous coefficient P of the sPDG in the simplified program dependence graphS,TSyntax information superposition coefficient CS,TAnd the system call sequence similarity coefficient AS,T
And the homology judgment module 205 is used for calculating the homology indexes of the two codes S and T according to the homology coefficients obtained by the homology coefficient acquisition module and judging the homology relationship between the two codes according to the homology indexes.
The effectiveness of the present invention is further explained by specific examples, as shown in fig. 8, part of contents in two input program code files are illustrated, program analysis is performed according to a programming language compiling principle, an original program dependency graph PDG is obtained, the original program dependency graph PDG is represented as a directed graph G ═ V, E, a node set V represents a set of predicate expressions or statements, E represents data dependencies and control dependencies existing among various parts, and the original program dependency graph PDG is used as a basis for code fingerprints; simplifying the structure of the original program dependence graph, wherein the simplified partial effect is schematically shown in FIG. 9; performing nested removal on the simplified program dependency graph, wherein the process of nested removal is shown in FIG. 10; coloring the program dependence graph after the structure simplification and the nesting removal to obtain a simplified program dependence graph sPDG; constructing an abstract syntax tree of the source code by using LLVM or Clang, and analyzing key syntax information of the code based on the abstract syntax tree to form a code fingerprint, wherein the analysis of two input files is schematically shown in FIG. 11.
Extracting a parameter sequence of system call in a target code to form a complete code fingerprint, as shown in fig. 12, starting from an entry function such as main, generating a call graph and a subsequent domination tree of a function f, extracting a system call sequence K in a single execution path, recording a system call sequence set K in all possible execution paths, locating a function domain d of the function f in each system call sequence K, analyzing all parameters of the function f in an abstract syntax tree, determining a data source s of each parameter in f through static taint analysis, determining the value of the parameter from the outside or from the inside of the function, and finally forming a parameter vector e of the function f by combining a data type t of the parameterf(d, f, t, s). The above operation is carried out on a system calling sequence k to obtain a parameter vector set E of the sequence kkPerforming the above steps on each sequence K in the system call sequence set K to obtain a full-path parameter vector set E of the target codeK. To alleviate EKThe problem of overlarge size or too many invalid paths is solved, and only the number | E of elements satisfying the parameter vector set is reservedkA path | ≧ 5.
For the simplified program dependence graph sPDG of the target code, a maximum isomorphic subgraph among the sPDGs is found by adopting a progressive graph isomorphism solving algorithm, and the sPDG graph isomorphism coefficients are calculated according to the result. Let G1=(V1,E1),G2=(V2,E2) The simplified program dependence graph sPDG respectively represents two input code files, and the algorithm is as follows:
the number of steps n and the size m of each step expansion are determined before the algorithm startsiWherein m isiAnd n satisfiesAnd m isiNot less than 1. Isomorphh (G) implementation using VFLib open source graph isomorphic decision framework1,i,G2) Function pair graph G1,iAnd G2Isomorphic determination to obtain G1Neutralization G2Isomorphic maximal subgraph.
According to the solution result of the isomorphic solution algorithm of the progressive graph, the following evaluation function is used for calculating the proportion of the number of different edges between the two graphs to the number of edges in the smaller graph.
The calculation result P represents the isomorphic coefficients of the sPDG graph of the simplified program attribute graph. When P is 0, it represents G1Is G2Complete subgraph of (1).
For the obtained code key grammar information sequence, including a variable quadruple sequence, a macro definition triple sequence and a structural body sequence, calculating a coincidence coefficient C by adopting a Jaccard algorithm, specifically as follows:
coincidence coefficient of single grammar information alphaWherein, the sequences of the grammar information alpha corresponding to the two codes S and T are respectively defined When all are empty Hα0; calculating key syntax information coincidence coefficientwαIs the weight of the syntax information a. In this example, the variable information weight is set to 0.4, the macro definition information weight is set to 0.2, and the structure information weight is set to 0.4.
Full path parameter vector set E for object codeKCalculating E by adopting Jaccard algorithmKAnd taking the highest value of the similarity coefficients of the subsets as a similarity coefficient A of the system call sequence. The algorithm is as follows:
calculating sPDG graph isomorphic coefficient P aiming at codes S and TS,TSyntax information superposition coefficient CS,TAnd the system call sequence similarity coefficient AS,TThe homology indices of the two codes S and T were calculated by the following formula:
wherein w isPWeight of isomorphic coefficients, w, for sPDG graphCAs weights of coincidence coefficients of syntax information, wAWeights for the system call sequence similarity coefficients. The larger the Homology (S, T), the more obvious the Homology relationship exists among the input samples.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The elements of the various examples and method steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and the components and steps of the examples have been described in a functional generic sense in the foregoing description for clarity of hardware and software interchangeability. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
Those skilled in the art will appreciate that all or part of the steps of the above methods may be implemented by instructing the relevant hardware through a program, which may be stored in a computer-readable storage medium, such as: read-only memory, magnetic or optical disk, and the like. Alternatively, all or part of the steps of the foregoing embodiments may also be implemented by using one or more integrated circuits, and accordingly, each module/unit in the foregoing embodiments may be implemented in the form of hardware, and may also be implemented in the form of a software functional module. The present invention is not limited to any specific form of combination of hardware and software.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A code homology detection method based on code fingerprints is characterized by comprising the following steps:
step 1, analyzing the dependency relationship of two input codes S and T to obtain an original program dependency graph PDG; carrying out structure simplification, nesting removal and coloring treatment on the original program dependency graph PDG to obtain a simplified program dependency graph sPDG;
step 2, analyzing key syntax information of the code based on the abstract syntax tree;
step 3, extracting a system calling sequence of the code execution path, acquiring a full path parameter vector set of the target code, and constructing a code fingerprint;
step 4, calculating a homology coefficient among code fingerprints, wherein the homology coefficient comprises a simplified program dependence graph sPDG isomorphic coefficient PS,TSyntax information superposition coefficient CS,TAnd the system call sequence similarity coefficient AS,T
And 5, calculating the homology indexes of the two codes S and T according to the homology coefficients, and judging the homology relation between the two codes according to the homology indexes.
2. The code homology detection method based on code fingerprints according to claim 1, wherein the original program dependency graph PDG is subjected to structure simplification, nesting removal and coloring processing in step 1 to obtain a simplified program dependency graph sPDG, and the method comprises the following steps:
step 11, simplifying the structure of the original program dependence graph PDG according to a simplification principle;
step 12, removing the nested input nodes and output nodes from the nodes containing the nested relation, and removing the edges of the corresponding dependency relation to the outer layer function call nodes;
and step 13, classifying and coloring the nodes according to the statement types, and acquiring the simplified program dependence graph sPDG.
3. The code homology detection method based on code fingerprint as claimed in claim 2, wherein the simplification rule in step 11 comprises: removing vertices with only one outgoing edge without any incoming edge, removing vertices with only one incoming edge without any outgoing edge; removing vertices with only one input and one output edge and introducing a point from its input vertex to the output vertex; removing does not have any incoming or outgoing edge vertices.
4. The method for detecting code homology based on code fingerprint as claimed in claim 1, wherein the key syntax information of the code is parsed based on the abstract syntax tree in the step 2, which comprises the following contents:
step 21, recording global variables, local variables and attributes thereof in the designated code domain to form a quadruple, wherein the quadruple comprises a scope of the variables, link attributes, storage types and names;
step 22, analyzing and recording the macro definition and the corresponding content thereof to form a triple, wherein the triple comprises a macro definition identifier, a content type and a name;
and step 23, analyzing a key data structure in the code based on the abstract syntax tree AST, and recording a custom structure body in the code in a sequence form.
5. The code homology detection method based on code fingerprint as claimed in claim 1, wherein the step 3 comprises the following steps:
step 31, starting from an entry function, generating a call graph and a subsequent domination tree of a function f, extracting a system call sequence K in a single execution path, and recording a system call sequence set K in all possible execution paths;
step 32, for the function f in each system call sequence k, locating the function domain d where the function f is located, analyzing all parameters of the function f in the abstract syntax tree, determining the data source s of each parameter in the f through static taint analysis, judging the source of the parameter value, and combining the data type t of the parameter to form the parameter vector e of the function ff(d, f, t, s); obtaining a parameter vector set E of a system call sequence kk
Step 33, executing step 32 on each sequence K in the system call sequence set K, and acquiring a full-path parameter vector set E of the target codeK
Step 34, according to the target code, the full path parameter vector set EKAnd constructing the code fingerprint.
6. The code homology detection method based on code fingerprint as claimed in claim 1, wherein the step 4 comprises the following steps:
step 41, for the simplified program dependence graph sPDG of the target code, seeking the maximum isomorphic subgraph between the simplified program dependence graphs sPDG through a progressive graph isomorphism solving algorithm, and calculating the isomorphism coefficient P between the simplified program dependence graphs sPDGS,T
Step 42, calculating a coincidence coefficient C through a Jaccard algorithm according to the key grammar information obtained in the step 2S,T
Step 43, according to the target code in step 3, the full path parameter vector set EKSolving for E by JaccardKAnd taking the highest value of the similarity coefficients of the subsets as a similarity coefficient A of the system call sequenceS,T
7. The code homology detection method based on code fingerprints as claimed in claim 6, wherein in step 41, the original program dependency graph PDG is represented as a directed graph G (V, E), the node set V represents a group of predicate expressions or sentences, E represents data dependencies and control dependencies existing among parts, and G is represented as1=(V1,E1),G2=(V2,E2) Respectively representing a simplified program dependence graph sPDG, by evaluating the functions:
the calculation of the isomorphic coefficients P between the reduced program dependency graphs sPDG.
8. The code homology detection method based on code fingerprint as claimed in claim 6, wherein the step 42 comprises the following contents: coincidence coefficient of single grammar information alphaWherein,sequences of grammar information alpha corresponding to the two codes S and T respectively; calculating key syntax information coincidence coefficientwαIs the weight of the syntax information a.
9. The code homology detection method based on code fingerprint as claimed in claim 1, wherein in step 5, by the formula:calculating the homology index of the two codes S and T, wherein wPWeight of isomorphic coefficients, w, for sPDG graphCAs weights of coincidence coefficients of syntax information, wAWeights for the system call sequence similarity coefficients.
10. A code homology detection apparatus based on a code fingerprint, comprising: the system comprises a program simplifying module, a grammar analyzing module, a fingerprint constructing module, a homology coefficient acquiring module and a homology judging module;
the program simplification module is used for analyzing the dependency relationship between the two input codes S and T and acquiring an original program dependency graph PDG; carrying out structure simplification, nesting removal and coloring treatment on the original program dependency graph PDG to obtain a simplified program dependency graph sPDG;
the grammar parsing module is used for parsing key grammar information of the codes based on an abstract grammar tree and comprises a variable parsing unit, a macro definition parsing unit and a key data structure parsing unit, wherein the variable parsing unit is used for recording global variables, local variables and corresponding action domains, link attributes and storage types of the global variables and the local variables in a designated code domain, the macro definition parsing unit is used for recording macro definitions and corresponding content types of the macro definitions, and the key data structure parsing unit is used for parsing all classes and structural bodies defined in functions in a target code domain;
the fingerprint construction module is used for extracting a system calling sequence of a code execution path, acquiring a full-path parameter vector set of a target code and constructing a code fingerprint;
a homology coefficient obtaining module for calculating the homology coefficient between code fingerprints according to the information obtained by the program simplifying module, the grammar analyzing module and the fingerprint constructing module, wherein the homology coefficient comprises a simplified program dependence graph sPDG isomorphic coefficient PS,TSyntax information superposition coefficient CS,TAnd the system call sequence similarity coefficient AS,T
And the homology judging module is used for calculating the homology indexes of the two codes S and T according to the homology coefficients obtained by the homology coefficient obtaining module and judging the homology relation between the two codes according to the homology indexes.
CN201710375425.4A 2017-05-24 2017-05-24 Code homology detection method and its device based on code fingerprint Active CN107169358B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710375425.4A CN107169358B (en) 2017-05-24 2017-05-24 Code homology detection method and its device based on code fingerprint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710375425.4A CN107169358B (en) 2017-05-24 2017-05-24 Code homology detection method and its device based on code fingerprint

Publications (2)

Publication Number Publication Date
CN107169358A CN107169358A (en) 2017-09-15
CN107169358B true CN107169358B (en) 2019-10-08

Family

ID=59820829

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710375425.4A Active CN107169358B (en) 2017-05-24 2017-05-24 Code homology detection method and its device based on code fingerprint

Country Status (1)

Country Link
CN (1) CN107169358B (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108399321B (en) * 2017-11-03 2021-05-18 西安邮电大学 Software local plagiarism detection method based on dynamic instruction dependence graph birthmark
CN107967152B (en) * 2017-12-12 2020-06-19 西安交通大学 Software local plagiarism evidence generation method based on minimum branch path function birthmarks
CN108287996A (en) * 2018-01-08 2018-07-17 北京工业大学 A kind of malicious code obscures feature cleaning method
CN108229170B (en) * 2018-02-02 2020-05-12 中科软评科技(北京)有限公司 Software analysis method and apparatus using big data and neural network
CN110347428A (en) * 2018-04-08 2019-10-18 北京京东尚科信息技术有限公司 A kind of detection method and device of code similarity
CN110555305A (en) * 2018-05-31 2019-12-10 武汉安天信息技术有限责任公司 Malicious application tracing method based on deep learning and related device
CN109101235B (en) * 2018-06-05 2021-03-19 北京航空航天大学 Intelligent analysis method for software program
CN109190653B (en) * 2018-07-09 2020-06-05 四川大学 Malicious code family homology analysis method based on semi-supervised density clustering
CN109101816B (en) * 2018-08-10 2022-02-08 北京理工大学 Malicious code homology analysis method based on system call control flow graph
CN109918128B (en) * 2019-03-25 2022-04-08 湘潭大学 Code similarity detection method and system based on relation variable graph
CN110489973A (en) * 2019-08-06 2019-11-22 广州大学 A kind of intelligent contract leak detection method, device and storage medium based on Fuzz
CN110955758A (en) * 2019-12-18 2020-04-03 中国电子技术标准化研究院 Code detection method, code detection server and index server
CN111291373B (en) * 2020-02-03 2022-06-14 思客云(北京)软件技术有限公司 Method, apparatus and computer-readable storage medium for analyzing data pollution propagation
CN113064633A (en) * 2021-03-26 2021-07-02 山东师范大学 Automatic code abstract generation method and system
CN113138924B (en) * 2021-04-23 2023-10-31 扬州大学 Thread safety code identification method based on graph learning
CN113434145A (en) * 2021-06-09 2021-09-24 华东师范大学 Program code similarity measurement method based on abstract syntax tree path context
CN114879974B (en) * 2022-06-09 2024-09-13 西安交通大学 Implicit dependency pattern analysis method based on CPG+ graph
CN115129364B (en) * 2022-07-05 2023-04-18 四川大学 Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101697121A (en) * 2009-10-26 2010-04-21 哈尔滨工业大学 Method for detecting code similarity based on semantic analysis of program source code
CN104407872A (en) * 2014-12-04 2015-03-11 北京邮电大学 Code clone detection method
CN104933364A (en) * 2015-07-08 2015-09-23 中国科学院信息工程研究所 Automatic malicious code homology judgment method and system based on calling behaviors

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8429628B2 (en) * 2007-12-28 2013-04-23 International Business Machines Corporation System and method for comparing partially decompiled software

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101697121A (en) * 2009-10-26 2010-04-21 哈尔滨工业大学 Method for detecting code similarity based on semantic analysis of program source code
CN104407872A (en) * 2014-12-04 2015-03-11 北京邮电大学 Code clone detection method
CN104933364A (en) * 2015-07-08 2015-09-23 中国科学院信息工程研究所 Automatic malicious code homology judgment method and system based on calling behaviors

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向代码相似度检测的指纹选取方法;黄柳柳 等;《计算机工程与应用》;20100921(第27期);169-171 *

Also Published As

Publication number Publication date
CN107169358A (en) 2017-09-15

Similar Documents

Publication Publication Date Title
CN107169358B (en) Code homology detection method and its device based on code fingerprint
CN113360915B (en) Intelligent contract multi-vulnerability detection method and system based on source code diagram representation learning
Ortiz et al. Worst-case optimal reasoning for the Horn-DL fragments of OWL 1 and 2
CN114861194B (en) Multi-type vulnerability detection method based on BGRU and CNN fusion model
CN110581864B (en) Method and device for detecting SQL injection attack
Xiao et al. Bug localization with semantic and structural features using convolutional neural network and cascade forest
Rosenberg et al. DeepAPT: nation-state APT attribution using end-to-end deep neural networks
CN107844415A (en) A kind of model inspection path reduction method, computer based on interpolation
Wang et al. Explainable apt attribution for malware using nlp techniques
Fang et al. JStrong: Malicious JavaScript detection based on code semantic representation and graph neural network
CN115098857B (en) Visual malicious software classification method and device
Meng et al. [Retracted] A Deep Learning Approach for a Source Code Detection Model Using Self‐Attention
CN115269427A (en) Intermediate language representation method and system for WEB injection vulnerability
Wang et al. Enhancing dnn-based binary code function search with low-cost equivalence checking
Jiang et al. Scalable processing of contemporary semi-structured data on commodity parallel processors-a compilation-based approach
CN115858002B (en) Binary code similarity detection method and system based on graph comparison learning and storage medium
WO2010149986A2 (en) A method, a computer program and apparatus for analysing symbols in a computer
CN116663018A (en) Vulnerability detection method and device based on code executable path
CN116522337A (en) API semantic-based unbiased detection method for malicious software family
CN115146267A (en) Method and device for detecting macro viruses in Office document, electronic equipment and storage medium
Alrabaee et al. BinDeep: Binary to source code matching using deep learning
Paduraru et al. Automatic test data generation for a given set of applications using recurrent neural networks
Deaton What Makefile? Detecting Compiler Information Without Source Using The Code Property Graph
Xu et al. Fuzzing JavaScript engines with a syntax-aware neural program model
CN115879868B (en) Expert system and deep learning integrated intelligent contract security audit method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant