CN107169358B - Code homology detection method and its device based on code fingerprint - Google Patents

Code homology detection method and its device based on code fingerprint Download PDF

Info

Publication number
CN107169358B
CN107169358B CN201710375425.4A CN201710375425A CN107169358B CN 107169358 B CN107169358 B CN 107169358B CN 201710375425 A CN201710375425 A CN 201710375425A CN 107169358 B CN107169358 B CN 107169358B
Authority
CN
China
Prior art keywords
code
dependency graph
fingerprint
spdg
coefficient
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710375425.4A
Other languages
Chinese (zh)
Other versions
CN107169358A (en
Inventor
魏强
刘臻
曹琰
尹中旭
彭建山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Red Neurons Co Ltd
PLA Information Engineering University
Original Assignee
Shanghai Red Neurons Co Ltd
PLA Information Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Red Neurons Co Ltd, PLA Information Engineering University filed Critical Shanghai Red Neurons Co Ltd
Priority to CN201710375425.4A priority Critical patent/CN107169358B/en
Publication of CN107169358A publication Critical patent/CN107169358A/en
Application granted granted Critical
Publication of CN107169358B publication Critical patent/CN107169358B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/425Lexical analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Stored Programmes (AREA)

Abstract

The present invention relates to a kind of code homology detection methods and its device based on code fingerprint, and this method includes: carrying out dependence analysis to input code, obtain original program dependency graph PDG;Simplified structure, nesting removal and coloring treatment are carried out to original program dependency graph PDG, obtains and simplifies program dependency graph sPDG;Code key syntactic information is parsed based on abstract syntax tree;The system call sequence for extracting code execution path obtains the complete trails parameter vector set of object code, constructs code fingerprint;Homologous property coefficient between calculation code fingerprint component;The homologous sex index that two parts of codes S and T are calculated according to homologous property coefficient determines affinity existing for code both sides by the homologous sex index.The present invention can take into account code semanteme and behavior on the basis of similitude, improve detection efficiency using the feature and simplified mechanism of lightweight, and multi-angle measures existing affinity between code, can be while guaranteeing accuracy, raising detection efficiency.

Description

Code homology detection method and its device based on code fingerprint
Technical field
The invention belongs to computer software application technical field, in particular to a kind of code homology based on code fingerprint Detection method and its device.
Background technique
With the growth of all kinds of Internet application demands and the increase of code iteration speed, to the development efficiency of programmer with Speed proposes higher demand.On software development assembly line, it is based on the secondary development of template and the reuse of existing component Common phenomenon;Simultaneously in order to solve new demand, developer would generally use for reference the code in internet in Open Source Code warehouse. This has resulted in the code with homology and has constantly been increased by different channels, and the defect and mistake hidden in code also pass extensively It broadcasts.It is existing macro on internet simultaneously with the continuous development of computer security technique and the continuous improvement of virus detection techniques Higher and higher, the attacker of a possibility that malicious codes such as virus, malice VBS script, malice JavaScript script are detected It needs on the basis of original code to improve malice generation by means such as modification code content, transform code forms around detection The survival ability of code.Have the homology of interior life between each version of malicious code of the same race, be it is detected it is important Foundation.
As an importance of computer program research, it is directed to the homology detection technique of software source code at this stage Be broadly divided into following several types: the detection of text based software homology, the software homology detection based on structural analysis with Semantic-based software homology detection.One) text based software homology detection technique, the object of detection are source codes Text, such as detected based on text similarity and the code similitude based on text attribute.By program source code when composition notebook point One benefit of analysis is without being bound by programming language used in analysis object, but also just because of its language for not accounting for code Characteristic, such methods are generally weaker to the resistance of Code obfuscation.Simple Code obfuscation means are such as: replacement variable function name, Insertion rubbish code, upset statement sequence etc. under the premise of influencing function being capable of large effect detection effect.Therefore This kind of technology can only carry out simple homology detection from text level, and there are biggish limitations.Two) based on structural analysis Software homology detection technique, by analyze and expressed with other comparable intermediate forms to code structure, often That sees has based on token, based on tree and the detection method based on figure etc..This kind of technology is compared for text based detection method With better detection effect, means are obscured with certain resistivity for common.But its computation complexity depends on In the method for intermediate representation, complicated structure can bring biggish performance cost in the detection process.Three) semantic-based software Homology detection technique extracts the features such as control stream, data flow, standard API stream, from difference on the basis of static semantic analysis Angle portrays program behavior;Or source code is compiled and is executed, logging program instruction stream and system call sequence are to carve Draw program behavior.All it is the semanteme and behavioural characteristic for portraying program on this kind of technological essence, can more effectively copes with all kinds of generations Code, which is obscured, detects bring challenge to homology.But semantic-based method cannot effectively cover code unique characteristics, open simultaneously It is larger to open up accurate semantic analysis difficulty.
Summary of the invention
Aiming at the shortcomings in the prior art, the present invention provide a kind of code homology detection method based on code fingerprint and Its device solves in software source code detection process, and antialiasing interference performance is not strong, the not high problem of detection efficiency, can Code characteristic is accurately extracted, common code is successfully managed and obscures the influence of gimmick bring, improve the effect of homology detection Rate and the accuracy of detection, the propagation of effective preventing malice code.
According to design scheme provided by the present invention, a kind of code homology detection method based on code fingerprint includes Following steps:
Step 1 carries out dependence analysis to two parts of input codes S and T, obtains original program dependency graph PDG;And to original Beginning program dependency graph PDG carries out simplified structure, nesting removal and coloring treatment, obtains and simplifies program dependency graph sPDG;
Step 2 parses code key syntactic information based on abstract syntax tree;
Step 3, the system call sequence for extracting code execution path, obtain the complete trails parameter vector collection of object code It closes, constructs code fingerprint;
Homologous property coefficient between step 4, calculation code fingerprint component, the homologous property coefficient include to simplify program dependency graph SPDG isomorphism FACTOR PS,T, syntactic information overlap coefficient CS,TAnd system call sequence similarity factor AS,T
Step 5, the homologous sex index that two parts of codes S and T are calculated according to homologous property coefficient, are determined by the homologous sex index Affinity existing for code both sides.
Above-mentioned, simplified structure, nesting removal and coloring treatment are carried out to original program dependency graph PDG in step 1, obtained Simplified program dependency graph sPDG is taken, includes following content:
Step 11 simplifies original program dependency graph PDG progress structure according to simplification principle;
Step 12, for the node comprising nest relation, remove its internal nested input node and output node, and The side for being corresponded to dependence removes on outer layer function call node;
Step 13 is classified and is coloured to node according to statement type, obtains and simplifies program dependency graph sPDG.
Above-mentioned, the simplification principle in step 11 includes: removing only one and spreads out of top while without any incoming Point removes only one incoming vertex while without any export-oriented;Remove the top of only one input and an output side Point, and introduce from its input vertex and be directed toward output vertex;It removes and no any is transferred into and out edge vertices.
Above-mentioned, code key syntactic information is parsed based on abstract syntax tree in step 2, includes following content:
Global variable, local variable and its attribute in step 21, record appointment codes domain, form four-tuple, the quaternary Group includes scope, link attribute, storage class and the title of variable;
Step 22 parses and records macrodefinition and its corresponding content, forms triple, which includes macrodefinition mark Knowledge, content type and title;
Step 23 parses the key data structure in code based on abstract syntax tree AST, in sequence form record code Self-defined structure body.
Above-mentioned, step 3 includes following content:
Step 31, since entrance function, the calling figure and postorder Dominator Tree of generating function f, extract single execute road System call sequence k in diameter records system call sequence set K in all possible execution route;
Step 32, for the function f in every system call sequence k, position function field d where it, parse abstract syntax All parameters of function f in tree, the data source s for determining parameters in f is analyzed by static stain, and critical parameter value is come Source, and the parameter vector e of the data type t constituting-functions f of incorporating parametricf=(d, f, t, s);Obtain system call sequence k's Parameter vector collection Ek
Step 33 executes step 32 to every sequence k in system call sequence set K, obtains the complete trails of object code Parameter vector set EK
Step 34, the complete trails parameter vector set E according to object codeK, construct code fingerprint.
Above-mentioned, the step 3 includes following content:
Step 31, the domain correlation degree for calculating separately nominator initial set nominator, recommendation response rate and recommendation satisfaction Rate;
Step 32, recommends satisfaction rate and domain correlation degree acceptance threshold at setting recommendation response rate, for nominator's initial set In all nominators, screened by acceptance threshold, will be less than the nominator of acceptance threshold and moved from nominator's initial set It removes;
Step 33 obtains nominator's Candidate Set by screening.
Above-mentioned, step 4 includes following content:
Step 41, the simplification program dependency graph sPDG for object code, are sought by gradual solving graph isomorphism problems algorithm Simplify the maximum isomorphism subgraph between program dependency graph sPDG, and the isomorphism coefficient between computational short cut program dependency graph sPDG PS,T
Step 42, the crucial syntactic information obtained according to step 2 calculate overlap coefficient C by Jaccard algorithmS,T
Step 43, the complete trails parameter vector set E according to the object code in step 3K, E is solved by JaccardKSon The likeness coefficient of set takes peak as system call sequence similarity factor AS,T
Preferably, in step 41, original program dependency graph PDG is expressed as digraph G=(V, E), and node set V indicates one Group predicate expressions or sentence, E indicate that existing data dependence and control rely between each section, enable G1=(V1, E1), G2= (V2, E2) simplified program dependency graph sPDG is respectively indicated, pass through valuation functions:
, isomorphism FACTOR P between computational short cut program dependency graph sPDG.
Preferably, step 42 includes that content is as follows: the overlap coefficient of single syntactic information αWherein,It is the sequence of the corresponding syntactic information α of two parts of codes S and T respectively;Calculate crucial syntactic information overlap coefficient wαIt is the weight of syntactic information α.
Above-mentioned, in step 5, pass through formula:Calculate two The homologous sex index of part code S and T, wherein wPFor the weight of sPDG isomorphism of graph coefficient, wCFor the power of syntactic information overlap coefficient Weight, wAFor the weight of system call sequence similarity factor.
A kind of code homology detection device based on code fingerprint includes: program simplification module, syntax parsing module, Fingerprint constructs module, homologous property coefficient obtains module, homology determination module;
Program simplification module obtains original program and relies on for carrying out dependence analysis to two parts of input codes S and T Scheme PDG;And simplified structure, nesting removal and coloring treatment are carried out to original program dependency graph PDG, acquisition simplifies program and relies on Scheme sPDG;
Syntax parsing module, for based on abstract syntax tree parse code key syntactic information, comprising variable resolution unit, Macrodefinition resolution unit and key data structure resolution unit, wherein variable resolution unit is for recording in appointment codes domain Global variable, local variable and its corresponding scope, link attribute and storage class, macrodefinition resolution unit are used for record macro Definition and its corresponding content type, key data structure resolution unit is for parsing in object code domain in all classes and function The structural body of definition;
Fingerprint constructs module and obtains the complete trails of object code for extracting the system call sequence of code execution path Parameter vector set constructs code fingerprint;
Homologous property coefficient obtains module, obtains for constructing module according to program simplification module, syntax parsing module and fingerprint The information taken, the homologous property coefficient between calculation code fingerprint component, the homologous property coefficient include that simplify program dependency graph sPDG same Structure FACTOR PS,T, syntactic information overlap coefficient CS,TAnd system call sequence similarity factor AS,T
Homology determination module calculates two parts of codes for obtaining the homologous property coefficient that module obtains according to homologous property coefficient The homologous sex index of S and T, and affinity existing for code both sides is determined by the homologous sex index
Beneficial effects of the present invention:
The present invention is directed to source code homology determination method, and code semanteme and row can be taken into account on the basis of similitude To improve detection efficiency using the feature and simplified mechanism of lightweight, multi-angle measures existing affinity between code;It solves It exists in the prior art: cannot a) successfully manage and be arranged again using format change, renaming modification, rubbish code insertion, sentence The compound Code obfuscation method of the multiple means such as sequence;B) detection mode of the detection method based on labyrinth and algorithm can obtain To high accuracy, but exist in detection process and solve computationally intensive, the low problem of detection efficiency, while improving accuracy Detection efficiency cannot be taken into account very well, the problem waited;Detection efficiency can be improved while guaranteeing accuracy.
The present invention takes out the logical AND feature of code by code fingerprint, by program dependency graph show data flow with The grammar property and behavioural characteristic that code is incorporated while controlling flow relation solve existing code homology detection and lay particular emphasis on analysis Code text and when characteristic similarity, is being kept at the problem of reflecting between code internal logic and deep layer associated scarce capacity The efficiency of homology detection is substantially increased while high-accuracy, the propagation of effective preventing malice code is computer program Homology detection and the judgement of source code provide technical support, have weight to computer network security technology and virus detection techniques Want directive significance.
Detailed description of the invention:
Fig. 1 is method flow schematic diagram of the invention;
Fig. 2 is homology analysis method flow schematic diagram in embodiment;
Fig. 3 is that the flow diagram for simplifying program dependency graph sPDG is obtained in embodiment;
Fig. 4 is the flow diagram for parsing code key syntactic information in embodiment based on abstract syntax tree;
Fig. 5 is the flow diagram that code fingerprint is constructed in embodiment;
Fig. 6 is the flow diagram of homologous property coefficient between calculation code fingerprint component in embodiment;
Fig. 7 is the device of the invention schematic diagram;
Fig. 8 is input code signal in embodiment;
Fig. 9 is that original program dependency graph structure simplifies the signal of part effect in embodiment;
Figure 10 is nested removal process signal in embodiment;
Figure 11 is in embodiment based on abstract syntax tree parsing signal;
Figure 12 is that object code system call parameter extracts signal in embodiment.
Specific embodiment:
To make the object, technical solutions and advantages of the present invention clearer, understand, with reference to the accompanying drawing with technical solution pair The present invention is described in further detail.
At this stage for the technique study of code homology detection, it is mostly based on single type development.The feature of coarseness Detection can be improved detection efficiency but can reduce detection accuracy, fine-grained feature while improving detection accuracy again Bring computationally intensive performance bottleneck.How complicated Code obfuscation means are successfully managed under conditions of efficient detection, it is quasi- True abstract code logical AND concludes code characteristic, is the important content for currently needing to study.
Embodiment provides a kind of code homology detection method based on code fingerprint, shown in Figure 1, comprising as follows Step:
Step 1 carries out dependence analysis to two parts of input codes S and T, obtains original program dependency graph PDG;And to original Beginning program dependency graph PDG carries out simplified structure, nesting removal and coloring treatment, obtains and simplifies program dependency graph sPDG;
Step 2 parses code key syntactic information based on abstract syntax tree;
Step 3, the system call sequence for extracting code execution path, obtain the complete trails parameter vector collection of object code It closes, constructs code fingerprint;
Homologous property coefficient between step 4, calculation code fingerprint component, the homologous property coefficient include to simplify program dependency graph SPDG isomorphism FACTOR PS,T, syntactic information overlap coefficient CS,TAnd system call sequence similarity factor AS,T
Step 5, the homologous sex index that two parts of codes S and T are calculated according to homologous property coefficient, are determined by the homologous sex index Affinity existing for code both sides.
The present embodiment can accurately extract code characteristic, successfully manage common code and obscure gimmick bring shadow It rings, while improving homology detection accuracy, and greatly improves the efficiency of its detection.
For two parts of input code files to be detected, program analysis is carried out according to its programming language Fundamentals of Compiling, is obtained The original program dependency graph PDG of code, as the basis of code fingerprint, in another embodiment of the present invention, referring to Fig. 3 institute Show, simplified structure, nesting removal and coloring treatment carried out to original program dependency graph PDG, obtains and simplifies program dependency graph sPDG, Include following content:
Step 11 simplifies original program dependency graph PDG progress structure according to simplification principle;
Step 12, for the node comprising nest relation, remove its internal nested input node and output node, and The side for being corresponded to dependence removes on outer layer function call node;
Step 13 is classified and is coloured to node according to statement type, obtains and simplifies program dependency graph sPDG.
In another embodiment of the present invention, structure is carried out to original program dependency graph PDG and is simplified, including according to simplification principle Following simplify is carried out to figure interior joint to operate: being removed only one and is spread out of vertex while without any incoming, removal only has One incoming vertex while without any export-oriented;Remove only one input and one output side vertex, and introduce from Its input vertex is directed toward output vertex;It removes and no any is transferred into and out edge vertices;Above-mentioned simplified operation is repeated, until not having Until the node for meeting simplification principle.
Classify according to statement type to node, then different type node carry out according to different colors Color, and each type is identified with coloring number in order to compare.The classified instance used is as follows: function call, control statement, Declarative statement, arithmetic statement, switch sentence, logical expression, skip instruction and return statement etc..Specific descriptions are shown in Table 1。
Type Node indicates information Coloring number
Function call Call function and system API 1
Control statement If,switch,while,for. 2
Declarative statement Variable declarations or format parameter 3
Arithmetic statement Variable operation, from add drop operation 4
Switch sentence case,default 5
Skip instruction goto,break,continue 6
Conditional statement <,>,==,!= 7
Return statement return 8
Other Other 0
Table 1
Use the abstract syntax tree (Abstract Syntax Tree, AST) of LLVM or Clang building source code, this hair In bright another embodiment, code key syntactic information is parsed based on abstract syntax tree, as the component part of code fingerprint, ginseng As shown in Figure 4, include following content:
Global variable, local variable and its attribute in step 21, record appointment codes domain, form four-tuple, the quaternary Group includes scope, link attribute, storage class and the title of variable;
Step 22 parses and records macrodefinition and its corresponding content, forms triple, which includes macrodefinition mark Knowledge, content type and title;
Step 23 parses the key data structure in code based on abstract syntax tree AST, in sequence form record code Self-defined structure body.
In another embodiment of the invention, the system call sequence of code execution path is extracted, object code is obtained Complete trails parameter vector set E constructs code fingerprint, shown in Figure 5, includes following content:
Step 31, since entrance function, the calling figure and postorder Dominator Tree of generating function f, extract single execute road System call sequence k in diameter records system call sequence set K in all possible execution route;
Step 32, for the function f in every system call sequence k, position function field d where it, parse abstract syntax All parameters of function f in tree, the data source s for determining parameters in f is analyzed by static stain, and critical parameter value is come Source, and the parameter vector e of the data type t constituting-functions f of incorporating parametricf=(d, f, t, s);Obtain system call sequence k's Parameter vector collection Ek
Step 33 executes step 32 to every sequence k in system call sequence set K, obtains the complete trails of object code Parameter vector set EK
Step 34, the complete trails parameter vector set E according to object codeK, construct code fingerprint.
When code fingerprint carries out homologous sex determination, in another embodiment of the invention, between calculation code fingerprint component Homologous property coefficient, it is shown in Figure 6, include following content:
Step 41, the simplification program dependency graph sPDG for object code, are sought by gradual solving graph isomorphism problems algorithm Simplify the maximum isomorphism subgraph between program dependency graph sPDG, and the isomorphism coefficient between computational short cut program dependency graph sPDG PS,T
Step 42, the crucial syntactic information obtained according to step 2 calculate overlap coefficient C by Jaccard algorithmS,T
Step 43, the complete trails parameter vector set E according to the object code in step 3K, E is solved by JaccardKSon The likeness coefficient of set takes peak as system call sequence similarity factor AS,T
In another embodiment, the original program dependency graph PDG of object code is expressed as digraph G=(V, E), node collection Closing V indicates that one group of predicate expressions or sentence, E indicate that existing data dependence and control rely between each section, enable G1=(V1, E1), G2=(V2, E2) simplified program dependency graph sPDG is respectively indicated, according to the solving result of gradual solving graph isomorphism problems algorithm, Pass through valuation functions:
, isomorphism FACTOR P between computational short cut program dependency graph sPDG, P=0 then indicates G1It is G2Complete subgraph.
For the code key grammer information sequence of acquisition, in another embodiment of the present invention, pass through Jaccard algorithm meter Overlap coefficient is calculated, it is as follows comprising content: the overlap coefficient of single syntactic information αWherein, It is respectively The sequence of the corresponding syntactic information α of two parts of codes S and T;Calculate crucial syntactic information overlap coefficientwαIt is grammer The weight of information α.
For two parts of input codes S and T, sPDG isomorphism of graph FACTOR P is calculatedS,T, syntactic information overlap coefficient CS,TAnd System call sequence similarity factor AS,T, in the other embodiment of the present invention, pass through formula:
The homologous sex index of two parts of codes S and T are calculated, Wherein, wPFor the weight of sPDG isomorphism of graph coefficient, wCFor the weight of syntactic information overlap coefficient, wAIt is similar for system call sequence The weight of coefficient.Homology (S, T) is bigger, illustrates there is the affinity being more obvious between input sample.
Corresponding with the above method, the embodiment of the invention also provides a kind of, and the code homology based on code fingerprint detects dress It sets, as shown in fig. 7, comprising: program simplification module 201, syntax parsing module 202, fingerprint construct module 203, homologous property coefficient Obtain module 204, homology determination module 205;
Program simplification module 201, for carrying out dependence analyses to two parts of input codes S and T, obtain original program according to Rely figure PDG;And simplified structure, nesting removal and coloring treatment are carried out to original program dependency graph PDG, acquisition simplifies program and relies on Scheme sPDG;
Syntax parsing module 202 includes variable resolution list for parsing code key syntactic information based on abstract syntax tree Member, macrodefinition resolution unit and key data structure resolution unit, wherein variable resolution unit is for recording in appointment codes domain Global variable, local variable and its corresponding scope, link attribute and storage class, macrodefinition resolution unit is for recording Macrodefinition and its corresponding content type, key data structure resolution unit is for parsing all classes and function in object code domain The structural body of interior definition;
Fingerprint constructs module 203 and obtains the system-wide of object code for extracting the system call sequence of code execution path Diameter parameter vector set constructs code fingerprint;
Homologous property coefficient obtains module 204, for constructing module according to program simplification module, syntax parsing module and fingerprint The information of acquisition, the homologous property coefficient between calculation code fingerprint component, the homologous property coefficient include to simplify program dependency graph sPDG Isomorphism FACTOR PS,T, syntactic information overlap coefficient CS,TAnd system call sequence similarity factor AS,T
Homology determination module 205 calculates two parts for obtaining the homologous property coefficient that module obtains according to homologous property coefficient The homologous sex index of code S and T, and affinity existing for code both sides is determined by the homologous sex index.
Effectiveness of the invention, shown in Figure 8, two parts of input program codes are further illustrated below by concrete example Partial content is illustrated in file, carries out program analysis according to programming language Fundamentals of Compiling, obtains original program dependency graph PDG, former Beginning program dependency graph PDG is expressed as digraph G=(V, E), and node set V indicates that one group of predicate expressions or sentence, E indicate each Existing data dependence and control rely between part, basis of the original program dependency graph PDG as code fingerprint;To original journey Sequence dependency graph carries out structure and simplifies, and simplified part effect signal is as shown in Figure 9;To the simplified program dependency graph of structure into Row is nested to be removed, and nested removal process is as shown in Figure 10;Program dependency graph after simplified by structure, nested remove is carried out Color processing, is simplified program dependency graph sPDG;Using the abstract syntax tree of LLVM or Clang building source code, based on abstract Syntax tree parses code key syntactic information, and as the component part of code fingerprint, the parsing signal of two parts of input files is as schemed Shown in 11.
The argument sequence that system is called in object code is extracted, complete code fingerprint is constituted, as shown in figure 12, from main Equal entrance functions start, the calling figure and postorder Dominator Tree of generating function f, and the system extracted in single execution route calls sequence K is arranged, system call sequence set K in all possible execution route is recorded, for the function in every system call sequence k F positions function field d where it, parses all parameters of function f in abstract syntax tree, is analyzed and is determined in f respectively by static stain The value of the data source s of a parameter, critical parameter are incoming by outside or come from inside function, the data class of final incorporating parametric A parameter vector e of type t constituting-functions ff=(d, f, t, s).It is obtained after carrying out aforesaid operations to a system call sequence k To the parameter vector collection E of sequence kk, above-mentioned steps are carried out to every sequence k in system call sequence set K, obtain target generation The complete trails parameter vector set E of codeK.To alleviate EKThe problem that scale is excessive or Invalid path is too many only retains and wherein meets ginseng Number vector collection element number | Ek| >=5 path.
For the simplification program dependency graph sPDG of object code, found between sPDG using gradual solving graph isomorphism problems algorithm Maximum isomorphism subgraph, according to result calculate sPDG isomorphism of graph coefficient.Enable G1=(V1, E1), G2=(V2, E2) respectively indicate two parts The simplification program dependency graph sPDG of input code file, algorithm are as follows:
Algorithm first determines the size m of step number n and the extension of each step before startingi, wherein miMeet with nAnd mi≥1.Determine that frame realizes isomorph (G using the VFLib open source isomorphism of graph1,i,G2) function is to subgraph G1,iAnd G2Isomorphism is sentenced It is fixed, finally obtain G1Neutralize G2The clique of isomorphism.
According to gradual solving graph isomorphism problems algorithm solving result, different sides between two figures are calculated using following valuation functions Quantity and smaller figure in side ratio of number example.
Calculated result P indicates to simplify program attribute figure sPDG isomorphism of graph coefficient.P=0 then indicates G1It is G2Complete subgraph.
For the code key grammer information sequence of acquisition, including variable four-tuple sequence, macrodefinition triad sequence with And structural body sequence, overlap coefficient C is calculated using Jaccard algorithm, specific as follows:
The overlap coefficient of single syntactic information αWherein, It is two parts of codes S and T corresponding respectively Syntactic information α sequence, definition works as H when being skyα=0;Calculate crucial syntactic information overlap coefficientwαIt is the weight of syntactic information α.Variable information weight is set in this example as 0.4, macrodefinition information weight is 0.2, structural body information weight is 0.4.
For the complete trails parameter vector set E of object codeK, E is asked using Jaccard algorithmKThe similitude system of subclass Number, takes peak as system call sequence similarity factor A.Algorithm is as follows:
SPDG isomorphism of graph FACTOR P is calculated for code S and TS,T, syntactic information overlap coefficient CS,TAnd system calls sequence Column similarity factor AS,T, the homologous sex index of two parts of codes S and T are calculated by following formula:
, wherein wPFor the weight of sPDG isomorphism of graph coefficient, wCFor the weight of syntactic information overlap coefficient, wAFor system calling The weight of sequence similarity factor.Homology (S, T) is bigger, illustrates there is the affinity being more obvious between input sample.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.
The unit and method and step of each example described in conjunction with the examples disclosed in this document, can with electronic hardware, The combination of computer software or the two is realized, in order to clearly illustrate the interchangeability of hardware and software, in above description In generally describe each exemplary composition and step according to function.These functions are held with hardware or software mode Row, specific application and design constraint depending on technical solution.Those of ordinary skill in the art can be to each specific Using using different methods to achieve the described function, but this realization be not considered as it is beyond the scope of this invention.
Those of ordinary skill in the art will appreciate that all or part of the steps in the above method can be instructed by program Related hardware is completed, and described program can store in computer readable storage medium, such as: read-only memory, disk or CD Deng.Optionally, one or more integrated circuits also can be used to realize, accordingly in all or part of the steps of above-described embodiment Ground, each module/unit in above-described embodiment can take the form of hardware realization, can also use the shape of software function module Formula is realized.The present invention is not limited to the combinations of the hardware and software of any particular form.
The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims (10)

1. a kind of code homology detection method based on code fingerprint, which is characterized in that comprise the following steps:
Step 1 carries out dependence analysis to two parts of input codes S and T, obtains original program dependency graph PDG;And to original journey Sequence dependency graph PDG carries out simplified structure, nesting removal and coloring treatment, obtains and simplifies program dependency graph sPDG;
Step 2 parses code key syntactic information based on abstract syntax tree;
Step 3, the system call sequence for extracting code execution path, obtain the complete trails parameter vector set of object code, structure Build code fingerprint;
Homologous property coefficient between step 4, calculation code fingerprint, the homologous property coefficient include to simplify program dependency graph sPDG isologous seve Number PS,T, syntactic information overlap coefficient CS,TAnd system call sequence similarity factor AS,T
Step 5, the homologous sex index that two parts of codes S and T are calculated according to homologous property coefficient determine code by the homologous sex index Affinity existing for both sides.
2. the code homology detection method according to claim 1 based on code fingerprint, which is characterized in that in step 1 To original program dependency graph PDG carry out structure it is simplified, it is nested remove and coloring treatment, obtain and simplify program dependency graph sPDG, Include following content:
Step 11 simplifies original program dependency graph PDG progress structure according to simplification principle;
Step 12, for the node comprising nest relation, remove its internal nested input node and output node, and by its The side of corresponding dependence removes on outer layer function call node;
Step 13 is classified and is coloured to node according to statement type, obtains and simplifies program dependency graph sPDG.
3. the code homology detection method according to claim 2 based on code fingerprint, which is characterized in that in step 11 Simplification principle, include: removing only one and spread out of vertex while without any incoming, remove only one incoming side and There is no the vertex on any export-oriented side;The vertex of only one input and an output side is removed, and introduces and refers to from its input vertex To output vertex;It removes and no any is transferred into and out edge vertices.
4. the code homology detection method according to claim 1 based on code fingerprint, which is characterized in that in step 2 Code key syntactic information is parsed based on abstract syntax tree, includes following content:
Global variable, local variable and its attribute in step 21, record appointment codes domain, form four-tuple, the four-tuple packet Scope, link attribute, storage class and title containing variable;
Step 22 parses and records macrodefinition and its corresponding content, forms triple, which includes that macrodefinition identifies, is interior Hold type and title;
Step 23 parses the key data structure in code based on abstract syntax tree AST, to make by oneself in sequence form record code Adopted structural body.
5. the code homology detection method according to claim 1 based on code fingerprint, which is characterized in that step 3 packet Containing following content:
Step 31, since entrance function, the calling figure and postorder Dominator Tree of generating function f extract in single execution route System call sequence k, record system call sequence set K in all possible execution route;
Step 32, for the function f in every system call sequence k, position function field d where it, parse in abstract syntax tree All parameters of function f, by static stain analyze determine f in parameters data source s, the source of critical parameter value, And the parameter vector e of the data type t constituting-functions f of incorporating parametricf=(d, f, t, s);Obtain the parameter of system call sequence k Vector set Ek
Step 33 executes step 32 to every sequence k in system call sequence set K, obtains the complete trails parameter of object code Vector set EK
Step 34, the complete trails parameter vector set E according to object codeK, construct code fingerprint.
6. the code homology detection method according to claim 1 based on code fingerprint, which is characterized in that step 4 packet Containing following content:
Step 41, the simplification program dependency graph sPDG for object code, seek to simplify by gradual solving graph isomorphism problems algorithm Maximum isomorphism subgraph between program dependency graph sPDG, and the isomorphism FACTOR P between computational short cut program dependency graph sPDGS,T
Step 42, the crucial syntactic information obtained according to step 2 calculate overlap coefficient C by Jaccard algorithmS,T
Step 43, the complete trails parameter vector set E according to the object code in step 3K, E is solved by JaccardKSubclass Likeness coefficient, take peak as system call sequence similarity factor AS,T
7. the code homology detection method according to claim 6 based on code fingerprint, which is characterized in that step 41 In, original program dependency graph PDG is expressed as digraph G=(V, E), and node set V indicates one group of predicate expressions or sentence, E Indicate that existing data dependence and control rely between each section, enable G1=(V1, E1), G2=(V2, E2) respectively indicate simplified journey Sequence dependency graph sPDG, passes through valuation functions:
, computational short cut program dependency graph sPDG Between isomorphism FACTOR P.
8. the code homology detection method according to claim 6 based on code fingerprint, which is characterized in that step 42 packet It is as follows containing content: the overlap coefficient of single syntactic information αWherein,It is two parts of codes S and T respectively The sequence of corresponding syntactic information α;Calculate crucial syntactic information overlap coefficientwαIt is the weight of syntactic information α.
9. the code homology detection method according to claim 1 based on code fingerprint, which is characterized in that in step 5, Pass through formula:The homologous sex index of two parts of codes S and T are calculated, Wherein, wPFor the weight of sPDG isomorphism of graph coefficient, wCFor the weight of syntactic information overlap coefficient, wAIt is similar for system call sequence The weight of coefficient.
10. a kind of code homology detection device based on code fingerprint is, characterized by comprising: program simplification module, grammer Parsing module, fingerprint building module, homologous property coefficient obtain module, homology determination module;
Program simplification module obtains original program dependency graph for carrying out dependence analysis to two parts of input codes S and T PDG;And simplified structure, nesting removal and coloring treatment are carried out to original program dependency graph PDG, it obtains and simplifies program dependency graph sPDG;
Syntax parsing module includes variable resolution unit, macro fixed for parsing code key syntactic information based on abstract syntax tree Adopted resolution unit and key data structure resolution unit, wherein variable resolution unit is used to record the overall situation in appointment codes domain Variable, local variable and its corresponding scope, link attribute and storage class, macrodefinition resolution unit is for recording macrodefinition And its corresponding content type, key data structure resolution unit define in all classes and function for parsing in object code domain Structural body;
Fingerprint constructs module and obtains the complete trails parameter of object code for extracting the system call sequence of code execution path Vector set constructs code fingerprint;
Homologous property coefficient obtains module, for constructing what module obtained according to program simplification module, syntax parsing module and fingerprint Information, the homologous property coefficient between calculation code fingerprint, the homologous property coefficient include to simplify program dependency graph sPDG isomorphism coefficient PS,T, syntactic information overlap coefficient CS,TAnd system call sequence similarity factor AS,T
Homology determination module calculates two parts of codes S and T for obtaining the homologous property coefficient that module obtains according to homologous property coefficient Homologous sex index, and affinity existing for code both sides is determined by the homologous sex index.
CN201710375425.4A 2017-05-24 2017-05-24 Code homology detection method and its device based on code fingerprint Active CN107169358B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710375425.4A CN107169358B (en) 2017-05-24 2017-05-24 Code homology detection method and its device based on code fingerprint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710375425.4A CN107169358B (en) 2017-05-24 2017-05-24 Code homology detection method and its device based on code fingerprint

Publications (2)

Publication Number Publication Date
CN107169358A CN107169358A (en) 2017-09-15
CN107169358B true CN107169358B (en) 2019-10-08

Family

ID=59820829

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710375425.4A Active CN107169358B (en) 2017-05-24 2017-05-24 Code homology detection method and its device based on code fingerprint

Country Status (1)

Country Link
CN (1) CN107169358B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108399321B (en) * 2017-11-03 2021-05-18 西安邮电大学 Software local plagiarism detection method based on dynamic instruction dependence graph birthmark
CN107967152B (en) * 2017-12-12 2020-06-19 西安交通大学 Software local plagiarism evidence generation method based on minimum branch path function birthmarks
CN108287996A (en) * 2018-01-08 2018-07-17 北京工业大学 A kind of malicious code obscures feature cleaning method
CN108229170B (en) * 2018-02-02 2020-05-12 中科软评科技(北京)有限公司 Software analysis method and apparatus using big data and neural network
CN110347428A (en) * 2018-04-08 2019-10-18 北京京东尚科信息技术有限公司 A kind of detection method and device of code similarity
CN110555305A (en) * 2018-05-31 2019-12-10 武汉安天信息技术有限责任公司 Malicious application tracing method based on deep learning and related device
CN109101235B (en) * 2018-06-05 2021-03-19 北京航空航天大学 Intelligent analysis method for software program
CN109190653B (en) * 2018-07-09 2020-06-05 四川大学 Malicious code family homology analysis method based on semi-supervised density clustering
CN109101816B (en) * 2018-08-10 2022-02-08 北京理工大学 Malicious code homology analysis method based on system call control flow graph
CN109918128B (en) * 2019-03-25 2022-04-08 湘潭大学 Code similarity detection method and system based on relation variable graph
CN110489973A (en) * 2019-08-06 2019-11-22 广州大学 A kind of intelligent contract leak detection method, device and storage medium based on Fuzz
CN110955758A (en) * 2019-12-18 2020-04-03 中国电子技术标准化研究院 Code detection method, code detection server and index server
CN111291373B (en) * 2020-02-03 2022-06-14 思客云(北京)软件技术有限公司 Method, apparatus and computer-readable storage medium for analyzing data pollution propagation
CN113064633A (en) * 2021-03-26 2021-07-02 山东师范大学 Automatic code abstract generation method and system
CN113138924B (en) * 2021-04-23 2023-10-31 扬州大学 Thread safety code identification method based on graph learning
CN113434145A (en) * 2021-06-09 2021-09-24 华东师范大学 Program code similarity measurement method based on abstract syntax tree path context
CN115129364B (en) * 2022-07-05 2023-04-18 四川大学 Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101697121A (en) * 2009-10-26 2010-04-21 哈尔滨工业大学 Method for detecting code similarity based on semantic analysis of program source code
CN104407872A (en) * 2014-12-04 2015-03-11 北京邮电大学 Code clone detection method
CN104933364A (en) * 2015-07-08 2015-09-23 中国科学院信息工程研究所 Automatic malicious code homology judgment method and system based on calling behaviors

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8429628B2 (en) * 2007-12-28 2013-04-23 International Business Machines Corporation System and method for comparing partially decompiled software

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101697121A (en) * 2009-10-26 2010-04-21 哈尔滨工业大学 Method for detecting code similarity based on semantic analysis of program source code
CN104407872A (en) * 2014-12-04 2015-03-11 北京邮电大学 Code clone detection method
CN104933364A (en) * 2015-07-08 2015-09-23 中国科学院信息工程研究所 Automatic malicious code homology judgment method and system based on calling behaviors

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向代码相似度检测的指纹选取方法;黄柳柳 等;《计算机工程与应用》;20100921(第27期);169-171 *

Also Published As

Publication number Publication date
CN107169358A (en) 2017-09-15

Similar Documents

Publication Publication Date Title
CN107169358B (en) Code homology detection method and its device based on code fingerprint
Ghamarian et al. Modelling and analysis using GROOVE
CN104536898B (en) The detection method of c program parallel regions
CN109117164B (en) Micro-service updating method and system based on difference analysis of key elements
Pientka et al. Inductive beluga: Programming proofs
CN108491228B (en) Binary vulnerability code clone detection method and system
Radke et al. Translating essential OCL invariants to nested graph constraints focusing on set operations
Alrabaee et al. On leveraging coding habits for effective binary authorship attribution
Coscia et al. Predicting web service maintainability via object-oriented metrics: a statistics-based approach
Wang et al. Explainable apt attribution for malware using nlp techniques
Ji et al. Vestige: Identifying binary code provenance for vulnerability detection
Murawski et al. Game semantic analysis of equivalence in IMJ
CN109002696A (en) It establishes the method for installation kit identification model, identify the method and device of installation kit
Murawski et al. A contextual equivalence checker for IMJ
CN116702157A (en) Intelligent contract vulnerability detection method based on neural network
CN115906086A (en) Method, system and storage medium for detecting webpage backdoor based on code attribute graph
CN115098857A (en) Visual malicious software classification method and device
CN109976805B (en) Event-driven architecture mode identification method based on ontology
CN109299004B (en) Method and system for analyzing difference of key elements
Li et al. Detection malicious Android application based on simple-Dalvik intermediate language
CN113190234A (en) Method and system for automatically recovering intelligent contract function signature of block chain
CN115879868B (en) Expert system and deep learning integrated intelligent contract security audit method
Moghaddas et al. Technical Report for HW2VEC--A Graph Learning Tool for Automating Hardware Security
Gu et al. BinAIV: Semantic-enhanced vulnerability detection for Linux x86 binaries
CN109117142A (en) A kind of fundamental type reconstructing method based on variable association tree

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant