CN107169358B - Code homology detection method and its device based on code fingerprint - Google Patents
Code homology detection method and its device based on code fingerprint Download PDFInfo
- Publication number
- CN107169358B CN107169358B CN201710375425.4A CN201710375425A CN107169358B CN 107169358 B CN107169358 B CN 107169358B CN 201710375425 A CN201710375425 A CN 201710375425A CN 107169358 B CN107169358 B CN 107169358B
- Authority
- CN
- China
- Prior art keywords
- code
- dependency graph
- fingerprint
- spdg
- coefficient
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
- G06F21/563—Static detection by source code analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/42—Syntactic analysis
- G06F8/425—Lexical analysis
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Virology (AREA)
- Stored Programmes (AREA)
Abstract
The present invention relates to a kind of code homology detection methods and its device based on code fingerprint, and this method includes: carrying out dependence analysis to input code, obtain original program dependency graph PDG;Simplified structure, nesting removal and coloring treatment are carried out to original program dependency graph PDG, obtains and simplifies program dependency graph sPDG;Code key syntactic information is parsed based on abstract syntax tree;The system call sequence for extracting code execution path obtains the complete trails parameter vector set of object code, constructs code fingerprint;Homologous property coefficient between calculation code fingerprint component;The homologous sex index that two parts of codes S and T are calculated according to homologous property coefficient determines affinity existing for code both sides by the homologous sex index.The present invention can take into account code semanteme and behavior on the basis of similitude, improve detection efficiency using the feature and simplified mechanism of lightweight, and multi-angle measures existing affinity between code, can be while guaranteeing accuracy, raising detection efficiency.
Description
Technical field
The invention belongs to computer software application technical field, in particular to a kind of code homology based on code fingerprint
Detection method and its device.
Background technique
With the growth of all kinds of Internet application demands and the increase of code iteration speed, to the development efficiency of programmer with
Speed proposes higher demand.On software development assembly line, it is based on the secondary development of template and the reuse of existing component
Common phenomenon;Simultaneously in order to solve new demand, developer would generally use for reference the code in internet in Open Source Code warehouse.
This has resulted in the code with homology and has constantly been increased by different channels, and the defect and mistake hidden in code also pass extensively
It broadcasts.It is existing macro on internet simultaneously with the continuous development of computer security technique and the continuous improvement of virus detection techniques
Higher and higher, the attacker of a possibility that malicious codes such as virus, malice VBS script, malice JavaScript script are detected
It needs on the basis of original code to improve malice generation by means such as modification code content, transform code forms around detection
The survival ability of code.Have the homology of interior life between each version of malicious code of the same race, be it is detected it is important
Foundation.
As an importance of computer program research, it is directed to the homology detection technique of software source code at this stage
Be broadly divided into following several types: the detection of text based software homology, the software homology detection based on structural analysis with
Semantic-based software homology detection.One) text based software homology detection technique, the object of detection are source codes
Text, such as detected based on text similarity and the code similitude based on text attribute.By program source code when composition notebook point
One benefit of analysis is without being bound by programming language used in analysis object, but also just because of its language for not accounting for code
Characteristic, such methods are generally weaker to the resistance of Code obfuscation.Simple Code obfuscation means are such as: replacement variable function name,
Insertion rubbish code, upset statement sequence etc. under the premise of influencing function being capable of large effect detection effect.Therefore
This kind of technology can only carry out simple homology detection from text level, and there are biggish limitations.Two) based on structural analysis
Software homology detection technique, by analyze and expressed with other comparable intermediate forms to code structure, often
That sees has based on token, based on tree and the detection method based on figure etc..This kind of technology is compared for text based detection method
With better detection effect, means are obscured with certain resistivity for common.But its computation complexity depends on
In the method for intermediate representation, complicated structure can bring biggish performance cost in the detection process.Three) semantic-based software
Homology detection technique extracts the features such as control stream, data flow, standard API stream, from difference on the basis of static semantic analysis
Angle portrays program behavior;Or source code is compiled and is executed, logging program instruction stream and system call sequence are to carve
Draw program behavior.All it is the semanteme and behavioural characteristic for portraying program on this kind of technological essence, can more effectively copes with all kinds of generations
Code, which is obscured, detects bring challenge to homology.But semantic-based method cannot effectively cover code unique characteristics, open simultaneously
It is larger to open up accurate semantic analysis difficulty.
Summary of the invention
Aiming at the shortcomings in the prior art, the present invention provide a kind of code homology detection method based on code fingerprint and
Its device solves in software source code detection process, and antialiasing interference performance is not strong, the not high problem of detection efficiency, can
Code characteristic is accurately extracted, common code is successfully managed and obscures the influence of gimmick bring, improve the effect of homology detection
Rate and the accuracy of detection, the propagation of effective preventing malice code.
According to design scheme provided by the present invention, a kind of code homology detection method based on code fingerprint includes
Following steps:
Step 1 carries out dependence analysis to two parts of input codes S and T, obtains original program dependency graph PDG;And to original
Beginning program dependency graph PDG carries out simplified structure, nesting removal and coloring treatment, obtains and simplifies program dependency graph sPDG;
Step 2 parses code key syntactic information based on abstract syntax tree;
Step 3, the system call sequence for extracting code execution path, obtain the complete trails parameter vector collection of object code
It closes, constructs code fingerprint;
Homologous property coefficient between step 4, calculation code fingerprint component, the homologous property coefficient include to simplify program dependency graph
SPDG isomorphism FACTOR PS,T, syntactic information overlap coefficient CS,TAnd system call sequence similarity factor AS,T;
Step 5, the homologous sex index that two parts of codes S and T are calculated according to homologous property coefficient, are determined by the homologous sex index
Affinity existing for code both sides.
Above-mentioned, simplified structure, nesting removal and coloring treatment are carried out to original program dependency graph PDG in step 1, obtained
Simplified program dependency graph sPDG is taken, includes following content:
Step 11 simplifies original program dependency graph PDG progress structure according to simplification principle;
Step 12, for the node comprising nest relation, remove its internal nested input node and output node, and
The side for being corresponded to dependence removes on outer layer function call node;
Step 13 is classified and is coloured to node according to statement type, obtains and simplifies program dependency graph sPDG.
Above-mentioned, the simplification principle in step 11 includes: removing only one and spreads out of top while without any incoming
Point removes only one incoming vertex while without any export-oriented;Remove the top of only one input and an output side
Point, and introduce from its input vertex and be directed toward output vertex;It removes and no any is transferred into and out edge vertices.
Above-mentioned, code key syntactic information is parsed based on abstract syntax tree in step 2, includes following content:
Global variable, local variable and its attribute in step 21, record appointment codes domain, form four-tuple, the quaternary
Group includes scope, link attribute, storage class and the title of variable;
Step 22 parses and records macrodefinition and its corresponding content, forms triple, which includes macrodefinition mark
Knowledge, content type and title;
Step 23 parses the key data structure in code based on abstract syntax tree AST, in sequence form record code
Self-defined structure body.
Above-mentioned, step 3 includes following content:
Step 31, since entrance function, the calling figure and postorder Dominator Tree of generating function f, extract single execute road
System call sequence k in diameter records system call sequence set K in all possible execution route;
Step 32, for the function f in every system call sequence k, position function field d where it, parse abstract syntax
All parameters of function f in tree, the data source s for determining parameters in f is analyzed by static stain, and critical parameter value is come
Source, and the parameter vector e of the data type t constituting-functions f of incorporating parametricf=(d, f, t, s);Obtain system call sequence k's
Parameter vector collection Ek;
Step 33 executes step 32 to every sequence k in system call sequence set K, obtains the complete trails of object code
Parameter vector set EK;
Step 34, the complete trails parameter vector set E according to object codeK, construct code fingerprint.
Above-mentioned, the step 3 includes following content:
Step 31, the domain correlation degree for calculating separately nominator initial set nominator, recommendation response rate and recommendation satisfaction
Rate;
Step 32, recommends satisfaction rate and domain correlation degree acceptance threshold at setting recommendation response rate, for nominator's initial set
In all nominators, screened by acceptance threshold, will be less than the nominator of acceptance threshold and moved from nominator's initial set
It removes;
Step 33 obtains nominator's Candidate Set by screening.
Above-mentioned, step 4 includes following content:
Step 41, the simplification program dependency graph sPDG for object code, are sought by gradual solving graph isomorphism problems algorithm
Simplify the maximum isomorphism subgraph between program dependency graph sPDG, and the isomorphism coefficient between computational short cut program dependency graph sPDG
PS,T;
Step 42, the crucial syntactic information obtained according to step 2 calculate overlap coefficient C by Jaccard algorithmS,T;
Step 43, the complete trails parameter vector set E according to the object code in step 3K, E is solved by JaccardKSon
The likeness coefficient of set takes peak as system call sequence similarity factor AS,T。
Preferably, in step 41, original program dependency graph PDG is expressed as digraph G=(V, E), and node set V indicates one
Group predicate expressions or sentence, E indicate that existing data dependence and control rely between each section, enable G1=(V1, E1), G2=
(V2, E2) simplified program dependency graph sPDG is respectively indicated, pass through valuation functions:
, isomorphism FACTOR P between computational short cut program dependency graph sPDG.
Preferably, step 42 includes that content is as follows: the overlap coefficient of single syntactic information αWherein,It is the sequence of the corresponding syntactic information α of two parts of codes S and T respectively;Calculate crucial syntactic information overlap coefficient wαIt is the weight of syntactic information α.
Above-mentioned, in step 5, pass through formula:Calculate two
The homologous sex index of part code S and T, wherein wPFor the weight of sPDG isomorphism of graph coefficient, wCFor the power of syntactic information overlap coefficient
Weight, wAFor the weight of system call sequence similarity factor.
A kind of code homology detection device based on code fingerprint includes: program simplification module, syntax parsing module,
Fingerprint constructs module, homologous property coefficient obtains module, homology determination module;
Program simplification module obtains original program and relies on for carrying out dependence analysis to two parts of input codes S and T
Scheme PDG;And simplified structure, nesting removal and coloring treatment are carried out to original program dependency graph PDG, acquisition simplifies program and relies on
Scheme sPDG;
Syntax parsing module, for based on abstract syntax tree parse code key syntactic information, comprising variable resolution unit,
Macrodefinition resolution unit and key data structure resolution unit, wherein variable resolution unit is for recording in appointment codes domain
Global variable, local variable and its corresponding scope, link attribute and storage class, macrodefinition resolution unit are used for record macro
Definition and its corresponding content type, key data structure resolution unit is for parsing in object code domain in all classes and function
The structural body of definition;
Fingerprint constructs module and obtains the complete trails of object code for extracting the system call sequence of code execution path
Parameter vector set constructs code fingerprint;
Homologous property coefficient obtains module, obtains for constructing module according to program simplification module, syntax parsing module and fingerprint
The information taken, the homologous property coefficient between calculation code fingerprint component, the homologous property coefficient include that simplify program dependency graph sPDG same
Structure FACTOR PS,T, syntactic information overlap coefficient CS,TAnd system call sequence similarity factor AS,T;
Homology determination module calculates two parts of codes for obtaining the homologous property coefficient that module obtains according to homologous property coefficient
The homologous sex index of S and T, and affinity existing for code both sides is determined by the homologous sex index
Beneficial effects of the present invention:
The present invention is directed to source code homology determination method, and code semanteme and row can be taken into account on the basis of similitude
To improve detection efficiency using the feature and simplified mechanism of lightweight, multi-angle measures existing affinity between code;It solves
It exists in the prior art: cannot a) successfully manage and be arranged again using format change, renaming modification, rubbish code insertion, sentence
The compound Code obfuscation method of the multiple means such as sequence;B) detection mode of the detection method based on labyrinth and algorithm can obtain
To high accuracy, but exist in detection process and solve computationally intensive, the low problem of detection efficiency, while improving accuracy
Detection efficiency cannot be taken into account very well, the problem waited;Detection efficiency can be improved while guaranteeing accuracy.
The present invention takes out the logical AND feature of code by code fingerprint, by program dependency graph show data flow with
The grammar property and behavioural characteristic that code is incorporated while controlling flow relation solve existing code homology detection and lay particular emphasis on analysis
Code text and when characteristic similarity, is being kept at the problem of reflecting between code internal logic and deep layer associated scarce capacity
The efficiency of homology detection is substantially increased while high-accuracy, the propagation of effective preventing malice code is computer program
Homology detection and the judgement of source code provide technical support, have weight to computer network security technology and virus detection techniques
Want directive significance.
Detailed description of the invention:
Fig. 1 is method flow schematic diagram of the invention;
Fig. 2 is homology analysis method flow schematic diagram in embodiment;
Fig. 3 is that the flow diagram for simplifying program dependency graph sPDG is obtained in embodiment;
Fig. 4 is the flow diagram for parsing code key syntactic information in embodiment based on abstract syntax tree;
Fig. 5 is the flow diagram that code fingerprint is constructed in embodiment;
Fig. 6 is the flow diagram of homologous property coefficient between calculation code fingerprint component in embodiment;
Fig. 7 is the device of the invention schematic diagram;
Fig. 8 is input code signal in embodiment;
Fig. 9 is that original program dependency graph structure simplifies the signal of part effect in embodiment;
Figure 10 is nested removal process signal in embodiment;
Figure 11 is in embodiment based on abstract syntax tree parsing signal;
Figure 12 is that object code system call parameter extracts signal in embodiment.
Specific embodiment:
To make the object, technical solutions and advantages of the present invention clearer, understand, with reference to the accompanying drawing with technical solution pair
The present invention is described in further detail.
At this stage for the technique study of code homology detection, it is mostly based on single type development.The feature of coarseness
Detection can be improved detection efficiency but can reduce detection accuracy, fine-grained feature while improving detection accuracy again
Bring computationally intensive performance bottleneck.How complicated Code obfuscation means are successfully managed under conditions of efficient detection, it is quasi-
True abstract code logical AND concludes code characteristic, is the important content for currently needing to study.
Embodiment provides a kind of code homology detection method based on code fingerprint, shown in Figure 1, comprising as follows
Step:
Step 1 carries out dependence analysis to two parts of input codes S and T, obtains original program dependency graph PDG;And to original
Beginning program dependency graph PDG carries out simplified structure, nesting removal and coloring treatment, obtains and simplifies program dependency graph sPDG;
Step 2 parses code key syntactic information based on abstract syntax tree;
Step 3, the system call sequence for extracting code execution path, obtain the complete trails parameter vector collection of object code
It closes, constructs code fingerprint;
Homologous property coefficient between step 4, calculation code fingerprint component, the homologous property coefficient include to simplify program dependency graph
SPDG isomorphism FACTOR PS,T, syntactic information overlap coefficient CS,TAnd system call sequence similarity factor AS,T;
Step 5, the homologous sex index that two parts of codes S and T are calculated according to homologous property coefficient, are determined by the homologous sex index
Affinity existing for code both sides.
The present embodiment can accurately extract code characteristic, successfully manage common code and obscure gimmick bring shadow
It rings, while improving homology detection accuracy, and greatly improves the efficiency of its detection.
For two parts of input code files to be detected, program analysis is carried out according to its programming language Fundamentals of Compiling, is obtained
The original program dependency graph PDG of code, as the basis of code fingerprint, in another embodiment of the present invention, referring to Fig. 3 institute
Show, simplified structure, nesting removal and coloring treatment carried out to original program dependency graph PDG, obtains and simplifies program dependency graph sPDG,
Include following content:
Step 11 simplifies original program dependency graph PDG progress structure according to simplification principle;
Step 12, for the node comprising nest relation, remove its internal nested input node and output node, and
The side for being corresponded to dependence removes on outer layer function call node;
Step 13 is classified and is coloured to node according to statement type, obtains and simplifies program dependency graph sPDG.
In another embodiment of the present invention, structure is carried out to original program dependency graph PDG and is simplified, including according to simplification principle
Following simplify is carried out to figure interior joint to operate: being removed only one and is spread out of vertex while without any incoming, removal only has
One incoming vertex while without any export-oriented;Remove only one input and one output side vertex, and introduce from
Its input vertex is directed toward output vertex;It removes and no any is transferred into and out edge vertices;Above-mentioned simplified operation is repeated, until not having
Until the node for meeting simplification principle.
Classify according to statement type to node, then different type node carry out according to different colors
Color, and each type is identified with coloring number in order to compare.The classified instance used is as follows: function call, control statement,
Declarative statement, arithmetic statement, switch sentence, logical expression, skip instruction and return statement etc..Specific descriptions are shown in Table
1。
Type | Node indicates information | Coloring number |
Function call | Call function and system API | 1 |
Control statement | If,switch,while,for. | 2 |
Declarative statement | Variable declarations or format parameter | 3 |
Arithmetic statement | Variable operation, from add drop operation | 4 |
Switch sentence | case,default | 5 |
Skip instruction | goto,break,continue | 6 |
Conditional statement | <,>,==,!= | 7 |
Return statement | return | 8 |
Other | Other | 0 |
Table 1
Use the abstract syntax tree (Abstract Syntax Tree, AST) of LLVM or Clang building source code, this hair
In bright another embodiment, code key syntactic information is parsed based on abstract syntax tree, as the component part of code fingerprint, ginseng
As shown in Figure 4, include following content:
Global variable, local variable and its attribute in step 21, record appointment codes domain, form four-tuple, the quaternary
Group includes scope, link attribute, storage class and the title of variable;
Step 22 parses and records macrodefinition and its corresponding content, forms triple, which includes macrodefinition mark
Knowledge, content type and title;
Step 23 parses the key data structure in code based on abstract syntax tree AST, in sequence form record code
Self-defined structure body.
In another embodiment of the invention, the system call sequence of code execution path is extracted, object code is obtained
Complete trails parameter vector set E constructs code fingerprint, shown in Figure 5, includes following content:
Step 31, since entrance function, the calling figure and postorder Dominator Tree of generating function f, extract single execute road
System call sequence k in diameter records system call sequence set K in all possible execution route;
Step 32, for the function f in every system call sequence k, position function field d where it, parse abstract syntax
All parameters of function f in tree, the data source s for determining parameters in f is analyzed by static stain, and critical parameter value is come
Source, and the parameter vector e of the data type t constituting-functions f of incorporating parametricf=(d, f, t, s);Obtain system call sequence k's
Parameter vector collection Ek;
Step 33 executes step 32 to every sequence k in system call sequence set K, obtains the complete trails of object code
Parameter vector set EK;
Step 34, the complete trails parameter vector set E according to object codeK, construct code fingerprint.
When code fingerprint carries out homologous sex determination, in another embodiment of the invention, between calculation code fingerprint component
Homologous property coefficient, it is shown in Figure 6, include following content:
Step 41, the simplification program dependency graph sPDG for object code, are sought by gradual solving graph isomorphism problems algorithm
Simplify the maximum isomorphism subgraph between program dependency graph sPDG, and the isomorphism coefficient between computational short cut program dependency graph sPDG
PS,T;
Step 42, the crucial syntactic information obtained according to step 2 calculate overlap coefficient C by Jaccard algorithmS,T;
Step 43, the complete trails parameter vector set E according to the object code in step 3K, E is solved by JaccardKSon
The likeness coefficient of set takes peak as system call sequence similarity factor AS,T。
In another embodiment, the original program dependency graph PDG of object code is expressed as digraph G=(V, E), node collection
Closing V indicates that one group of predicate expressions or sentence, E indicate that existing data dependence and control rely between each section, enable G1=(V1,
E1), G2=(V2, E2) simplified program dependency graph sPDG is respectively indicated, according to the solving result of gradual solving graph isomorphism problems algorithm,
Pass through valuation functions:
, isomorphism FACTOR P between computational short cut program dependency graph sPDG, P=0 then indicates G1It is G2Complete subgraph.
For the code key grammer information sequence of acquisition, in another embodiment of the present invention, pass through Jaccard algorithm meter
Overlap coefficient is calculated, it is as follows comprising content: the overlap coefficient of single syntactic information αWherein, It is respectively
The sequence of the corresponding syntactic information α of two parts of codes S and T;Calculate crucial syntactic information overlap coefficientwαIt is grammer
The weight of information α.
For two parts of input codes S and T, sPDG isomorphism of graph FACTOR P is calculatedS,T, syntactic information overlap coefficient CS,TAnd
System call sequence similarity factor AS,T, in the other embodiment of the present invention, pass through formula:
The homologous sex index of two parts of codes S and T are calculated,
Wherein, wPFor the weight of sPDG isomorphism of graph coefficient, wCFor the weight of syntactic information overlap coefficient, wAIt is similar for system call sequence
The weight of coefficient.Homology (S, T) is bigger, illustrates there is the affinity being more obvious between input sample.
Corresponding with the above method, the embodiment of the invention also provides a kind of, and the code homology based on code fingerprint detects dress
It sets, as shown in fig. 7, comprising: program simplification module 201, syntax parsing module 202, fingerprint construct module 203, homologous property coefficient
Obtain module 204, homology determination module 205;
Program simplification module 201, for carrying out dependence analyses to two parts of input codes S and T, obtain original program according to
Rely figure PDG;And simplified structure, nesting removal and coloring treatment are carried out to original program dependency graph PDG, acquisition simplifies program and relies on
Scheme sPDG;
Syntax parsing module 202 includes variable resolution list for parsing code key syntactic information based on abstract syntax tree
Member, macrodefinition resolution unit and key data structure resolution unit, wherein variable resolution unit is for recording in appointment codes domain
Global variable, local variable and its corresponding scope, link attribute and storage class, macrodefinition resolution unit is for recording
Macrodefinition and its corresponding content type, key data structure resolution unit is for parsing all classes and function in object code domain
The structural body of interior definition;
Fingerprint constructs module 203 and obtains the system-wide of object code for extracting the system call sequence of code execution path
Diameter parameter vector set constructs code fingerprint;
Homologous property coefficient obtains module 204, for constructing module according to program simplification module, syntax parsing module and fingerprint
The information of acquisition, the homologous property coefficient between calculation code fingerprint component, the homologous property coefficient include to simplify program dependency graph sPDG
Isomorphism FACTOR PS,T, syntactic information overlap coefficient CS,TAnd system call sequence similarity factor AS,T;
Homology determination module 205 calculates two parts for obtaining the homologous property coefficient that module obtains according to homologous property coefficient
The homologous sex index of code S and T, and affinity existing for code both sides is determined by the homologous sex index.
Effectiveness of the invention, shown in Figure 8, two parts of input program codes are further illustrated below by concrete example
Partial content is illustrated in file, carries out program analysis according to programming language Fundamentals of Compiling, obtains original program dependency graph PDG, former
Beginning program dependency graph PDG is expressed as digraph G=(V, E), and node set V indicates that one group of predicate expressions or sentence, E indicate each
Existing data dependence and control rely between part, basis of the original program dependency graph PDG as code fingerprint;To original journey
Sequence dependency graph carries out structure and simplifies, and simplified part effect signal is as shown in Figure 9;To the simplified program dependency graph of structure into
Row is nested to be removed, and nested removal process is as shown in Figure 10;Program dependency graph after simplified by structure, nested remove is carried out
Color processing, is simplified program dependency graph sPDG;Using the abstract syntax tree of LLVM or Clang building source code, based on abstract
Syntax tree parses code key syntactic information, and as the component part of code fingerprint, the parsing signal of two parts of input files is as schemed
Shown in 11.
The argument sequence that system is called in object code is extracted, complete code fingerprint is constituted, as shown in figure 12, from main
Equal entrance functions start, the calling figure and postorder Dominator Tree of generating function f, and the system extracted in single execution route calls sequence
K is arranged, system call sequence set K in all possible execution route is recorded, for the function in every system call sequence k
F positions function field d where it, parses all parameters of function f in abstract syntax tree, is analyzed and is determined in f respectively by static stain
The value of the data source s of a parameter, critical parameter are incoming by outside or come from inside function, the data class of final incorporating parametric
A parameter vector e of type t constituting-functions ff=(d, f, t, s).It is obtained after carrying out aforesaid operations to a system call sequence k
To the parameter vector collection E of sequence kk, above-mentioned steps are carried out to every sequence k in system call sequence set K, obtain target generation
The complete trails parameter vector set E of codeK.To alleviate EKThe problem that scale is excessive or Invalid path is too many only retains and wherein meets ginseng
Number vector collection element number | Ek| >=5 path.
For the simplification program dependency graph sPDG of object code, found between sPDG using gradual solving graph isomorphism problems algorithm
Maximum isomorphism subgraph, according to result calculate sPDG isomorphism of graph coefficient.Enable G1=(V1, E1), G2=(V2, E2) respectively indicate two parts
The simplification program dependency graph sPDG of input code file, algorithm are as follows:
Algorithm first determines the size m of step number n and the extension of each step before startingi, wherein miMeet with nAnd
mi≥1.Determine that frame realizes isomorph (G using the VFLib open source isomorphism of graph1,i,G2) function is to subgraph G1,iAnd G2Isomorphism is sentenced
It is fixed, finally obtain G1Neutralize G2The clique of isomorphism.
According to gradual solving graph isomorphism problems algorithm solving result, different sides between two figures are calculated using following valuation functions
Quantity and smaller figure in side ratio of number example.
Calculated result P indicates to simplify program attribute figure sPDG isomorphism of graph coefficient.P=0 then indicates G1It is G2Complete subgraph.
For the code key grammer information sequence of acquisition, including variable four-tuple sequence, macrodefinition triad sequence with
And structural body sequence, overlap coefficient C is calculated using Jaccard algorithm, specific as follows:
The overlap coefficient of single syntactic information αWherein, It is two parts of codes S and T corresponding respectively
Syntactic information α sequence, definition works as H when being skyα=0;Calculate crucial syntactic information overlap coefficientwαIt is the weight of syntactic information α.Variable information weight is set in this example as 0.4, macrodefinition information weight is
0.2, structural body information weight is 0.4.
For the complete trails parameter vector set E of object codeK, E is asked using Jaccard algorithmKThe similitude system of subclass
Number, takes peak as system call sequence similarity factor A.Algorithm is as follows:
SPDG isomorphism of graph FACTOR P is calculated for code S and TS,T, syntactic information overlap coefficient CS,TAnd system calls sequence
Column similarity factor AS,T, the homologous sex index of two parts of codes S and T are calculated by following formula:
, wherein wPFor the weight of sPDG isomorphism of graph coefficient, wCFor the weight of syntactic information overlap coefficient, wAFor system calling
The weight of sequence similarity factor.Homology (S, T) is bigger, illustrates there is the affinity being more obvious between input sample.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other
The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodiment
For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part
It is bright.
The unit and method and step of each example described in conjunction with the examples disclosed in this document, can with electronic hardware,
The combination of computer software or the two is realized, in order to clearly illustrate the interchangeability of hardware and software, in above description
In generally describe each exemplary composition and step according to function.These functions are held with hardware or software mode
Row, specific application and design constraint depending on technical solution.Those of ordinary skill in the art can be to each specific
Using using different methods to achieve the described function, but this realization be not considered as it is beyond the scope of this invention.
Those of ordinary skill in the art will appreciate that all or part of the steps in the above method can be instructed by program
Related hardware is completed, and described program can store in computer readable storage medium, such as: read-only memory, disk or CD
Deng.Optionally, one or more integrated circuits also can be used to realize, accordingly in all or part of the steps of above-described embodiment
Ground, each module/unit in above-described embodiment can take the form of hardware realization, can also use the shape of software function module
Formula is realized.The present invention is not limited to the combinations of the hardware and software of any particular form.
The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application.
Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein
General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application
It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one
The widest scope of cause.
Claims (10)
1. a kind of code homology detection method based on code fingerprint, which is characterized in that comprise the following steps:
Step 1 carries out dependence analysis to two parts of input codes S and T, obtains original program dependency graph PDG;And to original journey
Sequence dependency graph PDG carries out simplified structure, nesting removal and coloring treatment, obtains and simplifies program dependency graph sPDG;
Step 2 parses code key syntactic information based on abstract syntax tree;
Step 3, the system call sequence for extracting code execution path, obtain the complete trails parameter vector set of object code, structure
Build code fingerprint;
Homologous property coefficient between step 4, calculation code fingerprint, the homologous property coefficient include to simplify program dependency graph sPDG isologous seve
Number PS,T, syntactic information overlap coefficient CS,TAnd system call sequence similarity factor AS,T;
Step 5, the homologous sex index that two parts of codes S and T are calculated according to homologous property coefficient determine code by the homologous sex index
Affinity existing for both sides.
2. the code homology detection method according to claim 1 based on code fingerprint, which is characterized in that in step 1
To original program dependency graph PDG carry out structure it is simplified, it is nested remove and coloring treatment, obtain and simplify program dependency graph sPDG,
Include following content:
Step 11 simplifies original program dependency graph PDG progress structure according to simplification principle;
Step 12, for the node comprising nest relation, remove its internal nested input node and output node, and by its
The side of corresponding dependence removes on outer layer function call node;
Step 13 is classified and is coloured to node according to statement type, obtains and simplifies program dependency graph sPDG.
3. the code homology detection method according to claim 2 based on code fingerprint, which is characterized in that in step 11
Simplification principle, include: removing only one and spread out of vertex while without any incoming, remove only one incoming side and
There is no the vertex on any export-oriented side;The vertex of only one input and an output side is removed, and introduces and refers to from its input vertex
To output vertex;It removes and no any is transferred into and out edge vertices.
4. the code homology detection method according to claim 1 based on code fingerprint, which is characterized in that in step 2
Code key syntactic information is parsed based on abstract syntax tree, includes following content:
Global variable, local variable and its attribute in step 21, record appointment codes domain, form four-tuple, the four-tuple packet
Scope, link attribute, storage class and title containing variable;
Step 22 parses and records macrodefinition and its corresponding content, forms triple, which includes that macrodefinition identifies, is interior
Hold type and title;
Step 23 parses the key data structure in code based on abstract syntax tree AST, to make by oneself in sequence form record code
Adopted structural body.
5. the code homology detection method according to claim 1 based on code fingerprint, which is characterized in that step 3 packet
Containing following content:
Step 31, since entrance function, the calling figure and postorder Dominator Tree of generating function f extract in single execution route
System call sequence k, record system call sequence set K in all possible execution route;
Step 32, for the function f in every system call sequence k, position function field d where it, parse in abstract syntax tree
All parameters of function f, by static stain analyze determine f in parameters data source s, the source of critical parameter value,
And the parameter vector e of the data type t constituting-functions f of incorporating parametricf=(d, f, t, s);Obtain the parameter of system call sequence k
Vector set Ek;
Step 33 executes step 32 to every sequence k in system call sequence set K, obtains the complete trails parameter of object code
Vector set EK;
Step 34, the complete trails parameter vector set E according to object codeK, construct code fingerprint.
6. the code homology detection method according to claim 1 based on code fingerprint, which is characterized in that step 4 packet
Containing following content:
Step 41, the simplification program dependency graph sPDG for object code, seek to simplify by gradual solving graph isomorphism problems algorithm
Maximum isomorphism subgraph between program dependency graph sPDG, and the isomorphism FACTOR P between computational short cut program dependency graph sPDGS,T;
Step 42, the crucial syntactic information obtained according to step 2 calculate overlap coefficient C by Jaccard algorithmS,T;
Step 43, the complete trails parameter vector set E according to the object code in step 3K, E is solved by JaccardKSubclass
Likeness coefficient, take peak as system call sequence similarity factor AS,T。
7. the code homology detection method according to claim 6 based on code fingerprint, which is characterized in that step 41
In, original program dependency graph PDG is expressed as digraph G=(V, E), and node set V indicates one group of predicate expressions or sentence, E
Indicate that existing data dependence and control rely between each section, enable G1=(V1, E1), G2=(V2, E2) respectively indicate simplified journey
Sequence dependency graph sPDG, passes through valuation functions:
, computational short cut program dependency graph sPDG
Between isomorphism FACTOR P.
8. the code homology detection method according to claim 6 based on code fingerprint, which is characterized in that step 42 packet
It is as follows containing content: the overlap coefficient of single syntactic information αWherein,It is two parts of codes S and T respectively
The sequence of corresponding syntactic information α;Calculate crucial syntactic information overlap coefficientwαIt is the weight of syntactic information α.
9. the code homology detection method according to claim 1 based on code fingerprint, which is characterized in that in step 5,
Pass through formula:The homologous sex index of two parts of codes S and T are calculated,
Wherein, wPFor the weight of sPDG isomorphism of graph coefficient, wCFor the weight of syntactic information overlap coefficient, wAIt is similar for system call sequence
The weight of coefficient.
10. a kind of code homology detection device based on code fingerprint is, characterized by comprising: program simplification module, grammer
Parsing module, fingerprint building module, homologous property coefficient obtain module, homology determination module;
Program simplification module obtains original program dependency graph for carrying out dependence analysis to two parts of input codes S and T
PDG;And simplified structure, nesting removal and coloring treatment are carried out to original program dependency graph PDG, it obtains and simplifies program dependency graph
sPDG;
Syntax parsing module includes variable resolution unit, macro fixed for parsing code key syntactic information based on abstract syntax tree
Adopted resolution unit and key data structure resolution unit, wherein variable resolution unit is used to record the overall situation in appointment codes domain
Variable, local variable and its corresponding scope, link attribute and storage class, macrodefinition resolution unit is for recording macrodefinition
And its corresponding content type, key data structure resolution unit define in all classes and function for parsing in object code domain
Structural body;
Fingerprint constructs module and obtains the complete trails parameter of object code for extracting the system call sequence of code execution path
Vector set constructs code fingerprint;
Homologous property coefficient obtains module, for constructing what module obtained according to program simplification module, syntax parsing module and fingerprint
Information, the homologous property coefficient between calculation code fingerprint, the homologous property coefficient include to simplify program dependency graph sPDG isomorphism coefficient
PS,T, syntactic information overlap coefficient CS,TAnd system call sequence similarity factor AS,T;
Homology determination module calculates two parts of codes S and T for obtaining the homologous property coefficient that module obtains according to homologous property coefficient
Homologous sex index, and affinity existing for code both sides is determined by the homologous sex index.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710375425.4A CN107169358B (en) | 2017-05-24 | 2017-05-24 | Code homology detection method and its device based on code fingerprint |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710375425.4A CN107169358B (en) | 2017-05-24 | 2017-05-24 | Code homology detection method and its device based on code fingerprint |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107169358A CN107169358A (en) | 2017-09-15 |
CN107169358B true CN107169358B (en) | 2019-10-08 |
Family
ID=59820829
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710375425.4A Active CN107169358B (en) | 2017-05-24 | 2017-05-24 | Code homology detection method and its device based on code fingerprint |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107169358B (en) |
Families Citing this family (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108399321B (en) * | 2017-11-03 | 2021-05-18 | 西安邮电大学 | Software local plagiarism detection method based on dynamic instruction dependence graph birthmark |
CN107967152B (en) * | 2017-12-12 | 2020-06-19 | 西安交通大学 | Software local plagiarism evidence generation method based on minimum branch path function birthmarks |
CN108287996A (en) * | 2018-01-08 | 2018-07-17 | 北京工业大学 | A kind of malicious code obscures feature cleaning method |
CN108229170B (en) * | 2018-02-02 | 2020-05-12 | 中科软评科技(北京)有限公司 | Software analysis method and apparatus using big data and neural network |
CN110347428A (en) * | 2018-04-08 | 2019-10-18 | 北京京东尚科信息技术有限公司 | A kind of detection method and device of code similarity |
CN110555305A (en) * | 2018-05-31 | 2019-12-10 | 武汉安天信息技术有限责任公司 | Malicious application tracing method based on deep learning and related device |
CN109101235B (en) * | 2018-06-05 | 2021-03-19 | 北京航空航天大学 | Intelligent analysis method for software program |
CN109190653B (en) * | 2018-07-09 | 2020-06-05 | 四川大学 | Malicious code family homology analysis method based on semi-supervised density clustering |
CN109101816B (en) * | 2018-08-10 | 2022-02-08 | 北京理工大学 | Malicious code homology analysis method based on system call control flow graph |
CN109918128B (en) * | 2019-03-25 | 2022-04-08 | 湘潭大学 | Code similarity detection method and system based on relation variable graph |
CN110489973A (en) * | 2019-08-06 | 2019-11-22 | 广州大学 | A kind of intelligent contract leak detection method, device and storage medium based on Fuzz |
CN110955758A (en) * | 2019-12-18 | 2020-04-03 | 中国电子技术标准化研究院 | Code detection method, code detection server and index server |
CN111291373B (en) * | 2020-02-03 | 2022-06-14 | 思客云(北京)软件技术有限公司 | Method, apparatus and computer-readable storage medium for analyzing data pollution propagation |
CN113064633A (en) * | 2021-03-26 | 2021-07-02 | 山东师范大学 | Automatic code abstract generation method and system |
CN113138924B (en) * | 2021-04-23 | 2023-10-31 | 扬州大学 | Thread safety code identification method based on graph learning |
CN113434145A (en) * | 2021-06-09 | 2021-09-24 | 华东师范大学 | Program code similarity measurement method based on abstract syntax tree path context |
CN115129364B (en) * | 2022-07-05 | 2023-04-18 | 四川大学 | Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101697121A (en) * | 2009-10-26 | 2010-04-21 | 哈尔滨工业大学 | Method for detecting code similarity based on semantic analysis of program source code |
CN104407872A (en) * | 2014-12-04 | 2015-03-11 | 北京邮电大学 | Code clone detection method |
CN104933364A (en) * | 2015-07-08 | 2015-09-23 | 中国科学院信息工程研究所 | Automatic malicious code homology judgment method and system based on calling behaviors |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8429628B2 (en) * | 2007-12-28 | 2013-04-23 | International Business Machines Corporation | System and method for comparing partially decompiled software |
-
2017
- 2017-05-24 CN CN201710375425.4A patent/CN107169358B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101697121A (en) * | 2009-10-26 | 2010-04-21 | 哈尔滨工业大学 | Method for detecting code similarity based on semantic analysis of program source code |
CN104407872A (en) * | 2014-12-04 | 2015-03-11 | 北京邮电大学 | Code clone detection method |
CN104933364A (en) * | 2015-07-08 | 2015-09-23 | 中国科学院信息工程研究所 | Automatic malicious code homology judgment method and system based on calling behaviors |
Non-Patent Citations (1)
Title |
---|
面向代码相似度检测的指纹选取方法;黄柳柳 等;《计算机工程与应用》;20100921(第27期);169-171 * |
Also Published As
Publication number | Publication date |
---|---|
CN107169358A (en) | 2017-09-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107169358B (en) | Code homology detection method and its device based on code fingerprint | |
Ghamarian et al. | Modelling and analysis using GROOVE | |
CN104536898B (en) | The detection method of c program parallel regions | |
CN109117164B (en) | Micro-service updating method and system based on difference analysis of key elements | |
Pientka et al. | Inductive beluga: Programming proofs | |
CN108491228B (en) | Binary vulnerability code clone detection method and system | |
Radke et al. | Translating essential OCL invariants to nested graph constraints focusing on set operations | |
Alrabaee et al. | On leveraging coding habits for effective binary authorship attribution | |
Coscia et al. | Predicting web service maintainability via object-oriented metrics: a statistics-based approach | |
Wang et al. | Explainable apt attribution for malware using nlp techniques | |
Ji et al. | Vestige: Identifying binary code provenance for vulnerability detection | |
Murawski et al. | Game semantic analysis of equivalence in IMJ | |
CN109002696A (en) | It establishes the method for installation kit identification model, identify the method and device of installation kit | |
Murawski et al. | A contextual equivalence checker for IMJ | |
CN116702157A (en) | Intelligent contract vulnerability detection method based on neural network | |
CN115906086A (en) | Method, system and storage medium for detecting webpage backdoor based on code attribute graph | |
CN115098857A (en) | Visual malicious software classification method and device | |
CN109976805B (en) | Event-driven architecture mode identification method based on ontology | |
CN109299004B (en) | Method and system for analyzing difference of key elements | |
Li et al. | Detection malicious Android application based on simple-Dalvik intermediate language | |
CN113190234A (en) | Method and system for automatically recovering intelligent contract function signature of block chain | |
CN115879868B (en) | Expert system and deep learning integrated intelligent contract security audit method | |
Moghaddas et al. | Technical Report for HW2VEC--A Graph Learning Tool for Automating Hardware Security | |
Gu et al. | BinAIV: Semantic-enhanced vulnerability detection for Linux x86 binaries | |
CN109117142A (en) | A kind of fundamental type reconstructing method based on variable association tree |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |