CN108399321B - Software local plagiarism detection method based on dynamic instruction dependence graph birthmark - Google Patents

Software local plagiarism detection method based on dynamic instruction dependence graph birthmark Download PDF

Info

Publication number
CN108399321B
CN108399321B CN201711072012.5A CN201711072012A CN108399321B CN 108399321 B CN108399321 B CN 108399321B CN 201711072012 A CN201711072012 A CN 201711072012A CN 108399321 B CN108399321 B CN 108399321B
Authority
CN
China
Prior art keywords
function
instruction
ins
graph
program
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711072012.5A
Other languages
Chinese (zh)
Other versions
CN108399321A (en
Inventor
田振洲
王忠民
陈彦萍
张恒山
夏虹
刘烃
郑庆华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Posts and Telecommunications
Original Assignee
Xian University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Posts and Telecommunications filed Critical Xian University of Posts and Telecommunications
Priority to CN201711072012.5A priority Critical patent/CN108399321B/en
Publication of CN108399321A publication Critical patent/CN108399321A/en
Application granted granted Critical
Publication of CN108399321B publication Critical patent/CN108399321B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • G06F21/16Program or content traceability, e.g. by watermarking

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Technology Law (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a software local plagiarism detection method based on dynamic instruction dependence graph birthmarks, which comprises the following steps: 1) performing instruction level monitoring on a program to be analyzed by utilizing dynamic instrumentation, and capturing an instruction track of each function; 2) for recording the dynamic instruction track of each function, carrying out data dependence and control dependence analysis, and constructing a dynamic instruction dependence graph memory; 3) calculating the similarity between the instruction dependence graph and the memory to realize the measurement of the similarity between the functions; 4) constructing a suspicious function table for each function in the original program based on a given threshold; 5) extracting a static function call graph of a program, and carrying out accurate one-to-one pairing on suspicious functions under the guidance of a call dependency relationship; 6) and assembling the matched function pairs to generate a plagiarism evidence graph based on the calling dependency relationship, and measuring the proportion of the suspected plagiarism part. According to the method, the detection of local plagiarism is realized by constructing function-level birthmarks; the invention firstly provides a concept of plagiarism evidence graph, and can greatly enhance the evidence effectiveness.

Description

Software local plagiarism detection method based on dynamic instruction dependence graph birthmark
Technical Field
The invention relates to the field of software dynamic behavior analysis and software plagiarism detection, in particular to a software local plagiarism detection method based on dynamic instruction dependence graph and birthmarks.
Background
The time and labor cost for developing a software is very expensive, and the situation that existing codes are reused in the project development process is very common. The booming development of open source software communities such as GitHub and SourceForge and social programming websites such as CodeShare and the like brings the prosperity of the software industry. However, the accompanying problem of software copy is becoming more serious, and the situation of abusing other codes is often rare, so that the software infringement case frequently occurs. For example, the software of the 'green dam' is exposed and copied with a large amount of codes of CyberSitter, and the bad image filtering files of cximate.dll, cimage.dll, xcore.dll and the like also contain a large amount of codes of OpenCV; a product APICloud of the recent grapefruit science and technology is suspected to copy a large amount of codes of the digital paradise DCloud to fall into an infringement case, and a judicial appraisal report proves that the product APICloud not only directly uses dll files of the DCloud, but also copies a large amount of source codes of the DCloud with the functions of real machine operation, changing and viewing, and the like. Under the condition of not counting cost and expense, the plagiarism can achieve the plagiarism purpose by using any currently feasible means, technologies and resources, so that the research on a practical software plagiarism detection method is very challenging.
The software birthmarks are some characteristics which are extracted from software codes or execution processes, are not easy to change and can uniquely identify the software identities; the birthmark technology judges plagiarism possibly existing among software by measuring the similarity of the birthmarks. The method can be divided into static and dynamic methods according to the generation mode of the birthmark. The former statically analyzes the code of the program and generates a birthmark based on the lexical, grammatical or structural characteristics of the program; the latter specifically executes software, constructs a birthmark by utilizing the captured execution track information or analyzing the control and data dependency relationship among all elements in the track, and can more closely describe the behavior and the semantics of the software. Although the existing method solves the problem of plagiarism detection to a certain extent, a series of limitations exist, including:
1) many methods can only be applied to the situation that source codes exist, and in many cases, program source codes cannot be acquired, especially the plagiarism of commercial software, and are usually issued in the form of executable files for avoiding detection. Unless a judicial authority intervenes, the defendant will not easily surrender its source code, and in legal proceedings, the defendant may demonstrate responsibility for the defending of the defendant as being held off or refusing to surrender the source code.
2) In order to increase the detection difficulty, a plagiarizer often generates codes which are different from the original program in appearance by changing a program control structure, introducing garbage codes, adjusting code layout and other confusion modes on the premise of ensuring that the program semantics are unchanged by means of a plurality of automatic code confusion/confusion technologies and tools. In particular, unless the software can be shelled in advance, the hull-added obfuscation effectively defeats the static analysis-based detection method. In general, existing approaches remain weak against complex code obfuscation.
3) The plagiarism of the defender (plagiarism party) to the original defender (plagiarism party) can be simply summarized into two forms, firstly, all codes of the original defender are stolen for use, namely, the whole plagiarism is realized; and secondly, only partial codes of the original notice are stolen, namely local plagiarism is carried out, for example, in an APICloud infringement case, the notice only steals functional modules of code prompt, real machine operation, changing and watching and the like of DCloud. The existing birthmark technology mainly models the whole behaviors and semantics of a program and only gives out the judgment on the whole similarity or plagiarism of the original reported software. However, in reality, the more common situation is local plagiarism, and the defendant may be a few functional modules appropriating the original defender or may be only a few core functions. Therefore, the similarity meaning of the original defendant is not great, more importantly, the plagiarism proportion is measured accurately, the corresponding relation of the suspected plagiarism part in the original defendant software is found, and the organization and the presentation mode of the local plagiarism detection result with stronger expression and explanation power and even certain evidence effectiveness are provided for the user.
Therefore, in order to solve the above problems, a better method for detecting the plagiarism of the software needs to be researched. It should be able to directly act on the binary object program, and can resist the current mainstream automatic code confusion technology, especially to effectively support the detection of the local plagiarism of the software.
Disclosure of Invention
The invention aims to provide a software local plagiarism detection method based on a dynamic instruction dependence graph and a birthmark so as to overcome the limitation of the current plagiarism detection means based on the birthmark.
In order to achieve the purpose, the invention adopts the following technical scheme:
the software local plagiarism detection method based on the dynamic instruction dependence graph birthmark comprises the following steps:
s101) monitoring the original program to be analyzed and the reported program in instruction-level operation by utilizing dynamic instrumentation, and capturing and recording the dynamic instruction track of each function in real time;
s102) carrying out data dependence and control dependence analysis on the recorded dynamic instruction track of each function, and constructing a dynamic instruction dependence graph memory as a representation of the function behavior and semantics;
s103) calculating the similarity between functions by measuring the similarity of the instruction dependence graph of a certain function of the reported program and a certain function of the original program;
s104) screening all functions with high similarity from the reported program for each function in the original program to form a suspicious function table;
s105) extracting static calling dependency graphs of the original and reported programs, and carrying out accurate one-to-one pairing on suspicious functions under the guidance of function calling dependency relations;
s106) based on the calling dependency relationship, assembling the matched function pair to generate a plagiarism evidence graph, measuring the proportion of the plagiarism suspected part, and outputting the evidence graph capturing the corresponding relationship of the plagiarism suspected part and the proportion value of the plagiarism suspected part as the result of local plagiarism detection.
The invention further improves the following steps: the capturing and recording mode of the dynamic instruction track of each function in the step S101) is as follows: implanting an analysis code before each instruction of a binary object program to be monitored is executed by utilizing a dynamic instrumentation technology, acquiring the ID of a function where each instruction is located, analyzing the assembly representation form of the instruction in real time, and adding the assembly representation form and the instruction address to the tail of an instruction sequence identified by the ID of the function; and recording the dynamic instruction track of each function in the form of (function ID, trace), wherein trace is<ins1,ins2,…,insn>Represents an instruction sequence, i is 1, 2, 3 … … n; n is the length of the instruction sequence; insiIndicating a specific instruction, insiText is the assembly representation of the instruction, insiLocation is the address of the instruction.
The invention further improves the following steps: the generating method of the dynamic instruction dependence graph of the function in the step S102) comprises the following steps: let f be a function in the program to be analyzed, pairThe dynamic instruction track of the system carries out data and control dependence analysis and constructs a directed graph IDG (f) { V, E }, wherein V is a set of nodes and E is a set of edges; each node vie.V corresponds to a specific instruction in the instruction track and is uniquely identified by the assembly form and the instruction address of the instruction, i.e.
Figure RE-GDA0001525327780000041
At least one instruction ins in the instruction traceiCorresponds to it, and node viFrom ins to insi.text#insiLocation for identification; each side eiE represents that a dynamic data dependency or control dependency relationship exists between two associated instructions, the direction of the edge describes the direction of the dependency, the label of the edge describes the occurrence frequency of the dependency, namely for two instructions ins in the trackiAnd insjIf any one of the following conditions is met, a dynamic data dependency or control dependency relationship exists between the two instructions: a) insjHave access to insiAnd is a destination operand ofjAnd insiThere is no other instruction pair insiIs written with the destination operand of i.e. has insjDynamic data dependence on insi;b)insiIs some type of control jump instruction, ins in this executionjThe branch is taken, and insjAnd insiThere is no other control jump instruction in between, i.e. there is an insjDynamic control is dependent on insiAdding an insiCorresponding node to insjThe directed edge of the corresponding node, or the label value of the edge is updated if the edge exists; the directed graph idg (f) thus generated is referred to as the dynamic instruction dependency graph entry of the function.
The invention further improves the following steps: the method for calculating the similarity between functions in the step S103) comprises the following steps: measuring the similarity of the functions by calculating the similarity of the instruction dependence graph of the functions; that is, for a certain function f in the original program and a certain function g in the reported program, the similarity between the two is sim (f, g) ═ sim (idg (f), idg (g)).
The invention further improves the following steps: the similarity calculation method for the instruction dependent graph in the step S103) comprises the following steps: the instruction dependence graph is regarded as a general complex network, the instruction dependence graph is converted into a vector form by means of a Restart Random Walk algorithm (RWR), and then cosine distances among vectors are calculated to serve as similarity of the instruction dependence graph.
The invention further improves the following steps: the method for constructing the suspicious function table for the function in the original report program in the step S104) comprises the following steps: firstly, the pairwise similarity of all functions between the original advertising program and the advertised program is calculated by using the method in the step S103). Then, for each function in the original program, based on an adjustable threshold
Figure RE-GDA0001525327780000042
Screening out all functions with similarity above the threshold value in the reported program to form a suspicious function table; that is, for a certain function F in the original reporter F, the suspicious function with which there is a high similarity in the reported program G is represented as
Figure RE-GDA0001525327780000043
Adjustable threshold
Figure RE-GDA0001525327780000044
Is between 0.7 and 0.85.
The invention further improves the following steps: the method for realizing the one-to-one pairing of the suspicious functions in the step S105) comprises the following steps: firstly, extracting a static function call graph of an original reporter F, acquiring a function set with a call and called relation according to the call dependency relation contained in the graph for any function F in the F, and calling the set as the call context of the function F, and recording the call context (F) as CallRelation (F); the call context callrelationship (G) of any function G in the notifier G can be obtained by the same method. Then, the similarity of f to the calling context of the function in its list of suspicious functions is calculated, i.e. for each g ∈ Candidate (f), the similarity is calculated
Figure RE-GDA0001525327780000051
Using g with the maximum ContextSim (f, g) value as the only pair of function fDeleting g from Candidate (f ') of all other f's of the original notifier; and iteratively pairing the suspicious functions one by one.
The invention further improves the following steps: the construction method of the plagiarism evidence graph in the step S106) comprises the following steps: let SCG (F) and SCG (G) be static function call graphs of the original reporter F and the reported reporter G respectively, and Match be a set formed by one-to-one pairing function pairs obtained in step 5), then a graph structure constructed through the following steps is called a plagiarism evidence graph: firstly, assembling a function pair in Match to a static function call graph of an original reported program, namely, adding a virtual edge between a node corresponding to a function f in SCG (F) and a node corresponding to a function g in SCG (G) for each function matching pair (f, g) belonging to the Match, and marking the similarity sim (f, g) of the matching pair on the edge, thereby realizing the association and the correspondence of a suspicious function of the original reported program on the static function call graph; then, all nodes not associated by the virtual edge are pruned from the graph.
The invention further improves the following steps: the measuring method of the plagiarism proportion in the step S106) comprises the following steps:
Figure RE-GDA0001525327780000052
wherein, | F | is the number of functions in the original program, and count returns the number of instructions specifying the function.
Compared with the prior art, the invention has the following advantages:
(1) the method directly acts on the binary object code of the object to be detected, can still be used under the situation that the program source code is unavailable, and has more universal value;
(2) the invention takes the function as a basic unit for constructing the birthmark, namely, the function-level birthmark is constructed to realize the support of local plagiarism detection, and the existing birthmark technology can only be used for the detection of the whole plagiarism.
(3) According to the invention, by analyzing the control and data dependency relationship among the instructions, a dynamic instruction dependency graph is constructed as a function birthmark, so that the dynamic instruction dependency graph is less prone to be confused and damaged by semantically reserved codes, and the capacity of resisting deep confusion is improved;
(4) the method converts the graph-form birthmarks into vector forms which are easier to compare by using the random walk algorithm, and compared with the traditional method for calculating the graph similarity by using the sub-graph isomorphic algorithm, the efficiency of calculating the similarity of the birthmarks is greatly improved;
(5) the method guides the suspicious functions in the original reported program to be accurately paired one by one based on the call dependency relationship among the functions, thereby not only improving the pairing efficiency, but also reducing the occurrence of mismatching and mismatching to a great extent.
(6) The invention firstly provides a concept of plagiarism evidence graph, generates the evidence graph by assembling scattered function matching pairs, captures strong correlation among suspicious modules to realize micro evidence accumulation, can greatly enhance evidence effectiveness and has more practical significance.
Drawings
FIG. 1 is an overall flow chart of the software local plagiarism detection method based on dynamic instruction dependence graph birthmarks according to the invention;
FIG. 2 is a flow diagram of a function-based dynamic instruction trace generation dynamic instruction dependency graph token;
FIG. 3 is a diagram of a dynamic instruction dependency graph;
FIG. 4 is a flow diagram of a method for invoking a one-to-one pairing of suspicious functions that is dependent on a bootstrap;
fig. 5 is a schematic diagram of a plagiarism evidence graph.
Detailed Description
The following describes an embodiment of the software local plagiarism detection method based on dynamic instruction dependency graph according to the present invention in detail with reference to the accompanying drawings.
Fig. 1 is a processing flow of a software local plagiarism detection method based on dynamic instruction dependency graph birthmarks, wherein an original program refers to an original program developed by a program owner, and a reported program refers to a suspicious program which is considered to plagiarize the original program.
The invention relates to a software local plagiarism detection method based on dynamic instruction dependence graph birthmarks, which comprises the following steps:
step S101: using dynamic instrumentation frames such as Pin, Valgrind, etc., on the binary target trip to be monitoredEmbedding analysis codes before each instruction of the sequence is executed, acquiring the ID of a function where each instruction is located, analyzing the assembly representation form of the instruction in real time, and adding the assembly representation form and the instruction address to the end of the instruction sequence identified by the ID of the function; and recording the dynamic instruction track of each function in the form of (function ID, trace), wherein trace is<ins1,ins2,…,insn>Representing a sequence of instructions, insiIndicating a specific instruction, insiText is the assembly representation of the instruction, insiLocation is the address of the instruction.
Step S102: and carrying out data dependence and control dependence analysis on the dynamic instruction track of each function to construct a dynamic instruction dependence graph. With reference to fig. 2, the specific process is described as follows:
step S201: judging whether more than two instructions exist in the dynamic instruction track of the function, if so, turning to the step S202, otherwise, turning to the step S2010;
step S202: fetching first ins from instruction traceiAnd removing it from the current instruction trace;
step S203: fetching an instruction ins from an instruction tracejAnalyze it and instruction insiThe dynamic control dependency or data dependency relationship possibly existing between the two;
the identification method of the dynamic control dependency and the data dependency relationship comprises the following steps:
a)insjhave access to insiAnd is a destination operand ofjAnd insiThere is no other instruction pair insiIf the destination operand of (1) is written, there is insjDynamic data dependence on insi
b)insiIs some type of control jump instruction, ins in this executionjThe branch is taken, and insjAnd insiIf there is no other control jump instruction in between, there is an insjDynamic control is dependent on insi
Step S204: if the control dependency or the data dependency exists, the process proceeds to step S205; otherwise, go to step S209;
step S205: construction tuple (ins)i,insj) To indicate instructions insjControl or data dependence on insi
Step S206: using the tuple as a key value, and searching whether corresponding elements exist in a set B (an initial set B is empty); if yes, the step S208 is executed, and if not, the step S207 is executed;
step S207: creating a new element taking the tuple as a key, setting the key value as 1, adding the key value pair element into the set B, and turning to the step S209;
step S208: finding the element in the set B, and updating the key value of the element;
step S209: judging instruction insjIf the tail element is not the tail element of the instruction track, that is, the instruction to be analyzed still exists, the step S203 is carried out to carry out the next round of processing; otherwise, go to step S201;
step S2010: the set B of key-value pairs is output.
The keys of the elements in set B imply dependencies between instructions, and the values of the keys reflect the number of times the dependencies occur. Based on this, a directed graph idg (f) { V, E }, may be constructed, where V is the set of nodes and E is the set of edges; each node viE, V corresponds to a specific instruction in an instruction track, and is uniquely identified by an assembly form and an instruction address of the instruction; each side eiE represents that a dynamic data dependency or control dependency relationship exists between two associated instructions, the direction of the edge describes the direction of the dependency, and the label of the edge describes the occurrence times of the dependency; the directed graph IDG (f) is called the dynamic instruction dependency graph token of the function f, and fig. 3 shows a schematic diagram of IDG.
Step S103: for a certain function f in the original program and a certain function g in the reported program, calculating the similarity of the instruction dependence graph and the memory of the instruction dependence graph of the original program and the function dependence graph, and realizing the measurement of the similarity between the functions, the specific flow is described as follows:
a) regarding the instruction dependence graph as a general complex network, and calculating the importance of each node in the graph by using a Restart Random Walk algorithm (RWR);
b) and (4) taking an assembly form of the instructions corresponding to the nodes in the graph, and standardizing according to the following abstract rules, namely uniformly abstracting the memory addresses into MEM, uniformly abstracting the registers into REG, and uniformly abstracting the immediate data into VAL. For example, instructions mov [0x602c334], 0x 2; mov [0x602c33c ], eax; sub esp,0x 15; mov eax, ebx will be standardized to mov MEM, VAL, respectively; mov MEM, REG; sub REG, VAL; mov REG, REG;
c) constructing a key-value pair < norm (v) and dgr (v) for each node v in the graph, wherein norm (v) is a standardized form of the corresponding instruction of the node, and dgr (v) is the importance degree of the node; meanwhile, if a plurality of nodes have the same standardized form, the sum of the importance degrees of the nodes is taken as a key value; thereby converting the instruction dependency graph into a set of key-value pairs;
d) processing the instruction dependency graphs of the functions f and g respectively by using the steps a) to c), and respectively obtaining KVSet (f) and KVSet (g) as key value pair sets obtained by conversion; taking the union of KVSet (f) and KVSet (g) as a dimension of a certain vector space, and taking the value of the key as a value in the dimension, thereby converting KVSet (f) and KVSet (g) into vectors with the same dimension
Figure RE-GDA0001525327780000091
And
Figure RE-GDA0001525327780000092
e) calculating cosine distance between two vectors to obtain similarity of the instruction dependence graph birthmarks, wherein the similarity between functions is as follows:
Figure RE-GDA0001525327780000093
step S104: a suspect function table is constructed for each function in the source program. First, the method described in step S103 is continuously used to obtain pairwise similarities between all functions of the original and the reported programs. Then, for each function in the original program, based on an adjustable threshold
Figure RE-GDA0001525327780000094
Screening all functions with the similarity above the threshold value in the reported program to form a suspicious function table of the function; that is, for the function F in the original program F, the suspicious function table is defined as
Figure RE-GDA0001525327780000095
Wherein G is the advertised program, the adjustable threshold
Figure RE-GDA0001525327780000096
Is between 0.7 and 0.85.
Step S105: influenced by the factors of internal code cloning and the like, for a certain function f in the original program, a plurality of functions with high similarity can be found in the notified program, namely | Candidate (f)>1; in contrast, under the guidance of the call dependency relationship, the accurate one-to-one pairing of suspicious functions in the original notified program is realized, namely the processed result is satisfied,
Figure RE-GDA0001525327780000097
always, the absolute value of Candidate (f) is less than or equal to 1. The processing flow comprises the following steps:
firstly, extracting static function call graphs SCG (F) and SCG (G) of an original program F and a reported program G respectively by using a disassembling tool such as IDA;
then, in conjunction with fig. 4, the process of implementing one-to-one pairing is described in detail:
step S401: judging whether a function to be analyzed still exists in the original report program, if so, jumping to the step S402, otherwise, jumping to the step S407;
step S402: judging whether a function with the length of a suspicious function table being 1 exists in unanalyzed functions of the original report program, if so, jumping to a step S403, otherwise, jumping to a step S404;
step S403: from the set S formed by the functions of the original informing program, any function f meeting the condition that | Candidate (f) | 1 is selected, namely, the only function in the informed program corresponds to the function; forming a matching pair (f, candidate (f)) to be added into the set Match, and deleting the function f from the set S; and simultaneously, updating the suspicious function table of the residual functions in the S, namely deleting Candidate (f) from the suspicious function table of the residual functions in the S. Then, the process proceeds to step S402 to perform the next round of processing;
step S404: taking any function f from the set S and deleting the function f from the set S;
step S405: judging whether the suspicious function table of the function f is empty, if so, jumping to the step S401, otherwise, jumping to the step S406;
step S406: selecting a function g with the maximum context similarity with f from the suspicious function table of f, forming a function pair (f, g), adding the function pair (f, g) into the set Match, and deleting the function f from the set S; and simultaneously, updating the suspicious function table of the residual functions in the S, namely deleting Candidate (f) from the suspicious function table of the residual functions in the S. Then, go to step S401;
specifically, the method for screening the function g with the largest context similarity from the suspicious function table of f comprises the following steps:
a) according to the calling dependency relation contained in the original reporter program SCG (F), a function set with calling and called relations with f can be obtained, the set is called as the calling context of the function f and is written as CallRelation (f); by the same method, the calling context (callrelationship) (g) of any function g in the suspicious function table of the function f can be obtained;
b) for each g ∈ Candidate (f), calculate its similarity to the call context of f, with
Figure RE-GDA0001525327780000101
And taking g with the maximum ContextSim (f, g) value as the only pairing of the function f;
step S407: the output function matches the set Match of pairs.
Step S106: and the assembly function matching pairs generate plagiarism evidence graphs, measure the proportion of the suspected plagiarism part, and output the evidence graphs capturing the corresponding relation of the suspected plagiarism part and the proportion value of the suspected plagiarism part as the result of local plagiarism detection.
Fig. 5 gives a schematic diagram of a plagiarism evidence graph, and the construction process is described as follows:
a) assembling the function pairs in the Match to the static function call graph of the original reported program, namely adding a virtual edge between the corresponding node of the function f in the SCG (F) and the corresponding node of the function g in the SCG (G) for each function matching pair (f, g) belonging to the Match, and marking the similarity sim (f, g) of the matching pair on the edge, thereby realizing the association and the correspondence of the suspicious function of the original reported program on the static function call graph;
b) all nodes not associated by the virtual edge are pruned from the graph.
Meanwhile, the proportion of the suspected plagiarism part is calculated by adopting the following formula:
Figure RE-GDA0001525327780000111
wherein, | F | is the number of functions in the original program, and count returns the number of instructions specifying the function.

Claims (8)

1. The software local plagiarism detection method based on the dynamic instruction dependence graph birthmark is characterized by comprising the following steps of:
s101) monitoring the original program to be analyzed and the reported program in instruction-level operation by utilizing dynamic instrumentation, and capturing and recording the dynamic instruction track of each function in real time;
s102) carrying out data dependence and control dependence analysis on the recorded dynamic instruction track of each function, and constructing a dynamic instruction dependence graph memory as a representation of the function behavior and semantics;
s103) calculating the similarity between functions by measuring the similarity of the instruction dependence graph of a certain function of the reported program and a certain function of the original program;
s104) screening all functions with high similarity from the reported program for each function in the original program to form a suspicious function table;
s105) extracting static calling dependency graphs of the original and reported programs, and carrying out accurate one-to-one pairing on suspicious functions under the guidance of function calling dependency relations;
s106) assembling the matched function pairs to generate plagiarism evidence graphs, measuring the proportion of the plagiarism suspected parts, and outputting the evidence graphs capturing the corresponding relation of the plagiarism suspected parts and the proportion values of the plagiarism suspected parts as the results of local plagiarism detection;
the construction method of the plagiarism evidence graph in the step S106) comprises the following steps: let scg (F) and scg (G) be static function call graphs of the original reporter F and the reported reporter G, respectively, Match be a set formed by one-to-one pairing function pairs obtained in step 105), and a graph structure constructed through the following steps is referred to as a plagiarism evidence graph: firstly, assembling function pairs in Match to a static function call graph of an original reported program, adding a virtual edge between a node corresponding to a function f in SCG (F) and a node corresponding to a function g in SCG (G) for each function matching pair (f, g) belonging to the Match, and marking the similarity sim (f, g) of the matching pair on the edge to realize the association and the correspondence of a suspicious function of the original reported program on the static function call graph; then, all nodes not associated by the virtual edge are pruned from the graph.
2. The method according to claim 1, wherein the capturing and recording manner of the dynamic instruction trace of each function in step S101) is: implanting an analysis code before each instruction of a binary object program to be monitored is executed by utilizing a dynamic instrumentation technology, acquiring the ID of a function where each instruction is located, analyzing the assembly representation form of the instruction in real time, and adding the assembly representation form and the instruction address to the tail of an instruction sequence identified by the ID of the function; and recording the dynamic instruction track of each function in the form of (function ID, trace), wherein trace is<ins1,ins2,L,insn>Represents an instruction sequence, i is 1, 2, 3 … … n; n is the length of the instruction sequence; insiIndicating a specific instruction, insiText is the assembly representation of the instruction, insiLocation is the address of the instruction.
3. The method according to claim 1, wherein the dynamic instruction dependency graph of the function in step S102) is generated by: let f be in the program to be analyzedA function, which performs data and control dependency analysis on its dynamic instruction trace, and constructs a directed graph idg (f) { V, E }, where V is a set of nodes and E is a set of edges; each node vie.V corresponds to a specific instruction in the instruction track and is uniquely identified by the assembly form and the instruction address of the instruction, i.e.
Figure FDA0003008253530000021
At least one instruction ins in the instruction traceiCorresponds to it, and node viFrom ins to insi.text#insiLocation for identification; each side eiE represents that a dynamic data dependency or control dependency relationship exists between two associated instructions, the direction of the edge describes the direction of the dependency, and the label of the edge describes the occurrence times of the dependency; ins for two instructions in a traceiAnd insjIf any one of the following conditions is met, a dynamic data dependency or control dependency relationship exists between the two instructions: a) insjHave access to insiAnd is a destination operand ofjAnd insiThere is no other instruction pair insiIs written with the destination operand of i.e. has insjDynamic data dependence on insi;b)insiIs some type of control jump instruction, ins in this executionjThe branch is taken, and insjAnd insiThere is no other control jump instruction in between, i.e. there is an insjDynamic control is dependent on insiAdding an insiCorresponding node to insjThe directed edge of the corresponding node, or the label value of the edge is updated if the edge exists; the directed graph idg (f) thus generated is referred to as the dynamic instruction dependency graph entry of the function.
4. The method according to claim 3, wherein the inter-function similarity calculation method in step S103) is as follows: similarity of functions is measured by calculating similarity of instruction dependent graph tokens of the functions: for the function f in the original program and the function g in the reported program, the similarity between the two is sim (f, g) ═ sim (idg (f), idg (g)).
5. The method of claim 4, wherein the similarity of instruction dependent graph birthmarks is calculated by: the instruction dependence graph is regarded as a general complex network, the instruction dependence graph is converted into a vector form by restarting a random walk algorithm, and then the cosine distance between vectors is calculated to be used as the similarity of the instruction dependence graph memory.
6. The method according to claim 1, wherein the method for constructing the suspicious function table for the function in the original program in step S104) comprises: firstly, calculating pairwise similarity of all functions between the original informing program and the informed informing program by using the method of the step S103); then, for each function in the original program, based on an adjustable threshold
Figure FDA0003008253530000031
Screening all functions with similarity above the threshold in the reported program to form a suspicious function table: for the function F in the original reporter F, the suspicious function with which the high similarity exists in the reported program G is represented as
Figure FDA0003008253530000032
Adjustable threshold
Figure FDA0003008253530000033
Is between 0.7 and 0.85.
7. The method according to claim 1, wherein the method for pairing suspicious functions in step S105) is as follows: firstly, extracting a static function call graph of an original reporter F, acquiring a function set with a call and called relation according to the call dependency relation contained in the graph for any function F in the F, and calling the set as the call context of the function F, wherein the call context is called CallRelation (F); obtaining a calling context (callrelationship) (G) of any function G in the notified program G by using the same method; then, calculating the similarity of f and the calling context of the function in the suspicious function table thereof: to pairAt each g ∈ Candidate (f), calculate
Figure FDA0003008253530000034
Taking g with the maximum ContextSim (f, g) value as the only pairing of the function f, and simultaneously deleting g from Candidate (f ') of all other f' of the original reporter; and iteratively pairing the suspicious functions one by one.
8. The method according to claim 1, wherein the measure of plagiarism ratio in step S106) is:
Figure FDA0003008253530000035
wherein, | F | is the number of functions in the original program, and count returns the number of instructions specifying the function.
CN201711072012.5A 2017-11-03 2017-11-03 Software local plagiarism detection method based on dynamic instruction dependence graph birthmark Active CN108399321B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711072012.5A CN108399321B (en) 2017-11-03 2017-11-03 Software local plagiarism detection method based on dynamic instruction dependence graph birthmark

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711072012.5A CN108399321B (en) 2017-11-03 2017-11-03 Software local plagiarism detection method based on dynamic instruction dependence graph birthmark

Publications (2)

Publication Number Publication Date
CN108399321A CN108399321A (en) 2018-08-14
CN108399321B true CN108399321B (en) 2021-05-18

Family

ID=63093563

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711072012.5A Active CN108399321B (en) 2017-11-03 2017-11-03 Software local plagiarism detection method based on dynamic instruction dependence graph birthmark

Country Status (1)

Country Link
CN (1) CN108399321B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110083534B (en) * 2019-04-19 2023-03-31 西安邮电大学 Software plagiarism detection method based on reduction-constrained shortest path birthmarks
CN110532739B (en) * 2019-08-30 2021-04-30 西安邮电大学 Multithreading program plagiarism detection method based on frequent pattern mining
CN110990058B (en) * 2019-11-28 2020-08-21 中国人民解放军战略支援部队信息工程大学 Software similarity measurement method and device
CN114817061B (en) * 2022-05-16 2024-08-02 厦门大学 Dependency error detection method for virtual construction script

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577323A (en) * 2013-09-27 2014-02-12 西安交通大学 Dynamic key command sequence birthmark-based software plagiarism detecting method
CN103870721A (en) * 2014-03-04 2014-06-18 西安交通大学 Multi-thread software plagiarism detection method based on thread slice birthmarks
CN107169358A (en) * 2017-05-24 2017-09-15 中国人民解放军信息工程大学 Code homology detection method and its device based on code fingerprint

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577323A (en) * 2013-09-27 2014-02-12 西安交通大学 Dynamic key command sequence birthmark-based software plagiarism detecting method
CN103870721A (en) * 2014-03-04 2014-06-18 西安交通大学 Multi-thread software plagiarism detection method based on thread slice birthmarks
CN107169358A (en) * 2017-05-24 2017-09-15 中国人民解放军信息工程大学 Code homology detection method and its device based on code fingerprint

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Behavior based software theft detection;Wang X等;《Proceedings of the 16th ACM conference on Computer and communicaitons security》;20091231;第283-287页 *
软件抄袭检测研究综述;田振洲;《信 息 安 全 学 报》;20160731;第1卷(第3期);第52-70页 *

Also Published As

Publication number Publication date
CN108399321A (en) 2018-08-14

Similar Documents

Publication Publication Date Title
CN108399321B (en) Software local plagiarism detection method based on dynamic instruction dependence graph birthmark
US9621571B2 (en) Apparatus and method for searching for similar malicious code based on malicious code feature information
KR101999471B1 (en) Information recommendation methods and devices
US10505960B2 (en) Malware detection by exploiting malware re-composition variations using feature evolutions and confusions
US20200380125A1 (en) Method for Detecting Libraries in Program Binaries
KR102415971B1 (en) Apparatus and Method for Recognizing Vicious Mobile App
CN112733150B (en) Firmware unknown vulnerability detection method based on vulnerability analysis
CN112035359A (en) Program testing method, program testing device, electronic equipment and storage medium
JP6778761B2 (en) Extraction and comparison of hybrid program binary features
Mercaldo et al. Hey malware, i can find you!
CN107273546B (en) Counterfeit application detection method and system
CN109670318B (en) Vulnerability detection method based on cyclic verification of nuclear control flow graph
CN103294951B (en) A kind of malicious code sample extracting method based on document type bug and system
CN112632535B (en) Attack detection method, attack detection device, electronic equipment and storage medium
Nguyen et al. Detecting repackaged android applications using perceptual hashing
Zhang et al. BDA: practical dependence analysis for binary executables by unbiased whole-program path sampling and per-path abstract interpretation
CN109766697A (en) Vulnerability scanning method, storage medium, equipment and system applied to linux system
Zhu et al. Determining image base of firmware files for ARM devices
CN114491566A (en) Fuzzy test method and device based on code similarity and storage medium
CN107066302B (en) Defect inspection method, device and service terminal
CN116932381A (en) Automatic evaluation method for security risk of applet and related equipment
Tsyganok et al. Classification of polymorphic and metamorphic malware samples based on their behavior
CN110555147A (en) website data capturing method, device, equipment and medium thereof
JP7235126B2 (en) BACKDOOR INSPECTION DEVICE, BACKDOOR INSPECTION METHOD, AND PROGRAM
CN109241706B (en) Software plagiarism detection method based on static birthmarks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant