WO2019114673A1 - 基于最小分支路径函数胎记的软件局部抄袭证据生成方法 - Google Patents

基于最小分支路径函数胎记的软件局部抄袭证据生成方法 Download PDF

Info

Publication number
WO2019114673A1
WO2019114673A1 PCT/CN2018/120179 CN2018120179W WO2019114673A1 WO 2019114673 A1 WO2019114673 A1 WO 2019114673A1 CN 2018120179 W CN2018120179 W CN 2018120179W WO 2019114673 A1 WO2019114673 A1 WO 2019114673A1
Authority
WO
WIPO (PCT)
Prior art keywords
path
function
fun
similarity
ins
Prior art date
Application number
PCT/CN2018/120179
Other languages
English (en)
French (fr)
Inventor
刘烃
徐茜
贾昂
刘欣宇
佟菲菲
郑庆华
Original Assignee
西安交通大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 西安交通大学 filed Critical 西安交通大学
Publication of WO2019114673A1 publication Critical patent/WO2019114673A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • G06F8/751Code clone detection

Definitions

  • the invention relates to the field of program feature discovery and software plagiarism detection, in particular to a method for software local plagiarism evidence generation.
  • the existing software plagiarism detection technologies can be classified into three categories: source code plagiarism detection technology, plagiarism detection technology based on software watermark, and plagiarism detection technology based on software birthmark.
  • the object of the present invention is to propose a software local plagiarism evidence generating method based on the minimum branch path function birthmark to cope with the current software plagiarism detection needs.
  • the invention extracts the static information of the program by performing reverse analysis on the program; extracts the function birthmark based on the minimum branch path based on the control flow graph, the basic block, the instruction sequence, etc., to characterize the behavioral characteristics of the function; The similarity between the two, so that the similarity between the functions is obtained; based on the similarity between functions, combined with the function call graph, the optimal similar sub-atlas is constructed, which is used as the judgment basis and strong plagiar evidence of local plagiarism between programs. , providing prima facie evidence for real software infringement cases.
  • a software local plagiarism evidence generating method based on a minimum branch path function birthmark includes the following steps:
  • Step S101 Decompiling the executable binary file corresponding to the plaintiff program P and the court program Q based on the disassembly technology, recording and analyzing the generated assembly code, preprocessing the static information contained therein, and storing the data in the form of a data table ;
  • Step S102 Based on the static control flow graph in the function of the program, the instruction sequence included in the basic block between the starting basic block of one branch and the starting basic block of the next branch is used as a minimum branch path of the function, a function
  • i 0,1,...n ⁇ , and the function corresponding to all functions in the court program Q is extracted from the plaintiff program P.
  • birthmark PB ⁇ FB i
  • j 0,1,...,m 2 ⁇ ;n is all of the function birthmark FB id
  • the number of minimum branch paths, m 1 and m 2 are the number of birthmarks of all functions in the plaintiff program P and the court program Q, respectively;
  • Step S103 Calculate the function birthmark similarity SIM(FB i , FB j '), FB i ⁇ PB && FB j ' ⁇ QB, for all the functions in the court program Q based on all the function birthmarks in the plaintiff program P;
  • Step S104 Based on the similarity between functions and the call graph between functions, find similar sub-atlas, and construct an optimal similar sub-atlas;
  • Step S105 Perform plagiarism determination based on the optimal similar sub-atlas, and if there is plagiarism, generate plagiar evidence. According to the size of the optimal similar sub-atlas, and compared with the original program size to determine whether the program has plagiarism, the generated optimal similar sub-atlas can be used as the plagiar evidence of the court program Q plagiarism plaintiff program P. If plagiarism exists, the optimal similar sub-atlas obtained in step S104 is output as plagiar evidence.
  • the step S101 specifically uses a reverse analysis tool to remove the compilation and assembly process, the input is a machine language, and the output result is an assembly language; the binary executable file corresponding to the plaintiff and the court program P, Q is disassembled, The assembly code output after disassembly is analyzed, the static information contained in the program is preprocessed, the library function and the too small function are deleted, and the effective function information is obtained, and the storage is recorded in the form of a data table, and the storage manner is as shown in the following table;
  • the static information specifically includes: a basic block, a function, an instruction, a mnemonic, an operand, a static control flow graph in the function, and a call graph between functions;
  • the too small function is a function whose instruction number is less than 3.
  • the method for extracting the minimum branch path of the basic block b id in step S102 is to create a path path id, i starting from b id for each branch , and continuously adding the subsequent basic block to each path.
  • the assembly instruction in the basic block through which the path passes constitutes the minimum branch path, and the set PATH id of these paths is all the minimum branch paths starting from the basic block.
  • the method for extracting the assembly instruction in the basic block in step S102 is: first reading the mnemonic of the assembly instruction, then reading the expression tree id of the operand corresponding to the assembly instruction, and reading the corresponding node according to the expression tree id Id, thereby reading the symbol or immediate value corresponding to the node id, traversing each node of the expression tree, obtaining an operand, and finally combining the mnemonic with the operand to obtain an expression of the assembly instruction.
  • a is the smallest branch path of the function birthmark of the function Fun_1
  • b is the number of all minimum branch paths of the function birthmark of function Fun_2; for each path path 1,i in PATH 1 , calculate its similarity to path 2,j of each path in PATH 2 , Based on these similarities, find the path path 2, match that best matches path 1, i and record its similarity sim(path 1,i ,path 2,match ); based on the static information of Fun_1, the assembly contained in the path
  • the number of instructions l i is a weight, and a weighting calculation is performed to obtain a similarity between the function birthmark FB 1 and the function birthmark FB 2 , and the calculation formula is:
  • the method for calculating the similarity between the paths path 1, i and path 2, j in step S103 is divided into four steps, namely, a pre-processing, a calibration path, and a similar value of the assembly instruction based on the mnemonic and the operand. Calculation, and path similarity calculation; as follows:
  • the preprocessing method is to first delete the jump instructions contained in the path, and then abstract the operands; the abstraction of operands abstracts the specific operands in the instructions that make up the path into three categories: registers , storage unit and variable name, respectively, represented by REG, MEM, VAL;
  • the method of calibrating the path is to calibrate the two paths path 1, i , path 2, j to be similarly calculated using the LCS algorithm with the same mnemonic as the reference;
  • the path is path 1,i ',path 2,j ', the number of assembly instructions of the two paths is the same, and the mnemonics of the assembly instructions at the corresponding positions are also the same;
  • path 1, i 'assembler instructions similarity values are summed to obtain path 1, i', path 2 , j ' similarity score between the value (path 1, i', path 2, j '); using the same method, get path 1, i , path 2, j and its own similar value score (path 1, i , path 1, i ) and score (path 2, j , path 2, j ); Finally, standardization is performed to obtain the similarity between the paths path 1, i , path 2, j ;
  • the similar subgraph refers to a subgraph whose function is a node, the calling relationship of the function is an edge, the corresponding node has a high similarity, and has the same calling relationship for a similar function;
  • the optimal similar subgraph is It means that each subgraph is given a score based on the number of nodes, the corresponding node similarity value and the node weight, and the subgraph with the highest score is the optimal similar subgraph of the subgraph;
  • the optimal similar subarray refers to one of the most The set ⁇ G 1 ⁇ G 1 ', G 2 ⁇ G 2 ',...,G n ⁇ G n ' ⁇ , G 1 , G 2 ,...G n obtained by the similarity subgraph belong to the plaintiff program P , G 1 ', G 2 ', ...
  • step S104 the method for finding the optimal similar sub-atlas in step S104 is:
  • FF ⁇ (Fun_i,Fun_j)
  • the subgraph score S is the sum of the similarities of all pairs of functions in the subgraph.
  • the calculation formula is:
  • n is the number of all function pairs in the subgraph
  • the threshold value [epsilon] 1 is 0.5 to 1; the optimal value of ⁇ 2 is greater than the first similarity is less than a subgraph extracted fraction 1 G b S b.
  • step S104 the method for generating the similar sub-aggregate G of the FF in step S104 is:
  • step S102 specifically includes the following steps:
  • Step S202 reading the content of the basic block b id from the static control flow graph in the function
  • Step S206 counter id++, and proceeds to step S202 for analysis of the next round;
  • Step S207 Output the set PATH of the minimum branch path as a function birthmark FB of the function F.
  • step S102 specifically includes the following steps:
  • Step S301 input basic block b id and its m+1 subsequent basic blocks b id,0 , b id,1 ,...b id,m ;
  • Step S304 creating a pointer pt pointing to the current subsequent basic block b id, i , pt ⁇ b id, i ;
  • Step S305 determining whether the basic block pointed to by the pointer pt has one and only one subsequent basic block pt.b s , and if so, proceeds to step S306, otherwise proceeds to step S307;
  • Step S308 determining whether the counter i>m, if yes, proceeding to step S309, otherwise proceeding to step S303 for analysis of the next round;
  • Step S309 path output minimum number of branches for all basic blocks b id set PATH id.
  • step S103 specifically includes the following steps:
  • i 0,1,...,a ⁇ of the function Fun_1 as a function birthmark of the function Fun_1 The number of minimum branch paths;
  • j 0, 1, ..., b ⁇ , calculating path 1, i is similar to each path path 2, j in PATH 2 Degree; b is the number of all minimum branch paths of the function birthmark of the function Fun_2;
  • Step S406 counter i++, and proceeds to step S402 for analysis of the next round;
  • Step S407 Perform weighting calculation based on the inter-path similarity matrix SIM_Path and the static information of Fun_1 read from the function birthmark PATH 1 by using the number of assembly instructions l i included in the path as a weight, and the calculation formula is:
  • a further improvement of the present invention is that the method for calculating the similarity between the paths path 1, i and path 2, j in step S103 can be divided into four steps, namely, preprocessing, calibration path, and correlation based on mnemonics and operands.
  • Step S501 input the minimum branch path path 1, i and path 2, j ;
  • Step S502 Preprocessing the paths path 1, i and path 2, j , first deleting the jump instructions included in the path (including JE, JNE, JZ, JNZ, JS, JNS, JC, JNC, JO, JNO, JA, JNA, JAE, JNAE, JG, JNG, JGE, JNGE, JB, JNB, JBE, JNBE, JL, JNL, JLE, JNLE, JP, JNP, JPE, JPO, etc.); then abstract the operands , the specific operands in the instructions constituting the path are abstracted into three categories: registers, storage units, and variable names, which are represented by REG, MEM, and VAL, respectively;
  • Step S503 Calibrate the path using the LCS algorithm, and calibrate the two paths path 1, i , path 2, j to be similarly calculated with the same mnemonic as a reference.
  • the two paths after calibration are path 1,i ',path 2,j ', the number of assembly instructions of the two paths is the same, and the mnemonics of the assembly instructions at the corresponding positions are also the same;
  • Step S505 path for similarity calculation, the 'assembler instructions similarity values summed to obtain path 1, i' path 1, i, path 2, j ' similarity value score (path 1 between, i', path 2,j '), the calculation formula is In the same way, path 1, i , path 2, j and its own similar values score(path 1, i , path 1, i ) and score (path 2, j , path 2, j ) are obtained. Finally, standardize to get the similarity between the paths path 1,i , path 2,j :
  • Step S506 Output the similarity sim(path 1, i , path 2, j ) between the minimum branch path path 1, i and path 2, j .
  • step S104 specifically includes the following steps:
  • Step S601 input threshold values ⁇ 1 and ⁇ 2 , ⁇ 1 is used to filter pairs of similar functions, and ⁇ 2 is used to determine whether the loop can be ended; wherein the value of the threshold ⁇ 1 is 0.5 to 1; the value of ⁇ 2 is greater than 1 and less than extracting a first sub-optimal similarity scores FIG G b S b;
  • Step S602 Based on the similarity matrix SIM_Fun between functions, the similar function pair FF whose similarity is greater than a certain threshold ⁇ 1 is selected:
  • FF ⁇ (Fun_i,Fun_j)
  • Step S603 generating a similar sub-aggregate G of the FF based on the call graph between functions, and calculating the sub-graph score S;
  • the subgraph score S is the sum of the similarities of all pairs of functions in the subgraph.
  • the calculation formula is:
  • n is the number of all function pairs in the subgraph
  • Step S604 extract the optimal similar sub-graph G b , record its score S b , and incorporate it into the optimal similar sub-atlas;
  • Step S605 determining whether the current optimal similar sub-atlas score S b > ⁇ 2 , and if so, then proceeds to step S606, otherwise proceeds to step S607;
  • Step S607 Output the current optimal similar sub-atlas.
  • a further improvement of the present invention is that the method for generating the similar sub-atlas G of the FF in step S104 specifically includes the following steps:
  • i 0,1,...,n ⁇ ; n is the number of function pairs in the FF;
  • Step S704 determining whether ff i conflicts with G j , and if so, proceeds to step S707, otherwise proceeds to step S705;
  • Step S705 based on the function call graph, determine whether there is a function pair in G j for the ff i , and the call relationship is matched with it, if yes, then proceeds to step S706, otherwise proceeds to step S707;
  • Step S708 counter j++, and proceeds to step S704 for analysis of the next round;
  • Step S711 counter i++, and proceeds to step S703 for analysis of the next round;
  • Step S712 Output the current similar sub-atlas G.
  • the present invention has the following beneficial effects: 1) The method of the present invention can directly target binary code, does not depend on source code, does not depend on a specific programming language or platform, and has better applicability; 2) the present invention The detection method can cope with various mature and powerful code obfuscation techniques and tools, and improve the detection ability of deep confusion; 3) The method of the invention can be applied not only to the overall plagiarism but also to the scene of local plagiarism; 4) Unlike existing plagiarism detection techniques, this method not only provides the results of plagiarism, but also provides specific and powerful plagiar evidence for plagiarism.
  • FIG. 1 is an overall flow chart of a software local plagiarism evidence generating method based on a minimum branch path function birthmark of the present invention
  • FIG. 2 is a flow chart of a function birthmark extraction process based on a minimum branch path
  • FIG. 3 is a flow chart of a minimum branch path extraction process of a basic block
  • 5 is a flow chart of a method for calculating similarity between paths
  • 6 is a flow chart of a method for finding an optimal similar sub-atlas
  • FIG. 8 is a schematic diagram of a control flow diagram of a function and its minimum branch path; wherein FIG. 8(a) is a control flow diagram of function F; FIG. 8(b) is a diagram of all minimum branch paths of function F;
  • FIG. 9 is a schematic diagram of a function call graph and an optimal similar subgraph of the program; wherein FIG. 9(a) is a schematic diagram of a program P function call diagram; FIG. 9(b) is a schematic diagram of a program Q function call graph; FIG. 9(c) is a diagram Program P, Q optimal similar subgraph diagram.
  • FIG. 1 is an overall processing flow of a software partial plagiar evidence generating method based on a minimum branch path function birthmark.
  • the invention relates to a software local plagiarism evidence generating method based on a minimum branch path function birthmark, comprising the following steps:
  • Step S101 using a reverse analysis tool such as IDA pro, Binnavi, etc., to implement disassembly of the executable binary code corresponding to the plaintiff program P and the court program Q, extracting the static information contained therein, performing preprocessing and in the form of a data table. storage.
  • a reverse analysis tool such as IDA pro, Binnavi, etc.
  • Table 1 Data table table name and structure
  • j 0, corresponding to all functions in the court program Q. 1,...,m 2 ⁇ ;n is the number of all minimum branch paths of the function birthmark FB id , and m 1 and m 2 are the number of all function birthmarks in the plaintiff program P and the court Q, respectively.
  • the instruction sequence included in the basic block between the starting basic block of one branch and the starting basic block of the next branch is used as a minimum branch path of the function, and the function birthmark based on the minimum branch path
  • the extraction specifically includes the following steps:
  • the method for extracting the minimum branch path of the basic block b id specifically includes the following steps: Step S301: input the basic block b id and its m+1 subsequent basic blocks b id, 0 , b id, 1 , ...
  • control flow graph of the function F can be extracted according to the above steps.
  • the minimum branch path can be extracted as shown in Fig. 8(b), and the function birthmark constituting the function is constructed.
  • Step S103 Based on all the function birthmarks in the plaintiff program P, calculate the function birthmark similarity SIM(FB i , FB j '), FB i ⁇ PB && FB j ' ⁇ QB for all functions in the accused program Q.
  • j 0,1,...,b ⁇ , a is the number of all minimum branch paths of the function birthmark of the function Fun_1; b is the smallest branch of the function birthmark of the function Fun_2
  • the number of paths; for each path path 1,i in PATH 1 calculate its similarity to path 2,j of each path in PATH 2 , based on these similarities, find the path that best matches path 1,i Path 2, match and record its similarity sim(path 1,i ,path 2,match ).
  • the path contains the number of assembly instructions l i is the weight, weighted calculation to obtain the degree of similarity SIM function birthmark FB 1 and the function between the birthmark FB 2 (FB i, FB j ') .
  • j 0, 1, ..., b ⁇ , calculating path 1, i is similar to each path path 2, j in PATH 2
  • the function Fun_1 includes the paths path1, path2, and path3, and the function Fun_2 includes the paths pathA, pathB, and pathC.
  • the calculation method of the similarity between the paths path 1, i and path 2, j can be divided into four steps, namely, preprocessing, calibration path, calculation of similar values of assembly instructions based on mnemonics and operands, and paths. Similarity calculation. Specifically, the following steps are included:
  • Step S501 input the minimum branch path path 1, i and path 2, j ;
  • Step S502 Preprocessing the paths path 1, i and path 2, j , first deleting the jump instructions included in the path (including JE, JNE, JZ, JNZ, JS, JNS, JC, JNC, JO, JNO, JA, JNA, JAE, JNAE, JG, JNG, JGE, JNGE, JB, JNB, JBE, JNBE, JL, JNL, JLE, JNLE, JP, JNP, JPE, JPO, etc.); then abstract the operands , the specific operands in the instructions constituting the path are abstracted into three categories: registers, storage units, and variable names, which are represented by REG, MEM, and VAL, respectively;
  • Step S503 Calibrate the path using the LCS algorithm, and calibrate the two paths path 1, i , path 2, j to be similarly calculated with the same mnemonic as a reference.
  • the two paths after calibration are path 1,i ',path 2,j ', the number of assembly instructions of the two paths is the same, and the mnemonics of the assembly instructions at the corresponding positions are also the same;
  • Step S505 path for similarity calculation, the 'assembler instructions similarity values summed to obtain path 1, i' path 1, i, path 2, j ' similarity value score (path 1 between, i', path 2,j '), the calculation formula is In the same way, path 1, i , path 2, j and its own similar values score(path 1, i , path 1, i ) and score (path 2, j , path 2, j ) are obtained. Finally, standardize to get the similarity between the paths path 1,i , path 2,j
  • Step S506 Output the similarity sim(path 1, i , path 2, j ) between the minimum branch path path 1, i and path 2, j .
  • path path1 ⁇ (push,ebp),(mov,ebp,esp),(push,ebx),(sub,esp,4h),(cmp,byte ds:[completed.6159],byte 0h),( Jnz, loc_8049F6F), (mov, byte ds: [completed.6159], byte 1h)>
  • path2 (mov, eax, ds:[dtor_idx.6161]), (mov, ebx, __DTOR_END___), (sub, ebx , __DTOR_LIST__), (sar, ebx, byte 2h), (sub, ebx, 1h), (cmp, eax, ebx), (jnb, loc_8049F68), (lea, esi, ds: [esi+0h])>
  • Step S104 Based on the similarity between the functions and the call graph between the functions, the similar sub-atlas is found, and the optimal similar sub-atlas is constructed. Firstly, based on the given threshold and the similarity between functions, the similar function pairs are filtered. The similar sub-atlas of all similar function pairs are generated, and then the optimal similar subgraphs are extracted and the optimal similar sub-atlas is constructed.
  • the similar subgraph G 1 ⁇ G 1 ' refers to a subgraph whose function is a node, the calling relationship of the function is an edge, the corresponding node has a high degree of similarity, and has the same calling relationship for a similar function.
  • the optimal similarity subgraph means that each subgraph is given a score based on the number of nodes, the corresponding node similarity value and the node weight, and the subgraph with the highest score is the optimal similar subgraph of the subgraph.
  • the optimal similarity sub-atlas refers to the set ⁇ G 1 ⁇ G 1 ', G 2 ⁇ G 2 ',...,G n ⁇ G n ' ⁇ , G 1 obtained each time an optimal similar subgraph is added.
  • G 2 , ... G n belong to the plaintiff program P
  • G 1 ', G 2 ', ... G n ' belong to the court program Q
  • G 1 , G 2 , ... G n do not intersect, G 1 ', G 2 ', ... G n 'disjoint.
  • the method for finding the optimal similar sub-atlas includes the following steps:
  • Step S601 input threshold values ⁇ 1 and ⁇ 2 , ⁇ 1 is used to filter pairs of similar functions, and ⁇ 2 is used to determine whether the loop can be ended; wherein the value of the threshold ⁇ 1 is 0.5 to 1; the value of ⁇ 2 is greater than 1 and less than extracting a first sub-optimal similarity scores FIG G b S b;
  • Step S602 Based on the similarity matrix SIM_Fun between functions, the similar function pair FF whose similarity is greater than a certain threshold ⁇ 1 is selected:
  • FF ⁇ (Fun_i,Fun_j)
  • Step S603 generating a similar sub-aggregate G of the FF based on the call graph between functions, and calculating the sub-graph score S;
  • the subgraph score S is the sum of the similarities of all pairs of functions in the subgraph.
  • the calculation formula is:
  • n is the number of all function pairs in the subgraph
  • Step S604 extract the optimal similar sub-graph G b , record its score S b , and incorporate it into the optimal similar sub-atlas;
  • Step S605 determining whether the current optimal similar sub-atlas score S b > ⁇ 2 , and if so, then proceeds to step S606, otherwise proceeds to step S607;
  • Step S607 Output the current optimal similar sub-atlas.
  • the method for generating the similar sub-atlas G of the FF specifically includes the following steps:
  • the function call diagram of the plaintiff program P and the court program Q is as shown in Fig. 9(a) and (b), wherein the node represents a function, and the directed connection line represents a call relationship between functions, and the optimal similar subgraph is passed.
  • the extraction can obtain the optimal similarity subgraph as shown in Fig. 9(c).
  • the functions on the left belong to the plaintiff program P, and the one on the right belongs to the court program Q.
  • the two functions connected by the dashed line are similar function pairs.
  • Step S105 Perform plagiarism determination based on the optimal similar sub-atlas, and if there is plagiarism, generate plagiar evidence.
  • the generated optimal similar sub-atlas can be used as the plagiar evidence of the court program Q plagiarism plaintiff program P.
  • the modules included in the optimal similar subgraph set are functional modules or general modules. If the optimal similar subgraph sets are all common modules, then it is judged that there is no plagiarism; If at least one functional module is identical in the optimal similar sub-graph set, it may be determined that there is plagiarism; if there is plagiarism, the optimal similar sub-atlas obtained in step S104 is output as plagiar evidence.
  • the function module is the original module of the plaintiff program.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种基于最小分支路径函数胎记的软件局部抄袭证据生成方法,通过对程序进行逆向分析,提取出程序的静态信息;基于函数的控制流图、基本块、指令序列等,提取基于最小分支路径的函数胎记,以表征函数的行为特征;计算函数胎记之间的相似度,从而得到函数之间的相似性;基于函数间的相似度,结合函数调用图,构建出最优相似子图集,将其作为程序间局部抄袭的判断依据和有力的抄袭证据,为真实的软件侵权案件提供初步证据。

Description

基于最小分支路径函数胎记的软件局部抄袭证据生成方法 技术领域
本发明涉及程序特征发现及软件抄袭检测领域,特别涉及一种软件局部抄袭证据生成的方法。
背景技术
随着计算机软件产业迅速发展,软件的安全问题得到了越来越多的研究人员、教育人员及软件企业的重视。而开源软件的出现,更是将软件抄袭问题推到了风头浪尖。近年来,各类软件侵权案件时有发生,Verizon、eBay、Apple等公司都曾卷入相关案件当中。
为了对抗软件抄袭,保护软件知识产权,国内外的研究人员提出了大量的软件抄袭检测技术。以应用场景和技术手段作为基准,可将现有的软件抄袭检测技术归为三类:源码抄袭检测技术,基于软件水印的抄袭检测技术,以及基于软件胎记的抄袭检测技术。
但是,目前的软件抄袭检测技术存在一系列的局限性:
1)目前大部分具有权威性的软件抄袭检测方法是针对源代码的,而在现实中,软件所有者通常是以二进制文件的形式发布软件,软件源代码在未取得一定的证据之前是很难获取;
2)抄袭者为了躲避软件抄袭检测,通常使用成熟、强力的代码混淆技术和工具,使得抄袭程序与原程序在表面上呈现很大的不同,使得一部分软件抄袭检测方法失效;
3)相较于整体抄袭,局部抄袭更为普遍,一方面是因为局部抄袭更容易符合抄袭者的要求,更加灵活,另一方面也因为使得计算出的软件与原版本之间的整体相似度较低,从而导致许多整体检测方法失效。
4)现有的抄袭检测都只是提供一个简单的结果,没有具体且有力的抄袭证据作为佐证。
发明内容
本发明的目的在于提出一种基于最小分支路径函数胎记的软件局部抄袭证据生成方法,以应对当前的软件抄袭检测的需要。本发明通过对程序进行逆向分析,提取出程序的静态信息;基于函数的控制流图、基本块、指令序列等,提取基于最小分支路径的函数胎记,以表征函数的行为特征;计算函数胎记之间的相似度,从而得到函数之间的相似性;基于函数间的相似度,结合函数调用图,构建出最优相似子图集,将其作为程序间局部抄袭的判断依据和有力的抄袭证据,为真实的软件侵权案件提供初步证据。
为了实现上述目的,本发明采用以下技术方案:
基于最小分支路径函数胎记的软件局部抄袭证据生成方法,包括如下步骤:
步骤S101:基于反汇编技术,对原告程序P及被告程序Q对应的可执行二进制文件进行反汇编,记录并分析生成的汇编代码,对于其包含的静态信息进行预处理并以数据表的形式存储;
步骤S102:基于程序的函数内静态控制流图,将一个分支的起始基本块到下一个分支的起始基本块之间的基本块所包含的指令序列作为函数的一条最小分支路径,一个函数F id的函数胎记FB id是其所有最小分支路径构成的集合PATH={path id,i|i=0,1,...n},提取原告程序P与被告程序Q内所有函数对应的函数胎记PB={FB i|i=0,1,...,m 1}以及QB={FB j'|j=0,1,...,m 2};n为函数胎记FB id的所有最小分支路径的个数,m 1和m 2分别为原告程序P与被告程序Q中所有函数胎记的个数;
步骤S103:基于原告程序P内的所有函数胎记,计算其对于被告程序Q内的所有函数的函数胎记相似度SIM(FB i,FB j'),FB i∈PB&&FB j'∈QB;
步骤S104:基于函数间的相似度以及函数间调用图,发现相似子图集,构建最优相似子图集;
步骤S105:基于最优相似子图集,进行抄袭判定,如存在抄袭,生成抄袭证据。根据最优相似子图集的规模大小,并与原程序规模作比较从而判断程序是否存在抄袭,而生成的最优相似子图集则可作为被告程序Q抄袭原告程序P的抄袭证据。如果存在抄袭,将步骤S104获得的最优相似子图集输出作为抄袭证据。
进一步的,所述步骤S101具体为使用逆向分析工具来撤除编译和汇编过程,输入为机器语言,输出结果为汇编语言;对原告及被告程序P,Q对应的二进制可执行文件进行反汇编,对反汇编后输出的汇编代码进行分析,对程序包含的静态信息进行预处理,删除库函数以及过小的函数,得到有效的函数信息,以数据表的形式记录存储,存储方式如下表所示;
表名 表结构
Functions address#name#type
BasicBlocks id#parent_function#adress
BasicBlocks_Instructions basicblock_id#instruction_address
Instructions address#mnemonic
Operands address#expression_tree_id
Expression_Tree_Nodes expression_tree_id#expression_node_id
Expression_Nodes id#type#symbol#immediate#parent_id
Control_Flow_Graphs id#parent_function#source#destination
Callgraph id#source#destination
所述静态信息具体包括:基本块,函数,指令,助记符,操作数,函数内静态控制流图以及函数间调用图;
所述过小的函数为指令数小于3的函数。
进一步的,步骤S102中基于最小分支路径的函数胎记FB id即最小分支路径集合PATH的提取方法是基于函数的静态控制流图,对函数内的每一个基本块b id进行分析,如果该基本块的分支大于等于2或该基本块为所属函数的起始基本块,则提取以该基本块为起点的所有最小分支路径集合PATH id={path id,i|i=0,1,...,m},并将该集合加入所属函数胎记集合,PATH=PATH∪PATH id,m为以基本块b id为起点的所有最小分支路径的个数。
进一步的,步骤S102中提取基本块b id的最小分支路径的方法是为其每一个分支创建一条以b id为起点的路径path id,i,对于每一条路径,将其后继基本块不断加入该路径中,直至遇到下一个分支,则该路径经 过的基本块内的汇编指令构成了该最小分支路径,这些路径的集合PATH id即为以该基本块为起点的所有最小分支路径。
进一步的,步骤S102中提取基本块中汇编指令的方法是:首先读取汇编指令的助记符,接着读取该汇编指令对应的操作数的表达树id,根据表达树id读取对应的节点id,从而读取节点id对应的符号或立即数,遍历该表达树的各个节点,得到操作数,最后将助记符与操作数组合,得到该汇编指令的表达形式。
进一步的,步骤S103中函数胎记之间相似度的计算方法是:令原告程序P中的函数Fun_1的胎记FB 1与被告程序Q中函数Fun_2的胎记FB 2'分别表示为PATH 1={path 1,i|i=0,1,...,a},PATH 2={path 2,j|j=0,1,...,b},a为函数Fun_1的函数胎记的所有最小分支路径的个数;b为函数Fun_2的函数胎记的所有最小分支路径的个数;对于PATH 1中的每一条路径path 1,i,计算其与PATH 2中每一条路径path 2,j的相似度,基于这些相似度,寻找与path 1,i最匹配的路径path 2,match,并记录其相似度sim(path 1,i,path 2,match);基于Fun_1的静态信息,以路径中包含的汇编指令个数l i为权值,进行加权计算,从而得到函数胎记FB 1与函数胎记FB 2之间的相似度,计算公式为:
Figure PCTCN2018120179-appb-000001
式中:
l i——函数Fun_1中第i条最小分支路径包含的汇编指令的个数;
则函数之间的相似度SIM(Fun_1,Fun_2)=SIM(FB 1,FB 2')。
进一步的,步骤S103中路径path 1,i与path 2,j之间相似度的计算方法分为四个步骤,分别是预处理,校准路径,基于助记符和操作数关联的汇编指令相似值计算,以及路径相似度计算;具体如下:
a)预处理:预处理的方法是首先删除路径中所包含的跳转指令,接着将操作数抽象化;操作数抽象化是指将构成路径的指令中的具体操作数抽象为三类:寄存器,存储单元以及变量名,分别用REG,MEM,VAL来表示;
b)校准路径:校准路径的方法是使用LCS算法,以相同的助记符作为基准,对要进行相似性计算的两条路径path 1,i,path 2,j进行校准;校准后的两条路径为path 1,i',path 2,j',这两条路径汇编指令的数目是相同的,同时对应位置上的汇编指令的助记符也是相同的;
c)基于助记符和操作数关联的汇编指令相似值计算:令校准后的路径path 1,i',path 2,j'表示为path 1,i'=<ins 1,ins 2,...,ins n>,path 2,j'=<ins 1',ins 2',...,ins n'>,其中,n为每条路径的汇编指令的数目。计算path 1,i', path 2,j'之间的相似值,首先计算对应位置的汇编指令ins pos,ins pos'之间的相似值,计算方法为对应位置的相同操作数的个数:
sim_ins(ins pos,ins pos')=|{i|arg s(ins pos)[i]=arg s(ins pos')[i]}|
式中:
ins pos,ins pos'——两条汇编指令;
arg s(ins pos)[i]——汇编指令ins pos的第i个操作数;
d)路径相似度计算:将path 1,i'中汇编指令的相似值相加,得到path 1,i',path 2,j'之间的相似值score(path 1,i',path 2,j');采用相同的方法,得到path 1,i,path 2,j与其自身的相似值score(path 1,i,path 1,i)与score(path 2,j,path 2,j);最后,进行标准化,得到路径path 1,i,path 2,j之间的相似度;
Figure PCTCN2018120179-appb-000002
Figure PCTCN2018120179-appb-000003
进一步的,步骤S104中,相似子图是指以函数为节点,函数的调用关系为边,对应节点相似度较高且对相似的函数有相同的调用关系的子图;最优相似子图是指,对每个子图基于节点数目、对应节点相似值和节点权重赋予一个分数,分数最高的子图为该子图的最优相似子图;最优相似子图集是指每次加入一个最优相似子图所得到的集合{G 1→G 1',G 2→G 2',...,G n→G n'},G 1,G 2,...G n属于原告程序P,G 1',G 2',...G n'属于被告程序Q,且G 1,G 2,...G n不相交,G 1',G 2',...G n'不相交;G i→G i'为一个相似子图,其中i=1,2,…n。
进一步的,步骤S104中最优相似子图集发现的方法是:
1)筛选出相似度大于阈值ε 1的相似函数对:
FF={(Fun_i,Fun_j)|Fun_i∈P&&Fun_j∈Q&&SIM(Fun_i,Fun_j)>ε 1};
2)基于函数间调用图,生成FF的相似子图集G,并计算每个子图的分数S;
子图分数S是指子图中所有函数对的相似度的和,计算公式为:
Figure PCTCN2018120179-appb-000004
式中,n为子图中所有函数对的数目;
3)提取最优相似子图G b,记录其分数S b,并入最优相似子图集;
4)如果最优相似子图的分数大于ε 2,更新FF,将目前最优相似子图集中包含的函数对从FF中移除,FF=FF-{(Fun_i,Fun_j)|Fun_i∈G b||Fun_j∈G b},并跳转至步骤2);否则,结束,并输出当前的最优 相似子图集;
其中阈值ε 1的取值为0.5~1;ε 2的取值大于1小于第一个提取的最优相似子图G b的分数S b
进一步的,步骤S104中生成FF的相似子图集G的方法是:
2.1)将FF={ff i|i=0,1,...,n}中的第一个函数对ff 0作为一个子图加入子图集G中;G={{ff 0}};n为FF中函数对的个数;
2.2)从ff i,i=1开始,遍历FF,对于ff i
a)遍历子图集G={G j|j=0,1,...,m};m为子图集G中的子图个数;
b)如果ff i与G j不冲突,且基于函数调用图,对于ff i,存在G j中的函数对与其符合调用关系,则G=G∪{G j∪{ff i}};
2.3)将函数对ff i作为一个子图加入子图集G中,G=G∪{{ff i}};
2.4)输出相似子图集G;
步骤S104中判断一个函数对ff i=(Fun_i,Fun_i')是否与子图G j冲突的方法是:如果存在ff=(Fun,Fun')∈G j,使得Fun_i=Fun且Fun_i'≠Fun',或Fun_i'=Fun'且Fun_i≠Fun,则函数对ff i与子图G j冲突。
本发明的进一步改进在于:步骤S102具体包括以下步骤:
步骤S201:初始化最小分支路径的集合
Figure PCTCN2018120179-appb-000005
以及计数器id=0;
步骤S202:从函数内静态控制流图中读取基本块b id的内容;
步骤S203:判断是否计数器id==0或b id的分支数目大于等于2,若是,则转入步骤S204,否则转入步骤S206;
步骤S204:提取以基本块b id为起点的所有最小分支路径集合PATH id={path id,i|i=0,1,...,m},并将该集合加入所属函数胎记集合,PATH=PATH∪PATH id;m为以基本块b id为起点的所有最小分支路径的个数;
步骤S205:判断是否计数器id==n,若是,则转入步骤S207,否则转入步骤S206;
步骤S206:计数器id++,并转入步骤S202进行下一轮的分析;
步骤S207:输出最小分支路径的集合PATH作为函数F的函数胎记FB。
本发明的进一步改进在于:步骤S102具体包括以下步骤:
步骤S301:输入基本块b id及其m+1个后继基本块b id,0,b id,1,...b id,m
步骤S302:初始化b id的最小分支路径的集合
Figure PCTCN2018120179-appb-000006
以及计数器i=0;
步骤S303:为其后继基本块b id,i创建一条以b id为起点的路径path id,i,path id,i=b id+b id,i
步骤S304:创建指针pt指向当前后继基本块b id,i,pt→b id,i
步骤S305:判断指针pt指向的基本块是否有且只有一个后继基本块pt.b s,若是,则转入步骤S306,否则转入步骤S307;
步骤S306:将后继基本块pt.b s加入路径path id,i中,path id,i=path id,i+pt.b s,指针pt指向其后继基本块pt.b s,pt→pt.b s,并转入步骤S305进行下一轮的分析;
步骤S307:将当前路径path id,i并入b id的最小分支路径的集合PATH id来,PATH id=PATH id∪{path id,i},计数器i++;
步骤S308:判断是否计数器i>m,若是,则转入步骤S309,否则转入步骤S303进行下一轮的分析;
步骤S309:输出基本块b id的所有最小分支路径集合PATH id
本发明的进一步改进在于:步骤S103具体包括以下步骤:
步骤S401:初始化计数器i=0;
步骤S402:从函数Fun_1的函数胎记PATH 1={path 1,i|i=0,1,...,a}中读取最小分支路径path 1,i;a为函数Fun_1的函数胎记的所有最小分支路径的个数;
步骤S403:读取函数Fun_2的函数胎记PATH 2={path 2,j|j=0,1,...,b},计算path 1,i与PATH 2中每一条路径path 2,j的相似度;b为函数Fun_2的函数胎记的所有最小分支路径的个数;
步骤S404:寻找与path 1,i最匹配的路径path 2,match,并记录其相似度sim(path 1,i,path 2,match),存入路径间相似度矩阵SIM_Path中,SIM_Path=[sim(path 1,i,path 2,match)],i=0,1,...,a;
步骤S405:判断是否计数器i==a,若是,则转入步骤S407,否则转入步骤S406;
步骤S406:计数器i++,并转入步骤S402进行下一轮的分析;
步骤S407:基于路径间相似度矩阵SIM_Path以及从函数胎记PATH 1中读取的Fun_1的静态信息,以路径中包含的汇编指令个数l i为权值,进行加权计算,计算公式为:
Figure PCTCN2018120179-appb-000007
式中:
l i——函数Fun_1中第i条最小分支路径包含的汇编指令的个数;
步骤S408:输出函数Fun_1与函数Fun_2之间的相似度SIM(Fun_1,Fun_2)=SIM(FB 1,FB 2'),并存入函数间相似度矩阵SIM_Fun中,SIM_Fun=[SIM(Fun_i,Fun_j)],i=0,1,...,m 1,j=0,1,...,m 2
本发明的进一步改进在于:步骤S103中路径path 1,i与path 2,j之间相似度的计算方法可分为四个步骤,分别是预处理,校准路径,基于助记符和操作数关联的汇编指令相似值计算,以及路径相似度计算。具体包括以下步骤:
步骤S501:输入最小分支路径path 1,i与path 2,j
步骤S502:对路径path 1,i和path 2,j进行预处理,首先删除路径中所包含的跳转指令(包括JE、JNE、JZ、JNZ、JS、JNS、JC、JNC、JO、JNO、JA、JNA、JAE、JNAE、JG、JNG、JGE、JNGE、JB、JNB、JBE、JNBE、JL、JNL、JLE、JNLE、JP、JNP、JPE、JPO等跳转指令);接着抽象化操作数,将构成路径的指令中的具体操作数抽象为三类:寄存器,存储单元以及变量名,分别用REG,MEM,VAL来表示;
步骤S503:使用LCS算法校准路径,以相同的助记符作为基准,对要进行相似性计算的两条路径path 1,i,path 2,j进行校准。校准后的两条路径为path 1,i',path 2,j',这两条路径汇编指令的数目是相同的,同时对应位置上的汇编指令的助记符也是相同的;
步骤S504:基于助记符和操作数关联的汇编指令相似值计算,令校准后的路径path 1,i',path 2,j'表示为path 1,i'=<ins 1,ins 2,...,ins n>,path 2,j'=<ins 1',ins 2',...,ins n'>,其中,n为每条路径的汇编指令的数目。计算path 1,i',path 2,j'之间的相似值,首先计算对应位置的汇编指令ins pos,ins pos'之间的相似值,计算方法为取对应位置的相同操作数的个数,令ins pos,ins pos'表示两条汇编指令,arg s(ins pos)[i]表示汇编指令ins pos的第i个操作数,计算公式为:
sim(ins pos,ins pos')=|{i|arg s(ins pos)[i]=arg s(ins pos')[i]}|
式中:
ins pos,ins pos'——两条汇编指令;
arg s(ins pos)[i]——汇编指令ins pos的第i个操作数;
步骤S505:进行路径相似度计算,将path 1,i'中汇编指令的相似值相加,得到path 1,i',path 2,j'之间的相似值score(path 1,i',path 2,j'),计算公式为
Figure PCTCN2018120179-appb-000008
采用相同的方法,得到path 1,i,path 2,j与其自身的相似值score(path 1,i,path 1,i)与score(path 2,j,path 2,j)。最后,进行标准化,得到路径path 1,i,path 2,j之间的相似度:
Figure PCTCN2018120179-appb-000009
步骤S506:输出最小分支路径path 1,i与path 2,j间的相似度sim(path 1,i,path 2,j)。
本发明的进一步改进在于:步骤S104中最优相似子图集发现的方法具体包括以下步骤:
步骤S601:输入阈值ε 1和ε 2,ε 1用于筛选相似函数对,ε 2用于判断是否可以结束循环;其中阈值ε 1的取值为0.5~1;ε 2的取值大于1小于第一个提取的最优相似子图G b的分数S b
步骤S602:基于函数间相似度矩阵SIM_Fun,筛选出相似度大于一定阈值ε 1的相似函数对FF:
FF={(Fun_i,Fun_j)|Fun_i∈P&&Fun_j∈Q&&SIM(Fun_i,Fun_j)>ε 1};
步骤S603:基于函数间调用图,生成FF的相似子图集G,并计算子图分数S;
子图分数S是指子图中所有函数对的相似度的和,计算公式为:
Figure PCTCN2018120179-appb-000010
式中,n为子图中所有函数对的数目;
步骤S604:提取最优相似子图G b,记录其分数S b,将其并入最优相似子图集;
步骤S605:判断是否当前最优相似子图集的分数S b>ε 2,若是,则转入步骤S606,否则转入步骤S607;
步骤S606:更新FF,将目前最优相似子图集中包含的函数对从FF中移除,FF=FF-{(Fun_i,Fun_j)|Fun_i∈G b||Fun_j∈G b},并跳转至步骤S603进行下一轮的分析;
步骤S607:输出当前的最优相似子图集。
本发明的进一步改进在于:步骤S104中生成FF的相似子图集G的方法具体包括以下步骤:
步骤S701:输入相似函数对集合FF={ff i|i=0,1,...,n};n为FF中函数对的个数;
步骤S702:将FF={ff i|i=0,1,...,n}中的第一个函数对ff 0作为第一个子图加入子图集,初始化相似子图集G={{ff 0}}和计数器i=1;
步骤S703:遍历子图集G={G j|j=0,1,...,m},初始化计数器j=1;m为子图集G中的子图个数;
步骤S704:判断ff i是否与G j冲突,若是,则转入步骤S707,否则转入步骤S705;
步骤S705:基于函数调用图,判断对于ff i,是否存在G j中的函数对与其符合调用关系,若是,则转入步骤S706,否则转入步骤S707;
步骤S706:将ff i加入图G j组成的子图加入子图集G中,G=G∪{G j∪{ff i}};
步骤S707:判断是否计数器j==m,若是,则转入步骤S709,否则转入步骤S708;
步骤S708:计数器j++,并转入步骤S704进行下一轮的分析;
步骤S709:将函数对ff i作为一个子图加入子图集G中,G=G∪{{ff i}};
步骤S710:判断是否计数器i==n,若是,则转入步骤S712,否则转入步骤S711;
步骤S711:计数器i++,并转入步骤S703进行下一轮的分析;
步骤S712:输出当前相似子图集G。
本发明的进一步改进在于:步骤S104中判断一个函数对ff i=(Fun_i,Fun_i')是否与子图G j冲突的方法是:如果存在ff=(Fun,Fun')∈G j,使得Fun_i=Fun且Fun_i'≠Fun',或Fun_i'=Fun'且Fun_i≠Fun,则函数对ff i与子图G j冲突。
相对于现有技术,本发明具有以下有益效果:1)本发明方法能够直接针对二进制代码,不依赖于源代码,不依赖特定的编程语言或平台,具有更好的适用性;2)本发明的检测手段可以应对各种各样的成熟、强力的代码混淆技术和工具,提高对深度混淆的检测能力;3)本发明方法不仅能够应用于整体抄袭的情况,还能够应对局部抄袭的场景;4)不同于现有的抄袭检测技术,本方法不仅可以提供是否存在抄袭的结果,并且对抄袭情况,可以提供具体且有力的抄袭证据。
附图说明
图1为本发明基于最小分支路径函数胎记的软件局部抄袭证据生成方法整体流程图;
图2为基于最小分支路径的函数胎记提取过程流程图;
图3为基本块的最小分支路径提取过程流程图;
图4为函数间相似度计算方法流程图;
图5为路径间相似度计算方法流程图;
图6为最优相似子图集发现方法流程图;
图7为相似子图集生成过程流程图;
图8为函数的控制流图及其最小分支路径的示意图;其中图8(a)为函数F的控制流图;图8(b)为函数F的所有最小分支路径图;
图9为程序的函数调用图及最优相似子图的示意图;其中图9(a)为程序P函数调用图示意图;图9(b)为程序Q函数调用图示意图;图9(c)为程序P、Q最优相似子图示意图。
具体实施方式
以下结合附图详细说明本发明基于最小分支路径函数胎记的软件局部抄袭证据生成方法的具体实施方式。
图1为基于最小分支路径函数胎记的软件局部抄袭证据生成方法整体处理流程。
本发明一种基于最小分支路径函数胎记的软件局部抄袭证据生成方法,包括以下步骤:
步骤S101:使用逆向分析工具如IDA pro、Binnavi等,实现对原告程序P及被告程序Q对应的可执行二进制代码的反汇编,提取出其包含的静态信息,进行预处理并以数据表的形式存储。
具体而言,提取并分析与基本块,函数,指令,助记符,操作数,函数内静态控制流图以及函数间调用图的有关的静态信息,删除库函数以及指令数小于3的函数,得到有效的函数信息,并对其进行整理与分析,按照下表所示,以数据表的形式记录存储所有数据。
表1:数据表表名及结构
表名 表结构
Functions address#name#type
BasicBlocks id#parent_function#adress
BasicBlocks_Instructions basicblock_id#instruction_address
Instructions address#mnemonic
Operands address#expression_tree_id
Expression_Tree_Nodes expression_tree_id#expression_node_id
Expression_Nodes id#type#symbol#immediate#parent_id
Control_Flow_Graphs id#parent_function#source#destination
Callgraph id#source#destination
步骤S102:基于程序的函数内静态控制流图构建函数胎记,一个函数F id的函数胎记FB id是其所有最小分支路径构成的集合PATH={path id,i|i=0,1,...n},提取原告程序P与被告程序Q内所有函数对应的函数胎记PB={FB i|i=0,1,...,m 1}以及QB={FB j'|j=0,1,...,m 2};n为函数胎记FB id的所有最小分支路径的个数,m 1和m 2分别为原告程序P与被告程序Q中所有函数胎记的个数。
结合图2,具体而言,将一个分支的起始基本块到下一个分支的起始基本块之间的基本块所包含的指令序列作为函数的一条最小分支路径,基于最小分支路径的函数胎记提取具体包括以下步骤:
步骤S201:初始化最小分支路径的集合
Figure PCTCN2018120179-appb-000011
以及计数器id=0;步骤S202:从函数内静态控制流图中读取基本块b id的内容;步骤S203:判断是否计数器id==0或b id的分支数目大于等于2,若是,则转入步骤S204,否则转入步骤S206;步骤S204:提取以基本块b id为起点的所有最小分支路径集合PATH id={path id,i|i=0,1,...,m},并将该集合加入所属函数胎记集合,PATH=PATH∪PATH id;m为以基本块b id为起点的所有最小分支路径的个数;步骤S205:判断是否计数器id==n,若是,则转入步骤S207,否则转入步骤S206;步骤S206:计数器id++,并转入步骤S202进行下一轮的分析;步骤S207:输出最小分支路径的集合PATH作为函数F的函数胎记FB。
其中,提取基本块b id的最小分支路径的方法具体包括以下步骤:步骤S301:输入基本块b id及其m+1个后继基本块b id,0,b id,1,...b id,m;步骤S302:初始化b id的最小分支路径的集合
Figure PCTCN2018120179-appb-000012
以及计数器i=0;步骤S303:为其后继基本块b id,i创建一条以b id为起点的路径path id,i,path id,i=b id+b id,i;步骤S304:创建指针pt指向当前后继基本块b id,i,pt→b id,i;步骤S305:判断指针pt指向的基本块是否有且只有一个后继基本块pt.b s,若是,则转入步骤S306,否则转入步骤S307;步骤S306:将后继基本块pt.b s加入路径path id,i中,path id,i=path id,i+pt.b s,指针pt指向其后继基本块pt.b s,pt→pt.b s,并转入步骤S305进行下一轮的分析;步骤S307:将当前路径path id,i并入b id的最小分支路径的集合PATH id来,PATH id=PATH id∪{path id,i},计数 器i++;步骤S308:判断是否计数器i>m,若是,则转入步骤S309,否则转入步骤S303进行下一轮的分析;步骤S309:输出基本块b id的所有最小分支路径集合PATH id
例如函数F的控制流图如图8(a)所示,则按照以上步骤可提取出最小分支路径4条如图8(b)所示,构成该函数的函数胎记。
步骤S103:基于原告程序P内的所有函数胎记,计算其对于被告程序Q内的所有函数的函数胎记相似度SIM(FB i,FB j'),FB i∈PB&&FB j'∈QB。其中函数胎记之间相似度的计算方法是:令函数Fun_1的胎记FB 1与函数Fun_2的胎记FB 2分别表示为PATH 1={path 1,i|i=0,1,...,a},PATH 2={path 2,j|j=0,1,...,b},a为函数Fun_1的函数胎记的所有最小分支路径的个数;b为函数Fun_2的函数胎记的所有最小分支路径的个数;对于PATH 1中的每一条路径path 1,i,计算其与PATH 2中每一条路径path 2,j的相似度,基于这些相似度,寻找与path 1,i最匹配的路径path 2,match,并记录其相似度sim(path 1,i,path 2,match)。基于Fun_1的静态信息,以路径中包含的汇编指令个数l i为权值,进行加权计算,从而得到函数胎记FB 1与函数胎记FB 2之间的相似度SIM(FB i,FB j')。
具体包括以下步骤:
步骤S401:初始化计数器i=0;步骤S402:从函数Fun_1的函数胎记PATH 1={path 1,i|i=0,1,...,a}中读取最小分支路径path 1,i;步骤S403:读取函数Fun_2的函数胎记PATH 2={path 2,j|j=0,1,...,b},计算path 1,i与PATH 2中每一条路径path 2,j的相似度;步骤S404:寻找与path 1,i最匹配的路径path 2,match,并记录其相似度sim(path 1,i,path 2,match),存入路径间相似度矩阵SIM_Path中,SIM_Path=[sim(path 1,i,path 2,match)],i=0,1,...,a;步骤S405:判断是否计数器i==a,若是,则转入步骤S407,否则转入步骤S406;步骤S406:计数器i++,并转入步骤S402进行下一轮的分析;步骤S407:基于路径间相似度矩阵SIM_Path以及从函数胎记PATH 1中读取的Fun_1的静态信息,以路径中包含的汇编指令个数l i为权值,进行加权计算,计算公式为:
Figure PCTCN2018120179-appb-000013
式中:
l i——函数Fun_1中第i条最小分支路径包含的汇编指令的个数;
步骤S408:输出函数Fun_1与函数Fun_2之间的相似度SIM(Fun_1,Fun_2)=SIM(FB 1,FB 2'),并存入函数间相似度矩阵SIM_Fun中,SIM_Fun=[SIM(Fun_i,Fun_j)],i=0,1,...,m 1,j=0,1,...,m 2
例如函数Fun_1包含路径path1、path2、path3,函数Fun_2包含路径pathA、pathB、pathC,它们两两之间的相似度如下表所示,则路径间相似度矩阵SIM_Path=[0.99 0.87 0.86]。
表2:路径相似度举例示意图
相似度 pathA pathB pathC
path1 0.76 0.86 0.99
path2 0.54 0.87 0.18
path3 0.86 0.15 0.47
如果路径path1、path2、path3中包含的汇编指令个数分别为19、25、8,则函数Fun_1与函数Fun_2之间的相似度
Figure PCTCN2018120179-appb-000014
其中路径path 1,i与path 2,j之间相似度的计算方法可分为四个步骤,分别是预处理,校准路径,基于助记符和操作数关联的汇编指令相似值计算,以及路径相似度计算。具体包括以下步骤:
步骤S501:输入最小分支路径path 1,i与path 2,j
步骤S502:对路径path 1,i和path 2,j进行预处理,首先删除路径中所包含的跳转指令(包括JE、JNE、JZ、JNZ、JS、JNS、JC、JNC、JO、JNO、JA、JNA、JAE、JNAE、JG、JNG、JGE、JNGE、JB、JNB、JBE、JNBE、JL、JNL、JLE、JNLE、JP、JNP、JPE、JPO等跳转指令);接着抽象化操作数,将构成路径的指令中的具体操作数抽象为三类:寄存器,存储单元以及变量名,分别用REG,MEM,VAL来表示;
步骤S503:使用LCS算法校准路径,以相同的助记符作为基准,对要进行相似性计算的两条路径path 1,i,path 2,j进行校准。校准后的两条路径为path 1,i',path 2,j',这两条路径汇编指令的数目是相同的,同时对应位置上的汇编指令的助记符也是相同的;
步骤S504:基于助记符和操作数关联的汇编指令相似值计算,令校准后的路径path 1,i',path 2,j'表示为path 1,i'=<ins 1,ins 2,...,ins n>,path 2,j'=<ins 1',ins 2',...,ins n'>,其中,n为每条路径的汇编指令的数目。计算path 1,i',path 2,j'之间的相似值,首先计算对应位置的汇编指令ins pos,ins pos'之间的相似值,计算方法为取对应位置的相同操作数的个数,令ins pos,ins pos'表示两条汇编指令,arg s(ins pos)[i]表示汇编指令ins pos的第i个操作数,计算公式为:
sim_ins(ins pos,ins pos')=|{i|arg s(ins pos)[i]=arg s(ins pos')[i]}|
式中:
ins pos,ins pos'——两条汇编指令;arg s(ins pos)[i]——汇编指令ins pos的第i个操作数;
步骤S505:进行路径相似度计算,将path 1,i'中汇编指令的相似值相加,得到path 1,i',path 2,j'之间的相似 值score(path 1,i',path 2,j'),计算公式为
Figure PCTCN2018120179-appb-000015
采用相同的方法,得到path 1,i,path 2,j与其自身的相似值score(path 1,i,path 1,i)与score(path 2,j,path 2,j)。最后,进行标准化,得到路径path 1,i,path 2,j之间的相似度
Figure PCTCN2018120179-appb-000016
步骤S506:输出最小分支路径path 1,i与path 2,j间的相似度sim(path 1,i,path 2,j)。
例如路径path1=<(push,ebp),(mov,ebp,esp),(push,ebx),(sub,esp,4h),(cmp,byte ds:[completed.6159],byte 0h),(jnz,loc_8049F6F),(mov,byte ds:[completed.6159],byte 1h)>,path2=(mov,eax,ds:[dtor_idx.6161]),(mov,ebx,__DTOR_END__),(sub,ebx,__DTOR_LIST__),(sar,ebx,byte 2h),(sub,ebx,1h),(cmp,eax,ebx),(jnb,loc_8049F68),(lea,esi,ds:[esi+0h])>,经过预处理可抽象为path1=<(push,REG),(mov,REG,REG),(push,REG),(sub,REG,VAL),(cmp,MEM,VAL),(mov,MEM,VAL)>,path2=(mov,REG,MEM),(mov,REG,VAL),(sub,REG,VAL),(sar,REG,VAL),(sub,REG,VAL),(cmp,REG,REG),(lea,REG,MEM)>,采用LCS算法,以相同的助记符作为基准,校准后的两条路径为path1’=<(mov,REG,REG),(sub,REG,VAL),(cmp,MEM,VAL)>,path2’=(mov,REG,MEM),(sub,REG,VAL),(cmp,REG,REG)>,两条路径对应的指令之间的相似值依次为1、2、0,则校准后路径的相似值为score(path1',path2')=3,最后通过标准化得到路径间的相似值为
Figure PCTCN2018120179-appb-000017
步骤S104:基于函数间的相似度以及函数间调用图,发现相似子图集,构建最优相似子图集。首先基于给定的阈值以及函数间的相似度,进行相似函数对的筛选;生成所有相似函数对的的相似子图集,然后从中提取最优相似子图,构建最优相似子图集。
具体描述为:相似子图G 1→G 1'是指以函数为节点,函数的调用关系为边,对应节点相似度较高且对相似的函数有相同的调用关系的子图。最优相似子图是指,对每个子图基于节点数目、对应节点相似值和节点权重赋予一个分数,分数最高的子图为该子图的最优相似子图。最优相似子图集是指每次加入一个最优相似子图所得到的集合{G 1→G 1',G 2→G 2',...,G n→G n'},G 1,G 2,...G n属于原告程序P,G 1',G 2',...G n'属于被告程序Q,且G 1,G 2,...G n不相交,G 1',G 2',...G n'不相交。
最优相似子图集发现的方法具体包括以下步骤:
步骤S601:输入阈值ε 1和ε 2,ε 1用于筛选相似函数对,ε 2用于判断是否可以结束循环;其中阈值ε 1的取值为0.5~1;ε 2的取值大于1小于第一个提取的最优相似子图G b的分数S b
步骤S602:基于函数间相似度矩阵SIM_Fun,筛选出相似度大于一定阈值ε 1的相似函数对FF:
FF={(Fun_i,Fun_j)|Fun_i∈P&&Fun_j∈Q&&SIM(Fun_i,Fun_j)>ε 1};
步骤S603:基于函数间调用图,生成FF的相似子图集G,并计算子图分数S;
子图分数S是指子图中所有函数对的相似度的和,计算公式为:
Figure PCTCN2018120179-appb-000018
式中,n为子图中所有函数对的数目;
步骤S604:提取最优相似子图G b,记录其分数S b,将其并入最优相似子图集;
步骤S605:判断是否当前最优相似子图集的分数S b>ε 2,若是,则转入步骤S606,否则转入步骤S607;
步骤S606:更新FF,将目前最优相似子图集中包含的函数对从FF中移除,FF=FF-{(Fun_i,Fun_j)|Fun_i∈G b||Fun_j∈G b},并跳转至步骤S603进行下一轮的分析;
步骤S607:输出当前的最优相似子图集。
其中生成FF的相似子图集G的方法具体包括以下步骤:
步骤S701:输入相似函数对集合FF={ff i|i=0,1,...,n};n为FF中函数对的个数;步骤S702:将FF={ff i|i=0,1,...,n}中的第一个函数对ff 0作为第一个子图加入子图集,初始化相似子图集G={{ff 0}}和计数器i=1;步骤S703:遍历子图集G={G j|j=0,1,...,m},初始化计数器j=1;m为子图集G中的子图个数;步骤S704:判断ff i是否与G j冲突(判断方法是:如果存在ff=(Fun,Fun')∈G j,使得Fun_i=Fun且Fun_i'≠Fun',或Fun_i'=Fun'且Fun_i≠Fun,则函数对ff i与子图G j冲突),若是,则转入步骤S707,否则转入步骤S705;步骤S705:基于函数调用图,判断对于ff i,是否存在G j中的函数对与其符合调用关系,若是,则转入步骤S706,否则转入步骤S707;步骤S706:将ff i加入图G j组成的子图加入子图集G中,G=G∪{G j∪{ff i}};步骤S707:判断是否计数器j==m,若是,则转入步骤S709,否则转入步骤S708;步骤S708:计数器j++,并转入步骤S704进行下一轮的分析;步骤S709:将函数对ff i作为一个子图加入子图集G中,G=G∪{{ff i}};步骤S710:判断是否计数器i==n,若是,则转入步骤S712,否则转入步骤S711;步骤S711:计数器i++,并转入步骤S703进行下一轮的分析;步骤S712:输出当前相似子图集G。
例如原告程序P与被告程序Q的函数调用图如图9(a)、(b)所示,其中节点代表函数,有向连接线表示的是函数之间的调用关系,通过最优相似子图的提取可得到如图9(c)所示的最优相似子图,左边的函数均属于原告程序P,右边的属于被告程序Q,虚线连接的两个函数就是相似函数对。
步骤S105:基于最优相似子图集,进行抄袭判定,如存在抄袭,生成抄袭证据。
具体描述为:
根据最优相似子图集的规模大小,并与原程序规模作比较从而判断程序是否存在抄袭,而生成的最优相似子图集则可作为被告程序Q抄袭原告程序P的抄袭证据。在实际的应用中,还需要考虑最优相似子图集中所包含的模块是否是功能模块亦或是通用模块等具体问题,如果最优相似子图集中全部为通用模块,则判断不存在抄袭;如果最优相似子图集中存在至少一个功能模块相同,则可以认定存在抄袭;如果存在抄袭,将步骤S104获得的最优相似子图集输出作为抄袭证据。其中,功能模块为原告程序所原创的模块。

Claims (10)

  1. 基于最小分支路径函数胎记的软件局部抄袭证据生成方法,其特征在于,包括如下步骤:
    步骤S101:基于反汇编技术,对原告程序P及被告程序Q对应的可执行二进制文件进行反汇编,记录并分析生成的汇编代码,对于其包含的静态信息进行预处理并以数据表的形式存储;
    步骤S102:基于程序的函数内静态控制流图,将一个分支的起始基本块到下一个分支的起始基本块之间的基本块所包含的指令序列作为函数的一条最小分支路径,一个函数F id的函数胎记FB id是其所有最小分支路径构成的集合PATH={path id,i|i=0,1,...n},提取原告程序P与被告程序Q内所有函数对应的函数胎记PB={FB i|i=0,1,...,m 1}以及QB={FB j'|j=0,1,...,m 2};n为函数胎记FB id的所有最小分支路径的个数,m 1和m 2分别为原告程序P与被告程序Q中所有函数胎记的个数;
    步骤S103:基于原告程序P内的所有函数胎记,计算其对于被告程序Q内的所有函数的函数胎记相似度SIM(FB i,FB j'),FB i∈PB&&FB j'∈QB;
    步骤S104:基于函数间的相似度以及函数间调用图,发现相似子图集,构建最优相似子图集;
    步骤S105:基于最优相似子图集,进行抄袭判定,如存在抄袭,生成抄袭证据。
  2. 根据权利要求1所述的方法,其特征在于,所述步骤S101具体为使用逆向分析工具来撤除编译和汇编过程,输入为机器语言,输出结果为汇编语言;对原告及被告程序P,Q对应的二进制可执行文件进行反汇编,对反汇编后输出的汇编代码进行分析,对程序包含的静态信息进行预处理,删除库函数以及过小的函数,得到有效的函数信息,以数据表的形式记录存储;
    所述静态信息具体包括:基本块,函数,指令,助记符,操作数,函数内静态控制流图以及函数间调用图;
    所述过小的函数为指令数小于3的函数。
  3. 根据权利要求1所述的方法,其特征在于,步骤S102中基于最小分支路径的函数胎记FB id即最小分支路径集合PATH的提取方法是基于函数的静态控制流图,对函数内的每一个基本块b id进行分析,如果该基本块的分支大于等于2或该基本块为所属函数的起始基本块,则提取以该基本块为起点的所有最小分支路径集合PATH id={path id,i|i=0,1,...,m},并将该集合加入所属函数胎记集合,PATH=PATH∪PATH id,m为以基本块b id为起点的所有最小分支路径的个数。
  4. 根据权利要求3所述的方法,其特征在于,步骤S102中提取基本块b id的最小分支路径的方法是为其每一个分支创建一条以b id为起点的路径path id,i,对于每一条路径,将其后继基本块不断加入该路径中,直至遇到下一个分支,则该路径经过的基本块内的汇编指令构成了该最小分支路径,这些路径的集合PATH id即为以该基本块为起点的所有最小分支路径。
  5. 根据权利要求4所述的方法,其特征在于,步骤S102中提取基本块中汇编指令的方法是:首先读取汇编指令的助记符,接着读取该汇编指令对应的操作数的表达树id,根据表达树id读取对应的节点id, 从而读取节点id对应的符号或立即数,遍历该表达树的各个节点,得到操作数,最后将助记符与操作数组合,得到该汇编指令的表达形式。
  6. 根据权利要求1所述的方法,其特征在于,步骤S103中函数胎记之间相似度的计算方法是:令原告程序P中的函数Fun_1的胎记FB 1与被告程序Q中函数Fun_2的胎记FB 2'分别表示为PATH 1={path 1,i|i=0,1,...,a},PATH 2={path 2,j|j=0,1,...,b},对于PATH 1中的每一条路径path 1,i,计算其与PATH 2中每一条路径path 2,j的相似度,基于这些相似度,寻找与path 1,i最匹配的路径path 2,match,并记录其相似度sim(path 1,i,path 2,match);基于Fun_1的静态信息,以路径中包含的汇编指令个数l i为权值,进行加权计算,从而得到函数胎记FB 1与函数胎记FB 2的相似度,计算公式为:
    Figure PCTCN2018120179-appb-100001
    式中:
    l i——函数Fun_1中第i条最小分支路径包含的汇编指令的个数;
    所述a为函数Fun_1的函数胎记的所有最小分支路径的个数;b为函数Fun_2的函数胎记的所有最小分支路径的个数;
    则函数之间的相似度SIM(Fun_1,Fun_2)=SIM(FB 1,FB 2')。
  7. 根据权利要求6所述的方法,其特征在于,步骤S103中路径path 1,i与path 2,j之间相似度的计算方法分为四个步骤,分别是预处理,校准路径,基于助记符和操作数关联的汇编指令相似值计算,以及路径相似度计算;具体如下:
    a)预处理:预处理的方法是首先删除路径中所包含的跳转指令,接着将操作数抽象化;操作数抽象化是指将构成路径的指令中的具体操作数抽象为三类:寄存器,存储单元以及变量名,分别用REG,MEM,VAL来表示;
    b)校准路径:校准路径的方法是使用LCS算法,以相同的助记符作为基准,对要进行相似性计算的两条路径path 1,i,path 2,j进行校准;校准后的两条路径为path 1,i',path 2,j',这两条路径汇编指令的数目是相同的,同时对应位置上的汇编指令的助记符也是相同的;
    c)基于助记符和操作数关联的汇编指令相似值计算:令校准后的路径path 1,i',path 2,j'表示为path 1,i'=<ins 1,ins 2,...,ins n>,path 2,j'=<ins 1',ins 2',...,ins n'>,其中,n为每条路径的汇编指令的数目;计算path 1,i',path 2,j'之间的相似值,首先计算对应位置的汇编指令ins pos,ins pos'之间的相似值,计算方法为对应位置的相同操作数的个数:
    sim_ins(ins pos,ins pos')=|{i|args(ins pos)[i]=args(ins pos')[i]}|
    式中:ins pos,ins pos'——两条汇编指令;
    args(ins pos)[i]——汇编指令ins pos的第i个操作数;
    d)路径相似度计算:将path 1,i'中汇编指令的相似值相加,得到path 1,i',path 2,j'之间的相似值score(path 1,i',path 2,j');采用相同的方法,得到path 1,i,path 2,j与其自身的相似值score(path 1,i,path 1,i)与score(path 2,j,path 2,j);最后,进行标准化,得到路径path 1,i,path 2,j之间的相似度;
    Figure PCTCN2018120179-appb-100002
    Figure PCTCN2018120179-appb-100003
  8. 根据权利要求1所述的方法,其特征在于,步骤S104中,相似子图是指以函数为节点,函数的调用关系为边,对应节点相似度较高且对相似的函数有相同的调用关系的子图;最优相似子图是指,对每个子图基于节点数目、对应节点相似值和节点权重赋予一个分数,分数最高的子图为该子图的最优相似子图;最优相似子图集是指每次加入一个最优相似子图所得到的集合{G 1→G 1',G 2→G 2',...,G n→G n'},G 1,G 2,...G n属于原告程序P,G 1',G 2',...G n'属于被告程序Q,且G 1,G 2,...G n不相交,G 1',G 2',...G n'不相交;G i→G i'为一个相似子图,其中i=1,2,…n。
  9. 根据权利要求8所述的方法,其特征在于,步骤S104中最优相似子图集发现的方法是:
    1)筛选出相似度大于阈值ε 1的相似函数对:
    FF={(Fun_i,Fun_j)|Fun_i∈P&&Fun_j∈Q&&SIM(Fun_i,Fun_j)>ε 1};
    2)基于函数间调用图,生成FF的相似子图集G,并计算每个子图的分数S;
    子图分数S是指子图中所有函数对的相似度的和,计算公式为:
    Figure PCTCN2018120179-appb-100004
    式中,n为子图中所有函数对的数目;
    3)提取最优相似子图G b,记录其分数S b,并入最优相似子图集;
    4)如果最优相似子图的分数大于ε 2,更新FF,将目前最优相似子图集中包含的函数对从FF中移除,FF=FF-{(Fun_i,Fun_j)|Fun_i∈G b||Fun_j∈G b},并跳转至步骤2);否则,结束,并输出当前的最优相似子图集;
    其中阈值ε 1的取值为0.5~1;ε 2的取值大于1小于第一个提取的最优相似子图G b的分数S b
  10. 根据权利要求9所述的方法,其特征在于,步骤S104中生成FF的相似子图集G的方法是:
    2.1)将FF={ff i|i=0,1,...,n}中的第一个函数对ff 0作为一个子图加入子图集G中;G={{ff 0}};n为FF中函数对的个数;
    2.2)从ff i,i=1开始,遍历FF,对于ff i
    a)遍历子图集G={G j|j=0,1,...,m};m为子图集G中的子图个数;
    b)如果ff i与G j不冲突,且基于函数调用图,对于ff i,存在G j中的函数对与其符合调用关系,则G=G∪{G j∪{ff i}};
    2.3)将函数对ff i作为一个子图加入子图集G中,G=G∪{{ff i}};
    2.4)输出相似子图集G;
    步骤S104中判断一个函数对ff i=(Fun_i,Fun_i')是否与子图G j冲突的方法是:如果存在ff=(Fun,Fun')∈G j,使得Fun_i=Fun且Fun_i'≠Fun',或Fun_i'=Fun'且Fun_i≠Fun,则函数对ff i与子图G j冲突。
PCT/CN2018/120179 2017-12-12 2018-12-11 基于最小分支路径函数胎记的软件局部抄袭证据生成方法 WO2019114673A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711322531.2A CN107967152B (zh) 2017-12-12 2017-12-12 基于最小分支路径函数胎记的软件局部抄袭证据生成方法
CN201711322531.2 2017-12-12

Publications (1)

Publication Number Publication Date
WO2019114673A1 true WO2019114673A1 (zh) 2019-06-20

Family

ID=61994982

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/120179 WO2019114673A1 (zh) 2017-12-12 2018-12-11 基于最小分支路径函数胎记的软件局部抄袭证据生成方法

Country Status (2)

Country Link
CN (1) CN107967152B (zh)
WO (1) WO2019114673A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107967152B (zh) * 2017-12-12 2020-06-19 西安交通大学 基于最小分支路径函数胎记的软件局部抄袭证据生成方法
CN108830049B (zh) * 2018-05-09 2021-07-20 四川大学 一种基于动态控制流图权重序列胎记的软件相似性检测方法
CN110083534B (zh) * 2019-04-19 2023-03-31 西安邮电大学 一种基于约减最短路径胎记的软件抄袭检测方法
CN112749822B (zh) * 2019-10-30 2024-05-17 北京京东振世信息技术有限公司 一种生成路线的方法和装置
CN111913718B (zh) * 2020-06-22 2022-02-11 西安交通大学 基于基本块上下文信息的二进制函数差分分析方法
CN113901457A (zh) * 2020-06-22 2022-01-07 深信服科技股份有限公司 一种恶意软件识别的方法、系统、设备及可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060206477A1 (en) * 2004-11-18 2006-09-14 University Of Washington Computing probabilistic answers to queries
CN101697121A (zh) * 2009-10-26 2010-04-21 哈尔滨工业大学 一种基于程序源代码语义分析的代码相似度检测方法
CN103577323A (zh) * 2013-09-27 2014-02-12 西安交通大学 基于动态关键指令序列胎记的软件抄袭检测方法
CN107967152A (zh) * 2017-12-12 2018-04-27 西安交通大学 基于最小分支路径函数胎记的软件局部抄袭证据生成方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101604891B1 (ko) * 2013-11-05 2016-03-18 단국대학교 산학협력단 실행 파일의 부분 정보를 이용한 소프트웨어 불법 복제 및 표절 탐지 방법 및 이를 이용한 장치
CN103870721B (zh) * 2014-03-04 2016-12-07 西安交通大学 基于线程切片胎记的多线程软件抄袭检测方法
CN107229563B (zh) * 2016-03-25 2020-07-10 中国科学院信息工程研究所 一种跨架构的二进制程序漏洞函数关联方法
CN107169358B (zh) * 2017-05-24 2019-10-08 中国人民解放军信息工程大学 基于代码指纹的代码同源性检测方法及其装置
CN107357566A (zh) * 2017-06-06 2017-11-17 上海交通大学 多架构二进制相似代码检测系统及方法
CN107341822B (zh) * 2017-06-06 2019-11-08 东北大学 一种基于最小分支代价聚合的立体匹配方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060206477A1 (en) * 2004-11-18 2006-09-14 University Of Washington Computing probabilistic answers to queries
CN101697121A (zh) * 2009-10-26 2010-04-21 哈尔滨工业大学 一种基于程序源代码语义分析的代码相似度检测方法
CN103577323A (zh) * 2013-09-27 2014-02-12 西安交通大学 基于动态关键指令序列胎记的软件抄袭检测方法
CN107967152A (zh) * 2017-12-12 2018-04-27 西安交通大学 基于最小分支路径函数胎记的软件局部抄袭证据生成方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TIAN, ZHENZHOU ET AL.: "Software Plagiarism Detection: A Survey", JOURNAL OF CYBER SECURITY, vol. 1, no. 3, 31 July 2016 (2016-07-31), pages 11 *

Also Published As

Publication number Publication date
CN107967152A (zh) 2018-04-27
CN107967152B (zh) 2020-06-19

Similar Documents

Publication Publication Date Title
WO2019114673A1 (zh) 基于最小分支路径函数胎记的软件局部抄袭证据生成方法
US9954895B2 (en) System and method for identifying phishing website
CN109886294B (zh) 知识融合方法、装置、计算机设备和存储介质
CN106096024A (zh) 地址相似度的评估方法和评估装置
CN108491228B (zh) 一种二进制漏洞代码克隆检测方法及系统
CN110688853B (zh) 序列标注方法、装置、计算机设备和存储介质
CN105302882B (zh) 获取关键词的方法及装置
CN111092894A (zh) 一种基于增量学习的webshell检测方法、终端设备及存储介质
CN104408020A (zh) 一种公式解析计算系统及方法
CN109144879B (zh) 测试分析方法及装置
CN110110213A (zh) 挖掘用户职业的方法、装置、计算机可读存储介质和终端设备
CN107704474A (zh) 属性对齐方法和装置
CN110543603A (zh) 基于用户行为的协同过滤推荐方法、装置、设备和介质
CN107678968A (zh) 源码函数的样本提取方法、装置、计算设备及存储介质
CN108399321B (zh) 基于动态指令依赖图胎记的软件局部抄袭检测方法
CN110413994B (zh) 热点话题生成方法、装置、计算机设备和存储介质
US20180329873A1 (en) Automated data extraction system based on historical or related data
CN115455382A (zh) 一种二进制函数代码的语义比对方法及装置
Li et al. A distributed meta-learning system for Chinese entity relation extraction
CN111222136B (zh) 恶意应用归类方法、装置、设备及计算机可读存储介质
JP2020060988A (ja) 名称マッチング装置及び方法
KR101706827B1 (ko) 개체 간 사회 관계 추출 장치 및 방법
JP6261669B2 (ja) クエリ校正システムおよび方法
CN113010550B (zh) 结构化数据的批处理对象生成、批处理方法和装置
CN109542766A (zh) 基于代码映射和词法分析的大规模程序相似性快速检测与证据生成方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18888473

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18888473

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 18888473

Country of ref document: EP

Kind code of ref document: A1