CN103729580A

CN103729580A - Method and device for detecting software plagiarism

Info

Publication number: CN103729580A
Application number: CN201410039084.XA
Authority: CN
Inventors: 刘楠; 崔宝江; 夏坤峰; 韩丽芳
Original assignee: State Grid Corp of China SGCC; Beijing University of Posts and Telecommunications; China Electric Power Research Institute Co Ltd CEPRI; State Grid Jiangsu Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; Beijing University of Posts and Telecommunications; China Electric Power Research Institute Co Ltd CEPRI; State Grid Jiangsu Electric Power Co Ltd
Priority date: 2014-01-27
Filing date: 2014-01-27
Publication date: 2014-04-16

Abstract

The invention provides a method and a device for detecting software plagiarism. The method is on the basis of an abstract syntax tree pruning comparison algorithm, and comprises the following steps: A, acquiring a source code file; B, generating an abstract syntax tree; C, traversing the abstract syntax tree, and converting the abstract syntax tree into a required storage type; D, assigning and sequencing the nodes of the abstract syntax tree; E, comparing the sequenced nodes; and F, outputting an result. According to the method and the device, the source code plagiarism is detected accurately on a syntax level, particularly the variable type change and the interference variable-added plagiarism which cannot be distinguished by the abstract syntax tree are detected accurately and effectively and can be embodied in the form of a software product; the computer software product can be stored in a storage medium, such as an ROM (read only memory)/an RAM (random access memory), a magnetic disc and a compact disc, which contains a plurality of commands so that a piece of computer equipment executes the method disclosed by the invention.

Description

A kind of method and apparatus that detects software plagiarism

Technical field

The invention belongs to the technical field of software security in information security, be specifically related to a kind of method and apparatus that software is plagiarized that detects.

Background technology

Along with scientific and technological development, the environment of software industry constantly improves, and emerging in large numbers fast of new technology, has promoted the develop rapidly of software industry, and the new software appearing at every day on market gets more and more.But in this numerous new software, also inevitably there is the uneven situation of software quality, the phenomenons such as clone to software source code, plagiarism are also emerged in large numbers day by day, the plagiarism main manifestations of software is for copying code completely, or carries out on this basis some modifications that do not affect code function etc. again.The common modification for code has access type, change variable's attribute and the interpolation of amending method or deletes class etc.Therefore, the detection technique of plagiarizing for software source code plays very important effect in the plagiarism detection of software and evaluation work, under such background, has occurred much for software, plagiarizing the achievements in research of detection technique.Now more conventional software source code to plagiarize testing tool is mainly text based, based on Token's with based on code syntax structure.

It is generally that source code is being converted into character string that text based is plagiarized detection technique, by the matching detection of searching to character string, clones code.In string searching matching detection algorithm, the most frequently used is exactly Longest Common Substring (LCS) algorithm, and it detects text plagiarism situation by calculating two the longest same text length of text.And the process that software source code is plagiarized is at present all generally that monoblock copies, or changed on this basis, such as replacing variable name, upset statement sequence in the situation that not affecting program function, changing function name or function position etc., therefore, text based software is plagiarized detection technique scheme and only at text level, is carried out software detection, can not meet software and plagiarize the demand detecting.And the software based on text level is plagiarized to detect and has been ignored the grammer implication of software source code completely, thereby has significant limitation, plagiarizes means all cannot correctly detect for above-mentioned software source code.

It is exactly that each word of source code is converted into a mark that source code based on Token comparison is plagiarized detection technique, by comparison mark, judge plagiarism, its algorithm research is more, and numerous comparison instruments such as CP-Miner, CCFinder have all adopted this comparison technology.On the source code detection technique basis based on Token, the scholars such as Muddu have proposed a kind of Robust technology recently, and this technology, based on a kind of language aware Token sequence, has merged again some technology of abstract syntax tree simultaneously.First it travel through the Token sequence of being obtained code by the syntax tree of JDT parsing acquisition code, then by K-gram technology, the Token sequences segmentation obtaining become to a series of K-gram and is stored in index, finally by comparison K-gram, carrying out locating source code copy.This technology has made up the defect of Longest Common Substring (LCS) algorithm to a certain extent.More domestic scholars excavate the method combining having proposed Token flow analysis and sequence aspect the technical research based on Token, clone's code and the defect that clone's code causes are carried out to joint-detection, reduce clone's code flase drop and the undetected impact on the inconsistent defects detection of identifier rename; Also there are some researchists that cluster is incorporated among the detection of code homology, proposed a kind of detection algorithm based on Sequence clustering.First this algorithm carries out stage extraction source code according to the structure of himself, then each segmentation is carried out to partial code conversion, take the editing distance of Weight as similarity measure standard, these symbols are carried out to Sequence clustering again, obtain similar code segment, to reach the object of source program being carried out to identity function detection.Software homology detection technique scheme based on mark Token has been considered language feature to a certain extent, but its principle is to search similar substring the longest in software source code, so can not tackle software source code order, exchanges such plagiarism situation.

Based on Token, detect technical be exactly further detection technique based on software source code syntactic structure.Now there have been some levels from source code syntactic structure to carry out the achievement in research of code copy detection and comparison, such as the DECKARD that utilizes European vector space record syntax tree information to compare, the method of comparing by Xml record syntax tree information of being mentioned by people such as Wahler, also have some online Compare Systems simultaneously, such as Moss, and the JPlag of the people such as Prcchelt L exploitation.The people such as Baxter have also proposed a kind of algorithm that utilizes source code syntax tree to carry out code copy detection, and have the software clone of CloneDr by name to detect software.But his algorithm has all kept tree structure in whole comparison process, need to carry out repeatedly traversal of tree.Homology comparison based on abstract syntax tree is that the one of syntax tree comparison method is improved, it syntax tree has been carried out to the conversion of twice storage form, tree structure is changed for linear linked list, and according to son node number, carried out the grouping of syntax tree subtree, and according to hash value, syntax tree subtree is sorted, thereby improved the efficiency of comparison, this algorithm also for example, carries out special processing to the semantic computing (subtraction and division) changing after switch, has reduced rate of false alarm.But, based on abstract syntax tree comparison method, still thering is limitation, the plagiarism that it revises bottom data structures to those cannot detect or to detect error larger.

Summary of the invention

In order to overcome above-mentioned the deficiencies in the prior art, the invention provides a kind of software based on abstract syntax tree beta pruning alignment algorithm and plagiarize detection method and device, by generating abstract syntax tree corresponding to software source code, and each tree node in abstract syntax tree is changed to storage class and is adjusted into unified structure; Calculate the cryptographic hash of each node in described abstract syntax tree, and node is sorted according to cryptographic hash size; Delete subtree leaf node, judge that whether subtree root node cryptographic hash is consistent; In subtree, add successively the leaf node of deleting, find out the node that affects comparison result; The similarity of software for calculation source code, the result that output software detects.

In order to realize foregoing invention object, the present invention takes following technical scheme:

An aspect of of the present present invention, provides a kind of method that software is plagiarized that detects, and the beta pruning alignment algorithm of the method based on abstract syntax tree, is characterized in that: said method comprising the steps of:

A. obtain source code file;

B. generate abstract syntax tree;

C. travel through this abstract syntax tree, and be translated into required storage class;

D. the node of assignment abstract syntax tree sequence;

E. the node after comparison sequence;

F. Output rusults.

Preferably, described steps A comprises: the content of text that obtains source code file according to the format information of source code file.

Preferably, described step B comprises:

B-1. the content of text of Pretreatment of Source code file;

B-2. text content is carried out to lexical analysis, to obtain the identification information of text;

B-3. the identification information obtaining according to lexical analysis and the syntax rule of source code, application memory headroom, generates abstract syntax tree nodal information corresponding to source code text, and builds its corresponding abstract syntax tree;

Described abstract syntax tree nodal information comprises: the node type information of abstract syntax tree, node is present position information in source code file, the corresponding Token identifier information of node ID label and node.

Preferably, described step C comprises: call depth-first traversal method, take the abstract syntax root vertex that generates as start node, whole syntax tree of traversal, opens up memory headroom according to the content of node, and be needed type by node unloading, join in node listing.

Preferably, described step D comprises:

D-1. from the random Hash array generating of system, choose the Hash field that element value that array index equals node ID label is assigned to node; Described node ID label is that system is distributed in generation abstract syntax tree process;

D-2. in cumulative each node the cryptographic hash of all child nodes as the final hash value of this node;

D-3. node is sorted by its cryptographic hash size and child node number.

Preferably, described step e comprises:

E-1. from node listing, take out the node that order is corresponding, whether interpretation cryptographic hash equates; If equate, perform step E-9; If etc., do not carry out next step;

E-2. all leaf nodes of deletion of node, and these leaf nodes are joined to leaf node list, recalculate the cryptographic hash of described node simultaneously;

E-3. again compare described cryptographic hash, if equate, execution step E-5; If unequal, continue next step;

E-4. node is joined in dissimilar node listing to execution step E-9;

E-5. from leaf node list, choose leaf node and join in node, calculate the cryptographic hash that adds posterior nodal point simultaneously;

E-6. whether judgement cryptographic hash is now identical, if identical, execution step E-8; If not identical, continue next step;

E-7. delete leaf node and join in dissimilar leaf node list;

E-8. judge whether leaf node list is now empty, if not empty, execution step E-5; Otherwise, continue next step;

E-9. judge that whether all nodes have all been compared, if so, continue next step; Otherwise, execution step E-1;

E-10. finish comparison, return to dissimilar node listing information.

Preferably, in described step B-1, described pre-service comprises the processing to the file including instruction in software source code, macro definition instruction, conditional compilation instruction; To being treated to of described file including instruction: delete this file include instruction; To being treated to of described macro definition instruction: search the character string of being replaced by macro definition in software source code, and gain original character string; To being treated to of described conditional compilation instruction: whether the Rule of judgment in Rule of judgment compiler directive is set up, then according to this condition, whether setting up and select the code segment of this deletion and the code segment of this reservation;

In described step B-2, described lexical analysis is the programming language syntax rule that adopts according to described source code, adopt corresponding regular expression by after corresponding the matched rule of character string and this programming language, the mark that returns to the described character string of sign, this lexical analysis comprises: return to the Token identifier of coupling morphological rule and annotation, newline, the blank character of deletion text; The identification information of described text comprises the Token information of the text.

Another aspect of the present invention, provides a kind of device that software is plagiarized that detects, and it is characterized in that, described device comprises the source code file read module, syntax tree processing module, cryptographic hash processing module, beta pruning comparing module, the result output module that connect in turn;

Described source code file read module is used for obtaining source code file;

Described syntax tree processing module is used for generating abstract syntax tree, travels through this abstract syntax tree, and is translated into required storage class;

Described cryptographic hash processing module is for node the sequence of assignment abstract syntax tree;

Described beta pruning comparing module is for comparing the node after sequence;

Described result output module is for Output rusults.

Preferably, described syntax tree processing module comprise in turn connect with lower unit:

Pretreatment unit, for the content of text of Pretreatment of Source code file;

Lexical analysis unit, be used for for reading in successively through the pretreated source code text of described pretreatment unit character string, the programming language syntax rule adopting according to described source code, adopt corresponding regular expression by after corresponding the matched rule of character string and this programming language, return to the mark of the described character string of sign;

Parsing unit, for the identification information that obtains according to lexical analysis and the syntax rule of source code, application memory headroom, generates abstract syntax tree nodal information corresponding to source code text, and builds its corresponding abstract syntax tree;

Storage converting unit, for being needed type by the node unloading of syntax tree, joins in node listing.

Preferably, described cryptographic hash processing module comprises the cumulative unit of cryptographic hash assignment unit, cryptographic hash and the node sequencing unit that connect in turn; Described cryptographic hash assignment unit is assigned to the Hash field of node for the element value of choosing array index from the random Hash array generating of system and equaling node ID label; The cumulative unit of described cryptographic hash is used for the cryptographic hash of the cumulative all child nodes of each node as the final hash value of this node; Described node sequencing unit is for sorting by its cryptographic hash size and child node number to node.

Preferably, described beta pruning comparing module comprises: node acquiring unit, cryptographic hash judging unit, beta pruning processing unit and node storage unit; Described cryptographic hash judging unit receives the data that described node acquiring unit is submitted, and with beta pruning processing unit interaction data, then be forwarded to node storage unit;

Described node acquiring unit is for obtaining the syntax tree node of the needs comparison after the processing of cryptographic hash processing module and sequence;

Described cryptographic hash judging unit, for judging the various states of comparison flow process, specifically comprises: whether node cryptographic hash is equal, and whether all nodes have all been compared, and whether leaf list is empty;

Described beta pruning processing unit comprises three subelements: knot removal unit, node adding device and cryptographic hash computing unit; Described knot removal unit is for deleting leaf node from node, described node adding device is for adding leaf node to node, described cryptographic hash computing unit is deleted or adds the node cryptographic hash after leaf node for calculating, itself and knot removal unit and the synchronous operation of node adding device;

Described node storage unit is used for storing dissimilar node.

Compared with prior art, beneficial effect of the present invention is:

The present invention is in grammatical levels, source code is plagiarized and accurately detected, especially the types of variables that cannot differentiate abstract syntax tree detection algorithm changes and adds the plagiarism of disturbance variable and carries out accurately and effectively detecting, and can embody with the form of software product, this computer software product can be stored in storage medium, as ROM/RAM, magnetic disc, CD etc., comprise that some instructions are in order to make a computer equipment (as personal computer, server, or the network equipment etc.) carry out method of the present invention.

Accompanying drawing explanation

Fig. 1 is that software of the present invention is plagiarized detection method process flow diagram;

Fig. 2 is syntax tree processing module process flow figure of the present invention;

Fig. 3 is Hash processing module processing flow chart of the present invention;

Fig. 4 is beta pruning alignment algorithm process flow diagram of the present invention;

Fig. 5 is that software of the present invention is plagiarized pick-up unit module diagram;

Fig. 6 is abstract syntax tree relationships between nodes schematic diagram of the present invention.

Embodiment

Below in conjunction with accompanying drawing, the present invention is described in further detail.

The present invention is that a kind of software based on abstract syntax tree beta pruning alignment algorithm is plagiarized detecting device and detection method, can effectively detect types of variables that abstract syntax tree algorithm cannot detect and revise and increase the software act of plagiarism of disturbance variable.

Referring to Fig. 1, introduce the present invention and plagiarize the main operational steps of detection method:

Step 1, gets the source code file of participating in comparison, and passes to syntax tree processing module by this device source code file read module;

Step 2, detecting device syntax tree processing module, according to obtained source code file, generates corresponding abstract syntax tree;

Step 3, obtain after the corresponding abstract syntax tree of source code, syntax tree processing module traversal syntax tree, and call corresponding method the abstract syntax tree generating is converted into needed storage class (the storage class here refers to " object array " type in Java language);

Step 4, carries out cryptographic hash calculating and assignment to the abstract syntax tree node of conversion storage class, and calls the node of node sequencing method after to assignment and sort;

Step 5, passes to beta pruning comparing module by the node listing after sequence, calls beta pruning alignment algorithm the node after sorting is compared;

Step 6, detector result output module is exported testing result; This module is responsible for exporting comparison result, comprises the similarity of source code file, the corresponding position of similar code in source code (that is: similar code corresponding line number scope in source file).

Pass through aforesaid operations, use beta pruning alignment algorithm in grammatical levels, source code is plagiarized and accurately detected, and the types of variables that especially cannot differentiate abstract syntax tree detection algorithm changes and adds the plagiarism of disturbance variable and carries out accurately and effectively detecting.

Referring to Fig. 2, syntax tree processing module disposal route main operational steps in detail:

Step 11, obtains source code file, obtains the content of text of source code file according to the format information of source code file (format information refers to: the suffix name of code file);

Step 12, calls preprocess method, and content of text is carried out to pre-service, specifically comprises the processing to the file including instruction in software source code, macro definition instruction, conditional compilation instruction, and wherein to file including, instruction is directly deleted; To the processing of macro definition instruction, be to search the character string of being replaced by macro definition in software source code, and gain original character string; To the processing of conditional compilation instruction, be whether Rule of judgment in Rule of judgment compiler directive is set up, (the Rule of judgment does not here have scope restriction, for different judgement statements, Rule of judgment is not identical yet, for example: in if (a>0) statement, a>0 is exactly Rule of judgment, if a=3 in code, a meets the condition of a>0, now condition is set up, if a=0, a does not meet a>0, and now condition is false; For another example: (a unequal to is b) in statement for if, unequal to representative is not equal to, the Rule of judgment establishment when a is not equal to b exactly of this statement, suppose a=5, b=5, a=b so, so Rule of judgment is false) then according to condition, whether set up and select the code segment of this deletion and the code segment of this reservation; (for example, code segment (italic) below:

#if expression formula

Statement interlude 1

#else

Statement interlude 2

#endif

If #if expression formula is below set up, the Rule of judgment of namely saying is above set up, and retains statement interlude 1, and statement interlude 2 is deleted; If Rule of judgment is false, statement interlude 1 is deleted, retain statement interlude 2);

Step 13, lexical analysis unit carries out lexical analysis to the text after pretreated, to obtain the Token information of text, specifically to comprise, returns to the coupling Token identifier of morphological rule and the annotation to text, line feed, blank character etc. and deletes;

Step 14, the Token identifier that parsing unit obtains according to lexical analysis and the source code syntax rule of definition, application memory headroom, generates abstract syntax tree nodal information corresponding to source code text, and builds its corresponding abstract syntax tree;

Step 15, in storage, call depth-first traversal method in converting unit, take the abstract syntax root vertex that generates as whole syntax tree of start node traversal, and according to the content of node, opening up memory headroom is needed type by node unloading, joins in node listing;

By the processing of this module, the described method that the embodiment of the present invention provides, can plagiarize the software source code on syntax tree level and detect the source code plagiarism detection being converted to based on internal memory unified structure, has significantly improved software and has plagiarized the efficiency detecting.

Referring to Fig. 3, introduce in detail the main operational steps of Hash processing module:

Step 21, due to the syntax tree processing stage, system is in the process of generation software source code syntax tree, can be according to the type of node for distributing an ID label to a node, carrying out in Hash assignment procedure, according to the label of node, from the random Hash array generating of system, choose element value that array index equals nodal scheme and be assigned to the Hash field of node;

Step 22, according to node cryptographic hash computation rule, the cryptographic hash of all child nodes of accumulation calculating node is as the final hash value of this node, and accumulation calculating node cryptographic hash is responsible in this unit;

Step 23, node sequencing unit, responsible node sequence, according to the cryptographic hash of node and child node number, how many according to cryptographic hash size and son node number object, from big to small node is sorted and (first according to the child node number of node, sorted, child node number is many come before, if the child node number of two nodes is identical, according to the cryptographic hash of these two nodes, sort again, be cryptographic hash large come before, after little coming), to obtain the ordered sequence that meets the processing of beta pruning comparing module.

Fig. 6 has illustrated the relation of node and child node, and in tree as shown in the figure, what circle represented is exactly node, and the information that each node comprises is the same.The child node of node 1 is exactly node 2 and node 3, and in like manner, the child node of node 2 is node 4, node 5 and node 6;

Node

4,6,8,9,10 does not have node below, and when accumulation calculating, the cryptographic hash of its leaf node just equals 0.

Referring to Fig. 4, introduce in detail the main operational steps of beta pruning alignment algorithm:

Step 31 is taken out respective nodes from node listing;

Whether step 32, equate by the cryptographic hash of interpretation node, decides next step operating process, if the cryptographic hash of node is identical, illustrates that node is similar, goes to step 310; If the cryptographic hash difference of node, continues next step;

Step 33, all leaf nodes of deletion of node, and these leaf nodes are joined in leaf node list, the cryptographic hash of node is recalculated simultaneously, obtain deleting the new cryptographic hash after leaf node;

Step 34, judges now whether the cryptographic hash of node equates, if cryptographic hash is identical, goes to step 36; If unequal, continue next step;

Step 35, joins node in dissimilar node listing, goes to step 310;

Step 36 is chosen leaf node and is joined in node from leaf node list, calculates the cryptographic hash that adds posterior nodal point simultaneously;

Step 37, whether judgement cryptographic hash is now identical, if identical, goes to step 39; If not identical, continue next step;

Step 38, deletes leaf node join in dissimilar leaf node list;

Step 39, judges whether leaf node list is now empty, if non-NULL goes to step 36; Otherwise, continue next step;

Step 310, judges whether all nodes have all been compared, if condition is true, continues next step; Otherwise, go to step 31;

Step 311, finishes comparison, returns to dissimilar node listing information.

When homology detects, owing to abstract syntax tree relevant information having been changed to storage format, i.e. listings format, therefore, when carrying out homology detection, can be according to all nodal informations of the abstract syntax tree of two list records, put section by section traversal contrast and detect.By that analogy, until all nodes of target software and sample software source code are all completed to detection, thereby determine whether to exist plagiarism phenomenon (criterion: whether have the node that cryptographic hash is identical, if existed, explanation exists and plagiarizes, and can obtain so the similarity of code file according to the number of same node point; If there is no same node point, illustrates and in source code, does not have plagiarism).

Finally should be noted that: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit, although the present invention is had been described in detail with reference to above-described embodiment, those of ordinary skill in the field are to be understood that: still can modify or be equal to replacement the specific embodiment of the present invention, and do not depart from any modification of spirit and scope of the invention or be equal to replacement, it all should be encompassed in the middle of claim scope of the present invention.

Claims

1. detect the method that software is plagiarized, the beta pruning alignment algorithm of the method based on abstract syntax tree, is characterized in that: said method comprising the steps of:

A. obtain source code file;

B. generate abstract syntax tree;

D. the node of assignment abstract syntax tree sequence;

E. the node after comparison sequence;

F. Output rusults.

2. the method for claim 1, is characterized in that, described steps A comprises: the content of text that obtains source code file according to the format information of source code file.

3. the method for claim 1, is characterized in that, described step B comprises:

B-1. the content of text of Pretreatment of Source code file;

4. the method for claim 1, it is characterized in that, described step C comprises: call depth-first traversal method, take the abstract syntax root vertex that generates as whole syntax tree of start node traversal, according to the content of node, open up memory headroom, and be needed type by node unloading, join in node listing.

5. the method for claim 1, is characterized in that, described step D comprises:

D-3. node is sorted by its cryptographic hash size and child node number.

6. the method for claim 1, is characterized in that, described step e comprises:

E-4. node is joined in dissimilar node listing to execution step E-9;

E-7. delete leaf node and join in dissimilar leaf node list;

E-10. finish comparison, return to dissimilar node listing information.

7. method as claimed in claim 3, is characterized in that: in described step B-1, described pre-service comprises the processing to the file including instruction in software source code, macro definition instruction, conditional compilation instruction; To being treated to of described file including instruction: delete this file include instruction; To being treated to of described macro definition instruction: search the character string of being replaced by macro definition in software source code, and gain original character string; To being treated to of described conditional compilation instruction: whether the Rule of judgment in Rule of judgment compiler directive is set up, then according to this condition, whether setting up and select the code segment of this deletion and the code segment of this reservation;

8. detect the device that software is plagiarized, it is characterized in that, described device comprises the source code file read module, syntax tree processing module, cryptographic hash processing module, beta pruning comparing module, the result output module that connect in turn;

Described source code file read module is used for obtaining source code file;

Described result output module is for Output rusults.

9. device as claimed in claim 8, is characterized in that: described syntax tree processing module comprise in turn connect with lower unit:

Pretreatment unit, for the content of text of Pretreatment of Source code file;

10. device as claimed in claim 8, is characterized in that: described cryptographic hash processing module comprises the cumulative unit of cryptographic hash assignment unit, cryptographic hash and the node sequencing unit that connect in turn; Described cryptographic hash assignment unit is assigned to the Hash field of node for the element value of choosing array index from the random Hash array generating of system and equaling node ID label; The cumulative unit of described cryptographic hash is used for the cryptographic hash of the cumulative all child nodes of each node as the final hash value of this node; Described node sequencing unit is for sorting by its cryptographic hash size and child node number to node.

11. devices as claimed in claim 8, is characterized in that, described beta pruning comparing module comprises: node acquiring unit, cryptographic hash judging unit, beta pruning processing unit and node storage unit; Described cryptographic hash judging unit receives the data that described node acquiring unit is submitted, and with beta pruning processing unit interaction data, then be forwarded to node storage unit;

Described node storage unit is used for storing dissimilar node.