CN103729580A - Method and device for detecting software plagiarism - Google Patents

Method and device for detecting software plagiarism Download PDF

Info

Publication number
CN103729580A
CN103729580A CN201410039084.XA CN201410039084A CN103729580A CN 103729580 A CN103729580 A CN 103729580A CN 201410039084 A CN201410039084 A CN 201410039084A CN 103729580 A CN103729580 A CN 103729580A
Authority
CN
China
Prior art keywords
node
syntax tree
cryptographic hash
source code
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410039084.XA
Other languages
Chinese (zh)
Inventor
刘楠
崔宝江
夏坤峰
韩丽芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
Beijing University of Posts and Telecommunications
China Electric Power Research Institute Co Ltd CEPRI
State Grid Jiangsu Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
Beijing University of Posts and Telecommunications
China Electric Power Research Institute Co Ltd CEPRI
State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, Beijing University of Posts and Telecommunications, China Electric Power Research Institute Co Ltd CEPRI, State Grid Jiangsu Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN201410039084.XA priority Critical patent/CN103729580A/en
Publication of CN103729580A publication Critical patent/CN103729580A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • G06F8/751Code clone detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention provides a method and a device for detecting software plagiarism. The method is on the basis of an abstract syntax tree pruning comparison algorithm, and comprises the following steps: A, acquiring a source code file; B, generating an abstract syntax tree; C, traversing the abstract syntax tree, and converting the abstract syntax tree into a required storage type; D, assigning and sequencing the nodes of the abstract syntax tree; E, comparing the sequenced nodes; and F, outputting an result. According to the method and the device, the source code plagiarism is detected accurately on a syntax level, particularly the variable type change and the interference variable-added plagiarism which cannot be distinguished by the abstract syntax tree are detected accurately and effectively and can be embodied in the form of a software product; the computer software product can be stored in a storage medium, such as an ROM (read only memory)/an RAM (random access memory), a magnetic disc and a compact disc, which contains a plurality of commands so that a piece of computer equipment executes the method disclosed by the invention.

Description

A kind of method and apparatus that detects software plagiarism
Technical field
The invention belongs to the technical field of software security in information security, be specifically related to a kind of method and apparatus that software is plagiarized that detects.
Background technology
Along with scientific and technological development, the environment of software industry constantly improves, and emerging in large numbers fast of new technology, has promoted the develop rapidly of software industry, and the new software appearing at every day on market gets more and more.But in this numerous new software, also inevitably there is the uneven situation of software quality, the phenomenons such as clone to software source code, plagiarism are also emerged in large numbers day by day, the plagiarism main manifestations of software is for copying code completely, or carries out on this basis some modifications that do not affect code function etc. again.The common modification for code has access type, change variable's attribute and the interpolation of amending method or deletes class etc.Therefore, the detection technique of plagiarizing for software source code plays very important effect in the plagiarism detection of software and evaluation work, under such background, has occurred much for software, plagiarizing the achievements in research of detection technique.Now more conventional software source code to plagiarize testing tool is mainly text based, based on Token's with based on code syntax structure.
It is generally that source code is being converted into character string that text based is plagiarized detection technique, by the matching detection of searching to character string, clones code.In string searching matching detection algorithm, the most frequently used is exactly Longest Common Substring (LCS) algorithm, and it detects text plagiarism situation by calculating two the longest same text length of text.And the process that software source code is plagiarized is at present all generally that monoblock copies, or changed on this basis, such as replacing variable name, upset statement sequence in the situation that not affecting program function, changing function name or function position etc., therefore, text based software is plagiarized detection technique scheme and only at text level, is carried out software detection, can not meet software and plagiarize the demand detecting.And the software based on text level is plagiarized to detect and has been ignored the grammer implication of software source code completely, thereby has significant limitation, plagiarizes means all cannot correctly detect for above-mentioned software source code.
It is exactly that each word of source code is converted into a mark that source code based on Token comparison is plagiarized detection technique, by comparison mark, judge plagiarism, its algorithm research is more, and numerous comparison instruments such as CP-Miner, CCFinder have all adopted this comparison technology.On the source code detection technique basis based on Token, the scholars such as Muddu have proposed a kind of Robust technology recently, and this technology, based on a kind of language aware Token sequence, has merged again some technology of abstract syntax tree simultaneously.First it travel through the Token sequence of being obtained code by the syntax tree of JDT parsing acquisition code, then by K-gram technology, the Token sequences segmentation obtaining become to a series of K-gram and is stored in index, finally by comparison K-gram, carrying out locating source code copy.This technology has made up the defect of Longest Common Substring (LCS) algorithm to a certain extent.More domestic scholars excavate the method combining having proposed Token flow analysis and sequence aspect the technical research based on Token, clone's code and the defect that clone's code causes are carried out to joint-detection, reduce clone's code flase drop and the undetected impact on the inconsistent defects detection of identifier rename; Also there are some researchists that cluster is incorporated among the detection of code homology, proposed a kind of detection algorithm based on Sequence clustering.First this algorithm carries out stage extraction source code according to the structure of himself, then each segmentation is carried out to partial code conversion, take the editing distance of Weight as similarity measure standard, these symbols are carried out to Sequence clustering again, obtain similar code segment, to reach the object of source program being carried out to identity function detection.Software homology detection technique scheme based on mark Token has been considered language feature to a certain extent, but its principle is to search similar substring the longest in software source code, so can not tackle software source code order, exchanges such plagiarism situation.
Based on Token, detect technical be exactly further detection technique based on software source code syntactic structure.Now there have been some levels from source code syntactic structure to carry out the achievement in research of code copy detection and comparison, such as the DECKARD that utilizes European vector space record syntax tree information to compare, the method of comparing by Xml record syntax tree information of being mentioned by people such as Wahler, also have some online Compare Systems simultaneously, such as Moss, and the JPlag of the people such as Prcchelt L exploitation.The people such as Baxter have also proposed a kind of algorithm that utilizes source code syntax tree to carry out code copy detection, and have the software clone of CloneDr by name to detect software.But his algorithm has all kept tree structure in whole comparison process, need to carry out repeatedly traversal of tree.Homology comparison based on abstract syntax tree is that the one of syntax tree comparison method is improved, it syntax tree has been carried out to the conversion of twice storage form, tree structure is changed for linear linked list, and according to son node number, carried out the grouping of syntax tree subtree, and according to hash value, syntax tree subtree is sorted, thereby improved the efficiency of comparison, this algorithm also for example, carries out special processing to the semantic computing (subtraction and division) changing after switch, has reduced rate of false alarm.But, based on abstract syntax tree comparison method, still thering is limitation, the plagiarism that it revises bottom data structures to those cannot detect or to detect error larger.
Summary of the invention
In order to overcome above-mentioned the deficiencies in the prior art, the invention provides a kind of software based on abstract syntax tree beta pruning alignment algorithm and plagiarize detection method and device, by generating abstract syntax tree corresponding to software source code, and each tree node in abstract syntax tree is changed to storage class and is adjusted into unified structure; Calculate the cryptographic hash of each node in described abstract syntax tree, and node is sorted according to cryptographic hash size; Delete subtree leaf node, judge that whether subtree root node cryptographic hash is consistent; In subtree, add successively the leaf node of deleting, find out the node that affects comparison result; The similarity of software for calculation source code, the result that output software detects.
In order to realize foregoing invention object, the present invention takes following technical scheme:
An aspect of of the present present invention, provides a kind of method that software is plagiarized that detects, and the beta pruning alignment algorithm of the method based on abstract syntax tree, is characterized in that: said method comprising the steps of:
A. obtain source code file;
B. generate abstract syntax tree;
C. travel through this abstract syntax tree, and be translated into required storage class;
D. the node of assignment abstract syntax tree sequence;
E. the node after comparison sequence;
F. Output rusults.
Preferably, described steps A comprises: the content of text that obtains source code file according to the format information of source code file.
Preferably, described step B comprises:
B-1. the content of text of Pretreatment of Source code file;
B-2. text content is carried out to lexical analysis, to obtain the identification information of text;
B-3. the identification information obtaining according to lexical analysis and the syntax rule of source code, application memory headroom, generates abstract syntax tree nodal information corresponding to source code text, and builds its corresponding abstract syntax tree;
Described abstract syntax tree nodal information comprises: the node type information of abstract syntax tree, node is present position information in source code file, the corresponding Token identifier information of node ID label and node.
Preferably, described step C comprises: call depth-first traversal method, take the abstract syntax root vertex that generates as start node, whole syntax tree of traversal, opens up memory headroom according to the content of node, and be needed type by node unloading, join in node listing.
Preferably, described step D comprises:
D-1. from the random Hash array generating of system, choose the Hash field that element value that array index equals node ID label is assigned to node; Described node ID label is that system is distributed in generation abstract syntax tree process;
D-2. in cumulative each node the cryptographic hash of all child nodes as the final hash value of this node;
D-3. node is sorted by its cryptographic hash size and child node number.
Preferably, described step e comprises:
E-1. from node listing, take out the node that order is corresponding, whether interpretation cryptographic hash equates; If equate, perform step E-9; If etc., do not carry out next step;
E-2. all leaf nodes of deletion of node, and these leaf nodes are joined to leaf node list, recalculate the cryptographic hash of described node simultaneously;
E-3. again compare described cryptographic hash, if equate, execution step E-5; If unequal, continue next step;
E-4. node is joined in dissimilar node listing to execution step E-9;
E-5. from leaf node list, choose leaf node and join in node, calculate the cryptographic hash that adds posterior nodal point simultaneously;
E-6. whether judgement cryptographic hash is now identical, if identical, execution step E-8; If not identical, continue next step;
E-7. delete leaf node and join in dissimilar leaf node list;
E-8. judge whether leaf node list is now empty, if not empty, execution step E-5; Otherwise, continue next step;
E-9. judge that whether all nodes have all been compared, if so, continue next step; Otherwise, execution step E-1;
E-10. finish comparison, return to dissimilar node listing information.
Preferably, in described step B-1, described pre-service comprises the processing to the file including instruction in software source code, macro definition instruction, conditional compilation instruction; To being treated to of described file including instruction: delete this file include instruction; To being treated to of described macro definition instruction: search the character string of being replaced by macro definition in software source code, and gain original character string; To being treated to of described conditional compilation instruction: whether the Rule of judgment in Rule of judgment compiler directive is set up, then according to this condition, whether setting up and select the code segment of this deletion and the code segment of this reservation;
In described step B-2, described lexical analysis is the programming language syntax rule that adopts according to described source code, adopt corresponding regular expression by after corresponding the matched rule of character string and this programming language, the mark that returns to the described character string of sign, this lexical analysis comprises: return to the Token identifier of coupling morphological rule and annotation, newline, the blank character of deletion text; The identification information of described text comprises the Token information of the text.
Another aspect of the present invention, provides a kind of device that software is plagiarized that detects, and it is characterized in that, described device comprises the source code file read module, syntax tree processing module, cryptographic hash processing module, beta pruning comparing module, the result output module that connect in turn;
Described source code file read module is used for obtaining source code file;
Described syntax tree processing module is used for generating abstract syntax tree, travels through this abstract syntax tree, and is translated into required storage class;
Described cryptographic hash processing module is for node the sequence of assignment abstract syntax tree;
Described beta pruning comparing module is for comparing the node after sequence;
Described result output module is for Output rusults.
Preferably, described syntax tree processing module comprise in turn connect with lower unit:
Pretreatment unit, for the content of text of Pretreatment of Source code file;
Lexical analysis unit, be used for for reading in successively through the pretreated source code text of described pretreatment unit character string, the programming language syntax rule adopting according to described source code, adopt corresponding regular expression by after corresponding the matched rule of character string and this programming language, return to the mark of the described character string of sign;
Parsing unit, for the identification information that obtains according to lexical analysis and the syntax rule of source code, application memory headroom, generates abstract syntax tree nodal information corresponding to source code text, and builds its corresponding abstract syntax tree;
Storage converting unit, for being needed type by the node unloading of syntax tree, joins in node listing.
Preferably, described cryptographic hash processing module comprises the cumulative unit of cryptographic hash assignment unit, cryptographic hash and the node sequencing unit that connect in turn; Described cryptographic hash assignment unit is assigned to the Hash field of node for the element value of choosing array index from the random Hash array generating of system and equaling node ID label; The cumulative unit of described cryptographic hash is used for the cryptographic hash of the cumulative all child nodes of each node as the final hash value of this node; Described node sequencing unit is for sorting by its cryptographic hash size and child node number to node.
Preferably, described beta pruning comparing module comprises: node acquiring unit, cryptographic hash judging unit, beta pruning processing unit and node storage unit; Described cryptographic hash judging unit receives the data that described node acquiring unit is submitted, and with beta pruning processing unit interaction data, then be forwarded to node storage unit;
Described node acquiring unit is for obtaining the syntax tree node of the needs comparison after the processing of cryptographic hash processing module and sequence;
Described cryptographic hash judging unit, for judging the various states of comparison flow process, specifically comprises: whether node cryptographic hash is equal, and whether all nodes have all been compared, and whether leaf list is empty;
Described beta pruning processing unit comprises three subelements: knot removal unit, node adding device and cryptographic hash computing unit; Described knot removal unit is for deleting leaf node from node, described node adding device is for adding leaf node to node, described cryptographic hash computing unit is deleted or adds the node cryptographic hash after leaf node for calculating, itself and knot removal unit and the synchronous operation of node adding device;
Described node storage unit is used for storing dissimilar node.
Compared with prior art, beneficial effect of the present invention is:
The present invention is in grammatical levels, source code is plagiarized and accurately detected, especially the types of variables that cannot differentiate abstract syntax tree detection algorithm changes and adds the plagiarism of disturbance variable and carries out accurately and effectively detecting, and can embody with the form of software product, this computer software product can be stored in storage medium, as ROM/RAM, magnetic disc, CD etc., comprise that some instructions are in order to make a computer equipment (as personal computer, server, or the network equipment etc.) carry out method of the present invention.
Accompanying drawing explanation
Fig. 1 is that software of the present invention is plagiarized detection method process flow diagram;
Fig. 2 is syntax tree processing module process flow figure of the present invention;
Fig. 3 is Hash processing module processing flow chart of the present invention;
Fig. 4 is beta pruning alignment algorithm process flow diagram of the present invention;
Fig. 5 is that software of the present invention is plagiarized pick-up unit module diagram;
Fig. 6 is abstract syntax tree relationships between nodes schematic diagram of the present invention.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in further detail.
The present invention is that a kind of software based on abstract syntax tree beta pruning alignment algorithm is plagiarized detecting device and detection method, can effectively detect types of variables that abstract syntax tree algorithm cannot detect and revise and increase the software act of plagiarism of disturbance variable.
Referring to Fig. 1, introduce the present invention and plagiarize the main operational steps of detection method:
Step 1, gets the source code file of participating in comparison, and passes to syntax tree processing module by this device source code file read module;
Step 2, detecting device syntax tree processing module, according to obtained source code file, generates corresponding abstract syntax tree;
Step 3, obtain after the corresponding abstract syntax tree of source code, syntax tree processing module traversal syntax tree, and call corresponding method the abstract syntax tree generating is converted into needed storage class (the storage class here refers to " object array " type in Java language);
Step 4, carries out cryptographic hash calculating and assignment to the abstract syntax tree node of conversion storage class, and calls the node of node sequencing method after to assignment and sort;
Step 5, passes to beta pruning comparing module by the node listing after sequence, calls beta pruning alignment algorithm the node after sorting is compared;
Step 6, detector result output module is exported testing result; This module is responsible for exporting comparison result, comprises the similarity of source code file, the corresponding position of similar code in source code (that is: similar code corresponding line number scope in source file).
Pass through aforesaid operations, use beta pruning alignment algorithm in grammatical levels, source code is plagiarized and accurately detected, and the types of variables that especially cannot differentiate abstract syntax tree detection algorithm changes and adds the plagiarism of disturbance variable and carries out accurately and effectively detecting.
Referring to Fig. 2, syntax tree processing module disposal route main operational steps in detail:
Step 11, obtains source code file, obtains the content of text of source code file according to the format information of source code file (format information refers to: the suffix name of code file);
Step 12, calls preprocess method, and content of text is carried out to pre-service, specifically comprises the processing to the file including instruction in software source code, macro definition instruction, conditional compilation instruction, and wherein to file including, instruction is directly deleted; To the processing of macro definition instruction, be to search the character string of being replaced by macro definition in software source code, and gain original character string; To the processing of conditional compilation instruction, be whether Rule of judgment in Rule of judgment compiler directive is set up, (the Rule of judgment does not here have scope restriction, for different judgement statements, Rule of judgment is not identical yet, for example: in if (a>0) statement, a>0 is exactly Rule of judgment, if a=3 in code, a meets the condition of a>0, now condition is set up, if a=0, a does not meet a>0, and now condition is false; For another example: (a unequal to is b) in statement for if, unequal to representative is not equal to, the Rule of judgment establishment when a is not equal to b exactly of this statement, suppose a=5, b=5, a=b so, so Rule of judgment is false) then according to condition, whether set up and select the code segment of this deletion and the code segment of this reservation; (for example, code segment (italic) below:
#if expression formula
Statement interlude 1
#else
Statement interlude 2
#endif
If #if expression formula is below set up, the Rule of judgment of namely saying is above set up, and retains statement interlude 1, and statement interlude 2 is deleted; If Rule of judgment is false, statement interlude 1 is deleted, retain statement interlude 2);
Step 13, lexical analysis unit carries out lexical analysis to the text after pretreated, to obtain the Token information of text, specifically to comprise, returns to the coupling Token identifier of morphological rule and the annotation to text, line feed, blank character etc. and deletes;
Step 14, the Token identifier that parsing unit obtains according to lexical analysis and the source code syntax rule of definition, application memory headroom, generates abstract syntax tree nodal information corresponding to source code text, and builds its corresponding abstract syntax tree;
Step 15, in storage, call depth-first traversal method in converting unit, take the abstract syntax root vertex that generates as whole syntax tree of start node traversal, and according to the content of node, opening up memory headroom is needed type by node unloading, joins in node listing;
By the processing of this module, the described method that the embodiment of the present invention provides, can plagiarize the software source code on syntax tree level and detect the source code plagiarism detection being converted to based on internal memory unified structure, has significantly improved software and has plagiarized the efficiency detecting.
Referring to Fig. 3, introduce in detail the main operational steps of Hash processing module:
Step 21, due to the syntax tree processing stage, system is in the process of generation software source code syntax tree, can be according to the type of node for distributing an ID label to a node, carrying out in Hash assignment procedure, according to the label of node, from the random Hash array generating of system, choose element value that array index equals nodal scheme and be assigned to the Hash field of node;
Step 22, according to node cryptographic hash computation rule, the cryptographic hash of all child nodes of accumulation calculating node is as the final hash value of this node, and accumulation calculating node cryptographic hash is responsible in this unit;
Step 23, node sequencing unit, responsible node sequence, according to the cryptographic hash of node and child node number, how many according to cryptographic hash size and son node number object, from big to small node is sorted and (first according to the child node number of node, sorted, child node number is many come before, if the child node number of two nodes is identical, according to the cryptographic hash of these two nodes, sort again, be cryptographic hash large come before, after little coming), to obtain the ordered sequence that meets the processing of beta pruning comparing module.
Fig. 6 has illustrated the relation of node and child node, and in tree as shown in the figure, what circle represented is exactly node, and the information that each node comprises is the same.The child node of node 1 is exactly node 2 and node 3, and in like manner, the child node of node 2 is node 4, node 5 and node 6; Node 4,6,8,9,10 does not have node below, and when accumulation calculating, the cryptographic hash of its leaf node just equals 0.
Referring to Fig. 4, introduce in detail the main operational steps of beta pruning alignment algorithm:
Step 31 is taken out respective nodes from node listing;
Whether step 32, equate by the cryptographic hash of interpretation node, decides next step operating process, if the cryptographic hash of node is identical, illustrates that node is similar, goes to step 310; If the cryptographic hash difference of node, continues next step;
Step 33, all leaf nodes of deletion of node, and these leaf nodes are joined in leaf node list, the cryptographic hash of node is recalculated simultaneously, obtain deleting the new cryptographic hash after leaf node;
Step 34, judges now whether the cryptographic hash of node equates, if cryptographic hash is identical, goes to step 36; If unequal, continue next step;
Step 35, joins node in dissimilar node listing, goes to step 310;
Step 36 is chosen leaf node and is joined in node from leaf node list, calculates the cryptographic hash that adds posterior nodal point simultaneously;
Step 37, whether judgement cryptographic hash is now identical, if identical, goes to step 39; If not identical, continue next step;
Step 38, deletes leaf node join in dissimilar leaf node list;
Step 39, judges whether leaf node list is now empty, if non-NULL goes to step 36; Otherwise, continue next step;
Step 310, judges whether all nodes have all been compared, if condition is true, continues next step; Otherwise, go to step 31;
Step 311, finishes comparison, returns to dissimilar node listing information.
When homology detects, owing to abstract syntax tree relevant information having been changed to storage format, i.e. listings format, therefore, when carrying out homology detection, can be according to all nodal informations of the abstract syntax tree of two list records, put section by section traversal contrast and detect.By that analogy, until all nodes of target software and sample software source code are all completed to detection, thereby determine whether to exist plagiarism phenomenon (criterion: whether have the node that cryptographic hash is identical, if existed, explanation exists and plagiarizes, and can obtain so the similarity of code file according to the number of same node point; If there is no same node point, illustrates and in source code, does not have plagiarism).
Finally should be noted that: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit, although the present invention is had been described in detail with reference to above-described embodiment, those of ordinary skill in the field are to be understood that: still can modify or be equal to replacement the specific embodiment of the present invention, and do not depart from any modification of spirit and scope of the invention or be equal to replacement, it all should be encompassed in the middle of claim scope of the present invention.

Claims (11)

1. detect the method that software is plagiarized, the beta pruning alignment algorithm of the method based on abstract syntax tree, is characterized in that: said method comprising the steps of:
A. obtain source code file;
B. generate abstract syntax tree;
C. travel through this abstract syntax tree, and be translated into required storage class;
D. the node of assignment abstract syntax tree sequence;
E. the node after comparison sequence;
F. Output rusults.
2. the method for claim 1, is characterized in that, described steps A comprises: the content of text that obtains source code file according to the format information of source code file.
3. the method for claim 1, is characterized in that, described step B comprises:
B-1. the content of text of Pretreatment of Source code file;
B-2. text content is carried out to lexical analysis, to obtain the identification information of text;
B-3. the identification information obtaining according to lexical analysis and the syntax rule of source code, application memory headroom, generates abstract syntax tree nodal information corresponding to source code text, and builds its corresponding abstract syntax tree;
Described abstract syntax tree nodal information comprises: the node type information of abstract syntax tree, node is present position information in source code file, the corresponding Token identifier information of node ID label and node.
4. the method for claim 1, it is characterized in that, described step C comprises: call depth-first traversal method, take the abstract syntax root vertex that generates as whole syntax tree of start node traversal, according to the content of node, open up memory headroom, and be needed type by node unloading, join in node listing.
5. the method for claim 1, is characterized in that, described step D comprises:
D-1. from the random Hash array generating of system, choose the Hash field that element value that array index equals node ID label is assigned to node; Described node ID label is that system is distributed in generation abstract syntax tree process;
D-2. in cumulative each node the cryptographic hash of all child nodes as the final hash value of this node;
D-3. node is sorted by its cryptographic hash size and child node number.
6. the method for claim 1, is characterized in that, described step e comprises:
E-1. from node listing, take out the node that order is corresponding, whether interpretation cryptographic hash equates; If equate, perform step E-9; If etc., do not carry out next step;
E-2. all leaf nodes of deletion of node, and these leaf nodes are joined to leaf node list, recalculate the cryptographic hash of described node simultaneously;
E-3. again compare described cryptographic hash, if equate, execution step E-5; If unequal, continue next step;
E-4. node is joined in dissimilar node listing to execution step E-9;
E-5. from leaf node list, choose leaf node and join in node, calculate the cryptographic hash that adds posterior nodal point simultaneously;
E-6. whether judgement cryptographic hash is now identical, if identical, execution step E-8; If not identical, continue next step;
E-7. delete leaf node and join in dissimilar leaf node list;
E-8. judge whether leaf node list is now empty, if not empty, execution step E-5; Otherwise, continue next step;
E-9. judge that whether all nodes have all been compared, if so, continue next step; Otherwise, execution step E-1;
E-10. finish comparison, return to dissimilar node listing information.
7. method as claimed in claim 3, is characterized in that: in described step B-1, described pre-service comprises the processing to the file including instruction in software source code, macro definition instruction, conditional compilation instruction; To being treated to of described file including instruction: delete this file include instruction; To being treated to of described macro definition instruction: search the character string of being replaced by macro definition in software source code, and gain original character string; To being treated to of described conditional compilation instruction: whether the Rule of judgment in Rule of judgment compiler directive is set up, then according to this condition, whether setting up and select the code segment of this deletion and the code segment of this reservation;
In described step B-2, described lexical analysis is the programming language syntax rule that adopts according to described source code, adopt corresponding regular expression by after corresponding the matched rule of character string and this programming language, the mark that returns to the described character string of sign, this lexical analysis comprises: return to the Token identifier of coupling morphological rule and annotation, newline, the blank character of deletion text; The identification information of described text comprises the Token information of the text.
8. detect the device that software is plagiarized, it is characterized in that, described device comprises the source code file read module, syntax tree processing module, cryptographic hash processing module, beta pruning comparing module, the result output module that connect in turn;
Described source code file read module is used for obtaining source code file;
Described syntax tree processing module is used for generating abstract syntax tree, travels through this abstract syntax tree, and is translated into required storage class;
Described cryptographic hash processing module is for node the sequence of assignment abstract syntax tree;
Described beta pruning comparing module is for comparing the node after sequence;
Described result output module is for Output rusults.
9. device as claimed in claim 8, is characterized in that: described syntax tree processing module comprise in turn connect with lower unit:
Pretreatment unit, for the content of text of Pretreatment of Source code file;
Lexical analysis unit, be used for for reading in successively through the pretreated source code text of described pretreatment unit character string, the programming language syntax rule adopting according to described source code, adopt corresponding regular expression by after corresponding the matched rule of character string and this programming language, return to the mark of the described character string of sign;
Parsing unit, for the identification information that obtains according to lexical analysis and the syntax rule of source code, application memory headroom, generates abstract syntax tree nodal information corresponding to source code text, and builds its corresponding abstract syntax tree;
Storage converting unit, for being needed type by the node unloading of syntax tree, joins in node listing.
10. device as claimed in claim 8, is characterized in that: described cryptographic hash processing module comprises the cumulative unit of cryptographic hash assignment unit, cryptographic hash and the node sequencing unit that connect in turn; Described cryptographic hash assignment unit is assigned to the Hash field of node for the element value of choosing array index from the random Hash array generating of system and equaling node ID label; The cumulative unit of described cryptographic hash is used for the cryptographic hash of the cumulative all child nodes of each node as the final hash value of this node; Described node sequencing unit is for sorting by its cryptographic hash size and child node number to node.
11. devices as claimed in claim 8, is characterized in that, described beta pruning comparing module comprises: node acquiring unit, cryptographic hash judging unit, beta pruning processing unit and node storage unit; Described cryptographic hash judging unit receives the data that described node acquiring unit is submitted, and with beta pruning processing unit interaction data, then be forwarded to node storage unit;
Described node acquiring unit is for obtaining the syntax tree node of the needs comparison after the processing of cryptographic hash processing module and sequence;
Described cryptographic hash judging unit, for judging the various states of comparison flow process, specifically comprises: whether node cryptographic hash is equal, and whether all nodes have all been compared, and whether leaf list is empty;
Described beta pruning processing unit comprises three subelements: knot removal unit, node adding device and cryptographic hash computing unit; Described knot removal unit is for deleting leaf node from node, described node adding device is for adding leaf node to node, described cryptographic hash computing unit is deleted or adds the node cryptographic hash after leaf node for calculating, itself and knot removal unit and the synchronous operation of node adding device;
Described node storage unit is used for storing dissimilar node.
CN201410039084.XA 2014-01-27 2014-01-27 Method and device for detecting software plagiarism Pending CN103729580A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410039084.XA CN103729580A (en) 2014-01-27 2014-01-27 Method and device for detecting software plagiarism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410039084.XA CN103729580A (en) 2014-01-27 2014-01-27 Method and device for detecting software plagiarism

Publications (1)

Publication Number Publication Date
CN103729580A true CN103729580A (en) 2014-04-16

Family

ID=50453651

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410039084.XA Pending CN103729580A (en) 2014-01-27 2014-01-27 Method and device for detecting software plagiarism

Country Status (1)

Country Link
CN (1) CN103729580A (en)

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104036156A (en) * 2014-06-27 2014-09-10 麦永浩 Method and system for evidence collection and identification of electronic data of software piracy
CN104104680A (en) * 2014-07-14 2014-10-15 中国电子科技集团公司第四十一研究所 Method for carrying out Rapid IO protocol decoding by means of formalization description language
CN104408023A (en) * 2014-11-05 2015-03-11 中国农业银行股份有限公司 Index calculation method and index calculator
CN104536984A (en) * 2014-12-08 2015-04-22 北京邮电大学 Verification method and system for space text Top-k query in outsourced database
CN104572471A (en) * 2015-01-28 2015-04-29 杭州电子科技大学 Index-based Java software code clone detection method
CN106250769A (en) * 2016-07-30 2016-12-21 北京明朝万达科技股份有限公司 The source code data detection method of a kind of multistage filtering and device
CN106294139A (en) * 2016-08-02 2017-01-04 上海理工大学 A kind of Detection and Extraction method of repeated fragment in software code
CN106919431A (en) * 2015-12-25 2017-07-04 航天信息股份有限公司 Code comparison method, equipment and system in continuous integrating
CN106951743A (en) * 2017-03-22 2017-07-14 上海英慕软件科技有限公司 A kind of software code infringement detection method
CN108446540A (en) * 2018-03-19 2018-08-24 中山大学 Program code based on source code multi-tag figure neural network plagiarizes type detection method and system
CN109347639A (en) * 2018-09-21 2019-02-15 浪潮电子信息产业股份有限公司 Method and device for generating serial number
CN109445834A (en) * 2018-10-30 2019-03-08 北京计算机技术及应用研究所 The quick comparative approach of program code similitude based on abstract syntax tree
CN109558706A (en) * 2018-11-16 2019-04-02 杭州师范大学 The detection method of the close SM4 block cipher of state
CN109558314A (en) * 2018-11-09 2019-04-02 国网四川省电力公司电力科学研究院 A method of it clones and detects towards Java source code
CN109635569A (en) * 2018-12-10 2019-04-16 国家电网有限公司信息通信分公司 A kind of leak detection method and device
CN109739509A (en) * 2018-09-30 2019-05-10 北京奇虎科技有限公司 Hide detection method, device and the computer storage medium of API Calls
CN109816038A (en) * 2019-01-31 2019-05-28 广东工业大学 A kind of Internet of Things firmware program classification method and its device
CN109885491A (en) * 2019-02-12 2019-06-14 科华恒盛股份有限公司 To there are the detection methods and terminal device that data overflow expression formula
CN109933365A (en) * 2018-12-28 2019-06-25 蜂巢能源科技有限公司 A kind of generation method and device of function call tree
CN110109920A (en) * 2019-03-19 2019-08-09 咪咕文化科技有限公司 Data comparison method and server
CN110347416A (en) * 2019-07-19 2019-10-18 网易(杭州)网络有限公司 The update method and device of script
CN110347428A (en) * 2018-04-08 2019-10-18 北京京东尚科信息技术有限公司 A kind of detection method and device of code similarity
CN110737466A (en) * 2019-10-16 2020-01-31 南京航空航天大学 Source code coding sequence representation method based on static program analysis
CN110989991A (en) * 2019-10-25 2020-04-10 深圳开源互联网安全技术有限公司 Method and system for detecting source code clone open source software in application program
CN111090856A (en) * 2020-03-23 2020-05-01 杭州有数金融信息服务有限公司 Crawler detection method based on browser feature detection and event monitoring
CN111562944A (en) * 2020-05-11 2020-08-21 南京域智智能科技有限公司 Program code comparison method and device
CN112148609A (en) * 2020-09-28 2020-12-29 南京大学 Method for measuring codes submitted in online programming test
CN113111345A (en) * 2020-01-13 2021-07-13 深信服科技股份有限公司 XXE attack detection method, system, device and computer storage medium
CN113535229A (en) * 2021-06-30 2021-10-22 中国人民解放军战略支援部队信息工程大学 Anti-confusion binary code clone detection method based on software gene
CN114138414A (en) * 2021-12-02 2022-03-04 国汽大有时空科技(安庆)有限公司 Incremental compression method and system for container mirror image
CN114816519A (en) * 2022-04-26 2022-07-29 南京航空航天大学 Code clone detection method and application based on abstract syntax tree and token
CN115718907A (en) * 2022-11-29 2023-02-28 广发银行股份有限公司 Font copyright detection method and device
CN116302074A (en) * 2023-05-12 2023-06-23 卓望数码技术(深圳)有限公司 Third party component identification method, device, equipment and storage medium
US11782819B2 (en) 2020-07-15 2023-10-10 Microsoft Technology Licensing, Llc Program execution monitoring using deep memory tracing
CN118278003A (en) * 2024-03-25 2024-07-02 中国人民解放军61660部队 Software back door detection method based on normalized grammar tree
CN118331637A (en) * 2024-06-14 2024-07-12 北京迪力科技有限责任公司 Code similarity evaluation method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101315599A (en) * 2007-05-29 2008-12-03 北京航空航天大学 Method and device for detecting similarity of source codes
US7503035B2 (en) * 2003-11-25 2009-03-10 Software Analysis And Forensic Engineering Corp. Software tool for detecting plagiarism in computer source code
CN101398758A (en) * 2008-10-30 2009-04-01 北京航空航天大学 Detection method of code copy

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7503035B2 (en) * 2003-11-25 2009-03-10 Software Analysis And Forensic Engineering Corp. Software tool for detecting plagiarism in computer source code
CN101315599A (en) * 2007-05-29 2008-12-03 北京航空航天大学 Method and device for detecting similarity of source codes
CN101398758A (en) * 2008-10-30 2009-04-01 北京航空航天大学 Detection method of code copy

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘楠等: "一种改进的基于抽象语法树的软件源代码比对算法", 《信息网络安全》 *
李建松: "基于抽象语法树的软件抄袭检测算法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (56)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104036156A (en) * 2014-06-27 2014-09-10 麦永浩 Method and system for evidence collection and identification of electronic data of software piracy
CN104104680A (en) * 2014-07-14 2014-10-15 中国电子科技集团公司第四十一研究所 Method for carrying out Rapid IO protocol decoding by means of formalization description language
CN104408023A (en) * 2014-11-05 2015-03-11 中国农业银行股份有限公司 Index calculation method and index calculator
CN104408023B (en) * 2014-11-05 2017-11-03 中国农业银行股份有限公司 Method and indicia calculator that a kind of index is calculated
CN104536984B (en) * 2014-12-08 2017-10-13 北京邮电大学 The verification method and system of a kind of space text Top k inquiries in Outsourced database
CN104536984A (en) * 2014-12-08 2015-04-22 北京邮电大学 Verification method and system for space text Top-k query in outsourced database
CN104572471B (en) * 2015-01-28 2017-10-03 杭州电子科技大学 A kind of Java software Code Clones detection method based on index
CN104572471A (en) * 2015-01-28 2015-04-29 杭州电子科技大学 Index-based Java software code clone detection method
CN106919431A (en) * 2015-12-25 2017-07-04 航天信息股份有限公司 Code comparison method, equipment and system in continuous integrating
CN106919431B (en) * 2015-12-25 2021-03-26 航天信息股份有限公司 Code comparison method, equipment and system in continuous integration
CN106250769B (en) * 2016-07-30 2019-08-16 北京明朝万达科技股份有限公司 A kind of the source code data detection method and device of multistage filtering
CN106250769A (en) * 2016-07-30 2016-12-21 北京明朝万达科技股份有限公司 The source code data detection method of a kind of multistage filtering and device
CN106294139A (en) * 2016-08-02 2017-01-04 上海理工大学 A kind of Detection and Extraction method of repeated fragment in software code
CN106294139B (en) * 2016-08-02 2018-08-31 上海理工大学 A kind of Detection and Extraction method of repeated fragment in software code
CN106951743A (en) * 2017-03-22 2017-07-14 上海英慕软件科技有限公司 A kind of software code infringement detection method
CN108446540A (en) * 2018-03-19 2018-08-24 中山大学 Program code based on source code multi-tag figure neural network plagiarizes type detection method and system
CN108446540B (en) * 2018-03-19 2022-02-25 中山大学 Program code plagiarism type detection method and system based on source code multi-label graph neural network
CN110347428A (en) * 2018-04-08 2019-10-18 北京京东尚科信息技术有限公司 A kind of detection method and device of code similarity
CN109347639A (en) * 2018-09-21 2019-02-15 浪潮电子信息产业股份有限公司 Method and device for generating serial number
CN109347639B (en) * 2018-09-21 2021-06-29 浪潮电子信息产业股份有限公司 Method and device for generating serial number
CN109739509A (en) * 2018-09-30 2019-05-10 北京奇虎科技有限公司 Hide detection method, device and the computer storage medium of API Calls
CN109445834B (en) * 2018-10-30 2021-04-30 北京计算机技术及应用研究所 Program code similarity rapid comparison method based on abstract syntax tree
CN109445834A (en) * 2018-10-30 2019-03-08 北京计算机技术及应用研究所 The quick comparative approach of program code similitude based on abstract syntax tree
CN109558314B (en) * 2018-11-09 2021-07-27 国网四川省电力公司电力科学研究院 Java source code clone detection oriented method
CN109558314A (en) * 2018-11-09 2019-04-02 国网四川省电力公司电力科学研究院 A method of it clones and detects towards Java source code
CN109558706B (en) * 2018-11-16 2021-09-07 杭州师范大学 Detection method of SM4 cryptographic block algorithm
CN109558706A (en) * 2018-11-16 2019-04-02 杭州师范大学 The detection method of the close SM4 block cipher of state
CN109635569A (en) * 2018-12-10 2019-04-16 国家电网有限公司信息通信分公司 A kind of leak detection method and device
CN109933365B (en) * 2018-12-28 2022-08-19 蜂巢能源科技有限公司 Method and device for generating function call tree
CN109933365A (en) * 2018-12-28 2019-06-25 蜂巢能源科技有限公司 A kind of generation method and device of function call tree
CN109816038B (en) * 2019-01-31 2022-07-29 广东工业大学 Internet of things firmware program classification method and device
CN109816038A (en) * 2019-01-31 2019-05-28 广东工业大学 A kind of Internet of Things firmware program classification method and its device
CN109885491B (en) * 2019-02-12 2022-07-05 科华恒盛股份有限公司 Method for detecting existence of data overflow expression and terminal equipment
CN109885491A (en) * 2019-02-12 2019-06-14 科华恒盛股份有限公司 To there are the detection methods and terminal device that data overflow expression formula
CN110109920B (en) * 2019-03-19 2022-03-22 咪咕文化科技有限公司 Data comparison method and server
CN110109920A (en) * 2019-03-19 2019-08-09 咪咕文化科技有限公司 Data comparison method and server
CN110347416A (en) * 2019-07-19 2019-10-18 网易(杭州)网络有限公司 The update method and device of script
CN110737466A (en) * 2019-10-16 2020-01-31 南京航空航天大学 Source code coding sequence representation method based on static program analysis
CN110989991B (en) * 2019-10-25 2023-12-01 深圳开源互联网安全技术有限公司 Method and system for detecting source code clone open source software in application program
CN110989991A (en) * 2019-10-25 2020-04-10 深圳开源互联网安全技术有限公司 Method and system for detecting source code clone open source software in application program
CN113111345B (en) * 2020-01-13 2024-05-24 深信服科技股份有限公司 XXE attack detection method, system, equipment and computer storage medium
CN113111345A (en) * 2020-01-13 2021-07-13 深信服科技股份有限公司 XXE attack detection method, system, device and computer storage medium
CN111090856A (en) * 2020-03-23 2020-05-01 杭州有数金融信息服务有限公司 Crawler detection method based on browser feature detection and event monitoring
CN111562944B (en) * 2020-05-11 2023-08-29 南京域智智能科技有限公司 Program code comparison method and comparison device
CN111562944A (en) * 2020-05-11 2020-08-21 南京域智智能科技有限公司 Program code comparison method and device
US11782819B2 (en) 2020-07-15 2023-10-10 Microsoft Technology Licensing, Llc Program execution monitoring using deep memory tracing
CN112148609A (en) * 2020-09-28 2020-12-29 南京大学 Method for measuring codes submitted in online programming test
CN113535229B (en) * 2021-06-30 2022-12-02 中国人民解放军战略支援部队信息工程大学 Anti-confusion binary code clone detection method based on software gene
CN113535229A (en) * 2021-06-30 2021-10-22 中国人民解放军战略支援部队信息工程大学 Anti-confusion binary code clone detection method based on software gene
CN114138414A (en) * 2021-12-02 2022-03-04 国汽大有时空科技(安庆)有限公司 Incremental compression method and system for container mirror image
CN114138414B (en) * 2021-12-02 2023-08-15 国汽大有时空科技(安庆)有限公司 Incremental compression method and system for container mirror image
CN114816519A (en) * 2022-04-26 2022-07-29 南京航空航天大学 Code clone detection method and application based on abstract syntax tree and token
CN115718907A (en) * 2022-11-29 2023-02-28 广发银行股份有限公司 Font copyright detection method and device
CN116302074A (en) * 2023-05-12 2023-06-23 卓望数码技术(深圳)有限公司 Third party component identification method, device, equipment and storage medium
CN118278003A (en) * 2024-03-25 2024-07-02 中国人民解放军61660部队 Software back door detection method based on normalized grammar tree
CN118331637A (en) * 2024-06-14 2024-07-12 北京迪力科技有限责任公司 Code similarity evaluation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN103729580A (en) Method and device for detecting software plagiarism
CN110245496B (en) Source code vulnerability detection method and detector and training method and system thereof
CN108446540B (en) Program code plagiarism type detection method and system based on source code multi-label graph neural network
US9542477B2 (en) Method of automated discovery of topics relatedness
US9317260B2 (en) Query-by-example in large-scale code repositories
CN101894236B (en) Software homology detection method and device based on abstract syntax tree and semantic matching
He et al. Pyart: Python api recommendation in real-time
CN108959433A (en) A kind of method and system extracting knowledge mapping and question and answer from software project data
CN104573503B (en) The detection method and device that a kind of internal storage access overflows
Jadon Code clones detection using machine learning technique: Support vector machine
CN111459799A (en) Software defect detection model establishing and detecting method and system based on Github
Bulychev et al. Duplicate code detection using anti-unification
CN108664237B (en) It is a kind of based on heuristic and neural network non-API member's recommended method
CN111045670B (en) Method and device for identifying multiplexing relationship between binary code and source code
US11288266B2 (en) Candidate projection enumeration based query response generation
CN110019384A (en) A kind of acquisition methods of blood relationship data provide the method and device of blood relationship data
Solanki et al. Comparative study of software clone detection techniques
CN116974554A (en) Code data processing method, apparatus, computer device and storage medium
CN110737469B (en) Source code similarity evaluation method based on semantic information on function granularity
Li et al. Toward less hidden cost of code completion with acceptance and ranking models
JP2021060800A (en) Data extraction method and data extraction device
Hübner et al. Using interaction data for continuous creation of trace links between source code and requirements in issue tracking systems
Tukaram Design and development of software tool for code clone search, detection, and analysis
CN109299453A (en) A kind of method and apparatus for constructing dictionary
Zhang et al. Long Method Detection Using Graph Convolutional Networks

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140416

RJ01 Rejection of invention patent application after publication