CN101894236B - Software homology detection method and device based on abstract syntax tree and semantic matching - Google Patents

Software homology detection method and device based on abstract syntax tree and semantic matching Download PDF

Info

Publication number
CN101894236B
CN101894236B CN2010102384099A CN201010238409A CN101894236B CN 101894236 B CN101894236 B CN 101894236B CN 2010102384099 A CN2010102384099 A CN 2010102384099A CN 201010238409 A CN201010238409 A CN 201010238409A CN 101894236 B CN101894236 B CN 101894236B
Authority
CN
China
Prior art keywords
syntax tree
abstract syntax
subtree
software
source code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2010102384099A
Other languages
Chinese (zh)
Other versions
CN101894236A (en
Inventor
崔宝江
吴世忠
郭涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING HUAXIA INFOSEC TECHNOLOGY Ltd
China Information Technology Security Evaluation Center
Original Assignee
BEIJING HUAXIA INFOSEC TECHNOLOGY Ltd
China Information Technology Security Evaluation Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING HUAXIA INFOSEC TECHNOLOGY Ltd, China Information Technology Security Evaluation Center filed Critical BEIJING HUAXIA INFOSEC TECHNOLOGY Ltd
Priority to CN2010102384099A priority Critical patent/CN101894236B/en
Publication of CN101894236A publication Critical patent/CN101894236A/en
Application granted granted Critical
Publication of CN101894236B publication Critical patent/CN101894236B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Stored Programmes (AREA)

Abstract

The invention relates to a software homology detection method and a device based on an abstract syntax tree and semantic matching. The method comprises the following steps: generating an abstract syntax tree corresponding to software source codes, and regulating subtrees which match with the same semantic feature rule and have the same semantics in the abstract syntax tree to a unified structure; calculating the hash values of subtrees in the abstract syntax tree; and performing software homology detection through judging whether the hash values of subtrees with the same number of nodes are consistent, thereby performing accurate and effective software homology detection by combining semantics in grammatical level.

Description

Software homology detection method and device based on abstract syntax tree and semantic matches
Technical field
The present invention relates to the technical field of software security in the information security, relate in particular to a kind of software homology detection method and device based on abstract syntax tree and semantic matches.
Background technology
The software homology detection technique is an importance of computer programming language research; Different according to the method for its detection, there is following main flow research field in this area now: the text based software homology detects and detects based on the software homology of mark (Token).
Text based software homology detection technique scheme is based on the text level and carries out the software homology detection; And the process of software source code plagiarism at present generally all is that monoblock is duplicated; Perhaps change on this basis; Such as replacing variable name, under the situation that does not influence program function, upset statement sequence, changing function name or function position or the like; Therefore, text based software homology detection technique scheme is only carried out software homology at the text level and is detected, and can not satisfy the demand that software homology detects.And, ignored the grammer implication of software source code fully based on the software homology detection of text level, thereby significant limitation has been arranged, plagiarize means for above-mentioned software source code and all can't correctly detect.
Software homology detection technique scheme based on mark Token mainly is to be used for carrying out the software source code clone to detect, and also can be used for doing software homology and detect.This technical scheme has been considered language feature to a certain extent, but its principle is to search similar substring the longest in the software source code, changes such plagiarism situation in proper order so can not tackle software source code.
At present, also do not have a kind of software homology detection technique scheme of mature and reliable, homology detects on grammatical levels, to combine semanteme to carry out effectively and accurately to software source code.
Summary of the invention
In view of this; The purpose of the embodiment of the invention has provided a kind of software homology detection method and device based on abstract syntax tree and semantic matches; Thereby effectively and accurately carrying out software homology detects; To tackle various software source code acts of plagiarism, promoted the propelling of software evaluation work with the protection software copyright.
In order to achieve the above object, the embodiment of the invention provides a kind of software homology detection method based on abstract syntax tree and semantic matches, comprising:
Generate the corresponding abstract syntax tree of software source code, and identical semantic feature rule of coupling in the said abstract syntax tree and semantic identical subtree are adjusted into unified structure;
Calculate the cryptographic hash of subtree in the said abstract syntax tree;
Whether the cryptographic hash through the same number of subtree of decision node is consistent, carries out software homology and detects.
Preferably, in the said method, the abstract syntax tree that said generation software source code is corresponding comprises:
File in the said software source code is comprised instruction, note and unnecessary blank character deletion; Search the character string of being replaced by the macro definition instruction in the said software source code, and gain original character string; Judge that whether the Rule of judgment in the conditional compilation instruction in the said software source code is set up, and selects deletion then or keeps corresponding software source code section.
Preferably, in the said method, said semantic feature rule is at least a kind of in the feature: conditional expression is comparison expression, variable sign, function call.
Preferably, in the said method, the cryptographic hash of subtree comprises in the said abstract syntax tree of said calculating:
According to the node type information of said abstract syntax tree, calculate the cryptographic hash of said node, and preserve the information that said node comprises cryptographic hash with the linear linked list form;
The cryptographic hash of the node that subtree in the said abstract syntax tree is included adds up, and obtains the cryptographic hash of said subtree.
Preferably, in the said method, the cryptographic hash of said node that subtree in the said abstract syntax tree is included adds up, and the cryptographic hash that obtains said subtree comprises:
Give different weights with the element of participating in computing before and after division, subtraction and the complementation operation.
Preferably, in the said method, whether said cryptographic hash through the same number of subtree of decision node is consistent, carries out the software homology detection and comprises:
According to the included interstitial content of subtree in the said abstract syntax tree, subtree is divided into groups.
The embodiment of the invention also provides a kind of software homology pick-up unit based on abstract syntax tree and semantic matches, comprising:
Generation module is used to generate the corresponding abstract syntax tree of software source code, and identical semantic feature rule of coupling in the said abstract syntax tree and semantic identical subtree are adjusted into unified structure;
Computing module is used for calculating the cryptographic hash of the abstract syntax tree subtree that said generation module generates
Whether detection module is used for calculating the said cryptographic hash of obtaining according to said computing module, consistent through the cryptographic hash of the same number of subtree of decision node, carries out software homology and detects.
Preferably, in the said apparatus, said generation module comprises:
Pretreatment unit is used for the file of said software source code is comprised instruction, note and unnecessary blank character deletion; Search the character string of being replaced by the macro definition instruction in the said software source code, and gain original character string; Judge that whether the Rule of judgment in the conditional compilation instruction in the said software source code is set up, and selects deletion then or keeps corresponding software source code section;
The lexical analysis unit; Be used for reading in successively through the pretreated said software source code text character string of said pretreatment unit; According to the programming language syntax rule that said software source code adopted; Adopt corresponding regular expression with the matched rule of said character string and said programming language corresponding after, return the mark that identifies said character string;
Parsing unit; Be used for the said mark that returns according to said lexical analysis unit; The software source code sequence that said mark is corresponding after the programming language syntax rule coupling that is adopted with said software source code, is opened up memory headroom; Generate the nodal information in the corresponding abstract syntax tree of said software source code, make up said abstract syntax tree;
The semantic matches unit is used to analyze the structural information of the said abstract syntax tree that said parsing unit makes up, and identical semantic feature rule of coupling and semantic identical subtree are adjusted into unified structure.
Preferably, in the said apparatus, said computing module comprises:
The node computing unit is used for calculating the cryptographic hash of said node according to the node type information through the said abstract syntax tree after the adjustment of said semantic matches unit;
The subtree computing unit is used for calculating the node cryptographic hash of obtaining according to said node computing unit, and the cryptographic hash of the node that subtree in the said abstract syntax tree is included adds up, and obtains the cryptographic hash of said subtree.
Preferably, in the said apparatus, said detection module comprises:
Grouped element is used for according to the included interstitial content of said abstract syntax tree subtree said subtree being divided into groups.
Technical scheme by the invention described above embodiment provides can be found out; In the embodiment of the invention; Through generating the corresponding abstract syntax tree of software source code, and identical semantic feature rule of coupling in the said abstract syntax tree and semantic identical subtree are adjusted into unified structure; Calculate the cryptographic hash of subtree in the said abstract syntax tree; Whether the cryptographic hash through the same number of subtree of decision node is consistent, carries out software homology and detects.Thereby on grammatical levels, carry out accurately and effectively carrying out software homology in conjunction with semanteme and detect.
Description of drawings
Fig. 1 specifically realizes synoptic diagram one for the said method that the embodiment of the invention provides;
Fig. 2 specifically realizes synoptic diagram two for the said method that the embodiment of the invention provides;
Fig. 3 specifically realizes synoptic diagram three for the said method that the embodiment of the invention provides;
The concrete implementation structure synoptic diagram one of said device that Fig. 4 provides for the embodiment of the invention;
The concrete implementation structure synoptic diagram two of said device that Fig. 5 provides for the embodiment of the invention;
The concrete implementation structure synoptic diagram three of said device that Fig. 6 provides for the embodiment of the invention;
The concrete implementation structure synoptic diagram four of said device that Fig. 7 provides for the embodiment of the invention.
Embodiment
The embodiment of the invention provides a kind of software homology detection method based on abstract syntax tree and semantic matches, shown in accompanying drawing 1, comprising:
Step 11 generates the corresponding abstract syntax tree of software source code, and identical semantic feature rule of coupling in the said abstract syntax tree and semantic identical subtree are adjusted into unified structure;
Step 12 is calculated Hash (Hash) value of subtree in the said abstract syntax tree;
Step 13, whether consistent through the cryptographic hash of the same number of subtree of decision node, carry out software homology and detect.
Through the enforcement that the embodiment of the invention provides based on the software homology detection method of abstract syntax tree and semantic matches, can be on grammatical levels, carrying out accurately and effectively in conjunction with semanteme, software homology detects.
For making the object of the invention, technical scheme and advantage clearer, the embodiment of the invention is done further to describe in detail below in conjunction with accompanying drawing.
What the embodiment of the invention provided carries out software homology detection method in grammatical levels, can take into full account the original semanteme of software source code, has guaranteed the correctness that homology detects.And; A kind of detection means of utilizing corresponding tree structure one abstract syntax tree of software source code has completely newly been proposed; The benchmark that the cryptographic hash of each node in the corresponding abstract syntax tree of software source code is detected as software homology; And, detection method has been brought up to semantic hierarchies, thereby improved the accuracy that homology detects to common source code plagiarism means proposition semantic feature rule match detection method.Simultaneously, the subtree in the software source code abstract syntax tree spare of same node point number is divided into groups to compare detection, avoided unnecessary contrast, improved detection efficiency greatly.
The embodiment of the invention provide based on an embodiment of the software homology detection method of abstract syntax tree and semantic matches in concrete implementation procedure, specifically can shown in accompanying drawing 2, can may further comprise the steps:
Step 21 is obtained target software and sample software and is carried out the required detected parameters of homology detection.
In the embodiment of the invention, can the software that need to do the contrast detection be referred to as target software, and the software of sample is referred to as sample software as a comparison.It is understandable that above-mentioned appellation can be changed as required.
In specific embodiment of the present invention, step 21 specifically can comprise following specific operation process:
Step 211, selection needs the target software source code and the sample software source code of detection;
In this step; If obtained and stored in advance database file according to the abstract syntax tree of target or the generation of sample software source code; Then can directly transfer corresponding database file and carry out follow-up comparison and detection, need not to obtain the source code of target or sample software.
Step 212, the input comparison threshold value.
In this step, can compare the target software of detection and C, C++ and Java that sample software is adopted or the like programming language as required, the threshold value of input contrast.The precision that this threshold value promptly contrasts confirms as 5 such as the user with threshold value, is accurate to the child node number when then contrasting more than or equal to subtree in 5 the abstract syntax tree.
Certainly, also can not compare the selection operation of threshold value in the embodiment of the invention, directly compare detection, perhaps formulate a certain fixed numbers in advance, as the acquiescence comparison threshold value.
Step 213 is confirmed the storing path of intermediate file.
In this step, can confirm intermediate file, specifically can comprise the storing path of database file through software source code, journal file and storage comparing result after the pre-service or the like.
And, the embodiment of the invention in concrete implementation procedure, can be real-time carry out the operation information in the recorded and stored software homology testing process, detect such as software homology and begin and concluding time, database file storing path or the like.
Need to prove that the embodiment of the invention is in concrete implementation procedure, step 21 is not to carry out, and can be used as selectivity or gives tacit consent to flow process and operate.
Step 22 generates target software and sample software source code abstract syntax tree, and adjustment abstract syntax tree structure.
In the embodiment of the invention, can generate the pairing abstract syntax tree of this software source code according to the syntax rule of target or the used programming language of sample software.Owing to write down the syntactic structure and the particular location of each code snippet in software source code of this software source code in the abstract syntax tree, therefore generate the abstract syntax tree file and be and carry out software homology needed important and active data when detecting.After generating abstract syntax tree, analyze this abstract syntax tree, through with the semantic feature rule match, adjustment abstract syntax tree structure will meet identical semantic feature rule and semantic identical subtree and be adjusted into unified structure.
The concrete implementation procedure of step 22 can specifically can comprise shown in accompanying drawing 3:
Step 221 is carried out pre-service to software source code.
Software source code is carried out pre-service, can all software source codes (comprising target and sample software source code) unification be same cannonical format.
Software source code is carried out pre-service, can comprise that specifically the file in the software source code is comprised instruction, macro definition instruction, conditional compilation instruction, note and unnecessary blank character to be handled accordingly.Wherein file being comprised the processing mode that instruction, note and unnecessary blank character take is direct deletion; Processing to the macro definition instruction is to search the character string of being replaced by macro definition in the software source code, and gains original character string; To the processing of conditional compilation instruction is whether Rule of judgment in the Rule of judgment compiler directive is set up, and selects the code segment of this deletion and the code segment of this reservation then.
In specific embodiment of the present invention, step 221 specifically can may further comprise the steps:
Step 2211, unnecessary blank character and newline in the deletion software source code;
Step 2212 is handled the continuation character in the software source code, will be divided into the two capable codes of writing and be routed to same delegation;
Step 2213 with the deletion of the note in the software source code, only stays significant code segment;
Step 2214 finds the file in the software source code to comprise order " #include ", directly with its deletion;
Step 2215 finds the conditional compilation instruction in the software source code, and whether Rule of judgment is set up, and correct the reservation perhaps deleted included code segment.
Need to prove that the execution of above-mentioned steps 2211 to step 2215 can not have sequencing.
The software homology detection method that the embodiment of the invention provides based on abstract syntax tree and semantic matches; When software source code is carried out pre-service; Considered its semanteme fully; Processing such as " typedeffloat FL " statement in the software source code is exactly that all " FL " in the software source code are replaced with " float ", and the processing of conditional compilation statement " #ifdef " then is to judge whether its condition is set up in the software source code for another example, the statement that corresponding then reservation should keep.Therefore, the software homology detection method based on abstract syntax tree and semantic matches that the embodiment of the invention provides can carry out pre-service according to semanteme to software source code fully, the accuracy that keeps software homology to detect to the full extent.
Step 222 is carried out lexical analysis to accomplishing pretreated software source code.
Concrete; Can read in successively through character string in the pretreated software source code text; Then according to the programming language that this software source code adopted, such as C, C++ and Java or the like programming language, adopt corresponding regular expression with the matched rule of specific character string and this programming language corresponding after; Return the mark (Token is used for identifying the specific character string of a certain type of character string) of this character string of sign.
Step 223 is carried out grammatical analysis to the software source code after the lexical analysis, generates abstract syntax tree.
Concrete; The software source code sequence that has mark Token that in step 222, produces; Memory headroom is just opened up in the syntax rule coupling back (such as rules specific such as function definition rules) of the programming language that is adopted with this software source code, generates a node; And in this node, write down corresponding nodes type this moment, and this node is corresponding to the positional information in the software source code.After generating all nodes of software source code, just generated the pairing abstract syntax tree of software source code.
Need to prove; If the mark Token sequence of some software source codes that step 222 analysis obtains; In step 223, have no syntax rule to mate, then after the statement fragment deletion with its place, obtain new software source code; Continue step 222, up to the corresponding abstract syntax tree of correct generation software source code.
Step 224 is according to semantic feature rule adjustment abstract syntax tree structure.
The embodiment of the invention is to common source statement in C, C++ and the Java languages such as if-else statement, while statement; Can make a summary out the semantic feature rule of some corresponding abstract syntax trees, for example conditional expression is comparison expression, variable sign (ID), function call or the like.Travel through the corresponding abstract syntax tree of analysis software source code then, search the subtree that meets these semantic feature rules.Such as the characteristic of if-else sentence structure be node with the if type as root node, its first child node is the node of condition type, second node is the node of else type.
Comprising certain semantic information in institute's software stored source code syntactic structure information in the abstract syntax tree; Through the semantic information in the abstract syntax tree being made a summary and mating; Can obtain meeting identical semantic feature rule but the different subtree of structure, promptly the corresponding source code syntactic structure of these subtrees is different but have identical semanteme.
The subtree of the semantic feature rule that coupling is identical is carried out structural adjustment; Promptly be adjusted into unified structure to the subtree of the identical semantic rules of coupling; Make to meet identical semantic feature rule and semantic identical subtree has identical structure, the different but semantic identical statement of syntactic structure is unified to be the identical statement of syntactic structure thereby make.
Concrete adjustment can be given an example as follows:
When the conditional expression of coupling if statement begin contain "! " during (negate) regular, will "! " ignore and with the subtree exchange of second sub-tree of if node and else node, owing to when calculating cryptographic hash, adopt the mode that adds up from bottom to top, the subtree exchange does not influence final cryptographic hash, so the operation of subtree exchange can be ignored; When the conditional expression of coupling if statement is comparison expression regular; Rule of judgment compares the two ends of symbol if variable name or function call; Just relatively symbol is unified for "<", if variable name and numeral are adjusted to comparison symbol front with variable name so and adjusted.
Processing through this step; The said method that the embodiment of the invention provides; Can the software source code homology on the grammatical levels be detected to promote to grammer combines the source code homology of semantic hierarchies and detect; Can be accurately with synonym but structure Different software source code plagiarize and detect accurately, significantly improved the efficient that software homology detects.
Step 225 is carried out storage format conversion and storage with adjusted abstract syntax tree.
The storage of abstract syntax tree relevant information for ease and transferring, the embodiment of the invention can generate and adjustment after, abstract syntax tree is converted into the linear linked list structure and stores.
Step 23, the cryptographic hash of subtree in the calculating abstract syntax tree.
The concrete implementation procedure of step 23 can comprise:
Step 231, the computing node cryptographic hash.
After generating the corresponding abstract syntax tree of software source code, can calculate the cryptographic hash of each node according to each node types in the abstract syntax tree.
In this step, can also preserve the information that node comprises cryptographic hash with the linear linked list form.
Step 232 is calculated the subtree cryptographic hash.
The cryptographic hash of all nodes that each subtree in this abstract syntax tree is included adds up, and obtains the cryptographic hash of each subtree in the abstract syntax tree.
The benefit of calculating the cryptographic hash of subtree in the abstract syntax tree is, when the detection of subsequent homo property, can effectively detect the plagiarism means of upsetting the software source code order.
When the abstract syntax tree that generates from software source code is carried out hash calculation,, may produce the situation of some flase drops owing to adopt the mode that adds up to calculate the cryptographic hash of subtree; These situation all are relevant with some special arithmetic operations; Such as division, subtraction with get surplusly or the like, if before and after these arithmetic operations, participate in the element reversing of position of computings, change has just taken place in the meaning of whole arithmetic operation; For fear of the detection with this situation mistake is similar code; The software homology detection method based on abstract syntax tree and semantic matches that the embodiment of the invention provides has designed special hash calculation mode to the special arithmetic operation of this type, when calculating the cryptographic hash of the special arithmetic operation of this type, has added the notion of weights; Give different weights with the element of participating in computing before and after the special arithmetic operation of this type; If participate in the positions of elements of computing like this before and after the transposing, the cryptographic hash of whole arithmetic operation will change, and can be wrong it is identified as similar code; Avoided some with the situation of different code wrong reports, reduced rate of false alarm for the plagiarism code.
Step 24 is carried out software homology and is detected.
The abstract syntax tree relevant information corresponding according to the target software source code that generates in the step 23; And the corresponding abstract syntax tree relevant information of sample software source code; Whether the cryptographic hash through the same number of subtree of decision node is consistent, carries out software homology and detects.Thereby obtain the homology of target software and sample software, promptly the relevant information of similarity closely can judge whether target software is plagiarized.
Before carrying out the homology detection; In order to improve detection efficiency and accuracy; Can be according in the corresponding abstract syntax tree of software source code; The child node number of each sub-tree divides into groups each sub-tree, and the abstract syntax tree of storing in the linear linked list structure is dumped in the array linked list; Assurance has the subtree of same child node number to be stored in the corresponding chained list of same array index, all can be deposited into such as the subtree that 6 node are all arranged in the abstract syntax tree in the 6th chained list that element is stored of array.
When homology detects; Owing to the abstract syntax tree relevant information has been changed storage format, i.e. array linked list form, therefore; Carrying out to pursue node traversal comparison and detection according to all nodal informations of abstract syntax tree of two array linked list records when homology detects.Such as; Can be earlier, target software source code array element chooses some subtrees from being the array of 1 (being that subtree comprises a node); With sample software source code array element be that all subtrees in 1 the array are carried out cryptographic hash relatively one by one; Whether the cryptographic hash through judging subtree is consistent, to determine whether to exist similar source code; And then from target software source code array element is 1 array, choose another subtree, be that all subtrees in 1 the array are carried out cryptographic hash relatively one by one with sample software source code array element again.By that analogy, until all arrays of target software and sample software source code are all accomplished detection, thereby whether what confirm target software and sample software source code is homology software.
If target software and sample software are non-homology software, can also confirm the similarity of target software and sample software according to the identical shared ratio that gets of cryptographic hash in the comparing result.
Because the embodiment of the invention was divided into groups to the subtree in the abstract syntax tree by the child node number, had therefore avoided unnecessary comparison and detection, improved the efficient that homology detects greatly before carrying out the homology detection.And the embodiment of the invention is being carried out homology when detecting, employing be by node traversal comparison and detection method, promptly be not only that the data structure is compared detection, also be deep into data structure inside and compare detection.Therefore, even data structure is different, the embodiment of the invention also can be found out its inner plagiarism code.
In addition, the embodiment of the invention can also be set comparison threshold value according to step 212, is accurate to the precision that needs detection.
Step 25, storage and output homology testing result.
For the homology testing result, the embodiment of the invention can be stored, and further can also export as required, such as the homology testing result is exported in Word and the Excel file.
Can find out through foregoing description; The software homology detection method that the embodiment of the invention provides based on abstract syntax tree and semantic matches; Through generating the corresponding abstract syntax tree of software source code, and identical semantic feature rule of coupling in the said abstract syntax tree and semantic identical subtree are adjusted into unified structure; Calculate the cryptographic hash of subtree in the said abstract syntax tree; Whether the cryptographic hash through the same number of subtree of decision node is consistent, carries out software homology and detects.Thereby on grammatical levels, carry out accurately and effectively carrying out software homology in conjunction with semanteme and detect.
And, when software source code is carried out pre-service, considered the original semanteme of software source code in the embodiment of the invention fully, therefore can keep the accuracy of homology contrast to the full extent.
And; After the embodiment of the invention is calculated abstract syntax tree node cryptographic hash; Particular processing has been carried out in some special computings,, given different weights for its element that computing is participated in front and back such as when cryptographic hash calculating is carried out in operation to subtraction, division or the like; Avoided some with the situation of different code wrong reports, reduced rate of false alarm for the plagiarism code.
And the embodiment of the invention converts abstract syntax tree into the linear linked list structure after generating the software source code abstract syntax tree, make things convenient for the storage and the taking-up of abstract syntax tree.
And the embodiment of the invention was divided into groups to the subtree in the abstract syntax tree by interstitial content before carrying out the software homology detection, had avoided unnecessary comparison and detection, had improved the efficient of comparison and detection greatly.
And; The embodiment of the invention adopts the method by the contrast of node traversal when carrying out the software homology detection, promptly be not only that the data structure is compared; Also being deep into data structure inside compares; Therefore, also can find out its inner plagiarism code, and the comparison threshold value that can change input is adjusted the precision of contrast even data structure is different.
And the information in the preservation software homology testing process that the embodiment of the invention can also be real-time, and testing result for follow-up this software homology that carries out once more detects and improved convenience, have improved the efficient of real work greatly.
Can find out based on above-mentioned significant technical characterictic; The software homology detection method based on abstract syntax tree and semantic matches that the embodiment of the invention provides can be tackled multiple source code and plagiarized means, can make things convenient for, carry out efficiently, accurately software homology and detect.
The embodiment of the invention provides a kind of software homology pick-up unit based on abstract syntax tree and semantic matches, and shown in accompanying drawing 4, this device specifically can comprise generation module 41, computing module 42, detection module 43.Wherein:
Generation module 41 is used to generate the corresponding abstract syntax tree of software source code, and identical semantic feature rule of coupling in the said abstract syntax tree and semantic identical subtree are adjusted into unified structure;
Computing module 42 is used for calculating the cryptographic hash of the abstract syntax tree subtree that said generation module 41 generates;
Whether detection module 43 is used for calculating the said cryptographic hash of obtaining according to said computing module 42, consistent through the cryptographic hash of the same number of subtree of decision node, carries out software homology and detects.
In a real specific embodiment of the present invention, optional, generation module 41 specifically can comprise shown in accompanying drawing 5:
Pretreatment unit 411 is used for the file of software source code is comprised instruction, note and unnecessary blank character deletion; Search the character string of being replaced by the macro definition instruction in the software source code, and gain original character string; Judge that whether the Rule of judgment in the conditional compilation instruction in the software source code is set up, and selects deletion then or keeps corresponding software source code section.
The specific operation process of pretreatment unit 411 can comprise:
Unnecessary blank character and newline in the deletion software source code;
Continuation character in the software source code is handled, and will be divided into the two capable codes of writing and be routed to same delegation;
With the deletion of the note in the software source code, only stay significant code segment;
Find the file in the software source code to comprise order " #include ", directly with its deletion;
Find the conditional compilation instruction in the software source code, whether Rule of judgment is set up, and correct the reservation perhaps deleted included code segment.
Lexical analysis unit 412; Be used for reading in successively through pretreatment unit 411 pretreated software source code text character strings; According to the programming language syntax rule that software source code adopted; Adopt corresponding regular expression with the matched rule of this character string and programming language corresponding after, return the mark that identifies this character string.
Parsing unit 413; Be used for the mark that returns according to lexical analysis unit 412; The software source code sequence that this mark is corresponding after the programming language syntax rule coupling that is adopted with software source code, is opened up memory headroom; Generate the nodal information of the corresponding abstract syntax tree of software source code, make up abstract syntax tree.
Parsing unit 413 specifically can be with the software source code sequence of the mark Token that has 412 generations of lexical analysis unit; The syntax rule coupling back (such as rules specific such as function definition rules) of the programming language that is adopted with this software source code; Open up memory headroom; Generate an abstract syntax tree node, and in this node, write down corresponding nodes type this moment, and this node is corresponding to the positional information in the software source code.After generating all nodes of software source code, just generated the pairing abstract syntax tree of software source code.
Semantic matches unit 414 is used to analyze the structural information of the abstract syntax tree that said parsing unit 413 makes up, and identical semantic feature rule of coupling and semantic identical subtree are adjusted into unified structure.
Semantic matches unit 414 specifically can use the mode of degree of depth traversal to travel through whole abstract syntax tree, analyzes the structural information of abstract syntax tree.The fixed semantic rules of abstract syntax tree constructor unification that if-else and while statement are corresponding; Through mating these semantic ruleses; Can search if-else statement and while statement with certain characteristic, for example conditional expression is comparison expression, variable ID, function call or the like.For the semantic identical if-else of the certain semantic rules of coupling and the conditional expression of while statement; The abstract syntax tree arrangement that it is corresponding is unified structure; The statement that promptly has identical semanteme has identical abstract syntax tree structure, revises statements such as if-else that identical semanteme is but arranged and while thereby can detect syntactic structure.
Through generation module 41; Especially the concrete operations of semantic matches unit 414; The device that the embodiment of the invention provides; Can the software source code homology on the grammatical levels detect be promoted to grammer combines the source code homology of semantic hierarchies and detect, can be accurately with synonym but structure Different software source code plagiarize and detect accurately, significantly improved the efficient of software homology detection.
In a real specific embodiment of the present invention, optional, computing module 42 specifically can comprise shown in accompanying drawing 6:
Node computing unit 421 is used for calculating the cryptographic hash of node in the abstract syntax tree according to the node type information through the abstract syntax tree after 414 adjustment of semantic matches unit.
Subtree computing unit 422 is used for calculating Unit 421 according to node and calculates the node cryptographic hash of obtaining, and the cryptographic hash of the node that subtree in the abstract syntax tree is included adds up, and obtains the cryptographic hash of subtree.
When the abstract syntax tree that generates from software source code is carried out hash calculation,, may produce the situation of some flase drops owing to adopt the mode that adds up to calculate the cryptographic hash of subtree; These situation all are relevant with some special arithmetic operations; Such as division, subtraction with get surplusly or the like, if before and after these arithmetic operations, participate in the element reversing of position of computings, change has just taken place in the meaning of whole arithmetic operation; For fear of the detection with this situation mistake is similar code; The software homology pick-up unit based on abstract syntax tree and semantic matches that the embodiment of the invention provides has designed special hash calculation mode to the special arithmetic operation of this type, when calculating the cryptographic hash of the special arithmetic operation of this type, has added the notion of weights; Give different weights with the element of participating in computing before and after the special arithmetic operation of this type; If participate in the positions of elements of computing like this before and after the transposing, the cryptographic hash of whole arithmetic operation will change, and can be wrong it is identified as similar code; Avoided some with the situation of different code wrong reports, reduced rate of false alarm for the plagiarism code.
Through the concrete operations of computing module 42, the device that the embodiment of the invention provides can effectively detect the plagiarism means of upsetting the software source code order.
In a real specific embodiment of the present invention, optional, detection module 43 specifically can comprise:
Grouped element 431 is used for the interstitial content included according to the abstract syntax tree subtree, and subtree is divided into groups.
Concrete; In order to improve detection efficiency and accuracy; Grouped element specifically can be according in the corresponding abstract syntax tree of software source code; The child node number of each sub-tree divides into groups each sub-tree, and the abstract syntax tree of storing in the linear linked list structure is dumped in the array linked list; Assurance has the subtree of same child node number to be stored in the corresponding chained list of same array index, all can be deposited into such as the subtree that 6 node are all arranged in the abstract syntax tree in the 6th chained list that element is stored of array.
When homology detects; Owing to the abstract syntax tree relevant information has been changed storage format, i.e. array linked list form, therefore; Carrying out to pursue node traversal comparison and detection according to all nodal informations of abstract syntax tree of array linked list record when homology detects.Such as; Can be earlier, A software source code array element chooses some subtrees from being the array of 1 (being that subtree comprises a node); With B software source code array element be that all subtrees in 1 the array are carried out cryptographic hash relatively one by one; Whether the cryptographic hash through judging subtree is consistent, to determine whether to exist similar source code; And then from A software source code array element is 1 array, choose another subtree, be that all subtrees in 1 the array are carried out cryptographic hash relatively one by one with B software source code array element again.By that analogy, until all arrays of A software and B software source code are all accomplished detection, thereby whether what confirm A software and B software source code is homology software.If software is non-homology software, can also confirm the similarity of each software according to the identical shared ratio that gets of cryptographic hash in the comparing result.
Because the embodiment of the invention was divided into groups to the subtree in the abstract syntax tree by the child node number, had therefore avoided unnecessary comparison and detection, improved the efficient that homology detects greatly before carrying out the homology detection.And the embodiment of the invention is being carried out homology when detecting, employing be by node traversal comparison and detection method, promptly be not only that the data structure is compared detection, also be deep into data structure inside and compare detection.Therefore, even data structure is different, the embodiment of the invention also can be found out its inner plagiarism code.
In a real specific embodiment of the present invention, optional,, shown in accompanying drawing 7, specifically can also comprise based on the software homology pick-up unit of abstract syntax tree and semantic matches:
DBM 44 is used for preserving in real time the information of software homology testing process.
Concrete, DBM 43 can be preserved generation module 41 output ground abstract syntax tree information, and is that linear linked list is preserved with the abstract syntax tree information translation.In addition, DBM 44 can also be preserved the information that computing module 42 and detection module 43 are exported in real time.Or the like.
Output module 45 is used to export the testing result of detection module 43.
For the homology testing result, the embodiment of the invention can be stored, and further can also export as required, such as the homology testing result is exported in Word and the Excel file.
What the embodiment of the invention provided is carrying out before software homology detects based on the software homology pick-up unit of abstract syntax tree and semantic matches, the comparison threshold value in the time of can also setting homology and detect, and the storing path of the information of detection.
Can find out through foregoing description; The software homology pick-up unit that the embodiment of the invention provides based on abstract syntax tree and semantic matches; Through generating the corresponding abstract syntax tree of software source code, and identical semantic feature rule of coupling in the said abstract syntax tree and semantic identical subtree are adjusted into unified structure; Calculate the cryptographic hash of subtree in the said abstract syntax tree; Whether the cryptographic hash through the same number of subtree of decision node is consistent, carries out software homology and detects.Thereby on grammatical levels, carry out accurately and effectively carrying out software homology in conjunction with semanteme and detect.
And, when software source code is carried out pre-service, considered the original semanteme of software source code in the embodiment of the invention fully, therefore can keep the accuracy of homology contrast to the full extent.
And; After the embodiment of the invention is calculated abstract syntax tree node cryptographic hash; Particular processing has been carried out in some special computings,, given different weights for its element that computing is participated in front and back such as when cryptographic hash calculating is carried out in operation to subtraction, division or the like; Avoided some with the situation of different code wrong reports, reduced rate of false alarm for the plagiarism code.
And the embodiment of the invention converts abstract syntax tree into the linear linked list structure after generating the software source code abstract syntax tree, make things convenient for the storage and the taking-up of abstract syntax tree.
And the embodiment of the invention was divided into groups to the subtree in the abstract syntax tree by interstitial content before carrying out the software homology detection, had avoided unnecessary contrast, had improved the efficient of contrast greatly.
And; The embodiment of the invention adopts the method by the contrast of node traversal when carrying out the software homology detection, promptly be not only that the data structure is compared; Also being deep into data structure inside compares; Therefore, also can find out its inner plagiarism code, and the comparison threshold value that can change input is adjusted the precision of contrast even data structure is different.
And the information in the preservation software homology testing process that the embodiment of the invention can also be real-time, and testing result for follow-up this software homology that carries out once more detects and improved convenience, have improved the efficient of real work greatly.
Can find out based on above-mentioned significant technical characterictic; The software homology pick-up unit based on abstract syntax tree and semantic matches that the embodiment of the invention provides can be tackled multiple source code and plagiarized means, can make things convenient for, carry out efficiently, accurately software homology and detect.
Description through above embodiment; Those skilled in the art can be well understood to the present invention and can realize by the mode that software adds essential hardware platform; Can certainly all implement, but the former is better embodiment under a lot of situation through hardware.Based on such understanding; All or part of can the coming out that technical scheme of the present invention contributes to background technology with the embodied of software product; This computer software product can be stored in the storage medium, like ROM/RAM, magnetic disc, CD etc., comprises that some instructions are with so that a computer equipment (can be a personal computer; Server, the perhaps network equipment etc.) carry out the described method of some part of each embodiment of the present invention or embodiment.
The above; Be merely the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, any technician who is familiar with the present technique field is in the technical scope that the present invention discloses; The variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims (8)

1. the software homology detection method based on abstract syntax tree and semantic matches is characterized in that, comprising:
Generate the corresponding abstract syntax tree of software source code, and identical semantic feature rule of coupling in the said abstract syntax tree and semantic identical subtree are adjusted into unified structure;
Calculate the cryptographic hash of subtree in the said abstract syntax tree;
Whether the cryptographic hash through the same number of subtree of decision node is consistent, carries out software homology and detects;
The cryptographic hash of subtree comprises in the said abstract syntax tree of said calculating:
According to the node type information of said abstract syntax tree, calculate the cryptographic hash of said node, and preserve the information that said node comprises cryptographic hash with the linear linked list form;
The cryptographic hash of the node that subtree in the said abstract syntax tree is included adds up, and obtains the cryptographic hash of said subtree.
2. method according to claim 1 is characterized in that, the abstract syntax tree that said generation software source code is corresponding comprises:
File in the said software source code is comprised instruction, note and unnecessary blank character deletion; Search the character string of being replaced by the macro definition instruction in the said software source code, and gain original character string; Judge that whether the Rule of judgment in the conditional compilation instruction in the said software source code is set up, and selects deletion then or keeps corresponding software source code section.
3. method according to claim 1 is characterized in that, said semantic feature rule is at least a kind of in the feature: conditional expression is comparison expression, variable sign, function call.
4. method according to claim 1 is characterized in that, the cryptographic hash of said node that subtree in the said abstract syntax tree is included adds up, and the cryptographic hash that obtains said subtree comprises:
Give different weights with the element of participating in computing before and after division, subtraction and the complementation operation.
5. method according to claim 1 is characterized in that, whether said cryptographic hash through the same number of subtree of decision node is consistent, carries out the software homology detection and comprises:
According to the included interstitial content of subtree in the said abstract syntax tree, subtree is divided into groups.
6. the software homology pick-up unit based on abstract syntax tree and semantic matches is characterized in that, comprising:
Be used to generate the corresponding abstract syntax tree of software source code, and the regular and semantic identical subtree of the identical semantic feature of coupling in the said abstract syntax tree is adjusted into the generation module of unified structure;
Be used for calculating the computing module of the cryptographic hash of the abstract syntax tree subtree that said generation module generates;
Be used for calculating the said cryptographic hash of obtaining according to said computing module, whether consistent through the cryptographic hash of the same number of subtree of decision node, carry out the detection module that software homology detects;
Said computing module also comprises:
Be used for node type information, calculate the cryptographic hash of said node, and preserve the node computing unit that said node comprises the information of cryptographic hash with the linear linked list form according to the said abstract syntax tree of said generation module generation;
Be used for calculating the node cryptographic hash of obtaining according to said node computing unit, the cryptographic hash of the node that subtree in the said abstract syntax tree is included adds up, and obtains the subtree computing unit of the cryptographic hash of said subtree.
7. device according to claim 6 is characterized in that, said generation module comprises:
Be used for the file of said software source code is comprised instruction, note and unnecessary blank character deletion; Search the character string of being replaced by the macro definition instruction in the said software source code, and gain original character string; Judge that whether the Rule of judgment in the conditional compilation instruction in the said software source code is set up, and selects the pretreatment unit of the corresponding software source code section of deletion or reservation then.
8. according to claim 6 or 7 described devices, it is characterized in that said detection module comprises:
Be used for according to the included interstitial content of said abstract syntax tree subtree the grouped element that said subtree is divided into groups.
CN2010102384099A 2010-07-28 2010-07-28 Software homology detection method and device based on abstract syntax tree and semantic matching Expired - Fee Related CN101894236B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010102384099A CN101894236B (en) 2010-07-28 2010-07-28 Software homology detection method and device based on abstract syntax tree and semantic matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010102384099A CN101894236B (en) 2010-07-28 2010-07-28 Software homology detection method and device based on abstract syntax tree and semantic matching

Publications (2)

Publication Number Publication Date
CN101894236A CN101894236A (en) 2010-11-24
CN101894236B true CN101894236B (en) 2012-01-11

Family

ID=43103426

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010102384099A Expired - Fee Related CN101894236B (en) 2010-07-28 2010-07-28 Software homology detection method and device based on abstract syntax tree and semantic matching

Country Status (1)

Country Link
CN (1) CN101894236B (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI484413B (en) 2012-04-03 2015-05-11 Mstar Semiconductor Inc Function-based software comparison method
CN103377040B (en) * 2012-04-16 2016-08-03 晨星软件研发(深圳)有限公司 Based on functional program comparative approach
CN103838666B (en) * 2012-11-27 2017-12-19 百度在线网络技术(北京)有限公司 A kind of method and apparatus for determining code implementation coverage
CN103336890A (en) * 2013-06-08 2013-10-02 东南大学 Method for quickly computing similarity of software
WO2015015622A1 (en) * 2013-08-01 2015-02-05 松崎 務 Apparatus and program
CN103425771B (en) * 2013-08-12 2016-12-28 深圳市华傲数据技术有限公司 The method for digging of a kind of data regular expression and device
CN104866502B (en) 2014-02-25 2020-10-13 深圳市中兴微电子技术有限公司 Data matching method and device
CN104021075A (en) * 2014-05-22 2014-09-03 小米科技有限责任公司 Method and device for evaluating program codes
CN104866765B (en) * 2015-06-03 2017-11-10 康绯 The malicious code homology analysis method of Behavior-based control characteristic similarity
CN106294139B (en) * 2016-08-02 2018-08-31 上海理工大学 A kind of Detection and Extraction method of repeated fragment in software code
CN106384048B (en) * 2016-08-30 2021-05-07 北京奇虎科技有限公司 Threat information processing method and device
CN106951743A (en) * 2017-03-22 2017-07-14 上海英慕软件科技有限公司 A kind of software code infringement detection method
CN108021559B (en) * 2018-02-05 2022-05-03 威盛电子股份有限公司 Natural language understanding system and semantic analysis method
CN108874396A (en) * 2018-05-31 2018-11-23 苏州蜗牛数字科技股份有限公司 The cross-compiler and Compilation Method of multi-platform multiple target language based on HLSL
CN109062792A (en) * 2018-07-21 2018-12-21 东南大学 A kind of Open Source Code detection method based on String matching and characteristic matching
CN109445834B (en) * 2018-10-30 2021-04-30 北京计算机技术及应用研究所 Program code similarity rapid comparison method based on abstract syntax tree
CN109635569B (en) * 2018-12-10 2020-11-03 国家电网有限公司信息通信分公司 Vulnerability detection method and device
CN110737466B (en) * 2019-10-16 2021-04-02 南京航空航天大学 Source code coding sequence representation method based on static program analysis
CN110989991B (en) * 2019-10-25 2023-12-01 深圳开源互联网安全技术有限公司 Method and system for detecting source code clone open source software in application program
CN113312904A (en) * 2021-05-31 2021-08-27 南京航空航天大学 Code segment recommendation method and system based on abstract syntax tree
CN114742028A (en) * 2022-02-24 2022-07-12 中电科数字科技(集团)有限公司 Feature-based JSON consistency comparison detection method and system
CN116989838B (en) * 2023-09-27 2023-12-26 苏州中电科启计量检测技术有限公司 Meter metering detection calibration method and system based on grammar tree

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100461132C (en) * 2007-03-02 2009-02-11 北京邮电大学 Software safety code analyzer based on static analysis of source code and testing method therefor
CN101697121A (en) * 2009-10-26 2010-04-21 哈尔滨工业大学 Method for detecting code similarity based on semantic analysis of program source code

Also Published As

Publication number Publication date
CN101894236A (en) 2010-11-24

Similar Documents

Publication Publication Date Title
CN101894236B (en) Software homology detection method and device based on abstract syntax tree and semantic matching
CN108446540B (en) Program code plagiarism type detection method and system based on source code multi-label graph neural network
Yu et al. Syntaxsqlnet: Syntax tree networks for complex and cross-domaintext-to-sql task
Tran et al. Neural metric learning for fast end-to-end relation extraction
CN103729580A (en) Method and device for detecting software plagiarism
CN105706092B (en) The method and system of four values simulation
CN108139891A (en) Include suggesting for the missing of external file
CN115066674A (en) Method for evaluating source code using numeric array representation of source code elements
Zhou et al. Summarizing source code with hierarchical code representation
JP6614152B2 (en) Text processing system, text processing method, and computer program
Valenzuela-Escarcega et al. Description of the Odin event extraction framework and rule language
CN116974554A (en) Code data processing method, apparatus, computer device and storage medium
CN115437626A (en) OCL statement automatic generation method and device based on natural language
Ge et al. Keywords guided method name generation
CN103793653B (en) A kind of program dependence based on tree optimization analyzes method and system
Li et al. Neural factoid geospatial question answering
Mi et al. Towards using visual, semantic and structural features to improve code readability classification
CN116663019B (en) Source code vulnerability detection method, device and system
Ferreira et al. Evaluating human-machine translation with attention mechanisms for industry 4.0 environment SQL-based systems
Jadallah et al. Cate: Causality tree extractor from natural language requirements
Zhou et al. Survey of intelligent program synthesis techniques
Zhou et al. Code comments generation with data flow-guided transformer
Liu et al. DL4SC: a novel deep learning-based vulnerability detection framework for smart contracts
Frankel et al. Machine learning approaches for authorship attribution using source code stylometry
Wu et al. Goner: Building Tree-Based N-Gram-Like Model for Semantic Code Clone Detection

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120111

Termination date: 20130728