CN110347428A - A kind of detection method and device of code similarity - Google Patents

A kind of detection method and device of code similarity Download PDF

Info

Publication number
CN110347428A
CN110347428A CN201810305719.4A CN201810305719A CN110347428A CN 110347428 A CN110347428 A CN 110347428A CN 201810305719 A CN201810305719 A CN 201810305719A CN 110347428 A CN110347428 A CN 110347428A
Authority
CN
China
Prior art keywords
code
abstract
vocabulary
syntax tree
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810305719.4A
Other languages
Chinese (zh)
Inventor
陆韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201810305719.4A priority Critical patent/CN110347428A/en
Publication of CN110347428A publication Critical patent/CN110347428A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/427Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • G06F8/751Code clone detection

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses the detection method of code similarity and devices, are related to field of computer technology.One specific embodiment of this method includes: acquisition code file, to establish the corresponding abstract syntax tree of the code;The vocabulary in the abstract syntax tree is extracted, abstract syntax tree is mapped to space vector according to the vocabulary;COS distance based on space vector, calculation code similarity.The embodiment is able to solve in the prior art for without prejudice to programmed logic or coding style, and variable, name, type be all normal but the obvious undetectable problem of problematic situation of code architecture.

Description

A kind of detection method and device of code similarity
Technical field
The present invention relates to field of computer technology more particularly to a kind of detection methods and device of code similarity.
Background technique
Currently, general software enterprise all can carry out static detection to code by various tools, to code In unreasonable part modify.The static checks tools such as existing static analysis software such as Coverity or Infer, it is quiet State test includes code check, static structure analysis, code quality measurement etc..It can give full play to people's by manually carrying out Logical thinking advantage can also carry out automatically by software tool.Wherein, code check includes code walkthrough, desk checking [software], generation Code examines etc., the correctness of the main logical expression for checking code, it can be found that the problem of violating programming standard, in program Dangerous, indefinite and fuzzy part, find out in program not portable part, violate the problem of programming style, including The contents such as variable inspection, name and type examination, programmed logic examination, program syntax inspection and program structure inspection.
In realizing process of the present invention, at least there are the following problems in the prior art for inventor's discovery:
Present static detection only for code violate logic, do not meet coding style and variable inspection, name It examines etc., but the quality of the framework of code can not be detected with type, for example multiple similar whether repeat class There can be higher abstraction hierarchy.Be scattered in whether tool method in each file can be unified to together, it is this without prejudice to Programmed logic or coding style, and all normal still code architecture of variable, name, type is obvious the problematic is the fact that nothing What method detected.
Summary of the invention
In view of this, the embodiment of the present invention provides the detection method and device of a kind of code similarity, can solve existing It is directed in technology without prejudice to programmed logic or coding style, and all normal still code architecture of variable, name, type is obvious The problematic undetectable problem of situation.
To achieve the above object, according to an aspect of an embodiment of the present invention, a kind of detection of code similarity is provided Method, including code file is obtained, to establish the corresponding abstract syntax tree of the code;Extract the word in the abstract syntax tree It converges, abstract syntax tree is mapped to space vector according to the vocabulary;COS distance based on space vector, calculation code phase Like degree.
Optionally, abstract syntax tree is mapped to by space vector according to the vocabulary, comprising:
Abstract code tree array is converted by abstract syntax tree according to the vocabulary, to be mapped as space vector.
Optionally, abstract code tree array is converted for abstract syntax tree according to the vocabulary, comprising:
Remove the variate-value in abstract syntax tree;
According to the vocabulary, by the abstract code tree array of generation from left to right of abstract syntax tree longitudinal direction, and weight is deleted Compound word converges.
Optionally, it is converted abstract syntax tree to after abstract code tree array according to the vocabulary, comprising:
In the abstract code tree array front end of generation plus engineering packet name belonging to code file, and it is stored in database In default abstract table;
It is described that array is mapped as space vector, comprising:
Space vector is converted by the text information in abstract table, the text information includes engineering packet name, code file Name and abstract code tree array.
In addition, according to an aspect of an embodiment of the present invention, providing a kind of detection device of code similarity, including obtain Modulus block, for obtaining code file, to establish the corresponding abstract syntax tree of the code;Mapping block, it is described for extracting Vocabulary in abstract syntax tree, abstract syntax tree is mapped to space vector according to the vocabulary;Computing module, for being based on The COS distance of space vector, calculation code similarity.
Optionally, abstract syntax tree is mapped to space vector according to the vocabulary by the mapping block, comprising:
Abstract code tree array is converted by abstract syntax tree according to the vocabulary, to be mapped as space vector.
Optionally, the mapping block converts abstract code tree array for abstract syntax tree according to the vocabulary, comprising:
Remove the variate-value in abstract syntax tree;
According to the vocabulary, by the abstract code tree array of generation from left to right of abstract syntax tree longitudinal direction, and weight is deleted Compound word converges.
Optionally, after the mapping block converts abstract code tree array for abstract syntax tree according to the vocabulary, Further include:
In the abstract code tree array front end of generation plus engineering packet name belonging to code file, and it is stored in database In default abstract table;
Array is mapped as space vector by the mapping block, comprising:
Space vector is converted by the text information in abstract table, the text information includes engineering packet name, code file Name and abstract code tree array.
Other side according to an embodiment of the present invention, additionally provides a kind of electronic equipment, comprising:
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processing Device realizes method described in the detection embodiment of any of the above-described code similarity.
Other side according to an embodiment of the present invention additionally provides a kind of computer-readable medium, is stored thereon with meter Calculation machine program realizes side described in any of the above-described detection embodiment based on code similarity when described program is executed by processor Method.
One embodiment in foregoing invention has the following advantages that or the utility model has the advantages that is subject to code file because using Processing forms the abstract syntax tree that is understood of compiler, abstract syntax tree is then mapped to space vector again, based on space to The COS distance of amount carrys out the technological means of the similarity of calculation code, so as on the level of code architecture to can in system The modified code that needs to reconstruct of energy is made prediction, and then is obviously improved code structure and readability.
Further effect possessed by above-mentioned non-usual optional way adds hereinafter in conjunction with specific embodiment With explanation.
Detailed description of the invention
Attached drawing for a better understanding of the present invention, does not constitute an undue limitation on the present invention.Wherein:
Fig. 1 is the schematic diagram of the main flow of the detection method of code similarity according to an embodiment of the present invention;
Fig. 2 is the schematic diagram that can refer to the main flow of the detection method of code similarity of embodiment according to the present invention;
Fig. 3 is the schematic diagram that can refer to the code abstract syntax tree of embodiment according to the present invention;
Fig. 4 is the schematic diagram of the main modular of the detection device of code similarity according to an embodiment of the present invention;
Fig. 5 is that the embodiment of the present invention can be applied to exemplary system architecture figure therein;
Fig. 6 is adapted for the structural representation of the computer system for the terminal device or server of realizing the embodiment of the present invention Figure.
Specific embodiment
Below in conjunction with attached drawing, an exemplary embodiment of the present invention will be described, including the various of the embodiment of the present invention Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize It arrives, it can be with various changes and modifications are made to the embodiments described herein, without departing from scope and spirit of the present invention.Together Sample, for clarity and conciseness, descriptions of well-known functions and structures are omitted from the following description.
Fig. 1 is the detection method of code similarity according to an embodiment of the present invention, as shown in Figure 1, described be based on code phase Detection method like degree includes:
Step S101 obtains code file, to establish the corresponding abstract syntax tree of the code.
Wherein, a usual engineering includes the code in many packets and packet.And the code in each packet is then referred to as one A code file.
Step S102 extracts the vocabulary in the abstract syntax tree, to be mapped to abstract syntax tree according to the vocabulary Space vector.Specifically implementation process includes:
Abstract code tree array is converted by the abstract syntax tree of code, to be mapped as space vector.Preferably, can be first The variate-value in abstract syntax tree is removed, then again by the abstract code tree array of generation from left to right of abstract syntax tree longitudinal direction. It is possible to further delete repeated vocabulary in abstract code tree array.Furthermore it is possible in the abstract code tree array front end of generation In addition engineering packet name belonging to code file, and be stored in the default abstract table of database.Then, by the text in abstract table Information is converted into space vector, and wherein text information includes engineering packet name, code file name and abstract code tree array.
It is worth noting that space vector can be converted by text information using Word2vec.
Step S103, the COS distance based on space vector, calculation code similarity.
In embodiment, after calculating the similarity of code, it can use the similarity Optimized code file of code.Into One step, higher level abstract integration can be carried out to similar code, delete useless discarded code.
According to various embodiments above, it can be seen that the detection method of the code similarity, it can be in code architecture Level on need to reconstruct modified code to possible in system and make prediction, and then developer is reminded to pay close attention to those very Similar code file, so as to make recommendation by adjusting based on framework of the similarity between code to code, therefore it is big Amplitude improves code structure and readability.
Fig. 2 is the schematic diagram that can refer to the main flow of the detection method of code similarity of embodiment according to the present invention, The detection method of the code similarity may include:
Step S201 obtains code file.
Code file is generated abstract syntax tree by step S202.
Preferably, passing through Parser (resolver) work of JavaParser (Java resolver) or other similar language Each code file under packet can be generated abstract syntax tree (AST) by tool, this sets the semantic structure for defining code, removal Different text structure (space, enter key etc.) and useless annotated code.Such as: code abstract syntax shown in Fig. 3 Tree.
Abstract syntax tree is converted abstract code tree array by step S203.
Preferably, by the from left to right generation array of abstract code tree longitudinal direction, while removing the variable in abstract syntax tree Value.That is, only focusing on the structure of code as variate-value is what is not concerned with.Preferably, abstract code tree is longitudinal From left to right generation array during, delete duplicated code.Such as: code abstract syntax tree production according to Fig.3, Array are as follows: [' packet name ', program, variabledeclaration, Identifier, program, Variabledeclaration, Numeric Literal, program, expression statement, Binaryexpression, program, expression statement, Binary expression, NumericLiteral]
In further carrying out example, packet name is added in the array front end of generation, the main reason is that similarity is very high The code that can carry out abstract reconstruct often replicate and paste out, the probability under a packet is higher.Further, may be used It is abstracted in table with the storage of array that will be generated in the default of database, such as table 1, wherein filename refers to the name of code file Such as xxx.java etc.
Table 1: abstract table
Id Packet name+filename Abstract code tree array
1 ... ...
Syntax tree array is mapped as space vector by step S204.
In embodiment, it for the abstract table of generation, needs to switch to quantize by the abstract code tree array in abstract table It indicates.Preferably, using Word2vec space vector can be converted by text information.Wherein, the text information includes Packet name, filename and abstract code tree array.
Also it is worth noting that, the space vector of mapping is stored in abstract table, and corresponding with abstract code tree array.
Word2vec is a efficient tool that word is characterized as to real number value vector that Google increases income in year in 2013, It utilizes the thought of deep learning, the processing to content of text can be reduced to the vector in K dimensional vector space by training Operation, and the similarity in vector space can be used to indicate the similarity on text semantic.The term vector of Word2vec output It can be used to do the relevant work of many NLP, for example cluster, look for synonym, part of speech analysis etc..
Word2vec training pattern is the neuroid with hidden layer, its input is vocabulary vector, when When inputting a training sample, for each of sample word, the position occurred in vocabulary accordingly is just set to 1, Otherwise it is set to 0.Output is also vocabulary vector, for each of the label of training sample word, just accordingly in vocabulary The position occurred in table is set to 1, is otherwise set to 0.This neuroid is trained to all training samples, it, will after convergence Weight from input layer to hidden layer, as the term vector in each vocabulary.
It is directed to the comment spam collected in 1 by Word2vec model and extracts keyword.Language model is modeled, is obtained A kind of expression of word in vector space.Its basic thought assumes that for a text, ignore its word order and grammer, Syntax is only regarded as the set of some vocabulary, and each vocabulary of text is independent.
Assuming that two simple texts are as follows:
John likes to watch movies.Mary likes too.
John also likes to watch football games.
Based on the vocabulary occurred in above-mentioned two document, such as next dictionary (dictionary) is constructed:
{"John":1,"likes":2,"to":3,"watch":4,"movies":5,"also":6,"football": 7,"games":8,"Mary":9,"too":10}
Include 10 words in dictionary above, each word has unique index, then each text we can make It is indicated with the vector of one 10 dimension.It is as follows:
[1,2,1,1,1,0,0,0,1,1]
[1,1,1,1,0,1,1,1,0,0]
The vector of generation and the vocabulary appearance sequence in original text are not related, and expression is each word in correspondence Text in the number that occurs.
Step S205, the COS distance based on space vector, calculation code similarity.
As embodiment, vector space cosine similarity (Cosine Similarity) is using two in vector space Vectorial angle cosine value is as the size for measuring two inter-individual differences.Cosine value indicates that angle closer to 0 closer to 1 Degree, that is, two vectors are more similar, and two vectors are equal when wherein angle is equal to 0, above-mentioned to be called cosine similarity.It will Step S204 switchs to the abstract syntax tree of each code file of vectorization, and two are one group, asks cosine between the two similar Degree.
The formula of cosine similarity are as follows:
COS distance, also referred to as cosine similarity are to use in vector space two vectorial angle cosine values as measuring The measurement of the size of two inter-individual differences.
Vector is directive line segment in hyperspace, if the direction of two vectors is consistent, i.e. angle close to zero, that The two vectors are with regard to close.And to determine whether two vector directions are consistent, this will use the cosine law and calculate vector Angle.
The cosine law describes the relationship of any one angle and three sides in triangle.Given three sides of a triangle, The angle that the cosine law finds out each angle of triangle can be used.It is assumed that three sides of a triangle be a, b and c, corresponding three Angle is A, B and C, then the cosine of angle A are as follows:
If regarding the both sides b and c of triangle as two vectors, above-mentioned formula is equivalent to:
In addition, can refer to the specific implementation content of the detection method of code similarity described in embodiment in the present invention, It has been described in detail in the detection method of code similarity described above, therefore has no longer illustrated in this duplicate contents.
Fig. 4 is the detection device of code similarity according to an embodiment of the present invention, as shown in figure 4, the code similarity Detection device 400 include obtaining module 401, mapping block 402 and computing module 403.Wherein, it obtains module 401 and obtains generation Code file, to establish the corresponding abstract syntax tree of the code.And mapping block 402 extracts the word in the abstract syntax tree Converge, abstract syntax tree is mapped to space vector according to the vocabulary, computing module 403 based on the cosine of space vector away from From calculation code similarity.
As an embodiment preferably, the mapping block 402 by the abstract syntax tree of code be mapped to space to The specific implementation process of amount includes:
Abstract code tree array is converted by the abstract syntax tree of code, to be mapped as space vector.It is possible to further The variate-value in abstract syntax tree is first removed, then again by the abstract code tree number of generation from left to right of abstract syntax tree longitudinal direction Group.Further, repeated vocabulary in abstract code tree array can be deleted.Furthermore it is possible in the abstract code tree number of generation Group front end is stored in the default abstract table of database plus engineering packet name belonging to code file.It then, will be in abstract table Text information be converted into space vector, wherein text information includes engineering packet name, code file name and abstract code tree array.
It is worth noting that mapping block 402 can convert space vector for text information using Word2vec.
In addition, computing module 403 can use the similar of code after calculating the similarity of code as embodiment Spend Optimized code file.It is possible to further carry out higher level abstract integration to similar code, delete useless discard Code.
It should be noted that the specific implementation content of the detection device in code similarity of the present invention, in institute above It states and has been described in detail in the detection method of code similarity, therefore no longer illustrate in this duplicate contents.
Fig. 5 is shown can be using the detection of the detection method or code similarity of the code similarity of the embodiment of the present invention The exemplary system architecture 500 of device.Or Fig. 5 shows the detection side that can apply the code similarity of the embodiment of the present invention The exemplary system architecture 500 of the detection device of method or code similarity.
As shown in figure 5, system architecture 500 may include terminal device 501,502,503, network 504 and server 505. Network 504 between terminal device 501,502,503 and server 505 to provide the medium of communication link.Network 504 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be used terminal device 501,502,503 and be interacted by network 504 with server 505, to receive or send out Send message etc..Various telecommunication customer end applications, such as the application of shopping class, net can be installed on terminal device 501,502,503 (merely illustrative) such as the application of page browsing device, searching class application, instant messaging tools, mailbox client, social platform softwares.
Terminal device 501,502,503 can be the various electronic equipments with display screen and supported web page browsing, packet Include but be not limited to smart phone, tablet computer, pocket computer on knee and desktop computer etc..
Server 505 can be to provide the server of various services, such as utilize terminal device 501,502,503 to user The shopping class website browsed provides the back-stage management server (merely illustrative) supported.Back-stage management server can be to reception To the data such as information query request analyze etc. processing, and by processing result (such as target push information, product letter Breath -- merely illustrative) feed back to terminal device.
It should be noted that the detection method of code similarity provided by the embodiment of the present invention is generally by server 505 It executes, correspondingly, the detection device of code similarity is generally positioned in server 505.
It should be understood that the number of terminal device, network and server in Fig. 5 is only schematical.According to realization need It wants, can have any number of terminal device, network and server.
Below with reference to Fig. 6, it illustrates the computer systems 600 for the terminal device for being suitable for being used to realize the embodiment of the present invention Structural schematic diagram.Terminal device shown in Fig. 6 is only an example, function to the embodiment of the present invention and should not use model Shroud carrys out any restrictions.
As shown in fig. 6, computer system 600 includes central processing unit (CPU) 601, it can be read-only according to being stored in Program in memory (ROM) 602 or be loaded into the program in random access storage device (RAM) 603 from storage section 608 and Execute various movements appropriate and processing.In RAM603, also it is stored with system 600 and operates required various programs and data. CPU601, ROM 602 and RAM603 is connected with each other by bus 604.Input/output (I/O) interface 605 is also connected to bus 604。
I/O interface 605 is connected to lower component: the importation 606 including keyboard, mouse etc.;It is penetrated including such as cathode The output par, c 607 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section 608 including hard disk etc.; And the communications portion 609 of the network interface card including LAN card, modem etc..Communications portion 609 via such as because The network of spy's net executes communication process.Driver 610 is also connected to I/O interface 605 as needed.Detachable media 611, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 610, in order to read from thereon Computer program be mounted into storage section 608 as needed.
Particularly, disclosed embodiment, the process described above with reference to flow chart may be implemented as counting according to the present invention Calculation machine software program.For example, embodiment disclosed by the invention includes a kind of computer program product comprising be carried on computer Computer program on readable medium, the computer program include the program code for method shown in execution flow chart.? In such embodiment, which can be downloaded and installed from network by communications portion 609, and/or from can Medium 611 is dismantled to be mounted.When the computer program is executed by central processing unit (CPU) 601, system of the invention is executed The above-mentioned function of middle restriction.
It should be noted that computer-readable medium shown in the present invention can be computer-readable signal media or meter Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In the present invention, computer readable storage medium can be it is any include or storage journey The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this In invention, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. are above-mentioned Any appropriate combination.
Flow chart and block diagram in attached drawing are illustrated according to the system of various embodiments of the invention, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of above-mentioned module, program segment or code include one or more Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants It is noted that the combination of each box in block diagram or flow chart and the box in block diagram or flow chart, can use and execute rule The dedicated hardware based systems of fixed functions or operations is realized, or can use the group of specialized hardware and computer instruction It closes to realize.
Being described in module involved in the embodiment of the present invention can be realized by way of software, can also be by hard The mode of part is realized.Described module also can be set in the processor, for example, can be described as: a kind of processor packet It includes and obtains module, mapping block and computing module.Wherein, the title of these modules is not constituted under certain conditions to the module The restriction of itself.
As on the other hand, the present invention also provides a kind of computer-readable medium, which be can be Included in equipment described in above-described embodiment;It is also possible to individualism, and without in the supplying equipment.Above-mentioned calculating Machine readable medium carries one or more program, when said one or multiple programs are executed by the equipment, makes Obtaining the equipment includes: acquisition code file, to establish the corresponding abstract syntax tree of the code;It extracts in the abstract syntax tree Vocabulary, abstract syntax tree is mapped to space vector according to the vocabulary;COS distance based on space vector calculates generation Code similarity.
Technical solution according to an embodiment of the present invention is able to solve and is directed in the prior art without prejudice to programmed logic or volume Cheng Fengge, and variable, name, type be all normal but the obvious undetectable problem of problematic situation of code architecture.
Above-mentioned specific embodiment, does not constitute a limitation on the scope of protection of the present invention.Those skilled in the art should be bright It is white, design requirement and other factors are depended on, various modifications, combination, sub-portfolio and substitution can occur.It is any Made modifications, equivalent substitutions and improvements etc. within the spirit and principles in the present invention, should be included in the scope of the present invention Within.

Claims (10)

1. a kind of detection method of code similarity characterized by comprising
Code file is obtained, to establish the corresponding abstract syntax tree of the code;
The vocabulary in the abstract syntax tree is extracted, abstract syntax tree is mapped to space vector according to the vocabulary;
COS distance based on space vector, calculation code similarity.
2. the method according to claim 1, wherein according to the vocabulary by abstract syntax tree be mapped to space to Amount, comprising:
Abstract code tree array is converted by abstract syntax tree according to the vocabulary, to be mapped as space vector.
3. according to the method described in claim 2, it is characterized in that, converting abstract generation for abstract syntax tree according to the vocabulary Code tree array, comprising:
Remove the variate-value in abstract syntax tree;
According to the vocabulary, by the abstract code tree array of generation from left to right of abstract syntax tree longitudinal direction, and repetitor is deleted It converges.
4. according to the method described in claim 2, it is characterized in that, converting abstract generation for abstract syntax tree according to the vocabulary After code tree array, comprising:
In the abstract code tree array front end of generation plus engineering packet name belonging to code file, and it is stored in the default of database In abstract table;
It is described that array is mapped as space vector, comprising:
Convert space vector for the text information in abstract table, the text information include engineering packet name, code file name and Abstract code tree array.
5. a kind of detection device of code similarity characterized by comprising
Module is obtained, for obtaining code file, to establish the corresponding abstract syntax tree of the code;
Mapping block, for extracting the vocabulary in the abstract syntax tree, to be mapped to abstract syntax tree according to the vocabulary Space vector;
Computing module, for the COS distance based on space vector, calculation code similarity.
6. device according to claim 5, which is characterized in that the mapping block is according to the vocabulary by abstract syntax tree It is mapped to space vector, comprising:
Abstract code tree array is converted by abstract syntax tree according to the vocabulary, to be mapped as space vector.
7. device according to claim 6, which is characterized in that the mapping block is according to the vocabulary by abstract syntax tree It is converted into abstract code tree array, comprising:
Remove the variate-value in abstract syntax tree;
According to the vocabulary, by the abstract code tree array of generation from left to right of abstract syntax tree longitudinal direction, and repetitor is deleted It converges.
8. device according to claim 6, which is characterized in that the mapping block is according to the vocabulary by abstract syntax tree It is converted into after abstract code tree array, further includes:
In the abstract code tree array front end of generation plus engineering packet name belonging to code file, and it is stored in the default of database In abstract table;
Array is mapped as space vector by the mapping block, comprising:
Convert space vector for the text information in abstract table, the text information include engineering packet name, code file name and Abstract code tree array.
9. a kind of electronic equipment characterized by comprising
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors are real The now method as described in any in claim 1-4.
10. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that described program is held by processor The method as described in any in claim 1-4 is realized when row.
CN201810305719.4A 2018-04-08 2018-04-08 A kind of detection method and device of code similarity Pending CN110347428A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810305719.4A CN110347428A (en) 2018-04-08 2018-04-08 A kind of detection method and device of code similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810305719.4A CN110347428A (en) 2018-04-08 2018-04-08 A kind of detection method and device of code similarity

Publications (1)

Publication Number Publication Date
CN110347428A true CN110347428A (en) 2019-10-18

Family

ID=68173114

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810305719.4A Pending CN110347428A (en) 2018-04-08 2018-04-08 A kind of detection method and device of code similarity

Country Status (1)

Country Link
CN (1) CN110347428A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111813444A (en) * 2020-07-10 2020-10-23 北京思特奇信息技术股份有限公司 Method, system and electronic equipment for analyzing similarity of source codes
CN111857660A (en) * 2020-07-06 2020-10-30 南京航空航天大学 Context-aware API recommendation method and terminal based on query statement
CN112035165A (en) * 2020-08-26 2020-12-04 山谷网安科技股份有限公司 Code clone detection method and system based on homogeneous network
CN112148609A (en) * 2020-09-28 2020-12-29 南京大学 Method for measuring codes submitted in online programming test
CN112698836A (en) * 2021-01-18 2021-04-23 昆明理工大学 Code quality attribute judgment method for complex user comments
CN113641588A (en) * 2021-08-31 2021-11-12 北京航空航天大学 Software intelligibility determination method and system based on LDA topic modeling
CN115129364A (en) * 2022-07-05 2022-09-30 四川大学 Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network
CN113641588B (en) * 2021-08-31 2024-05-24 北京航空航天大学 Software understandability determination method and system based on LDA topic modeling

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000031626A1 (en) * 1998-11-19 2000-06-02 Netron Inc. Method of identifying recurring code constructs
CN101697121A (en) * 2009-10-26 2010-04-21 哈尔滨工业大学 Method for detecting code similarity based on semantic analysis of program source code
CN103729580A (en) * 2014-01-27 2014-04-16 国家电网公司 Method and device for detecting software plagiarism
WO2016085273A1 (en) * 2014-11-28 2016-06-02 주식회사 파수닷컴 Method for classifying alarm types in detecting source code error, computer program therefor, recording medium thereof
WO2017052318A1 (en) * 2015-09-25 2017-03-30 (주)씽크포비엘 Method and apparatus for analyzing software
CN107169358A (en) * 2017-05-24 2017-09-15 中国人民解放军信息工程大学 Code homology detection method and its device based on code fingerprint

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000031626A1 (en) * 1998-11-19 2000-06-02 Netron Inc. Method of identifying recurring code constructs
CN101697121A (en) * 2009-10-26 2010-04-21 哈尔滨工业大学 Method for detecting code similarity based on semantic analysis of program source code
CN103729580A (en) * 2014-01-27 2014-04-16 国家电网公司 Method and device for detecting software plagiarism
WO2016085273A1 (en) * 2014-11-28 2016-06-02 주식회사 파수닷컴 Method for classifying alarm types in detecting source code error, computer program therefor, recording medium thereof
WO2017052318A1 (en) * 2015-09-25 2017-03-30 (주)씽크포비엘 Method and apparatus for analyzing software
CN107169358A (en) * 2017-05-24 2017-09-15 中国人民解放军信息工程大学 Code homology detection method and its device based on code fingerprint

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
全上克;杨新锋;: "程序代码相似度检测方法的设计与实现", 微型电脑应用, no. 10 *
刘军娜;邢琪;赵卫东;: "程序相似度检测算法", 计算机与数字工程, no. 12 *
朱波;郑虹;孙琳琳;杨友星;: "基于AST的程序代码相似性度量研究", 吉林大学学报(信息科学版), vol. 33, no. 01, pages 2 - 2 *
陈凯;刘建宾;: "代码标识符属性特征向量相似度检测技术研究", 福建电脑, vol. 32, no. 01, pages 3 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111857660A (en) * 2020-07-06 2020-10-30 南京航空航天大学 Context-aware API recommendation method and terminal based on query statement
CN111857660B (en) * 2020-07-06 2021-10-08 南京航空航天大学 Context-aware API recommendation method and terminal based on query statement
CN111813444A (en) * 2020-07-10 2020-10-23 北京思特奇信息技术股份有限公司 Method, system and electronic equipment for analyzing similarity of source codes
CN112035165A (en) * 2020-08-26 2020-12-04 山谷网安科技股份有限公司 Code clone detection method and system based on homogeneous network
CN112148609A (en) * 2020-09-28 2020-12-29 南京大学 Method for measuring codes submitted in online programming test
CN112698836A (en) * 2021-01-18 2021-04-23 昆明理工大学 Code quality attribute judgment method for complex user comments
CN112698836B (en) * 2021-01-18 2022-05-17 昆明理工大学 Code quality attribute judgment method for complex user comments
CN113641588A (en) * 2021-08-31 2021-11-12 北京航空航天大学 Software intelligibility determination method and system based on LDA topic modeling
CN113641588B (en) * 2021-08-31 2024-05-24 北京航空航天大学 Software understandability determination method and system based on LDA topic modeling
CN115129364A (en) * 2022-07-05 2022-09-30 四川大学 Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network

Similar Documents

Publication Publication Date Title
CN110347428A (en) A kind of detection method and device of code similarity
US11062089B2 (en) Method and apparatus for generating information
CN109376234A (en) A kind of method and apparatus of trained summarization generation model
CN109697641A (en) The method and apparatus for calculating commodity similarity
US10191946B2 (en) Answering natural language table queries through semantic table representation
US20210240784A1 (en) Method, apparatus and storage medium for searching blockchain data
CN110377289A (en) A kind of data analysis method, device, medium and electronic equipment
CN108628830A (en) A kind of method and apparatus of semantics recognition
CN107908615A (en) A kind of method and apparatus for obtaining search term corresponding goods classification
CN107506256A (en) A kind of method and apparatus of crash data monitoring
CN109948141A (en) A kind of method and apparatus for extracting Feature Words
CN107526718A (en) Method and apparatus for generating text
CN110119445A (en) The method and apparatus for generating feature vector and text classification being carried out based on feature vector
CN113986864A (en) Log data processing method and device, electronic equipment and storage medium
CN110471848A (en) A kind of method and apparatus of dynamic returned packet
JP2023036681A (en) Task processing method, processing device, electronic equipment, storage medium, and computer program
CN107908662A (en) The implementation method and realization device of search system
CN110532352A (en) Text duplicate checking method and device, computer readable storage medium, electronic equipment
CN109190123A (en) Method and apparatus for output information
CN111143394B (en) Knowledge data processing method, device, medium and electronic equipment
CN110188113A (en) Method, device and storage medium for comparing data by using complex expression
CN110019802A (en) A kind of method and apparatus of text cluster
CN110110153A (en) A kind of method and apparatus of node searching
CN109902152A (en) Method and apparatus for retrieving information
CN109871540A (en) A kind of calculation method and relevant device of text similarity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination