CN115729797A - Code similarity function detection method and device, electronic equipment and storage medium - Google Patents

Code similarity function detection method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115729797A
CN115729797A CN202110983815.6A CN202110983815A CN115729797A CN 115729797 A CN115729797 A CN 115729797A CN 202110983815 A CN202110983815 A CN 202110983815A CN 115729797 A CN115729797 A CN 115729797A
Authority
CN
China
Prior art keywords
subtrees
node
nodes
similar
subtree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110983815.6A
Other languages
Chinese (zh)
Inventor
刘江虹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Youzhuju Network Technology Co Ltd
Original Assignee
Beijing Youzhuju Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Youzhuju Network Technology Co Ltd filed Critical Beijing Youzhuju Network Technology Co Ltd
Priority to CN202110983815.6A priority Critical patent/CN115729797A/en
Publication of CN115729797A publication Critical patent/CN115729797A/en
Pending legal-status Critical Current

Links

Images

Abstract

The embodiment of the disclosure provides a code similarity function detection method, a device, an electronic device and a storage medium, wherein the method obtains an abstract syntax tree by analyzing a code to be detected; at least two subtrees in the abstract syntax tree are obtained by traversing the abstract syntax tree, and the subtrees are used for representing a function body in the code to be detected; determining at least two similar subtrees according to the nodes of the subtrees, and determining the function bodies corresponding to the at least two similar subtrees as similar functions, wherein the types of the corresponding nodes of the similar subtrees are the same, because the codes to be detected are converted into the abstract syntax tree, and the similar subtrees with the same types of the nodes are determined based on the static analysis of the abstract syntax tree, the functions realized by the function bodies corresponding to the similar subtrees are the same, and further, the corresponding similar functions are determined according to the function bodies corresponding to the similar subtrees, so that the detection of the similar functions with the same functions is realized, and the accuracy of the code repetition detection is improved.

Description

Code similarity function detection method and device, electronic equipment and storage medium
Technical Field
The embodiment of the disclosure relates to the technical field of computers, and in particular relates to a code similarity function detection method and device, an electronic device and a storage medium.
Background
At present, in the front-end project quality detection process, the repetition degree of codes is an important index, repeated codes gradually appear along with continuous development iteration and function accumulation of projects, and the overhigh repetition degree of codes can cause the stability of the projects to be reduced and the project maintenance cost to be improved.
In the prior art, the duplication detection of the code is usually to perform complete clone detection on the code through an open-source code detection function library, but this scheme can only detect completely consistent functions in the code, but cannot detect similar functions with the same functions but different parameters and function names, which results in the problem of low duplication rate detection accuracy of the code.
Disclosure of Invention
The embodiment of the disclosure provides a code similarity function detection method and device, electronic equipment and a storage medium, so as to overcome the problem that a similarity function cannot be detected.
In a first aspect, an embodiment of the present disclosure provides a method for detecting a code similarity function, including:
analyzing the code to be detected to obtain an abstract syntax tree; obtaining at least two subtrees in the abstract syntax tree by traversing the abstract syntax tree, wherein the subtrees are used for representing a function body in the code to be detected; and determining at least two similar subtrees according to the nodes of the subtrees, and determining function bodies corresponding to the at least two similar subtrees as similar functions, wherein the types of the corresponding nodes of the similar subtrees are the same.
In a second aspect, an embodiment of the present disclosure provides a code similarity function detection apparatus, including:
the analysis module is used for analyzing the code to be detected to obtain an abstract syntax tree;
the traversing module is used for traversing the abstract syntax tree to obtain at least two subtrees in the abstract syntax tree, and the subtrees are used for representing a function body in the to-be-detected code;
and the determining module is used for determining at least two similar subtrees according to the nodes of the subtrees and determining the function bodies corresponding to the at least two similar subtrees as similar functions, wherein the types of the corresponding nodes of the similar subtrees are the same.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the code similarity function detection method as described above in the first aspect and in various possible designs of the first aspect.
In a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium, where a computer executable instruction is stored, and when a processor executes the computer executable instruction, the code similarity function detection method according to the first aspect and various possible designs of the first aspect is implemented.
In a fifth aspect, embodiments of the present disclosure provide a computer program product comprising a computer program that, when executed by a processor, implements a code similarity function detection method as described above in the first aspect and various possible designs of the first aspect.
According to the code similarity function detection method, the code similarity function detection device, the electronic equipment and the storage medium, the abstract syntax tree is obtained by analyzing the code to be detected; obtaining at least two subtrees in the abstract syntax tree by traversing the abstract syntax tree, wherein the subtrees are used for representing a function body in the code to be detected; determining at least two similar subtrees according to the nodes of the subtrees, and determining function bodies corresponding to the at least two similar subtrees as similar functions, wherein the types of the corresponding nodes of the similar subtrees are the same, because the codes to be detected are converted into an abstract syntax tree, and the similar subtrees with the same types of nodes are determined based on static analysis of the abstract syntax tree, the functions of the function bodies corresponding to the similar subtrees are ensured to be the same, and further, the corresponding similar functions are determined according to the function bodies corresponding to the similar subtrees, so that the detection of the similar functions with the same functions is realized, and the accuracy of code repetition detection is improved.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is an application scenario diagram of a code similarity function detection method provided in the embodiment of the present disclosure;
fig. 2 is a schematic diagram of a code to be detected according to an embodiment of the present disclosure;
fig. 3 is a first flowchart illustrating a code similarity function detection method according to an embodiment of the present disclosure;
FIG. 4 is a flowchart of one implementation step of step S102 in the embodiment shown in FIG. 3;
fig. 5 is a schematic diagram of a corresponding relationship between a seed tree and a function body according to an embodiment of the present disclosure;
fig. 6 is a second flowchart illustrating a code similarity function detection method according to an embodiment of the present disclosure;
fig. 7 is a block diagram of a code similarity function detection apparatus provided in the embodiment of the present disclosure;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure;
fig. 9 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
The following explains an application scenario of the embodiment of the present disclosure:
fig. 1 is an application scenario diagram of a code similarity function detection method provided in the present disclosure, where the code similarity function detection method provided in the present disclosure may be applied in a scenario of software development processes such as system test and code repetition detection, and an execution main body of the code similarity function detection method provided in the present disclosure may be a server, where the server communicates with a terminal device and provides a service of code detection to the terminal device, specifically, as shown in fig. 1, the server receives a to-be-detected code uploaded by the terminal device, detects the to-be-detected code based on the method, determines a code line corresponding to a similarity function therein, and returns the code line corresponding to the similarity function to the terminal device for display, so that a developer user at one side of the terminal device can know a position of the similarity function in the to-be-detected code.
Fig. 2 is a schematic diagram of a code to be detected according to an embodiment of the present disclosure, in the prior art, complete clone detection of the code to be detected may be implemented through some function libraries for code detection, for example, jscpd, that is, two completely consistent functions in the code to be detected may be detected. However, as shown in fig. 2, function 1 (fn 1: string)) and function 2 (fn 2 (rule 2: number)), although the processing procedure and the implemented function inside function 1 and function 2 are the same (execute X _ function () function), since the function names (fn 1 and fn 2) and the input parameters of the two are different (rule 1: string and rule 2: number), the detection means in the prior art will detect function 1 and function 2 as two different functions, and thus will not include the repeated code part, thereby causing the missing of the repeated code and affecting the detection accuracy of the repetition. The reason for the above problem is that in the prior art, only identical functions can be detected, but similar functions with identical functions cannot be detected. In order to solve the above problem, the code similarity function detection method provided in the embodiments of the present disclosure provides a method for determining a similarity function based on a static analysis of a code to be detected, so as to solve the problem of low accuracy of repetition detection caused by the inability to determine the similarity function.
Referring to fig. 3, fig. 3 is a first flowchart illustrating a code similarity function detection method according to an embodiment of the present disclosure. The method of the present embodiment may be applied to a server or a terminal device, and the present embodiment is described with the server as an execution subject, and the method for detecting a code similarity function includes:
and S101, analyzing the code to be detected to obtain an abstract syntax tree.
Illustratively, the code to be detected may be obtained by loading a code file stored in the server, and more specifically, the code file is, for example, a ts format file, a tsx format file, or a js format file. The code to be detected can comprise one or more function bodies, wherein the function bodies are an integral body formed by all codes defining one function in the programming language.
The method comprises the steps of compiling a code to be detected by using an editor, converting the code to be detected into a corresponding Abstract Syntax Tree (AST), wherein the AST is an Abstract representation method of a source code Syntax structure and can represent the Syntax structure of the source code, each Node represents a structure of the source code, and the generation of the AST can be realized by performing lexical analysis, syntax analysis, compilation and other steps on each line of codes in the source code. Specifically, the abstract syntax tree corresponding to the code to be detected is obtained by converting each line of codes in the code to be detected. Illustratively, the process may be implemented by a Babel compiler, for example, by loading a preset Babel editor, and parsing the code to be detected through the Babel editor and a preset parsing keyword, so as to obtain a corresponding abstract syntax tree. Babel is a tool chain that is used primarily to convert code written in ECMAScript 2015+ syntax into backwards compatible JavaScript syntax to be able to run in current and old versions of browsers or other environments. The specific use method of the Babel compiler is the prior art, and the process is not described herein again.
And S102, traversing the abstract syntax tree to obtain at least two subtrees in the abstract syntax tree, wherein the subtrees are used for representing a function body in the code to be detected.
After the abstract syntax tree corresponding to the code to be detected is obtained, traversing the abstract syntax tree to obtain all nodes included in the abstract syntax tree. In an embodiment of the present disclosure, as shown in fig. 4, step S102 includes two specific implementation steps S1021 and S1022:
and step S1021, traversing the abstract syntax tree based on a preset algorithm to obtain at least two nodes.
Step S1022, determining at least two subtrees according to the node information of each node.
Illustratively, the predetermined algorithm is, for example, a Depth First Search (DFS) pre-sequencing traversal algorithm, and after traversing the abstract syntax tree by the predetermined algorithm, the nodes in the abstract syntax tree are obtained,
illustratively, the node information includes a node type, and the implementation manner for determining at least two subtrees according to the node information of each node includes: determining the node type (type) of each node according to the node information of each node; at least two subtrees are determined according to the node type of each node. Illustratively, the node information of each node is description information corresponding to the node. The node information includes node identification (Id), node type, and the like. The node is identified as an object name, such as a function name and a variable name. The node type is used for characterizing the type of the code block corresponding to the node, for example, the node type is "function declaration", that is, the code block corresponding to the node is referred to as a function declaration; as another example, a node type is "VariableDeclarator", which refers to a code block corresponding to the node, and is an assignment declaration. Therefore, according to the node type, a part of structural bodies in the abstract syntax tree corresponding to the code blocks of the representation function statement are determined as a sub-tree. Furthermore, at least two subtrees can be obtained by traversing the abstract syntax tree (i.e. the code to be detected corresponding to the abstract syntax tree at least comprises two function bodies). For example, when the node type of a certain node is traversed to a function statement corresponding to the node type, the partial structure in the abstract syntax tree where the node is located may be determined as a sub-tree based on the specific symbol. Meanwhile, other nodes in the subtree can be obtained through traversing the abstract syntax tree.
Step S103, determining at least two similar subtrees according to the nodes of the subtrees, and determining the function bodies corresponding to the at least two similar subtrees as similar functions, wherein the types of the corresponding nodes of the similar subtrees are the same.
Illustratively, after at least two subtrees are determined, each subtree corresponds to one or more nodes, and when the functions implemented by the functions corresponding to the two subtrees are the same, the node types of the corresponding nodes in the two subtrees should also be the same. Fig. 5 is a schematic diagram of a correspondence between a seed tree and a function body according to an embodiment of the present disclosure, and as shown in fig. 5, a code to be detected including an fn1 function and an fn2 function is analyzed to obtain a sub-tree 1 corresponding to the function body of fn1 and a sub-tree 2 corresponding to the function body of fn 2. Each of subtree 1 and subtree 2 includes a plurality of nodes. When the functions fn1 and fn2 are the same, the nodes in the subtree 1 and the nodes in the subtree 2 are in one-to-one correspondence, and the node types of the corresponding nodes are the same, for example, the node types of the node A1 and the node B1 are the same, the node types of the node A2 and the node B2 are the same, and the node types of the node A3 and the node B3 are the same, which is not illustrated here. By comparing the node types of the nodes in the subtree, the comparison between the function bodies corresponding to the subtree can be realized, so that whether more than two functions are similar functions for realizing the same function can be judged. In fig. 5, rule 1 is the parameter name of fn1, rule 2 is the parameter name of fn2, X _ function is the function name executed in function fn1 and function fn2, and other functional statements in the code shown in fig. 5 are common programming statements known to those skilled in the art, and are not described one by one here.
In this embodiment, an abstract syntax tree is obtained by parsing a code to be detected; traversing the abstract syntax tree to obtain at least two subtrees in the abstract syntax tree, wherein the subtrees are used for representing a function body in the to-be-detected code; determining at least two similar subtrees according to the nodes of the subtrees, and determining the function bodies corresponding to the at least two similar subtrees as similar functions, wherein the types of the corresponding nodes of the similar subtrees are the same, because the codes to be detected are converted into the abstract syntax tree, and the similar subtrees with the same types of the nodes are determined based on the static analysis of the abstract syntax tree, the functions realized by the function bodies corresponding to the similar subtrees are the same, and further, the corresponding similar functions are determined according to the function bodies corresponding to the similar subtrees, so that the detection of the similar functions with the same functions is realized, and the accuracy of the code repetition detection is improved.
Referring to fig. 6, fig. 6 is a schematic flowchart illustrating a second method for detecting a code similarity function according to an embodiment of the present disclosure. This embodiment further refines steps S102-S103 on the basis of the embodiment shown in fig. 3, and the code similarity function detecting method includes:
s201: and analyzing the code to be detected to obtain the abstract syntax tree.
S202: storing each node in the abstract syntax tree to first dictionary data, wherein keys of the first dictionary data represent storage addresses and values of the first dictionary data are used to represent a set of each node in the abstract syntax tree.
Illustratively, after the abstract syntax tree is traversed through the DFS precedence, all the obtained nodes are stored in the first dictionary data, which may be a map. The objects in this map include keys (keys) and corresponding values (values). The key value is a storage address of the code to be detected, and more specifically, for example, a file path. And storing node information of all nodes in the value to realize the storage of all nodes. By storing all nodes in the abstract syntax tree in the first dictionary data, the determined similar subtrees are restored in a subsequent step based on the first dictionary data, thereby determining the position of the similar function in the code.
S203: and acquiring node types of at least two nodes of each subtree.
S204: and determining the subtree type according to the node types of at least two nodes of each subtree.
S205: and determining subtrees with the same subtree type as similar subtrees.
In an embodiment of the present disclosure, determining a subtree type according to node types of at least two nodes of each subtree includes: performing hash calculation on the node types of at least two nodes in each sub-tree to generate type hash values respectively corresponding to each sub-tree; and determining the type of the subtree according to the type hash value corresponding to each subtree.
More specifically, for example, nodes in each sub-tree are obtained by traversing the abstract syntax tree, and then a set of node types corresponding to each node in each sub-tree is subjected to hash calculation to obtain a type hash value corresponding to the sub-tree. For example, the abstract syntax tree corresponding to the to-be-detected code includes an a subtree and a B subtree, where the a subtree includes nodes a1, a2, and a 3; the B subtree comprises B1, B2 and B3 nodes. Performing hash calculation on a set of node types of a1 node, a2 node and a3 node in the A subtree to obtain a type hash value of the A subtree; performing hash calculation on a set of node types of nodes B1, B2 and B3 in a B sub-tree to obtain a type hash value of the B sub-tree; if the type hash value of the subtree B is consistent with the type hash value of the subtree B, the node types of each corresponding node of the subtree A and the subtree B are considered to be the same, namely the node types of a1, a2 and a3 nodes in the subtree A are respectively corresponding to a B1, B2 and B3 nodes in the subtree B and are the same, and the subtree A and the subtree B are determined to be similar subtrees.
In a possible implementation manner, nodes with preset lengths in each sub-tree are obtained, and a target node is generated, wherein, exemplarily, the nodes with the preset lengths refer to nodes with preset number of nodes in the sub-tree or node information of nodes with preset number of rows; for example, the subtree a includes 10 nodes, and the nodes with the preset length are obtained, that is, the first 5 nodes in the subtree a are obtained, and a target node is generated to perform subsequent hash calculation. When the node types of at least two nodes in each sub-tree are subjected to Hash calculation to generate type Hash values respectively corresponding to each sub-tree, the Hash calculation is performed through the node type corresponding to the target node, so that the calculation efficiency can be improved, and the calculation resources can be saved.
S206: storing the similar subtrees to second dictionary data, wherein keys of the second dictionary data characterize subtree types of the similar subtrees, and values of the second dictionary data are used to characterize a set of nodes of the similar subtrees.
Exemplarily, after the similar subtrees are determined, the similar subtrees with fixed lengths are stored into second dictionary data, the second dictionary data is also a map, and in an object in the map, a key value is a unique hash value generated by node types of all nodes in the similar subtrees to characterize the subtree type; the value is a set of all node groups having the same subtree type, i.e., nodes of similar subtrees.
S207: and determining the node positions of all nodes corresponding to the similar subtrees according to the values of the first dictionary data and the values of the second dictionary data.
Illustratively, in the value of the second dictionary data, all node groups having the same subtree type are recorded, and in the value of the first dictionary data, node information of all nodes is stored. And searching nodes corresponding to all nodes in the node group with the same subtree type according to the values in the first data dictionary, and further determining the positions of the nodes of the similar subtrees in the codes, namely determining the positions of the nodes corresponding to the similar subtrees.
Illustratively, in the second dictionary data, key1 is abcd1, and represents a subtree type (i.e., corresponds to a function) of a similar subtree; the corresponding value1 includes a plurality of three node groups, namely a node group a, a node group b, and a node group c, each node group corresponds to a sub-tree, and the three sub-trees corresponding to the node group a, the node group b, and the node group c are similar sub-trees. And respectively positioning each node in the node group a, the node group b and the node group c through the first dictionary data according to the position information of all the nodes included in the first dictionary data, and further determining the node position of each node corresponding to the similar subtree.
In an embodiment of the present disclosure, before determining the node position, the node groups corresponding to the similar subtrees in the second field data may be sorted by a stable sorting algorithm, and the similar subtrees with more nodes are placed in the front for priority processing, so as to facilitate deduplication of subsequent similar functions.
S208: and restoring each node corresponding to the similar subtree according to the node position to generate a similar function.
Furthermore, the similar subtrees are restored according to the node positions of the similar subtrees, and the similar functions corresponding to the similar subtrees can be determined. For example, if the similar subtree includes three subtrees, the similar function also includes three functions. The implementation method for restoring the subtree in the abstract syntax tree to generate the corresponding function code segment is known to those skilled in the art and is not described herein again.
The implementation manner of the similarity function may be, for example, specific code characters, or may be a code line identifier in the code to be detected. Optionally, after the similar function is determined, the similar function may be sent to a side of the terminal device for displaying, so that a developer user can know the position of the similar function in the code to be detected, and further optimize the code, thereby improving the code quality.
In this embodiment, the implementation manner of step S201 is the same as the implementation manner of step S101 in the embodiment shown in fig. 3 of the present disclosure, and is not described in detail here.
Fig. 7 is a block diagram of a code similarity function detection apparatus according to an embodiment of the present disclosure, which corresponds to the code similarity function detection method according to the above embodiment. For ease of illustration, only portions relevant to embodiments of the present disclosure are shown. Referring to fig. 7, the code similarity function detecting apparatus 3 includes:
the analysis module 31 is configured to analyze the code to be detected to obtain an abstract syntax tree;
the traversing module 32 is configured to obtain at least two sub-trees in the abstract syntax tree by traversing the abstract syntax tree, where the sub-trees are used to represent function bodies in the to-be-detected code;
the determining module 33 is configured to determine at least two similar subtrees according to the nodes of each subtree, and determine the function bodies corresponding to the at least two similar subtrees as similar functions, where the types of the corresponding nodes of each similar subtree are the same.
In an embodiment of the present disclosure, the traversal module 32 is specifically configured to: traversing the abstract syntax tree based on a preset algorithm to obtain at least two nodes; and determining at least two subtrees according to the node information of each node.
In an embodiment of the present disclosure, when determining at least two subtrees according to the node information of each node, the traversing module 32 is specifically configured to: determining the node type of each node according to the node information of each node; and determining at least two subtrees according to the node type of each node.
In an embodiment of the present disclosure, the preset algorithm includes a depth-first traversal algorithm, and the traversal module 32 is specifically configured to, when traversing the abstract syntax tree based on the preset algorithm to obtain at least two nodes: and performing the forward traversal on the abstract syntax tree based on a depth-first traversal algorithm to obtain at least two nodes.
In an embodiment of the present disclosure, the determining module 33 is specifically configured to: acquiring node types of at least two nodes of each subtree; determining a sub-tree type according to the node types of at least two nodes of each sub-tree; and determining subtrees with the same subtree type as similar subtrees.
In an embodiment of the present disclosure, when determining the subtree type according to the node types of at least two nodes of each subtree, the determining module 33 is specifically configured to: performing hash calculation on node types of at least two nodes in each subtree to generate type hash values respectively corresponding to each subtree; and determining the type of the subtree according to the type hash value corresponding to each subtree.
In an embodiment of the present disclosure, the determining module 33 is further configured to: acquiring nodes with preset lengths in each sub-tree to generate target nodes; the determining module 33 is specifically configured to, when performing hash calculation on node types of at least two nodes in each sub-tree to generate type hash values corresponding to each sub-tree: and carrying out hash calculation on the node types of the target nodes in the subtrees to generate type hash values respectively corresponding to the subtrees.
In an embodiment of the present disclosure, the determining module 33 is further configured to: storing each node in the abstract syntax tree to first dictionary data, wherein keys of the first dictionary data represent storage addresses, and values of the first dictionary data are used for representing a set of each node in the abstract syntax tree; storing the similar subtrees to second dictionary data, wherein keys of the second dictionary data characterize subtree types of the similar subtrees, and values of the second dictionary data are used for characterizing a set of nodes of the similar subtrees;
when determining the function bodies corresponding to the at least two similar subtrees as similar functions, the determining module 33 is specifically configured to: determining the node positions of all nodes corresponding to the similar subtrees according to the values of the first dictionary data and the values of the second dictionary data; and restoring each node corresponding to the similar subtree according to the node position to generate a similar function.
In an embodiment of the present disclosure, the parsing module is specifically configured to: loading a preset Babel editor; and analyzing the code to be detected through a Babel editor and preset analysis keywords to obtain a corresponding abstract syntax tree.
The analysis module 31, the traversal module 32, and the determination module 33 are connected in sequence. The code similarity function detecting apparatus 3 provided in this embodiment may execute the technical solution of the above method embodiment, and the implementation principle and technical effect are similar, which are not described herein again.
Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, and as shown in fig. 8, the electronic device 4 includes at least one processor 41 and a memory 42;
the memory 42 stores computer-executable instructions;
the at least one processor 41 executes computer-executable instructions stored by the memory 42 to cause the at least one processor 41 to perform the code similarity function detection method in the embodiment shown in fig. 3-6.
The processor 41 and the memory 42 are connected by a bus 43.
The relevant description may be understood by referring to the relevant description and effect corresponding to the steps in the embodiments corresponding to fig. 3 to fig. 6, and redundant description is not repeated here.
Referring to fig. 9, a schematic structural diagram of an electronic device 900 suitable for implementing the embodiment of the present disclosure is shown, where the electronic device 900 may be a terminal device or a server. Among them, the terminal Device may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a Digital broadcast receiver, a Personal Digital Assistant (PDA), a tablet computer (PAD), a Portable Multimedia Player (PMP), a car terminal (e.g., car navigation terminal), etc., and a fixed terminal such as a Digital TV, a desktop computer, etc. The electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 9, the electronic device 900 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 901, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage means 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the electronic apparatus 900 are also stored. The processing apparatus 901, the ROM902, and the RAM 903 are connected to each other through a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.
Generally, the following devices may be connected to the I/O interface 905: input devices 906 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 907 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 908 including, for example, magnetic tape, hard disk, etc.; and a communication device 909. The communication device 909 may allow the electronic apparatus 900 to perform wireless or wired communication with other apparatuses to exchange data. While fig. 9 illustrates an electronic device 900 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication device 909, or installed from the storage device 908, or installed from the ROM 902. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing apparatus 901.
It should be noted that the computer readable medium of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the methods shown in the above embodiments.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of Network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first retrieving unit may also be described as a "unit for retrieving at least two internet protocol addresses".
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
In a first aspect, according to one or more embodiments of the present disclosure, there is provided a code similarity function detection method, including:
analyzing the code to be detected to obtain an abstract syntax tree; obtaining at least two subtrees in the abstract syntax tree by traversing the abstract syntax tree, wherein the subtrees are used for representing a function body in the code to be detected; and determining at least two similar subtrees according to the nodes of the subtrees, and determining function bodies corresponding to the at least two similar subtrees as similar functions, wherein the types of the corresponding nodes of the similar subtrees are the same.
According to one or more embodiments of the present disclosure, traversing the abstract syntax tree to obtain at least two sub-trees in the abstract syntax tree comprises: traversing the abstract syntax tree based on a preset algorithm to obtain at least two nodes; and determining at least two subtrees according to the node information of each node.
According to one or more embodiments of the present disclosure, determining at least two subtrees according to node information of each of the nodes includes: determining the node type of each node according to the node information of each node; and determining at least two subtrees according to the node type of each node.
According to one or more embodiments of the present disclosure, the preset algorithm includes a depth-first traversal algorithm, and traversing the abstract syntax tree based on the preset algorithm to obtain at least two nodes, including: and performing the forward traversal on the abstract syntax tree based on a depth-first traversal algorithm to obtain at least two nodes.
According to one or more embodiments of the present disclosure, determining at least two similar subtrees according to the nodes of each of the subtrees includes: obtaining node types of at least two nodes of each subtree; determining a sub-tree type according to the node types of at least two nodes of each sub-tree; and determining subtrees with the same subtree type as similar subtrees.
According to one or more embodiments of the present disclosure, determining a subtree type according to node types of at least two nodes of each subtree includes: performing hash calculation on node types of at least two nodes in each subtree to generate type hash values respectively corresponding to each subtree; and determining the type of the subtree according to the type hash value corresponding to each subtree.
According to one or more embodiments of the present disclosure, the method further comprises: acquiring nodes with preset lengths in the subtrees to generate target nodes; performing hash calculation on node types of at least two nodes in each subtree to generate type hash values respectively corresponding to each subtree, including: and carrying out Hash calculation on the node types of the target nodes in the subtrees to generate type Hash values respectively corresponding to the subtrees.
According to one or more embodiments of the present disclosure, the method further comprises: storing each node in the abstract syntax tree to first dictionary data, wherein keys of the first dictionary data characterize storage addresses, and values of the first dictionary data are used to characterize a set of each node in the abstract syntax tree; storing the similar subtree to second dictionary data, wherein a key of the second dictionary data characterizes a subtree type of the similar subtree, and a value of the second dictionary data is used to characterize a set of nodes of the similar subtree; determining function bodies corresponding to the at least two similar subtrees as similar functions, including: determining the node positions of all nodes corresponding to the similar subtrees according to the first dictionary data and the second dictionary data; and restoring each node corresponding to the similar subtree according to the node position to generate the similar function.
According to one or more embodiments of the present disclosure, parsing a code to be detected to obtain an abstract syntax tree includes: loading a preset Babel editor; and analyzing the code to be detected through the Babel editor and preset analysis keywords to obtain a corresponding abstract syntax tree.
In a second aspect, according to one or more embodiments of the present disclosure, there is provided a code similarity function detection apparatus including:
the analysis module is used for analyzing the codes to be detected to obtain an abstract syntax tree;
the traversal module is used for obtaining at least two subtrees in the abstract syntax tree by traversing the abstract syntax tree, wherein the subtrees are used for representing a function body in the code to be detected;
and the determining module is used for determining at least two similar subtrees according to the nodes of the subtrees and determining the function bodies corresponding to the at least two similar subtrees as similar functions, wherein the types of the corresponding nodes of the similar subtrees are the same.
According to one or more embodiments of the present disclosure, the traversal module is specifically configured to: traversing the abstract syntax tree based on a preset algorithm to obtain at least two nodes; and determining at least two subtrees according to the node information of each node.
According to one or more embodiments of the present disclosure, when determining at least two subtrees according to the node information of each of the nodes, the traversal module is specifically configured to: determining the node type of each node according to the node information of each node; and determining at least two subtrees according to the node type of each node.
According to one or more embodiments of the present disclosure, the preset algorithm includes a depth-first traversal algorithm, and the traversal module is specifically configured to, when traversing the abstract syntax tree based on the preset algorithm to obtain at least two nodes: and performing the forward traversal on the abstract syntax tree based on a depth-first traversal algorithm to obtain at least two nodes.
According to one or more embodiments of the present disclosure, the determining module is specifically configured to: obtaining node types of at least two nodes of each subtree; determining a sub-tree type according to the node types of at least two nodes of each sub-tree; and determining subtrees with the same subtree type as similar subtrees.
According to one or more embodiments of the present disclosure, when determining the subtree type according to the node types of at least two nodes of each subtree, the determining module is specifically configured to: performing hash calculation on node types of at least two nodes in each subtree to generate type hash values respectively corresponding to each subtree; and determining the type of the subtree according to the type hash value corresponding to each subtree.
According to one or more embodiments of the present disclosure, the determining module is further configured to: acquiring nodes with preset lengths in the subtrees to generate target nodes; the determining module is specifically configured to, when performing hash calculation on node types of at least two nodes in each sub-tree to generate type hash values corresponding to each sub-tree: and carrying out Hash calculation on the node types of the target nodes in the subtrees to generate type Hash values respectively corresponding to the subtrees.
According to one or more embodiments of the present disclosure, the determining module is further configured to: storing each node in the abstract syntax tree to first dictionary data, wherein keys of the first dictionary data characterize storage addresses, and values of the first dictionary data are used to characterize a set of each node in the abstract syntax tree; storing the similar subtree to second dictionary data, wherein a key of the second dictionary data characterizes a subtree type of the similar subtree, and a value of the second dictionary data is used to characterize a set of nodes of the similar subtree;
according to one or more embodiments of the present disclosure, when determining the function bodies corresponding to the at least two similar subtrees as similar functions, the method is specifically configured to: determining the node positions of all nodes corresponding to the similar subtrees according to the first dictionary data and the second dictionary data; and restoring each node corresponding to the similar subtree according to the node position to generate the similar function.
According to one or more embodiments of the present disclosure, the parsing module is specifically configured to: loading a preset Babel editor; and analyzing the code to be detected through the Babel editor and preset analysis keywords to obtain a corresponding abstract syntax tree.
In a third aspect, according to one or more embodiments of the present disclosure, there is provided an electronic device including: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the code similarity function detection method as described above in the first aspect and in various possible designs of the first aspect.
In a fourth aspect, according to one or more embodiments of the present disclosure, a computer-readable storage medium is provided, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the code similarity function detection method as described in the first aspect and various possible designs of the first aspect is implemented.
In a fifth aspect, embodiments of the present disclosure provide a computer program product comprising a computer program that, when executed by a processor, implements a code similarity function detection method as described above in the first aspect and various possible designs of the first aspect.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (13)

1. A method for detecting a code similarity function, the method comprising:
analyzing the code to be detected to obtain an abstract syntax tree;
obtaining at least two subtrees in the abstract syntax tree by traversing the abstract syntax tree, wherein the subtrees are used for representing a function body in the code to be detected;
and determining at least two similar subtrees according to the nodes of the subtrees, and determining function bodies corresponding to the at least two similar subtrees as similar functions, wherein the types of the corresponding nodes of the similar subtrees are the same.
2. The method of claim 1, wherein traversing the abstract syntax tree to obtain at least two sub-trees in the abstract syntax tree comprises:
traversing the abstract syntax tree based on a preset algorithm to obtain at least two nodes;
and determining at least two subtrees according to the node information of each node.
3. The method of claim 2, wherein the node information includes a node type; determining at least two subtrees according to the node information of each node, comprising:
determining the node type of each node according to the node information of each node;
and determining at least two subtrees according to the node type of each node.
4. The method of claim 2, wherein the predetermined algorithm comprises a depth-first traversal algorithm, and wherein traversing the abstract syntax tree based on the predetermined algorithm to obtain at least two nodes comprises:
and performing the forward traversal on the abstract syntax tree based on a depth-first traversal algorithm to obtain at least two nodes.
5. The method of claim 1, wherein determining at least two similar subtrees based on the nodes of each of the subtrees comprises:
obtaining node types of at least two nodes of each subtree;
determining a sub-tree type according to the node types of at least two nodes of each sub-tree;
and determining subtrees with the same subtree type as similar subtrees.
6. The method of claim 5, wherein determining the subtree type based on the node types of at least two nodes of each of the subtrees comprises:
performing hash calculation on node types of at least two nodes in each subtree to generate type hash values respectively corresponding to each subtree;
and determining the type of the subtree according to the type hash value corresponding to each subtree.
7. The method of claim 5, further comprising:
acquiring nodes with preset lengths in the subtrees to generate target nodes;
performing hash calculation on node types of at least two nodes in each subtree to generate type hash values respectively corresponding to each subtree, including:
and carrying out Hash calculation on the node types of the target nodes in the subtrees to generate type Hash values respectively corresponding to the subtrees.
8. The method of claim 5, further comprising:
storing each node in the abstract syntax tree to first dictionary data, wherein keys of the first dictionary data characterize storage addresses, and values of the first dictionary data are used to characterize a set of each node in the abstract syntax tree;
storing the similar subtree to second dictionary data, wherein a key of the second dictionary data characterizes a subtree type of the similar subtree, and a value of the second dictionary data is used to characterize a set of nodes of the similar subtree;
determining the function bodies corresponding to the at least two similar subtrees as similar functions, including:
determining the node positions of all nodes corresponding to the similar subtrees according to the first dictionary data and the second dictionary data;
and restoring each node corresponding to the similar subtree according to the node position to generate the similar function.
9. The method according to any of claims 1-8, wherein parsing the code to be detected to obtain an abstract syntax tree comprises:
loading a preset Babel editor;
and analyzing the code to be detected through the Babel editor and preset analysis keywords to obtain a corresponding abstract syntax tree.
10. A code similarity function detecting apparatus, comprising:
the analysis module is used for analyzing the code to be detected to obtain an abstract syntax tree;
the traversal module is used for obtaining at least two subtrees in the abstract syntax tree by traversing the abstract syntax tree, wherein the subtrees are used for representing a function body in the code to be detected;
and the determining module is used for determining at least two similar subtrees according to the nodes of the subtrees and determining the function bodies corresponding to the at least two similar subtrees as similar functions, wherein the types of the corresponding nodes of the similar subtrees are the same.
11. An electronic device, comprising: at least one processor and a memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the code similarity function detection method of any of claims 1 to 9.
12. A computer-readable storage medium having computer-executable instructions stored therein, which when executed by a processor, implement the code similarity function detection method of any one of claims 1 to 9.
13. A computer program product comprising a computer program which, when executed by a processor, implements the code similarity function detection method of any of claims 1 to 9.
CN202110983815.6A 2021-08-25 2021-08-25 Code similarity function detection method and device, electronic equipment and storage medium Pending CN115729797A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110983815.6A CN115729797A (en) 2021-08-25 2021-08-25 Code similarity function detection method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110983815.6A CN115729797A (en) 2021-08-25 2021-08-25 Code similarity function detection method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115729797A true CN115729797A (en) 2023-03-03

Family

ID=85289807

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110983815.6A Pending CN115729797A (en) 2021-08-25 2021-08-25 Code similarity function detection method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115729797A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117349803A (en) * 2023-12-06 2024-01-05 浙江大学 Code confusion method, device, electronic equipment and computer readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117349803A (en) * 2023-12-06 2024-01-05 浙江大学 Code confusion method, device, electronic equipment and computer readable storage medium
CN117349803B (en) * 2023-12-06 2024-03-19 浙江大学 Code confusion method, device, electronic equipment and computer readable storage medium

Similar Documents

Publication Publication Date Title
KR20160073402A (en) Callpath finder
CN111506900B (en) Vulnerability detection method and device, electronic equipment and computer storage medium
CN114422267B (en) Flow detection method, device, equipment and medium
CN110764748B (en) Code calling method, device, terminal and storage medium
CN110502227B (en) Code complement method and device, storage medium and electronic equipment
CN109726217B (en) Database operation method, device, equipment and storage medium
CN111736840A (en) Compiling method and running method of applet, storage medium and electronic equipment
CN112395253B (en) Index file generation method, terminal device, electronic device and medium
CN111552640A (en) Code detection method, device, equipment and storage medium
CN114328208A (en) Code detection method and device, electronic equipment and storage medium
CN114035805A (en) Code conversion method, apparatus, medium, and device for pre-compiler
CN111241823A (en) Dependency configuration management method and device, electronic equipment and storage medium
CN115729797A (en) Code similarity function detection method and device, electronic equipment and storage medium
CN112527302B (en) Error detection method and device, terminal and storage medium
CN111124541A (en) Configuration file generation method, device, equipment and medium
CN111026629A (en) Method and device for automatically generating test script
CN116185805A (en) Code detection method, device, equipment and storage medium
CN114047923A (en) Error code positioning method, device, storage medium and electronic equipment
CN110716946B (en) Method and device for updating feature rule matching library, storage medium and electronic equipment
US11853751B2 (en) Indirect function call target identification in software
CN113609309B (en) Knowledge graph construction method and device, storage medium and electronic equipment
CN117235744B (en) Source file online method, device, electronic equipment and computer readable medium
CN109446078B (en) Code testing method and device, storage medium and electronic equipment
CN114116517A (en) Front-end item analysis method, device, medium and electronic equipment
CN115878091A (en) Data processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination