CN115309632A - Method and device for detecting repeated codes - Google Patents

Method and device for detecting repeated codes Download PDF

Info

Publication number
CN115309632A
CN115309632A CN202210804142.8A CN202210804142A CN115309632A CN 115309632 A CN115309632 A CN 115309632A CN 202210804142 A CN202210804142 A CN 202210804142A CN 115309632 A CN115309632 A CN 115309632A
Authority
CN
China
Prior art keywords
code
function
token
leaf node
same
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210804142.8A
Other languages
Chinese (zh)
Inventor
任小伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202210804142.8A priority Critical patent/CN115309632A/en
Publication of CN115309632A publication Critical patent/CN115309632A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs

Abstract

The specification discloses a method and a device for detecting repeated codes. The method comprises the following steps: replacing the text identifier in the target code with a preset token; performing syntax analysis on the replaced target code by taking a function as a unit to obtain a token list corresponding to each section of function code in the target code; detecting whether token lists of different function codes are the same; and in the case that the same token list is detected, determining a function code corresponding to the same token list as a repeated code in the target code. The repeated code detection is carried out based on the code logic, so that the repeated codes with different text identifiers such as function names and variable names can be effectively detected, the detection accuracy of the repeated codes is improved, and the difficulty of subsequent software maintenance is further reduced.

Description

Method and device for detecting repeated codes
Technical Field
The present disclosure relates to the field of software development technologies, and in particular, to a method and an apparatus for detecting a duplicate code.
Background
Often, there are many duplicate codes in large software that can increase the difficulty of software maintenance. For example, in fixing a vulnerability, some duplicate code may be missed, etc. How to detect repeated codes is an important research topic in the field of software development.
Disclosure of Invention
In view of the above, the present specification provides a method and an apparatus for detecting a duplicate code.
Specifically, the description is realized by the following technical scheme:
a method of detecting a duplicate code, comprising:
replacing the text identifier in the target code with a preset token;
performing syntax analysis on the replaced target code by taking a function as a unit to obtain a token list corresponding to each section of function code in the target code;
detecting whether the token lists of different function codes are the same;
and in the case that the same token list is detected, determining a function code corresponding to the same token list as a repeated code in the target code.
Optionally, after obtaining the token list corresponding to each function code in the target code, the method further includes:
aiming at each segment of function code, constructing a Merckel tree corresponding to the function code according to a token list corresponding to the function code, wherein leaf nodes of the Merckel tree are hash values of all tokens in the token list;
determining the number of code rows covered by each non-leaf node in the Merckel tree;
acquiring a specified detection line number range during repeated code detection;
detecting whether the hash values of non-leaf nodes in each Merckel tree, which accord with the range of the number of the detection lines, are the same or not;
and under the condition that the same hash value is detected, determining a function code corresponding to the Mercker tree to which the same hash value belongs as a repeated code in the target code.
Optionally, the determining the number of code rows covered by each non-leaf node in the merkel tree includes:
determining the line position of a token corresponding to a leaf node in the Mercker tree in the function code;
for each non-leaf node in the Merckel tree, determining the number of code lines covered by the non-leaf node according to the line position of the leaf node hung below the non-leaf node.
Optionally, after determining the number of code rows covered by each non-leaf node in the merkel tree, the method further includes:
storing the hash value and the code line number of each non-leaf node in the Mercker tree as data records in a database;
the detecting whether the hash values of the non-leaf nodes in each merkel tree which meet the range of the detection line number are the same includes:
and performing uniqueness detection on the hash values of the data records in the database, which accord with the detection line number range, so as to detect whether the hash values of non-leaf nodes in each Mercker tree, which accord with the detection line number range, are the same.
Optionally, the text identifier includes a function name and a variable name, and the replacing of the text identifier in the target code with a preset token includes:
replacing the function name in the target code with a preset function name token;
and replacing the variable name in the target code with a preset variable name token.
Optionally, before replacing the text identifier in the target code with the preset token, the method further includes:
and filtering out log printing statements and code comments in the target code.
Optionally, the method further includes: and outputting the function name of the repeated code so as to locate the repeated code.
An apparatus for detecting a repetitive code, comprising:
the character replacing unit is used for replacing the text identifier in the target code with a preset token;
the code analysis unit is used for carrying out syntax analysis on the replaced target code by taking a function as a unit to obtain a token list corresponding to each section of function code in the target code;
the token detection unit is used for detecting whether the token lists of different function codes are the same or not;
and the repeated determining unit determines the function codes corresponding to the same token list as the repeated codes in the target codes when the same token list is detected.
An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor implements the aforementioned method by executing the executable instructions.
A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method as previously described.
By adopting the embodiment, the text identifier in the target code can be replaced by the preset token, so that the expression of the text identifier in the target code is unified, the grammar of the replaced target code is analyzed, the token list which can represent the code logic and corresponds to each section of function code in the target code is obtained, and then the detection of the repeated code is realized by detecting whether the token lists of different function codes are the same.
Drawings
Fig. 1 is a flowchart illustrating a method for detecting a duplicate code according to an exemplary embodiment of the present disclosure.
Fig. 2 is a flowchart illustrating another method for detecting a duplicate code according to an exemplary embodiment of the present disclosure.
Fig. 3 is a diagram illustrating a merkel tree corresponding to function codes in an exemplary embodiment of the present specification.
Fig. 4 is a hardware configuration diagram of an electronic device in which a device for detecting a repetitive code is provided according to an exemplary embodiment of the present disclosure.
Fig. 5 is a block diagram of a device for detecting a duplicate code according to an exemplary embodiment of the present disclosure.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the specification, as detailed in the appended claims.
The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present description. The word "if" as used herein may be interpreted as "at" \8230; "or" when 8230; \8230; "or" in response to a determination ", depending on the context.
Often, there are many duplicate codes in large software that can increase the difficulty of software maintenance. For example, when a bug is fixed, some duplicate code may be missed, etc. How to detect repeated codes is an important research topic in the field of software development.
At present, the detection method in the industry can only detect completely repeated codes, the number of the repeated codes detected on the whole is small, and the requirement for reducing the software maintenance difficulty cannot be met.
The description provides a detection scheme of repeated codes, can effectively detect the repeated codes with different function names or variable names, improves the detection accuracy of the repeated codes, and further reduces the difficulty of subsequent software maintenance.
Fig. 1 is a flowchart illustrating a method for detecting a duplicate code according to an exemplary embodiment of the present disclosure.
Referring to fig. 1, the method for detecting a repetition code may include the following steps:
and 102, replacing the text identifier in the object code with a preset token.
In this specification, the target code is a code that needs to be subjected to repetitive code detection, the target code usually includes a plurality of functions, and a function code corresponding to each function includes a plurality of characters. For example: function names, variable names, grammar tokens (i.e., grammar keywords), and the like. Characters in function codes can be generally classified into two types, one type is logically related to codes and belongs to a logic type of characters, for example: grammar tokens, etc., another class often not related to code logic, such as: function names, variable names, and other text identifiers.
In this specification, the text identifier in the object code may be replaced with a preset token. For distinguishing the type of the text identifier, the text identifier such as the function name can be replaced by a preset function name token, such as functoken; for a text identifier such as a variable name, it may be replaced with a preset variable name token, such as a varienabletoken.
It should be noted that "functions" in the code described in this specification may also be referred to as "methods" in the code, and names of the functions and methods may be different for different compiling languages. This specification will refer to this as a "function," but this reference does not represent a limitation of the specification to the compilation language.
And 104, performing syntax analysis on the replaced target code by taking a function as a unit to obtain a token list corresponding to each section of function code in the target code.
Based on the foregoing step 102, after the text identifier in the target code is replaced, parsing may be performed on the replaced target code by taking a function as a unit, and a token is extracted from each segment of function code of the target code, so as to obtain a token list corresponding to the function code, where the token list may embody a code logic corresponding to the function code.
The extracting of the token from the function code generally refers to extracting key information such as a syntax token, a function name token, and a variable name token from the function code.
In this specification, each piece of function code may be parsed, and each token included in the function code may be extracted from the function code, so as to obtain a token list corresponding to the function code. In other examples, the object code may be parsed in units of classes, which is not particularly limited in this specification.
For example, assume a piece of function code in the object code is:
Figure BDA0003735849630000041
for the section of function code, the function name call is replaced by a function name token functoken, and the variable names maxVal, sum and the like are replaced by a variable name token varienabletoken. Then, the grammar of the token is analyzed, and a token list corresponding to the function code can be extracted:
int、funcToken、int、variableToken、{、int、variableToken、=、0、;、for、(、int、variableToken、++、variableToken、)、{、variableToken、+=、variableToken、}、return、variableToken、}
wherein a pause sign represents a separator.
In this step, a token list corresponding to each segment of function code in the target code can be obtained.
Step 106, checking whether the token lists of different function codes are the same.
In this specification, it is possible to detect whether or not token lists corresponding to different function codes are the same. For example, for each piece of function code, it may be detected whether the token list corresponding to the piece of function code is the same as the token list corresponding to any other piece of function code, so as to achieve detection whether the token lists between any two pieces of function codes in all the function codes are the same.
In one example, an edit distance algorithm, such as an edit distance algorithm, may be used to calculate an edit distance between any two token lists, and when the edit distance is less than an edit distance threshold, the two token lists may be determined to be the same.
In another example, the hash value of each token list may be calculated separately, and then whether the hash values of the token lists of different function codes are the same or not is detected, and when the hash values are the same, it may be determined that the two token lists are the same.
Of course, in other examples, other methods may also be used to detect whether the token lists of different function codes are the same, for example, similarity between any two token lists may be calculated, and whether the two token lists are the same is detected according to the similarity, which is not limited in this specification.
And 108, under the condition that the same token list is detected, determining a function code corresponding to the same token list as a repeated code in the target code.
In this specification, the detected same token list may correspond to two sections of function codes, may also correspond to three or more sections of function codes, and may determine the function codes as mutually repeated codes regardless of corresponding to several sections of function codes.
Function code Token list
Function code 1 Token Listing A
Function code
2 Token List A
Function code
3 Token Listing A
Function code
4 Token list C
TABLE 1
Referring to the example of table 1, assume that 4 functions are included in the object code, corresponding to function code 1-function code 4, respectively. The token lists of the function code 1 to the function code 3 are the same and are the token list a, the token list of the function code 4 is the token list C, and if the token list is different from the token list a, it can be determined that the function code 1, the function code 2, and the function code 3 are mutually repeated in the target code and are repeated codes in the target code.
In this specification, after the duplicate code is determined, the function name of the duplicate code may also be output so as to locate the duplicate code in the target code.
The description shows that the text identifier in the target code can be replaced by the preset token, so that the expression of the text identifier in the target code is unified, the syntax analysis is performed on the replaced target code, the token list which can represent the code logic and corresponds to each section of function code in the target code is obtained, and then the detection of the repeated code is realized by detecting whether the token lists of different function codes are the same.
In practical application, the target code usually also includes some codes irrelevant to code logic, such as log printing statements, code comments and the like, which do not affect the processing logic of the codes, and when the detection of the repeated codes is performed, the codes can be filtered from the target code, and then the filtered codes are detected, thereby further improving the accuracy of the detection of the repeated codes.
The log printing statement is usually the code at the beginning of printf, for example: printf ("sum =% d \ n", sum), etc. Code annotations typically begin with "//", "/", etc., and "/" etc. end.
In this specification, for a target code to be subjected to duplicate code detection, a log print statement and a code comment which are not related to code logic may be filtered, and then a text identifier may be replaced. Of course, in other examples, the text identifier in the object code may be replaced first, and then the filtering of the log printing statement and the code annotation may be performed, which is not limited in this specification.
The specification also provides a duplicate code detection scheme capable of specifying the number of code lines.
Referring to fig. 2, the method for detecting the repetition code may include the following steps:
and 202, replacing the text identifier in the object code with a preset token.
And 204, performing syntax analysis on the replaced target code by taking the function as a unit to obtain a token list corresponding to each section of function code in the target code.
In this specification, the implementation process of step 202 and step 204 may refer to the implementation of step 102 and step 104 in the foregoing embodiment of fig. 1, and details are not repeated here.
Step 206, for each segment of function code, building a tacle tree corresponding to the function code according to the token list corresponding to the function code, where leaf nodes of the tacle tree are hash values of each token in the token list.
In this specification, for each piece of function code, a merkel tree corresponding to the function code may be constructed according to a token list corresponding to the function code.
The Merkle Tree (Merkle Tree), also called hash Tree, is a binary Tree, and the nodes of the Merkle Tree can be divided into leaf nodes and non-leaf nodes, and the non-leaf nodes can include root nodes and intermediate nodes. A leaf node may store a hash value of data, while a non-leaf node may store hash values and hash values of its two children.
In this specification, for each piece of function code, the hash value of each token in the token list corresponding to the function code may be used as a leaf node of the merck tree, so that the leaf node of the merck tree is constructed first, and then a non-leaf node of the merck tree is constructed upward from the leaf node.
For example, assuming that there are 8 tokens in the token list corresponding to a certain function code, please refer to fig. 3, these 8 tokens may correspond to leaf nodes 1-8 of the mercker tree shown in fig. 3 in turn, and the construction of the mercker tree continues from leaf nodes 1-8 upwards.
Node of Merck tree Token Hash value
Leaf node 1 Token 1 H1
Leaf node
2 Token 2 H2
Leaf node
3 Token 3 H3
Leaf node
4 Token 4 H4
Leaf node
5 Token 5 H5
Leaf node
6 Token 6 H6
Leaf node
7 Token 7 H7
Leaf node
8 Token 8 H8
TABLE 2
Assuming that 8 tokens in the token list corresponding to the function code are token 1-token 8, please refer to table 2, a leaf node 1 may be a hash value H1 of token 1, a leaf node 2 may be a hash value H2 of token 2, and so on, a leaf node 8 may be a hash value H8 of token 8.
Node of Merck tree Hash value
Non-leaf node 9 H12
Non-leaf node
10 H34
Non-leaf node 11 H56
Non-leaf node
12 H78
TABLE 3
Then, non-leaf nodes 9-12 may be constructed based on leaf nodes 1-8. Referring to table 3, the non-leaf node 9 is the hash value H12 calculated after the hash value H1 of the leaf node 1 and the hash value H2 of the leaf node 2 are spliced, the non-leaf node 10 is the hash value H34 calculated after the hash value H3 of the leaf node 3 and the hash value H4 of the leaf node 4 are spliced, and so on, the hash value of the non-leaf node 12 is H78.
Similarly, non-leaf nodes 16-18 may continue to be constructed and hash values determined. Where the non-leaf nodes 18 are also referred to as root nodes.
Step 208, determining the number of code rows covered by each non-leaf node in the merkel tree.
In this specification, the number of code lines covered by each non-leaf node can also be determined according to the line position of the token corresponding to each leaf node in the merck tree in the function code to which the token belongs.
And the line position of each token in the token list corresponding to the function code in the function code is the line number of the token in the function code. For example, if a token is located in the first row of the function code, its row position is 1, and if a token is located in the third row of the function code, its row position is 3.
In this specification, for each non-leaf node in the merkel tree, the number of code lines covered by the non-leaf node is determined according to the row position of the leaf node under which the non-leaf node is hung. That is, the number of code lines of a non-leaf node in the merkel tree represents the number of lines covered by the token hung under the merkel tree in the function code to which the merkel tree belongs, and during calculation, the maximum line position and the minimum line position in the token hung under the merkel tree are differentiated, and then 1 is added to obtain the merkel tree.
Node of the Merck tree Number of code lines covered
Non-leaf node 9 3
Non-leaf node 10 2
Non-leaf node 13 9
TABLE 4
For example, still taking the merck tree shown in fig. 3 as an example, assuming that the row positions of tokens in the function codes corresponding to leaf nodes 1-4 are 1, 3, 8 and 9 respectively, please refer to table 4, the number of code rows covered by a non-leaf node 9 is the number of rows covered by tokens 1 and 2 corresponding to the leaf node below it, where the row position of token 1 is 1, the row position of token 3 is 3, and 3 rows of codes are covered altogether, that is, the number of code rows covered by a non-leaf node 9 is 3. Similarly, the non-leaf node 10 covers a number of code lines of 2. For a non-leaf node 13, the leaf node to be hung is a leaf node 1-leaf node 4, and covers 9 lines of codes altogether, that is, the number of code lines covered by the non-leaf node 13 is 9, a difference is made between the maximum line position 9 of the leaf node 4 and the minimum line position 1 of the leaf node 1 during calculation to obtain 8, and then 1 is added again to obtain the number of code lines covered by the non-leaf node 13, which is 9.
In the present specification, the number of code lines covered by the root node of the mercker tree is generally the number of code lines of the entire function code, regardless of the log print statement and the code comment.
And step 210, acquiring a detection line number range appointed during repeated code detection.
In practical application, if the number of lines of a section of code is small, the significance of repeated detection on the section of code is not large, and the flexibility is poor because the repeated code is detected by taking a function as a unit in related technologies. When the repetition code detection is performed, the range of the number of lines to be detected can be further specified, that is, the range of the number of detected lines can be specified in advance, so that the detection of the repetition code with the specified number of lines is realized, and compared with the function-based repetition code detection in the related art, the flexibility is higher.
Assuming that the number of lines of each segment of function code in the target code is 50-70 lines, with the technical solution provided in this specification, a skilled person may specify the detection line number range in advance, for example, 10-40 lines, that is, repeat detection is performed on codes between 10-40 lines, and no repeat detection is performed on codes with line numbers less than 10 lines or greater than 40 lines. Of course, the range of the number of detection rows may also be greater than 20 rows, and the like, which is not particularly limited in this specification.
Step 212, detecting whether the hash values of the non-leaf nodes in each merkel tree that meet the range of the detected number of rows are the same.
Based on the foregoing step 210, after the detection line number range is obtained, non-leaf nodes whose covered code line numbers conform to the detection line number range may be obtained from the merkel tree corresponding to each function code, and then whether hash values of the non-leaf nodes are the same or not may be detected. Upon detection, it may be detected whether hash values between non-leaf nodes from different merkel trees are the same.
And step 214, in the case that the same hash value is detected, determining a function code corresponding to the merkel tree to which the same hash value belongs as a repetition code in the target code.
Based on the detection result of the foregoing step 212, in the case that the same hash value is detected, the function code corresponding to the mercker tree to which the same hash value belongs may be determined as the repeated code in the target code.
Merkel tree Non-leaf node Hash value
Merck tree A Non-leaf node A-12 153512345
Merck tree B Non-leaf node B-18 153512345
Merck tree C Non-leaf node C-20 168665459
TABLE 5
For example, referring to table 5, assuming that the number of code rows covered by the non-leaf node a-12 of merck tree a, the non-leaf node B-18 of merck tree B, and the non-leaf node C-20 of merck tree C all meet the range of the number of detection rows to be obtained in the foregoing steps, where the hash values of the non-leaf node a-12 and the non-leaf node B-18 are the same, the function codes corresponding to merck tree a and merck tree B may be determined as the repetition codes in the target function.
It should be noted that the hash values in table 5 are merely exemplary, and do not represent actual hash values.
As can be seen from the above description, the present specification may construct a merkel tree according to a token list corresponding to a function code, and may also determine the number of code lines covered by each non-leaf node in the merkel tree, so that a detection line number range may be specified when detecting a duplicate code, thereby improving the flexibility of detecting the duplicate code.
In practical applications, when the target code amount is large, the number of non-leaf nodes in the range of the number of detection lines is also large, and the capacity of the device memory may not support such a large amount of detection. In another example of the present specification, hash values of non-leaf nodes in the merkel tree and the number of code lines covered by the hash values may be stored in a database.
For example, the hash value of each non-leaf node, the number of code lines covered, and the function name of the function code corresponding to the merkel tree of the non-leaf node may be stored as a data record in a database.
Hash value Number of code lines covered Function name
153512345 28 Function A
153512345 28 Function B
168665459 36 Function C
TABLE 6
Still taking table 5 as an example, the above information of 3 non-leaf nodes in table 5 is stored in the database, and 3 database records shown in table 6 can be obtained.
When repeated code detection is carried out, database records in a range of matching detection line numbers can be screened from a database, then the database records can be summarized into a data table, and then the uniqueness detection is carried out on the hash value field in the data table, so that the same hash value is detected. Then, the function names of the same hash values in the data table can be output to a technician for viewing, so that the technician can quickly locate the repeated codes in the target codes according to the function names.
By adopting the embodiment, the information of each non-leaf node in the Merckel tree is stored in the database, repeated codes can be detected by detecting the uniqueness of the field of the database, and the problem of unsupported memory capacity can be effectively solved when repeated detection of a large number of codes is faced.
In correspondence with the foregoing embodiments of the method for detecting a repetitive code, the present specification also provides embodiments of a device for detecting a repetitive code.
The embodiment of the detection device for repeated codes in the specification can be applied to electronic equipment. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. Taking a software implementation as an example, as a logical device, the device is formed by reading, by a processor of the electronic device where the device is located, a corresponding computer program instruction in the nonvolatile memory into the memory for operation. From a hardware aspect, as shown in fig. 4, the hardware structure diagram of the electronic device where the detection apparatus for repeated codes is located in this specification is shown, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 4, the electronic device where the apparatus is located in the embodiment may also include other hardware according to the actual function of the electronic device, which is not described again.
Fig. 5 is a block diagram of a device for detecting a repetitive code according to an exemplary embodiment of the present disclosure.
Referring to fig. 5, the device for detecting a repetitive code may be applied to the electronic device shown in fig. 4, and includes: the device comprises a character replacing unit, a code analyzing unit, a token detecting unit and a repeated determining unit.
The character replacing unit replaces the text identifier in the target code with a preset token;
the code analysis unit is used for carrying out syntax analysis on the replaced target code by taking a function as a unit to obtain a token list corresponding to each section of function code in the target code;
the token detection unit is used for detecting whether token lists of different function codes are the same or not;
and the repeated determining unit determines the function codes corresponding to the same token list as the repeated codes in the target codes when the same token list is detected.
Optionally, after obtaining the token list corresponding to each function code in the target code, the step of detecting the repeated code further includes:
aiming at each section of function code, constructing a Mercker tree corresponding to the function code according to a token list corresponding to the function code, wherein leaf nodes of the Mercker tree are hash values of all tokens in the token list;
determining the number of code rows covered by each non-leaf node in the Mercker tree;
acquiring a designated detection line number range during repeated code detection;
detecting whether the hash values of non-leaf nodes in each Mercker tree which accord with the detection line number range are the same or not;
and in the case that the same hash value is detected, determining the function code corresponding to the Merckel tree to which the same hash value belongs as a repeated code in the target code.
Optionally, the determining the number of code rows covered by each non-leaf node in the merkel tree includes:
determining the line position of a token corresponding to a leaf node in the Mercker tree in the function code;
and aiming at each non-leaf node in the Mercker tree, determining the code line number covered by the non-leaf node according to the line position of the leaf node hung below the non-leaf node.
Optionally, after determining the number of code rows covered by each non-leaf node in the mercker tree, the step of detecting the repeated code further includes:
storing the hash value and the code line number of each non-leaf node in the Mercker tree as data records in a database;
the detecting whether the hash values of the non-leaf nodes in the merkel trees which meet the range of the number of the detected rows are the same or not comprises the following steps:
and performing uniqueness detection on the hash values of the data records in the database, which accord with the detection line number range, so as to detect whether the hash values of non-leaf nodes in each Mercker tree, which accord with the detection line number range, are the same.
Optionally, the text identifier includes a function name and a variable name, and the replacing the text identifier in the object code with a preset token includes:
replacing the function name in the target code with a preset function name token;
and replacing the variable name in the target code with a preset variable name token.
Optionally, before replacing the text identifier in the target code with a preset token, the step of detecting the repeated code further includes:
and filtering out the log printing statements and the code comments in the target code.
Optionally, the step of detecting the repetition code further includes:
and outputting the function name of the repeated code so as to locate the repeated code.
The specific details of the implementation process of the functions and actions of each unit in the above device are the implementation processes of the corresponding steps in the above method, and are not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.
In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage media or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
In correspondence with the foregoing embodiments of the method for detecting a repetitive code, the present specification also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of:
replacing the text identifier in the target code with a preset token;
performing syntax analysis on the replaced target code by taking a function as a unit to obtain a token list corresponding to each section of function code in the target code;
detecting whether the token lists of different function codes are the same;
and in the case that the same token list is detected, determining a function code corresponding to the same token list as a repeated code in the target code.
Optionally, after obtaining the token list corresponding to each function code in the target code, the method further includes:
aiming at each section of function code, constructing a Mercker tree corresponding to the function code according to a token list corresponding to the function code, wherein leaf nodes of the Mercker tree are hash values of all tokens in the token list;
determining the number of code rows covered by each non-leaf node in the Mercker tree;
acquiring a designated detection line number range during repeated code detection;
detecting whether the hash values of non-leaf nodes in each Mercker tree which accord with the detection line number range are the same or not;
and under the condition that the same hash value is detected, determining a function code corresponding to the Mercker tree to which the same hash value belongs as a repeated code in the target code.
Optionally, the determining the number of code rows covered by each non-leaf node in the merkel tree includes:
determining the line position of a token corresponding to a leaf node in the Mercker tree in the function code;
for each non-leaf node in the Merckel tree, determining the number of code lines covered by the non-leaf node according to the line position of the leaf node hung below the non-leaf node.
Optionally, after determining the number of code rows covered by each non-leaf node in the merkel tree, the method further includes:
storing the hash value and the code line number of each non-leaf node in the Mercker tree as data records in a database;
the detecting whether the hash values of the non-leaf nodes in each merkel tree which meet the range of the detection line number are the same includes:
and performing uniqueness detection on the hash values of the data records in the database, which accord with the detection line number range, so as to detect whether the hash values of non-leaf nodes in each Mercker tree, which accord with the detection line number range, are the same.
Optionally, the text identifier includes a function name and a variable name, and the replacing of the text identifier in the target code with a preset token includes:
replacing the function name in the target code with a preset function name token;
and replacing the variable name in the target code with a preset variable name token.
Optionally, before replacing the text identifier in the target code with the preset token, the method further includes:
and filtering out log printing statements and code comments in the target code.
Optionally, the step of detecting the repeated code further includes: and outputting the function name of the repeated code so as to locate the repeated code.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The above description is only a preferred embodiment of the present disclosure, and should not be taken as limiting the present disclosure, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims (10)

1. A method of detecting a repetition code, comprising:
replacing the text identifier in the target code with a preset token;
performing syntax analysis on the replaced target code by taking a function as a unit to obtain a token list corresponding to each section of function code in the target code;
detecting whether the token lists of different function codes are the same;
and in the case that the same token list is detected, determining a function code corresponding to the same token list as a repeated code in the target code.
2. The method of claim 1, after obtaining the token list corresponding to each piece of function code in the object code, the method further comprising:
aiming at each section of function code, constructing a Mercker tree corresponding to the function code according to a token list corresponding to the function code, wherein leaf nodes of the Mercker tree are hash values of all tokens in the token list;
determining the number of code rows covered by each non-leaf node in the Mercker tree;
acquiring a specified detection line number range during repeated code detection;
detecting whether the hash values of non-leaf nodes in each Mercker tree which accord with the detection line number range are the same or not;
and under the condition that the same hash value is detected, determining a function code corresponding to the Mercker tree to which the same hash value belongs as a repeated code in the target code.
3. The method of claim 2, wherein said determining the number of code rows covered by each non-leaf node in the merkel tree comprises:
determining the line position of a token corresponding to a leaf node in the Mercker tree in the function code;
and aiming at each non-leaf node in the Mercker tree, determining the code line number covered by the non-leaf node according to the line position of the leaf node hung below the non-leaf node.
4. The method of claim 2, after determining the number of code rows covered by each non-leaf node in the merkel tree, the method further comprising:
storing the hash value and the code line number of each non-leaf node in the Mercker tree as data records in a database;
the detecting whether the hash values of the non-leaf nodes in each merkel tree which meet the range of the detection line number are the same includes:
and performing uniqueness detection on the hash values of the data records in the database, which accord with the detection line number range, so as to detect whether the hash values of non-leaf nodes in each Mercker tree, which accord with the detection line number range, are the same.
5. The method of claim 1, the text identifier comprising a function name and a variable name, the replacing the text identifier in the object code with a preset token comprising:
replacing the function name in the target code with a preset function name token;
and replacing the variable name in the target code with a preset variable name token.
6. The method of claim 1, prior to replacing the text identifier in the object code with a preset token, further comprising:
and filtering out log printing statements and code comments in the target code.
7. The method of claim 1, further comprising:
and outputting the function name of the repeated code so as to locate the repeated code.
8. An apparatus for detecting a repetitive code, comprising:
the character replacing unit is used for replacing the text identifier in the target code with a preset token;
the code analysis unit is used for carrying out syntax analysis on the replaced target code by taking a function as a unit to obtain a token list corresponding to each section of function code in the target code;
the token detection unit is used for detecting whether token lists of different function codes are the same or not;
and the repetition determining unit is used for determining the function codes corresponding to the same token list as the repetition codes in the target codes under the condition that the same token list is detected.
9. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor implements the method of any one of claims 1-7 by executing the executable instructions.
10. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 7.
CN202210804142.8A 2022-07-07 2022-07-07 Method and device for detecting repeated codes Pending CN115309632A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210804142.8A CN115309632A (en) 2022-07-07 2022-07-07 Method and device for detecting repeated codes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210804142.8A CN115309632A (en) 2022-07-07 2022-07-07 Method and device for detecting repeated codes

Publications (1)

Publication Number Publication Date
CN115309632A true CN115309632A (en) 2022-11-08

Family

ID=83857758

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210804142.8A Pending CN115309632A (en) 2022-07-07 2022-07-07 Method and device for detecting repeated codes

Country Status (1)

Country Link
CN (1) CN115309632A (en)

Similar Documents

Publication Publication Date Title
CN109697162B (en) Software defect automatic detection method based on open source code library
Xin et al. Production machine learning pipelines: Empirical analysis and optimization opportunities
AU2021269302C1 (en) System and method for coupled detection of syntax and semantics for natural language understanding and generation
US20150066814A1 (en) Sentiment Analysis of Data Logs
US20120072988A1 (en) Detection of global metamorphic malware variants using control and data flow analysis
CN108363634B (en) Method, device and equipment for identifying service processing failure reason
CN109918296B (en) Software automation test method and device
US10839308B2 (en) Categorizing log records at run-time
CN112036187A (en) Context-based video barrage text auditing method and system
CN111338692A (en) Vulnerability classification method and device based on vulnerability codes and electronic equipment
US20160132809A1 (en) Identifying and amalgamating conditional actions in business processes
CN109492401B (en) Content carrier risk detection method, device, equipment and medium
CN110750297A (en) Python code reference information generation method based on program analysis and text analysis
CN108681490B (en) Vector processing method, device and equipment for RPC information
CN112069052A (en) Abnormal object detection method, device, equipment and storage medium
CN115309632A (en) Method and device for detecting repeated codes
CN115796146A (en) File comparison method and device
CN114706766A (en) False alarm elimination method and device of security function, electronic equipment and storage medium
CN115774784A (en) Text object identification method and device
CN111143203B (en) Machine learning method, privacy code determination method, device and electronic equipment
CN110968500A (en) Test case execution method and device
US20150006578A1 (en) Dynamic search system
CN111898762B (en) Deep learning model catalog creation
US11727059B2 (en) Retrieval sentence utilization device and retrieval sentence utilization method
KR102382017B1 (en) Apparatus and method for malware lineage inference system with generating phylogeny

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination