Disclosure of Invention
The invention provides a method for carrying out homologous detection on code clones, which aims to solve the problems of low detection speed and high missing report rate of code clones based on text detection in the prior art.
The invention provides a method for carrying out homologous detection on code clone, which comprises the following steps: carrying out code clone detection on a preset code based on a preset knowledge base to obtain a code clone detection result; and qualitatively and quantitatively analyzing the code clone detection result to determine a source file corresponding to the preset code.
Optionally, the method further comprises: establishing a preset knowledge base;
the establishing of the preset knowledge base comprises the following steps:
storing the collected source codes according to a preset format;
and extracting the characteristics of the source code to obtain the preset knowledge base.
Optionally, the performing feature extraction on the source code includes:
extracting the HASH value of the source code, preprocessing the source code through a lexical analyzer to obtain a structured text of the source code, and extracting the HASH value of the structured text of the source code and the HASH value of a code segment of the structured text;
after the feature extraction is performed on the source code, the method further includes:
and storing the mapping relation among the HASH value extracted from the source code, the HASH value obtained by the lexical analyzer, the HASH value of the code fragment level of the structured text obtained by the lexical analyzer, the source code and the relative file storage path of the source code.
Optionally, the processing rule of the lexical analyzer includes:
removing meaningless code fragments;
removing code segments of non-preset semantics and non-preset functions;
and searching and reserving keywords, reserved words and common grammars with code semantic information, and performing unified character replacement on components without code structure semantics.
Optionally, the performing, based on a preset knowledge base, code clone detection on the preset code includes:
extracting a HASH value corresponding to a source code of the preset code, analyzing the HASH value corresponding to the preset code through a lexical analyzer, and determining a code fragment level HASH value set of the preset code; and sequentially carrying out clone collision detection on the HASH value extracted by the source code, the HASH value obtained by the lexical analyzer and the HASH value set of the code fragment level on the preset code and the preset knowledge base, and if any collision detection is successful, determining that the preset code is a code clone.
Optionally, performing clone collision detection of MD5 information on the preset code and the preset knowledge base, including: and extracting MD5 information of the preset code, performing clone collision detection on the MD5 information and MD5 information stored in the preset knowledge base, and if the collision is successful, determining that the preset code is code clone and the cloning degree is 100%.
Optionally, performing clone collision detection of the HASH value on the preset code and the preset knowledge base, including: preprocessing the preset code through a lexical analyzer to obtain a HASH value of the preset code, carrying out clone collision detection on the HASH value of the preset code and the HASH value stored in the preset knowledge base, and if collision is successful, determining that the preset code is a code clone and the cloning degree is 100%.
Optionally, performing clone collision detection of a code segment level HASH value set on the preset code and the preset knowledge base, including: processing the preset code through a lexical analyzer, converting the preset code into a preset character string, performing window division processing on the preset character string according to a preset character length to sequentially obtain a plurality of substring sets, sequentially and respectively generating HASH values for the substring sets, and performing deduplication processing on the generated HASH values to obtain a code fragment level feature set of the preset code;
and sequentially carrying out clone collision detection on the code segment level feature set and the code segment level HASH value set in the preset knowledge base, and if the collision is successful, determining that the preset code is a code clone.
Optionally, the code segment level feature set is sequentially subjected to clone collision detection with the code segment level HASH value set in the preset knowledge base, and if collision is successful, after the preset code is determined to be a code clone, the method further includes: and grouping HASH values with successful collision according to source files, counting the feature quantity of collision in each group, and taking the source file with the most collision feature quantity as the source file of the preset code clone.
Optionally, the code segment level feature set is sequentially subjected to clone collision detection with the code segment level HASH value set in the preset knowledge base, and if collision is successful, after the preset code is determined to be a code clone, the method further includes:
and dividing the HASH value number of the successful collision of the preset code by the feature number in the code segment level feature set of the preset code to obtain the cloning degree of the preset code.
The invention has the following beneficial effects:
the invention establishes a special knowledge base after the characteristic of the source code is extracted, then carries out code clone detection on the preset code based on the knowledge base to obtain a code clone detection result, and then carries out qualitative and quantitative analysis on the code clone detection result to determine the source file corresponding to the preset code.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Detailed Description
Aiming at the problems of low detection speed and high missing report rate of code clone detection based on the existing text, the embodiment of the invention establishes a special knowledge base after the characteristic extraction of a source code, then carries out code clone detection on a preset code based on the knowledge base to obtain a code clone detection result, and then carries out qualitative and quantitative analysis on the code clone detection result to determine a source file corresponding to the preset code. The present invention will be described in further detail below with reference to the drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.
The embodiment of the invention provides a method for carrying out homologous detection on code clones, and referring to fig. 1, the method comprises the following steps:
s101, carrying out code clone detection on a preset code based on a preset knowledge base to obtain a code clone detection result;
it should be noted that the preset knowledge base stores feature data obtained by extracting the acquired source code features in the embodiment of the present invention.
Specifically, the collected source codes are stored according to a preset format, and then all the stored source codes are subjected to feature extraction to obtain the preset knowledge base.
The feature extraction of the source code according to the embodiment of the present invention includes: extracting a HASH value of the source code, preprocessing the source code through a lexical analyzer to obtain a structured text of the source code, extracting the HASH value of the structured text of the source code and the HASH value of a code segment of the structured text, and after extracting features of the source code, the method further comprises: and storing the mapping relation among the HASH value extracted from the source code, the HASH value obtained by the lexical analyzer, the HASH value of the code fragment level of the structured text obtained by the lexical analyzer, the source code and the relative file storage path of the source code.
That is, the preset knowledge base of the present invention stores the corresponding relationship between different HASH values and the source code, and the mapping relationship between the respective HASH values and the relative file storage paths of the source code. And then, the source code corresponding to the code to be detected can be accurately and quickly searched according to the mapping relation, thereby providing an implementable basis for improving the detection speed and the accuracy of detecting the code clone.
In addition, the data storage mode of the invention can reduce the hardware requirement of the storage device to a certain extent, thereby obtaining better user experience.
It should be noted that, in the embodiment of the present invention, the MD5 information is used as a HASH value to perform feature extraction on the source code, and in a specific implementation, a person skilled in the art may also perform feature extraction on the source code through other HASH values according to actual needs.
In specific implementation, as shown in fig. 2, the processing rule of the lexical analyzer according to the embodiment of the present invention includes:
removing meaningless code fragments, namely removing meaningless code fragments, mainly comprising code comments, blank lines and the like;
removing code segments with non-preset semantics and non-preset functions, specifically, removing parts without specific semantics and functions, such as package information and import information in Java;
searching and reserving keywords, reserved words and common grammars with code semantic information, and performing unified character replacement on components without code structure semantics, such as public, static, int, String, Thread, println and the like in Java, and performing unified character replacement on components without code structure semantics, such as uniformly replacing with a letter 'P';
and removing unnecessary spaces between characters, etc., which is not particularly limited in the present invention, and those skilled in the art may arbitrarily set the processing rules of the lexical analyzer according to actual needs.
In the embodiment of the present invention, the performing code clone detection on the preset code based on the preset knowledge base includes: extracting a HASH value corresponding to a source code of the preset code, analyzing the HASH value corresponding to the preset code through a lexical analyzer, and determining a code fragment level HASH value set of the preset code; and sequentially carrying out clone collision detection on the HASH value extracted by the source code, the HASH value obtained by the lexical analyzer and the HASH value set of the code fragment level on the preset code and the preset knowledge base, and if any collision detection is successful, determining that the preset code is a code clone.
S102, carrying out qualitative and quantitative analysis on the code clone detection result to determine a source file corresponding to the preset code.
That is to say, the embodiment of the present invention establishes a knowledge base after the feature extraction of a special source code, then performs code clone detection on a preset code based on the knowledge base to obtain a code clone detection result, and then performs qualitative and quantitative analysis on the code clone detection result to determine a source file corresponding to the preset code.
In specific implementation, in the embodiment of the present invention, the cloning collision detection of the HASH value extracted from the source code, the HASH value obtained by the lexical analyzer, and the code fragment level HASH value set is performed on the preset code and the preset knowledge base in sequence, and if any collision detection is successful, the specific execution process for determining that the preset code is a code clone includes the following steps, which are specifically shown in fig. 4:
firstly, performing clone collision detection of a HASH value on the preset code and the preset knowledge base, specifically, extracting the HASH value of the preset code, performing clone collision detection on the HASH value and the HASH value stored in the preset knowledge base, and if the collision is successful, determining that the preset code is a code clone and the cloning degree is 100%.
If the collision is not successful, performing clone collision detection of the HASH value on the preset code and the preset knowledge base, wherein the step specifically comprises the following steps: preprocessing the preset code through a lexical analyzer to obtain a HASH value of the preset code, carrying out clone collision detection on the HASH value of the preset code and the HASH value stored in the preset knowledge base, and if collision is successful, determining that the preset code is a code clone and the cloning degree is 100%.
If the collision is not successful, carrying out clone collision detection of the code fragment level HASH value set on the preset code and the preset knowledge base, wherein the specific collision detection process comprises the following steps: processing the preset code through a lexical analyzer, converting the preset code into a preset character string, performing window division processing on the preset character string according to a preset character length to sequentially obtain a plurality of substring sets, sequentially and respectively generating HASH values for the substring sets, and performing deduplication processing on the generated HASH values to obtain a code fragment level feature set of the preset code; and sequentially carrying out clone collision detection on the code segment level feature set and the code segment level HASH value set in the preset knowledge base, and if the collision is successful, determining that the preset code is a code clone.
And if the collision is successful, determining that the preset code is a code clone, and the method further comprises the following steps: and grouping HASH values with successful collision according to source files, counting the feature quantity of collision in each group, and taking the source file with the most collision feature quantity as the source file of the preset code clone. And dividing the number of HASH values of the preset codes which are successfully collided by the number of features in the code segment level feature set of the preset codes to obtain the cloning degree of the preset codes.
The method according to the embodiment of the present invention will be explained and explained in detail below with MD5 as the HASH value in conjunction with fig. 2, 3 and 4:
the invention mainly solves the problem of homologous detection of code cloning, and mainly comprises the following 2 steps:
firstly, establishing a knowledge base (storing source code characteristics);
then, the code file is subjected to clone detection and qualitative and quantitative analysis.
In the embodiment of the invention, the specific steps of establishing the knowledge base comprise:
step 1, collecting and sorting open source codes, and storing the open source codes in a uniform format;
step 2, extracting characteristics of the source codes stored in the previous step, and mainly extracting information of original texts MD5 of the files, MD5 characteristics of texts of the original texts after being preprocessed by a lexical analyzer, code segment level MD5 information of the texts after being preprocessed, and path mapping relation information of open source file relative paths and original texts MD5 of the files;
in step 2, the local disk stores a piece of unique original file, that is, the source code mentioned above, for implementing visual display of the code clone pair, the local file is named as MD5 of the original text, and the file suffix is the file suffix of the original file, so as to facilitate subsequent path indexing, for example, setting: e10adc3949ba59abbe56e057f20f883e.java;
in the embodiment of the present invention, as shown in fig. 4, the specific steps of the code clone detection in the embodiment of the present invention include:
1) reading the file content, generating MD5, performing collision detection with the hash _ token table, if the file content is collided, determining that the qualitative file is in a clone relation and the quantitative cloning degree is 100%, and if the file content is not collided, performing the processing of the step 2);
2) acquiring a corresponding programming language according to a file suffix, for example, a Java suffix file is a Java programming language file, calling a lexical analyzer, performing text preprocessing, outputting a preprocessed text, and constructing different lexical analyzers according to different grammar rules, wherein the core processing rules of the lexical analyzers are as follows:
(1) removing meaningless code segments, mainly comprising code comments, blank lines and the like;
(2) removing parts without specific semantics and functions, such as package information and import information in Java;
(3) searching and reserving information such as keywords, reserved words, common grammar and the like with code semantic information, such as pub l ic, stat ic, i nt, Stri ng, Thread, pr i nt l n and the like in Java, and performing unified character replacement on components without code structure semantics such as class names, function names, parameter names, character strings, numerical values and the like, wherein the components are uniformly replaced by a letter P;
(4) removing unnecessary spaces between characters;
(5) and (4) obtaining the text after the structured preprocessing after the processing according to the steps (1), (2) and (3).
The flow of the structured preprocessing of the lexical analyzer in the embodiment of the present invention is specifically shown in fig. 2.
3) Generating MD5 information according to the text of the preprocessed fingerprint acquired in the step 2), performing collision detection on the MD5 information and a file-level feature table { language } _ file _ token of a corresponding programming language, and if the MD5 information and the file-level feature table { language } _ file _ token collide, determining that the MD5 information is a file-level code clone qualitatively and the quantitative cloning degree is 100%; if the collision does not happen, the processing of the step 4) is carried out;
4) the extracting of the file code segment level feature set comprises the following steps:
(1) uniformly removing the character '\ n' from the text subjected to the structured preprocessing in the step 2), and converting the text into a 1-line-length character string;
(2) performing substring extraction on the character strings in the step (1), starting from the first character, extracting one substring every 50 characters, and performing window division processing, wherein if the character length is 200, 200-50+ 1-151 substrings are generated, namely 1-50, 2-51, and 3-52.. 151. and 200 are respectively generated, and the front and back sequence of the substrings is maintained;
(3) sequentially and respectively generating MD5 for the substring sets in the step (2), and respectively generating 151 MD5 while keeping the front-back order;
(4) performing feature extraction on 151 MDs 5 in the step (3) to effectively compress data, wherein the extraction method is to perform windowing by using windows with the length of 51, each window comprises 51 MDs 5, and 151-51+1 is 101 windows in total, and each window respectively comprises a sub-string sequence with the sequence number of 1-51 and 2-52.. 101- > 151;
(5) in the 101 division windows generated in (4), in each division window, sorting the elements in the window for MD5 to obtain the smallest MD5 element as the feature of the corresponding window, and then 101 features are generated;
(6) performing deduplication on 101 features in (5), and generating 1 feature in the worst case (each feature is the same) and preferably 101 features (each feature is different);
(7) and finally, the feature set after the deduplication in (6) is the code segment level feature corresponding to the file.
The code level feature extraction flow chart of the embodiment of the invention is specifically shown in fig. 3:
5) performing cyclic traversal processing on the code segment level set extracted in the step 4), respectively colliding with a code segment level feature library { language } _ code _ token of a corresponding programming language, if the code segment level set does not collide with the code segment level feature library, determining the file to be an original file (without code cloning behavior), if the file collides with the original file, obtaining 1-to-n collision relation (the fragments of 1 file may collide with code segments of a plurality of files in a knowledge base), then sorting according to the number of feature numbers on the collision of the same open source file, selecting the open source file with the largest number of feature numbers on the collision to serve as an optimal solution, and then performing the step 6);
6) respectively carrying out structured preprocessing on texts in 2) on 5 files a and B which are qualitatively in clone relation, then respectively carrying out code segment level feature extraction in step 4), only carrying out processing in steps (1) (2) (3) (4) this time to obtain feature detection A1 and A2, calculating and obtaining all MD5 feature set B with the same A1 and A2, and dividing the number of elements in the set B by the number of elements in the set A1 to obtain the clonality of the file;
7) at this point, the cloning qualitative and quantitative detection of the coding file is finished, and a single file has 4 detection results, namely original level cloning, file level cloning, code fragment level cloning and originality.
Generally speaking, the invention only needs to store source codes and devices for detecting clone, so the invention can reduce the hardware deployment cost, and the method of the invention supports code structured clone detection, conforms to the practical situation of programming, supports 3.5 type code clone detection, and can combine the practical grammar and semantics of programming language, and uses different lexical resolvers for different languages, so the detection of the invention is more accurate and effective.
Although the preferred embodiments of the present invention have been disclosed for illustrative purposes, those skilled in the art will appreciate that various modifications, additions and substitutions are possible, and the scope of the invention should not be limited to the embodiments described above.