CN112579155B

CN112579155B - Code similarity detection method and device and storage medium

Info

Publication number: CN112579155B
Application number: CN202110198641.2A
Authority: CN
Inventors: 高庆; 李玫; 张世琨; 马森
Original assignee: Beijing Peking University Software Engineering Co ltd
Current assignee: Beijing Peking University Software Engineering Co ltd
Priority date: 2021-02-23
Filing date: 2021-02-23
Publication date: 2021-05-18
Anticipated expiration: 2041-02-23
Also published as: CN112579155A

Abstract

The embodiment of the invention relates to the field of software detection, and discloses a code similarity detection method which mainly comprises three stages, wherein the preprocessing stage is used for preprocessing a mass of source code files and extracting features, and outputting similar hash fingerprint values; in the fingerprint indexing stage, according to the result of the previous stage, the fingerprints are segmented and recombined by adopting a segmented indexing strategy and then stored in a similar hash fingerprint library, and segmented indexes are established to facilitate quick matching; in the similar matching stage, similar hash values are generated after the engineering files to be detected are processed, and the tracing detection results are retrieved from the similar hash fingerprint database in a segmented manner according to the similar hash values of the engineering files to be detected; the invention can reduce the influence of the line coverage problem on the result from the angle of eliminating common lines in different languages.

Description

Code similarity detection method and device and storage medium

Technical Field

The invention relates to the field of software detection, in particular to a code similarity detection method, a system, a device and a storage medium.

Background

Nowadays, with the increasing popularity of software code development, the amount of open source code is growing at the speed of light. Whether in enterprises or scientific research units, more and more developers choose to copy and paste existing codes so as to improve the software development efficiency. However, as software is continuously updated and software functionality is increased, the negative impact of these repeated and cloned codes on software quality, availability, and maintainability becomes more pronounced. The codes introduced from the open source project reduce the understanding and control of software developers on the whole software system, conflicts can occur between external codes and the codes of the software system, and bugs in the open source codes can be introduced into the project along with code copying, so that potential safety hazards are brought. To address this problem, researchers often use code similarity detection techniques to detect similar code in software engineering.

Since the 70 s of the 20 th century, a large number of code similarity detection tools and methods are emerging in academia, and are widely applied to the directions of code clone detection, software license violation detection, software plagiarism detection, vulnerability defect discovery and the like. The currently common code similarity detection method comprises five levels, namely, index-based (metrics-based), text-based (text-based), lexical-based (token-based), tree-based (tree-based) and graph-based (graph-based). With the increasing amount of open source codes, the large scale of code similarity detection has become a necessary trend. The traditional method can obtain higher precision under the condition of small-scale data, but under the condition of limited hardware, if the character string comparison or the analysis based on lexical, grammatical and semantic needs to be carried out on the detected code, and then the comparison with large-scale data in a library is carried out, excessive time can be consumed.

Similar hashing was first proposed by Charikar et al, and is a locality Sensitive hashing (lsh) algorithm. For the conventional hash algorithm, the generated digital signature result only provides information that the original input is equal or unequal, and when the input is unequal, other contents cannot be additionally provided. Even if only one space is modified for the input content, it is highly likely that distinct signature information will be generated. Through a similar hash algorithm, the Simhash fingerprints acquired by the two source code files can not only express whether the two are equal, but also record the similarity degree of the two.

Due to the high efficiency of Simhash similarity search, the method is widely applied to many fields. In google, Simhash is used for removing the weight of massive similar web pages, and the algorithm transplanted by the google obtains a good test effect in a webpage database of billions of orders crawled by the google crawler, and is applied to actual products by the google. In the field of clonal analysis, there is also a silhouette of Simhash. Uddin et al have proposed SimCad as a code cloning clustering tool in combination with Simhash and Nicad, and Qiao et al have used Simhash in assembly code homology testing.

Because repeated words often appear in the code, the phenomenon that similar hash features of the code obtained by word granularity are easy to have high-frequency features and cover low-frequency features is avoided, it is proposed that feature accumulation is carried out by the code behavior granularity, pure symbol rows and empty rows are screened out in preprocessing, and the application of similar hash in the aspect of code detection is further improved, as shown in fig. 1, the algorithm flow is approximately as follows: 1) preprocessing and dividing an input code file into lines, and screening out pure symbol lines and empty lines; 2) for each line, a signature value of a specified number of bits is obtained by using a traditional hash method; 3) for each bit of each feature, if the bit is 1, setting the bit as its weight value, and if the bit is 0, setting the bit as the inverse of its weight; 4) adding each feature bit by bit to obtain a result vector; 5) and (4) dimension reduction, namely, each bit of the result vector is replaced, a positive number is replaced by 1, and a negative number is replaced by 0, so that the final fingerprint value can be obtained.

Through experimental analysis and statistics, the existing similar hash method is found to have the problem of coverage of high-frequency line characteristics to low-frequency line characteristics in actual engineering, so that the result accuracy is extremely low, and the line coverage phenomenon mainly occurs in the following situations:

some code lines containing keywords that frequently appear in the code, such as "break", "return", "try {". In some scenarios, these rows occur so frequently that its effect on the results covers all other rows.

Due to the particularity of the code functions (such as some tool classes or test classes representing information prompts), some functional line contents frequently appear in the code, such as frequent calls to a certain method, implementation of a function for transferring the same information in multiple methods, and the like.

In both cases, the latter coverage phenomenon can be regarded as the manifestation of the document code function unity, so two similar code document pairs with a large number of repeated identical methods do not count as false positives in the code similarity measurement process. The annotations and the large number of certain lines of code in the first case do not characteristically represent the functionality of the code file and therefore have a major impact on the accuracy of the result.

For the line coverage phenomenon, if the weight of the frequently appearing lines in the file is simply reduced, the influence of the lines repeatedly calling the method on the fingerprint in the second case may be weakened, the special function of the file cannot be embodied, and the false report of the file is caused.

Disclosure of Invention

The invention aims to provide a code similarity detection method which can reduce the influence of a line coverage problem on a result from the angle of eliminating common lines in different languages.

In order to solve the above technical problem, an embodiment of the present invention provides a code similarity detection method, including the following steps:

s101: in the preprocessing stage, preprocessing and feature extraction work are carried out on the massive source code files, and similar Hash fingerprint values are output;

s102: a fingerprint indexing stage, which is to adopt a segmentation indexing strategy to segment and recombine the fingerprints and store the fingerprints into a similar hash fingerprint library according to the result of the preprocessing stage, and establish a segmentation index for convenient and rapid matching;

s103: and in the similar matching stage, processing the engineering file to be detected to generate a similar hash value, and searching the tracing detection result from the similar hash fingerprint library in a segmented manner according to the similar hash value of the engineering file to be detected.

Preferably, step S101 includes the following sub-steps:

s1011, code preprocessing: the code of the source code file comprises irrelevant factors, wherein the irrelevant factors comprise blank lines, spaces and comments, and the code is subjected to unified formatting treatment to remove the irrelevant factors;

s1012, feature extraction: performing common line statistics on open source project files of multiple common languages in advance, taking the first N lines of each language which are sorted according to frequency as a common line list, wherein N is a preset threshold value, judging the language to which the open source project files belong according to the suffix name of the source code file, selecting a common line filter of the corresponding language from the common line list, completely screening out the corresponding lines contained in the filter in the source code file, and taking the remaining each line as the characteristic of the source code file;

s1013, hash processing: calculating a corresponding hash value of each characteristic of the source code file by using a hash algorithm, wherein the hash value is a 64-bit binary string;

s1014, weighted summation: taking the frequency of each feature as the weight of the feature to carry out weighted summation, determining whether the hash value is multiplied by the weight value in a positive or negative way by 0 or 1 of each bit of the hash value, and adding the weighted hash values bit by bit to obtain a result sequence string;

s1015, dimensionality reduction: and transforming the result sequence string obtained by weighted summation, if each bit is a positive number, transforming into 1, otherwise, transforming into 0, and obtaining the final similar hash fingerprint value.

Preferably, in step S102, each fingerprint is divided into 5 segments and recombined as an index by permutation and combination to be recorded in the fields of the similar hashed fingerprint library, respectively.

Preferably, in step S103, firstly, corresponding to the preprocessing stage, the engineering file to be tested is subjected to operations of standardization, feature extraction, and similar hash fingerprint generation; then, corresponding to a fingerprint indexing stage, dividing the similar hash fingerprint value to be inquired into 5 sections, arranging and combining every two sections to obtain 10 index sections, and respectively and accurately matching with corresponding fields in a database, wherein the matched files are files possibly similar to the files to be detected; and finally, further Hamming distance calculation is carried out on the hash values in all the candidate sequences, and the finally calculated results are summarized.

The embodiment of the invention also provides a code similarity detection system based on the similar hash algorithm, which comprises the following modules:

the preprocessing module is used for preprocessing and extracting features of the massive source code files and outputting similar hash fingerprint values;

the fingerprint index module is used for segmenting and recombining fingerprints according to the result of the preprocessing module by adopting a segmented index strategy and then storing the segmented fingerprints into a similar hash fingerprint library, and establishing a segmented index for facilitating quick matching;

and the similar matching module is used for processing the engineering file to be detected to generate a similar hash value, and retrieving the tracing detection result from the similar hash fingerprint library in a segmented manner according to the similar hash value of the engineering file to be detected.

Preferably, the preprocessing module comprises the following sub-modules:

the code preprocessing submodule is used for carrying out uniform formatting processing on the code and removing the irrelevant factors;

a feature extraction submodule to: performing common line statistics on open source project files of multiple common languages in advance, taking the first N lines of each language which are sorted according to frequency as a common line list, wherein N is a preset threshold value, judging the language to which the open source project files belong according to the suffix name of the source code file, selecting a common line filter of the corresponding language from the common line list, completely screening out the corresponding lines contained in the filter in the source code file, and taking the remaining each line as the characteristic of the source code file;

a hash processing sub-module to: calculating a corresponding hash value of each characteristic of the source code file by using a hash algorithm, wherein the hash value is a 64-bit binary string;

a weighted sum sub-module for: taking the frequency of each feature as the weight of the feature to carry out weighted summation, determining whether the hash value is multiplied by the weight value in a positive or negative way by 0 or 1 of each bit of the hash value, and adding the weighted hash values bit by bit to obtain a result sequence string;

a dimension reduction submodule to: and transforming the result sequence string obtained by weighted summation, if each bit is a positive number, transforming into 1, otherwise, transforming into 0, and obtaining the final similar hash fingerprint value.

Preferably, the fingerprint indexing module divides each fingerprint into 5 segments and records the 5 segments as indexes in fields of the similar hash fingerprint library respectively through permutation and combination.

Preferably, the affinity matching module is configured to: firstly, corresponding to a preprocessing stage, carrying out operations of standardization, feature extraction and similar Hash fingerprint generation on an engineering file to be detected; then, corresponding to a fingerprint indexing stage, dividing the similar hash fingerprint value to be inquired into 5 sections, arranging and combining every two sections to obtain 10 index sections, and respectively and accurately matching with corresponding fields in a database, wherein the matched files are files possibly similar to the files to be detected; and finally, further Hamming distance calculation is carried out on the hash values in all the candidate sequences, and the finally calculated results are summarized.

An embodiment of the present invention also provides an electronic device, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the code similarity detection method as described above.

Embodiments of the present invention also provide a computer-readable storage medium storing a computer program, which when executed by a processor implements the code similarity detection method as described above.

Compared with the prior art, the embodiment of the invention optimizes the line granularity similar hash (Simhash) algorithm based on the text in consideration of scale and efficiency, and provides a similar hash detection method for screening by languages. On one hand, the method continues the characteristic of efficient similar retrieval of similar hash, and maps the source code after pretreatment into a binary number (namely the fingerprint of the code) so as to realize dimension reduction and index of the code file and accelerate the construction of a large-scale library; on the other hand, the method combines the characteristics of different language codes, eliminates the influence of common lines on fingerprint generation, enables the fingerprint result to reflect the code characteristics, and greatly improves the accuracy of similar detection. The method has the characteristics of easiness in implementation, simplicity in deployment, no dependence on other lexical or syntactic analyzers, high construction efficiency, low language migration cost and the like, and meanwhile, the high-precision characteristic ensures that the number of wrong matching pairs in the result is low, and the cost of subsequent manual verification is reduced.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

Fig. 1 is a schematic diagram of a similar hash fingerprint acquisition method in the prior art.

Fig. 2 is a schematic diagram of an improved similar hash fingerprint acquisition method according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present invention in its various embodiments. However, the technical solution claimed in the present invention can be implemented without these technical details and various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not constitute any limitation to the specific implementation manner of the present invention, and the embodiments may be mutually incorporated and referred to without contradiction.

The invention is described in detail below with reference to the drawings and specific examples.

In order to eliminate the influence of some code lines which frequently appear due to the special syntactic structure of the code on the result, the invention selects 10 common languages (c #, c/c + +, go, java, js, php, python, ruby, sql and swift) and carries out common line statistics on each language respectively. The data source is selected from open source items of 50000 before the ranking of the github star number, source code files of each language are preprocessed and then are arranged in a descending order according to the occurrence frequency, and the front 20000 lines are taken as the screening content of the language line screener. The top 15 results of the statistics of the common lines in some languages are shown in Table 1

TABLE 1 statistical results of common lines in different languages

As can be seen from table 1, the frequently appearing lines in the code file include not only pure symbol lines but also some special semantically related lines, and the frequent appearance of these lines in the code does not directly reflect the function of the code, and the included semantic information is less, so that the line coverage of the frequently appearing contents of some functional lines is not the same as the above-mentioned line coverage. If the result is screened only by the pure symbol rows, the extraction of fingerprint features is affected by the rows which frequently appear and have low characteristic expression, and the accuracy of the result is seriously affected by the row coverage condition when the specific gravity is high. Meanwhile, although there are some common lines such as "return false" and "} else {" return "between different languages, the difference between the common line lists is also large due to different grammatical features of different languages, and there are some common lines specific to languages such as" pass "in python and" intensiure "in ruby.

Therefore, the invention is improved on the basis of the line-granularity similarity hash algorithm, as shown in fig. 2, the preprocessing processes of removing blank lines, removing comments, removing spaces and converting all capitalization into lower case are carried out on the source code to be detected, and the code file is divided into lines. Judging the language to which the file belongs according to the suffix name of the file, selecting a common line list of the corresponding language from a common line library as a filter, screening all corresponding lines contained in the filter in the file, and taking each remaining line as the characteristic of the code file. And performing Hash calculation on the characteristics in sequence by a MurMur Hash method, mapping the obtained result into a series of sequences consisting of 1 and-1 bit by bit, and finally obtaining a final fingerprint result through accumulation and dimensionality reduction.

The method disclosed by the invention is mainly divided into three stages, namely a preprocessing stage, a fingerprint indexing stage and a similarity matching stage. And in the preprocessing stage, preprocessing is carried out on the massive source code files, then screening and extracting of features are carried out, and finally similar Hash fingerprints are generated. And in the fingerprint indexing stage, according to the result of the previous stage, the fingerprints are segmented and recombined by adopting a segmented indexing strategy and then stored in a similar hash fingerprint library, and segmented indexes are established to facilitate quick matching. And in the similar matching stage, after the engineering file to be detected is processed, a similar hash value is generated, and the tracing detection result is retrieved from the similar hash fingerprint library in a segmented manner according to the similar hash value of the engineering file to be detected.

The method and system of the present invention are further described below. The method of the embodiment of the present invention may be executed by processors in various types of intelligent computing devices, and specifically, the method of the embodiment of the present invention includes the following steps:

s101, a pretreatment stage:

in the preprocessing stage, a plurality of preprocessing and feature extraction works are carried out on the massive source code files, and fingerprint values are output. The improved similar hash fingerprint acquisition method is shown in fig. 2, and includes the following sub-steps:

s1011, code preprocessing: some irrelevant factors such as blank lines, blank spaces, comments and the like in the codes can influence the generated hash result, in order to improve the accuracy of the similarity matching result, the source codes are subjected to uniform formatting processing, and the preprocessing process of removing the blank lines, comments, blank spaces and converting all capitalization into lowercase is carried out on the source codes to be detected, so that the irrelevant factors are removed.

S1012, feature extraction: in order to eliminate the influence of a plurality of large-appearing and nonsense rows on the result, common row statistics are carried out on open source project files which are ranked 50000 before the github star number of 10 common languages in advance, and the top 20000 rows which are ranked in each language in a frequency mode are used as a common row list. Judging the language according to the suffix name of the file, selecting the common line filter of the corresponding language from the common line list, removing all the corresponding lines contained in the filter in the file, and taking each line after the rest code formatting as a characteristic of the source code file.

S1013, hash processing: using the conventional hash algorithm murmurrhash, a corresponding hash value is calculated for each feature of the source code file, where the hash value is a 64-bit binary string, and the 8-bit binary string is intercepted and demonstrated with reference to fig. 2.

S1014, weighted summation: and taking the frequency of each feature as the weight of the feature to carry out weighted summation to obtain a result sequence string. As described in S1012 to S1013, each line is regarded as a feature, each feature calculates a corresponding hash value, the same line is a repeated occurrence of the same feature for the same file, and statistics of the number of occurrences of each line in the file (corresponding to sets of different lines) are performed for each line, and the number of times of repeated occurrences of the line is the frequency of the line, i.e., the weight w of the line, and the weighting process is to change the calculated hash value (i.e., binary string) of the line into w or-w bit by bit. In the actual operation process, each row does not need to be subjected to frequency statistics, each row is directly treated as an independent individual, the weight w corresponding to each row can be directly regarded as 1, 0 or 1 of each bit of the hash value determines whether the hash value is positively multiplied or negatively multiplied with the weight value, the hash value (namely, binary string) calculated by the row is changed into 1 or-1 bit by bit, namely, weighting is completed, and then each weighted hash value is added bit by bit to obtain a result sequence string. See the "map" and "accumulate" processes in fig. 2.

S1015, dimensionality reduction: and transforming the result sequence string obtained by weighted summation, wherein if each bit is a positive number, the result sequence string is transformed into 1, otherwise, the result sequence string is transformed into 0, and a final similar Hash fingerprint value is obtained.

S102, fingerprint indexing:

for two source code files, the criterion for determining whether they are similar and the degree of similarity is the Hamming Distance (Hamming Distance) between them. In a massive data set, the process of searching for a fingerprint value close to the detected code fingerprint is time-consuming, so that the index optimization method is designed, and the query efficiency is greatly improved in a segmented index mode.

Specifically, the invention divides each fingerprint into 5 segments with the lengths of 13, 13, 13, 13 and 12 respectively, and records the segments in the number by permutation and combination as indexesIn the fields of the database. Therefore, in the similar matching stage, the hash value of the tested item can be quickly positioned to a candidate sequence possibly matched through the index, and a tracing result is further obtained. In 2⁶⁴On the data set with the size, the range of the candidate sequences of each similar search can be greatly reduced by the indexing method, and the retrieval efficiency is greatly improved.

S103, similarity matching:

after the steps are completed, the similar Hash fingerprint library is constructed, the source tracing detection is carried out on the project to be detected in the similar matching stage, the input of the stage is the source code of the project to be detected, and the output is the result which is inquired in all the fingerprint libraries and is similar to the code in the input project.

First, similar to the preprocessing stage, the operations of normalization, feature extraction, and similar hash fingerprint generation are required for the file. Then, similar to the fingerprint indexing stage, the similar hash fingerprint value to be queried needs to be divided into 5 segments, 10 index segments are obtained after two-by-two arrangement and combination, the index segments are respectively and accurately matched with the corresponding fields in the database, and the matched files are files possibly similar to the files to be tested. And finally, calculating the Hamming distance of the hash values in all the candidate sequences, and summarizing the finally calculated result.

The above method is described below by way of an example:

in order to verify the effect of the invention, a total of 60 items with top star ranking were obtained from the Github website as a test set. Github is a famous open source project community, a large number of high-quality open source project source codes which are widely applied and quoted are arranged in the community, and projects with the most attention and the most reuse are usually projects with the most star counts.

And screening the acquired project source codes, only leaving files with suffixes related to partial codes, and eliminating interference of irrelevant files. The finally obtained files are 73453 in total and total 7 GB.

The sensitivity was first tested and the test was performed on three different scale files. As shown in table 2, the Simhash values of the files with different sizes have different sensitivity degrees to file changes, and the smaller the file is, the more sensitive the file is to the change; the larger the file, the less sensitive to changes. This feature is precisely what is needed in document affinity matching, since the larger the document to be detected, the larger the range of variation that can be tolerated. Meanwhile, no matter how many lines of the file is, the Simhash value of the file reaches the upper limit 3 of the Hamming distance similarity judgment after the adding and deleting operations accounting for 2% -4% of the total lines.

The accuracy of the homologous matching is then tested. The collected 73453 files are stored in the fingerprint library through the relevant methods of the preprocessing stage and the indexing stage mentioned above, and then 1317 files are selected from the fingerprint library, and then 100 files which are manually modified in small parts based on the source files in the library are added, and the total 1417 files are used as project files to be tested. And through a similar matching stage, sequentially obtaining file results of homologous analysis on the files, and performing one-to-one comparison verification on the matched clone pairs in the results. Through statistical analysis, all the identical files can be correctly matched, and the files with partial modification can be matched with the corresponding code files. In the matching results of all the tested files, the false alarm rate is only 1.41%, and therefore the method can obtain a good effect in actual engineering.

Table 2 improved similar hash fingerprint sensitivity test

Those skilled in the art will understand that all or part of the steps in the method according to the above embodiments may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, etc.) or a processor (processor) to execute all or part of the steps in the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims

1. A code similarity detection method is characterized by comprising the following steps:

s103: in the similar matching stage, a similar hash value is generated after the engineering file to be detected is processed, and a tracing detection result is retrieved from the similar hash fingerprint library in a segmented manner according to the similar hash value of the engineering file to be detected;

the feature extraction in step S101 is: common line statistics is carried out on open source project files of multiple common languages in advance, the first N lines of each language which are sorted according to frequency are used as a common line list, N is a preset threshold value, the language to which the source code file belongs is judged according to the suffix name of the source code file, a common line filter of the corresponding language is selected from the common line list, all corresponding lines contained in the filter in the source code file are screened, and the remaining lines are used as the characteristics of the source code file.

2. The method for detecting code similarity according to claim 1, wherein step S101 comprises the following sub-steps:

3. The method for detecting code similarity according to claim 1, wherein in step S102, each fingerprint is divided into 5 segments, and the 5 segments are recombined by permutation and combination as indexes and are respectively recorded in the fields of the similar hashed fingerprint database.

4. The method for detecting code similarity according to claim 1, wherein in step S103, firstly, operations of standardization, feature extraction, and similar hash fingerprint generation are performed on the engineering document to be detected, corresponding to the preprocessing stage; then, corresponding to a fingerprint indexing stage, dividing the similar hash fingerprint value to be inquired into 5 sections, arranging and combining every two sections to obtain 10 index sections, and respectively and accurately matching with corresponding fields in a database, wherein the matched files are files possibly similar to the files to be detected; and finally, further Hamming distance calculation is carried out on the hash values in all the candidate sequences, and the finally calculated results are summarized.

5. A code similarity detection system based on a similar hash algorithm is characterized by comprising the following modules:

the similar matching module is used for processing the engineering file to be detected to generate a similar hash value, and retrieving a tracing detection result from the similar hash fingerprint library in a segmented manner according to the similar hash value of the engineering file to be detected;

the feature extraction in the preprocessing module is as follows: common line statistics is carried out on open source project files of multiple common languages in advance, the first N lines of each language which are sorted according to frequency are used as a common line list, N is a preset threshold value, the language to which the source code file belongs is judged according to the suffix name of the source code file, a common line filter of the corresponding language is selected from the common line list, all corresponding lines contained in the filter in the source code file are screened, and the remaining lines are used as the characteristics of the source code file.

6. The system for detecting code similarity based on the similar hash algorithm as claimed in claim 5, wherein said preprocessing module comprises the following sub-modules:

7. The system for detecting code similarity based on similar hash algorithm as claimed in claim 5, wherein said fingerprint indexing module divides each fingerprint into 5 segments and records them as indexes in fields of similar hash fingerprint database respectively by permutation and combination.

8. The system according to claim 5, wherein the similarity matching module is configured to: firstly, corresponding to a preprocessing stage, carrying out operations of standardization, feature extraction and similar Hash fingerprint generation on an engineering file to be detected; then, corresponding to a fingerprint indexing stage, dividing the similar hash fingerprint value to be inquired into 5 sections, arranging and combining every two sections to obtain 10 index sections, and respectively and accurately matching with corresponding fields in a database, wherein the matched files are files possibly similar to the files to be detected; and finally, further Hamming distance calculation is carried out on the hash values in all the candidate sequences, and the finally calculated results are summarized.

9. An electronic device, comprising:

at least one processor; and the number of the first and second groups,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the code similarity detection method of any one of claims 1-4.

10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the code similarity detection method according to any one of claims 1 to 4.