CN112579155B - Code similarity detection method and device and storage medium - Google Patents

Code similarity detection method and device and storage medium Download PDF

Info

Publication number
CN112579155B
CN112579155B CN202110198641.2A CN202110198641A CN112579155B CN 112579155 B CN112579155 B CN 112579155B CN 202110198641 A CN202110198641 A CN 202110198641A CN 112579155 B CN112579155 B CN 112579155B
Authority
CN
China
Prior art keywords
similar
fingerprint
hash
source code
files
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110198641.2A
Other languages
Chinese (zh)
Other versions
CN112579155A (en
Inventor
高庆
李玫
张世琨
马森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Peking University Software Engineering Co ltd
Original Assignee
Beijing Peking University Software Engineering Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Peking University Software Engineering Co ltd filed Critical Beijing Peking University Software Engineering Co ltd
Priority to CN202110198641.2A priority Critical patent/CN112579155B/en
Publication of CN112579155A publication Critical patent/CN112579155A/en
Application granted granted Critical
Publication of CN112579155B publication Critical patent/CN112579155B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • G06F8/751Code clone detection

Abstract

The embodiment of the invention relates to the field of software detection, and discloses a code similarity detection method which mainly comprises three stages, wherein the preprocessing stage is used for preprocessing a mass of source code files and extracting features, and outputting similar hash fingerprint values; in the fingerprint indexing stage, according to the result of the previous stage, the fingerprints are segmented and recombined by adopting a segmented indexing strategy and then stored in a similar hash fingerprint library, and segmented indexes are established to facilitate quick matching; in the similar matching stage, similar hash values are generated after the engineering files to be detected are processed, and the tracing detection results are retrieved from the similar hash fingerprint database in a segmented manner according to the similar hash values of the engineering files to be detected; the invention can reduce the influence of the line coverage problem on the result from the angle of eliminating common lines in different languages.

Description

Code similarity detection method and device and storage medium
Technical Field
The invention relates to the field of software detection, in particular to a code similarity detection method, a system, a device and a storage medium.
Background
Nowadays, with the increasing popularity of software code development, the amount of open source code is growing at the speed of light. Whether in enterprises or scientific research units, more and more developers choose to copy and paste existing codes so as to improve the software development efficiency. However, as software is continuously updated and software functionality is increased, the negative impact of these repeated and cloned codes on software quality, availability, and maintainability becomes more pronounced. The codes introduced from the open source project reduce the understanding and control of software developers on the whole software system, conflicts can occur between external codes and the codes of the software system, and bugs in the open source codes can be introduced into the project along with code copying, so that potential safety hazards are brought. To address this problem, researchers often use code similarity detection techniques to detect similar code in software engineering.
Since the 70 s of the 20 th century, a large number of code similarity detection tools and methods are emerging in academia, and are widely applied to the directions of code clone detection, software license violation detection, software plagiarism detection, vulnerability defect discovery and the like. The currently common code similarity detection method comprises five levels, namely, index-based (metrics-based), text-based (text-based), lexical-based (token-based), tree-based (tree-based) and graph-based (graph-based). With the increasing amount of open source codes, the large scale of code similarity detection has become a necessary trend. The traditional method can obtain higher precision under the condition of small-scale data, but under the condition of limited hardware, if the character string comparison or the analysis based on lexical, grammatical and semantic needs to be carried out on the detected code, and then the comparison with large-scale data in a library is carried out, excessive time can be consumed.
Similar hashing was first proposed by Charikar et al, and is a locality Sensitive hashing (lsh) algorithm. For the conventional hash algorithm, the generated digital signature result only provides information that the original input is equal or unequal, and when the input is unequal, other contents cannot be additionally provided. Even if only one space is modified for the input content, it is highly likely that distinct signature information will be generated. Through a similar hash algorithm, the Simhash fingerprints acquired by the two source code files can not only express whether the two are equal, but also record the similarity degree of the two.
Due to the high efficiency of Simhash similarity search, the method is widely applied to many fields. In google, Simhash is used for removing the weight of massive similar web pages, and the algorithm transplanted by the google obtains a good test effect in a webpage database of billions of orders crawled by the google crawler, and is applied to actual products by the google. In the field of clonal analysis, there is also a silhouette of Simhash. Uddin et al have proposed SimCad as a code cloning clustering tool in combination with Simhash and Nicad, and Qiao et al have used Simhash in assembly code homology testing.
Because repeated words often appear in the code, the phenomenon that similar hash features of the code obtained by word granularity are easy to have high-frequency features and cover low-frequency features is avoided, it is proposed that feature accumulation is carried out by the code behavior granularity, pure symbol rows and empty rows are screened out in preprocessing, and the application of similar hash in the aspect of code detection is further improved, as shown in fig. 1, the algorithm flow is approximately as follows: 1) preprocessing and dividing an input code file into lines, and screening out pure symbol lines and empty lines; 2) for each line, a signature value of a specified number of bits is obtained by using a traditional hash method; 3) for each bit of each feature, if the bit is 1, setting the bit as its weight value, and if the bit is 0, setting the bit as the inverse of its weight; 4) adding each feature bit by bit to obtain a result vector; 5) and (4) dimension reduction, namely, each bit of the result vector is replaced, a positive number is replaced by 1, and a negative number is replaced by 0, so that the final fingerprint value can be obtained.
Through experimental analysis and statistics, the existing similar hash method is found to have the problem of coverage of high-frequency line characteristics to low-frequency line characteristics in actual engineering, so that the result accuracy is extremely low, and the line coverage phenomenon mainly occurs in the following situations:
some code lines containing keywords that frequently appear in the code, such as "break", "return", "try {". In some scenarios, these rows occur so frequently that its effect on the results covers all other rows.
Due to the particularity of the code functions (such as some tool classes or test classes representing information prompts), some functional line contents frequently appear in the code, such as frequent calls to a certain method, implementation of a function for transferring the same information in multiple methods, and the like.
In both cases, the latter coverage phenomenon can be regarded as the manifestation of the document code function unity, so two similar code document pairs with a large number of repeated identical methods do not count as false positives in the code similarity measurement process. The annotations and the large number of certain lines of code in the first case do not characteristically represent the functionality of the code file and therefore have a major impact on the accuracy of the result.
For the line coverage phenomenon, if the weight of the frequently appearing lines in the file is simply reduced, the influence of the lines repeatedly calling the method on the fingerprint in the second case may be weakened, the special function of the file cannot be embodied, and the false report of the file is caused.
Disclosure of Invention
The invention aims to provide a code similarity detection method which can reduce the influence of a line coverage problem on a result from the angle of eliminating common lines in different languages.
In order to solve the above technical problem, an embodiment of the present invention provides a code similarity detection method, including the following steps:
s101: in the preprocessing stage, preprocessing and feature extraction work are carried out on the massive source code files, and similar Hash fingerprint values are output;
s102: a fingerprint indexing stage, which is to adopt a segmentation indexing strategy to segment and recombine the fingerprints and store the fingerprints into a similar hash fingerprint library according to the result of the preprocessing stage, and establish a segmentation index for convenient and rapid matching;
s103: and in the similar matching stage, processing the engineering file to be detected to generate a similar hash value, and searching the tracing detection result from the similar hash fingerprint library in a segmented manner according to the similar hash value of the engineering file to be detected.
Preferably, step S101 includes the following sub-steps:
s1011, code preprocessing: the code of the source code file comprises irrelevant factors, wherein the irrelevant factors comprise blank lines, spaces and comments, and the code is subjected to unified formatting treatment to remove the irrelevant factors;
s1012, feature extraction: performing common line statistics on open source project files of multiple common languages in advance, taking the first N lines of each language which are sorted according to frequency as a common line list, wherein N is a preset threshold value, judging the language to which the open source project files belong according to the suffix name of the source code file, selecting a common line filter of the corresponding language from the common line list, completely screening out the corresponding lines contained in the filter in the source code file, and taking the remaining each line as the characteristic of the source code file;
s1013, hash processing: calculating a corresponding hash value of each characteristic of the source code file by using a hash algorithm, wherein the hash value is a 64-bit binary string;
s1014, weighted summation: taking the frequency of each feature as the weight of the feature to carry out weighted summation, determining whether the hash value is multiplied by the weight value in a positive or negative way by 0 or 1 of each bit of the hash value, and adding the weighted hash values bit by bit to obtain a result sequence string;
s1015, dimensionality reduction: and transforming the result sequence string obtained by weighted summation, if each bit is a positive number, transforming into 1, otherwise, transforming into 0, and obtaining the final similar hash fingerprint value.
Preferably, in step S102, each fingerprint is divided into 5 segments and recombined as an index by permutation and combination to be recorded in the fields of the similar hashed fingerprint library, respectively.
Preferably, in step S103, firstly, corresponding to the preprocessing stage, the engineering file to be tested is subjected to operations of standardization, feature extraction, and similar hash fingerprint generation; then, corresponding to a fingerprint indexing stage, dividing the similar hash fingerprint value to be inquired into 5 sections, arranging and combining every two sections to obtain 10 index sections, and respectively and accurately matching with corresponding fields in a database, wherein the matched files are files possibly similar to the files to be detected; and finally, further Hamming distance calculation is carried out on the hash values in all the candidate sequences, and the finally calculated results are summarized.
The embodiment of the invention also provides a code similarity detection system based on the similar hash algorithm, which comprises the following modules:
the preprocessing module is used for preprocessing and extracting features of the massive source code files and outputting similar hash fingerprint values;
the fingerprint index module is used for segmenting and recombining fingerprints according to the result of the preprocessing module by adopting a segmented index strategy and then storing the segmented fingerprints into a similar hash fingerprint library, and establishing a segmented index for facilitating quick matching;
and the similar matching module is used for processing the engineering file to be detected to generate a similar hash value, and retrieving the tracing detection result from the similar hash fingerprint library in a segmented manner according to the similar hash value of the engineering file to be detected.
Preferably, the preprocessing module comprises the following sub-modules:
the code preprocessing submodule is used for carrying out uniform formatting processing on the code and removing the irrelevant factors;
a feature extraction submodule to: performing common line statistics on open source project files of multiple common languages in advance, taking the first N lines of each language which are sorted according to frequency as a common line list, wherein N is a preset threshold value, judging the language to which the open source project files belong according to the suffix name of the source code file, selecting a common line filter of the corresponding language from the common line list, completely screening out the corresponding lines contained in the filter in the source code file, and taking the remaining each line as the characteristic of the source code file;
a hash processing sub-module to: calculating a corresponding hash value of each characteristic of the source code file by using a hash algorithm, wherein the hash value is a 64-bit binary string;
a weighted sum sub-module for: taking the frequency of each feature as the weight of the feature to carry out weighted summation, determining whether the hash value is multiplied by the weight value in a positive or negative way by 0 or 1 of each bit of the hash value, and adding the weighted hash values bit by bit to obtain a result sequence string;
a dimension reduction submodule to: and transforming the result sequence string obtained by weighted summation, if each bit is a positive number, transforming into 1, otherwise, transforming into 0, and obtaining the final similar hash fingerprint value.
Preferably, the fingerprint indexing module divides each fingerprint into 5 segments and records the 5 segments as indexes in fields of the similar hash fingerprint library respectively through permutation and combination.
Preferably, the affinity matching module is configured to: firstly, corresponding to a preprocessing stage, carrying out operations of standardization, feature extraction and similar Hash fingerprint generation on an engineering file to be detected; then, corresponding to a fingerprint indexing stage, dividing the similar hash fingerprint value to be inquired into 5 sections, arranging and combining every two sections to obtain 10 index sections, and respectively and accurately matching with corresponding fields in a database, wherein the matched files are files possibly similar to the files to be detected; and finally, further Hamming distance calculation is carried out on the hash values in all the candidate sequences, and the finally calculated results are summarized.
An embodiment of the present invention also provides an electronic device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the code similarity detection method as described above.
Embodiments of the present invention also provide a computer-readable storage medium storing a computer program, which when executed by a processor implements the code similarity detection method as described above.
Compared with the prior art, the embodiment of the invention optimizes the line granularity similar hash (Simhash) algorithm based on the text in consideration of scale and efficiency, and provides a similar hash detection method for screening by languages. On one hand, the method continues the characteristic of efficient similar retrieval of similar hash, and maps the source code after pretreatment into a binary number (namely the fingerprint of the code) so as to realize dimension reduction and index of the code file and accelerate the construction of a large-scale library; on the other hand, the method combines the characteristics of different language codes, eliminates the influence of common lines on fingerprint generation, enables the fingerprint result to reflect the code characteristics, and greatly improves the accuracy of similar detection. The method has the characteristics of easiness in implementation, simplicity in deployment, no dependence on other lexical or syntactic analyzers, high construction efficiency, low language migration cost and the like, and meanwhile, the high-precision characteristic ensures that the number of wrong matching pairs in the result is low, and the cost of subsequent manual verification is reduced.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.
Fig. 1 is a schematic diagram of a similar hash fingerprint acquisition method in the prior art.
Fig. 2 is a schematic diagram of an improved similar hash fingerprint acquisition method according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present invention in its various embodiments. However, the technical solution claimed in the present invention can be implemented without these technical details and various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not constitute any limitation to the specific implementation manner of the present invention, and the embodiments may be mutually incorporated and referred to without contradiction.
The invention is described in detail below with reference to the drawings and specific examples.
In order to eliminate the influence of some code lines which frequently appear due to the special syntactic structure of the code on the result, the invention selects 10 common languages (c #, c/c + +, go, java, js, php, python, ruby, sql and swift) and carries out common line statistics on each language respectively. The data source is selected from open source items of 50000 before the ranking of the github star number, source code files of each language are preprocessed and then are arranged in a descending order according to the occurrence frequency, and the front 20000 lines are taken as the screening content of the language line screener. The top 15 results of the statistics of the common lines in some languages are shown in Table 1
Figure 815029DEST_PATH_IMAGE002
TABLE 1 statistical results of common lines in different languages
As can be seen from table 1, the frequently appearing lines in the code file include not only pure symbol lines but also some special semantically related lines, and the frequent appearance of these lines in the code does not directly reflect the function of the code, and the included semantic information is less, so that the line coverage of the frequently appearing contents of some functional lines is not the same as the above-mentioned line coverage. If the result is screened only by the pure symbol rows, the extraction of fingerprint features is affected by the rows which frequently appear and have low characteristic expression, and the accuracy of the result is seriously affected by the row coverage condition when the specific gravity is high. Meanwhile, although there are some common lines such as "return false" and "} else {" return "between different languages, the difference between the common line lists is also large due to different grammatical features of different languages, and there are some common lines specific to languages such as" pass "in python and" intensiure "in ruby.
Therefore, the invention is improved on the basis of the line-granularity similarity hash algorithm, as shown in fig. 2, the preprocessing processes of removing blank lines, removing comments, removing spaces and converting all capitalization into lower case are carried out on the source code to be detected, and the code file is divided into lines. Judging the language to which the file belongs according to the suffix name of the file, selecting a common line list of the corresponding language from a common line library as a filter, screening all corresponding lines contained in the filter in the file, and taking each remaining line as the characteristic of the code file. And performing Hash calculation on the characteristics in sequence by a MurMur Hash method, mapping the obtained result into a series of sequences consisting of 1 and-1 bit by bit, and finally obtaining a final fingerprint result through accumulation and dimensionality reduction.
The method disclosed by the invention is mainly divided into three stages, namely a preprocessing stage, a fingerprint indexing stage and a similarity matching stage. And in the preprocessing stage, preprocessing is carried out on the massive source code files, then screening and extracting of features are carried out, and finally similar Hash fingerprints are generated. And in the fingerprint indexing stage, according to the result of the previous stage, the fingerprints are segmented and recombined by adopting a segmented indexing strategy and then stored in a similar hash fingerprint library, and segmented indexes are established to facilitate quick matching. And in the similar matching stage, after the engineering file to be detected is processed, a similar hash value is generated, and the tracing detection result is retrieved from the similar hash fingerprint library in a segmented manner according to the similar hash value of the engineering file to be detected.
The method and system of the present invention are further described below. The method of the embodiment of the present invention may be executed by processors in various types of intelligent computing devices, and specifically, the method of the embodiment of the present invention includes the following steps:
s101, a pretreatment stage:
in the preprocessing stage, a plurality of preprocessing and feature extraction works are carried out on the massive source code files, and fingerprint values are output. The improved similar hash fingerprint acquisition method is shown in fig. 2, and includes the following sub-steps:
s1011, code preprocessing: some irrelevant factors such as blank lines, blank spaces, comments and the like in the codes can influence the generated hash result, in order to improve the accuracy of the similarity matching result, the source codes are subjected to uniform formatting processing, and the preprocessing process of removing the blank lines, comments, blank spaces and converting all capitalization into lowercase is carried out on the source codes to be detected, so that the irrelevant factors are removed.
S1012, feature extraction: in order to eliminate the influence of a plurality of large-appearing and nonsense rows on the result, common row statistics are carried out on open source project files which are ranked 50000 before the github star number of 10 common languages in advance, and the top 20000 rows which are ranked in each language in a frequency mode are used as a common row list. Judging the language according to the suffix name of the file, selecting the common line filter of the corresponding language from the common line list, removing all the corresponding lines contained in the filter in the file, and taking each line after the rest code formatting as a characteristic of the source code file.
S1013, hash processing: using the conventional hash algorithm murmurrhash, a corresponding hash value is calculated for each feature of the source code file, where the hash value is a 64-bit binary string, and the 8-bit binary string is intercepted and demonstrated with reference to fig. 2.
S1014, weighted summation: and taking the frequency of each feature as the weight of the feature to carry out weighted summation to obtain a result sequence string. As described in S1012 to S1013, each line is regarded as a feature, each feature calculates a corresponding hash value, the same line is a repeated occurrence of the same feature for the same file, and statistics of the number of occurrences of each line in the file (corresponding to sets of different lines) are performed for each line, and the number of times of repeated occurrences of the line is the frequency of the line, i.e., the weight w of the line, and the weighting process is to change the calculated hash value (i.e., binary string) of the line into w or-w bit by bit. In the actual operation process, each row does not need to be subjected to frequency statistics, each row is directly treated as an independent individual, the weight w corresponding to each row can be directly regarded as 1, 0 or 1 of each bit of the hash value determines whether the hash value is positively multiplied or negatively multiplied with the weight value, the hash value (namely, binary string) calculated by the row is changed into 1 or-1 bit by bit, namely, weighting is completed, and then each weighted hash value is added bit by bit to obtain a result sequence string. See the "map" and "accumulate" processes in fig. 2.
S1015, dimensionality reduction: and transforming the result sequence string obtained by weighted summation, wherein if each bit is a positive number, the result sequence string is transformed into 1, otherwise, the result sequence string is transformed into 0, and a final similar Hash fingerprint value is obtained.
S102, fingerprint indexing:
for two source code files, the criterion for determining whether they are similar and the degree of similarity is the Hamming Distance (Hamming Distance) between them. In a massive data set, the process of searching for a fingerprint value close to the detected code fingerprint is time-consuming, so that the index optimization method is designed, and the query efficiency is greatly improved in a segmented index mode.
Specifically, the invention divides each fingerprint into 5 segments with the lengths of 13, 13, 13, 13 and 12 respectively, and records the segments in the number by permutation and combination as indexesIn the fields of the database. Therefore, in the similar matching stage, the hash value of the tested item can be quickly positioned to a candidate sequence possibly matched through the index, and a tracing result is further obtained. In 264On the data set with the size, the range of the candidate sequences of each similar search can be greatly reduced by the indexing method, and the retrieval efficiency is greatly improved.
S103, similarity matching:
after the steps are completed, the similar Hash fingerprint library is constructed, the source tracing detection is carried out on the project to be detected in the similar matching stage, the input of the stage is the source code of the project to be detected, and the output is the result which is inquired in all the fingerprint libraries and is similar to the code in the input project.
First, similar to the preprocessing stage, the operations of normalization, feature extraction, and similar hash fingerprint generation are required for the file. Then, similar to the fingerprint indexing stage, the similar hash fingerprint value to be queried needs to be divided into 5 segments, 10 index segments are obtained after two-by-two arrangement and combination, the index segments are respectively and accurately matched with the corresponding fields in the database, and the matched files are files possibly similar to the files to be tested. And finally, calculating the Hamming distance of the hash values in all the candidate sequences, and summarizing the finally calculated result.
The above method is described below by way of an example:
in order to verify the effect of the invention, a total of 60 items with top star ranking were obtained from the Github website as a test set. Github is a famous open source project community, a large number of high-quality open source project source codes which are widely applied and quoted are arranged in the community, and projects with the most attention and the most reuse are usually projects with the most star counts.
And screening the acquired project source codes, only leaving files with suffixes related to partial codes, and eliminating interference of irrelevant files. The finally obtained files are 73453 in total and total 7 GB.
The sensitivity was first tested and the test was performed on three different scale files. As shown in table 2, the Simhash values of the files with different sizes have different sensitivity degrees to file changes, and the smaller the file is, the more sensitive the file is to the change; the larger the file, the less sensitive to changes. This feature is precisely what is needed in document affinity matching, since the larger the document to be detected, the larger the range of variation that can be tolerated. Meanwhile, no matter how many lines of the file is, the Simhash value of the file reaches the upper limit 3 of the Hamming distance similarity judgment after the adding and deleting operations accounting for 2% -4% of the total lines.
The accuracy of the homologous matching is then tested. The collected 73453 files are stored in the fingerprint library through the relevant methods of the preprocessing stage and the indexing stage mentioned above, and then 1317 files are selected from the fingerprint library, and then 100 files which are manually modified in small parts based on the source files in the library are added, and the total 1417 files are used as project files to be tested. And through a similar matching stage, sequentially obtaining file results of homologous analysis on the files, and performing one-to-one comparison verification on the matched clone pairs in the results. Through statistical analysis, all the identical files can be correctly matched, and the files with partial modification can be matched with the corresponding code files. In the matching results of all the tested files, the false alarm rate is only 1.41%, and therefore the method can obtain a good effect in actual engineering.
Figure 997749DEST_PATH_IMAGE004
Table 2 improved similar hash fingerprint sensitivity test
Those skilled in the art will understand that all or part of the steps in the method according to the above embodiments may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, etc.) or a processor (processor) to execute all or part of the steps in the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims (10)

1. A code similarity detection method is characterized by comprising the following steps:
s101: in the preprocessing stage, preprocessing and feature extraction work are carried out on the massive source code files, and similar Hash fingerprint values are output;
s102: a fingerprint indexing stage, which is to adopt a segmentation indexing strategy to segment and recombine the fingerprints and store the fingerprints into a similar hash fingerprint library according to the result of the preprocessing stage, and establish a segmentation index for convenient and rapid matching;
s103: in the similar matching stage, a similar hash value is generated after the engineering file to be detected is processed, and a tracing detection result is retrieved from the similar hash fingerprint library in a segmented manner according to the similar hash value of the engineering file to be detected;
the feature extraction in step S101 is: common line statistics is carried out on open source project files of multiple common languages in advance, the first N lines of each language which are sorted according to frequency are used as a common line list, N is a preset threshold value, the language to which the source code file belongs is judged according to the suffix name of the source code file, a common line filter of the corresponding language is selected from the common line list, all corresponding lines contained in the filter in the source code file are screened, and the remaining lines are used as the characteristics of the source code file.
2. The method for detecting code similarity according to claim 1, wherein step S101 comprises the following sub-steps:
s1011, code preprocessing: the code of the source code file comprises irrelevant factors, wherein the irrelevant factors comprise blank lines, spaces and comments, and the code is subjected to unified formatting treatment to remove the irrelevant factors;
s1012, feature extraction: performing common line statistics on open source project files of multiple common languages in advance, taking the first N lines of each language which are sorted according to frequency as a common line list, wherein N is a preset threshold value, judging the language to which the open source project files belong according to the suffix name of the source code file, selecting a common line filter of the corresponding language from the common line list, completely screening out the corresponding lines contained in the filter in the source code file, and taking the remaining each line as the characteristic of the source code file;
s1013, hash processing: calculating a corresponding hash value of each characteristic of the source code file by using a hash algorithm, wherein the hash value is a 64-bit binary string;
s1014, weighted summation: taking the frequency of each feature as the weight of the feature to carry out weighted summation, determining whether the hash value is multiplied by the weight value in a positive or negative way by 0 or 1 of each bit of the hash value, and adding the weighted hash values bit by bit to obtain a result sequence string;
s1015, dimensionality reduction: and transforming the result sequence string obtained by weighted summation, if each bit is a positive number, transforming into 1, otherwise, transforming into 0, and obtaining the final similar hash fingerprint value.
3. The method for detecting code similarity according to claim 1, wherein in step S102, each fingerprint is divided into 5 segments, and the 5 segments are recombined by permutation and combination as indexes and are respectively recorded in the fields of the similar hashed fingerprint database.
4. The method for detecting code similarity according to claim 1, wherein in step S103, firstly, operations of standardization, feature extraction, and similar hash fingerprint generation are performed on the engineering document to be detected, corresponding to the preprocessing stage; then, corresponding to a fingerprint indexing stage, dividing the similar hash fingerprint value to be inquired into 5 sections, arranging and combining every two sections to obtain 10 index sections, and respectively and accurately matching with corresponding fields in a database, wherein the matched files are files possibly similar to the files to be detected; and finally, further Hamming distance calculation is carried out on the hash values in all the candidate sequences, and the finally calculated results are summarized.
5. A code similarity detection system based on a similar hash algorithm is characterized by comprising the following modules:
the preprocessing module is used for preprocessing and extracting features of the massive source code files and outputting similar hash fingerprint values;
the fingerprint index module is used for segmenting and recombining fingerprints according to the result of the preprocessing module by adopting a segmented index strategy and then storing the segmented fingerprints into a similar hash fingerprint library, and establishing a segmented index for facilitating quick matching;
the similar matching module is used for processing the engineering file to be detected to generate a similar hash value, and retrieving a tracing detection result from the similar hash fingerprint library in a segmented manner according to the similar hash value of the engineering file to be detected;
the feature extraction in the preprocessing module is as follows: common line statistics is carried out on open source project files of multiple common languages in advance, the first N lines of each language which are sorted according to frequency are used as a common line list, N is a preset threshold value, the language to which the source code file belongs is judged according to the suffix name of the source code file, a common line filter of the corresponding language is selected from the common line list, all corresponding lines contained in the filter in the source code file are screened, and the remaining lines are used as the characteristics of the source code file.
6. The system for detecting code similarity based on the similar hash algorithm as claimed in claim 5, wherein said preprocessing module comprises the following sub-modules:
the code preprocessing submodule is used for carrying out uniform formatting processing on the code and removing the irrelevant factors;
a feature extraction submodule to: performing common line statistics on open source project files of multiple common languages in advance, taking the first N lines of each language which are sorted according to frequency as a common line list, wherein N is a preset threshold value, judging the language to which the open source project files belong according to the suffix name of the source code file, selecting a common line filter of the corresponding language from the common line list, completely screening out the corresponding lines contained in the filter in the source code file, and taking the remaining each line as the characteristic of the source code file;
a hash processing sub-module to: calculating a corresponding hash value of each characteristic of the source code file by using a hash algorithm, wherein the hash value is a 64-bit binary string;
a weighted sum sub-module for: taking the frequency of each feature as the weight of the feature to carry out weighted summation, determining whether the hash value is multiplied by the weight value in a positive or negative way by 0 or 1 of each bit of the hash value, and adding the weighted hash values bit by bit to obtain a result sequence string;
a dimension reduction submodule to: and transforming the result sequence string obtained by weighted summation, if each bit is a positive number, transforming into 1, otherwise, transforming into 0, and obtaining the final similar hash fingerprint value.
7. The system for detecting code similarity based on similar hash algorithm as claimed in claim 5, wherein said fingerprint indexing module divides each fingerprint into 5 segments and records them as indexes in fields of similar hash fingerprint database respectively by permutation and combination.
8. The system according to claim 5, wherein the similarity matching module is configured to: firstly, corresponding to a preprocessing stage, carrying out operations of standardization, feature extraction and similar Hash fingerprint generation on an engineering file to be detected; then, corresponding to a fingerprint indexing stage, dividing the similar hash fingerprint value to be inquired into 5 sections, arranging and combining every two sections to obtain 10 index sections, and respectively and accurately matching with corresponding fields in a database, wherein the matched files are files possibly similar to the files to be detected; and finally, further Hamming distance calculation is carried out on the hash values in all the candidate sequences, and the finally calculated results are summarized.
9. An electronic device, comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the code similarity detection method of any one of claims 1-4.
10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the code similarity detection method according to any one of claims 1 to 4.
CN202110198641.2A 2021-02-23 2021-02-23 Code similarity detection method and device and storage medium Active CN112579155B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110198641.2A CN112579155B (en) 2021-02-23 2021-02-23 Code similarity detection method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110198641.2A CN112579155B (en) 2021-02-23 2021-02-23 Code similarity detection method and device and storage medium

Publications (2)

Publication Number Publication Date
CN112579155A CN112579155A (en) 2021-03-30
CN112579155B true CN112579155B (en) 2021-05-18

Family

ID=75113936

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110198641.2A Active CN112579155B (en) 2021-02-23 2021-02-23 Code similarity detection method and device and storage medium

Country Status (1)

Country Link
CN (1) CN112579155B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023028721A1 (en) * 2021-08-28 2023-03-09 Huawei Technologies Co.,Ltd. Systems and methods for detection of code clones
CN114153496B (en) * 2021-09-08 2023-09-12 北京天德科技有限公司 High-speed parallelizable code similarity comparison method and system based on blockchain
CN113590192B (en) * 2021-09-26 2022-01-04 北京迪力科技有限责任公司 Quality analysis method and related equipment
CN113721978B (en) * 2021-11-02 2022-02-11 北京大学 Method and system for detecting open source component in mixed source software
CN115378695A (en) * 2022-08-19 2022-11-22 安天科技集团股份有限公司 Method and device for detecting cloned web page
CN115099795B (en) * 2022-08-29 2022-11-11 江苏青山软件有限公司 Enterprise internal digital resource management method and system
CN116450581B (en) * 2023-04-10 2024-02-13 中国人民解放军61660部队 Local quick matching method and system for white list and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066262A (en) * 2017-03-10 2017-08-18 苏州棱镜七彩信息科技有限公司 Source code file clone's adjacency list merges detection method
CN109445844A (en) * 2018-11-05 2019-03-08 浙江网新恒天软件有限公司 Code Clones detection method based on cryptographic Hash, electronic equipment, storage medium
CN112257068A (en) * 2020-11-17 2021-01-22 南方电网科学研究院有限责任公司 Program similarity detection method and device, electronic equipment and storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9805099B2 (en) * 2014-10-30 2017-10-31 The Johns Hopkins University Apparatus and method for efficient identification of code similarity
CN106990956B (en) * 2017-03-10 2020-11-24 苏州棱镜七彩信息科技有限公司 Code file clone detection method based on suffix tree
CN109445834B (en) * 2018-10-30 2021-04-30 北京计算机技术及应用研究所 Program code similarity rapid comparison method based on abstract syntax tree
CN110362343A (en) * 2019-07-19 2019-10-22 上海交通大学 The method of the detection bytecode similarity of N-Gram
CN111290784B (en) * 2020-01-21 2021-08-24 北京航空航天大学 Program source code similarity detection method suitable for large-scale samples
CN111666101A (en) * 2020-04-24 2020-09-15 北京大学 Software homologous analysis method and device
CN111562920A (en) * 2020-06-08 2020-08-21 腾讯科技(深圳)有限公司 Method and device for determining similarity of small program codes, server and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066262A (en) * 2017-03-10 2017-08-18 苏州棱镜七彩信息科技有限公司 Source code file clone's adjacency list merges detection method
CN109445844A (en) * 2018-11-05 2019-03-08 浙江网新恒天软件有限公司 Code Clones detection method based on cryptographic Hash, electronic equipment, storage medium
CN112257068A (en) * 2020-11-17 2021-01-22 南方电网科学研究院有限责任公司 Program similarity detection method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112579155A (en) 2021-03-30

Similar Documents

Publication Publication Date Title
CN112579155B (en) Code similarity detection method and device and storage medium
US11853334B2 (en) Systems and methods for generating and using aggregated search indices and non-aggregated value storage
CN111581355B (en) Threat information topic detection method, device and computer storage medium
US20080114725A1 (en) Method and System for High Performance Data Metatagging and Data Indexing Using Coprocessors
US20070016612A1 (en) Molecular keyword indexing for chemical structure database storage, searching, and retrieval
CN111258966A (en) Data deduplication method, device, equipment and storage medium
US20190228085A1 (en) Log file pattern identifier
US20080127043A1 (en) Automatic Extraction of Programming Rules
CN108304382A (en) Mass analysis method based on manufacturing process text data digging and system
US11288266B2 (en) Candidate projection enumeration based query response generation
Li et al. Juxtapp and dstruct: Detection of similarity among android applications
CN115658080A (en) Method and system for identifying open source code components of software
US20040186833A1 (en) Requirements -based knowledge discovery for technology management
Vandic et al. A semantic clustering-based approach for searching and browsing tag spaces
CN112148359B (en) Distributed code clone detection and search method, system and medium based on subblock filtering
CN109918367B (en) Structured data cleaning method and device, electronic equipment and storage medium
CN113971403A (en) Entity identification method and system considering text semantic information
Sanjana Ad service detection-a comparative study using machine learning techniques
Singh et al. User specific context construction for personalized multimedia retrieval
Khalid et al. MDORG: Annotation Assisted Rule Agents for Metadata Files.
Chauhan et al. A parallel computational approach for similarity search using Bloom filters
US20240086448A1 (en) Detecting cited with connections in legal documents and generating records of same
CN112214494B (en) Retrieval method and device
US20240086442A1 (en) Heuristic identification of shared substrings between text documents
Toke et al. Enhancing text mining using side information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Code similarity detection method, device and storage medium

Effective date of registration: 20220826

Granted publication date: 20210518

Pledgee: Beijing first financing Company limited by guarantee

Pledgor: BEIJING PEKING UNIVERSITY SOFTWARE ENGINEERING CO.,LTD.

Registration number: Y2022980013696

PE01 Entry into force of the registration of the contract for pledge of patent right