CN116166321B

CN116166321B - Code clone detection method, system and computer readable storage medium

Info

Publication number: CN116166321B
Application number: CN202310457759.1A
Authority: CN
Inventors: 陈晓莉; 国毓芯; 朱崇; 赵祥廷; 林建洪
Original assignee: Zhejiang Ponshine Information Technology Co ltd
Current assignee: Zhejiang Ponshine Information Technology Co ltd
Priority date: 2023-04-26
Filing date: 2023-04-26
Publication date: 2023-06-27
Anticipated expiration: 2043-04-26
Also published as: CN116166321A

Abstract

The invention relates to a code clone detection method, a system and a computer readable storage medium, wherein the code clone detection method comprises the following steps: s1, collecting a source code dataset, performing cluster analysis, and outputting class labels and label characteristics of n classes of source codes; s2, sequentially processing codes to be detected to obtain a segmentation matrix; s3, respectively matching the segmentation matrixes with the source codes to obtain target source code class labels corresponding to each segmentation matrix; s4, traversing and calculating cosine similarity of the segmentation matrixes and all source code fragment matrixes under the corresponding target source code category labels respectively, weighting and calculating similarity scores of codes to be detected of each source code for each segmentation matrix, arranging the similarity scores in descending order, and reserving source code fragments corresponding to the scores topN; s5, inputting the source code fragments and the codes to be detected into an LSTM-DSSM network model to calculate similarity scores, and outputting the source code fragments with highest similarity. The invention can effectively detect whether source code clone exists.

Description

Code clone detection method, system and computer readable storage medium

Technical Field

The invention belongs to the technical field of network security, and particularly relates to a code clone detection method, a code clone detection system and a computer readable storage medium.

Background

The source code cloning refers to two or more identical or similar source code fragments in a code library, and is a common phenomenon in the software development process. The source code cloning can improve the development efficiency of developers to a certain extent, but external loopholes are also easy to introduce, and a series of safety problems are caused.

Disclosure of Invention

Based on the foregoing deficiencies in the art, the present invention is directed to a code clone detection method, system and computer readable storage medium.

In order to achieve the aim of the invention, the invention adopts the following technical scheme:

a code clone detection method comprises the following steps:

s1, acquiring a source code data set, and sequentially carrying out fragment segmentation and data preprocessing on source codes of the source code data set to obtain a source code fragment matrix data set; performing cluster analysis on a source code segment matrix of the source code segment matrix dataset, and outputting class labels and label characteristics of n classes of source codes; n is an integer greater than 1;

s2, sequentially performing fragment segmentation and data preprocessing on the code to be detected to obtain S/x segmentation matrixes; wherein s represents the length of the code to be detected, and x represents the length of the segment segmentation;

s3, matching the segmentation matrixes with n types of source codes respectively to obtain target source code class labels corresponding to each segmentation matrix;

s4, traversing and calculating cosine similarity of the segmentation matrixes and all source code fragment matrixes under the corresponding target source code category labels respectively, weighting and calculating similarity scores of codes to be detected of each source code by x/S of each segmentation matrix, arranging in descending order according to the similarity scores, and reserving source code fragments corresponding to the scores topN; wherein topN is the first N bits arranged in descending order;

s5, inputting the source code segment corresponding to the score topN and the code to be detected into an LSTM-DSSM network model, calculating the similarity score of the source code segment corresponding to the score topN and the code to be detected, and outputting the source code segment with the highest similarity.

Preferably, in the step S5, the processing procedure of the LSTM-DSSM network model includes:

s51, word segmentation is carried out on an input code based on a Bert model, and LSTM input is obtained through token ebedding layer conversion;

s52, outputting a latent semantic vector by the LSTM input through an LSTM model;

s53, inputting the latent semantic vector corresponding to the source code segment and the latent semantic vector corresponding to the code to be detected into a DSSM model, and calculating the similarity score of the source code segment and the code to be detected.

As a preferred scheme, the number of the DSSM models is N, the output of each DSSM model is connected to a full-connection layer, and the Softmax layer is connected after the full-connection layer so as to output and obtain the similarity proportion of the code to be detected and each source code segment corresponding to the score topN.

Preferably, the data preprocessing comprises data cleaning, text word segmentation and vectorization conversion, and the matrix with missing values after vectorization conversion is filled with 0.

As a preferable scheme, the text word segmentation adopts a word segmentation library NLTK to carry out English code word segmentation.

In the step S3, a source code class with the largest attribution class weight is selected as the target source code class label corresponding to the segmentation matrix according to the voting method.

In a preferred embodiment, in the step S1, the cluster analysis uses a GMM clustering algorithm, and estimates GMM parameters by using an EM algorithm.

Preferably, the value of N is 3-6.

The invention also provides a code clone detection system, which applies the code clone detection method according to any scheme, and comprises the following steps:

the acquisition module is used for acquiring a source code data set or codes to be detected;

the data processing module is used for sequentially carrying out fragment segmentation and data preprocessing on the source code data set or the code to be detected;

the cluster analysis module is used for carrying out cluster analysis on the source code segment matrix of the source code segment matrix data set and outputting class labels and label characteristics of n classes of source codes;

the matching module is used for respectively matching the segmentation matrixes with n types of source codes to obtain target source code class labels corresponding to each segmentation matrix;

the computing and sequencing module is used for traversing and computing cosine similarity for all source code segment matrixes under the target source code category labels corresponding to the segmentation matrixes respectively, weighting and computing similarity scores of codes to be detected of each source code by each segmentation matrix, arranging the similarity scores in descending order, and reserving source code segments corresponding to the scores topN;

the detection module is used for inputting the source code fragments corresponding to the score topN and the codes to be detected into the LSTM-DSSM network model, calculating the similarity score of the source code fragments corresponding to the score topN and the codes to be detected, and outputting the source code fragments with the highest similarity.

The present invention also provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the code clone detection method according to any one of the above aspects.

Compared with the prior art, the invention has the beneficial effects that:

(1) The invention can effectively detect whether the source code cloning or tampering exists, and the source code with highest similarity can be obtained by inputting the code to be detected by the user;

(2) The invention considers that the word bag model BOW is adopted in the expression layer of the DSSM model, which leads to the loss of the word order information and the context information, so that an LSTM-DSSM network model is introduced, and the LSTM is utilized to solve the problem of the context characteristics and the word order information in a far distance.

Drawings

FIG. 1 is a flow chart of a code clone detection method of embodiment 1 of the present invention;

FIG. 2 is a block diagram of the LSTM-DSSM network model of embodiment 1 of the invention;

FIG. 3 is a network architecture diagram of an LSTM fusion Bert of embodiment 1 of the invention;

FIG. 4 is a block diagram of a code clone detection system of embodiment 1 of the present invention.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention, specific embodiments of the present invention will be described below with reference to the accompanying drawings. It is evident that the drawings in the following description are only examples of the invention, from which other drawings and other embodiments can be obtained by a person skilled in the art without inventive effort.

Example 1:

as shown in fig. 1, the code clone detection method of the present embodiment includes the steps of:

s1, constructing a source code warehouse. The construction process of the source code warehouse of the embodiment comprises the following steps:

s11, collecting a source code data set, and sequentially carrying out fragment segmentation and data preprocessing on each source code file in the source code data set to obtain a source code fragment matrix data set.

The method comprises the steps of dividing the segments into segments with x rows of codes as intervals, wherein x can be 100 rows, 200 rows and the like and can be set according to actual conditions; the data preprocessing comprises data cleaning, text word segmentation and vectorization conversion to obtain a source code segment matrix, and filling the source code segment matrix with a missing value after vectorization conversion with 0.

The specific process of data cleaning and vectorization conversion in this embodiment may refer to the prior art, and is not described herein in detail; in addition, the text word segmentation adopts a common word segmentation library NLTK to carry out English code word segmentation, and then the word segmentation result is subjected to vectorization conversion to obtain a data set consisting of a source code segment matrix, namely a source code segment matrix data set.

S12, in order to solve the problems of complexity, time consumption and the like of traversing the full-source code warehouse through similarity calculation, the embodiment performs cluster analysis on the source code segment matrix data set, outputs n types of source code class labels, n types of label characteristics and source code segment matrixes corresponding to the n types of labels, and forms the source code warehouse. n is an integer greater than 1.

The cluster analysis selects a Gaussian mixture cluster GMM-EM algorithm, which refers to a linear combination of a plurality of Gaussian distribution functions, and theoretically any type of distribution can be fitted through the GMM, so that the situation that data in the same set contain a plurality of different distributions is usually solved. The GMM parameters are estimated by the EM algorithm, typically in two steps, the first step to find the rough value of the estimated parameters and the second step to maximize the likelihood function using the values of the first step.

S2, inputting a code A to be detected;

s3, segment segmentation is carried out on the code A to be detected. Specifically, assuming that the length of the code to be detected is s, performing segment segmentation on the code to be detected by taking x lines of codes as the interval size to obtain an s/x segment segmentation segment set.

S4, data cleaning and data preprocessing are carried out on the S/x segment segmentation fragment set, and S/x segmentation matrixes are obtained. The specific process of data cleaning and data preprocessing is the same as the processing process of the source codes.

And S5, traversing each segmentation matrix through the class labels of the n classes of source codes respectively, and carrying out source code class matching according to the characteristics of the n classes of labels to obtain a target class label t to which the segmentation matrix belongs. Specifically, according to a voting method of machine learning, selecting a source code category with the largest attribution category weight as a target category label attributing to the segmentation matrix. The specific process of the voting method of machine learning can refer to the prior art, and is not described herein.

S6, traversing all source code segment matrixes under the target class labels t corresponding to the segmentation matrixes respectively to calculate cosine similarity.

Specifically, the cosine similarity calculation process is as follows: and calculating weight x/s for the cosine similarity value obtained by calculating each segmentation matrix, calculating similarity score of each source code to be detected by weighted average, arranging the similarity scores in descending order, and reserving source code fragments of top5 (namely the top5 bits of the sequence) to form a screened source code detection data set D.

S7, inputting the code A to be detected and the source code detection data set D output in the step S6 into a pre-training LSTM-DSSM network model, outputting a source code segment with highest matching degree and tracing the source code position.

Specifically, the LSTM-DSSM network model is a network model which is improved based on the defect of a DSSM algorithm, wherein the DSSM algorithm is a method which is commonly used for calculating text similarity in a recommendation system or a retrieval system, and has a three-layer structure from bottom to top: the input layer, the representation layer and the matching layer are based on the principle that two sentences with similarity to be calculated are input, are mapped into a space vector and are converted into low latitude semantic vectors by DNN, and the distance between the two semantic vectors is calculated by a cosine distance. The input layer is used for converting the input text into a vector format capable of being input into a depth network, and the English scene is generally processed in a word sharing mode, mainly based on an N-gram of letters, and mainly used for reducing the dimension of input vectors. Chinese scenes typically require word segmentation or use of pre-trained models of BERT chinese. The presentation layer is typically some complex deep learning network such as CNN, RNN, etc. The two direction representation layers finally enter the matching layer, and the similarity is calculated by using COS and other modes to output the distance. The specific logic structure of the DSSM model may refer to the prior art, and is not described herein.

The word bag model BOW is adopted in the presentation layer based on the DSSM algorithm model, and is a simple assumption in natural language processing and information retrieval. In the model, the text or the paragraphs are regarded as unordered vocabulary sets, grammar or even word sequences are ignored, thereby losing word sequence information and context information, based on the LSTM-DSSM network model of the embodiment, the LSTM is adopted to encode the text into a segment of vector, the LSTM needs to perform pre-treatment on the text length, namely word segmentation is performed on the sentence, and an emmbedding is obtained based on pre-trained Bert model mapping, and is a mode for converting discrete vectors into continuous vector representation, and in a neural network, the emmbedding can not only reduce the space dimension of discrete variables, but also represent the vectors; inputting the whole sentence into LSTM, training the output after LSTM, and obtaining LSTM latent layer semantic vector; the latent semantic vector based on LSTM output is input into a DSSM model, and subsequent logic implementation is consistent with the DSSM model. LSTM solves the problem of remote context features and word order information.

As shown in fig. 2 and 3, the processing procedure of the LSTM-DSSM network model of the present embodiment includes:

s71, word segmentation is carried out on an input code based on a Bert model, and LSTM input is obtained through token ebedding layer conversion; the input codes comprise a code A to be detected and a source code detection data set D, wherein the source code detection data set D comprises a source code segment D1, a source code segment D2, a source code segment D3, a source code segment D4 and a source code segment D5;

s72, outputting a latent semantic vector by the input of the LSTM through an LSTM model; specifically, the latent semantic vectors corresponding to the code A to be detected, the source code segment D1, the source code segment D2, the source code segment D3, the source code segment D4 and the source code segment D5 are A ^* 、D1 ^* 、D2 ^* 、D3 ^* 、D4 ^* 、D5 ^* 。

S73, inputting the latent semantic vector corresponding to the source code segment and the latent semantic vector corresponding to the code to be detected into a DSSM model, and calculating the similarity score of the source code segment and the code to be detected to obtain the target source code with the highest similarity score.

The number of the DSSM models in this embodiment is five, which are DSSM1, DSSM2, DSSM3, DSSM4, and DSSM5, respectively; the output of each DSSM model is connected to the full connection layer FC, and the Softmax layer is connected after the full connection layer FC so as to output and obtain the similarity proportion of each source code segment corresponding to the score top5 of the code to be detected, namely P (D1 |A), P (D2|A), P (D3|A), P (D4|A) and P (D5|A).

Based on the code clone detection method, as shown in fig. 4, the code clone detection system of the embodiment comprises an acquisition module, a data processing module, a cluster analysis module, a matching module, a calculation and sequencing module and a detection module. Specifically, the acquisition module of the embodiment is used for acquiring a source code dataset or a code to be detected; the data processing module of the embodiment is used for sequentially carrying out segment segmentation and data preprocessing on the source code data set or the code to be detected; the cluster analysis module of the embodiment is used for carrying out cluster analysis on the source code segment matrix of the source code segment matrix data set and outputting class labels and label characteristics of n classes of source codes; the matching module of the embodiment is used for matching the segmentation matrixes with n types of source codes respectively to obtain target source code class labels corresponding to each segmentation matrix; the calculating and sorting module of the embodiment is used for traversing and calculating cosine similarity for all source code segment matrixes under the target source code category labels corresponding to the splitting matrixes respectively, giving weight x/s to each splitting matrix, weighting and calculating similarity scores of codes to be detected of each source code, arranging according to the descending order of the similarity scores, and reserving source code segments corresponding to the score top 5; the detection module of the embodiment is used for inputting the source code segment corresponding to the score top5 and the code to be detected into the LSTM-DSSM network model, calculating the similarity score of the source code segment corresponding to the score top5 and the code to be detected, and outputting the source code segment with the highest similarity.

The specific processing procedure of each functional module may refer to the specific description in the code clone detection method, which is not repeated herein.

The computer readable storage medium of the embodiment stores instructions in the computer readable storage medium, and when the instructions run on a computer, the computer is caused to execute the code clone detection method, so as to realize intelligent detection of codes.

Example 2:

the code clone detection method of the present embodiment is different from that of embodiment 1 in that:

the reserved source code fragment is not limited to the first 5 bits defined in embodiment 1, but may be the first 3 bits, the first 4 bits, the first 6 bits, etc., and may be specifically determined according to the actual application requirements;

for further steps reference is made to example 1.

The code clone detection system of the present embodiment makes an adaptation according to the code clone detection method of the present embodiment;

The foregoing is only illustrative of the preferred embodiments and principles of the present invention, and changes in specific embodiments will occur to those skilled in the art upon consideration of the teachings provided herein, and such changes are intended to be included within the scope of the invention as defined by the claims.

Claims

1. The code clone detection method is characterized by comprising the following steps of:

2. The code clone detection method according to claim 1, wherein in the step S5, the processing procedure of the LSTM-DSSM network model includes:

3. The code clone detection method according to claim 2, wherein the number of the DSSM models is N, the outputs of each DSSM model are all connected to a full connection layer, and the Softmax layer is connected after the full connection layer to output and obtain the similarity ratio of the code to be detected and each source code segment corresponding to the score topN.

4. The method of claim 1, wherein the data preprocessing includes data cleansing, text segmentation and vectorization conversion, and the matrix with missing values after vectorization conversion is padded with 0.

5. The code clone detection method according to claim 4, wherein the text word segmentation uses a word segmentation library NLTK for english code word segmentation.

6. The code clone detection method according to claim 1, wherein in the step S3, a source code class with the largest attribution class weight is selected as a target source code class label corresponding to the segmentation matrix according to a voting method.

7. The code clone detection method according to claim 1, wherein in the step S1, a GMM clustering algorithm is adopted for cluster analysis, and GMM parameters are estimated by an EM algorithm.

8. The method for detecting code clones according to claim 1, wherein the value of N is 3 to 6.

9. A code clone detection system applying the code clone detection method according to any one of claims 1 to 8, characterized in that the code clone detection system includes:

10. A computer readable storage medium having instructions stored therein, which when run on a computer, cause the computer to perform the code clone detection method according to any one of claims 1-8.