CN112215013B

CN112215013B - Clone code semantic detection method based on deep learning

Info

Publication number: CN112215013B
Application number: CN202011205774.XA
Authority: CN
Inventors: 成肖云; 王建荣; 王赞; 贾勇哲; 马国宁
Original assignee: 天津大学; 天津泰凡科技有限公司
Current assignee: Tianjin Thai Technology Co ltd; Tianjin University
Priority date: 2020-11-02
Filing date: 2020-11-02
Publication date: 2022-04-19
Anticipated expiration: 2040-11-02
Also published as: CN112215013A

Abstract

The invention discloses a semantic clone detection method based on deep learning, for a given code block pair, firstly preprocessing the code block into a sequence of TPE basic units, then performing word embedding operation on the code block and the TPE basic units, and using the code block and the TPE basic units to a BilSTM module with context characteristic combination; secondly, intensively extracting useful information related to the clone codes learned by the neural network by using a self-attention mechanism; each code segment is converted into a vector representation, the Euclidean distances between the vectors are calculated as the characteristics of classification, and the vectors are classified into two classes: if two code blocks are similar, the vectors they generate over the neural network should be similar, i.e. predictive cloned/uncloned. Compared with the prior art, the invention saves more time and can capture rich grammar and semantic information; TPE can also avoid the problem of insufficient vocabulary (OOV).

Description

Clone code semantic detection method based on deep learning

Technical Field

The invention relates to the field of program analysis and machine learning, in particular to a method for representing source codes and detecting semantic clone.

Background

A clone code is a duplicate code, a similar code. It is common to classify code clones into four types: 1) a type one clone is two code segments that are identical except for the differences in spaces, formats, and annotations; 2) the two-type clone is two code segments which are completely the same except that the identifiers, constants and variable types are different; 3) the third type clone is to modify the copied code segment with a few statements, such as changing, adding or deleting two code segments of the code of a few statements; 4) the four types of clones are mainly associated with functional similarity. The first three types are primarily related to text similarity. The presence of code clones unnecessarily increases program size, changes to code segments also require modifications to their clones, increasing maintenance effort, and copying code segments containing errors can result in error propagation. Detecting code clones helps to reduce software maintenance costs and prevent errors from occurring.

Among the various methods of detecting code clones, semantic clones are rarely detected because they are the most difficult to detect, and include clones that differ in syntax but still perform the same function, and therefore, it is necessary to propose a method for efficiently detecting code semantic clones.

One key issue for semantic clone detection is how to effectively learn the representation of the source code, thereby effectively capturing its semantics. Token and Abstract Syntax Trees (AST) are commonly used to detect code clones. However, Token cannot learn the semantic information contained in the code structure well, which is not enough for the semantic clone detection task. Recent semantic clone detection efforts, which use Abstract Syntax Trees (AST) to represent code in conjunction with syntax information, have proven effective, but are less efficient because the Abstract Syntax Trees (AST) for code are typically more complex than parse trees for text. Code clone detection considers not only accuracy but also efficiency.

In the task of code clone detection, it is very common to use Token as a basic segmentation method representing the source code of a program. The normalized Token vocabulary is so small (typically no more than 300 different normalized tokens) that it results in a learned vocabulary (an external knowledge obtained by pre-training) of limited capacity that the external pre-training is ineffective for neural models. The semantic clone detection task typically employs unnormalized tokens. Token may still fail to capture rich semantic information, especially using meaningless variables in the program.

To extract more information from the pre-training, a straightforward approach is to enlarge the input vocabulary. Sentence-level segmentation may be a natural choice, but due to the diversity of sentences, its vocabulary can be infinite. It is not possible to train a vocabulary containing all possible sentence representations. Thus, input statements may be encountered in which no vector representation is found in the vocabulary, which is known as an OOV (out-of-vocabulary) problem. The OOV problem severely limits the effectiveness of code representation.

Disclosure of Invention

The invention aims to solve the problem of detection of semantic code clone in a program, and provides a semantic clone detection method based on deep learning.

The invention relates to a clone code semantic detection method based on deep learning, which specifically comprises the following processes:

step 1, determining a basic unit of a TPE (thermal plastic elastomer) represented by codes in a semantic clone detection task, wherein the TPE generation process comprises the following steps: firstly, each code in an input corpus is cut into a Token sequence, the obtained Token initialization vocabulary vocab is used for merging all binary groups Token appearing in the corpus, then all Token binary groups in the current corpus are counted, the binary groups Token are sequenced and marked, then a Token with the highest iteration combination frequency in the corpus is used for identifying a new basic unit, the newly obtained unit is added into the vocabulary, the newly generated corpus consists of the newly added binary groups Token, the TPE regards the binary groups Token as a new Token, the process is carried out iteratively, and the vocabulary is updated by continuously iterating and searching the Token combination with higher frequency; after the final vocabulary is obtained, dividing the code sentence into TPEs according to the obtained vocabulary by utilizing a backward maximum matching method;

obtaining TPE basic units of different languages by using a TPE algorithm according to the selected corpus;

step 2, establishing a neural network model suitable for code clone detection, and pre-training the TPE basic units obtained in the step 1 by using a Skip-Gram model to generate a vocabulary of a corresponding TPE unit-word vector representation format; converting the series of discrete sequences in the vocabulary into continuous vector representation, and realizing and training a standard BilSTM model; putting the vector representation of the TPE basic unit learned by the BilSTM model into a matrix, multiplying the matrix by a weight matrix to obtain a vector with a fixed dimension, and grasping the weight of each TPE basic unit vector in the whole sentence by continuously learning and updating the weight matrix so as to obtain the vector representation of the whole code method; the specific formula is as follows:

where v denotes the vector parameters, T denotes the transpose of the matrix, each element in the vector represents the importance of each sequence node,

representing the hidden layer output, s, of BilSTM_tTo represent

Significance of (a)_tWeight parameter, h, representing the attention weight t position element over the entire sequence^CODERepresenting the final codevector representation; j represents an iteration parameter, and t represents a current position;

step 3, designing a Siamese framework to carry out the classified detection of clone pairs/non-clone pairs, and specifically comprising the following steps: two vectors converted by two code blocks are given as input, pre-training is carried out by adopting a Skip-Gram model to obtain the vector representation of each code block, the difference of the distances between the two output vectors is calculated by utilizing the calculation mode of the Euclidean distance between the vectors as the characteristic of classification, the vectors with similar Euclidean distance are clone pairs, and the final clone/non-clone prediction is obtained.

Compared with the prior art, the invention has the following beneficial effects:

(1) the new source code representation base unit TPE saves more time than the AST based representation while capturing rich syntactic and semantic information;

(2) TPE can also avoid the problem of insufficient vocabulary (OOV).

Drawings

FIG. 1 is an overall flow chart of a semantic clone detection method based on deep learning according to the present invention;

FIG. 2 is an exemplary diagram of TPE units generated using TPE;

fig. 3 is a diagram of a BiLSTM network architecture.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and examples.

As shown in fig. 1, it is an overall flowchart of the semantic clone detection method based on deep learning of the present invention. The process specifically comprises the following steps:

step 1, determining TPE basic units represented by codes in a clone code detection task, learning potential syntactic/semantic information from a large-scale original code corpus through pre-training embedding, and pre-training an embedding matrix with stronger expression capacity based on a vocabulary table generated by the TPE;

the TPE basic unit generation process is as follows: firstly, each code in an input corpus is cut into a Token sequence, the obtained Token is used for initializing a vocabulary vocab, the combination of all binary units Token appearing in the corpus is carried out, then all Token binary units in the current corpus are counted, the binary units Token are sorted and marked, then the Token with the highest iteration combination frequency in the corpus is used for identifying a new basic unit, the newly obtained unit is added into the vocabulary, the newly generated corpus is composed of the newly added binary units Token, the TPE regards the binary units Token as the new Token, the process is carried out iteratively, and the vocabulary is updated by continuously iteratively searching for the Token combination with higher frequency. After obtaining a final vocabulary table, dividing the code sentence into TPE units according to the obtained vocabulary table by utilizing a backward maximum matching method (for example, one code sentence is ABC, two pointers are arranged to respectively point to a first character A and a last character C of the sentence, firstly, whether the basic unit of ABC exists in the vocabulary table is searched, if so, the whole program sentence ABC is regarded as a TPE unit, if not, the first pointer is moved backwards, whether BC is in the vocabulary table is searched, if not, the process is continued, and finally, the whole code sentence is expressed as a TPE unit in the vocabulary table;

as shown in fig. 2, an exemplary diagram of generating TPE units using TPEs is used to demonstrate two iterations of the TPE algorithm. The dashed arrow labeled (iv) is the process of statistics of Token in the segmented corpus. The dashed arrows labeled c represent the frequency of occurrence of each marker after statistical data is obtained, and then the vocabulary is updated to incorporate the newly identified Token combinations. As shown. The dashed arrow labeled as (c) indicates that the last step of the iteration is to update the corpus with the newly obtained dyad Token units. V₀A vocabulary obtained from the original code snippet is shown.

Tpe (Token Pair encoding) as an innovative segmentation method, it constructs a new code representation using Token as a basic component. The TPE carries rich code information, and is beneficial to better utilizing the advantages of clone detection based on deep learning. Meanwhile, the TPE can also avoid the problem of insufficient vocabulary (OOV).

Step 2, establishing a neural network model suitable for code clone detection, and training:

before training, pre-training by adopting a Skip-Gram model: firstly, obtaining TPE basic units of different languages by using a TPE algorithm according to a selected corpus; then, the TPE units are pre-trained by using a Skip-Gram model to obtain a vocabulary of the TPE units in a word vector representation format.

The principle of the Skip-Gram model is that a central word is given to predict a context by the model, and the word vector of the central word is adjusted by using the prediction result of surrounding words. In Skip-Gram, each word is influenced by surrounding words, and each word needs to be predicted and adjusted when being used as a central word, so that the learned word vector is more accurate due to repeated adjustment and prediction. Therefore, a Skip-Gram model is selected as a training model of the TPE unit vector. And (4) training the Skip-Gram model by adopting a fastText tool to generate a vocabulary of code basic units-vectors. (supplement: setting parameter learning rate 0.025 in command using fastText, selecting training word vector dimension dim to be 100, defaulting context window size ws to be 5, default value 5 for epoch and lowest occurring word frequency minCount, default value 5 for number of negative samples neg, loss function loss to select ns, number of containable sub-words bucket to be 2000000,

default values

3 and 6 for maximum and minimum character lengths maxn and minn, thread number thread to be 4, learning rate to be 100. fastText finally generates two files with suffix vec and bin, bin is a binary file containing model parameters and all hyper-parameters.

And learning TPE basic units of different languages from the selected initial corpus by utilizing the idea of a BPE algorithm, and pre-training the TPE units by adopting a Skip-Gram model to obtain a vocabulary of a TPE unit-word vector representation format. And converting the code sequences after the TPE unit representation into corresponding vector representation by combining with a vocabulary.

Carrying out pretreatment: the empty lines and comments of the source code are deleted and the source code is represented as a sequence of TPE units combined according to a backward maximum matching algorithm. The preprocessed source code refers to a pre-trained vocabulary to obtain a specific vector representation of each TPE unit. After the preprocessing step, the source code has been converted into a sequence for a series of TPE units. The series of discrete sequences needs to be converted into a continuous vector representation, a standard BilSTM model is realized and trained, and the TPE embedded clone detection function is enhanced. LSTM is a widely used network for TPE ticketsThe sequence input of the element is encoded, BilSTM is a bi-directional spread of LSTMs, with a right-to-left spread. The hyper-parameters of the model are determined through preliminary experiments, and the hidden layer dimension is set to be 100. Dropou is applied on the input embedding layer, with a proportion of the BilsTM hidden layer of 0.33. Parameter optimization is carried out by adopting Adam algorithm, and the initial learning rate is 5 multiplied by 10^-4The gradient shear threshold is 5 and the minimum batch-size is 32. Semantic information of elements at each position in the sequence is learned through the two-way long-short memory neural network, and meanwhile forward and reverse sequence information is recorded in the learned vector. On the basis, a hidden layer vector generated by the bidirectional long-short memory neural network is extracted through a global pooling layer height, and self-attention pooling (self-attention pooling) is used for achieving the goal. Each Java method is converted to a vector that can be compared to each other by weighted summation using a layer of attention-based neural network.

And (3) putting the vector representation of the TPE unit learned by the BilSTM into a matrix by using a self-attention mechanism, multiplying the matrix by a weight matrix to obtain a vector with a fixed dimension, and grasping the weight of each TPE unit vector in the whole sentence by continuously learning and updating the weight matrix so as to obtain the vector representation of the whole code method. The specific calculation formula is as follows:

wherein v represents a vector parameter, T represents the transpose of the matrix, each element in the matrix represents the importance of each sequence node, expressed by means of a numerical vector,

representing hidden layer output, s, of BilSTM_tTo represent

The significance of (1) represents the weight of the sequence node vector of each position in the whole sentence, the attention weight a_tAccording to h_tSelf-calculation and normalization are carried out, and the model is continuously updated in the continuous training and learning process, h^CODERepresenting the final code vector representation.

representing the hidden layer output, s, of BilSTM_tTo represent

The significance of (a) represents the weight of the sequence node vector of each position in the whole sentence_tWeight parameter representing the attention weight t position element in the whole sequence according to h_tSelf-calculation and normalization are carried out, and the model is continuously updated in the continuous training and learning process, h^CODERepresenting the final codevector representation; j represents an iteration parameter, and t represents a current position;

the bidirectional long and short memory neural network learns all information of the forward sequence and the reverse sequence, can capture the semantic meaning and the time sequence information of the sequence more effectively, and is more sufficient and accurate in prediction compared with the common long and short memory neural network.

As shown in fig. 3, it is a structure diagram of the BiLSTM network. The bidirectional long-short term memory network (BilTM) is composed of two ordinary long-short term memory networks (LSTM), wherein the forward LSTM utilizes information of past time, and the reverse LSTM utilizes information of future time. BilSTM recursively computes the hidden output vector by:

f_t＝σ(W_f[h_t-1,x_t]+b_f)

i_t＝σ(W_i[h_t-1,x_t]+b_i)

o_t＝σ(W_o[h_t-1,x_t]+b_o)

h_t＝o_t⊙tanh(c_t)

wherein x is₁,x₂,…,x_nFor input, h₁,h₂,…,h_nFor output, other variables such as W, b are model parameters. The right-to-left direction only previews the same calculations in the opposite way.

Step 3, designing a Siamese framework for the classified detection of clone pairs/non-clone pairs, which specifically comprises the following steps: comparing the code representations of the input fragment pairs, then calculating the difference between the code representations as the classification characteristics, and finally performing final clone/non-clone prediction according to the classification characteristics; the code clone detection problem is formalized into a supervised binary classification task, namely two vectors converted by two code blocks are given as input, the difference of the two output vectors is calculated by utilizing a calculation mode of Euclidean distance between the vectors as the characteristic of classification, and clone pairs with similar Euclidean distance between the vectors are provided. If they are clone pairs, set their tag to 1, if not clone pairs, set their tag to 0; the probability that a given input pair is clonal and non-clonal is also ultimately obtained.

The main idea of the Siamese network is to map the input to the target space by a function, and contrast the similarity in the target space by using simple distance. The code clone detection is converted into a binary classification problem, a simple and effective BilSTM model is provided, and discrete code segments are converted into low-dimension continuous vector representation through the BilSTM. A Simese architecture is designed, and two BilSTM sub-networks have the same structure and share weight values.

The invention adopts the Siamese framework and the standard BiLSTM classification model to carry out code clone detection, thereby improving the efficiency while ensuring the validity of the code clone detection. Abundant grammar and semantic information can be obtained through the TPE unit, and semantic clone detection can be effectively carried out.

The embodiments of the present invention are described below:

step 1, data selection and pretreatment:

1-1, preparing a data set, the data set comprising:

(1) BigCloneBench is one of the widely used evaluation benchmarks for detecting clones, and many code clone detection tools use this dataset as an evaluation benchmark. The bigconebench dataset was created for clone code in Java language, containing only 600 ten thousand labeled true clone pairs and 26 ten thousand labeled false clone pairs in the old version of bigconebench, covering 10 functions. The new version contains over 800 ten thousand pairs of Java code pairs marked as clones (most of which are type three and type four clones) and 279,032 codes marked as non-clone pairs. For the BigCloneBench dataset, 20000 pairs of functions are selected from each type for experiment, and all selections less than 20000 are performed.

(2) Another code clone evaluation benchmark, OJClone, was created for C language programs. OJClone is based on an on-line programming openness judgment System (Open Judge System), often referred to as OJ System [8], and is constructed by selecting 104 programming topics in the OJ System, and regarding the same programming topic, code segments submitted by different people as clone pairs. The OJClone dataset does not explicitly specify the type of clone, but it is generally accepted that most clone pairs of OJClone are either triple or quadruple. 500 programs were selected from the first 15 programming problems of OJClone. The same programming problem can result in 124,750 clone pairs. Different programming problems can be combined into 2800 multiple ten thousand non-cloned pairs. Randomly selected 5 ten thousand function pairs, where the ratio of clone to non-clone was 1: 14.

(3) Google Code Jam (GCJ) is an online international program game held by Google every year. The content of the game comprises a series of algorithmic problems that must be solved within a specified time. Participants can answer questions using any programming language and development environment of their choice. Each project for the same question was implemented by a different programmer, google verified its correctness. Therefore, the processing of the same problem should be functionally similar. 1665 Java functions with 12 problems are selected from 2016, about 27 ten thousand pairs of clone pairs and 100 ten thousand pairs of non-clone pairs are respectively formed, and finally 5 ten thousand pairs of functions are randomly selected, wherein the proportion of clone pairs to non-clone pairs is 1: 4.

1-2, for the data selected in step 1-1, the function is represented by the TPE elementary units using the aforementioned TPE algorithm. Different TPE vocabularies were trained for Java and C languages.

Step 2, dividing a training set and a test set: for each data set, it was randomly divided into three parts, 60%, 20% for training, validation and testing, respectively.

Step 3, training by using the model

The preprocessed TPE basic unit vector representation is input into a first layer LSTM unit to obtain the characteristics of the influence of the previous unit of the basic unit on the TPE basic unit, and then the TPE basic unit vector representation is input into a second layer LSTM unit to obtain the influence of the next character of the TPE basic unit on the second layer LSTM unit. And then splicing and combining the output of the first layer of LSTM and the output of the second layer of LSTM. Through training, the output feature vector contains the context information of the code unit and the sequence information thereof.

Semantic information of elements at each position in the sequence is learned through the two-way long-short memory neural network, and meanwhile forward and reverse sequence information is recorded in the learned vector. On the basis, a hidden layer vector is generated by extracting a bidirectional long-short memory neural network through a global pooling layer height, and self-attention pooling (self-attention pooling) is used for achieving the goal. Each Java method is converted to a vector that can be compared to each other by weighted summation using a layer of attention-based neural network. So far, the original plain code text has been converted into a digitized vector, and then the difference between the two vectors is calculated using a calculation formula of the euclidean distance between the vectors.

The present invention was compared with two of the most advanced clone detection methods, tbcccd and ASTNN, in terms of accuracy (P), recall (R), F1 score (F1), data processing time (data-time) and test-time (test-time). .

As shown in table 1, the results of testing ASTNN, tbcc, and TPE models of the prior art on bigconebench dataset.

TABLE 1

It can be seen that on this BigCloneBench dataset, the model of the present invention has higher F values than both of the other tools that have recently detected semantic clones. Especially, the method for representing the code by the TPE basic unit improves the data processing (data-time) by nearly 2.5 times compared with the other two tools for representing the code by the AST.

On the aspect of model detection speed, the standard BiLstm model of the invention has the efficiency improved by nearly 3 times compared with two models. This is because tbcc uses a tree-based convolution model and max-pooling, ASTNN uses RvNN and RNN models for sentence and code coding, while using only a simple BiLSTM model, BiLSTM can learn all the information of forward and reverse sequences, and can capture the semantic and timing information of the sequence more strongly.

Table 2 lists the model of the invention and the results of the detection of ASTNN, TBCCD on the OJClone dataset.

The detection effect of the invention on the OJClone dataset is still better than tbcch, and the model does not perform as well as ASTNN, because ASTNN first constructs an AST for each code segment when representing the code, and decomposes the entire AST into small statement trees (trees composed of AST nodes of statements, with statement nodes as roots). And then, capturing sentence-level lexical and syntactic information by adopting a recursive encoder on a multi-path sentence tree to obtain a sentence vector. Finally, the code representation is obtained through a recurrent neural network, and the captured code information is relatively comprehensive, but meanwhile, much time is consumed.

As shown in table 3, the results of the detection of the TPE model, ASTNN model, tbcch model on the GCJ data set are shown.

TABLE 3

On the GCJ data set, the model of the invention still performs better than TBCCD, can achieve the effect similar to ASTNN, and has obvious advantage in time. The processing time of ASTNN and tbcch is longer than TPE for both ASTNN and tbcch, because both ASTNN and tbcch express codes based on AST.

Claims

1. A semantic clone detection method based on deep learning is characterized by comprising the following specific steps:

representing the hidden layer output, s, of BilSTM_tTo represent