CN113656066A

CN113656066A - Clone code detection method based on feature alignment

Info

Publication number: CN113656066A
Application number: CN202110936377.8A
Authority: CN
Inventors: 方黎明; 张爱平
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2021-08-16
Filing date: 2021-08-16
Publication date: 2021-11-16
Anticipated expiration: 2041-08-16
Also published as: CN113656066B

Abstract

The invention discloses a clone code detection method based on feature alignment, which comprises the steps of analyzing a source code into an abstract syntax tree, dividing the abstract syntax tree into a sentence tree sequence, and then carrying out word embedding and semantic tree coding; secondly, extracting feature representation of code segments with rich structure and semantic information by using a bidirectional causal convolutional neural network; after feature extraction, an alignment matrix representing the corresponding relation between the two code segments is learned in a data-driven mode through sparse reconstruction, so that the two code segments are aligned, and the similarity of the two codes is obtained. Compared with the prior art, the method can extract more abundant features, solve the problem of structural difference of codes with similar functions due to different statement positions and obtain higher detection precision.

Description

Clone code detection method based on feature alignment

Technical Field

The invention belongs to the technical field of software code analysis.

Background

The purpose of code clone detection is to make a decision by measuring the similarity of two code fragments. Code clone detection has proven valuable throughout the software development lifecycle. Identifying textual, grammatical, or functionally similar code fragments is the basis for many software engineering tasks, such as code classification, code reconstruction, bug detection, and malicious code detection. In recent years, deep learning techniques have achieved good results in code clone testing, especially to address code clone testing with similar functions.

However, the prior art only focuses on how to extract more distinctive features from the source code, and some problems, such as structural differences of functionally similar codes, are not clearly solved. In the software development process, when a programmer copies a code segment, several statements are often added or deleted, or a more flexible syntax structure is used to realize the same function, which causes the code statements before and after copying to be misplaced, resulting in structural differences.

The code fragments are usually converted into an abstract syntax tree or a program dependency graph, and then CNN or RNN learning feature representation is adopted to calculate the similarity between features so as to decide whether the code fragments are similar. The learned features are typically two-dimensional tensors, and in order to generate vectors that compute the degree of similarity, a global pooling operation is typically employed. However, global pooling is inherently weak in addressing code misalignment, and misalignment of features still exists. When the similarity of code pairs with different structures and similar functions is calculated, the similarity is low due to the misplacement of characteristics, and therefore, the decision error can be caused. The alignment operation is carried out on the code characteristics, so that the gap of different structures of similar-function codes can be closed.

Disclosure of Invention

The purpose of the invention is as follows: in order to solve the problems in the prior art, the invention provides a clone code detection method based on feature alignment.

The technical scheme is as follows: the invention provides a clone code detection method based on feature alignment, which specifically comprises the following steps: inputting the target code x and the code y into a trained clone code detection model; the trained clone code detection model outputs the similarity of the code x and the code y, and whether the code x and the code y are similar codes is judged according to the similarity of the code x and the code y; the clone code detection model performs the following processing on the input code x and the input code y:

step 1: generating an abstract syntax tree T for a code x using a code parsing tool_xAnd an abstract syntax tree T for the code y_y

Step 2: according to an abstract syntax tree T_xState node of, will T_xDividing into a plurality of statement trees according to the original abstract syntax tree T_xOrder of precedence traversalForming the plurality of sentence trees into a sentence tree sequence ST_x(ii) a According to an abstract syntax tree T_yState node of, will T_yDividing into a plurality of statement trees according to the original abstract syntax tree T_xForming the statement trees into a statement tree sequence ST according to the sequence of the sequencing traversal_y；

And step 3: constructing a statement vector matrix: embedding words into node entities of each statement tree in each statement tree sequence, encoding the statement trees with the words embedded into statement vectors by adopting an encoder, and forming a statement vector matrix by the statement vectors corresponding to the statement tree sequence according to the statement tree sequence;

and 4, step 4: statement vector matrix X for code X using a bidirectional causal convolutional network_xStatement vector matrix X of sum code y_yRespectively extracting the features to obtain the code features F of the code x_xCode characteristics F of code y_y；

And 5: computing code features F by sparse reconstruction method_yFor code feature F_xAlignment feature of

And code feature F_xFor code feature F_yAlignment feature of

Step 6: computing

And are each to R_xyAnd R_yxPerforming maximum pooling operation to obtain similarity eigenvector V_xyAnd V_yx；

And 7: will V_xyAnd V_yxAnd connecting in the characteristic dimension, inputting the connected vector into a full connection layer, inputting the output of the full connection layer into a sigmoid function layer, and outputting the similarity S of the code x and the code y by the sigmoid function layer.

Further, the bidirectional causal convolution network in step 4 includes a first bidirectional causal convolution module and a second bidirectional causal convolution module that are connected to each other, where the first and second bidirectional causal convolution modules have the same structure and each include a 1 × 1 convolution layer, a first bidirectional causal convolution layer and a second bidirectional causal convolution layer; adding the results output by the 1 × 1 convolutional layer, the first bidirectional causal convolutional layer and the second bidirectional causal convolutional layer as the output of the bidirectional causal convolutional module; the first bi-directional causal convolutional layer is a bi-directional causal convolutional layer with a convolution kernel of 3 × 1, and the second bi-directional causal convolutional layer is a bi-directional causal convolutional layer with a convolution kernel of 3 × 1 and a step size of 2.

Further, the bi-directional causal convolutional layer performs the following operations on the input features:

wherein the content of the first and second substances,

is the T-th feature vector in the input features of the bidirectional causal convolution layer, wherein T is 0,1,2, …, T'; when the code x is subjected to feature extraction, T' is a sentence tree sequence ST_xLength of (d); when the code y is subjected to feature extraction, T' is a sentence tree sequence ST_yLength of (d); k is the convolution kernel size of the bi-directional causal convolution layer,

for convolution kernels, f and b represent forward and backward, ═ f or b, respectively;

for forward convolution operation in bidirectional causal convolution layerThe t-th feature vector is output,

the concat represents the connection of the t characteristic vector output by the backward convolution operation in the bidirectional causal convolution layer, and the t output characteristic vector F of the bidirectional causal convolution layer is obtained by connecting the bidirectional characteristic vectors_t。

Further, the step 5 specifically includes:

computing code features F_yFor code feature F_xAlignment feature of

The specific method comprises the following steps:

constructing code features F by sparse reconstruction_yAligning to code feature F_xThe objective function of (2):

wherein W_yxIs a sparse reconstruction coefficient; beta is an equilibrium coefficient;

solving the objective function by the least square method to obtain

I_yIs of size

T is a transpose,

for a sentence tree sequence ST_yLength of (d); thereby obtaining

Computing code features F_xFor code feature F_yAlignment feature of

The specific method comprises the following steps:

constructing code features F by sparse reconstruction_xAligning to code feature F_yThe objective function of (2):

wherein, W_xyIs a sparse reconstruction coefficient;

solving the objective function by the least square method to obtain

Thereby obtaining

Wherein I_xIs of size

The unit matrix of (a) is,

for a sentence tree sequence ST_xLength of (d).

Further, the loss function of the clone code detection model is:

wherein N is the total number of samples in the training set, one sample comprises two codes, y^jIs the label of the jth sample, S^jIs the similarity between two codes in the jth sample.

Has the advantages that: the method extracts the characteristics of the code through the bidirectional causal convolutional network, can acquire context information in a bidirectional mode, improves the representing capability of the code, has fewer parameters than bidirectional RNN (radio network) used by other methods, and has higher speed and higher accuracy in code cloning detection. The code features are aligned through sparse reconstruction, and the influence caused by code feature dislocation can be effectively eliminated. For more difficult clone code types, namely similar codes with added, deleted and modified sentences and similar codes with the same semantics and greatly different grammar structures, the invention can obviously improve the detection accuracy and the recall rate.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a schematic diagram of a method of code alignment;

FIG. 3 is a schematic diagram of a bidirectional causal convolution network, where (a) is a schematic diagram of the structure of the bidirectional causal convolution network and (b) is a schematic diagram of the structure of the bidirectional causal convolution module.

Detailed Description

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention.

The embodiment provides a clone code detection method based on feature alignment, which specifically comprises the following steps: inputting the code x and the code y into a trained clone code detection model; the trained clone code detection model outputs the similarity of the code x and the code y, and whether the code x and the code y are similar codes is judged according to the similarity of the code x and the code y; the clone code detection model converts an original code segment pair into an abstract syntax tree, then divides the abstract syntax tree to generate a statement tree sequence consisting of a plurality of statement trees, constructs a statement vector matrix, extracts features by using a bidirectional causal convolution network, aligns the features by using the proposed sparse reconstruction feature alignment method, and calculates the similarity of the aligned code features.

As shown in fig. 1 and 2, the specific clone code detection model processes the code x and the code y as follows:

(1) taking the code x and the code y as a code pair, and generating an abstract syntax tree T of the code x and the code y by utilizing an existing code analysis tool for a code book_xAnd T_y；

(2) According to the state node of the abstract syntax tree, T_xDividing the sentence trees into a plurality of sentence trees, and forming the sentence trees into a sentence tree sequence ST according to the sequence traversed by the original abstract syntax tree_x(ii) a Will T_yDividing the sentence trees into a plurality of sentence trees, and forming the sentence trees into a sentence tree sequence ST according to the sequence traversed by the original abstract syntax tree_y(ii) a The purpose of this step is to keep the fixed order not to lose the structural information of the original code as much as possible, second, can carry on the subsequent characteristic extraction after forming the sequence;

(3) embedding words into the node entities in the statement tree by using Word2 Vec; coding the sentence tree with the embedded words into a sentence vector through a sentence coder, and constructing a sentence vector matrix according to the sequence of the sentence tree sequence; a sentence tree sequence ST is obtained_xSentence vector matrix and sentence tree sequence ST_yThe statement vector matrix of (2);

(4) inputting the statement vector matrix into a bidirectional causal convolution network for feature extraction, thereby obtaining the code feature F of the code x_xCode characteristics F of code y_y；

(5) Generating the code characteristics of the code sample pair by sparse reconstruction to generate the alignment characteristics, namely the code characteristics F_yFor code feature F_xAlignment feature of

And code feature F_xFor code feature F_yAlignment feature of

(6) Subtracting the alignment features from the original features, calculating an absolute value, and performing maximum pooling along the quantity dimension of the statement tree to obtain a similarity feature vector of a single code and another code;

(7) connecting the two similarity characteristic vectors on the characteristic dimension, inputting the connected vectors into a connection layer, inputting the output of the full connection layer into a sigmoid function layer, and outputting the similarity S of the code x and the code y by the sigmoid function layer.

In step 4, as shown in fig. 3, the bidirectional causal convolution network is formed by stacking two bidirectional causal convolution modules in series. And the bidirectional causal convolution module consists of a 1 multiplied by 1 convolution layer and two bidirectional causal convolution layers and is used for capturing information with different scales. The convolution kernels of the two bidirectional causal convolution layers are 3 x 1 in size, and the step lengths are 1 and 2 respectively.

The two-way causal convolutional layer can be represented as:

wherein the content of the first and second substances,

the t-th eigenvector output for the forward convolution operation in the bi-directional causal convolution layer,

The output of the two-way causal convolution module is obtained by adding the 1 x 1 convolutional layer to the outputs of the two-way causal convolutional layers. Statement vector matrix X_xAnd X_yThe characteristics obtained by the bidirectional causal convolution network are respectively

And

for the length of the statement tree of the code x,

for the statement tree length of code y, D represents the dimension of the feature.

In an embodiment of the present invention, step 5 specifically includes:

computing code features F_yFor code feature F_xAlignment feature of

The specific method comprises the following steps:

wherein the content of the first and second substances,

are sparse reconstruction coefficients, i.e. alignment matrices. Beta is the equilibrium coefficient. The least square method is adopted to obtain:

wherein

Is a matrix W_yxThe transpose of (a) is performed,

is an inversion matrix.

Is an identity matrix. It can thus be obtained that the alignment characteristic of the code y with respect to the code x is:

computing code features F_xFor code feature F_yAlignment feature of

The specific method comprises the following steps:

wherein the content of the first and second substances,

for sparse reconstruction coefficients, least squares are usedThe method can be solved as follows:

wherein

Is a matrix W_xyThe transpose of (a) is performed,

is an inversion matrix.

Is an identity matrix, and thus the alignment characteristics of the code x to the code y can be obtained as follows:

in step 6, code x is characterized by F, an embodiment of the present invention_xThe alignment characteristic of the code y to the code x is

The difference between the two characteristics is calculated, and the absolute value is calculated

The vector V is obtained by pooling the maximum values of the vector V and the vector V_xy. The code y is characterized by F_yThe alignment characteristic of the code x to the code y is

The vector V is obtained by pooling the maximum values of the vector V and the vector V_yx。

In one embodiment of the invention, in step 7, vector V is applied_xyAnd V_yxConnecting on characteristic dimension to obtain vector V, and inputting the vector V into full connection layer FCAnd obtaining the similarity S through a sigmoid function.

The clone code detection model adopts a binary cross entropy loss function as follows:

wherein N is the total number of samples in the training set, one sample comprises two codes, y^jIs the label of the jth sample (manually set, label 1 if two codes in one sample are similar codes, namely clone codes, otherwise label 0), S^jThe similarity between two codes in the jth sample is taken as the similarity (the value of the similarity is 0-1); training process by calculating

And back-propagating the gradient, updating the parameters of the model using gradient descent, thereby causing

Decrease, iterate a certain number of times or

And ending when the value is less than the given value to obtain the final clone code detection model.

It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. The invention is not described in detail in order to avoid unnecessary repetition.

Claims

1. A clone code detection method based on feature alignment is characterized in that a target code x and a code y are input into a trained clone code detection model; the trained clone code detection model outputs the similarity of the code x and the code y, and whether the code x and the code y are similar codes is judged according to the similarity of the code x and the code y; the clone code detection model performs the following processing on the input code x and the input code y:

step 1: generating an abstract syntax tree T for a code x using a code parsing tool_xAnd an abstract syntax tree T for the code y_y；

Step 2: according to an abstract syntax tree T_xState node of, will T_xDividing the sentence tree into a plurality of sentence trees, and forming the sentence trees into a sentence tree sequence ST according to the sequence of the previous traversal_x(ii) a According to an abstract syntax tree T_yState node of, will T_yDividing the sentence tree into a plurality of sentence trees, and forming the sentence trees into a sentence tree sequence ST according to the sequence of the previous traversal_y；

And 5: computing code features F_yFor code feature F_xAlignment feature of

And code feature F_xFor code feature F_yAlignment feature of

Step 6: computing

2. The method for detecting clone codes based on feature alignment of claim 1, wherein said bidirectional causal convolution network in step 4 comprises a first bidirectional causal convolution module and a second bidirectional causal convolution module connected to each other, said first and second bidirectional causal convolution modules are identical in structure and each comprise a 1 x 1 convolution layer, a first bidirectional causal convolution layer and a second bidirectional causal convolution layer; adding the results output by the 1 × 1 convolutional layer, the first bidirectional causal convolutional layer and the second bidirectional causal convolutional layer as the output of the bidirectional causal convolutional module; the first bi-directional causal convolutional layer is a bi-directional causal convolutional layer with a convolution kernel of 3 × 1, and the second bi-directional causal convolutional layer is a bi-directional causal convolutional layer with a convolution kernel of 3 × 1 and a step size of 2.

3. The method according to claim 2, wherein the bidirectional causal convolutional layer performs the following operations on the input features: