CN112905186A

CN112905186A - High signal-to-noise ratio code classification method and device suitable for open-source software supply chain

Info

Publication number: CN112905186A
Application number: CN202110168454.XA
Authority: CN
Inventors: 李浩晨; 吴敬征; 武延军; 罗天悦; 杨牧天; 崔星; 段旭
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2021-02-07
Filing date: 2021-02-07
Publication date: 2021-06-04
Anticipated expiration: 2041-02-07
Also published as: CN112905186B

Abstract

The invention discloses a high signal-to-noise ratio code classification method and a device suitable for an open source software supply chain, wherein the method comprises the following steps: the method comprises the steps of converting a code to be predicted into PE-AST, digitizing each node, extracting a PE-AST path in the PE-AST, converting the PE-AST path into a tuple capable of being operated, calculating a correlation coefficient WS, updating path representation and predicting according to a PE-AST feature vector. The invention can improve the signal-to-noise ratio in the code representation process, thereby improving the accuracy of machine classification codes; according to the classification of the codes, the working efficiency of programmers in the aspects of code understanding and code maintenance is improved.

Description

High signal-to-noise ratio code classification method and device suitable for open-source software supply chain

Technical Field

The invention belongs to the technical field of computers, and relates to a high signal-to-noise ratio code classification method and device suitable for an open source software supply chain.

Background

Over the past decade, a large amount of open source software has emerged, which is the core that constitutes the supply chain for open source software. For a programmer, correctly classifying a large amount of source code contained in source software is helpful to improve work efficiency. First, grouping applications with similar functionality may make it more convenient for programmers to find functions that need to be implemented in applications that belong to the same group or category. Secondly, the same code loopholes often widely exist in codes of the same type of functions, namely, the same type of codes often have common loopholes, and when a programmer finds the loopholes in a section of codes, the programmer can quickly locate other places where similar errors may occur, so that the maintenance efficiency is improved.

Currently, the code classification method is generally based on a neural network, and through the learning of a large number of samples, the neural network model finds specific rules in data, and then classifies the codes according to the rules in practical use. However, if the sample and the program to be tested are complex, that is, the number of tokens (words obtained by segmenting the sentence of the program) is large, the noise contained in the code is significantly increased, and this kind of method cannot completely capture the effective rule applicable to code classification, thereby reducing the accuracy of code classification. At this time, the signal-to-noise ratio should be increased to obtain the semantic representation of the code containing the key information of the program, thereby avoiding the reduction of the accuracy.

In summary, the prior art has a problem of insufficient code classification accuracy for the current open source software supply chain.

Disclosure of Invention

The invention aims to provide a high signal-to-noise ratio code classification method and device suitable for an open source software supply chain.

In order to achieve the purpose, the invention adopts the following technical scheme:

a high signal-to-noise ratio code classification method suitable for an open source software supply chain comprises the following steps:

1) analyzing the syntax tree of the program to be predicted to generate an abstract syntax tree T of the code of the program to be predicted_ASTAnd construct tuples<T_AST，pos>Wherein T is_AST(N, T, s, δ, Φ), which is a non-end node set, T is an end node set, s is a root node, X is an actual value of each node in the abstract syntax tree, δ is a correspondence between a parent node and a child node in the abstract syntax tree, Φ is a correspondence between each node in the abstract syntax tree and the actual value, and pos is a position coordinate of each node in the abstract syntax tree;

2) will tuple<T_AT，pos>Inputting a code classification model to obtain a classification prediction result of a program to be predicted;

wherein the code classification model is based on classification indexes of a plurality of sample programs and corresponding tuples<T′_AST，pos′>And is obtained by deep learning method training; the code classification model parses tuples by<T_AST，pos>：

a) Coding a corresponding relation phi (n) of each node, mapping a coded result to a vector space, and obtaining a final vector representation v (n) of each node according to the obtained node vector representation and the distance from the corresponding position coordinate pos to a root node s;

b) according to the non-end node set N, the end node set T, the root node s and the corresponding relation delta, the abstract syntax tree T is paired_ASTExtracting the path, and combining the final vector representation v (n) to construct a path L_iVector representation of (L)_i) Wherein i is the end node number;

c) the vector representation emb (L) is updated by computing the correlation coefficient WS of each path with other paths_q) Obtaining a path representation z_iAnd for each path, represents z_iPerforming maximum pooling to obtain a final vector representation e of the program code to be predicted_code；

d) According to the final vector representation e_codeAnd obtaining a classification index to obtain a classification prediction result of the program to be predicted.

Further, the method for parsing the syntax tree includes: javalang packets in Python are used.

Further, the position coordinates pos are obtained by:

1) by node n in an abstract syntax tree T_ASTDepth n of_depthWith the abstract syntax tree T_ASTDepth T of_depthCalculating the coordinates of node n

2) By the parent x value of node n, the number of siblings and the position n of node n in the siblings_qTo obtain the coordinates of the node n

3) And obtaining the position coordinate pos of the node n according to the coordinate x and the coordinate y.

Further, the framework for training the code classification model includes: PyTorch frame.

Further, the result E (Φ (n)) ═ W after encoding_EPhi (n), where W_E∈R^N′*EN' is the number of different node types, and E is the embedding dimension.

Further, the final vector representation

Where (x, y) is the coordinate of node n in the coordinate system with the root node s as the origin.

Further, the association coefficient WS is calculated by:

1) for any two paths L_iAnd L_jComputing end node semantic similarity

Wherein v (L)_i1) Represents a path L_iEnd node of v (L)_j1) Represents a path L_jJ is the serial number of the path, and j is not equal to i;

2) for the two paths L_iAnd L_jCalculating the path semantic similarity WS_path＝sigmoid(W_path·[emb(L_i)，emb(L_j)]+b_path) Wherein W is_path∈R^6E*1E is the embedding dimension when encoding the correspondence phi (n), b_pathIs an offset;

3) correlation coefficient WS_i，j＝α*WS_token+β*WS_pathWhere α is the first coefficient and β is the second coefficient.

Further, the path representation

Wherein W_vFor linear transformation, N_LIs divided by path L_iOther path sets than the above.

Further, the classification index is obtained by:

1) for the final vector representation e_codePerforming linear and nonlinear transformation;

2) classifying the transformed result through a Softmax () function to obtain the probability distribution P of the predicted result_d；

3) Selecting a probability distribution P_dObtaining the classification index of the program to be predicted according to the index corresponding to the medium maximum value

A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the above-mentioned method when executed.

An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer to perform the method as described above.

Compared with the prior art, the invention has the following advantages:

1) the signal-to-noise ratio in the code representation process can be improved, so that the accuracy of machine classification codes is improved;

2) according to the classification of the codes, the working efficiency of programmers in the aspects of code understanding and code maintenance is improved.

Drawings

FIG. 1 is a flow chart of a high signal-to-noise ratio code classification method suitable for use in an open source software supply chain.

Fig. 2 is a diagram illustrating a PE-AST path corresponding to a code.

Detailed Description

In order to make the technical solutions in the embodiments of the present invention better understood and make the objects, features, and advantages of the present invention more comprehensible, the technical core of the present invention is described in further detail below with reference to the accompanying drawings and examples.

The general flow of the high snr code classification method of the present embodiment is shown in fig. 1, and mainly includes the following steps:

1. and converting the code to be predicted into the PE-AST. The concrete description is as follows:

1a) AST analysis is carried out on the java program by adopting a java bag in Python, and an abstract syntax tree of the code segment is generated and recorded as T_AST，T_ASTN is a set of non-end nodes, T is a set of end nodes, X is an actual value of each node, s is a root node, δ is a correspondence between a parent node and a child node, and Φ is a correspondence between a node and an actual value. Go to 1 b).

1b) Traverse T_ASTThe position coordinates pos of each node are generated, and pos is (x, y). For a certain node n, the node n,

wherein n is_depthFor node n at T_ASTDepth of middle, T_depthIs T_ASTDepth of (i.e. T)_ASTThe depth of the deepest node in the cluster.

Wherein x_pFor the value x, n, of the parent node of node n_numIs the number of n siblings of node n, n_iThe positions of the node n in the sibling nodes are 1 from the left, and the positions are sequentially increased. Go to 1 c).

1c) Building tuples<T_AST，pos>PE-AST as the segment code.

2. Each node is digitized. The concrete description is as follows:

2a) using a matrix W_E∈R^N*EEncode φ (n), i.e.:

E(φ(n))＝W_E·φ(n)

wherein, N is the number of different node types, the types include: there are many types of initialization, stabilization, identification, etc., and E is the embedding dimension. Go to 2 b).

2b) Calculating an embedding weight coefficient according to the distance of the node n relative to the origin, and multiplying the embedding weight coefficient by the result in a) to obtain a final node vector representation, namely:

where v (n) is the final vector representation for node n, with the origin being the root node of the AST.

3. And extracting a PE-AST path in the PE-AST. A PE-AST path is a sequence n of length k₁…n_k-1s, wherein n₁Is one end node in PE-AST, s is root node; for i e [2, k-1 ]]N of (A) to (B)_iAre all non-end nodes; for a PE-AST path, the form is<n₁，p，s>P represents the removal of n from the sequence₁And the section of s. A PE-AST path starts at an end node, ends at a root node, and traverses a series of non-end nodes. For example, the portion enclosed by the dashed box in fig. 2 is an example of the PE-AST path.

4. Converting the PE-AST path into a tuple that can be operated on, the tuple being in the form of<v(n₁)，v(p)，v(s)>. The concrete description is as follows:

4a) for the calculation mode of v (p), it is expressed as summing the vectors of other nodes except the starting point and the end point on the PE-AST path, that is:

go to 4 b).

4b) Building triplets<v(n₁)，v(p)，v(s)>As a vector representation for the PE-AST path for calculation.

5. And selecting any PE-AST path, and calculating the association coefficient WS of other PE-AST paths and the path. The correlation coefficient is a coefficient indicating the degree of co-operation of different PE-AST paths in the same segment of code. The method comprises the following specific steps:

5a) for two PE-AST paths L₁，L₂Calculating end node semantic similarity WS_tokenNamely:

wherein, v (L)₁₁) Represents L₁End node of v (L)₂₁) Represents L₂The end node of (1). Go to 5 b).

5b) For two PE-AST paths L₁，L₂Calculating the path semantic similarity WS_pathNamely:

WS_path＝sigmoid(W_path·[emb(L₁)，emb(L₂)]+b_path)

wherein, W_path∈R^6E*1，b_pathIs the offset. Go to 5 c).

5c) By end node semantic similarity WS_tokenAnd path semantic similarity WS_pathThe average summation results in a path correlation coefficient WS, which is:

WS＝0.5*WS_token+0.5*WS_path

6. the path representation is updated. Handle

Expressed as input, the PE-AST path set, N_LThe number of PE-AST paths, if emb (l) is input and the updated path indicates z is output, it can be calculated by the following formula:

wherein, W_vFor linear transformation, 1 × 1 convolution kernel is adopted, i represents a certain PE-AST path, and j represents the rest PE-AST paths.

7. And predicting according to the PE-AST characteristic vector, and specifically comprising the following steps:

6a) all the updated PE-AST paths are subjected to maximum pooling to obtain the final vector representation of the whole code, namely e_code＝[max(z_i，1)，max(z_i，2)，...，max(z_i，E)]Wherein i ∈ [1, N ]_p]Go to 6 b).

6b) E to be the feature vector of the whole segment code_codeAfter linear and nonlinear transformation, probability distribution P of prediction result is obtained by adopting Softmax () function_dSelecting the index corresponding to the maximum value in the results as the final prediction result to obtain the classification prediction result, namely P_d＝Softmax(ReLU(W_code·e_code+b_code) In a batch process), wherein,

N_rnumber of possible answers, b_codeIs the offset.

8. And (3) training the model described in the steps by using a data set to obtain a trained deep learning model, wherein a PyTorch framework is used in the training process.

The inventor trains the model described in the above steps with a java14m data set, the data in the java14m data set is derived from a project of 10,072 GitHub, and comprises 12,636,998 training samples, 371,362 verification samples and 368,445 test samples in total, and the cloned codes are removed, so that the model has strong specificity and rigor. An Adams optimizer is used in the training process, the initial learning rate is set to be 0.01, a trained deep learning model is obtained after 10 times of training of the whole data set, and a PyTorch framework is used in the training process.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A high signal-to-noise ratio code classification method suitable for an open source software supply chain comprises the following steps:

1) analyzing the syntax tree of the program to be predicted to generate an abstract syntax tree T of the code of the program to be predicted_ASTAnd construct tuples<T_AST，pos>Wherein T is_ASTN is a non-end node set, T is an end node set, s is a root node, X is an actual value of each node in the abstract syntax tree, δ is a correspondence between a parent node and a child node in the abstract syntax tree, Φ is a correspondence between each node in the abstract syntax tree and the actual value, and pos is a position coordinate of each node in the abstract syntax tree;

2) will tuple<T_AST，pos>Inputting a code classification model to obtain a classification prediction result of a program to be predicted;

2. The method of claim 1, wherein the method of syntax tree parsing comprises: javalang packets in Python are used.

3. The method of claim 1, wherein the position coordinates pos are obtained by:

4. The method of claim 1, wherein training a framework of a code classification model comprises: PyTorch frame.

5. The method of claim 1, wherein a result after encoding E (Φ (n)) ═ W_EPhi (n), final vector representation

Wherein W_E∈R^N′*EN' is the number of different node types, E is the embedding dimension, and (x, y) is the coordinate of node N in the coordinate system with root node s as the origin.

6. The method of claim 1, wherein the correlation coefficient WS is calculated by:

1) for any two paths L_iAnd L_jComputing end node semantic similarity

7. The method of claim 6, wherein a path representation

8. The method of claim 1, wherein the classification index is obtained by:

3) Selecting a probability distribution P_dAnd obtaining the classification index of the program to be predicted according to the index corresponding to the medium maximum value.

9. A storage medium having a computer program stored thereon, wherein the computer program is arranged to, when run, perform the method of any of claims 1-8.

10. An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the method according to any of claims 1-8.