CN116841869A

CN116841869A - Java code examination comment generation method and device based on code structured information and examination knowledge

Info

Publication number: CN116841869A
Application number: CN202310658279.1A
Authority: CN
Inventors: 杨立; 李凌伟; 马肖肖; 张凤军; 左春
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2023-06-05
Filing date: 2023-06-05
Publication date: 2023-10-03

Abstract

The invention relates to a Java code examination comment generation method and device based on code structured information and examination knowledge. The method comprises the following steps: obtaining original Java code examination data, and carrying out vectorization representation on a code text and a code examination comment; in the non-supervision training stage, noise is introduced to word vectors through disturbing vector representation and adding mask labels, a graph model and a language model of a discriminant type are established, and disturbed input contents are restored; code examination comment generation is carried out in the supervised training stage through fusing the model of the unsupervised training stage, and a code change prediction task is assisted; and according to the model generated in the unsupervised training stage and the supervised training stage, outputting the automatic code examination comments of the input code to be examined. The invention can improve the efficiency of code examination comment generation, and can improve the original performance of the model while having higher decoupling by combining the unsupervised training and the supervised training.

Description

Java code examination comment generation method and device based on code structured information and examination knowledge

Technical Field

The invention relates to a Java code examination comment generation method and device based on code structured information and examination knowledge, and belongs to the field of computer technology application.

Background

Since the concept of "code review" was first proposed in 1976, code review has evolved over the last few decades.

However, in practice, there are generally two factors that seriously hamper the exploitation of these advantages: 1) The writing process of the code examination comments needs to consume a lot of time and has unstable quality; 2) The understanding of the content of the censored code is time-consuming as well, and there are deviations in the effect of the code understanding from person to person. While there have been related efforts to study these problems, engineers still need to put a great deal of effort into the writing of code review reviews in the course of today's software development.

The current code inspection methods are mainly classified into a static analysis type method and a machine learning model method. The following disadvantages mainly exist for the static analysis type method: firstly, the model performance is limited and unstable, the model performance depends on the accuracy of each static analysis tool, the model cannot inspect codes out of the range of the selected static tools, and the model performance depends on the quality of feature engineering and the accuracy of manual analysis; secondly, the automation process is insufficient, so that generalization is poor, and the reason for this is that the input content of the model is the result of secondary processing instead of examining the code itself, and specific analysis cannot be performed on specific content in the code, so that the application scene is limited. For the existing machine learning model method, as the model design principle is oriented to all related text tasks, the performance and the specialization of the code examination comment automatic generation task are insufficient, and the problems of partial coarse-grained representation, code structured information destruction and the like exist, so that the continuous improvement still remains.

In general, although the automatic code review comment generation technology has been primarily presented in academia, the existing work ignores the specific structured knowledge of codes and other useful information in the code review scene, and the performance of the usage model is limited and cannot be practically used in the actual scene.

Disclosure of Invention

The invention provides a Java code examination comment generation pre-training model based on code structural information and examination knowledge successfully through analysis and research. By modeling the problem, combining Java code structuring information (in the form of a code abstract syntax tree) and censoring knowledge (in the form of code changing information), designing a method, verifying model performance by establishing experiments, and analyzing the effectiveness of the generated result.

The technical scheme adopted for solving the technical problems is as follows:

a Java code examination comment generation method based on code structured information and examination knowledge comprises the following steps:

data preparation stage: obtaining original Java code examination data;

word vector representation phase: the method comprises the steps of carrying out syntactic analysis on a code text in original Java code examination data by using an abstract syntax tree to obtain structural representation of the code text, carrying out word segmentation and other processing on the code examination comments in the original Java code examination data, and carrying out vectorization representation on the code text, the code structure vectorization and the code examination comment text according to a corresponding model word list respectively to generate corresponding word vectors;

unsupervised training phase: based on a language model, noise is introduced to word vectors through two modes of disturbing vector representation and adding mask labels, then a discriminant graph model and a language model (belonging to the graph model and the language model in the pre-training model) are established, and disturbed input contents are restored through unsupervised training;

a supervised training phase: performing supervised training based on the graph model and the language model by fusing the non-supervised training stage, performing code examination comment generation, and assisting in code change prediction tasks; the code change prediction task is to predict whether the current code segment needs to be deleted, added or modified;

and according to the model generated in the unsupervised training stage and the supervised training stage, outputting the automatic code examination comments of the input code to be examined.

Further, the method for establishing the word vector representation of the code text comprises the following steps: a code text serialization representation and an abstract syntax tree structured representation are established.

Further, a vector representation of the code text and the code review comments is obtained by:

establishing a code abstract syntax tree, forming a code text formalized representation through level extraction, and converting the code text formalized representation into a serialization vector representation through an SBT (structured-based cross section) algorithm;

and (3) establishing a mapping dictionary, and mapping English labels into corresponding numbers to obtain vector representations of label data (code review comment text).

Further, the code review comments include opinion content that remains after the code is reviewed, including but not limited to: bug lookup, code improvement, team communication.

Further, code text vectorization and code structure vectorization are achieved by:

1) The code text is represented as a code abstract syntax tree by using formal representation functions in the javalang toolkit (https:// github. Com/c2 nes/javalang);

2) The Java code text and the module levels thereof are displayed in a tree mode, and the nodes of the function level are extracted and separated, so that the vectorization of the code text is realized;

3) Converting the structured information into the serialized information represented by the special symbol segmentation in a deep search mode by adopting an SBT algorithm;

4) And inputting the obtained serialization information into a T5 model to obtain vectorization representation of the structured information of the code text, thereby realizing vectorization of the code structure.

Further, the structured information includes: node information, node relation and hierarchical division.

Further, the language model is a model trained in advance on a large dataset, and the models can be used for performing migration learning, namely fine tuning on a new task, so that better performance is obtained.

Further, the language model includes: t5 model (Text-to-Text migration converter, text-to-Text Transfer Transformer).

Further, the T5 model includes: the Huggingface community is open-sourced t5-base-java model (https:// Huggingface. Co/SEBIS/code_trans_t5_large_code_document_generation_java_multi ask_finish).

Further, the method for automatically outputting the code review comments of the input code to be reviewed comprises the following steps: the 'noise-leading-noise-reducing' mode of random scrambling and masking model is adopted in the non-supervision training stage, and the multitask fine tuning mode is adopted in the supervision training stage. The multitasking is a code change prediction task and a code review comment generation task; the multitask fine tuning mode refers to the fine tuning training of the pre-training model parameters based on the two tasks.

Further, the automatic code examination comment output is performed on the input code to be examined, and the code examination comment is obtained through the following steps:

1) Discriminant training. Based on the T5 model, a discriminator is added after the output layer of the model, and the position of the destroyed word is subjected to inverse prediction so as to recover the original text data. The discriminator predicts the destroyed node information according to the text word vector representation output by the T5 model, takes the maximum value of the prediction probability as a prediction result, compares the prediction result with the original result, obtains a feedback result, trains 5000 steps and 16 groups of data in each step.

2) And (5) training a mask model. The text words are randomly covered by the mask, then the words covered by the mask are subjected to inverse prediction, content restoration is carried out, 5000 steps are trained, and 16 groups of data are obtained in each step. The masking model masks content, which may be one word or a plurality of words in succession, or may not mask any words.

3) Predicting code change information. For each line of codes in the code to be inspected, the model performs four classification tasks on the sentence vector labels represented by the previous layer of models at the output layer by capturing the sentence vector labels, and the four classification tasks comprise: is added, deleted, modified, and left unchanged, training 3600 steps, 16 sets of data per step.

4) Code review comment generation. The model carries out text generation in a text conversion form of the T5 model, predicts the content of the output words in sequence at the output layer until a stop is output or the maximum prediction length is reached, trains 3600 steps, and each step is 16 groups of data.

Further, the destruction word includes: random word scrambling, random word removal, random word duplication, random word rotation.

A Java code review comment generation apparatus based on code structured information and review knowledge, comprising:

the data acquisition module is used for acquiring original Java code examination data;

the vectorization representation module is used for vectorizing the code text and the code examination comments in the original Java code examination data, and comprises code text vectorization, code structure vectorization and code examination comment text vectorization to generate corresponding word vectors;

the non-supervision training module is used for introducing noise to word vectors through two modes of disturbing vector representation and adding mask labels respectively in a non-supervision training stage, then establishing a discriminant graph model and a language model, and restoring the disturbed input content;

the supervised training module is used for generating code examination comments by fusing a graph model and a language model of the unsupervised training stage in the supervised training stage and assisting in a code change prediction task;

and the review comment generation module is used for outputting automatic code review comments of the input code to be reviewed according to the model generated in the unsupervised training stage and the supervised training stage.

A computer device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the steps of the method of the invention.

A computer readable storage medium storing a computer program which, when executed by a computer, performs the steps of the method of the invention.

Compared with the prior art, the invention has the remarkable advantages that:

1) The method uses the pre-training language model to automatically generate the code examination comments, and the model has the capability of automatically carrying out feature expression, so that the efficiency of code examination comments generation is improved, and the problems of time consumption, labor consumption and unstable effectiveness in manual examination are alleviated;

2) The invention provides a language model based on structured information and examination knowledge, and the text representation capability of the model is further improved through the introduction of examination scene knowledge and explicit structured semantic information;

3) The invention respectively establishes two stages of unsupervised training and supervised training, utilizes a mode of noise reduction and noise reduction in the unsupervised training stage to enable the model to learn code text representation and review comment text representation, and improves the capability of the model to generate code review comments in the supervised stage, including a mode of predicting by using code change information. The method has the advantages that the method utilizes the unsupervised training to learn the representation of the high-quality text, and combines the unsupervised training with the supervised training to improve the quality of the generated result, so that the method has higher decoupling property and can improve the original performance of the model.

Drawings

FIG. 1 is a flow chart of steps of a method for generating code review comments based on code structured information and review knowledge.

Fig. 2 is a general flow chart of code audit dataset construction.

Fig. 3 is a comparative example of the present method and the prior art method.

FIG. 4 is an example automated code review comment.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

The Java code examination comment generation method based on the code structured information and examination knowledge, provided by the invention, has the flow shown in figure 1, and comprises the following steps:

step 1: collecting text content of the censoring codes from the project code files to be censored, and preprocessing the text content to generate a code censoring data set;

step 2: performing vectorization representation on text data in the code examination data set, wherein the vectorization representation comprises code text vectorization, code structure vectorization and code examination comment text vectorization;

step 3: using the code structured information, performing unsupervised training through a model (T5), obtaining text data representation, and continuously enabling the model to obtain high-quality representation;

step 4: the code examination scene knowledge is utilized, supervised training is carried out through a model (T5), code examination comments are automatically generated, the code examination comments are predicted, and the quality of the code examination comments is improved;

step 5: and (3) aiming at the input new code to be inspected, executing the step (2) to obtain an identification vector corresponding to the code to be inspected, and then inputting the model constructed in the steps (3) and (4) to obtain the review content of the automatic code.

In one embodiment of the invention, a code review comment generation method based on code structural information and review knowledge is provided, which comprises the following specific steps:

step 1: as shown in fig. 2, the text content of the censored code is collected from the code file to be censored, and the text content is preprocessed to generate a code censored data set, which specifically comprises the following steps:

step 11: crawling project history censoring data from a Github community platform;

step 12: code audit activities typically occur during software iterations, so that these data are saved in the pull history of the item. At least one code audit is performed on the commit code for each commit that occurs. By utilizing the characteristics, the part of data is crawled to obtain the original code examination data;

step 13: the data cleaning stage is mainly used for cleaning invalid and repeated code examination activity data, including non-Java codes, codes or examination comments with too short content, restorative examination data and the like;

step 14: by using the data enhancement method, the target examination comment text corresponding to the same code text is amplified to 10 parts by means of paraphrase replacement and the like, so that the data quantity of the original data set is increased to 10 times.

Step 2: the text data is vectorized, including code text vectorization, code structure vectorization and code review comment text vectorization, and the specific steps are as follows:

step 21: representing it as a code abstract syntax tree by using formal representation functions in javalang;

step 22: extracting and separating the nodes of the function level to obtain each function representation in the code file, thereby realizing vectorization of the code text;

step 23: the SBT algorithm is adopted, and the structured information is converted into the serialization information represented by the special symbol segmentation in a deep search mode, so that the vectorization of the code structure is realized;

step 24: performing word segmentation and part-of-speech reduction by using an NLTK tool, recovering the abbreviation by using an oxford dictionary, and rejecting non-text data including emoticons and the like;

step 25: and establishing a mapping dictionary, mapping English words into corresponding numbers, and obtaining vector representation of code examination text data, thereby realizing vectorization of code examination comment text.

Step 3: the method comprises the following specific steps of performing unsupervised training by using code structured information and through a language model (T5) to obtain text data representation and continuously enabling the model to obtain high-quality representation:

step 31: based on a T5 model, adding a discriminator after the output layer of the model to reversely predict the position of the damaged word so as to recover the original text data, and predicting the damaged node information by the discriminator according to the text word vector representation output by the T5 model, wherein the maximum value of the prediction probability is taken as a prediction result;

step 32: the text words are randomly covered by the mask, and then the words covered by the mask are subjected to inverse prediction, so that the content is restored. Masking the content may or may not mask any words, either as one word or as a succession of words;

step 33: using a T5 model, recovering the destroyed text, comparing with the content before recovering, measuring the fitting degree, and repeatedly training:

wherein p is _D Representing the probability of original text prediction in the manner of Softmax, i.e., for a particular word position x _t The prediction probability is that the model h passes through T5 _T5 The learned word vector represents e (x), which is normalized to a probability value. Naturally, the learning goal here is to increase the model prediction accuracy for each word to be predictedAnd reduce the prediction failure rate->The latter is inverted and added with the former to obtain the Loss function L of the discrimination model _u1 . x represents the text to be predicted, n represents the total predicted text, θ _D Representing model parameters->Representing mathematical expectations +.>Representing the original text +_>Representing vector unit one.

Step 4: the code examination scene knowledge is utilized, the supervised training is carried out through a language model (T5), the code examination comments are automatically generated, the code examination comments are predicted, and the quality of the code examination comments is improved, and the specific steps are as follows:

step 41: for each line of codes in the code to be inspected, predicting code change information, and performing four classification tasks on the sentence vector labels represented by the previous layer of models by capturing the sentence vector labels at an output layer by the models, wherein the four classification tasks comprise: is added, deleted, modified, and left unchanged;

step 42: the model carries out text generation in a text conversion form of a T5 model, namely code examination comment generation, and sequentially predicts the content of the output words at an output layer until a stop character is output or the maximum prediction length is reached;

step 43: using a T5 model, automatically generating target content, comparing the target content with labeling data, measuring fitting degree, and repeatedly training:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing a code change prediction loss function,/->Generating a loss function representing code review comments, p ^l _ij Indicating the probability that the current ith change code line is predicted to be the jth change label (0 is kept unchanged, 1 is added, 2 is deleted, 3 is changed), y ^l _ij The prediction label representing the binary value (0 is prediction error, 1 is prediction correct). Thus, L _s1 The average fitting condition of predictions of L lines of codes in the current code is counted, S represents total review comment text,/and->Representing the i-th censored text prediction probability, +.>Representing the i-th censored original text.

Step 5: and (3) aiming at the input new defect report, executing the steps (2) and (3) to obtain a vector corresponding to the defect report, and then inputting the vector into the defect report model stored in the step (4) to obtain a repairer list most suitable for repairing the defect.

Fig. 3 is a comparative example of the present method and the prior art method. Where Review Code Changes denotes code changes, reviews' comments denote code review comments, our Model denotes the method of the present invention, and LSTM, copyNet, codeBERT denotes three existing text automatic generation methods.

FIG. 4 is an example automated code review comment. The gray text box is internally provided with the content of the code examination comment automatically generated by the method, and the other is provided with the code function to be examined.

To illustrate the performance advantages of the present invention, the present invention uses the LSTM, copyNet, codeBERT text auto-generated model as a benchmark for comparison experiments. In order to make the objective and standard of the result, the reference model of the experiment directly adopts the original text of each method to realize the model experiment, and uses the same random seed to check the code according to 8:1: the scale of 1 is divided into a training set, a validation set and a test set. After repeated experiments and summation and average of results, the word accuracy of the LSTM classification method is 12.80%, the word accuracy of CopyNet is 13.74%, the word accuracy of CodeBERT is 21.52%, the prediction accuracy of the code examination comment generation method based on code structural information and examination knowledge is highest, and is 26.11%, and the prediction accuracy of the code examination comment generation method is 21.33% higher than that of the optimal CodeBERT model in the reference model.

Another embodiment of the present invention provides a Java code review comment generation apparatus based on code structured information and review knowledge, including:

Wherein the specific implementation of each module is referred to the previous description of the method of the present invention.

Another embodiment of the invention provides a computer device (computer, server, smart phone, etc.) comprising a memory storing a computer program configured to be executed by the processor and a processor, the computer program comprising instructions for performing the steps of the method of the invention.

Another embodiment of the invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, performs the steps of the method of the invention.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A Java code examination comment generation method based on code structured information and examination knowledge is characterized by comprising the following steps:

obtaining original Java code examination data;

vectorizing the code text and the code review comments in the original Java code review data, including vectorizing the code text, vectorizing the code structure and vectorizing the code review comment text, and generating corresponding word vectors;

in an unsupervised training stage, noise is introduced to word vectors through two modes of disturbing vector representation and adding mask labels, then a discriminant graph model and a language model are established, and disturbed input contents are restored;

in the supervised training stage, code examination comment generation is carried out by fusing a graph model and a language model of the unsupervised training stage, and a code change prediction task is assisted;

2. The method of claim 1, wherein the method of vectorizing the code text and the code review comments comprises:

establishing a code abstract grammar tree, forming a code text formalized representation through level extraction, and converting the code text formalized representation into a serialization vector representation through an SBT algorithm;

and establishing a mapping dictionary, and mapping English labels into corresponding numbers to obtain vector representations of code examination comments.

3. The method of claim 1, wherein code text vectorization, code structure vectorization is achieved by:

representing the code text as a code abstract syntax tree by using a formal representation function in a javalang toolkit;

the Java code text and the module levels thereof are displayed in a tree mode, and the nodes of the function level are extracted and separated, so that the vectorization of the code text is realized;

converting the structured information into the serialized information represented by the special symbol segmentation in a deep search mode by adopting an SBT algorithm;

inputting the obtained serialization information into a T5 model to obtain vectorization representation of structured information of the code text, thereby realizing vectorization of the code structure; the structured information includes: node information, node relation and hierarchical division.

4. The method of claim 1, wherein the language model comprises a T5 model; the T5 model comprises a T5-base-java model of a Huggingface community open source.

5. The method of claim 1, wherein said automating code review comment output of the entered code to be reviewed comprises: the method is characterized in that a random scrambling and masking model 'noise-leading-noise-reducing' mode is adopted in an unsupervised training stage, and a multi-task fine tuning mode is adopted in a supervised training stage, wherein the multi-task refers to a code change prediction task and a code examination comment generation task.

6. The method of claim 1, wherein said automating code review comment output of the entered code to be reviewed comprises:

1) Discriminant training: based on the T5 model, adding a discriminator after the output layer of the model, and carrying out inverse prediction on the position of the damaged word so as to recover the original text data; the discriminator predicts the destroyed node information according to the text word vector representation output by the T5 model, takes the maximum value of the prediction probability as a prediction result, and compares the prediction result with the original result to obtain a feedback result;

2) Mask model training: randomly masking text words by using a mask, and then carrying out inverse prediction on the words masked by the mask to restore contents, wherein the masking model masks contents into one word or a plurality of continuous words or does not mask any word;

3) Predicting code change information: for each line of codes in the code to be inspected, performing four classification tasks on sentence vector labels represented by a previous layer of models at an output layer by capturing the sentence vector labels, wherein the four classification tasks comprise: is added, deleted, modified, and left unchanged;

4) Code review comment generation: text generation is performed in a text conversion form of the T5 model, and output word contents are predicted in sequence at an output layer until a stop is output or a maximum prediction length is reached.

7. The method of claim 6, wherein the destroying words comprises: random word scrambling, random word removal, random word duplication, random word rotation.

8. A Java code review comment generation apparatus based on code structured information and review knowledge, comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-7.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1-7.