CN116661797A

CN116661797A - Code completion method based on enhanced Transformer under word element granularity

Info

Publication number: CN116661797A
Application number: CN202310543114.XA
Authority: CN
Inventors: 支宝; 王楚越; 陈希希; 文万志; 程实; 王则林
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2023-05-15
Filing date: 2023-05-15
Publication date: 2023-08-29

Abstract

The invention belongs to the technical field of code completion, and particularly relates to a code completion method based on reinforced Transformer under word element granularity. The invention comprises the following steps: s1: collecting Java code segments, constructing a code corpus, flattening and converting Java source codes into a word sequence form; s2: using BPE word segmentation algorithm to the data, and utilizing the subwords to encode the data to obtain word vectors required by the model; s3: in the aspect of a model framework, using a transformerlencoder to encode word vector information to be learned, and obtaining a result to be complemented through transformerlencoder decoding; s4: the Multi-HeadAttention used by the traditional transducer model is improved by using a Talking-HeadAttention; s5: in the reasoning decoding stage, a beam searching method is used for generating top5 recommended completion codes, and repeated word elements are avoided in a recommendation list in the code completion stage. The invention can better utilize the semantic information of the source code to complete the word element granularity code, and the method can effectively improve the accuracy of the code completion.

Description

Code completion method based on enhanced Transformer under word element granularity

Technical Field

The invention belongs to the technical field of code completion, and particularly relates to a code completion method based on reinforced Transformer under word element granularity.

Background

Code completion is an important ring in intelligent software engineering, is used as a branch of automatic software development, can provide predictions of class names, method names and the like for programmers in time in the process of software development, reduces typing burden of the developers, reduces spelling errors, and intuitively improves the efficiency of software development.

The early code complement technology carries out complement and prediction through manually defined heuristic rules based on inputted codes and grammar rules, and the code complement mode can gradually lose efficacy along with version alternation, and needs to spend a great deal of manpower and material resources for defining and revising a new round of rules. Nowadays, people train the built learning model through an open source corpus, so that the completed code has grammar rules of programming language and higher accuracy.

Disclosure of Invention

The invention aims to solve the technical problem of further improving the accuracy of code completion, effectively assisting software developers in using the model to improve the efficiency in the software development process and saving the development time. The patent provides a code completion method based on reinforced Transformer under the granularity of the word elements based on the deep learning technology, which is beneficial to improving the accuracy of code completion.

In order to achieve the aim of the invention, the technical scheme adopted by the invention is as follows:

a code complement method based on reinforced convertors under the granularity of a word element comprises the following steps:

s1: collecting Java code segments, constructing a code corpus, flattening and converting Java source codes into a word sequence form;

s2: using BPE word segmentation algorithm to the data, and utilizing the subwords to encode the data to obtain word vectors required by the model;

s3: in the aspect of a model framework, transformer Encoder is used for encoding word vector information to be learned, and a result to be complemented is obtained through a transformerler decoder;

s4: the Multi-HeadAttention used by the traditional transducer model is improved by using a Talking-HeadAttention;

s5: in the reasoning decoding stage, a beam searching method is used for generating top5 recommended completion codes, and repeated word elements are avoided in a recommendation list in the code completion stage.

Further, as a preferred technical solution of the present invention, the step S1 includes the following steps:

s1.1: searching and downloading more than 10 open source items of star in Github, and collecting Java methods;

s1.2: deleting comments of each Java method, filtering codes with the number of lines less than 5, and deleting repeated code fragments;

s1.3: flattening a section of Java code into a line sequence form, and sealing the Java code into a file, wherein the file is used as a corpus, and the corpus is represented by the following formula 4:4:2, dividing the training set, the verification set and the test set.

Further, as a preferred technical solution of the present invention, the step S2 includes the following steps:

s2.1: adopting a BPE algorithm to encode data by sub words;

s2.2: in the process of constructing a vocabulary by using BPE, firstly splitting all the vocabulary elements into character sequences, constructing an initial vocabulary by using all the character sequences, then counting the occurrence frequency of each continuous byte pair in the training corpus, merging the byte pair with the highest occurrence frequency into a new sub-word, updating the vocabulary, and repeating the previous step until the occurrence frequency of the rest byte pairs is the highest 1;

s2.3: in the corpus coding process, the lengths of all the subwords in the table are arranged in the order from large to small, for each given word unit, the word table after the completion of the sorting is traversed, whether the subwords in the word table are the substrings of the word unit is searched, if the matching is successful, the subwords are output, and the rest character strings of the word unit are continuously matched; finally, if the remaining substrings are not successfully matched after the whole traversal is finished, replacing the substrings with special tokens < UNK >, and ending the whole coding process.

Further as a preferred embodiment of the present invention, in the step S3, a neural network is constructed using Transformer Encoder and Transformer Decoder; inputting a partial program of an original text sequence, and outputting a word element of the next position of the partial program; the neural network model formula is:

o＝Trans((e _t ) _{t∈src_seq} )

where o is the distribution of all possible tokens, e _t Is an embedded representation of each individual token t in the original token sequence.

Further, as a preferred technical scheme of the invention, in the selection of the model parameters, the word vector dimension is 128, the attention block number is 6, the head number of the multi-head attention layer is 8, and the dropout probability is 0.5.

Further as a preferred technical solution of the present invention, in step S4, the linking-head entry processes information of different locations into the same header by sharing parameters among different headers, for each of the different headers:

each (Q) is calculated using a parameter matrix lambda ⁱ K ⁱ ) ^T The result of each low-rank distributed Multi-Head is overlapped, so that each isolated Attention Head is connected, and the model has a new learnable parameter lambda; occurs in the mechanism of attentionPrior to line softmax calculation, the final walking-Heads Attention can be expressed by the following formula:

further, as a preferred technical solution of the present invention, in step S5, the generated top5 word probabilities are converted once, the numerical value thereof is amplified, and the value thereof is finally decoded and output.

Compared with the prior art, the code complement method based on the reinforced Transformer under the word element granularity has the following technical effects:

(1) The invention can better utilize the semantic information of the source code to complete the word element granularity code, and the method can effectively improve the accuracy of the code completion.

(2) The invention can effectively assist software developers to use the model to improve the efficiency in the software development process and save the development time.

Drawings

FIG. 1 is a schematic flow diagram of the method of the present invention;

FIG. 2 is a drawing illustrating a Talking-Headsattntion;

FIG. 3 is a diagram of a model overall framework;

fig. 4 is an example diagram of a user obtaining code completions.

Detailed Description

The invention is further explained in the following detailed description with reference to the drawings so that those skilled in the art can more fully understand the invention and can practice it, but the invention is explained below by way of example only and not by way of limitation.

As shown in fig. 1, a code complement method based on reinforced convertors under the granularity of a word element is mainly used for helping a user to carry out code complement, and comprises the following steps:

s1, collecting Java code segments, constructing a code corpus, flattening Java source codes and converting the Java source codes into a word element sequence form;

s2, using BPE (Byte Pair Encoding) word segmentation algorithm to the data, and coding the data by utilizing the subwords to obtain word vectors required by the model;

s3, in the aspect of a model framework, transformer Encoder is used for encoding word vector information to be learned, and a result to be complemented is obtained through a transformerler decoder;

s4, improving Multi-HeadAttention used by a traditional transducer model by using a linking-HeadAttention, breaking the original bottleneck of modeling, and obtaining a better effect;

s5, in the reasoning decoding stage, generating top5 recommended completion codes by using a Beam Search (Beam Search) method, and avoiding repeated word elements in a code completion stage recommendation list.

S1, collecting Java code segments, constructing a code corpus, and converting Java source codes into word element sequence forms by flattening, wherein the method comprises the following specific steps of:

searching and downloading more than 10 open source items of star in Github, and collecting Java methods; deleting the notes of each Java method, filtering out codes with the number of lines less than 5, and deleting the repeated code fragments. Flattening a section of Java code into a line sequence form, and sealing the Java code into a file, wherein the file is used as a corpus, and the corpus is represented by the following formula 4:4:2, dividing the training set, the verification set and the test set.

Step S2, using BPE (Byte Pair Encoding) word segmentation algorithm to the data, using sub words to encode the data, and obtaining the word vector required by the model comprises the following specific steps:

because of the hump naming convention rule of Java, many word elements such as class names and method names are unique to the current Java method class, if the dictionary is constructed by the traditional space division sequence, the dictionary is too large, the training resource result is wasted, and meanwhile, rare words or words which are not found in the training process (OOV problem) are difficult to process in the model test. Therefore, the BPE algorithm is employed to encode data in subwords (subwords).

In the process of constructing a vocabulary by using BPE, all the vocabulary elements are split into character sequences, an initial vocabulary is constructed by using all the character sequences, then the occurrence frequency of each continuous byte pair in the training corpus is counted, the byte pair with the highest occurrence frequency is combined into a new sub-word, the vocabulary is updated, and the previous step is repeated until the occurrence frequency of the rest byte pairs is 1 at the highest. By way of example, the initial vocabulary is ' c ', ' o ', ' k ', ' e ','d ', ' i ', ' n ' g ', and ' cookies '. The frequency of occurrence of 'cook' is highest in the original word stock, which is included as a new subword in the table, and the frequency of occurrence of the next two byte pairs of 'ed' and 'ing' is highest at 1, so that the whole word list construction is finished. The two word elements of "cookie" and "cookie" are divided into "cookie", "ed", "ing", so that the semantic information of the word can be learned while the vocabulary is reduced.

In the corpus coding process, the lengths of all the subwords in the table are arranged in the order from large to small, for each given word unit, the word table after the completion of the sorting is traversed, whether the subwords in the word table are the substrings of the word unit is searched, if the matching is successful, the subwords are output, and the rest character strings of the word unit are continuously matched. Finally, if the remaining substrings are not successfully matched after the whole traversal is finished, replacing the substrings with special tokens < UNK >, and ending the whole coding process.

Step S3, in terms of a model framework, transformer Encoder is used for encoding word vector information to be learned, and Transformer Decoder is used for decoding to obtain a result to be complemented, wherein the specific steps are as follows:

because the foregoing flattened the structured code into a sequential form of processing that naturally matches the characteristics of the Transformer processing long sequence text, neural networks were constructed using Transformer Encoder and Transformer Decoder. The input here is a partial program of the original text sequence and the output is a word element of the next position of the partial program. The model can be rewritten as:

o＝Trans((e _t ) _{t∈src_seq} )

In the selection of the model parameters, the word vector dimension (embedding_size) is 128, the attention block number (block_num) is 6, the number of heads (num_heads) of the multi-head attention layer is 8, and the dropout probability is 0.5.

S4, improving Multi-head attribute used by a traditional transducer model by using a linking-Heads attribute, breaking the original bottleneck of modeling, and obtaining a better effect specifically comprises the following steps:

in this step, the Multi-HeadAttention used by the traditional transducer model is modified using the linking-Heads Attention. Since the input sequences in Multi-Head Attention are projected to different heads and the Attention is calculated, respectively, and finally the Attention of the heads is weighted and summed to obtain the final context representation, the result is that in the calculation process of each Head in isolation (Q ⁱ K ⁱ ) ^T Is not sufficient in expression ability. The parameters are shared among different Heads by the walking-Heads, and information of different positions is fused into the same head for processing, so that the model pays more Attention to the change of sequence positions, and the final expression capacity of the model is improved. Specifically, for each different head:

each (Q) is calculated using a parameter matrix lambda ⁱ K ⁱ ) ^T The result of each low rank distributed Multi-Head is superimposed, so that each isolated Attention Head is connected, and the model has a new learnable parameter lambda, thereby further improving the performance of the Attention. This operation occurs before the Attention mechanism makes a softmax calculation, and the final walking-Heads Attention can be expressed by the following formula:

as shown in the explanatory diagram of the talker-Heads Attention in fig. 2.

Step S5, in the reasoning decoding stage, generating top5 recommended completion codes by using a Beam Search (Beam Search) method, and avoiding the occurrence of repeated word elements in a recommendation list in the code completion stage, wherein the specific steps are as follows:

the model performs code completion based on the word elements, and when recommending codes, li composed of 5 different word elements is returned once _s t. The use of a beam search algorithm to predict 5 next possible tokens from the given above avoids the model from repeating the decoding 5 times to produce repeated tokens, and as a greedy algorithm, the resulting solution may be referred to as the optimal solution under the task conditions described herein. Because the generated lemmas are 5 possible options which are ranked highest according to the probability value, the numerical value generated by the proportion of single lemmas to the whole word stock is extremely small, the probability is probably very similar, therefore, the probability of the generated top5 lemmas is converted once, the numerical value is amplified, and finally the value is decoded and output.

The invention provides a code complement method based on reinforced Transformer under the granularity of the word elements, a corpus is built, a deep neural network is built and trained to obtain a model, and a user can finally obtain a complement list of the next word element by inputting a substitute complement Java code segment into the model. Fig. 3 is a diagram of the overall frame of the model.

The limitation of the neural code complement model is that the traditional text embedding mode has insufficient acquisition capability for code rich semantic information, and meanwhile, multi-Head Attention has the bottleneck of modeling itself, and the computing mode is as follows (Q ⁱ K ⁱ ) ^T As a result, the expression ability was insufficient. The model proposed by the patent is based on the improved results of the two points, and the model method is considered to be capable of improving the accuracy of code annotation generation. FIG. 4 is a diagram of an example of a user obtaining a replacement completion code.

The invention provides a code completion method based on reinforced Transformer under the word element granularity, which can better utilize the semantic information of the source code to carry out the word element granularity code completion.

While the foregoing is directed to embodiments of the present invention, other and further details of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims

1. The code complement method based on the enhanced Transformer under the word element granularity is characterized by comprising the following steps of:

s3: in terms of a model framework, transformer Encoder is used for encoding word vector information to be learned, and Transformer Decoder is used for decoding to obtain a result to be complemented;

s4: the Multi-Head attribute used by the traditional transducer model is improved by using the linking-Heads attribute;

2. The method for reinforcing Transformer-based code complement at the granularity of the lemma according to claim 1, wherein the step S1 comprises the steps of:

3. The method for reinforcing Transformer-based code complement at the granularity of the lemma according to claim 2, wherein the step S2 comprises the steps of:

s2.1: adopting a BPE algorithm to encode data by sub words;

4. A method of reinforcement fransformer based code completion at a lemma granularity according to claim 3, wherein in step S3, a neural network is constructed using Transformer Encoder and Transformer Decoder; inputting a partial program of an original text sequence, and outputting a word element of the next position of the partial program; the neural network model formula is:

o＝Trans((e _t ) _{t∈src_seq} )

5. The method of claim 4, wherein the choice of model parameters is a word vector dimension of 128, a attention block number of 6, a multi-head attention layer number of 8, and a dropout probability of 0.5.

6. The method of claim 4, wherein in step S4, the processing of merging information of different positions into the same header by sharing parameters between different headers, for each of the different headers:

each (Q) is calculated using a parameter matrix lambda ⁱ K ⁱ ) ^T The result of each low-rank distributed Multi-Head is overlapped, so that each isolated Attention Head is connected, and the model has a new learnable parameter lambda; before the Attention mechanism performs softmax calculation, the final talker-Heads Attention can be expressed by the following formula:

7. the method of claim 6, wherein in step S5, the generated top5 token probabilities are converted once, the numerical values are amplified, and the values are finally decoded and output.