CN113821198A

CN113821198A - Code completion method, system, storage medium and computer program product

Info

Publication number: CN113821198A
Application number: CN202111072772.2A
Authority: CN
Inventors: 杨浩; 邝砾
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2021-09-14
Filing date: 2021-09-14
Publication date: 2021-12-21
Anticipated expiration: 2041-09-14
Also published as: CN113821198B

Abstract

The invention discloses a code completion method, a system, a storage medium and a computer program product, wherein a coding fragment is preprocessed, and the preprocessed coding fragment is used as the input of a non-anonymization model to obtain a first prediction result; judging whether the first prediction result is an identifier UNK or not, and if not, ending the process; and otherwise, the anonymized code is used as the input of the anonymization model to obtain a second prediction result. The code completion method can efficiently process rare words and new words in OOV words by anonymizing and establishing a dynamic word list, and can quickly follow up the latest achievement in the field of code completion without depending on the characteristics of a specific model.

Description

Code completion method, system, storage medium and computer program product

Technical Field

The present invention relates to the field of software development technologies, and in particular, to a code completion method, system, storage medium, and computer program product.

Background

The intelligent code completion is a necessary component for the development of the next generation intelligent IDE. In the IDE, the code completion component provides the next possible identifier option based on the code that the developer has written and the current cursor position. Under the intelligent wave, the existing intelligent code completion method is turned to a method based on a neural language model. This statistical learning-based approach, inspired by Hindle et al, considers the programming language as one language and then trains the language model through a large corpus of code. However, this language model-based approach cannot deal with the oov (out of vocabularies) problem. In the code, the OOV problem has a greater influence on the performance of the models in the code completion because the new vocabulary and the rare vocabulary are in a much higher proportion than in the natural language, and the capability of the models in the code completion is limited.

The existing anonymization process is based on anonymization of the whole corpus, the frequency of occurrence of each word in the corpus is counted, the words are sorted from large to small according to the frequency of occurrence, and then an anonymized ID is sequentially distributed. The global corpus cannot accommodate all vocabularies due to the limitation of the size of the vocabulary, so that new vocabularies cannot be predicted. For rare words and global linguistic data, due to the limitation of computational efficiency, only global high-frequency words can be considered, and all words cannot be included in a computational range.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a code completion method, system, storage medium and computer program product for effectively processing rare vocabularies.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a code completion method, comprising the steps of:

s1, preprocessing the coded fragments, and taking the preprocessed coded fragments as the input of a non-anonymization model to obtain a first prediction result; judging whether the first prediction result is an identifier UNK or not, and if not, ending the process; otherwise, go to S2;

and S2, using the anonymized code as the input of the anonymization model to obtain a second prediction result.

The invention sets the anonymization process, thereby effectively processing OOV words (rare vocabularies) and improving the prediction accuracy.

In step S1, the training process of the non-anonymization model includes:

1) collecting a Code _ o;

2) preprocessing the Code _ o to obtain a corpus Code _ c for training;

3) and taking the corpus Code _ c for training as the input of a Code completion model to obtain the non-anonymization model.

The training of the non-anonymization model has the characteristics of good corpus acquisition and easy model training.

In step S1, the obtaining of the first prediction result includes:

A) preprocessing the existing Code segment Code _ s to obtain a preprocessed Code segment Code _ n;

B) and taking the processed Code _ n as the input of the non-anonymization model, and taking the result with the highest probability output by the non-anonymization model as a first prediction result.

The non-anonymization model of the invention can use any existing code completion model, and the advancement of the framework is kept.

In step S2, the training process of the anonymization model includes:

c) collecting codes, and preprocessing the collected codes to obtain a training corpus;

d) and training a completion code model by using the training corpus to obtain an anonymization model.

The anonymization model is a model specially designed and trained for the OOV problem, and firstly, the anonymization model is different from the non-anonymization model in that the training data of the model has an anonymization process. The processing enables the trained model to learn the structure of the code well, and the OOV words can be predicted conveniently.

In step S2, the obtaining of the second prediction result includes:

i) anonymizing the existing code fragment to obtain an anonymized code; establishing an anonymization word list, and storing the mapping from the anonymization ID to the original identifier; taking the anonymized code as the input of the prediction model, and taking the obtained result with the highest probability as a second prediction result;

ii) judging whether the second prediction result is in the anonymization word list, if so, searching the original identifier according to the anonymization word list, and outputting the original identifier, otherwise, outputting the identifier UNK.

In the invention, the anonymization model establishes a dynamically changed vocabulary, updates encountered new vocabularies in real time and predicts the encountered vocabularies in time. The vocabulary is upgraded from the static vocabulary of the traditional model to the dynamic vocabulary, so that the new vocabulary in the code can be processed conveniently, and the problem that the OOV vocabulary cannot be processed in the traditional model is solved.

The invention also provides a code completion system, which comprises computer equipment; the computer device is configured or programmed for performing the steps of the above-described method.

As one inventive concept, the present invention also provides a computer-readable storage medium including a program running in a computer device; the program is configured or programmed for carrying out the steps of the above-described method.

Compared with the prior art, the invention has the beneficial effects that: the invention fully utilizes the characteristics of the code, namely the functional invariance of the code, to solve the OOV problem in the code completion. The code completion method can efficiently process rare words and new words in OOV words by anonymizing and establishing a dynamic word list, has strong universality, and can be used on methods based on a deep language model, such as LSTM, Transformer and the like.

Drawings

FIG. 1 is a schematic block diagram of a method according to an embodiment of the invention;

FIG. 2 illustrates a process for predicting rare words according to an embodiment of the present invention.

Detailed Description

The present invention proposes to solve the OOV problem in code completion using a framework named AOOV. A in AOOV refers to anonymization. The AOOV framework is mainly composed of 3 parts: non-anonymization models, and fusion algorithms of anonymization and non-anonymization models. The overall process is shown in figure 1. The method inputs the source code into a non-anonymization model and an anonymization model respectively to obtain respective results respectively, and then obtains a final result by using a fusion algorithm. In AOOV, all components are replaceable.

The main difference between the non-anonymization model and the anonymization model is that the code is processed differently, and the two models may use the same language model or different language models. The non-anonymization model generates a vocabulary by counting the number of identifiers in the code, and then inputs the vocabulary as a language model. The anonymization model anonymizes the identifier in the original code, and then makes a word list as an input of the language model. Since the non-anonymized language model and the anonymized language model do not share a vocabulary, the respective predicted results cannot be understood by each other. Therefore, a bridge needs to be constructed between the two models to allow the two models to communicate and communicate. This bridge is a fusion algorithm of anonymized and non-anonymized language models. The non-anonymization model is an existing code completion model, and therefore, the invention only focuses on the anonymization model.

Consider the following three code files

#File 1

class Person:

def__init__(self,name,age,gender):

self.name＝name

self.age＝age

self.gender＝gender

#File 2

class DataIterator:

def__init__(self,data_stream,request_iterator,as_dict):

self.data_stream＝data_stream

self.request_iterator＝request_iterator

self.as_dict＝as_dict

#File 3

Class Modle:

def__init__(self,input_data,hp,flags):

self.input_data＝input_data

self.hp＝hp

self.flags＝flags

The codes of the three files all illustrate the same thing, namely, a class is defined, and parameters introduced by a constructor of the class are used as class member variables. Assuming that we use the existing method to predict, the prediction process is as follows: first, the number of occurrences of each identifier is counted. The results are shown in Table 1 if the statistics are completed. Next, each identifier is sorted by the number of occurrences, and IDs are assigned in turn. However, in the actual model, the words having a lower number of occurrences have to be discarded in view of computational efficiency. In the case that words used in the vocabulary in the model are all represented more than 10 times in the corpus, a new vocabulary is obtained, and there is no way to predict words such as hp. Tables 1 to 3 show the case of the raw data. If the anonymization method is adopted, each file is anonymized to appear, and then an anonymized word list is established for each file. Tables 4 to 6 show the cases of anonymized data.

TABLE 1 detailed information after statistics

TABLE 2 detailed information given ID

TABLE 3 details after discarding rare words

Table 4 anonymized vocabulary for file 1

TABLE 5 anonymized vocabulary for File 2

TABLE 6 anonymized vocabulary for File 3

After anonymization of # File 1

class var1:

def__init__(self,var2,var3,var4):

self.var2＝var3

self.var3＝var3

self.var4＝var4

After anonymizing # File 2

class var1:

def__init__(self,var2,var3,var4):

self.var2＝var3

self.var3＝var3

self.var4＝var4

After anonymizing # File 3

class var1:

def__init__(self,var2,var3,var4):

self.var2＝var3

self.var3＝var3

self.var4＝var4

After anonymization was completed, the contents of the 3 files were the same. For the model, the same input for 3 files will be the same as the predicted result. Finally, we only need to recover the original identifier from the anonymized ID, i.e. complete the whole prediction process.

Fig. 2 illustrates document 2 as an example, and fully shows the prediction process of rare words according to an embodiment of the present invention.

The anonymization model training process is as follows:

1. collecting Code from GitHub_original；

2. Preprocessing the Code to obtain the corpus Code of the model_corpus；

3. Training an anonymization Model, which may be any existing code completion Model (e.g., LSTM, Transformer), to obtain a Model_anonymous；

The anonymization model prediction process is as follows:

1. code the existing Code segment_snippetAnonymizing is carried out, the identifier is changed into an anonymized identifier by anonymization, and an anonymized Code is obtained_anonymousWhile establishing an anonymized vocabulary Vocab_anonymousMaintaining a mapping from the anonymized ID to the original identifier;

2. code to anonymize_anonymousModel as anonymization Model_anonymousUsing an anonymization model to predict;

3. predicting to obtain a plurality of items, wherein different items have different probabilities, and taking the item with the highest probability of the anonymization model as a prediction result Ret_pred；

4. If the result of prediction Ret_predIn anonymized vocabulary Vocab_anonymousIn the method, the original Identifier is searched according to the anonymization word list, and the Identifier of the original Identifier is output_original；

5. Otherwise, outputting UNK;

the training process of the non-anonymization model is as follows:

1. collecting a Code _ o from the GitHub;

2. preprocessing the Code (i.e. performing lexical analysis and syntactic analysis on the Code to build an abstract syntax tree, and finally flattening the tree to obtain a sequence, which is shown in Li J, Wang Y, Lyu M R, et al. Code composition with neural interpretation and pointernetworks [ J ]. arXiv preprinting arXiv:1711.09573,2017.; Liu F, Li G, Wei B, et al. A self-systematic approach architecture for Code composition with multi-task learning [ C ]/(Proceedings of the 28th International Conference protocol compression.2020: 37-47.), to obtain a corpus Code _ C for training;

3. training a non-anonymization Model, which can be any existing code completion Model (such as LSTM, Transformer), to obtain a Model _ n.

The prediction process of the non-anonymization model is as follows:

1. preprocessing the existing Code segment Code _ s to obtain Code _ n, wherein the processing mode is the same as the preprocessing in the training process;

2. taking the processed Code _ n as the input of a non-anonymization Model _ n, and predicting by using the non-anonymization Model;

3. a plurality of items are obtained by prediction, different items have different probabilities, and the item with the highest probability of the non-anonymization model is used as a prediction result Identifier_{non-anonymous}；

4. Output Identifier_{non-anonymous}。

The AOOV-POFA prediction process of the embodiment of the invention is as follows:

1. the existing Code segment Code _ s is processed to obtain an input Code _ in of a non-anonymization model, an input Code _ ia of the anonymization model and respective preprocessing modes of the processing mode and the model;

2. model using non-anonymization Model_{non-anonymous}Predicting to obtain a prediction result Identifier_{non-anonymous}；

3. If the prediction result is not UNK, outputting an Identifier_{non-anonymous}；

4. Otherwise, the anonymization Model is used_{non-anonymous}Predicting and outputting a prediction result Identifier_anonymous。

The OOV problem is mainly caused by two reasons, rare vocabulary and new vocabulary. Problem of rare vocabulary as mentioned in the example of the last 3 documents, anonymization shields the vocabulary from differences, leaving the low frequency vocabulary at the level of the corpus to become the high frequency vocabulary at the level of the source code. The problem with new vocabulary, assuming we have trained the anonymization model with previous corpora, if new code fragments come, is

# File 4

class Foo:

def__init__(self,bar,new_word_1,new_word_2):

self.bar＝bar

self.new_word_1＝new_word_1

self.new_word_2＝new_word_2

Assuming that the new words are new word 1 and new word 2, the two words cannot be predicted using the conventional neural language model. However, if an anonymized model is used, it is possible to predict it. The present invention first anonymizes the above code.

File 4 after anonymization

class var1:

def__init__(self,var2,var3,var4):

self.var2＝var3

self.var3＝var3

self.var4＝var4

Thus, the source code becomes the same as the previous file. In this case, the model is very easy to predict. Then the invention restores the predicted vocabulary, and finally the new vocabulary is predicted.

Regarding the invalid prediction problem:

when using the anonymization model, the anonymous vocabularies are dynamically built. However, the vocabulary is of a fixed size at the time of model training, thereby creating the problem of invalid predictions. Consider the following example when we write to

# File 5

class Foo:

def__init__(self,bar,___ ___

In this case, the anonymous term table is shown in table 7.

TABLE 7 anonymous vocabularies

However, at this point the model is predictive of ` var3 `. Since there is no 'var 3' in the vocabulary, it is impossible to recover what its real identifier is. Therefore, such predictions are meaningless and do not provide substantial suggestions for code completion.

In order to fuse and process the prediction results of an anonymization language model and a non-anonymization language model, a simple and direct fusion algorithm POFA is provided, and the specific implementation process comprises the following steps:

1. predicting using a non-anonymized language model;

2. if the predicted result is non-UNK, ending;

3. if the predicted result is UNK, predicting by using an anonymized language model;

4. and outputting a prediction result, and ending.

When predicting the OOV word request _ predictor, first, the anonymization model and the non-anonymization model each obtain their prediction results. The invention then selects a final prediction based on the predictions of the non-anonymized model. If the predicted outcome of the non-anonymization model is UNK, the present invention selects the outcome of the anonymization model. Otherwise, the present invention employs the results of the non-anonymization model. By such a fusion mode, the invention can combine the advantages of the anonymization model and the non-anonymization model at the same time.

Comparative experiments of inventive examples are as follows.

In the embodiment of the invention, a comparison experiment is performed based on two major deep learning models, namely LSTM and Transformer, and the experimental results are shown in tables 8 and 9.

Experiment 1 the non-anonymized model and the anonymized model in the method were replaced with different deep learning models, respectively. The experimental results show that the prediction effect on the OOV words is good no matter which model is used.

Table 8 results of experiment 1

Experiment 2 comparing other OOV prediction methods

Compared with the method of the Pointer texture Network, the method of the invention is a copying-based method. From experimental results, it can be seen that the method of the present invention is far better than the copy-based method in both Python and JavaScript data sets.

TABLE 9 results of experiment 2

Claims

1. A method of code completion, comprising the steps of:

2. The code completion method according to claim 1, wherein in step S1, the training process of the non-anonymization model comprises:

1) collecting a Code _ o;

2) preprocessing the Code _ o to obtain a corpus Code _ c for training;

3. The code completion method according to claim 1 or 2, wherein in step S1, the obtaining of the first predicted result comprises:

4. The code completion method according to claim 1, wherein in step S2, the training process of the anonymization model comprises:

a) collecting codes, and preprocessing the collected codes to obtain a training corpus;

b) and training a completion code model by using the training corpus to obtain an anonymization model.

5. The code completion method according to claim 1 or 4, wherein in step S2, the obtaining of the second prediction result comprises:

6. A code completion system comprising a computer device; the computer device is configured or programmed for carrying out the steps of the method according to one of claims 1 to 5.

7. A computer-readable storage medium comprising a program running in a computer device; the program is configured or programmed for carrying out the steps of the method according to one of claims 1 to 5.

8. A computer program product comprising a computer program/instructions; characterized in that the computer program/instructions, when executed by a processor, performs the steps of the method according to one of claims 1 to 5.