CN109582352A

CN109582352A - A kind of code completion method and system based on double AST sequences

Info

Publication number: CN109582352A
Application number: CN201811224521.XA
Authority: CN
Inventors: 李戈; 郝逸洋; 刘洋
Original assignee: Beijing Silicon Heart Technology Co Ltd
Current assignee: Beijing Silicon Heart Technology Co Ltd
Priority date: 2018-10-19
Filing date: 2018-10-19
Publication date: 2019-04-05

Abstract

The present invention provides a kind of code completion method and system based on double AST sequences, comprising: source code processing step uses abstract syntax tree analysis source code；AST turns binary tree step, and above-mentioned abstract syntax tree is converted to two different sequences simultaneously；Model training step, by described two different sequence inputting LSTM models, train language model；Completion step is predicted, according to the language model completion code trained.The abstract syntax tree (AST) of program code to be learned is converted to two sequences (such as " preamble sequence " and " middle sequence sequence ") by the present invention simultaneously, and utilizes information one LSTM model of training of the two sequences simultaneously.The LSTM that method of the invention trains has higher accuracy rate.Technical solution of the present invention have the characteristics that it is simple, quick, can preferably improve code recommendation accuracy rate and recommend efficiency.

Description

A kind of code completion method and system based on double AST sequences

Technical field

The present invention relates to computer software engineering technical fields, mend more particularly, to a kind of code based on double AST sequences Full method and system.

Background technique

One program often has the structure of different levels, and the structure of each level corresponds to corresponding program and analyzed Journey, thus the program information of different level of abstractions can be obtained from different analytic processes.Many programs need to compile ability Operation, such as C, C++, C#, Java, and the various technologies in compiling are also commonly used in program analysis task；Compiling substantially process As shown in Figure 1.Typically, by morphological analysis (Lexical Analysis), syntactic analysis (Syntax Analysis) and semanteme It analyzes (Semantic Analysis), the morphological information and grammer of available program, semantic information, inapt theory can also To be understood as " literal " information and " structure " information of program.Obviously, the two analyzes the function of a program for understanding It is all highly important.

Program analysis is carried out now with many research and utilization deep learning models, intuitive idea is to utilize circulation nerve Network (Recurrent Neural Network, RNN) is to program source code (or the word sequence of source program, i.e. token sequence Column) establish language model.And such way has only used the information of the program bottom --- morphological information --- to analyze Program.However, different with natural language, the structural information of program contains more essential information, directly utilizes the source generation of program Code/word sequence is modeled, and the information of program itself cannot be well reflected.In other words, merely with morphological information to program into Row analysis is incomplete, does not make full use of the information of program source code various aspects.Furthermore different tasks is to different programs Information sensing degree is different, even if certain program analysis tasks are only also more efficient using more abstract program information, and such as journey Sequence classification task is just more sensitive to program structure, this is because program structure reflects the function of program.Such as in program from Define identifier i and be substituted for iii, the structure of program be it is completely immovable, i.e. the function of program does not change.

Most of programmer will use frame or library API during carrying out software development to be multiplexed code.But Programmer is almost impossible to remember all API, because existing API quantity is very huge.Therefore, code auto-complete machine System have become in modern Integrated Development Environment (Integrated Development Environment, IDE) can not or Scarce component part.According to statistics, code completion is most-often used one of ten instructions of developer.Code completion mechanism is in program The remainder of completion program can be attempted when member's input code.The input when code completion of intelligence can be programmed by eliminating is wrong Accidentally and recommend suitable API to accelerate software development process.

Currently, being AST (Abstract Syntax Tree, abstract syntax tree) code conversion, then again abstract language Method tree is converted to Token (identifier) sequence and the use of obtained AST sequence data training LSTM is a kind of code building Method.However, it is difficult for only using a single sequence and such as only using " preamble sequence " according to the basic theory of data structure To describe clear original AST tree construction.That is, lost when an AST is converted to a sequence The information (that is, only relying only on a sequence, original AST can not be converted back) of many tree constructions.It completely to save The all information of one syntax tree, it is necessary to while at least (e.g., while " preamble sequence " and " middle sequence sequence are used using two sequences Column " could completely save the information of one tree.

Summary of the invention

In order to solve the above problem, the abstract syntax tree (AST) of program code to be learned is converted to two by the present invention simultaneously A sequence (such as " preamble sequence " and " middle sequence sequence "), and information one LSTM model of training of the two sequences is utilized simultaneously.

Specifically, the present invention provides a kind of code completion methods based on double AST sequences, comprising:

Source code processing step uses abstract syntax tree analysis source code；

Above-mentioned abstract syntax tree is converted to two different sequences by sequence generation step simultaneously；

Model training step, by described two different sequence inputting LSTM models, train language model；

Completion step is predicted, according to the language model completion code trained.

Preferably, in source code processing step, the source code is resolved to different form, with obtain code class, Method list, code identifier.

Preferably, the sequence generation step includes: to obtain preamble sequence and middle sequence by preamble traversal and inorder traversal Sequence splices preamble sequence and middle sequence sequence, as the input of subsequent LSTM network.

Preferably, the sequence generation step further comprises: obtaining middle sequence sequence by inorder traversal and postorder traversal With postorder sequence, sequence sequence and postorder sequence in splicing, as the input of subsequent LSTM network.

Preferably, the LSTM model is concatenated LSTM model, and the LSTM model is located at the hidden layer of RNN model.

Preferably, in prediction completion step, partial code segment is inputted to the language model of trained mistake, thus root The code element recommended according to context output.

According to another aspect of the present invention, a kind of code completion system based on double AST sequences is additionally provided, including suitable The following module of sequence connection:

Source code processing module uses abstract syntax tree analysis source code；

Above-mentioned abstract syntax tree is converted to two different sequences by sequence generating module simultaneously；

Model training module, by described two different sequence inputting LSTM models, train language model；

Completion module is predicted, for according to the language model completion code trained.

Preferably, the source code is resolved to different form by the source code processing module, to obtain class, the side of code Method list, code identifier.

Preferably, the sequence generating module obtains preamble sequence and middle sequence sequence by preamble traversal and inorder traversal, Splice preamble sequence and middle sequence sequence, as the input of subsequent LSTM network.

Preferably, sequence generating module further passes through inorder traversal and postorder traversal obtains middle sequence sequence and postorder sequence It arranges, sequence sequence and postorder sequence in splicing, as the input of subsequent LSTM network.

The LSTM that method of the invention trains has higher accuracy rate.Technical solution of the present invention has simple, fast The feature of speed can preferably improve the accuracy rate of code recommendation and recommend efficiency.

Detailed description of the invention

By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:

Fig. 1 is the compiling flow chart of existing program.

Fig. 2 is that the present invention is based on the code completion method flow diagrams of double AST sequences.

Fig. 3 is that the present invention is based on the code completion system construction drawings of double AST sequences.

Fig. 4 is the Experiment Training result schematic diagram that " preamble+middle sequence " of the invention splices list entries.

Fig. 5 is the Experiment Training result schematic diagram that " middle sequence+postorder " of the invention splices list entries.

Fig. 6 is the Experiment Training result schematic diagram of four kinds of different LSTM list entries.

Specific embodiment

The illustrative embodiments that the present invention will be described in more detail below with reference to accompanying drawings.Although showing this hair in attached drawing Bright illustrative embodiments, it being understood, however, that may be realized in various forms the reality of the invention without that should be illustrated here The mode of applying is limited.It is to be able to thoroughly understand the present invention on the contrary, providing these embodiments, and this can be sent out Bright range is fully disclosed to those skilled in the art.

The present invention is serialized AST (Abstract Syntax Tree, abstract syntax tree), and is serialized to AST As a result it is modeled, it thus can be with LSTM series model (shot and long term memory network, one kind of Recognition with Recurrent Neural Network) to journey The structural information of sequence is analyzed, and then completes class of procedures task.In other words, the present invention is carried out using LSTM series model It is improved on the basis of class of procedures task, the morphology rank input (token sequence) of original language model is substituted for AST serializing using the structural information of program as a result, mainly analyzed, and achieve ideal result.

RNN (Recognition with Recurrent Neural Network) is a kind of common artificial neural network, defeated suitable for processing timing list entries It can be set as the sequence (even 1) of different length out；Early stage RNN cannot handle long-term Dependence Problem, i.e. RNN meeting " forgetting " Information.LSTM (shot and long term memory network) is a kind of variant of RNN, solves long-term Dependence Problem, has certain memory capability, It is suitble to be spaced and postpone longer event in processing predicted time sequence.However, being which type of RNN, input all must be Sequence, if the structure that serializing embodies program information just must be taken into consideration using the structure of LSTM analysis program --- here The present invention uses AST.

The structure of program is often tree, and LSTM is the series model of linear structure.According to the knowledge of data structure: 1, multiway tree and binary tree correspond.2, binary tree and inorder traversal, preamble/postorder traversal sequence correspond；Or Person says that inorder traversal and preamble/postorder traversal sequence uniquely determine a binary tree.To which source code and AST turn two Pitch preamble/postorder traversal sequence, the one-to-one conclusion of splicing sequence of inorder traversal of tree.So the present invention proposes two Kind serializing mode is simultaneously tested:

The first: AST turns binary tree, obtains preamble sequence and middle sequence sequence by preamble traversal and inorder traversal, splices Preamble sequence and middle sequence sequence, as the input of network；

Second: AST turns binary tree, obtains middle sequence sequence and postorder sequence by inorder traversal and postorder traversal, splicing Middle sequence sequence and postorder sequence, as the input of network.

Fig. 2 is that the present invention is based on the code completion method flow diagrams of double AST sequences.Include the following steps:

S1, source code processing step, carry out analysis source code using abstract syntax tree.In this step, the source code It is resolved to different form, with class, method list, the code identifier etc. for obtaining code.

Abstract syntax tree (abstract syntax tree is perhaps abbreviated as AST) or syntax tree (syntax Tree), be source code abstract syntax structure the tree-shaped form of expression, in particular to the source code of programming language.With abstract language Method tree it is opposite be concrete syntax tree (concrete syntax tree), commonly referred to as parsing tree (parse tree).Generally , in the translation and compilation process of source code, syntax analyzer is created that parsing tree.Once AST is created out, subsequent Treatment process in, such as semantic analysis stage, some information can be added.

Then, abstract syntax tree switchs to binary tree.

Above-mentioned abstract syntax tree (AST) is converted to two different sequences by S2, sequence generation step simultaneously.Specifically, this Invention uses two kinds of conversion regimes, as follows:

The first: obtaining preamble sequence and middle sequence sequence by preamble traversal and inorder traversal, splicing preamble sequence is in Sequence sequence, as the input of subsequent LSTM network；

Second: middle sequence sequence and postorder sequence being obtained by inorder traversal and postorder traversal, sequence sequence is with after in splicing Sequence sequence, as the input of subsequent LSTM network.

S3, model training step, by described two different sequence inputting LSTM models, train language model.Step S2 solution Two sequences obtained after analysis will be used for the circulation mind based on long short-term memory (Long Short Term Memory, LSTM) Through netspeak model.The LSTM model is concatenated LSTM model, and the LSTM model is located at the hidden layer of RNN model.

S4, prediction completion step, according to the language model completion code trained.In this step, by partial code Segment inputs the language model of trained mistake, to based on context export the code element of recommendation.

As shown in figure 3, according to another aspect of the present invention, additionally providing a kind of code completion based on double AST sequences System 100, the following module including sequential connection:

Source code processing module 110 uses abstract syntax tree analysis source code；Preferably, the source code processing module The source code is resolved into different form, to obtain class, the method list, code identifier of code.

Above-mentioned abstract syntax tree (AST) is converted to two different sequences by sequence generating module 120 simultaneously.

Model training module 130, by described two different sequence inputting LSTM models, train language model；

Completion module 140 is predicted, for according to the language model completion code trained.

In specific embodiment described in Fig. 2, the data set that the present invention uses is that POJ (comment online by Peking University's program Examining system) 104 class C/C++ LISP program LISP source codes, every a kind of problem for corresponding to the system contains 500 parts of topics Completely meeting the requirements source code of submitting of student, the classification of Miscellaneous Documents has been marked via 1-104.Preprocessing part is first It first passes through pycparser tool and obtains the AST of C/C++ language, then AST is converted by binary tree by unified approach, by preceding Sequence/middle sequence/postorder traversal obtains preamble/middle sequence/postorder sequence (sequence that traversing result is all kinds of node Node in AST), leads to It crosses splicing (concatenate) sequence and obtains list entries: preamble sequence+middle sequence sequence or middle sequence sequence+postorder sequence.Such as Described above, either which kind of splices list entries, all corresponds with source program.List entries is crossed one first by network portion A embeding layer, each Node switch to an one-hot vector (one-hot length is vocabulary size), and such each One-hot vector is by the true input as each time step；When the one-hot input of the last one Node of list entries Afterwards, the output one-hot vector of a prediction is obtained, for the result after one softmax layers (being not drawn into figure), value is maximum One (length is programs categories number, i.e., 104) subscript adds a prediction classification as list entries.Training when, label to It is quantified as one-hot vector (length 104), only corresponding to one of classification is 1；The loss function used is cross entropy.

Xi indicates the input of current time step in Fig. 2, i.e. the one-hot of the insertion of AST interior joint, hi indicate current time The hiding layer state of step, y indicates current output --- due to being class of procedures task, the present invention only needs reading in last Time step when a input exports corresponding Prediction program classification, and actually y also needs to take after one softmax layers Argmax determines corresponding classification.

Experiment and result

In order to show modelling effect, the present invention has also done two groups of control experiments.One is the experiment of original version, inputs sequence Column are the morphological analyses of program as a result, i.e. directly by token sequence/word sequence of source code as input；The second is using non- One-to-one ergodic sequence combination is as input, and the present invention uses the depth-first traversal of AST for the sake of simplicity Node sequence is as input.

Experimental setup (hyper parameter):

Hidden size=300

Batch size=22

Learning rate=1e-4

Experimental result: the Experiment Training process of two kinds of splicing list entries is as shown in Figure 4,5: " preamble+middle sequence " is as defeated The experiment entered finally converges on 90.08% predictablity rate (Fig. 4), and " middle sequence+postorder " is finally converged on as input 88.43% accuracy rate (Fig. 5).

In addition the result of other two groups of control experiments is as shown in Figure 6: where ast2bt_iap refers to middle sequence splicing postorder For sequence as input, ast2bt_pai refers to that for sequence sequence as inputting, ast_dfs refers to the depth of AST in preamble splicing First traversal sequence refers to the tokens sequence of source code directly as input as input, src_code.The following table 1 is four kinds of methods The test accuracy rate that mode input obtains corresponds to table.

Table 1

Mode input	Test accuracy rate
		AST turns binary tree: preamble+middle sequence	90.08%
AST turns binary tree: middle sequence+postorder	88.43%
		The depth-first traversal sequence of AST	88.60%
The token sequence of Source Code	91.87%

It can be seen that the even superficial structural information using program, can also reach fine in class of procedures task Effect.Wherein, ast_dfs (depth-first traversal, which is contributed a foreword, is classified as input) method obviously restrains slow, this is because deep It spends first traversal sequence and source program not corresponds, so therefore the discrimination of program can lose, the study speed of model Degree is just naturally slow, and accuracy rate is also lower compared with method for distinguishing.And the convergence rate of src_code is fast, this Aspect is since list entries is shorter (the half length less than splicing sequence as input mode), and model can learn more Fastly；Further, since LSTM model itself can extraction procedure higher level of abstraction information, so being made in fact using double AST sequence assemblies For the model of list entries, the structural information of program is also not yet utilized well.It is critical that light is to utilize the imperfect of program Structural information (double AST splicing sequences can not completely reflect the structure of entire program, but for the serializing for LSTM Model we have to take into account that this kind of serializing operation), can be obtained in class of procedures task well as a result, with source generation The experiment that code is made to input differs less than 2%, and this already demonstrates the abilities of model.

On the whole, model of the invention is modeled with the structural information of program well.Further, should Model can be also used for other tasks, as long as the model can obtain complete AST structure, do similar process.On the other hand, such as Fruit can be using the network of more structuring, more completely with the data knot of original AST or corresponding reflection program structure For structure as input, possible experimental result can be more ideal.

It should be understood that

Algorithm and display be not inherently related to any certain computer, virtual bench or other equipment provided herein. Various fexible units can also be used together with teachings based herein.As described above, it constructs required by this kind of device Structure be obvious.In addition, the present invention is also not directed to any particular programming language.It should be understood that can use various Programming language realizes summary of the invention described herein, and the description done above to language-specific is to disclose this hair Bright preferred forms.

In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention Example can be practiced without these specific details.In some instances, well known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this specification.

Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. required to protect Shield the present invention claims features more more than feature expressly recited in each claim.More precisely, as following Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim itself All as a separate embodiment of the present invention.

Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose It replaces.

In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed Meaning one of can in any combination mode come using.

Various component embodiments of the invention can be implemented in hardware, or to run on one or more processors Software module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practice One in the creating device of microprocessor or digital signal processor (DSP) to realize virtual machine according to an embodiment of the present invention The some or all functions of a little or whole components.The present invention is also implemented as executing method as described herein Some or all device or device programs (for example, computer program and computer program product).Such realization Program of the invention can store on a computer-readable medium, or may be in the form of one or more signals.This The signal of sample can be downloaded from an internet website to obtain, and is perhaps provided on the carrier signal or mentions in any other forms For.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch To embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fame Claim.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by anyone skilled in the art, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the protection model of the claim Subject to enclosing.

Claims

1. a kind of code completion method based on double AST sequences characterized by comprising

Source code processing step uses abstract syntax tree analysis source code；

2. the code completion method according to claim 1 based on double AST sequences, it is characterised in that:

In source code processing step, the source code is resolved to different form, to obtain class, the method list, generation of code Code identifier.

3. the code completion method according to claim 1 based on double AST sequences, it is characterised in that:

The sequence generation step includes: to obtain preamble sequence and middle sequence sequence by preamble traversal and inorder traversal, before splicing Sequence sequence and middle sequence sequence, as the input of subsequent LSTM network.

4. the code completion method according to claim 3 based on double AST sequences, it is characterised in that:

The sequence generation step further comprises: middle sequence sequence and postorder sequence are obtained by inorder traversal and postorder traversal, Sequence sequence and postorder sequence in splicing, as the input of subsequent LSTM network.

5. the code completion method according to claim 1 or 2 based on double AST sequences, it is characterised in that:

The LSTM model is the LSTM model of stack.

6. the code completion method according to claim 1 based on double AST sequences, it is characterised in that:

In prediction completion step, partial code segment is inputted to the language model of trained mistake, thus based on context defeated The code element recommended out.

7. a kind of code completion system based on double AST sequences, which is characterized in that the following module including sequential connection:

Source code processing module uses abstract syntax tree analysis source code；

8. the code completion system according to claim 7 based on double AST sequences, it is characterised in that:

The source code is resolved to different form by the source code processing module, to obtain class, the method list, code of code Identifier.

9. the code completion system according to claim 7 based on double AST sequences, it is characterised in that:

The sequence generating module obtains preamble sequence and middle sequence sequence by preamble traversal and inorder traversal, splices preamble sequence With middle sequence sequence, as the input of subsequent LSTM network.

10. the code completion system according to claim 9 based on double AST sequences, it is characterised in that:

Sequence generating module further passes through inorder traversal and postorder traversal obtains middle sequence sequence and postorder sequence, sequence sequence in splicing Column and postorder sequence, as the input of subsequent LSTM network.