CN109582352A - A kind of code completion method and system based on double AST sequences - Google Patents

A kind of code completion method and system based on double AST sequences Download PDF

Info

Publication number
CN109582352A
CN109582352A CN201811224521.XA CN201811224521A CN109582352A CN 109582352 A CN109582352 A CN 109582352A CN 201811224521 A CN201811224521 A CN 201811224521A CN 109582352 A CN109582352 A CN 109582352A
Authority
CN
China
Prior art keywords
sequence
code
ast
sequences
completion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811224521.XA
Other languages
Chinese (zh)
Inventor
李戈
郝逸洋
刘洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Silicon Heart Technology Co Ltd
Original Assignee
Beijing Silicon Heart Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Silicon Heart Technology Co Ltd filed Critical Beijing Silicon Heart Technology Co Ltd
Priority to CN201811224521.XA priority Critical patent/CN109582352A/en
Publication of CN109582352A publication Critical patent/CN109582352A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/72Code refactoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The present invention provides a kind of code completion method and system based on double AST sequences, comprising: source code processing step uses abstract syntax tree analysis source code;AST turns binary tree step, and above-mentioned abstract syntax tree is converted to two different sequences simultaneously;Model training step, by described two different sequence inputting LSTM models, train language model;Completion step is predicted, according to the language model completion code trained.The abstract syntax tree (AST) of program code to be learned is converted to two sequences (such as " preamble sequence " and " middle sequence sequence ") by the present invention simultaneously, and utilizes information one LSTM model of training of the two sequences simultaneously.The LSTM that method of the invention trains has higher accuracy rate.Technical solution of the present invention have the characteristics that it is simple, quick, can preferably improve code recommendation accuracy rate and recommend efficiency.

Description

A kind of code completion method and system based on double AST sequences
Technical field
The present invention relates to computer software engineering technical fields, mend more particularly, to a kind of code based on double AST sequences Full method and system.
Background technique
One program often has the structure of different levels, and the structure of each level corresponds to corresponding program and analyzed Journey, thus the program information of different level of abstractions can be obtained from different analytic processes.Many programs need to compile ability Operation, such as C, C++, C#, Java, and the various technologies in compiling are also commonly used in program analysis task;Compiling substantially process As shown in Figure 1.Typically, by morphological analysis (Lexical Analysis), syntactic analysis (Syntax Analysis) and semanteme It analyzes (Semantic Analysis), the morphological information and grammer of available program, semantic information, inapt theory can also To be understood as " literal " information and " structure " information of program.Obviously, the two analyzes the function of a program for understanding It is all highly important.
Program analysis is carried out now with many research and utilization deep learning models, intuitive idea is to utilize circulation nerve Network (Recurrent Neural Network, RNN) is to program source code (or the word sequence of source program, i.e. token sequence Column) establish language model.And such way has only used the information of the program bottom --- morphological information --- to analyze Program.However, different with natural language, the structural information of program contains more essential information, directly utilizes the source generation of program Code/word sequence is modeled, and the information of program itself cannot be well reflected.In other words, merely with morphological information to program into Row analysis is incomplete, does not make full use of the information of program source code various aspects.Furthermore different tasks is to different programs Information sensing degree is different, even if certain program analysis tasks are only also more efficient using more abstract program information, and such as journey Sequence classification task is just more sensitive to program structure, this is because program structure reflects the function of program.Such as in program from Define identifier i and be substituted for iii, the structure of program be it is completely immovable, i.e. the function of program does not change.
Most of programmer will use frame or library API during carrying out software development to be multiplexed code.But Programmer is almost impossible to remember all API, because existing API quantity is very huge.Therefore, code auto-complete machine System have become in modern Integrated Development Environment (Integrated Development Environment, IDE) can not or Scarce component part.According to statistics, code completion is most-often used one of ten instructions of developer.Code completion mechanism is in program The remainder of completion program can be attempted when member's input code.The input when code completion of intelligence can be programmed by eliminating is wrong Accidentally and recommend suitable API to accelerate software development process.
Currently, being AST (Abstract Syntax Tree, abstract syntax tree) code conversion, then again abstract language Method tree is converted to Token (identifier) sequence and the use of obtained AST sequence data training LSTM is a kind of code building Method.However, it is difficult for only using a single sequence and such as only using " preamble sequence " according to the basic theory of data structure To describe clear original AST tree construction.That is, lost when an AST is converted to a sequence The information (that is, only relying only on a sequence, original AST can not be converted back) of many tree constructions.It completely to save The all information of one syntax tree, it is necessary to while at least (e.g., while " preamble sequence " and " middle sequence sequence are used using two sequences Column " could completely save the information of one tree.
Summary of the invention
In order to solve the above problem, the abstract syntax tree (AST) of program code to be learned is converted to two by the present invention simultaneously A sequence (such as " preamble sequence " and " middle sequence sequence "), and information one LSTM model of training of the two sequences is utilized simultaneously.
Specifically, the present invention provides a kind of code completion methods based on double AST sequences, comprising:
Source code processing step uses abstract syntax tree analysis source code;
Above-mentioned abstract syntax tree is converted to two different sequences by sequence generation step simultaneously;
Model training step, by described two different sequence inputting LSTM models, train language model;
Completion step is predicted, according to the language model completion code trained.
Preferably, in source code processing step, the source code is resolved to different form, with obtain code class, Method list, code identifier.
Preferably, the sequence generation step includes: to obtain preamble sequence and middle sequence by preamble traversal and inorder traversal Sequence splices preamble sequence and middle sequence sequence, as the input of subsequent LSTM network.
Preferably, the sequence generation step further comprises: obtaining middle sequence sequence by inorder traversal and postorder traversal With postorder sequence, sequence sequence and postorder sequence in splicing, as the input of subsequent LSTM network.
Preferably, the LSTM model is concatenated LSTM model, and the LSTM model is located at the hidden layer of RNN model.
Preferably, in prediction completion step, partial code segment is inputted to the language model of trained mistake, thus root The code element recommended according to context output.
According to another aspect of the present invention, a kind of code completion system based on double AST sequences is additionally provided, including suitable The following module of sequence connection:
Source code processing module uses abstract syntax tree analysis source code;
Above-mentioned abstract syntax tree is converted to two different sequences by sequence generating module simultaneously;
Model training module, by described two different sequence inputting LSTM models, train language model;
Completion module is predicted, for according to the language model completion code trained.
Preferably, the source code is resolved to different form by the source code processing module, to obtain class, the side of code Method list, code identifier.
Preferably, the sequence generating module obtains preamble sequence and middle sequence sequence by preamble traversal and inorder traversal, Splice preamble sequence and middle sequence sequence, as the input of subsequent LSTM network.
Preferably, sequence generating module further passes through inorder traversal and postorder traversal obtains middle sequence sequence and postorder sequence It arranges, sequence sequence and postorder sequence in splicing, as the input of subsequent LSTM network.
The LSTM that method of the invention trains has higher accuracy rate.Technical solution of the present invention has simple, fast The feature of speed can preferably improve the accuracy rate of code recommendation and recommend efficiency.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 is the compiling flow chart of existing program.
Fig. 2 is that the present invention is based on the code completion method flow diagrams of double AST sequences.
Fig. 3 is that the present invention is based on the code completion system construction drawings of double AST sequences.
Fig. 4 is the Experiment Training result schematic diagram that " preamble+middle sequence " of the invention splices list entries.
Fig. 5 is the Experiment Training result schematic diagram that " middle sequence+postorder " of the invention splices list entries.
Fig. 6 is the Experiment Training result schematic diagram of four kinds of different LSTM list entries.
Specific embodiment
The illustrative embodiments that the present invention will be described in more detail below with reference to accompanying drawings.Although showing this hair in attached drawing Bright illustrative embodiments, it being understood, however, that may be realized in various forms the reality of the invention without that should be illustrated here The mode of applying is limited.It is to be able to thoroughly understand the present invention on the contrary, providing these embodiments, and this can be sent out Bright range is fully disclosed to those skilled in the art.
The present invention is serialized AST (Abstract Syntax Tree, abstract syntax tree), and is serialized to AST As a result it is modeled, it thus can be with LSTM series model (shot and long term memory network, one kind of Recognition with Recurrent Neural Network) to journey The structural information of sequence is analyzed, and then completes class of procedures task.In other words, the present invention is carried out using LSTM series model It is improved on the basis of class of procedures task, the morphology rank input (token sequence) of original language model is substituted for AST serializing using the structural information of program as a result, mainly analyzed, and achieve ideal result.
RNN (Recognition with Recurrent Neural Network) is a kind of common artificial neural network, defeated suitable for processing timing list entries It can be set as the sequence (even 1) of different length out;Early stage RNN cannot handle long-term Dependence Problem, i.e. RNN meeting " forgetting " Information.LSTM (shot and long term memory network) is a kind of variant of RNN, solves long-term Dependence Problem, has certain memory capability, It is suitble to be spaced and postpone longer event in processing predicted time sequence.However, being which type of RNN, input all must be Sequence, if the structure that serializing embodies program information just must be taken into consideration using the structure of LSTM analysis program --- here The present invention uses AST.
The structure of program is often tree, and LSTM is the series model of linear structure.According to the knowledge of data structure: 1, multiway tree and binary tree correspond.2, binary tree and inorder traversal, preamble/postorder traversal sequence correspond;Or Person says that inorder traversal and preamble/postorder traversal sequence uniquely determine a binary tree.To which source code and AST turn two Pitch preamble/postorder traversal sequence, the one-to-one conclusion of splicing sequence of inorder traversal of tree.So the present invention proposes two Kind serializing mode is simultaneously tested:
The first: AST turns binary tree, obtains preamble sequence and middle sequence sequence by preamble traversal and inorder traversal, splices Preamble sequence and middle sequence sequence, as the input of network;
Second: AST turns binary tree, obtains middle sequence sequence and postorder sequence by inorder traversal and postorder traversal, splicing Middle sequence sequence and postorder sequence, as the input of network.
Fig. 2 is that the present invention is based on the code completion method flow diagrams of double AST sequences.Include the following steps:
S1, source code processing step, carry out analysis source code using abstract syntax tree.In this step, the source code It is resolved to different form, with class, method list, the code identifier etc. for obtaining code.
Abstract syntax tree (abstract syntax tree is perhaps abbreviated as AST) or syntax tree (syntax Tree), be source code abstract syntax structure the tree-shaped form of expression, in particular to the source code of programming language.With abstract language Method tree it is opposite be concrete syntax tree (concrete syntax tree), commonly referred to as parsing tree (parse tree).Generally , in the translation and compilation process of source code, syntax analyzer is created that parsing tree.Once AST is created out, subsequent Treatment process in, such as semantic analysis stage, some information can be added.
Then, abstract syntax tree switchs to binary tree.
Above-mentioned abstract syntax tree (AST) is converted to two different sequences by S2, sequence generation step simultaneously.Specifically, this Invention uses two kinds of conversion regimes, as follows:
The first: obtaining preamble sequence and middle sequence sequence by preamble traversal and inorder traversal, splicing preamble sequence is in Sequence sequence, as the input of subsequent LSTM network;
Second: middle sequence sequence and postorder sequence being obtained by inorder traversal and postorder traversal, sequence sequence is with after in splicing Sequence sequence, as the input of subsequent LSTM network.
S3, model training step, by described two different sequence inputting LSTM models, train language model.Step S2 solution Two sequences obtained after analysis will be used for the circulation mind based on long short-term memory (Long Short Term Memory, LSTM) Through netspeak model.The LSTM model is concatenated LSTM model, and the LSTM model is located at the hidden layer of RNN model.
S4, prediction completion step, according to the language model completion code trained.In this step, by partial code Segment inputs the language model of trained mistake, to based on context export the code element of recommendation.
As shown in figure 3, according to another aspect of the present invention, additionally providing a kind of code completion based on double AST sequences System 100, the following module including sequential connection:
Source code processing module 110 uses abstract syntax tree analysis source code;Preferably, the source code processing module The source code is resolved into different form, to obtain class, the method list, code identifier of code.
Above-mentioned abstract syntax tree (AST) is converted to two different sequences by sequence generating module 120 simultaneously.
Model training module 130, by described two different sequence inputting LSTM models, train language model;
Completion module 140 is predicted, for according to the language model completion code trained.
In specific embodiment described in Fig. 2, the data set that the present invention uses is that POJ (comment online by Peking University's program Examining system) 104 class C/C++ LISP program LISP source codes, every a kind of problem for corresponding to the system contains 500 parts of topics Completely meeting the requirements source code of submitting of student, the classification of Miscellaneous Documents has been marked via 1-104.Preprocessing part is first It first passes through pycparser tool and obtains the AST of C/C++ language, then AST is converted by binary tree by unified approach, by preceding Sequence/middle sequence/postorder traversal obtains preamble/middle sequence/postorder sequence (sequence that traversing result is all kinds of node Node in AST), leads to It crosses splicing (concatenate) sequence and obtains list entries: preamble sequence+middle sequence sequence or middle sequence sequence+postorder sequence.Such as Described above, either which kind of splices list entries, all corresponds with source program.List entries is crossed one first by network portion A embeding layer, each Node switch to an one-hot vector (one-hot length is vocabulary size), and such each One-hot vector is by the true input as each time step;When the one-hot input of the last one Node of list entries Afterwards, the output one-hot vector of a prediction is obtained, for the result after one softmax layers (being not drawn into figure), value is maximum One (length is programs categories number, i.e., 104) subscript adds a prediction classification as list entries.Training when, label to It is quantified as one-hot vector (length 104), only corresponding to one of classification is 1;The loss function used is cross entropy.
Xi indicates the input of current time step in Fig. 2, i.e. the one-hot of the insertion of AST interior joint, hi indicate current time The hiding layer state of step, y indicates current output --- due to being class of procedures task, the present invention only needs reading in last Time step when a input exports corresponding Prediction program classification, and actually y also needs to take after one softmax layers Argmax determines corresponding classification.
Experiment and result
In order to show modelling effect, the present invention has also done two groups of control experiments.One is the experiment of original version, inputs sequence Column are the morphological analyses of program as a result, i.e. directly by token sequence/word sequence of source code as input;The second is using non- One-to-one ergodic sequence combination is as input, and the present invention uses the depth-first traversal of AST for the sake of simplicity Node sequence is as input.
Experimental setup (hyper parameter):
Hidden size=300
Batch size=22
Learning rate=1e-4
Experimental result: the Experiment Training process of two kinds of splicing list entries is as shown in Figure 4,5: " preamble+middle sequence " is as defeated The experiment entered finally converges on 90.08% predictablity rate (Fig. 4), and " middle sequence+postorder " is finally converged on as input 88.43% accuracy rate (Fig. 5).
In addition the result of other two groups of control experiments is as shown in Figure 6: where ast2bt_iap refers to middle sequence splicing postorder For sequence as input, ast2bt_pai refers to that for sequence sequence as inputting, ast_dfs refers to the depth of AST in preamble splicing First traversal sequence refers to the tokens sequence of source code directly as input as input, src_code.The following table 1 is four kinds of methods The test accuracy rate that mode input obtains corresponds to table.
Table 1
Mode input Test accuracy rate
AST turns binary tree: preamble+middle sequence 90.08%
AST turns binary tree: middle sequence+postorder 88.43%
The depth-first traversal sequence of AST 88.60%
The token sequence of Source Code 91.87%
It can be seen that the even superficial structural information using program, can also reach fine in class of procedures task Effect.Wherein, ast_dfs (depth-first traversal, which is contributed a foreword, is classified as input) method obviously restrains slow, this is because deep It spends first traversal sequence and source program not corresponds, so therefore the discrimination of program can lose, the study speed of model Degree is just naturally slow, and accuracy rate is also lower compared with method for distinguishing.And the convergence rate of src_code is fast, this Aspect is since list entries is shorter (the half length less than splicing sequence as input mode), and model can learn more Fastly;Further, since LSTM model itself can extraction procedure higher level of abstraction information, so being made in fact using double AST sequence assemblies For the model of list entries, the structural information of program is also not yet utilized well.It is critical that light is to utilize the imperfect of program Structural information (double AST splicing sequences can not completely reflect the structure of entire program, but for the serializing for LSTM Model we have to take into account that this kind of serializing operation), can be obtained in class of procedures task well as a result, with source generation The experiment that code is made to input differs less than 2%, and this already demonstrates the abilities of model.
On the whole, model of the invention is modeled with the structural information of program well.Further, should Model can be also used for other tasks, as long as the model can obtain complete AST structure, do similar process.On the other hand, such as Fruit can be using the network of more structuring, more completely with the data knot of original AST or corresponding reflection program structure For structure as input, possible experimental result can be more ideal.
It should be understood that
Algorithm and display be not inherently related to any certain computer, virtual bench or other equipment provided herein. Various fexible units can also be used together with teachings based herein.As described above, it constructs required by this kind of device Structure be obvious.In addition, the present invention is also not directed to any particular programming language.It should be understood that can use various Programming language realizes summary of the invention described herein, and the description done above to language-specific is to disclose this hair Bright preferred forms.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention Example can be practiced without these specific details.In some instances, well known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this specification.
Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. required to protect Shield the present invention claims features more more than feature expressly recited in each claim.More precisely, as following Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim itself All as a separate embodiment of the present invention.
Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose It replaces.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed Meaning one of can in any combination mode come using.
Various component embodiments of the invention can be implemented in hardware, or to run on one or more processors Software module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practice One in the creating device of microprocessor or digital signal processor (DSP) to realize virtual machine according to an embodiment of the present invention The some or all functions of a little or whole components.The present invention is also implemented as executing method as described herein Some or all device or device programs (for example, computer program and computer program product).Such realization Program of the invention can store on a computer-readable medium, or may be in the form of one or more signals.This The signal of sample can be downloaded from an internet website to obtain, and is perhaps provided on the carrier signal or mentions in any other forms For.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch To embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fame Claim.
The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by anyone skilled in the art, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the protection model of the claim Subject to enclosing.

Claims (10)

1. a kind of code completion method based on double AST sequences characterized by comprising
Source code processing step uses abstract syntax tree analysis source code;
Above-mentioned abstract syntax tree is converted to two different sequences by sequence generation step simultaneously;
Model training step, by described two different sequence inputting LSTM models, train language model;
Completion step is predicted, according to the language model completion code trained.
2. the code completion method according to claim 1 based on double AST sequences, it is characterised in that:
In source code processing step, the source code is resolved to different form, to obtain class, the method list, generation of code Code identifier.
3. the code completion method according to claim 1 based on double AST sequences, it is characterised in that:
The sequence generation step includes: to obtain preamble sequence and middle sequence sequence by preamble traversal and inorder traversal, before splicing Sequence sequence and middle sequence sequence, as the input of subsequent LSTM network.
4. the code completion method according to claim 3 based on double AST sequences, it is characterised in that:
The sequence generation step further comprises: middle sequence sequence and postorder sequence are obtained by inorder traversal and postorder traversal, Sequence sequence and postorder sequence in splicing, as the input of subsequent LSTM network.
5. the code completion method according to claim 1 or 2 based on double AST sequences, it is characterised in that:
The LSTM model is the LSTM model of stack.
6. the code completion method according to claim 1 based on double AST sequences, it is characterised in that:
In prediction completion step, partial code segment is inputted to the language model of trained mistake, thus based on context defeated The code element recommended out.
7. a kind of code completion system based on double AST sequences, which is characterized in that the following module including sequential connection:
Source code processing module uses abstract syntax tree analysis source code;
Above-mentioned abstract syntax tree is converted to two different sequences by sequence generating module simultaneously;
Model training module, by described two different sequence inputting LSTM models, train language model;
Completion module is predicted, for according to the language model completion code trained.
8. the code completion system according to claim 7 based on double AST sequences, it is characterised in that:
The source code is resolved to different form by the source code processing module, to obtain class, the method list, code of code Identifier.
9. the code completion system according to claim 7 based on double AST sequences, it is characterised in that:
The sequence generating module obtains preamble sequence and middle sequence sequence by preamble traversal and inorder traversal, splices preamble sequence With middle sequence sequence, as the input of subsequent LSTM network.
10. the code completion system according to claim 9 based on double AST sequences, it is characterised in that:
Sequence generating module further passes through inorder traversal and postorder traversal obtains middle sequence sequence and postorder sequence, sequence sequence in splicing Column and postorder sequence, as the input of subsequent LSTM network.
CN201811224521.XA 2018-10-19 2018-10-19 A kind of code completion method and system based on double AST sequences Pending CN109582352A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811224521.XA CN109582352A (en) 2018-10-19 2018-10-19 A kind of code completion method and system based on double AST sequences

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811224521.XA CN109582352A (en) 2018-10-19 2018-10-19 A kind of code completion method and system based on double AST sequences

Publications (1)

Publication Number Publication Date
CN109582352A true CN109582352A (en) 2019-04-05

Family

ID=65920215

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811224521.XA Pending CN109582352A (en) 2018-10-19 2018-10-19 A kind of code completion method and system based on double AST sequences

Country Status (1)

Country Link
CN (1) CN109582352A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110223553A (en) * 2019-05-20 2019-09-10 北京师范大学 A kind of prediction technique and system of answering information
CN111966817A (en) * 2020-07-24 2020-11-20 复旦大学 API recommendation method based on deep learning and code context structure and text information
CN112035099A (en) * 2020-09-01 2020-12-04 北京天融信网络安全技术有限公司 Vectorization representation method and device for nodes in abstract syntax tree
CN112860362A (en) * 2021-02-05 2021-05-28 达而观数据(成都)有限公司 Visual debugging method and system for robot automation process
CN112905188A (en) * 2021-02-05 2021-06-04 中国海洋大学 Code translation method and system based on generation type countermeasure GAN network
CN113010182A (en) * 2021-03-25 2021-06-22 北京百度网讯科技有限公司 Method and device for generating upgrade file and electronic equipment
CN113064586A (en) * 2021-05-12 2021-07-02 南京大学 Code completion method based on abstract syntax tree augmented graph model
CN113076089A (en) * 2021-04-15 2021-07-06 南京大学 API completion method based on object type
CN117573085A (en) * 2023-10-17 2024-02-20 广东工业大学 Code complement method based on hierarchical structure characteristics and sequence characteristics
CN117573084A (en) * 2023-08-02 2024-02-20 广东工业大学 Code complement method based on layer-by-layer fusion abstract syntax tree

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070277163A1 (en) * 2006-05-24 2007-11-29 Syver, Llc Method and tool for automatic verification of software protocols
CN102185930A (en) * 2011-06-09 2011-09-14 北京理工大学 Method for detecting SQL (structured query language) injection vulnerability
US9928040B2 (en) * 2013-11-12 2018-03-27 Microsoft Technology Licensing, Llc Source code generation, completion, checking, correction
US20180088937A1 (en) * 2016-09-29 2018-03-29 Microsoft Technology Licensing, Llc Code refactoring mechanism for asynchronous code optimization using topological sorting
CN108388425A (en) * 2018-03-20 2018-08-10 北京大学 A method of based on LSTM auto-complete codes
CN108563433A (en) * 2018-03-20 2018-09-21 北京大学 A kind of device based on LSTM auto-complete codes
CN108595165A (en) * 2018-04-25 2018-09-28 清华大学 A kind of code completion method, apparatus and storage medium based on code intermediate representation

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070277163A1 (en) * 2006-05-24 2007-11-29 Syver, Llc Method and tool for automatic verification of software protocols
CN102185930A (en) * 2011-06-09 2011-09-14 北京理工大学 Method for detecting SQL (structured query language) injection vulnerability
US9928040B2 (en) * 2013-11-12 2018-03-27 Microsoft Technology Licensing, Llc Source code generation, completion, checking, correction
US20180088937A1 (en) * 2016-09-29 2018-03-29 Microsoft Technology Licensing, Llc Code refactoring mechanism for asynchronous code optimization using topological sorting
CN108388425A (en) * 2018-03-20 2018-08-10 北京大学 A method of based on LSTM auto-complete codes
CN108563433A (en) * 2018-03-20 2018-09-21 北京大学 A kind of device based on LSTM auto-complete codes
CN108595165A (en) * 2018-04-25 2018-09-28 清华大学 A kind of code completion method, apparatus and storage medium based on code intermediate representation

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DRCRYPTO: "由先序+后序遍历确定序列是否唯一并输出一个中序序列", 《HTTPS://BLOG.CSDN.NET/U011240016/ARTICLE/DETAILS/53193754》 *
JIAN LI等: "Code Completion with Neural Attention and Pointer Networks"", 《INT’L JOINT CONF. ON ARTIFICAL INTELLIGENCE (IJCAI)》 *
VESELIN RAYCHEV等: "Code Completion with Statistical Language Models", 《ACM》 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110223553B (en) * 2019-05-20 2021-08-10 北京师范大学 Method and system for predicting answer information
CN110223553A (en) * 2019-05-20 2019-09-10 北京师范大学 A kind of prediction technique and system of answering information
CN111966817A (en) * 2020-07-24 2020-11-20 复旦大学 API recommendation method based on deep learning and code context structure and text information
CN111966817B (en) * 2020-07-24 2022-05-20 复旦大学 API recommendation method based on deep learning and code context structure and text information
CN112035099A (en) * 2020-09-01 2020-12-04 北京天融信网络安全技术有限公司 Vectorization representation method and device for nodes in abstract syntax tree
CN112035099B (en) * 2020-09-01 2024-03-15 北京天融信网络安全技术有限公司 Vectorization representation method and device for nodes in abstract syntax tree
CN112905188A (en) * 2021-02-05 2021-06-04 中国海洋大学 Code translation method and system based on generation type countermeasure GAN network
CN112860362B (en) * 2021-02-05 2022-10-04 达而观数据(成都)有限公司 Visual debugging method and system for robot automation process
CN112860362A (en) * 2021-02-05 2021-05-28 达而观数据(成都)有限公司 Visual debugging method and system for robot automation process
CN113010182A (en) * 2021-03-25 2021-06-22 北京百度网讯科技有限公司 Method and device for generating upgrade file and electronic equipment
CN113076089A (en) * 2021-04-15 2021-07-06 南京大学 API completion method based on object type
CN113076089B (en) * 2021-04-15 2023-11-21 南京大学 API (application program interface) completion method based on object type
CN113064586A (en) * 2021-05-12 2021-07-02 南京大学 Code completion method based on abstract syntax tree augmented graph model
CN113064586B (en) * 2021-05-12 2022-04-22 南京大学 Code completion method based on abstract syntax tree augmented graph model
CN117573084A (en) * 2023-08-02 2024-02-20 广东工业大学 Code complement method based on layer-by-layer fusion abstract syntax tree
CN117573084B (en) * 2023-08-02 2024-04-12 广东工业大学 Code complement method based on layer-by-layer fusion abstract syntax tree
CN117573085A (en) * 2023-10-17 2024-02-20 广东工业大学 Code complement method based on hierarchical structure characteristics and sequence characteristics
CN117573085B (en) * 2023-10-17 2024-04-09 广东工业大学 Code complement method based on hierarchical structure characteristics and sequence characteristics

Similar Documents

Publication Publication Date Title
CN109582352A (en) A kind of code completion method and system based on double AST sequences
CN108388425A (en) A method of based on LSTM auto-complete codes
Biallas et al. Arcade. PLC: A verification platform for programmable logic controllers
Chakraborty et al. On multi-modal learning of editing source code
Marre et al. Test sequences generation from lustre descriptions: Gatel
Medeiros et al. DEKANT: a static analysis tool that learns to detect web application vulnerabilities
CN108595341B (en) Automatic example generation method and system
WO2019075390A1 (en) Blackbox matching engine
CN108563433A (en) A kind of device based on LSTM auto-complete codes
WO2019051426A1 (en) Pruning engine
CN109614103A (en) A kind of code completion method and system based on character
WO2018226598A1 (en) Method and system for arbitrary-granularity execution clone detection
CN109492402A (en) A kind of intelligent contract safe evaluating method of rule-based engine
CN106682343A (en) Method for formally verifying adjacent matrixes on basis of diagrams
CN114911711A (en) Code defect analysis method and device, electronic equipment and storage medium
Shrestha et al. DeepFuzzSL: Generating models with deep learning to find bugs in the Simulink toolchain
CN107194065A (en) A kind of method for being checked in PCB design and setting binding occurrence
CN108563561B (en) Program implicit constraint extraction method and system
CN106775913A (en) A kind of object code controlling stream graph generation method
Xu et al. Dsmith: Compiler fuzzing through generative deep learning model with attention
Meffert Supporting design patterns with annotations
Hashtroudi et al. Automated test case generation using code models and domain adaptation
Ribeiro et al. Gpt-3-powered type error debugging: Investigating the use of large language models for code repair
US7543274B2 (en) System and method for deriving a process-based specification
KR102421274B1 (en) voice control system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination