CN111966817B

CN111966817B - API recommendation method based on deep learning and code context structure and text information

Info

Publication number: CN111966817B
Application number: CN202010723230.6A
Authority: CN
Inventors: 彭鑫; 陈驰; 赵文耘
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2022-05-20
Anticipated expiration: 2040-07-24
Also published as: CN111966817A

Abstract

The invention belongs to the technical field of software engineering, and particularly relates to an API (application program interface) recommendation method based on deep learning and integrating a code structure and text information. The method comprises the steps of constructing a large number of training samples by analyzing a large number of source codes containing a target API; constructing a deep learning network fusing a code structure and text information; training the deep learning network through the training samples to obtain a trained deep learning network; carrying out intelligent API recommendation by using a trained deep learning network; in the invention, an API context graph for representing code context, a method for processing code text information and a deep learning network for performing fusion learning on a code structure and the text information are designed. The invention provides intelligent API recommendation based on code context for software developers, and recommends APIs which may be used for the developers on a line-by-line basis on the codes which are written by the developers, wherein the APIs comprise API method call, member variable access, control statement, variable and object instance statement and the like.

Description

API recommendation method based on deep learning and code context structure and text information

Technical Field

The invention belongs to the technical field of software engineering, particularly relates to an intelligent recommendation and auxiliary coding technology in software development, and particularly relates to an API recommendation method in a software coding process.

Background

In the software development task, a software developer relies on an Application Programming Interface (API) to implement required software features. However, the number of APIs is large and each API contains a large number of method calls and member variables, which makes it difficult for developers to know the functions of all APIs and their applications. In addition, many APIs have their specific usage requirements, such as usage combinations between APIs, calling order, corresponding control structures (e.g., condition judgment, loop, etc.), and so on. Therefore, how to select an appropriate API in a particular code context to accomplish a particular development task often becomes a difficult problem for developers.

One effective solution to this problem is to provide intelligent API recommendation capabilities in a development tool such as a software integration development environment. Such intelligent API recommendation should make API (e.g., API method call, member variable, etc.) recommendation by analyzing and speculating the code context being written by the developer, so as to assist the developer in selecting the correct API to complete his development task efficiently and with high quality.

The software code contains two core types of information: structural information and textual information. Code structure information (e.g., control and data flow) represents the program logic of the software features desired to be implemented in a graph. The code text information (such as code comments, method names, variable names) reflects the semantics of the software features to be implemented in the natural language. Therefore, the deep learning network is designed to fuse code structure information and text information, and therefore more effective API recommendation is achieved.

Disclosure of Invention

The invention aims to provide an intelligent API recommendation method based on code context for software developers, which is used for recommending APIs which may be used by the developers on a line-by-line basis on the basis of the codes which are written by the developers, and comprises API method call, member variable access, control statements (if, while and the like), variable and object instance statements and the like.

The API recommendation method provided by the invention is based on an autonomously designed deep learning network and integrates a code structure and text information, and specifically constructs a large number of training samples by analyzing a large number of source codes (open source codes or enterprise codes) containing target APIs (such as APIs in JDKs and Android); constructing a deep learning network fusing a code structure and text information; training the deep learning network through the training samples to obtain a trained deep learning network; and carrying out intelligent API recommendation by using the trained deep learning network.

In the invention, an API context graph used for representing a code context structure is designed. API contextThe text graph is a directed graph (N, E), where N represents a set of nodes,

representing a collection of edges. Each node in N represents an API method call, an API member variable access, a variable declaration, a variable assignment, a control unit, or a hole. Edges represent control flow and data flow relationships between nodes. Fig. 1 is an example of an API context graph (see appendix for source code corresponding to the API context graph of fig. 1). Each node in the API context graph is an abstract representation that abstracts variables and constants in the code, and only reserves API object creation/method call/attribute access, control nodes (if, while, etc.), variable declarations, etc. The API in the code is abstracted as a complete method signature, for example, result.add (hashCode) in the code shown in the appendix will be abstracted as java.util.arraylist.add (java.lang.object). A variable declaration or assignment node abstracts a variable declaration or assignment in code to a representation that ignores variable names and assigns constants, such as String null in code illustrated in the appendix, which is abstracted to java. And the control structure nodes respectively represent corresponding control structures by If, Elseif, Else, While, DoWhile, For, Foreach, Try, Catch, Finally, Switch, Case and Default. A control structure node may have multiple child nodes that represent codes in different control flows, for example, a While control node may have multiple child nodes that represent codes of a condition part, codes in a structure, and codes outside the structure. Wherein, the Condition node represents the initial node of the code in the Condition of the While node (if the Condition contains API call, the corresponding node can be generated as the child node of the Condition node, such as java. io. buffer reader. readline () node; body node represents the initial node of the code in the While node structure, the child node of Body node processes with the child node in the Condition node, such as int.declaration node; the node corresponding to the first API call to jump out of the While node structure will doIs a child node of a While node, such as a java. The types of the edges between the nodes are classified into control flow types, data flow types, control flow and data flow types, and special types according to the relationship between the control flow and the control flow. The control flow type indicates that a control flow relationship exists between two nodes, the data flow type indicates that a data flow relationship exists between the two nodes, the control flow and data flow type indicates that both a control flow relationship and a data flow relationship exist between the two nodes, and the special type indicates the type of an edge connected with a Hole node. Given a piece of code, parsing starts from the first line of the code, iteratively obtaining an API context graph for the code.

The invention designs a method for processing code text information to obtain a code token bag, which comprises the following steps:

(1) cut off the number in the method name, parameter name and variable name in the code text information, such as "file 2" will be cut into "file";

(2) two special characters of _ "and _" divide the name of the method, the name of the parameter and the name of the variable into words, for example, the "file _ name" is divided into two keys of "file" and "name";

(3) further, according to a hump nomenclature (see reference 12), the obtained token is subjected to word segmentation, for example, the "fileName" is divided into two tokens, namely "file" and "name";

(4) performing morphological reduction on each obtained token, wherein if the files is reduced to the file;

(5) duplicate and meaningless tokens are filtered out. Nonsense tokens include single-character tokens (e.g., "i" and "j") and tokens not contained in the GloVe vocabulary, resulting in a bag of code tokens. The GloVe vocabulary contains 400K unique tokens obtained from Wikipedia and Gigaword.

In the invention, a deep learning network for performing fusion learning on a context code structure and text information is designed, as shown in FIG. 2; including API context graph networks (where a gated graph neural network in an API context graph network is seen in reference 1). Code Token network, federation layer, and Softmax function. The API context graph network is used to learn code structure information characteristics. The API context graph network is composed of embedded layers and gated graph neural networks GG-NNs, and an API context graph vector is obtained through learning based on a given API context graph. The code Token network is used for learning the code text information characteristics. The code Token network consists of an embedded layer, a plurality of hidden layers and Sum operation, and Token vectors are obtained based on given code Token bag learning. The combination layer is used for fusing the code structure information and the code text information characteristic type to form a combination vector. The Softmax function computes the probability of each candidate API based on the joint vector for API recommendation.

The invention provides an API recommendation method based on deep learning and integrating a code structure and text information, which comprises the following specific steps:

the method (I) constructs training samples for training a deep learning network, and comprises the following sub-steps:

(1) analyzing each method in each source code file in the code library by taking the method as a minimum unit to obtain an API (application programming interface) context map and a code token word bag;

(2) for each analyzed API context graph, iteratively traversing from the root node of the API context graph, removing the last N nodes which have a control flow relation with the currently traversed node, and replacing the removed N nodes with a Hole node representing a Hole, thereby obtaining the API context graph with the Hole; for the API context graph with a hole, re-analyzing to obtain a word bag formed by the corresponding residual code tokens; thus, the API context map with the holes, the bag of words formed by the remaining code tokens, and the label of the first replaced (removed) node constitute a training sample; repeating the process to construct a certain number of training samples;

(II) constructing a deep learning network fusing a code structure and text information;

(III) training deep learning network

Inputting all training samples into the learning model, and training to obtain a trained deep learning model; specifically, all training samples are divided into a training set and a verification set according to the ratio of 9:1 and input into the learning model for training, and a trained deep learning model is obtained. The training set is used for training and optimizing model parameters, and the verification set is used for verifying the effect of the model after each round of training. If the effect of the model on the verification set is not improved after 5 rounds of continuous training, stopping training, and taking the model before 5 rounds as a final model;

and (IV) carrying out API prediction recommendation by using the trained deep learning model, and comprising the following substeps:

(1) a user inputs a program with a hole;

(2) analyzing the user input into an API (application programming interface) context map and a code token word bag, and inputting the API context map and the code token word bag into a deep learning model;

(3) running a model deep learning model and giving an API recommendation result;

(4) the user selects according to the API recommendation result;

(5) and updating the program input by the current user according to the selection of the user.

The API recommendation method based on the deep learning and integrating the code structure and the text information has the following characteristics:

(1) an API context graph is designed for representing a code context, i.e., a code API context and structure information. The API context graph is a directed graph (N, E), where N represents a collection of nodes,

representing a collection of edges. Each node in N represents an API method call, an API member variable access, a variable declaration, a variable assignment, a control unit, or a hole. Edges represent control flow and data flow relationships between nodes;

(2) a method for processing code text information is designed, and the method name, the parameter name and the variable name in the code are subjected to word segmentation, so that token is realized;

(3) a deep learning network for performing fusion learning on a code structure and text information is designed, and comprises an API (application programming interface) context graph network, a code Token network, a combination layer and a Softmax function.

Existing API recommendation methods only utilize code structure information or code text information independently. Based on observations of the naturalness of the source code language (see reference 2), many approaches (see references 2-5) propose the use of statistical language models to implement API recommendations. The statistical language models employed in these methods can be simple or enhanced n-gram models, or complex deep learning models (e.g., Recurrent Neural Networks (RNNs)) (see references 6-8). However, regardless of the type of statistical language model used, these methods essentially treat the code as a sequence of text tokens (perhaps enriching the relevant tokens with simple program syntax information (such as program construction keywords and data types)) without taking advantage of the code structure information of the source code. To overcome the limitations of token sequence-based API recommendations described above, another important class of API recommendation methods (see references 9-10) analyzes control flow and data flow graphs for recommending APIs, i.e., considering code structure information. However, this method only considers the local semantics of the subgraphs of the control flow and data flow graph, and does not take the control flow and data flow graph as a whole, and this method does not consider code text information. Another API recommendation method based on deep learning (see reference 11) considers structure information of a code and learns the code as a tree as a whole, however the tree representation in this method lacks data stream information and also does not consider code text information. Compared with the methods, the method solves the problems that the existing method independently models the code structure information and the code text information and lacks the limitation of learning and reasoning the code structure from the overall view. The deep learning model designed by the invention combines API usage and text information in codes based on an API context graph network and a code Token network, thereby simultaneously learning code structure characteristics and text characteristics for API recommendation. Compared with the two advanced API recommendation methods GraLan (see reference 9) and Tree-LSTM (see reference 11), the accuracy of API recommendation of top-1, top-5 and top-10 respectively reaches 58.6%, 81.4% and 87.9%, while GraLan is 31.5%, 64.5% and 77.6%, and Tree-LSTM is 46.7%, 70.4% and 79.3%.

Drawings

FIG. 1 is an example of an API context graph used by the present invention.

Fig. 2 is a diagram illustrating a deep learning network structure used in the present invention.

Detailed Description

One embodiment for Java programs and JDK APIs is as follows:

(1) the API context graph is realized by analyzing Java codes by using JavaParser with sentences as basic units to obtain AST (abstract syntax tree), traversing nodes of the AST corresponding to each sentence by using a navigator mode, obtaining a complete list of the API by using a Java reflection mechanism to extract the API, obtaining a complete method signature of the API, adding the complete method signature as a node to the current API context graph, and establishing a control flow relation between the nodes. And analyzing the condition that all variables and objects in the code are called based on the AST, thereby obtaining the data dependency relationship, and establishing the data flow relationship among the nodes based on the data dependency relationship.

(2) The method name, parameter name and variable name in the code are participled based on Java. Wherein the morphological reduction is realized by using Stanford CoreNLP, and the GloVe vocabulary is obtained from GloVe official website.

(3) The embedded layer in the API context graph network converts the API represented by each node in the API context graph into a 300-dimensional vector, and then inputs the 300-dimensional vector into the gated graph neural network GG-NNs to obtain the 300-dimensional API context graph vector (see reference 1 for the gated graph neural network GG-NNs). An embedded layer in the code Token network converts each input code Token into a 300-dimensional vector, then, semantic information in the code Token is further learned through three fully-connected hidden layers with hidden layer sizes of 300 (wherein an activation function of each hidden layer is tanh), and finally, all the code Token vectors obtained through the last hidden layer are summed through Sum operation to obtain the final 300-dimensional Token vector. The combination layer firstly splices the 300-dimensional API context map vector and the 300-dimensional Token vector into a 600-dimensional vector, and then further learns the combined semantics of the API context map and the code Token through a full connection layer (the activation functions are tanh) to obtain a final 600-dimensional vector. The Softmax function takes the 600-dimensional vectors obtained from the combination layer as input to carry out classified normalized probability calculation, and the formula is as follows:

where p (y | x) denotes the probability that y is possible in this class when the input is x, exp denotes an exponential function with e as the base, W_yRepresenting the parameters, W, in the deep-learning network corresponding to this class y_cAnd representing parameters in the deep learning network corresponding to a certain class c.

(4) The deep learning network writes deep learning network codes based on an API provided by TensorFlow, wherein the implementation of the gated graph neural network is modified based on Microsoft open source codes.

(5) Code recommendation embodiments. The user puts the cursor at a position in the code editor where API recommendation is needed, clicks a recommendation button in the code editor or presses a recommendation shortcut key, and therefore a recommendation list containing N recommendation results is popped up in the code editor. The user may select a recommendation from the list of recommendations, which will automatically fill in the location of the cursor.

The JDK comprises a large number of APIs, and the APIs in the JDK are often used in Java programs, but developers are difficult to remember and master all the API using methods, so that the API recommendation method can provide API recommendations for the developers in the development process. The invention solves the limitation that the existing method independently models code structure information and code text information and lacks the limitation of learning and reasoning the code structure from the integral view. The deep learning model designed by the invention combines API usage and text information in codes based on an API context graph network and a code Token network, thereby simultaneously learning code structure characteristics and text characteristics for API recommendation. The API context graph is constructed by resolving the codes through the JavaParser, so that the API context graph can be input into a deep learning network for prediction. The deep learning network is realized based on TensorFlow, and the GPU can be fully utilized for acceleration, so that the speed of prediction recommendation is increased. Take the code in the appendix as an example. The user writes the code in the appendix in the code editor (e.g. IntelliJ IDEA) and does not know what API should be used by line 8 to get the hash value of a string, so the user can place the cursor at the location of line 8 and click the recommendation button in the code editor to invoke the recommendation service of the invention. When receiving a call service, the invention can firstly utilize JavaParser to accurately analyze the codes in the appendix into the API context diagram shown in figure 1, and the API context diagram comprises two obvious semantics: semantics 1) reading file contents by rows by using a reader; semantics 2) add a value to the list that has been created. From an overall perspective, it can be seen that the variable "str" declared to be of the String type in semantic 1) is only used to store content from the file, and is not used any longer thereafter. Furthermore, the variable "hashCode" declared as type int in semantics 2) has not been assigned yet. In addition, there is a lack of an API to connect semantics 1) and 2) to maintain the integrity of the program logic. From the above overall view, it can be inferred that the cave position needs to perform some processing on a String type variable to obtain a value of an int type. These semantics can all be learned through the API context graph network. However, if only the API context graph network can only predict the cave position and needs to perform some processing on a String type variable to obtain a value of an int type, but the specific operation cannot be determined, so the code Token network is introduced to obtain the code text information in the present invention. When the codes in the appendix are analyzed, besides the API context graph shown in fig. 1, the codes Token of computer, hash, code, path, result, rd, br, str are also obtained through analysis, wherein the three codes Token of computer, hash, code include semantics for calculating a hash value, and can be obtained through code Token network learning. Combining the API context graph network and the code Token network, and performing joint learning, the method can predict the hash value of a String type variable needed to be calculated at the hole position. Thus, the API required by the invention to successfully recommend the hole location is java.

Reference documents:

1.Daniel Beck,Gholamreza Haffari,Trevor Cohn:

Graph-to-Sequence Learning using Gated Graph Neural Networks.ACL(1)2018:273-283.

2.Abram Hindle,Earl T.Barr,Zhendong Su,Mark Gabel,Premkumar T.Devanbu:On the naturalness of software.ICSE 2012:837-847.

3.Miltiadis Allamanis,Charles A.Sutton:

Mining source code repositories at massive scale using languagemodeling.MSR 2013:207-216.

4.Tung Thanh Nguyen,Anh Tuan Nguyen,Hoan Anh Nguyen,Tien N.Nguyen:A statistical semantic language model for source code.ESEC/SIGSOFT FSE 2013:532-542.

5.Zhaopeng Tu,Zhendong Su,Premkumar T.Devanbu:

On the localness of software.SIGSOFT FSE 2014:269-280.

6.Veselin Raychev,Martin T.Vechev,Eran Yahav:

Code completion with statistical language models.PLDI 2014:419-428.

7.Hoa Khanh Dam,Truyen Tran,Trang Pham:

A deep language model for software code.CoRR abs/1608.02715(2016).

8.Anh Tuan Nguyen,Trong Duc Nguyen,Hung Dang Phan,Tien N.Nguyen:

A deep neural network language model with contexts for source code.SANER 2018:323-334.

9.Anh Tuan Nguyen,Tien N.Nguyen:

Graph-Based Statistical Language Model for Code.ICSE(1)2015:858-868.

10.Xiaoyu Liu,LiGuo Huang,Vincent Ng:

Effective API recommendation without historical software repositories.ASE 2018:282-292.

11.Chi Chen,Xin Peng,Jun Sun,Zhenchang Xing,Xin Wang,Yifan Zhao,Hairui Zhang,Wenyun Zhao:

Generative API usage code recommendation with parameter concretization.Sci.China Inf.Sci.62(9):192103:1-192103:22(2019).

12.“Camel case,”2020.[Online].Available:

https://en.wikipedia.org/wiki/Camel case。

appendix:

Claims

1. an API recommendation method based on deep learning and integrating a code structure and text information is characterized by comprising the steps of constructing a large number of training samples by analyzing a large number of source codes containing target APIs; constructing a deep learning network fusing a code structure and text information; training the deep learning network through the training samples to obtain a trained deep learning network; carrying out intelligent API recommendation by using a trained deep learning network; the method comprises the following specific steps:

the method comprises the following steps of (I) constructing samples for deep learning network training, and comprising the following substeps:

(2) for each analyzed API context graph, iteratively traversing from the root node of the API context graph, removing the last N nodes which have a control flow relation with the currently traversed node, and replacing the removed N nodes with a Hole node representing a Hole, thereby obtaining the API context graph with the Hole; for the API context graph with a hole, re-analyzing to obtain a word bag formed by the corresponding residual code tokens; then, an API context graph with a hole, word bags formed by the rest codes token and the label of the first replaced node form a training sample; repeating the process to construct a certain number of training samples;

the deep learning network includes: API context graph network, code Token network, association layer and Softmax function; wherein, the API context graph network is used for learning the code structure information characteristics; the API context graph network consists of an embedded layer and a gated graph neural network GG-NNs, and an API context graph vector is obtained through learning based on a given API context graph; the code Token network is used for learning the code text information characteristics; the code Token network consists of an embedded layer, a plurality of hidden layers and Sum operation, and Token vectors are obtained based on given code Token bag learning; the combination layer is used for fusing the code structure information and the code text information characteristic type to form a combination vector; the Softmax function calculates the probability of each candidate API based on the joint vector for API recommendation;

(III) training deep learning network

Inputting all training samples into a deep learning network model, and training to obtain a trained deep learning model;

(1) a user inputs a program with a hole;

(4) the user selects according to the API recommendation result;

(5) updating the program input by the current user according to the selection of the user;

wherein the API context graph is a directed graph (N, E), wherein N represents a set of nodes,

representing a set of edges; each node in N represents an API method call, an API member variable access, a variable statement, a variable assignment and a controlMaking a unit or a hole; edges represent control flow and data flow relationships between nodes; each node in the API context graph is an abstract representation, the abstract representation is used for abstracting variables and constants in codes, and only API object creation, method calling/attribute access, control nodes and variable declarations are reserved; APIs in the code are abstracted as complete method signatures; the variable declaration or assignment node abstracts the variable declaration or assignment in the code into a representation of ignoring variable names and assignment constants; control structure nodes, wherein If, Elseif, Else, While, DoWhile, For, Foreach, Try, Catch, Finally, Switch, Case and Default respectively represent corresponding control structures; the control structure node is provided with a plurality of child nodes which respectively represent codes in different control flows, codes in the structure body and codes outside the structure body; wherein, the Condition node represents the initial node of the code in the Condition of the While node; the Body node represents the initial node of the code in the While node structure Body, and the processing of the child node of the Body node is the same as that of the child node in the Condition node; nodes corresponding to the first API call of the structure body jumping out of the While node can be used as child nodes of the While node; the types of the edges between the nodes are divided into control flow types, data flow types, control flow and data flow types and special types according to the relation between the control flow and the control flow; the control flow type indicates that a control flow relationship exists between two nodes, the data flow type indicates that a data flow relationship exists between the two nodes, the control flow and data flow type indicates that both a control flow relationship and a data flow relationship exist between the two nodes, and the special type indicates the type of an edge connected with a Hole node; a section of code is given, analysis is carried out from the first line of the code, and an API context graph of the code is obtained in an iterative mode;

the method for obtaining the code token bag comprises the following operations:

(1) cutting off the numbers in the method name, the parameter name and the variable name in the code text information;

(2) two special characters "_" and "$" are used for segmenting the name of a method, the name of a parameter and the name of a variable;

(3) according to a hump naming method, segmenting the obtained token;

(4) performing morphology reduction on each token;

(5) filtering out duplicates and meaningless tokens; nonsense tokens include single character tokens and tokens not included in the GloVe vocabulary.

2. The deep learning-based API recommendation method for integrating the code structure and the text information according to claim 1, wherein the Softmax function calculates the probability of each candidate API based on the joint vector, and the specific formula is as follows: