CN114996567A

CN114996567A - API recommendation method based on context and graph learning

Info

Publication number: CN114996567A
Application number: CN202210487835.9A
Authority: CN
Inventors: 郭俊霞; 赖宝强; 李征; 赵瑞莲
Original assignee: Beijing University of Chemical Technology
Current assignee: Beijing University of Chemical Technology
Priority date: 2022-05-06
Filing date: 2022-05-06
Publication date: 2022-09-02

Abstract

The API recommendation method based on context and graph learning is a recommendation method for analyzing and predicting the API according to context limited information, so that the problem that recommendation performance is affected due to cold start when the context information is insufficient is solved. According to the method, fine-grained API modeling and analysis are needed, context information is enriched by fusing structure and attribute information, more characteristic information can be provided for an API prediction task, more potential relationships and use relationships can be found, and the API prediction range is expanded. And then link prediction is carried out based on the feature representation learned by the graph, so that potential relations are discovered. And meanwhile, Bayesian prediction is carried out based on historical use information, and the possible API of the next step is deduced. And finally, combining the prediction scores of the two to predict the API which is possibly called next so as to improve the recommendation performance. Compared with the existing API recommendation method, the API recommendation method can effectively improve the API recommendation accuracy, can provide rich API recommendation lists, and has certain application value.

Description

API recommendation method based on context and graph learning

Technical Field

The invention belongs to the field of software engineering intelligent development, and relates to a method for recommending a proper API for developers by combining with a programming context information application graph learning technology.

Background

APIs are widely used in daily software development tasks as function access interfaces of application programs such as software toolkits and software frameworks. Currently, many help documents have problems of low quality, incomplete code examples and the like, so that developers often face the problem that the API is difficult to use. In recent years, big data and artificial intelligence technologies have made some research progress in the field of code search and recommendation, and many scholars have made further intensive research and exploration on the API recommendation problem and have made a series of research results.

However, the current API recommendation method has shortcomings in the utilization of contextual information. When the context information is insufficient, the recommendation is prone to get into a "cold start," resulting in the recommendation performance being affected. Graph learning, which is a technique of learning a relationship and attribute feature fusion, can learn a deep feature representation of a relationship entity from a data set, thereby finding a hidden relationship. Therefore, the image learning technology has wide application in the fields of classification, prediction, recommendation and the like.

Aiming at the problem of 'cold start' in the field of API recommendation, the invention provides an API recommendation method based on graph learning by combining a graph learning technology. By researching the API use relationship and the API attribute information of context fine granularity and fusing the API structure and the attribute information by using a graph learning technology, the representation of the context information is enhanced, and the API prediction capability is improved.

Disclosure of Invention

The invention aims to provide rich API use suggestions for developers according to limited contexts when the context information of the code where the developers are located is insufficient. The method has the main advantages that the graph learning mechanism is utilized to fuse the API structure relationship and the API attribute characteristics of fine granularity, so that the deep characteristic representation of the API is learned, the context information is enhanced, and the cold start problem is solved. These API features will be used for downstream predictive tasks, providing more efficient information for recommended tasks. The API feature fusion model based on graph learning is shown in FIG. 1. In addition, the invention combines the link prediction technology and the Bayesian prediction method to predict the candidate API, realizes the prediction of the generation probability of the unknown relation edge and takes the prediction as the recommendation basis.

The method mainly comprises the following steps:

(1) relationship construction

And constructing a relation graph of the API use, and representing the API use relation of the context.

(2) Feature extraction

The method mainly aims at extracting the node features in the API relationship graph.

(3) Model training

The method mainly fuses and represents the API relational graph and the characteristics, so as to obtain deep characteristics of the nodes.

(4) Prediction and recommendation

The method mainly predicts unknown APIs based on feature representation of context API nodes and API use distribution information, and then carries out sequencing recommendation according to the size of the predicted value.

Detailed Description

The API recommendation framework based on graph learning is shown in fig. 2, and the specific implementation steps are as follows:

step 1: and (5) constructing a relation.

Through a static analysis technology, the calling relation of the method and the API in a project can be extracted, and the association relation can be constructed based on the calling relation, so that the API association subgraph corresponding to the project is constructed. And the definition of the association relationship is as follows:

given data set P ═ M _i 1, 2., n }, where M is _i Representing the set of APIs called in the ith client method. If a pair of API nodes (u, v) are satisfied for use in client method M simultaneously _i If so, the value o (u, v) is 1, otherwise, the value o (u, v) is 0. When the API nodes u and v satisfy

Then an association is considered to exist between API node u and API node v, where minsup represents the minimum support and defaults to 1.

The fine-grained API structure relationship is represented by a weighted undirected API association graph G ═ V, E }, wherein a node V represents an API set, and an edge E represents an association relationship set. The label of the node V is represented by an API full-limited name and is identified by a unique number; the edge E is represented by a node pair, for example, the node pair (u, v, w) represents an edge with an association between the node u and the node v, and the association strength is w, i.e., E is (u, v, w), and E is E.

The construction of the API structural relationship is completed according to the above extraction method, and as shown in fig. 3, an API association diagram display of an item is shown.

And 2, step: and (5) feature extraction.

The feature extraction mainly aims at the node attribute in the API association graph to carry out feature extraction and serves as the initial feature of the node in the graph learning process. The attribute information of the node is mainly considered from two aspects: API project structure information and API function semantic information.

In order to simultaneously consider the advantages of the project structure information and the API function semantic information of the API, the method fuses the two kinds of information as node attributes, and embeds the node attributes into the same vector space to express the node attributes as feature vectors. The specific node attribute extraction method is that the API full-restriction names are separated according to the hierarchical relationship of project names (project), package names (package), class names (class) and method names (method) to obtain API project structure information. And then splitting each name according to a hump splitting method to obtain a word sequence so as to obtain API functional semantic information. And then all the structural parts are spliced, so that the attribute information of the API node is obtained.

And then constructing a vector space model to carry out vectorization representation on the node text attribute, wherein the weight of each bag-of-word model in the node text attribute is calculated by using a TF-IDF algorithm, so that the coding and the initial vector representation of the API are completed. The initial feature vector of API node i is represented as

Wherein i _w The calculation formula of (c) is as follows.

In the formula, f _w The text attribute representing the API node i comprises the number of words w, | i | represents the number of API nodes in the API association graph, a _w Indicating the number of APIs in the API association graph where all text attributes contain the word w.

And step 3: and (5) training a model.

In order to perform fusion learning on the topology information and the node attributes, the method performs graph representation learning by using a graph learning framework GraphSAGE based on an airspace, so as to obtain the fusion characteristic representation of the API. The API association graph learning model based on GraphSAGE is mainly divided into three parts: an input layer, a convolutional layer, and an output layer.

(ii) an input layer

There are mainly two parts of input information, which are structure information and attribute information. The structural information refers to the API association graph G for on-graph computing tasks. The attribute information refers to a text attribute T of the node, and the initial feature vector of the API is obtained by firstly encoding by using the vector space model.

② convolution layer

And the convolutional layer aggregates the feature information of the neighbor nodes by using the structural characteristics of the graph G through a sampling strategy and an aggregation function, so as to realize the fusion and the update of the features. The method for sampling the neighbor nodes adopts a fixed-length sampling method, the number S of the neighbors needs to be defined firstly, the purpose is to keep the number of the neighbors constant, and the method is convenient for splicing a plurality of nodes for batch training. And then, a re-sampling method with a return is adopted to reach S, and a neighbor node set N (v) of the API node v can be obtained through the method. The aggregation process of the aggregation function is to transmit messages from k-order neighbor nodes, so that the feature representation of the nodes is updated, and the attribute characteristics of the balanced nodes also keep the structural characteristics of the graph.

And providing the following three types of aggregation functions to realize the feature update of the API node v:

1) average aggregation function

And taking the weighted average value of the neighbor nodes of v as the update of the nodes:

wherein, the node u is a neighbor node of the node v, and W is a parameter to be learned.

2) LSTM aggregation function

And (5) regarding the neighbor node set N (v) of the v as a sequence, and processing the sequence by using an LSTM module structure.

3) Pooled aggregation function

Aggregating information from neighboring nodes using maximal pooling:

where σ is a Sigmoid function, and W and B are parameters to be learned.

Output layer

Final feature representation z of output layer output API node v _v For prediction of the API. To make the training process more stable, the output vector after each convolutional layer is normalized, i.e. normalized

The vectors of each convolutional layer imply the low-order and high-order characteristic signals of the nodes, which can reflect the local characteristics of the graph, so that the node representation of the k convolutional layers is output through the linear conversion layer. The formula is as follows:

wherein, the connection function concat () connects the feature representations of each layer of nodes in order, and W and B are parameters to be learned.

And the parameters of the aggregation function are learned and updated through a back propagation algorithm, and the parameters are learned in an unsupervised learning mode. Based on the SkipGram model idea, a loss function of the graph is adopted to enable adjacent nodes to have similar expressions, and the loss function is shown as the following formula.

Wherein z is _u And z _v Representing the final feature representation of nodes u and v, node u being the neighbor node of node v sampled in the k-th neighborhood, P _n Is the negative sampling probability distribution, Q is the negative sample size, and σ represents the Sigmoid function.

And 4, step 4: and (6) predicting and recommending.

Based on the context-limited API information, deep API feature representation can be obtained according to graph learning and used for predicting and recommending the API. Given an API node set B, only corresponding node embedding needs to be searched from a parameter server or a database, and the latest feature vector of the node can be obtained by recalling through a forward propagation algorithm, wherein the mini-batch-based forward propagation algorithm is shown in FIG. 4.

In the algorithm, lines 2 to 7 represent sampling processes, and 1-order, 2-order and other high-order neighbor nodes of the sampling node u are sampled. Lines 9 to 15 represent an aggregation process, and only the nodes of the local neighborhood are aggregated, so that the iteration speed is increased.

The feature vectors are fused with the structural information and the attribute information of the API, and the unknown connection relation is predicted through link prediction, so that the prediction probability of the candidate API is obtained; in addition, Bayesian prediction is performed according to API historical use information in the code base so as to enhance the stability of prediction. Finally, the prediction probabilities of the two are combined as the final score of the candidate API. Assuming that the API usage information of the context is denoted as Q as the input of the model and the API sequence table as the output result D, the API prediction problem can be converted into solving the generation probability P (D | Q) of D under the Q condition.

Link prediction based on API (application program interface) feature representation

To capture higher order potential relationships, API-based associationsThe graph features the potential edges of the candidate APIs predicted using the similarity between nodes as the feature representation of the potential edges. Given API node u and candidate API node v, then the similarity between the two nodes can be calculated by the inner product of the vectors, i.e.

Given a set Q of contextual API nodes, then the feature representation between Q and the candidate API node v is noted as

The probability distribution y of the potential edges can be obtained through a softmax function _d Namely:

y _d ＝P ₁ (d|Q)＝softmax(E _Q )

wherein d is the API node to be predicted.

Bayesian prediction based on API usage information

Bayesian prediction estimates posterior probability by using prior probability, and can predict API possibly called next. The frequency of d occurring in the code library can be estimated using the prior probability p (d) according to the bayesian formula shown in the following equation.

P(d|Q)＝P(d,Q)/P(Q)

∝P(d,Q)＝P(Q|d)P(d)

For P (Q | d) probability, according to conditional probability similar to n-gram model

And (6) performing calculation. Thus, given the API set Q, the predicted probability of a candidate API d is given by:

wherein f is the co-occurrence frequency.

According to the formula, each candidate result D in the candidate result set D can obtain a prediction probability, the probability is used as a probability score recommended by the API, and each score is normalized. Therefore, the arithmetic score of the two scores is taken as the final score of the candidate API d, as shown in the following formula.

Score(d)＝αScore ₁ (d)+βScore ₂ (d)

Wherein, Score ₁ (d) Is passing through P ₁ (d | Q) Score of each candidate API after normalization, Score ₂ (d) Is passing through P ₂ (d | Q) score of each candidate API after normalization. Alpha and beta are respectively set to 0.5 by default. And finally, sorting the scores in the result set D, and returning Top-k as a final recommendation result.

Drawings

FIG. 1 API feature fusion model based on graph learning

FIG. 2 API recommendation framework based on graph learning

API dependency graph example of an item of FIG. 3

FIG. 4 is based on the mini-batch forward propagation algorithm.

Claims

1. An API recommendation method based on context and graph learning is characterized in that an API association graph is constructed through a relational modeling method to represent a context API use relation; the problem of insufficient context information is relieved by the aid of the fusion characteristics of the API relation and the attributes through a graph learning model; and predicting the candidate API according to a prediction method of multi-source information fusion.

2. The relational modeling method according to claim 1, wherein class file information, method declaration information, method parameter information, API call information, and API parameter information in a project are extracted by a static analysis technique, thereby constructing an API association graph according to a co-occurrence relationship between a method and an API, and performing attribute feature extraction according to API project structure information and semantic information.

3. The graph learning model according to claim 1, wherein the relationship structure and attribute information of the API are fused by using a direct-push graph learning framework graph, so as to obtain the feature representation of the API nodes and the parameters of the learning aggregation function, and when the API nodes are input, the API feature vectors can be recalled according to the shared parameters for the downstream prediction task.

4. The prediction method of claim 1, wherein the target API is predicted by combining a link prediction method and a bayesian prediction method based on the API feature representation, when limited context information is known, the potential relationship can be predicted according to the existing API node information, and prediction and recommendation of candidate APIs are completed based on the prediction probability.