CN107506414B

CN107506414B - Code recommendation method based on long-term and short-term memory network

Info

Publication number: CN107506414B
Application number: CN201710687197.4A
Authority: CN
Inventors: 余啸; 殷晓飞; 刘进; 伍蔓; 姜加明; 崔晓晖
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2017-08-11
Filing date: 2017-08-11
Publication date: 2020-01-07
Anticipated expiration: 2037-08-11
Also published as: CN107506414A

Abstract

The invention relates to a code recommendation method based on a long-short term memory network, aiming at the problems of low recommendation accuracy, low recommendation efficiency and the like of the existing code recommendation technology. And uses dropout techniques to prevent model overfitting. Meanwhile, the ReLu function is used for replacing the traditional saturation function, the problem of gradient disappearance is solved, the convergence speed of the model is increased, the performance of the model is improved, and the advantages of the neural network are fully exerted. The technical scheme of the invention has the characteristics of simplicity and quickness, and can better improve the accuracy and the recommendation efficiency of code recommendation.

Description

Code recommendation method based on long-term and short-term memory network

Technical Field

The invention belongs to the field of code recommendation, and particularly relates to a code recommendation method based on a long-term and short-term memory network.

Background

(1) Code recommendation system

Developers often develop using sophisticated software frameworks and class libraries to improve the efficiency and quality of software development. Therefore, developers often need to know how to reuse existing class libraries or frameworks by calling the corresponding APIs. But learning the APIs in an unfamiliar library function or framework is a major obstacle in the software development process. On one hand, in recent years, the number of newly added APIs in various mature software frameworks is very large, so that developers need to spend more time to know the APIs in the software frameworks. On the other hand, insufficient or inaccurate API code samples, incomplete or erroneous API annotation documents, and the complexity of the API itself make learning and use of the API by developers exceptionally difficult.

The main body of modern software development workflows is the Integrated Development Environment (IDE). Originally it introduced as a user interface a specific coding language such as the commonly used C + + and Java programming languages. To date, IDEs have evolved into a stand-alone computing product, much like a high-end character document management control system, rather than just a user interface for coding and debugging tools. In order to solve the problem of difficult API use of developers, the core functions of the IDE of many advanced development tools include a code automatic recommendation function. However, the code recommendation system included in the IDE only considers the type compatibility and visibility of the API, and the recommended API accuracy is low when such code recommendation system is faced with a complex software framework. The main reason is that after all APIs are screened according to simple rules, such recommendation methods recommend a large number of methods or fields, and finally sort and give recommendation results according to the alphabetical order.

A more accurate method is to mine the use mode of the API, apply the API to a code recommendation system, recommend the API highly related to the requirements of a developer and present the API to the developer. The existing methods for mining the API use mode have certain defects. For example, search-based code recommendation techniques, while being fast in recommendation efficiency, do not utilize timing information. Experience has shown that timing information in one approach is important. For example, in an API call sequence, the use of any object must be after the object is constructed and declared, and any read or write of a file must be before the file is created. Therefore, the order of API calls, namely the time sequence information of the API calls, can help us to more reasonably mine the use mode of the API. The graph-based mode not only considers the time sequence information, but also considers the structural information in the code, such as data dependency, control dependency and the like, but the subgraph search technology used in the application process is low in efficiency. The method based on natural language processing considers time sequence information, compromises efficiency, and can consider the use mode among multiple APIs.

(2) Deep learning

In recent years, deep learning has a very good effect in the field of natural language processing, and a Recurrent Neural Networks (RNN) is one of the most commonly used deep learning models. The RNN can process time sequence sequences with any length, and has remarkable capabilities in the aspects of text classification, machine translation, part of speech tagging, picture semantic analysis and the like. However, the RNN model also has drawbacks. The nature of RNN is to maintain a state in the hidden layer of the neural network to memorize the history information, but as the time sequence grows, there is a problem of gradient disappearance (gradientexplosion) or gradient explosion (gradientexplosion) in the training process. Therefore, RNNs do not perform well if the input sequence exceeds a certain length. On the other hand, the deep neural network exceeds a certain number of iterations during training, and an overfitting phenomenon is easily sent.

1) Long and short term memory network

In order to solve the problems of the conventional RNN, a Long Short-Term Memory network (LSTM) model is developed. The LSTM model replaces hidden layer neurons in a neural network with a block structure. The block structure is added with an input gate, an output gate, a forgetting gate and a cell structure and is used for controlling the learning and forgetting of the historical information. Making the model suitable for dealing with long sequence problems. On this basis, a large number of scholars have conducted LSTM model studies, deriving a number of improved LSTM models, such as the LSTM model proposed by Gers, incorporating "peephole connections" (peephole connections), and cell states as inputs to the threshold layer. One derivative form of LSTM proposed by Chung et al is the threshold recursion unit, which combines the forgetting gate and the entry gate into an "update threshold", and also combines the unit state with the hidden state, which is increasingly accepted. Still other derived structures such as Tree-LSTM (Tree structured Long term memory neural network), Bi-LSTM (Bi-directional Long term memory network), etc. are widely used to solve many problems of natural language processing.

Let t, the memory cell of the LSTM model be denoted c_tForgetting gate is denoted as f_tInput gate is denoted as i_tThe output gate is denoted as o_tThe values of the elements of the three gates are all in the interval 0.1]At time t, LSTM is calculated as shown in equations (1) to (5).

i_t＝σ(w_ix_t+u_ib_t+v_ic_t-1) (1)

f_t＝σ(w_fx_t+u_fb_t+v_fc_t-1) (2)

o_t＝σ(w_ox_t+u_ob_t+v_oc_t) (3)

b_t＝o_t·tanh(c_t) (6)

The input of the input gate includes three aspects, as shown in formula (1), which are the input of the input layer at the current time t, the output of the hidden layer at the previous time t-1, and the state of the cell in the previous time t-1LSTM, respectively, the input gate is used for controlling the input of the state of the cell in the current hidden layer, and whether to input the input information into the state of the cell is determined through a certain operation, wherein 1 represents that the allowed information passes, the corresponding value needs to be updated, 0 represents that the allowed information does not pass, and the corresponding value does not need to be updated.

Three aspects included in the input of the forgetting gate are the same as those of the input gate, and are shown in formula (2). The forgetting gate is used for controlling the historical information stored in the hidden layer at the last time t-1, and the historical information c of the cell at the last time is determined according to the information of the hidden layer at the last time and the currently input information_t-1Wherein 1 represents the corresponding information retention and 0 represents the corresponding information rejection.

The input of the output gate includes three aspects, as shown in formula (3), which are the input of the input layer at the current time t, the output of the hidden layer at the previous time t-1, and the state of the cell at the current time tLSTM. The output gate is used for controlling the output of the current hidden node, wherein 1 represents that the corresponding value needs to be output, and 0 represents that the output is not needed.

As shown in equation (6)At time t, the output of the hidden layer is b_tAnd the output gate controls the output information.

2) Dropout techniques

Dropout technology is a technology proposed in 2012 by hint for preventing over-fitting of a neural network, and the working mechanism of Dropout is to randomly select a certain proportion of hidden layer nodes to be inactive, and the inactive nodes do not update weights in the training, but the weights still remain, because in the next training, the inactive nodes may be randomly selected again as active nodes. While during the validation and use of the model, all nodes will be used. The deep convolution neural network AlexNet proposed by Alex Krizhevsky, a student of Hinton, puts the dropout technology into practical use, and the dropout technology is applied to the last full-connection layers of AlexNet, so that the effect of preventing overfitting and improving the generalization capability of the dropout technology is proved.

3) ReLu function

The ReLu function is proposed by Nair & Hinton in 2010, and is firstly applied to a restricted Boltzmann machine, and as ReLu maps most values to 0, the application of ReLu adds sparsity to the network, so that the network better conforms to the characteristics of human neurons. On the other hand, ReLu does not have such a problem, as compared to the conventional sigmoid activation function, which is easily saturated, causing the gradient disappearance problem. Moreover, the ReLu function can accelerate the convergence speed of model training.

(3) Language model

1) Word vector

Word vectors are key technologies for deep learning in the field of natural language processing. The word vector technology uses a feature vector to replace an original one-hot vector to represent a word in a natural language, and compresses an original high-dimensional sparse vector into a low-dimensional dense vector. The invention analogizes API into vocabulary, corresponding to a vocabulary in natural language processing, proposes an API dictionary, corresponding to word vectors, proposes API vectors.

2) Probabilistic language model

The software has naturalness, and the statistical language model is applied to various software engineering tasks such as code recommendation, code completion and the like. These techniques treat the source code as a special natural language and analyze the source code using statistical natural language processing techniques.

A language model is a probabilistic model of how to generate a language that tells us the likelihood that a particular sentence will be generated in a language. For a sentence y, y ═ y (y)₁,y₂,...,y_n) Is the word sequence of the sentence, and the function of the language model is to estimate the joint probability Pr (y)₁,y₂,...,y_n). Known formula

Therefore, a joint probability Pr (y) is calculated₁,y₂,...,y_n) It can be translated into calculating the conditional probability that each word in the sentence is given the previous word. However, the estimation of conditional probability is difficult, and the approximate calculation is currently performed by using an "n-gram" model, such as the formula Pr (y)_t|y₁,...,y_t-1)≈Pr(y_t|y_1-n+1,...,y_t-1) As shown. However, the disadvantage is that the next word is assumed to be related to only n-1 previous words.

A neural language model is a neural network-based language model. Unlike "n-grams" which predict the next word from a fixed length of previous words, neural language models can predict the next word from a longer sequence of previous words, while at the same time, they can learn word vectors very efficiently.

Disclosure of Invention

The invention provides a code recommendation method based on a long-short term memory network, aiming at the problems that in a code recommendation system, the existing code recommendation algorithm cannot consider time sequence information, is low in recommendation efficiency and the like.

The technical scheme provided by the invention is a code recommendation method based on a long-term and short-term memory network, which comprises the following steps:

a code recommendation method based on a long-term and short-term memory network is characterized by comprising the following steps:

step 1, crawling at least ten thousand Java open source software codes from a GitHub website through a web crawler, wherein the number of times of updating versions of each Java open source software code exceeds 1000, the open source software codes form a source code library, then preprocessing the source codes to form an API sequence transaction library, and generating an API dictionary and an API vector matrix, wherein the method specifically comprises the following steps of:

step 1.1, at least ten thousand Java open source software codes are crawled from a GitHub website by using a web crawler, the number of times of updating versions of each Java open source software code exceeds 1000, and the open source software codes form a source code library.

Step 1.2, taking the method as a unit, extracting the API sequence of the method from the codes contained in the method, wherein all the API sequences extracted by all the methods in the source code library form an API sequence transaction library. The rule for extracting the API sequence from the code included in the method is to extract only the API of the newly-built object statement and the API of the object calling method statement. The API extracted by the new object statement is expressed as "class name. new", where the class name is the name of the class to which the new object belongs. The API extracted by the object calling method statement is denoted as "class name. method name", where the class name is the name of the class to which the object belongs.

And step 1.3, extracting the API dictionary from the API sequence transaction library and generating an API vector matrix.

The API dictionary is defined as: assuming the API sequence transaction library as D, the API dictionary may be denoted as V_D＝{1:API₁,w₁,2:API₂,w₂,…,i:API_i,w_i,…n:API_n,w_nN is the number of API contained in the API dictionary, API_iRepresents V_DName of the ith API in (1), w_iA set of representations V_DVector of the ith API.

The generation process of the API dictionary and the API vector matrix comprises the following steps: traversing the API sequence transaction library, judging whether the current API exists in the API dictionary, if so, ignoring the current API, and continuing traversing the next API, otherwise, adding the current API into the API dictionary, and giving the unique ID and a random M-dimensional API vector to the current API. The n M-dimensional API vectors of the n APIs contained in the API dictionary form an API vector matrix. The API vector matrix is used as a parameter of a long short-Term Memory network (LSTM) model, and the API vector is learned when the LSTM model is trained.

And 2, constructing an API recommendation model, namely constructing a long-term and short-term memory network. Defining a long-term and short-term memory network comprising an input layer, a hidden layer, a full link layer and an output layer; wherein,

the input layer receives a string of numerical value input, the numerical value input is input to the hidden layer through forward propagation, the current output of the hidden layer is influenced by the output of the hidden layer at the moment, the output generated by the hidden layer is input into the full-link layer, the output data of the full-link layer is input into the output layer, and the Softmax classifier in the output layer outputs the final classification result.

The neural unit of the hidden layer is a long-short term memory unit (LSTM), a dropout technology is used for preventing the long-short term memory network from being overfitted, and a ReLu function is used as a neuron activation function. The number of neurons in the input layer is M, which is the dimension of the API vector generated in step 1.3. The number of neurons of the hidden layer is M, the number of neurons of the full link layer is M, the number of neurons of the output layer is n, n is the number of APIs contained in the API dictionary, and the value of M, n is a positive integer.

And 3, training the API recommendation model, namely training the long-term and short-term memory network.

The input of the API recommendation model is N_b×N_sMatrix T of rows M columns_inputIn which N is_bIndicating the batch size, N_sThe length of the sequence is represented, M represents the dimension of the API vector, and the ith row of the matrix represents the vector corresponding to the ith API in the input sequence.

Target matrix T of API recommendation model_targetIs a number N_bLine N_sAnd (3) a matrix of columns, wherein the ith row and the jth column represent the IDs in the API dictionary generated in step 1.3 of the target output API corresponding to the ith API in the input sequence.

The output of the API recommendation model is N_b×N_sOutput probability matrix T of row n column_probWhere n represents the number of APIs contained in the API dictionary and the number in row i and column j represents the ith in the input sequenceThe probability that the next API predicted after the API input belongs to the jth API in the API dictionary.

The method comprises the following steps:

and 3.1, connecting all the API sequences in the API sequence transaction library end to generate an API total sequence.

Step 3.2, setting a pointer variable point, wherein the initial value of the point is 1, and sequentially extracting N from the API of the first point of the API total sequence each time_sAPI, a total of N_bFor each API, reading its corresponding ID from the API dictionary, and using its ID to extract the vector corresponding to the API from the API vector matrix and store it in the input matrix T_inputIn (1). For example, the vectors corresponding to the ith API and jth API are stored in the input matrix T_inputRow (i) x j. For the target matrix, starting from the API at the first point of the API total sequence, sequentially extracting N each time_sAPI, a total of N_bAnd for each API, reading the corresponding ID from the API dictionary and storing the corresponding ID into the target matrix. And finally, after the input matrix and the target matrix are filled, the point variable points to the last API read by the target matrix in the API total sequence. It is worth noting that after the last API in the API total sequence is extracted, the first API in the API total sequence is continuously extracted.

And 3.3, sequentially extracting API vectors from the input matrix to serve as input of an API recommendation model, and regarding the moment t, sequentially taking the vector of each row of API in the input matrix as the input vector of the model, and recording the API as the API_tMarking the input as x_tIf the calculation result of the hidden layer input gate of the LSTM model is i_t＝σ(w_ix_t+u_ib_t+v_ic_t-1) Forgetting to calculate as result f_t＝σ(w_fx_t+u_fb_t+v_fc_t-1) The output gate is calculated as o_t＝σ(w_ox_t+u_ob_t+v_oc_t) Finally the output of the hidden layer is b_t＝o_t·tanh(c_t) Data is transmitted from the hidden layer to the full link layer and finally transmittedAnd using a Softmax classifier for layer outlet. The output layer is obtained:

wherein | V_DI represents the number of APIs contained in the API dictionary, theta represents the current weight of the neural network, and theta represents the current weight of the neural network₁And representing a set of weight values corresponding to the first output node of the output layer. Finally, the formula is transposed and stored in the output probability matrix. This step is repeated until all the API vectors in the input matrix are entered into the API recommendation model.

And 3.4, calculating a cross entropy loss function by using the output probability matrix and the target matrix. A cross entropy loss function of

Wherein l represents an indicator function, l (y)_tJ) represents when y_tWhen j is equal, l (y)_tJ) 1, otherwise l (y)_t＝j)＝0，y_tThe ID of the target output API at time i.

Representing the output probability of the ith row and the jth column in the output probability matrix.

And 3.5, calculating the gradients of all weights in the network by taking the weights in the network as variables according to the cross entropy loss function. Meanwhile, based on gradient cutting, the updating of the weight value is controlled within a set range; the method comprises the following steps: firstly, a constant named gradient clipping is set, and is marked as clip _ gradient, when the backward propagation is carried out, the gradient of each parameter is obtained, and is marked as diff, at the moment, the weight is not selected to be directly updated, the sum of squares of all weight gradients is firstly solved, and is marked as sumsq _ diff, if the sum of squares of all weight gradients is greater than the clip _ gradient, the scaling factor is continuously solved, and is marked as scale _ factor which is clip _ gradient/sumsq _ diff. This scale _ factor is between (0, 1). If the sum of squares of the weight gradients is larger, the scaling factor will be smaller. Finally, all the weight gradients are multiplied by the scaling factor, and the obtained gradient is the final gradient information. The weight is updated according to the formula W ═ W ∑ J (θ), ∑ J (θ) represents the corresponding weight gradient, and η represents the learning rate.

And 3.6, repeating the steps 3.2-3.5 until convergence, namely the loss J (theta) does not rise or fall any more.

And 4, extracting an API sequence from the code which is edited by the developer, and then generating a prediction subsequence set.

Step 4.1, extracting an API sequence from the code being edited by the developer, and recording the sequence as P ═ P₁,P₂,…,P_i,…,P_LIn which P is_iDenotes the ith API, P in the API sequence P_LThe lth API in the API sequence P is represented, that is, the number of APIs included in the API sequence P is L. The rule for extracting the API sequence is the same as in step 1.2.

Step 4.2, with the L-th API as a reference position, selecting all subsequences with the length less than or equal to a threshold value gamma forwards, namely selecting the subsequences as Sub_i＝{P_L-i,…,P_LIn which 1 is<i<And gamma. The set of these subsequences is the set of predicted subsequences V_Sub＝{Sub₁,Sub₂,…,Sub_γ}。

Step 5, the prediction subsequence set V generated in the step 4 is collected_SubThe sequence in (3) is sequentially input into the API recommendation model trained in step (3), and an | V is output_SubA probability matrix of | rows and n columns, where | V_SubI is a subsequence set V_SubN is the number of the API contained in the API dictionary generated in step one, and the ith row and jth column of the probability matrix indicate that the current API sequence is the predicted Sub-sequence Sub_iWhen the next API is the conditional probability Pr (w) of the jth API in the API dictionary_j|Sub_i). Predicted probability matrix T to be generated_predictionAnd taking the maximum value of each column to obtain a one-dimensional probability matrix t, wherein the column with the maximum value in the one-dimensional probability matrix is the mth column, and the mth API in the API dictionary is preferentially recommended.

Aiming at the problems of low recommendation accuracy, low recommendation efficiency and the like commonly existing in the existing code recommendation technology, the method extracts the source code into an API sequence, constructs a code recommendation model by using a long-term and short-term memory network, learns the relation between API calls and then recommends the code. And uses dropout techniques to prevent model overfitting. Meanwhile, the ReLu function is used for replacing the traditional saturation function, the problem of gradient disappearance is solved, the convergence speed of the model is increased, the performance of the model is improved, and the advantages of the neural network are fully exerted. The technical scheme of the invention has the characteristics of simplicity and quickness, and can better improve the accuracy and the recommendation efficiency of code recommendation.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 shows the codes of the readTxtFile method of the embodiment;

FIG. 3 shows the code of the writeTxFile method of the present embodiment;

FIG. 4 illustrates an API transaction database extracted in the present embodiment;

FIG. 5 illustrates an extracted API dictionary;

FIG. 6 illustrates a long term memory network;

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.

The code recommendation method based on the long and short term memory network provided by the invention has the flow shown in the attached figure 1, and all the steps can be automatically operated by a person skilled in the art by adopting a computer software technology. The embodiment specifically realizes the following processes:

step 1, in order to make a source code library have high reliability and practicability, at least ten thousand Java open source software codes are crawled from a GitHub website through a web crawler, the number of times of updating versions of each Java open source software code exceeds 1000, the open source software codes form the source code library, then the source codes are preprocessed to form an API sequence transaction library, and an API dictionary and an API vector matrix are generated, specifically comprising:

step 1.1, at least ten thousand Java open source software codes are crawled, the times of updating versions of each Java open source software code exceed 1000, and the open source software codes form a source code library.

In this embodiment, in the code in the readtxfile method in fig. 2, the first statement "File ═ new File (File)" is a new object statement, the extracted API is file.new, the second statement "if (file.isfile ()") is an object calling method statement, the extracted API is file.isfile, the third statement "FileInputStream ═ new File" is a new object statement, the extracted API is fileinputstream.new, the fourth statement "inputstream reader: new _ new object statement," the extracted API is a new object statement, the fifth statement "buffedfile ═ new reader ═ new" is a new object statement, the extracted API is an inputstream reader, the extracted API is a new object statement, the fifth statement "buffer reader ═ new" is a new object statement, and the extracted API is not called API (File), the extracted API is a new object statement, and the extracted statement is not a new object statement (File) and the extracted statement is a new object statement, print ln, the ninth statement "read. close ()" is the object call statement, and the extracted API is inputstreamreader. Therefore, API sequences extracted by the codes in the readtxfile method in fig. 2 are file.new, file.isfile, fileinputstream.new, inputstream reader.new, bufferedreadreadreadnew, system.out.printin, inputstreamreadder.close.

In the present embodiment, in the code in the writetxfile () method in fig. 3, the first statement "File book File" (page.txt) "is a new object statement, the extracted API is file.new, the second statement" scannerzoo sc "is a new object statement, the extracted API is scanner.new, the third statement" File author File (author.txt) "is a new object statement, the extracted API is file.new, the fourth statement" File writer fw "(author File) is a new object statement, the extracted API is File writer.new, the fifth statement" File (page.hashline) "is a new object statement, the extracted API is a fifth statement" script (page.hashline) "is a call object, the extracted API is a sixth statement, and the extracted API is a sixth statement, the extracted API is a call entry statement, the API is a sixth statement, the extracted API is a new object statement, the API is a sixth statement, the API is a new object statement, the API is a call entry statement, the new object statement, the API is a new statement, the API is a call statement, the call is a new statement, the call is a call statement, the call of the sixth statement, the call of the, the extracted API is filewriter. close, the ninth statement "book sc. close ()" is an object call statement, and the extracted API is scanner. Thus, the API sequences extracted by the code in the writeTxtFile method in fig. 2 are file.new, scanner.new, file.new, filewriter.new, scanner.hasnextline, scanner.nextline, filewriter.applied, filewriter.close, scanner.close.

Finally, all the API sequences extracted by the two methods form an API sequence transaction library shown in FIG. 4.

In this embodiment, the API sequence transaction library in fig. 4 is traversed, the first API file of the first API sequence of the API sequence transaction library does not exist in the API dictionary, the unique ID is given as 1, and a random 100-dimensional API vector w is given to the API sequence transaction library₁＝[0.1,0.3,0.5,0.5,…,0.5]And adding it into API dictionary, the current API dictionary is V_D＝{1:File.new,w₁}. isFile of the first API sequence does not exist in the API dictionary, gives its unique ID 2 and gives it a random 100-dimensional API vector w₂＝[0.2,0.5,0.5,0.4,…,0.7]And adding it into API dictionary, the current API dictionary is V_D＝{1:File.new,w₁，2:File.ifFile,w₂}. New of the third API of the first API sequence, fileinputstream, does not exist in the API dictionary, gives it a unique ID of 3 and a random 100-dimensional API vector w₃＝[0.4,0.2,0.5,0.2,…,0.2]And adding it into API dictionary, the current API dictionary is V_D＝{1:File.new,w₁，2:File.ifFile,w₂，3:FileInputStream.new,w₃}. New does not exist in the API dictionary, gives its unique ID 4 and a random 100-dimensional API vector w₄＝[0.3,0.3,0.5,0.2,…,0.9]And adding it into API dictionary, the current API dictionary is V_D＝{1:File.new,w₁，2:File.ifFile,w₂，3:FileInputStream.new,w₃，4:InputStreamRead.new,w₄}. New of the fifth API buffer read of the first API sequence does not exist in the API dictionary, giving it a unique ID of 5 and a random 100-dimensional API vector w₅＝[0.1,0.6,0.5,0.6,…,0.5]And adding it into API dictionary, the current API dictionary is V_D＝{1:File.new,w₁，2:File.ifFile,w₂，3:FileInputStream.new,w₃，4:InputStreamRead.new,w₄，5:BufferedRead.new,w₅}. Readline does not exist in the API dictionary, gives it a unique ID of 6 and a random 100-dimensional API vector w₆＝[0.5,0.3,0.5,0.7,…,0.3]And adding it into API dictionary, the current API dictionary is V_D＝{1:File.new,w₁，2:File.ifFile,w₂，3:FileInputStream.new,w₃，4:InputStreamRead.new,w₄，5:BufferedRead.new,w₅，6:BufferedRead.readLine,w₆}. Print ln does not exist in the API dictionary, gives its unique ID of 7 and a random 100-dimensional API vector w₇＝[0.1,0.3,0.5,0.5,…,0.5]And adding it into API dictionary, the current API dictionary is V_D＝{1:File.new,w₁，2:File.ifFile,w₂，3:FileInputStream.new,w₃，4:InputStreamRead.new,w₄，5:BufferedRead.new,w₅，6:BufferedRead.readLine,w₆，7:System.out.println,w₇}. Close does not exist in the API dictionary, gives its unique ID of 8 and gives it a random 100-dimensional API vector w₈＝[0.7,0.2,0.1,0.8,…,0.3]And adding it into API dictionary, the current API dictionary is V_D＝{1:File.new,w₁，2:File.ifFile,w₂，3:FileInputStream.new,w₃，4:InputStreamRead.new,w₄，5:BufferedRead.new,w₅，6:BufferedRead.readLine,w₆，7:System.out.println,w₇，8:InputStreamReader.close,w₈}。

Of a second API sequence in the API sequence transaction libraryNew exists in the API dictionary and is ignored. New of the second API sequence is not present in the API dictionary, is assigned a unique ID of 9 and is assigned a random 100-dimensional API vector w₉＝[0.3,0.8,0.2,0.1,…,0.7]And adding it into API dictionary, the current API dictionary is V_D＝{1:File.new,w₁，2:File.ifFile,w₂，3:FileInputStream.new,w₃，4:InputStreamRead.new,w₄，5:BufferedRead.new,w₅，6:BufferedRead.readLine,w₆，7:System.out.println,w₇，8:InputStreamReader.close,w₈，9:Scanner.new,w₉}. New, the third API file of the second API sequence is present in the API dictionary and is ignored. New, the fourth API file of the second API sequence does not exist in the API dictionary, is assigned a unique ID of 10 and is assigned a random 100-dimensional API vector w₁₀＝[0.4,0.2,0.8,0.7,…,0.3]And adding it into API dictionary, the current API dictionary is V_D＝{1:File.new,w₁，2:File.ifFile,w₂，3:FileInputStream.new,w₃，4:InputStreamRead.new,w₄，5:BufferedRead.new,w₅，6:BufferedRead.readLine,w₆，7:System.out.println,w₇，8:InputStreamReader.close,w₈，9:Scanner.new,w₉，10:FileWriter.new,w₁₀}. The fifth API scanner. hasnextline of the second API sequence does not exist in the API dictionary, is assigned a unique ID of 11 and is assigned a random 100-dimensional API vector w₁₁＝[0.1,0.4,0.5,0.3,…,0.1]And adding it into API dictionary, the current API dictionary is V_D＝{1:File.new,w₁，2:File.ifFile,w₂，3:FileInputStream.new,w₃，4:InputStreamRead.new,w₄，5:BufferedRead.new,w₅，6:BufferedRead.readLine,w₆，7:System.out.println,w₇，8:InputStreamReader.close,w₈，9:Scanner.new,w₉，10:FileWriter.new,w₁₀，11:Scanner.hasNextLinec,w₁₁}. The sixth API scanner. nextline of the second API sequence does not exist in the API dictionary, giving it unique propertiesID of 12 and a random 100-dimensional API vector w assigned thereto₁₂＝[0.5,0.3,0.5,0.7,…,0.3]And adding it into API dictionary, the current API dictionary is V_D＝{1:File.new,w₁，2:File.ifFile,w₂，3:FileInputStream.new,w₃，4:InputStreamRead.new,w₄，5:BufferedRead.new,w₅，6:BufferedRead.readLine,w₆，7:System.out.println,w₇，8:InputStreamReader.close,w₈，9:Scanner.new,w₉，10:FileWriter.new,w₁₀，11:Scanner.hasNextLinec,w₁₁，12:Scanner.nextLine,w₁₂}. API, the seventh API filewriter of the second API sequence does not exist in the API dictionary, is assigned a unique ID of 13 and is assigned a random 100-dimensional API vector w₁₃＝[0.3,0.1,0.7,0.3,…,0.6]And adding it into API dictionary, the current API dictionary is V_D＝{1:File.new,w₁，2:File.ifFile,w₂，3:FileInputStream.new,w₃，4:InputStreamRead.new,w₄，5:BufferedRead.new,w₅，6:BufferedRead.readLine,w₆，7:System.out.println,w₇，8:InputStreamReader.close,w₈，9:Scanner.new,w₉，10:FileWriter.new,w₁₀，11:Scanner.hasNextLinec,w₁₁，12:Scanner.nextLine,w₁₂，13:FileWriter.append,w₁₃}。

Close, the eighth API filewriter of the second API sequence does not exist in the API dictionary, giving it a unique ID of 14 and a random 100-dimensional API vector w₁₄＝[0.4,0.8,0.4,0.2,…,0.1]And adding it into API dictionary, the current API dictionary is V_D＝{1:File.new,w₁，2:File.ifFile,w₂，3:FileInputStream.new,w₃，4:InputStreamRead.new,w₄，5:BufferedRead.new,w₅，6:BufferedRead.readLine,w₆，7:System.out.println,w₇，8:InputStreamReader.close,w₈，9:Scanner.new,w₉，10:FileWriter.new,w₁₀，11:Scanner.hasNextLinec,w₁₁，12:Scanner.nextLine,w₁₂，13:FileWriter.append,w₁₃，14:FileWriter.close,w₁₄}. Close of the ninth API scanner of the second API sequence does not exist in the API dictionary, giving it a unique ID of 15 and a random 100-dimensional API vector w₁₅＝[0.5,0.2,0.3,0.1,…,0.2]Adding the extracted API dictionary into the API dictionary, and finally extracting the API dictionary as V_D＝{1:File.new,w₁，2:File.ifFile,w₂，3:FileInputStream.new,w₃，4:InputStreamRead.new,w₄，5:BufferedRead.new,w₅，6:BufferedRead.readLine,w₆，7:System.out.println,w₇，8:InputStreamReader.close,w₈，9:Scanner.new,w₉，10:FileWriter.new,w₁₀，11:Scanner.hasNextLinec,w₁₁，12:Scanner.nextLine,w₁₂，13:FileWriter.append,w₁₃，14:FileWriter.close,w₁₄，15:Scanner.close,w₁₅}. In this embodiment, 15 API vectors of 100 dimensions of the 15 APIs included in the API dictionary constitute an API vector matrix as shown in fig. 5.

And 2, constructing an API recommendation model, namely constructing a long-term and short-term memory network. Referring to fig. 6, the long-term and short-term memory network is composed of an input layer, a hidden layer, a full link layer and an output layer. The input layer receives a string of numerical value input, the numerical value input is input to the hidden layer through forward propagation, the current output of the hidden layer is affected by the output of the hidden layer at the last moment, the output generated by the hidden layer is input into the full-link layer, the output data of the full-link layer is input into the output layer, and the Softmax classifier in the output layer outputs the final classification result. In specific implementation, the neural unit of the hidden layer is a long-short term memory unit (LSTM), a dropout technology is used for preventing the long-short term memory network from being overfitted, and a ReLu function is used as a neuron activation function. In this embodiment, the number of neurons in the input layer is 100, and 100 is the dimension of the API vector generated in step 1.3. The number of neurons in the hidden layer is 100, the number of neurons in the full link layer is 100, the number of neurons in the output layer is 15, and 15 is the number of APIs contained in the API dictionary.

The output of the API recommendation model is N_b×N_sOutput probability matrix T of row n column_probWherein n represents the number of the APIs contained in the API dictionary, and the number in the ith row and the jth column represents the probability that the predicted next API belongs to the jth API in the API dictionary after the ith API in the input sequence is input.

The method mainly comprises the following steps:

In this embodiment, after all API sequences in the API sequence database in fig. 4 are connected end to end, an API total sequence is generated as follows: new, file, isfile, file inputstream.new, inputstream reader.new, bufferdredrreadline, system.out.println, inputstream reader.close, file.new, scanner.new, file.new, filewriter.new, scanner.hasnexine, scanner.nextline, filewriter.ap-pend, filewriter.close, scanner.close.

Step 3.2, setting a pointer variable point (the initial value of point is 1), and sequentially extracting N from the API of the first point of the API total sequence each time_sAPI, a total of N_bFor each API, reading its corresponding ID from the API dictionary, and using its ID to extract the vector corresponding to the API from the API vector matrix and store it in the input matrix T_inputIn (1). For example, the vectors corresponding to the ith API and jth API are stored in the input matrix T_inputRow (i) x j. For the target matrix, the second from the API aggregate sequencePoint API starts, extracting N in turn each time_sAPI, a total of N_bAnd for each API, reading the corresponding ID from the API dictionary and storing the corresponding ID into the target matrix. And finally, after the input matrix and the target matrix are filled, the point variable points to the last API read by the target matrix in the API total sequence. It is worth noting that after the last API in the API total sequence is extracted, the first API in the API total sequence is continuously extracted.

In this embodiment, set batch size N_bIs 2, sequence length N_sIs 2 and the API vector dimension is 100. In the initial stage, point is 0, and 2 APIs are sequentially extracted from the 1 st API of the total API sequence, for a total of 2 batches, so that the extracted APIs are file. For each API, reading the corresponding ID from the API dictionary, extracting the vector corresponding to the API from the API vector matrix by using the ID, and storing the vector into the input matrix T_inputIn, therefore input matrix

The target matrix sequentially extracts 2 APIs from the 2 nd API of the API total sequence, and extracts 2 batches in total, so that the extracted APIs are File_targetIn, and thus the object matrix

If the number of APIs included in the API dictionary is 15, an input matrix with 4(═ 2 × 2) rows and 100 columns is established, an output probability matrix with 4(═ 2 × 2) rows and 15 columns is established, and a target matrix with 2 rows and 2 columns is established.

And 3.3, sequentially extracting API vectors from the input matrix to serve as input of an API recommendation model, and regarding the moment t, sequentially taking the vector of each row of API in the input matrix as the input vector of the model, and recording the API as the API_tMarking the input as x_tOf, then LSTM modelThe hidden layer input gate has a calculation result of i_t＝σ(w_ix_t+u_ib_t+v_ic_t-1) Forgetting to calculate as result f_t＝σ(w_fx_t+u_fb_t+v_fc_t-1) The output gate is calculated as o_t＝σ(w_ox_t+u_ob_t+v_oc_t) Finally the output of the hidden layer is b_t＝o_t·tanh(c_t) Data is passed from the hidden layer to the fully connected layer, and finally the output layer uses a Softmax classifier. The output layer is obtained:

In this embodiment, assume that the input matrix generated in step 3.2 is used

Input into API recommendation model to obtain output probability matrix of

And 3.4, calculating a loss function by using the output probability matrix and the target matrix. A cross entropy loss function of

Target matrix of the embodiment

According to the output probability matrix in step 3.3, the following losses are obtained:

and 3.5, calculating the gradients of all weights in the network by taking the weights in the network as variables according to the loss function. Meanwhile, a gradient clipping technology is introduced, the updating of the weight value is controlled in a proper range, and the problem of gradient disappearance or gradient decline is solved better. In specific implementation, a constant named gradient clipping is set first and is marked as clip _ gradient, when backward propagation is performed, the gradient of each parameter is obtained and is marked as diff, at this time, direct updating of the weight is not selected, the sum of squares of all weight gradients is determined first and is marked as sumsq _ diff, and if the sum of squares of all weight gradients is greater than clip _ gradient, the scaling factor is determined continuously and is marked as scale _ factor as clip _ gradient/sumsq _ diff. This scale _ factor is between (0, 1). If the sum of squares of the weight gradients is larger, the scaling factor will be smaller. Finally, all the weight gradients are multiplied by the scaling factor, and the obtained gradient is the final gradient information. The weight is updated according to the formula W ═ W ∑ J (θ), ∑ J (θ) represents the corresponding weight gradient, and η represents the learning rate.

Step 4.1, extracting an API sequence from the code being edited by the developer, and recording the sequence as P ═ P₁,P₂,…,P_i,…,P_LTherein ofP_iDenotes the ith API, P in the API sequence P_LThe lth API in the API sequence P is represented, that is, the number of APIs included in the API sequence P is L. The rule for extracting the API sequence is the same as in step 1.2.

In this embodiment, it is assumed that the user is editing the code:

the API sequence extracted from the code is File, scanner, new, scanner, hasNextLine, scanner, NextLine, the statement that needs to predict API is 'noteSc?', the threshold value gamma is set to be 3, then the prediction subsequence Sub-sequence Sub can be obtained₁＝{Scanner.NextLine},Sub₂＝{Scanner.hasNextLine,Scanner.NextLine},Sub₃-scanner. The set of these subsequences is the set of predicted subsequences V_Sub＝{Sub₁,Sub₂,Sub₃}。

In this embodiment, taking the set of predicted Sub-sequences obtained in step 4 as an example, a prediction probability matrix with 3 rows and 15 columns is established, and the predicted Sub-sequences Sub₁＝{Scanner.NextLine},Sub₂＝{Scanner.hasNextLine,Scanner.NextLine},Sub₃Inputting { scanner.new, scanner.hasnextline, scanner.nextline } into the trained model in turn, storing the output into a prediction probability matrix, assuming that the obtained prediction probability matrix is

Predicted probability matrix T to be generated_predictionTaking the maximum value of each column to obtain a one-dimensional probability matrix t ═ 0.6,0.3,0.5,0.2,0.3,0.2,0.3,0.4,0.3,0.5,0.2,0.3,0.4,0.3,0.8]Since the 15 th column value of the one-dimensional probability matrix is the largest, the 15 th API in the API dictionary is preferentially recommended.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A code recommendation method based on a long-term and short-term memory network is characterized by comprising the following steps:

step 1.1, crawling at least ten thousand Java open source software codes with update version times exceeding at least 1000 times from a GitHub website by using a web crawler to form a source code library;

step 1.2, taking a method as a unit, extracting an API sequence of the method from codes contained in the method, wherein all the API sequences extracted by all the methods in a source code library form an API sequence transaction library; the rule for extracting the API sequence from the code in the method is that only the API of the newly-built object statement and the API of the object calling method statement are extracted; the API extracted by the new object statement is expressed as 'class name.new', wherein the class name is the name of the class to which the new object belongs; the API extracted by the object calling method statement is expressed as 'class name, method name', wherein the class name is the name of the class to which the object belongs;

step 1.3, extracting an API dictionary from the API sequence transaction library and generating an API vector matrix;

the API dictionary is defined as: assuming the API sequence transaction library as D, the API dictionary may be denoted as V_D＝{1:API₁,w₁,2:API₂,w₂,…,i:API_i,w_i,…n:API_n,w_nN is the number of API contained in the API dictionary, API_iRepresents V_DName of the ith API in (1), w_iA set of representations V_DVector of the ith API;

the generation process of the API dictionary and the API vector matrix comprises the following steps: traversing the API sequence transaction library, judging whether the current API exists in the API dictionary, if so, ignoring the current API, and continuing traversing the next API, otherwise, adding the current API into the API dictionary, and giving the current API a unique ID and a random M-dimensional API vector; the API vector matrix is formed by n M-dimensional API vectors of n APIs contained in the API dictionary; the API vector matrix is used as a parameter of a Long Short-term memory network (LSTM) model, and the API vector can be learned when the LSTM model is trained;

step 2, constructing an API recommendation model, namely constructing a long-term and short-term memory network; defining a long-term and short-term memory network comprising an input layer, a hidden layer, a full link layer and an output layer; wherein,

the input layer receives a string of numerical value input, the numerical value input is input into the hidden layer through forward propagation, the current output of the hidden layer is influenced by the output at the moment on the hidden layer, the output generated by the hidden layer is input into the full-link layer, the full-link layer outputs data and is input into the output layer, and a Softmax classifier in the output layer outputs the final classification result;

the neural unit of the hidden layer is a long-term and short-term memory unit, the dropout technology is used for preventing the long-term and short-term memory network from being over-fitted, and the ReLu function is used as a neuron activation function; the number of neurons of the input layer is M, which is the dimension of the API vector generated in step 1.3; the number of neurons of the hidden layer is M, the number of neurons of the full link layer is M, the number of neurons of the output layer is n, n is the number of APIs contained in the API dictionary, and the values of M, n are positive integers;

step 3, training an API recommendation model, namely training a long-term and short-term memory network;

the input of the API recommendation model is N_b×N_sMatrix T of rows M columns_inputIn which N is_bIndicating the batch size, N_sRepresenting the length of the sequence, M representing the dimension of the API vector, and the ith row of the matrix representing the vector corresponding to the ith API in the input sequence;

target matrix T of API recommendation model_targetIs a number N_bLine N_sA matrix of columns, wherein the ith row and the jth column represent the IDs in the API dictionary generated in step 1.3 of the target output API corresponding to the ith API in the input sequence;

the output of the API recommendation model is N_b×N_sOutput probability matrix T of row n column_probWherein n represents the number of the APIs contained in the API dictionary, and the number in the ith row and the jth column represents the probability that the predicted next API belongs to the jth API in the API dictionary after the ith API in the input sequence is input;

the method comprises the following steps:

step 3.1, connecting all API sequences in the API sequence transaction library end to generate an API total sequence;

step 3.2, a pointer variable point is set, the initial value of the point is 1, and the overall sequence of the slave API isThe first point API of the column begins, fetching N in turn each time_sAPI, a total of N_bFor each API, reading its corresponding ID from the API dictionary, and using its ID to extract the vector corresponding to the API from the API vector matrix and store it in the input matrix T_inputPerforming the following steps; for example, the vectors corresponding to the ith API and jth API are stored in the input matrix T_inputRow i × j in (1); for the target matrix, starting from the API at the first point of the API total sequence, sequentially extracting N each time_sAPI, a total of N_bFor each API, reading the corresponding ID from the API dictionary and storing the corresponding ID into a target matrix; finally, after the input matrix and the target matrix are filled, the point variable points to the last API read by the target matrix in the API total sequence; it is worth to be noted that after the last API in the API total sequence is extracted, the first API in the API total sequence is continuously extracted;

and 3.3, sequentially extracting API vectors from the input matrix to serve as input of an API recommendation model, and regarding the moment t, sequentially taking the vector of each row of API in the input matrix as the input vector of the model, and recording the API as the API_tMarking the input as x_tIf the calculation result of the hidden layer input gate of the LSTM model is i_t＝σ(w_ix_t+u_ib_t+v_ic_t-1) Forgetting to calculate as result f_t＝σ(w_fx_t+u_fb_t+v_fc_t-1) The output gate is calculated as o_t＝σ(w_ox_t+u_ob_t+v_oc_t) Finally the output of the hidden layer is b_t＝o_t·tanh(c_t) Data is transmitted from the hidden layer to the full connection layer, and finally the output layer uses a Softmax classifier; the output layer is obtained:

wherein | V_DI represents the number of APIs contained in the API dictionary, theta represents the current weight of the neural network, and theta represents the current weight of the neural network₁Representing a set of weights corresponding to a first output node of an output layer;finally, transposing the formula and storing the transposed formula in an output probability matrix; repeating the steps until all the API vectors in the input matrix are input into the API recommendation model;

step 3.4, calculating a cross entropy loss function by using the output probability matrix and the target matrix; a cross entropy loss function of

Wherein l represents an indicator function, l (y)_tJ) represents when y_tWhen j is equal, l (y)_tJ) 1, otherwise l (y)_t＝j)＝0，y_tAn ID indicating a target output API at time i;

representing the output probability of the ith row and the jth column in the output probability matrix;

step 3.5, calculating the gradients of all weights in the network by taking the weights W in the network as variables according to the cross entropy loss function; meanwhile, based on gradient cutting, the updating of the weight value is controlled within a set range; the method comprises the following steps: firstly, setting a constant named gradient clipping, which is marked as clip _ gradient, obtaining the gradient of each parameter when reverse propagation is carried out, and marking the gradient as diff, at the moment, directly updating the weight is not selected, the sum of squares of all weight gradients is firstly solved and marked as sumsq _ diff, if the sum of squares of all weight gradients is greater than the clip _ gradient, the scaling factor is continuously solved and marked as scale _ factor which is clip _ gradient/sumsq _ diff; this scale _ factor is between (0, 1); if the sum of squares of the weight gradients is larger, the scaling factor will be smaller; finally, multiplying all the weight gradients by the scaling factor, wherein the obtained gradient is the final gradient information; updating the weight according to the formula W ═ W ^ J (θ), wherein ^ J (θ) represents the corresponding weight gradient, and η represents the learning rate;

step 3.6, repeating the steps 3.2-3.5 until convergence, namely the loss J (theta) does not rise or fall any more;

step 4, extracting an API sequence from the code being edited by the developer, and then generating a prediction subsequence set;

step 4.1, extracting an API sequence from the code being edited by the developer, and recording the sequence as P ═ P₁,P₂,…,P_i,…,P_LIn which P is_iDenotes the ith API, P in the API sequence P_LThe number of the L-th API in the API sequence P is represented, namely the number of the APIs contained in the API sequence P is L; the rule for extracting the API sequence is the same as the rule in the step 1.2;

step 4.2, with the L-th API as a reference position, selecting all subsequences with the length less than or equal to a threshold value gamma forwards, namely selecting the subsequences as Sub_i＝{P_L-i,…,P_LIn which 1 is<i<Gamma; the set of these subsequences is the set of predicted subsequences V_Sub＝{Sub₁,Sub₂,…,Sub_γ}；

Step 5, the prediction subsequence set V generated in the step 4 is collected_SubThe sequence in (3) is sequentially input into the API recommendation model trained in step (3), and an | V is output_SubA probability matrix of | rows and n columns, where | V_SubI is a subsequence set V_SubN is the number of the API contained in the API dictionary generated in step one, and the ith row and jth column of the probability matrix indicate that the current API sequence is the predicted Sub-sequence Sub_iWhen the next API is the conditional probability Pr (w) of the jth API in the API dictionary_j|Sub_i) (ii) a Predicted probability matrix T to be generated_predictionAnd taking the maximum value of each column to obtain a one-dimensional probability matrix t, wherein the column with the maximum value in the one-dimensional probability matrix is the mth column, and the mth API in the API dictionary is preferentially recommended.