CN107506414B - Code recommendation method based on long-term and short-term memory network - Google Patents

Code recommendation method based on long-term and short-term memory network Download PDF

Info

Publication number
CN107506414B
CN107506414B CN201710687197.4A CN201710687197A CN107506414B CN 107506414 B CN107506414 B CN 107506414B CN 201710687197 A CN201710687197 A CN 201710687197A CN 107506414 B CN107506414 B CN 107506414B
Authority
CN
China
Prior art keywords
api
sequence
input
dictionary
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710687197.4A
Other languages
Chinese (zh)
Other versions
CN107506414A (en
Inventor
余啸
殷晓飞
刘进
伍蔓
姜加明
崔晓晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN201710687197.4A priority Critical patent/CN107506414B/en
Publication of CN107506414A publication Critical patent/CN107506414A/en
Application granted granted Critical
Publication of CN107506414B publication Critical patent/CN107506414B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a code recommendation method based on a long-short term memory network, aiming at the problems of low recommendation accuracy, low recommendation efficiency and the like of the existing code recommendation technology. And uses dropout techniques to prevent model overfitting. Meanwhile, the ReLu function is used for replacing the traditional saturation function, the problem of gradient disappearance is solved, the convergence speed of the model is increased, the performance of the model is improved, and the advantages of the neural network are fully exerted. The technical scheme of the invention has the characteristics of simplicity and quickness, and can better improve the accuracy and the recommendation efficiency of code recommendation.

Description

Code recommendation method based on long-term and short-term memory network
Technical Field
The invention belongs to the field of code recommendation, and particularly relates to a code recommendation method based on a long-term and short-term memory network.
Background
(1) Code recommendation system
Developers often develop using sophisticated software frameworks and class libraries to improve the efficiency and quality of software development. Therefore, developers often need to know how to reuse existing class libraries or frameworks by calling the corresponding APIs. But learning the APIs in an unfamiliar library function or framework is a major obstacle in the software development process. On one hand, in recent years, the number of newly added APIs in various mature software frameworks is very large, so that developers need to spend more time to know the APIs in the software frameworks. On the other hand, insufficient or inaccurate API code samples, incomplete or erroneous API annotation documents, and the complexity of the API itself make learning and use of the API by developers exceptionally difficult.
The main body of modern software development workflows is the Integrated Development Environment (IDE). Originally it introduced as a user interface a specific coding language such as the commonly used C + + and Java programming languages. To date, IDEs have evolved into a stand-alone computing product, much like a high-end character document management control system, rather than just a user interface for coding and debugging tools. In order to solve the problem of difficult API use of developers, the core functions of the IDE of many advanced development tools include a code automatic recommendation function. However, the code recommendation system included in the IDE only considers the type compatibility and visibility of the API, and the recommended API accuracy is low when such code recommendation system is faced with a complex software framework. The main reason is that after all APIs are screened according to simple rules, such recommendation methods recommend a large number of methods or fields, and finally sort and give recommendation results according to the alphabetical order.
A more accurate method is to mine the use mode of the API, apply the API to a code recommendation system, recommend the API highly related to the requirements of a developer and present the API to the developer. The existing methods for mining the API use mode have certain defects. For example, search-based code recommendation techniques, while being fast in recommendation efficiency, do not utilize timing information. Experience has shown that timing information in one approach is important. For example, in an API call sequence, the use of any object must be after the object is constructed and declared, and any read or write of a file must be before the file is created. Therefore, the order of API calls, namely the time sequence information of the API calls, can help us to more reasonably mine the use mode of the API. The graph-based mode not only considers the time sequence information, but also considers the structural information in the code, such as data dependency, control dependency and the like, but the subgraph search technology used in the application process is low in efficiency. The method based on natural language processing considers time sequence information, compromises efficiency, and can consider the use mode among multiple APIs.
(2) Deep learning
In recent years, deep learning has a very good effect in the field of natural language processing, and a Recurrent Neural Networks (RNN) is one of the most commonly used deep learning models. The RNN can process time sequence sequences with any length, and has remarkable capabilities in the aspects of text classification, machine translation, part of speech tagging, picture semantic analysis and the like. However, the RNN model also has drawbacks. The nature of RNN is to maintain a state in the hidden layer of the neural network to memorize the history information, but as the time sequence grows, there is a problem of gradient disappearance (gradientexplosion) or gradient explosion (gradientexplosion) in the training process. Therefore, RNNs do not perform well if the input sequence exceeds a certain length. On the other hand, the deep neural network exceeds a certain number of iterations during training, and an overfitting phenomenon is easily sent.
1) Long and short term memory network
In order to solve the problems of the conventional RNN, a Long Short-Term Memory network (LSTM) model is developed. The LSTM model replaces hidden layer neurons in a neural network with a block structure. The block structure is added with an input gate, an output gate, a forgetting gate and a cell structure and is used for controlling the learning and forgetting of the historical information. Making the model suitable for dealing with long sequence problems. On this basis, a large number of scholars have conducted LSTM model studies, deriving a number of improved LSTM models, such as the LSTM model proposed by Gers, incorporating "peephole connections" (peephole connections), and cell states as inputs to the threshold layer. One derivative form of LSTM proposed by Chung et al is the threshold recursion unit, which combines the forgetting gate and the entry gate into an "update threshold", and also combines the unit state with the hidden state, which is increasingly accepted. Still other derived structures such as Tree-LSTM (Tree structured Long term memory neural network), Bi-LSTM (Bi-directional Long term memory network), etc. are widely used to solve many problems of natural language processing.
Let t, the memory cell of the LSTM model be denoted ctForgetting gate is denoted as ftInput gate is denoted as itThe output gate is denoted as otThe values of the elements of the three gates are all in the interval 0.1]At time t, LSTM is calculated as shown in equations (1) to (5).
it=σ(wixt+uibt+vict-1) (1)
ft=σ(wfxt+ufbt+vfct-1) (2)
ot=σ(woxt+uobt+voct) (3)
Figure BDA0001377040230000031
Figure BDA0001377040230000032
bt=ot·tanh(ct) (6)
The input of the input gate includes three aspects, as shown in formula (1), which are the input of the input layer at the current time t, the output of the hidden layer at the previous time t-1, and the state of the cell in the previous time t-1LSTM, respectively, the input gate is used for controlling the input of the state of the cell in the current hidden layer, and whether to input the input information into the state of the cell is determined through a certain operation, wherein 1 represents that the allowed information passes, the corresponding value needs to be updated, 0 represents that the allowed information does not pass, and the corresponding value does not need to be updated.
Three aspects included in the input of the forgetting gate are the same as those of the input gate, and are shown in formula (2). The forgetting gate is used for controlling the historical information stored in the hidden layer at the last time t-1, and the historical information c of the cell at the last time is determined according to the information of the hidden layer at the last time and the currently input informationt-1Wherein 1 represents the corresponding information retention and 0 represents the corresponding information rejection.
The input of the output gate includes three aspects, as shown in formula (3), which are the input of the input layer at the current time t, the output of the hidden layer at the previous time t-1, and the state of the cell at the current time tLSTM. The output gate is used for controlling the output of the current hidden node, wherein 1 represents that the corresponding value needs to be output, and 0 represents that the output is not needed.
As shown in equation (6)At time t, the output of the hidden layer is btAnd the output gate controls the output information.
2) Dropout techniques
Dropout technology is a technology proposed in 2012 by hint for preventing over-fitting of a neural network, and the working mechanism of Dropout is to randomly select a certain proportion of hidden layer nodes to be inactive, and the inactive nodes do not update weights in the training, but the weights still remain, because in the next training, the inactive nodes may be randomly selected again as active nodes. While during the validation and use of the model, all nodes will be used. The deep convolution neural network AlexNet proposed by Alex Krizhevsky, a student of Hinton, puts the dropout technology into practical use, and the dropout technology is applied to the last full-connection layers of AlexNet, so that the effect of preventing overfitting and improving the generalization capability of the dropout technology is proved.
3) ReLu function
The ReLu function is proposed by Nair & Hinton in 2010, and is firstly applied to a restricted Boltzmann machine, and as ReLu maps most values to 0, the application of ReLu adds sparsity to the network, so that the network better conforms to the characteristics of human neurons. On the other hand, ReLu does not have such a problem, as compared to the conventional sigmoid activation function, which is easily saturated, causing the gradient disappearance problem. Moreover, the ReLu function can accelerate the convergence speed of model training.
(3) Language model
1) Word vector
Word vectors are key technologies for deep learning in the field of natural language processing. The word vector technology uses a feature vector to replace an original one-hot vector to represent a word in a natural language, and compresses an original high-dimensional sparse vector into a low-dimensional dense vector. The invention analogizes API into vocabulary, corresponding to a vocabulary in natural language processing, proposes an API dictionary, corresponding to word vectors, proposes API vectors.
2) Probabilistic language model
The software has naturalness, and the statistical language model is applied to various software engineering tasks such as code recommendation, code completion and the like. These techniques treat the source code as a special natural language and analyze the source code using statistical natural language processing techniques.
A language model is a probabilistic model of how to generate a language that tells us the likelihood that a particular sentence will be generated in a language. For a sentence y, y ═ y (y)1,y2,...,yn) Is the word sequence of the sentence, and the function of the language model is to estimate the joint probability Pr (y)1,y2,...,yn). Known formula
Figure BDA0001377040230000041
Therefore, a joint probability Pr (y) is calculated1,y2,...,yn) It can be translated into calculating the conditional probability that each word in the sentence is given the previous word. However, the estimation of conditional probability is difficult, and the approximate calculation is currently performed by using an "n-gram" model, such as the formula Pr (y)t|y1,...,yt-1)≈Pr(yt|y1-n+1,...,yt-1) As shown. However, the disadvantage is that the next word is assumed to be related to only n-1 previous words.
A neural language model is a neural network-based language model. Unlike "n-grams" which predict the next word from a fixed length of previous words, neural language models can predict the next word from a longer sequence of previous words, while at the same time, they can learn word vectors very efficiently.
Disclosure of Invention
The invention provides a code recommendation method based on a long-short term memory network, aiming at the problems that in a code recommendation system, the existing code recommendation algorithm cannot consider time sequence information, is low in recommendation efficiency and the like.
The technical scheme provided by the invention is a code recommendation method based on a long-term and short-term memory network, which comprises the following steps:
a code recommendation method based on a long-term and short-term memory network is characterized by comprising the following steps:
step 1, crawling at least ten thousand Java open source software codes from a GitHub website through a web crawler, wherein the number of times of updating versions of each Java open source software code exceeds 1000, the open source software codes form a source code library, then preprocessing the source codes to form an API sequence transaction library, and generating an API dictionary and an API vector matrix, wherein the method specifically comprises the following steps of:
step 1.1, at least ten thousand Java open source software codes are crawled from a GitHub website by using a web crawler, the number of times of updating versions of each Java open source software code exceeds 1000, and the open source software codes form a source code library.
Step 1.2, taking the method as a unit, extracting the API sequence of the method from the codes contained in the method, wherein all the API sequences extracted by all the methods in the source code library form an API sequence transaction library. The rule for extracting the API sequence from the code included in the method is to extract only the API of the newly-built object statement and the API of the object calling method statement. The API extracted by the new object statement is expressed as "class name. new", where the class name is the name of the class to which the new object belongs. The API extracted by the object calling method statement is denoted as "class name. method name", where the class name is the name of the class to which the object belongs.
And step 1.3, extracting the API dictionary from the API sequence transaction library and generating an API vector matrix.
The API dictionary is defined as: assuming the API sequence transaction library as D, the API dictionary may be denoted as VD={1:API1,w1,2:API2,w2,…,i:APIi,wi,…n:APIn,wnN is the number of API contained in the API dictionary, APIiRepresents VDName of the ith API in (1), wiA set of representations VDVector of the ith API.
The generation process of the API dictionary and the API vector matrix comprises the following steps: traversing the API sequence transaction library, judging whether the current API exists in the API dictionary, if so, ignoring the current API, and continuing traversing the next API, otherwise, adding the current API into the API dictionary, and giving the unique ID and a random M-dimensional API vector to the current API. The n M-dimensional API vectors of the n APIs contained in the API dictionary form an API vector matrix. The API vector matrix is used as a parameter of a long short-Term Memory network (LSTM) model, and the API vector is learned when the LSTM model is trained.
And 2, constructing an API recommendation model, namely constructing a long-term and short-term memory network. Defining a long-term and short-term memory network comprising an input layer, a hidden layer, a full link layer and an output layer; wherein,
the input layer receives a string of numerical value input, the numerical value input is input to the hidden layer through forward propagation, the current output of the hidden layer is influenced by the output of the hidden layer at the moment, the output generated by the hidden layer is input into the full-link layer, the output data of the full-link layer is input into the output layer, and the Softmax classifier in the output layer outputs the final classification result.
The neural unit of the hidden layer is a long-short term memory unit (LSTM), a dropout technology is used for preventing the long-short term memory network from being overfitted, and a ReLu function is used as a neuron activation function. The number of neurons in the input layer is M, which is the dimension of the API vector generated in step 1.3. The number of neurons of the hidden layer is M, the number of neurons of the full link layer is M, the number of neurons of the output layer is n, n is the number of APIs contained in the API dictionary, and the value of M, n is a positive integer.
And 3, training the API recommendation model, namely training the long-term and short-term memory network.
The input of the API recommendation model is Nb×NsMatrix T of rows M columnsinputIn which N isbIndicating the batch size, NsThe length of the sequence is represented, M represents the dimension of the API vector, and the ith row of the matrix represents the vector corresponding to the ith API in the input sequence.
Target matrix T of API recommendation modeltargetIs a number NbLine NsAnd (3) a matrix of columns, wherein the ith row and the jth column represent the IDs in the API dictionary generated in step 1.3 of the target output API corresponding to the ith API in the input sequence.
The output of the API recommendation model is Nb×NsOutput probability matrix T of row n columnprobWhere n represents the number of APIs contained in the API dictionary and the number in row i and column j represents the ith in the input sequenceThe probability that the next API predicted after the API input belongs to the jth API in the API dictionary.
The method comprises the following steps:
and 3.1, connecting all the API sequences in the API sequence transaction library end to generate an API total sequence.
Step 3.2, setting a pointer variable point, wherein the initial value of the point is 1, and sequentially extracting N from the API of the first point of the API total sequence each timesAPI, a total of NbFor each API, reading its corresponding ID from the API dictionary, and using its ID to extract the vector corresponding to the API from the API vector matrix and store it in the input matrix TinputIn (1). For example, the vectors corresponding to the ith API and jth API are stored in the input matrix TinputRow (i) x j. For the target matrix, starting from the API at the first point of the API total sequence, sequentially extracting N each timesAPI, a total of NbAnd for each API, reading the corresponding ID from the API dictionary and storing the corresponding ID into the target matrix. And finally, after the input matrix and the target matrix are filled, the point variable points to the last API read by the target matrix in the API total sequence. It is worth noting that after the last API in the API total sequence is extracted, the first API in the API total sequence is continuously extracted.
And 3.3, sequentially extracting API vectors from the input matrix to serve as input of an API recommendation model, and regarding the moment t, sequentially taking the vector of each row of API in the input matrix as the input vector of the model, and recording the API as the APItMarking the input as xtIf the calculation result of the hidden layer input gate of the LSTM model is it=σ(wixt+uibt+vict-1) Forgetting to calculate as result ft=σ(wfxt+ufbt+vfct-1) The output gate is calculated as ot=σ(woxt+uobt+voct) Finally the output of the hidden layer is bt=ot·tanh(ct) Data is transmitted from the hidden layer to the full link layer and finally transmittedAnd using a Softmax classifier for layer outlet. The output layer is obtained:
Figure BDA0001377040230000071
wherein | VDI represents the number of APIs contained in the API dictionary, theta represents the current weight of the neural network, and theta represents the current weight of the neural network1And representing a set of weight values corresponding to the first output node of the output layer. Finally, the formula is transposed and stored in the output probability matrix. This step is repeated until all the API vectors in the input matrix are entered into the API recommendation model.
And 3.4, calculating a cross entropy loss function by using the output probability matrix and the target matrix. A cross entropy loss function of
Figure BDA0001377040230000072
Wherein l represents an indicator function, l (y)tJ) represents when ytWhen j is equal, l (y)tJ) 1, otherwise l (y)t=j)=0,ytThe ID of the target output API at time i.
Figure BDA0001377040230000073
Representing the output probability of the ith row and the jth column in the output probability matrix.
And 3.5, calculating the gradients of all weights in the network by taking the weights in the network as variables according to the cross entropy loss function. Meanwhile, based on gradient cutting, the updating of the weight value is controlled within a set range; the method comprises the following steps: firstly, a constant named gradient clipping is set, and is marked as clip _ gradient, when the backward propagation is carried out, the gradient of each parameter is obtained, and is marked as diff, at the moment, the weight is not selected to be directly updated, the sum of squares of all weight gradients is firstly solved, and is marked as sumsq _ diff, if the sum of squares of all weight gradients is greater than the clip _ gradient, the scaling factor is continuously solved, and is marked as scale _ factor which is clip _ gradient/sumsq _ diff. This scale _ factor is between (0, 1). If the sum of squares of the weight gradients is larger, the scaling factor will be smaller. Finally, all the weight gradients are multiplied by the scaling factor, and the obtained gradient is the final gradient information. The weight is updated according to the formula W ═ W ∑ J (θ), ∑ J (θ) represents the corresponding weight gradient, and η represents the learning rate.
And 3.6, repeating the steps 3.2-3.5 until convergence, namely the loss J (theta) does not rise or fall any more.
And 4, extracting an API sequence from the code which is edited by the developer, and then generating a prediction subsequence set.
Step 4.1, extracting an API sequence from the code being edited by the developer, and recording the sequence as P ═ P1,P2,…,Pi,…,PLIn which P isiDenotes the ith API, P in the API sequence PLThe lth API in the API sequence P is represented, that is, the number of APIs included in the API sequence P is L. The rule for extracting the API sequence is the same as in step 1.2.
Step 4.2, with the L-th API as a reference position, selecting all subsequences with the length less than or equal to a threshold value gamma forwards, namely selecting the subsequences as Subi={PL-i,…,PLIn which 1 is<i<And gamma. The set of these subsequences is the set of predicted subsequences VSub={Sub1,Sub2,…,Subγ}。
Step 5, the prediction subsequence set V generated in the step 4 is collectedSubThe sequence in (3) is sequentially input into the API recommendation model trained in step (3), and an | V is outputSubA probability matrix of | rows and n columns, where | VSubI is a subsequence set VSubN is the number of the API contained in the API dictionary generated in step one, and the ith row and jth column of the probability matrix indicate that the current API sequence is the predicted Sub-sequence SubiWhen the next API is the conditional probability Pr (w) of the jth API in the API dictionaryj|Subi). Predicted probability matrix T to be generatedpredictionAnd taking the maximum value of each column to obtain a one-dimensional probability matrix t, wherein the column with the maximum value in the one-dimensional probability matrix is the mth column, and the mth API in the API dictionary is preferentially recommended.
Aiming at the problems of low recommendation accuracy, low recommendation efficiency and the like commonly existing in the existing code recommendation technology, the method extracts the source code into an API sequence, constructs a code recommendation model by using a long-term and short-term memory network, learns the relation between API calls and then recommends the code. And uses dropout techniques to prevent model overfitting. Meanwhile, the ReLu function is used for replacing the traditional saturation function, the problem of gradient disappearance is solved, the convergence speed of the model is increased, the performance of the model is improved, and the advantages of the neural network are fully exerted. The technical scheme of the invention has the characteristics of simplicity and quickness, and can better improve the accuracy and the recommendation efficiency of code recommendation.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 shows the codes of the readTxtFile method of the embodiment;
FIG. 3 shows the code of the writeTxFile method of the present embodiment;
FIG. 4 illustrates an API transaction database extracted in the present embodiment;
FIG. 5 illustrates an extracted API dictionary;
FIG. 6 illustrates a long term memory network;
Detailed Description
In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.
The code recommendation method based on the long and short term memory network provided by the invention has the flow shown in the attached figure 1, and all the steps can be automatically operated by a person skilled in the art by adopting a computer software technology. The embodiment specifically realizes the following processes:
step 1, in order to make a source code library have high reliability and practicability, at least ten thousand Java open source software codes are crawled from a GitHub website through a web crawler, the number of times of updating versions of each Java open source software code exceeds 1000, the open source software codes form the source code library, then the source codes are preprocessed to form an API sequence transaction library, and an API dictionary and an API vector matrix are generated, specifically comprising:
step 1.1, at least ten thousand Java open source software codes are crawled, the times of updating versions of each Java open source software code exceed 1000, and the open source software codes form a source code library.
Step 1.2, taking the method as a unit, extracting the API sequence of the method from the codes contained in the method, wherein all the API sequences extracted by all the methods in the source code library form an API sequence transaction library. The rule for extracting the API sequence from the code included in the method is to extract only the API of the newly-built object statement and the API of the object calling method statement. The API extracted by the new object statement is expressed as "class name. new", where the class name is the name of the class to which the new object belongs. The API extracted by the object calling method statement is denoted as "class name. method name", where the class name is the name of the class to which the object belongs.
In this embodiment, in the code in the readtxfile method in fig. 2, the first statement "File ═ new File (File)" is a new object statement, the extracted API is file.new, the second statement "if (file.isfile ()") is an object calling method statement, the extracted API is file.isfile, the third statement "FileInputStream ═ new File" is a new object statement, the extracted API is fileinputstream.new, the fourth statement "inputstream reader: new _ new object statement," the extracted API is a new object statement, the fifth statement "buffedfile ═ new reader ═ new" is a new object statement, the extracted API is an inputstream reader, the extracted API is a new object statement, the fifth statement "buffer reader ═ new" is a new object statement, and the extracted API is not called API (File), the extracted API is a new object statement, and the extracted statement is not a new object statement (File) and the extracted statement is a new object statement, print ln, the ninth statement "read. close ()" is the object call statement, and the extracted API is inputstreamreader. Therefore, API sequences extracted by the codes in the readtxfile method in fig. 2 are file.new, file.isfile, fileinputstream.new, inputstream reader.new, bufferedreadreadreadnew, system.out.printin, inputstreamreadder.close.
In the present embodiment, in the code in the writetxfile () method in fig. 3, the first statement "File book File" (page.txt) "is a new object statement, the extracted API is file.new, the second statement" scannerzoo sc "is a new object statement, the extracted API is scanner.new, the third statement" File author File (author.txt) "is a new object statement, the extracted API is file.new, the fourth statement" File writer fw "(author File) is a new object statement, the extracted API is File writer.new, the fifth statement" File (page.hashline) "is a new object statement, the extracted API is a fifth statement" script (page.hashline) "is a call object, the extracted API is a sixth statement, and the extracted API is a sixth statement, the extracted API is a call entry statement, the API is a sixth statement, the extracted API is a new object statement, the API is a sixth statement, the API is a new object statement, the API is a call entry statement, the new object statement, the API is a new statement, the API is a call statement, the call is a new statement, the call is a call statement, the call of the sixth statement, the call of the, the extracted API is filewriter. close, the ninth statement "book sc. close ()" is an object call statement, and the extracted API is scanner. Thus, the API sequences extracted by the code in the writeTxtFile method in fig. 2 are file.new, scanner.new, file.new, filewriter.new, scanner.hasnextline, scanner.nextline, filewriter.applied, filewriter.close, scanner.close.
Finally, all the API sequences extracted by the two methods form an API sequence transaction library shown in FIG. 4.
And step 1.3, extracting the API dictionary from the API sequence transaction library and generating an API vector matrix.
The API dictionary is defined as: assuming the API sequence transaction library as D, the API dictionary may be denoted as VD={1:API1,w1,2:API2,w2,…,i:APIi,wi,…n:APIn,wnN is the number of API contained in the API dictionary, APIiRepresents VDName of the ith API in (1), wiA set of representations VDVector of the ith API.
The generation process of the API dictionary and the API vector matrix comprises the following steps: traversing the API sequence transaction library, judging whether the current API exists in the API dictionary, if so, ignoring the current API, and continuing traversing the next API, otherwise, adding the current API into the API dictionary, and giving the unique ID and a random M-dimensional API vector to the current API. The n M-dimensional API vectors of the n APIs contained in the API dictionary form an API vector matrix. The API vector matrix is used as a parameter of a long short-Term Memory network (LSTM) model, and the API vector is learned when the LSTM model is trained.
In this embodiment, the API sequence transaction library in fig. 4 is traversed, the first API file of the first API sequence of the API sequence transaction library does not exist in the API dictionary, the unique ID is given as 1, and a random 100-dimensional API vector w is given to the API sequence transaction library1=[0.1,0.3,0.5,0.5,…,0.5]And adding it into API dictionary, the current API dictionary is VD={1:File.new,w1}. isFile of the first API sequence does not exist in the API dictionary, gives its unique ID 2 and gives it a random 100-dimensional API vector w2=[0.2,0.5,0.5,0.4,…,0.7]And adding it into API dictionary, the current API dictionary is VD={1:File.new,w1,2:File.ifFile,w2}. New of the third API of the first API sequence, fileinputstream, does not exist in the API dictionary, gives it a unique ID of 3 and a random 100-dimensional API vector w3=[0.4,0.2,0.5,0.2,…,0.2]And adding it into API dictionary, the current API dictionary is VD={1:File.new,w1,2:File.ifFile,w2,3:FileInputStream.new,w3}. New does not exist in the API dictionary, gives its unique ID 4 and a random 100-dimensional API vector w4=[0.3,0.3,0.5,0.2,…,0.9]And adding it into API dictionary, the current API dictionary is VD={1:File.new,w1,2:File.ifFile,w2,3:FileInputStream.new,w3,4:InputStreamRead.new,w4}. New of the fifth API buffer read of the first API sequence does not exist in the API dictionary, giving it a unique ID of 5 and a random 100-dimensional API vector w5=[0.1,0.6,0.5,0.6,…,0.5]And adding it into API dictionary, the current API dictionary is VD={1:File.new,w1,2:File.ifFile,w2,3:FileInputStream.new,w3,4:InputStreamRead.new,w4,5:BufferedRead.new,w5}. Readline does not exist in the API dictionary, gives it a unique ID of 6 and a random 100-dimensional API vector w6=[0.5,0.3,0.5,0.7,…,0.3]And adding it into API dictionary, the current API dictionary is VD={1:File.new,w1,2:File.ifFile,w2,3:FileInputStream.new,w3,4:InputStreamRead.new,w4,5:BufferedRead.new,w5,6:BufferedRead.readLine,w6}. Print ln does not exist in the API dictionary, gives its unique ID of 7 and a random 100-dimensional API vector w7=[0.1,0.3,0.5,0.5,…,0.5]And adding it into API dictionary, the current API dictionary is VD={1:File.new,w1,2:File.ifFile,w2,3:FileInputStream.new,w3,4:InputStreamRead.new,w4,5:BufferedRead.new,w5,6:BufferedRead.readLine,w6,7:System.out.println,w7}. Close does not exist in the API dictionary, gives its unique ID of 8 and gives it a random 100-dimensional API vector w8=[0.7,0.2,0.1,0.8,…,0.3]And adding it into API dictionary, the current API dictionary is VD={1:File.new,w1,2:File.ifFile,w2,3:FileInputStream.new,w3,4:InputStreamRead.new,w4,5:BufferedRead.new,w5,6:BufferedRead.readLine,w6,7:System.out.println,w7,8:InputStreamReader.close,w8}。
Of a second API sequence in the API sequence transaction libraryNew exists in the API dictionary and is ignored. New of the second API sequence is not present in the API dictionary, is assigned a unique ID of 9 and is assigned a random 100-dimensional API vector w9=[0.3,0.8,0.2,0.1,…,0.7]And adding it into API dictionary, the current API dictionary is VD={1:File.new,w1,2:File.ifFile,w2,3:FileInputStream.new,w3,4:InputStreamRead.new,w4,5:BufferedRead.new,w5,6:BufferedRead.readLine,w6,7:System.out.println,w7,8:InputStreamReader.close,w8,9:Scanner.new,w9}. New, the third API file of the second API sequence is present in the API dictionary and is ignored. New, the fourth API file of the second API sequence does not exist in the API dictionary, is assigned a unique ID of 10 and is assigned a random 100-dimensional API vector w10=[0.4,0.2,0.8,0.7,…,0.3]And adding it into API dictionary, the current API dictionary is VD={1:File.new,w1,2:File.ifFile,w2,3:FileInputStream.new,w3,4:InputStreamRead.new,w4,5:BufferedRead.new,w5,6:BufferedRead.readLine,w6,7:System.out.println,w7,8:InputStreamReader.close,w8,9:Scanner.new,w9,10:FileWriter.new,w10}. The fifth API scanner. hasnextline of the second API sequence does not exist in the API dictionary, is assigned a unique ID of 11 and is assigned a random 100-dimensional API vector w11=[0.1,0.4,0.5,0.3,…,0.1]And adding it into API dictionary, the current API dictionary is VD={1:File.new,w1,2:File.ifFile,w2,3:FileInputStream.new,w3,4:InputStreamRead.new,w4,5:BufferedRead.new,w5,6:BufferedRead.readLine,w6,7:System.out.println,w7,8:InputStreamReader.close,w8,9:Scanner.new,w9,10:FileWriter.new,w10,11:Scanner.hasNextLinec,w11}. The sixth API scanner. nextline of the second API sequence does not exist in the API dictionary, giving it unique propertiesID of 12 and a random 100-dimensional API vector w assigned thereto12=[0.5,0.3,0.5,0.7,…,0.3]And adding it into API dictionary, the current API dictionary is VD={1:File.new,w1,2:File.ifFile,w2,3:FileInputStream.new,w3,4:InputStreamRead.new,w4,5:BufferedRead.new,w5,6:BufferedRead.readLine,w6,7:System.out.println,w7,8:InputStreamReader.close,w8,9:Scanner.new,w9,10:FileWriter.new,w10,11:Scanner.hasNextLinec,w11,12:Scanner.nextLine,w12}. API, the seventh API filewriter of the second API sequence does not exist in the API dictionary, is assigned a unique ID of 13 and is assigned a random 100-dimensional API vector w13=[0.3,0.1,0.7,0.3,…,0.6]And adding it into API dictionary, the current API dictionary is VD={1:File.new,w1,2:File.ifFile,w2,3:FileInputStream.new,w3,4:InputStreamRead.new,w4,5:BufferedRead.new,w5,6:BufferedRead.readLine,w6,7:System.out.println,w7,8:InputStreamReader.close,w8,9:Scanner.new,w9,10:FileWriter.new,w10,11:Scanner.hasNextLinec,w11,12:Scanner.nextLine,w12,13:FileWriter.append,w13}。
Close, the eighth API filewriter of the second API sequence does not exist in the API dictionary, giving it a unique ID of 14 and a random 100-dimensional API vector w14=[0.4,0.8,0.4,0.2,…,0.1]And adding it into API dictionary, the current API dictionary is VD={1:File.new,w1,2:File.ifFile,w2,3:FileInputStream.new,w3,4:InputStreamRead.new,w4,5:BufferedRead.new,w5,6:BufferedRead.readLine,w6,7:System.out.println,w7,8:InputStreamReader.close,w8,9:Scanner.new,w9,10:FileWriter.new,w10,11:Scanner.hasNextLinec,w11,12:Scanner.nextLine,w12,13:FileWriter.append,w13,14:FileWriter.close,w14}. Close of the ninth API scanner of the second API sequence does not exist in the API dictionary, giving it a unique ID of 15 and a random 100-dimensional API vector w15=[0.5,0.2,0.3,0.1,…,0.2]Adding the extracted API dictionary into the API dictionary, and finally extracting the API dictionary as VD={1:File.new,w1,2:File.ifFile,w2,3:FileInputStream.new,w3,4:InputStreamRead.new,w4,5:BufferedRead.new,w5,6:BufferedRead.readLine,w6,7:System.out.println,w7,8:InputStreamReader.close,w8,9:Scanner.new,w9,10:FileWriter.new,w10,11:Scanner.hasNextLinec,w11,12:Scanner.nextLine,w12,13:FileWriter.append,w13,14:FileWriter.close,w14,15:Scanner.close,w15}. In this embodiment, 15 API vectors of 100 dimensions of the 15 APIs included in the API dictionary constitute an API vector matrix as shown in fig. 5.
And 2, constructing an API recommendation model, namely constructing a long-term and short-term memory network. Referring to fig. 6, the long-term and short-term memory network is composed of an input layer, a hidden layer, a full link layer and an output layer. The input layer receives a string of numerical value input, the numerical value input is input to the hidden layer through forward propagation, the current output of the hidden layer is affected by the output of the hidden layer at the last moment, the output generated by the hidden layer is input into the full-link layer, the output data of the full-link layer is input into the output layer, and the Softmax classifier in the output layer outputs the final classification result. In specific implementation, the neural unit of the hidden layer is a long-short term memory unit (LSTM), a dropout technology is used for preventing the long-short term memory network from being overfitted, and a ReLu function is used as a neuron activation function. In this embodiment, the number of neurons in the input layer is 100, and 100 is the dimension of the API vector generated in step 1.3. The number of neurons in the hidden layer is 100, the number of neurons in the full link layer is 100, the number of neurons in the output layer is 15, and 15 is the number of APIs contained in the API dictionary.
And 3, training the API recommendation model, namely training the long-term and short-term memory network.
The input of the API recommendation model is Nb×NsMatrix T of rows M columnsinputIn which N isbIndicating the batch size, NsThe length of the sequence is represented, M represents the dimension of the API vector, and the ith row of the matrix represents the vector corresponding to the ith API in the input sequence.
Target matrix T of API recommendation modeltargetIs a number NbLine NsAnd (3) a matrix of columns, wherein the ith row and the jth column represent the IDs in the API dictionary generated in step 1.3 of the target output API corresponding to the ith API in the input sequence.
The output of the API recommendation model is Nb×NsOutput probability matrix T of row n columnprobWherein n represents the number of the APIs contained in the API dictionary, and the number in the ith row and the jth column represents the probability that the predicted next API belongs to the jth API in the API dictionary after the ith API in the input sequence is input.
The method mainly comprises the following steps:
and 3.1, connecting all the API sequences in the API sequence transaction library end to generate an API total sequence.
In this embodiment, after all API sequences in the API sequence database in fig. 4 are connected end to end, an API total sequence is generated as follows: new, file, isfile, file inputstream.new, inputstream reader.new, bufferdredrreadline, system.out.println, inputstream reader.close, file.new, scanner.new, file.new, filewriter.new, scanner.hasnexine, scanner.nextline, filewriter.ap-pend, filewriter.close, scanner.close.
Step 3.2, setting a pointer variable point (the initial value of point is 1), and sequentially extracting N from the API of the first point of the API total sequence each timesAPI, a total of NbFor each API, reading its corresponding ID from the API dictionary, and using its ID to extract the vector corresponding to the API from the API vector matrix and store it in the input matrix TinputIn (1). For example, the vectors corresponding to the ith API and jth API are stored in the input matrix TinputRow (i) x j. For the target matrix, the second from the API aggregate sequencePoint API starts, extracting N in turn each timesAPI, a total of NbAnd for each API, reading the corresponding ID from the API dictionary and storing the corresponding ID into the target matrix. And finally, after the input matrix and the target matrix are filled, the point variable points to the last API read by the target matrix in the API total sequence. It is worth noting that after the last API in the API total sequence is extracted, the first API in the API total sequence is continuously extracted.
In this embodiment, set batch size NbIs 2, sequence length NsIs 2 and the API vector dimension is 100. In the initial stage, point is 0, and 2 APIs are sequentially extracted from the 1 st API of the total API sequence, for a total of 2 batches, so that the extracted APIs are file. For each API, reading the corresponding ID from the API dictionary, extracting the vector corresponding to the API from the API vector matrix by using the ID, and storing the vector into the input matrix TinputIn, therefore input matrix
Figure BDA0001377040230000151
The target matrix sequentially extracts 2 APIs from the 2 nd API of the API total sequence, and extracts 2 batches in total, so that the extracted APIs are FiletargetIn, and thus the object matrix
Figure BDA0001377040230000152
If the number of APIs included in the API dictionary is 15, an input matrix with 4(═ 2 × 2) rows and 100 columns is established, an output probability matrix with 4(═ 2 × 2) rows and 15 columns is established, and a target matrix with 2 rows and 2 columns is established.
And 3.3, sequentially extracting API vectors from the input matrix to serve as input of an API recommendation model, and regarding the moment t, sequentially taking the vector of each row of API in the input matrix as the input vector of the model, and recording the API as the APItMarking the input as xtOf, then LSTM modelThe hidden layer input gate has a calculation result of it=σ(wixt+uibt+vict-1) Forgetting to calculate as result ft=σ(wfxt+ufbt+vfct-1) The output gate is calculated as ot=σ(woxt+uobt+voct) Finally the output of the hidden layer is bt=ot·tanh(ct) Data is passed from the hidden layer to the fully connected layer, and finally the output layer uses a Softmax classifier. The output layer is obtained:
wherein | VDI represents the number of APIs contained in the API dictionary, theta represents the current weight of the neural network, and theta represents the current weight of the neural network1And representing a set of weight values corresponding to the first output node of the output layer. Finally, the formula is transposed and stored in the output probability matrix. This step is repeated until all the API vectors in the input matrix are entered into the API recommendation model.
In this embodiment, assume that the input matrix generated in step 3.2 is used
Figure BDA0001377040230000162
Input into API recommendation model to obtain output probability matrix of
Figure BDA0001377040230000163
And 3.4, calculating a loss function by using the output probability matrix and the target matrix. A cross entropy loss function of
Figure BDA0001377040230000164
Wherein l represents an indicator function, l (y)tJ) represents when ytWhen j is equal, l (y)tJ) 1, otherwise l (y)t=j)=0,ytThe ID of the target output API at time i.
Figure BDA0001377040230000166
Representing the output probability of the ith row and the jth column in the output probability matrix.
Target matrix of the embodiment
Figure BDA0001377040230000165
According to the output probability matrix in step 3.3, the following losses are obtained:
Figure BDA0001377040230000171
and 3.5, calculating the gradients of all weights in the network by taking the weights in the network as variables according to the loss function. Meanwhile, a gradient clipping technology is introduced, the updating of the weight value is controlled in a proper range, and the problem of gradient disappearance or gradient decline is solved better. In specific implementation, a constant named gradient clipping is set first and is marked as clip _ gradient, when backward propagation is performed, the gradient of each parameter is obtained and is marked as diff, at this time, direct updating of the weight is not selected, the sum of squares of all weight gradients is determined first and is marked as sumsq _ diff, and if the sum of squares of all weight gradients is greater than clip _ gradient, the scaling factor is determined continuously and is marked as scale _ factor as clip _ gradient/sumsq _ diff. This scale _ factor is between (0, 1). If the sum of squares of the weight gradients is larger, the scaling factor will be smaller. Finally, all the weight gradients are multiplied by the scaling factor, and the obtained gradient is the final gradient information. The weight is updated according to the formula W ═ W ∑ J (θ), ∑ J (θ) represents the corresponding weight gradient, and η represents the learning rate.
And 3.6, repeating the steps 3.2-3.5 until convergence, namely the loss J (theta) does not rise or fall any more.
And 4, extracting an API sequence from the code which is edited by the developer, and then generating a prediction subsequence set.
Step 4.1, extracting an API sequence from the code being edited by the developer, and recording the sequence as P ═ P1,P2,…,Pi,…,PLTherein ofPiDenotes the ith API, P in the API sequence PLThe lth API in the API sequence P is represented, that is, the number of APIs included in the API sequence P is L. The rule for extracting the API sequence is the same as in step 1.2.
Step 4.2, with the L-th API as a reference position, selecting all subsequences with the length less than or equal to a threshold value gamma forwards, namely selecting the subsequences as Subi={PL-i,…,PLIn which 1 is<i<And gamma. The set of these subsequences is the set of predicted subsequences VSub={Sub1,Sub2,…,Subγ}。
In this embodiment, it is assumed that the user is editing the code:
Figure BDA0001377040230000172
Figure BDA0001377040230000181
the API sequence extracted from the code is File, scanner, new, scanner, hasNextLine, scanner, NextLine, the statement that needs to predict API is 'noteSc?', the threshold value gamma is set to be 3, then the prediction subsequence Sub-sequence Sub can be obtained1={Scanner.NextLine},Sub2={Scanner.hasNextLine,Scanner.NextLine},Sub3-scanner. The set of these subsequences is the set of predicted subsequences VSub={Sub1,Sub2,Sub3}。
Step 5, the prediction subsequence set V generated in the step 4 is collectedSubThe sequence in (3) is sequentially input into the API recommendation model trained in step (3), and an | V is outputSubA probability matrix of | rows and n columns, where | VSubI is a subsequence set VSubN is the number of the API contained in the API dictionary generated in step one, and the ith row and jth column of the probability matrix indicate that the current API sequence is the predicted Sub-sequence SubiWhen the next API is the conditional probability Pr (w) of the jth API in the API dictionaryj|Subi). Predicted probability matrix T to be generatedpredictionAnd taking the maximum value of each column to obtain a one-dimensional probability matrix t, wherein the column with the maximum value in the one-dimensional probability matrix is the mth column, and the mth API in the API dictionary is preferentially recommended.
In this embodiment, taking the set of predicted Sub-sequences obtained in step 4 as an example, a prediction probability matrix with 3 rows and 15 columns is established, and the predicted Sub-sequences Sub1={Scanner.NextLine},Sub2={Scanner.hasNextLine,Scanner.NextLine},Sub3Inputting { scanner.new, scanner.hasnextline, scanner.nextline } into the trained model in turn, storing the output into a prediction probability matrix, assuming that the obtained prediction probability matrix is
Figure BDA0001377040230000182
Predicted probability matrix T to be generatedpredictionTaking the maximum value of each column to obtain a one-dimensional probability matrix t ═ 0.6,0.3,0.5,0.2,0.3,0.2,0.3,0.4,0.3,0.5,0.2,0.3,0.4,0.3,0.8]Since the 15 th column value of the one-dimensional probability matrix is the largest, the 15 th API in the API dictionary is preferentially recommended.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims (1)

1. A code recommendation method based on a long-term and short-term memory network is characterized by comprising the following steps:
step 1, crawling at least ten thousand Java open source software codes from a GitHub website through a web crawler, wherein the number of times of updating versions of each Java open source software code exceeds 1000, the open source software codes form a source code library, then preprocessing the source codes to form an API sequence transaction library, and generating an API dictionary and an API vector matrix, wherein the method specifically comprises the following steps of:
step 1.1, crawling at least ten thousand Java open source software codes with update version times exceeding at least 1000 times from a GitHub website by using a web crawler to form a source code library;
step 1.2, taking a method as a unit, extracting an API sequence of the method from codes contained in the method, wherein all the API sequences extracted by all the methods in a source code library form an API sequence transaction library; the rule for extracting the API sequence from the code in the method is that only the API of the newly-built object statement and the API of the object calling method statement are extracted; the API extracted by the new object statement is expressed as 'class name.new', wherein the class name is the name of the class to which the new object belongs; the API extracted by the object calling method statement is expressed as 'class name, method name', wherein the class name is the name of the class to which the object belongs;
step 1.3, extracting an API dictionary from the API sequence transaction library and generating an API vector matrix;
the API dictionary is defined as: assuming the API sequence transaction library as D, the API dictionary may be denoted as VD={1:API1,w1,2:API2,w2,…,i:APIi,wi,…n:APIn,wnN is the number of API contained in the API dictionary, APIiRepresents VDName of the ith API in (1), wiA set of representations VDVector of the ith API;
the generation process of the API dictionary and the API vector matrix comprises the following steps: traversing the API sequence transaction library, judging whether the current API exists in the API dictionary, if so, ignoring the current API, and continuing traversing the next API, otherwise, adding the current API into the API dictionary, and giving the current API a unique ID and a random M-dimensional API vector; the API vector matrix is formed by n M-dimensional API vectors of n APIs contained in the API dictionary; the API vector matrix is used as a parameter of a Long Short-term memory network (LSTM) model, and the API vector can be learned when the LSTM model is trained;
step 2, constructing an API recommendation model, namely constructing a long-term and short-term memory network; defining a long-term and short-term memory network comprising an input layer, a hidden layer, a full link layer and an output layer; wherein,
the input layer receives a string of numerical value input, the numerical value input is input into the hidden layer through forward propagation, the current output of the hidden layer is influenced by the output at the moment on the hidden layer, the output generated by the hidden layer is input into the full-link layer, the full-link layer outputs data and is input into the output layer, and a Softmax classifier in the output layer outputs the final classification result;
the neural unit of the hidden layer is a long-term and short-term memory unit, the dropout technology is used for preventing the long-term and short-term memory network from being over-fitted, and the ReLu function is used as a neuron activation function; the number of neurons of the input layer is M, which is the dimension of the API vector generated in step 1.3; the number of neurons of the hidden layer is M, the number of neurons of the full link layer is M, the number of neurons of the output layer is n, n is the number of APIs contained in the API dictionary, and the values of M, n are positive integers;
step 3, training an API recommendation model, namely training a long-term and short-term memory network;
the input of the API recommendation model is Nb×NsMatrix T of rows M columnsinputIn which N isbIndicating the batch size, NsRepresenting the length of the sequence, M representing the dimension of the API vector, and the ith row of the matrix representing the vector corresponding to the ith API in the input sequence;
target matrix T of API recommendation modeltargetIs a number NbLine NsA matrix of columns, wherein the ith row and the jth column represent the IDs in the API dictionary generated in step 1.3 of the target output API corresponding to the ith API in the input sequence;
the output of the API recommendation model is Nb×NsOutput probability matrix T of row n columnprobWherein n represents the number of the APIs contained in the API dictionary, and the number in the ith row and the jth column represents the probability that the predicted next API belongs to the jth API in the API dictionary after the ith API in the input sequence is input;
the method comprises the following steps:
step 3.1, connecting all API sequences in the API sequence transaction library end to generate an API total sequence;
step 3.2, a pointer variable point is set, the initial value of the point is 1, and the overall sequence of the slave API isThe first point API of the column begins, fetching N in turn each timesAPI, a total of NbFor each API, reading its corresponding ID from the API dictionary, and using its ID to extract the vector corresponding to the API from the API vector matrix and store it in the input matrix TinputPerforming the following steps; for example, the vectors corresponding to the ith API and jth API are stored in the input matrix TinputRow i × j in (1); for the target matrix, starting from the API at the first point of the API total sequence, sequentially extracting N each timesAPI, a total of NbFor each API, reading the corresponding ID from the API dictionary and storing the corresponding ID into a target matrix; finally, after the input matrix and the target matrix are filled, the point variable points to the last API read by the target matrix in the API total sequence; it is worth to be noted that after the last API in the API total sequence is extracted, the first API in the API total sequence is continuously extracted;
and 3.3, sequentially extracting API vectors from the input matrix to serve as input of an API recommendation model, and regarding the moment t, sequentially taking the vector of each row of API in the input matrix as the input vector of the model, and recording the API as the APItMarking the input as xtIf the calculation result of the hidden layer input gate of the LSTM model is it=σ(wixt+uibt+vict-1) Forgetting to calculate as result ft=σ(wfxt+ufbt+vfct-1) The output gate is calculated as ot=σ(woxt+uobt+voct) Finally the output of the hidden layer is bt=ot·tanh(ct) Data is transmitted from the hidden layer to the full connection layer, and finally the output layer uses a Softmax classifier; the output layer is obtained:
Figure FDA0002189199180000041
wherein | VDI represents the number of APIs contained in the API dictionary, theta represents the current weight of the neural network, and theta represents the current weight of the neural network1Representing a set of weights corresponding to a first output node of an output layer;finally, transposing the formula and storing the transposed formula in an output probability matrix; repeating the steps until all the API vectors in the input matrix are input into the API recommendation model;
step 3.4, calculating a cross entropy loss function by using the output probability matrix and the target matrix; a cross entropy loss function of
Figure FDA0002189199180000042
Wherein l represents an indicator function, l (y)tJ) represents when ytWhen j is equal, l (y)tJ) 1, otherwise l (y)t=j)=0,ytAn ID indicating a target output API at time i;
Figure FDA0002189199180000043
representing the output probability of the ith row and the jth column in the output probability matrix;
step 3.5, calculating the gradients of all weights in the network by taking the weights W in the network as variables according to the cross entropy loss function; meanwhile, based on gradient cutting, the updating of the weight value is controlled within a set range; the method comprises the following steps: firstly, setting a constant named gradient clipping, which is marked as clip _ gradient, obtaining the gradient of each parameter when reverse propagation is carried out, and marking the gradient as diff, at the moment, directly updating the weight is not selected, the sum of squares of all weight gradients is firstly solved and marked as sumsq _ diff, if the sum of squares of all weight gradients is greater than the clip _ gradient, the scaling factor is continuously solved and marked as scale _ factor which is clip _ gradient/sumsq _ diff; this scale _ factor is between (0, 1); if the sum of squares of the weight gradients is larger, the scaling factor will be smaller; finally, multiplying all the weight gradients by the scaling factor, wherein the obtained gradient is the final gradient information; updating the weight according to the formula W ═ W ^ J (θ), wherein ^ J (θ) represents the corresponding weight gradient, and η represents the learning rate;
step 3.6, repeating the steps 3.2-3.5 until convergence, namely the loss J (theta) does not rise or fall any more;
step 4, extracting an API sequence from the code being edited by the developer, and then generating a prediction subsequence set;
step 4.1, extracting an API sequence from the code being edited by the developer, and recording the sequence as P ═ P1,P2,…,Pi,…,PLIn which P isiDenotes the ith API, P in the API sequence PLThe number of the L-th API in the API sequence P is represented, namely the number of the APIs contained in the API sequence P is L; the rule for extracting the API sequence is the same as the rule in the step 1.2;
step 4.2, with the L-th API as a reference position, selecting all subsequences with the length less than or equal to a threshold value gamma forwards, namely selecting the subsequences as Subi={PL-i,…,PLIn which 1 is<i<Gamma; the set of these subsequences is the set of predicted subsequences VSub={Sub1,Sub2,…,Subγ};
Step 5, the prediction subsequence set V generated in the step 4 is collectedSubThe sequence in (3) is sequentially input into the API recommendation model trained in step (3), and an | V is outputSubA probability matrix of | rows and n columns, where | VSubI is a subsequence set VSubN is the number of the API contained in the API dictionary generated in step one, and the ith row and jth column of the probability matrix indicate that the current API sequence is the predicted Sub-sequence SubiWhen the next API is the conditional probability Pr (w) of the jth API in the API dictionaryj|Subi) (ii) a Predicted probability matrix T to be generatedpredictionAnd taking the maximum value of each column to obtain a one-dimensional probability matrix t, wherein the column with the maximum value in the one-dimensional probability matrix is the mth column, and the mth API in the API dictionary is preferentially recommended.
CN201710687197.4A 2017-08-11 2017-08-11 Code recommendation method based on long-term and short-term memory network Expired - Fee Related CN107506414B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710687197.4A CN107506414B (en) 2017-08-11 2017-08-11 Code recommendation method based on long-term and short-term memory network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710687197.4A CN107506414B (en) 2017-08-11 2017-08-11 Code recommendation method based on long-term and short-term memory network

Publications (2)

Publication Number Publication Date
CN107506414A CN107506414A (en) 2017-12-22
CN107506414B true CN107506414B (en) 2020-01-07

Family

ID=60690777

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710687197.4A Expired - Fee Related CN107506414B (en) 2017-08-11 2017-08-11 Code recommendation method based on long-term and short-term memory network

Country Status (1)

Country Link
CN (1) CN107506414B (en)

Families Citing this family (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108053033A (en) * 2017-12-27 2018-05-18 中南大学 A kind of function calling sequence generation method and system
CN110084356B (en) * 2018-01-26 2021-02-02 赛灵思电子科技(北京)有限公司 Deep neural network data processing method and device
CN108388425B (en) * 2018-03-20 2021-02-19 北京大学 Method for automatically completing codes based on LSTM
CN108710634B (en) * 2018-04-08 2023-04-18 平安科技(深圳)有限公司 Protocol file pushing method and terminal equipment
CN108717423B (en) * 2018-04-24 2020-07-07 南京航空航天大学 Code segment recommendation method based on deep semantic mining
CN110209920A (en) * 2018-05-02 2019-09-06 腾讯科技(深圳)有限公司 Treating method and apparatus, storage medium and the electronic device of media resource
CN110502226B (en) * 2018-05-16 2023-06-09 富士通株式会社 Method and device for recommending codes in programming environment
CN110569030B (en) * 2018-06-06 2023-04-07 富士通株式会社 Code recommendation method and device
CN108733359B (en) * 2018-06-14 2020-12-25 北京航空航天大学 Automatic generation method of software program
CN108717470B (en) * 2018-06-14 2020-10-23 南京航空航天大学 Code segment recommendation method with high accuracy
CN108920457B (en) * 2018-06-15 2022-01-04 腾讯大地通途(北京)科技有限公司 Address recognition method and device and storage medium
CN108682418B (en) * 2018-06-26 2022-03-04 北京理工大学 Speech recognition method based on pre-training and bidirectional LSTM
CN109144498B (en) * 2018-07-16 2021-12-03 山东师范大学 API automatic recommendation method and device for object instantiation-oriented tasks
CN109086186B (en) * 2018-07-24 2022-02-15 中国联合网络通信集团有限公司 Log detection method and device
CN109146166A (en) * 2018-08-09 2019-01-04 南京安链数据科技有限公司 A kind of personal share based on the marking of investor's content of the discussions slumps prediction model
CN109117480B (en) * 2018-08-17 2022-05-27 腾讯科技(深圳)有限公司 Word prediction method, word prediction device, computer equipment and storage medium
CN110881966A (en) * 2018-09-10 2020-03-17 深圳市游弋科技有限公司 Algorithm for processing electrocardiogram data by using LSTM network
CN109522011B (en) * 2018-10-17 2021-05-25 南京航空航天大学 Code line recommendation method based on context depth perception of programming site
CN109857459B (en) * 2018-12-27 2022-03-08 中国海洋大学 E-level super-calculation ocean mode automatic transplanting optimization method and system
US10983761B2 (en) * 2019-02-02 2021-04-20 Microsoft Technology Licensing, Llc Deep learning enhanced code completion system
CN109886021A (en) * 2019-02-19 2019-06-14 北京工业大学 A kind of malicious code detecting method based on API overall situation term vector and layered circulation neural network
CN111666404A (en) * 2019-03-05 2020-09-15 腾讯科技(深圳)有限公司 File clustering method, device and equipment
CN110554860B (en) * 2019-06-27 2021-03-12 北京大学 Construction method and code generation method of software project natural language programming interface (NLI)
CN110750240A (en) * 2019-08-28 2020-02-04 南京航空航天大学 Code segment recommendation method based on sequence-to-sequence model
CN111159223B (en) * 2019-12-31 2021-09-03 武汉大学 Interactive code searching method and device based on structured embedding
CN112036963B (en) * 2020-09-24 2023-12-08 深圳市万佳安物联科技股份有限公司 Webpage advertisement putting device and method based on multilayer random hidden feature model
CN112860879A (en) * 2021-03-08 2021-05-28 南通大学 Code recommendation method based on joint embedding model
CN113111254B (en) * 2021-03-08 2023-04-07 支付宝(杭州)信息技术有限公司 Training method, fitting method and device of recommendation model and electronic equipment
CN113076089B (en) * 2021-04-15 2023-11-21 南京大学 API (application program interface) completion method based on object type
CN113239354A (en) * 2021-04-30 2021-08-10 武汉科技大学 Malicious code detection method and system based on recurrent neural network
CN114357146B (en) * 2021-12-13 2024-07-23 武汉大学 Diversified API recommendation method and device based on LSTM and diversified bundle search
CN115858942B (en) * 2023-02-27 2023-05-12 西安电子科技大学 User input-oriented serialization recommendation method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129463A (en) * 2011-03-11 2011-07-20 北京航空航天大学 Project correlation fused and probabilistic matrix factorization (PMF)-based collaborative filtering recommendation system
CN106782602A (en) * 2016-12-01 2017-05-31 南京邮电大学 Speech-emotion recognition method based on length time memory network and convolutional neural networks
CN106779073A (en) * 2016-12-27 2017-05-31 西安石油大学 Media information sorting technique and device based on deep neural network
CN106886846A (en) * 2017-04-26 2017-06-23 中南大学 A kind of bank outlets' excess reserve Forecasting Methodology that Recognition with Recurrent Neural Network is remembered based on shot and long term
CN106980683A (en) * 2017-03-30 2017-07-25 中国科学技术大学苏州研究院 Blog text snippet generation method based on deep learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10453099B2 (en) * 2015-12-11 2019-10-22 Fuji Xerox Co., Ltd. Behavior prediction on social media using neural networks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129463A (en) * 2011-03-11 2011-07-20 北京航空航天大学 Project correlation fused and probabilistic matrix factorization (PMF)-based collaborative filtering recommendation system
CN106782602A (en) * 2016-12-01 2017-05-31 南京邮电大学 Speech-emotion recognition method based on length time memory network and convolutional neural networks
CN106779073A (en) * 2016-12-27 2017-05-31 西安石油大学 Media information sorting technique and device based on deep neural network
CN106980683A (en) * 2017-03-30 2017-07-25 中国科学技术大学苏州研究院 Blog text snippet generation method based on deep learning
CN106886846A (en) * 2017-04-26 2017-06-23 中南大学 A kind of bank outlets' excess reserve Forecasting Methodology that Recognition with Recurrent Neural Network is remembered based on shot and long term

Also Published As

Publication number Publication date
CN107506414A (en) 2017-12-22

Similar Documents

Publication Publication Date Title
CN107506414B (en) Code recommendation method based on long-term and short-term memory network
CN108628823B (en) Named entity recognition method combining attention mechanism and multi-task collaborative training
CN111444320B (en) Text retrieval method and device, computer equipment and storage medium
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
US11983513B2 (en) Multi-lingual code generation with zero-shot inference
US20200334326A1 (en) Architectures for modeling comment and edit relations
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN111666758A (en) Chinese word segmentation method, training device and computer readable storage medium
CN109214006A (en) The natural language inference method that the hierarchical semantic of image enhancement indicates
CN113886601B (en) Electronic text event extraction method, device, equipment and storage medium
Haije et al. Automatic comment generation using a neural translation model
CN114358201A (en) Text-based emotion classification method and device, computer equipment and storage medium
CN110807335A (en) Translation method, device, equipment and storage medium based on machine learning
CN111400494A (en) Sentiment analysis method based on GCN-Attention
CN113742733A (en) Reading understanding vulnerability event trigger word extraction and vulnerability type identification method and device
CN118276913B (en) Code completion method based on artificial intelligence
CN116432184A (en) Malicious software detection method based on semantic analysis and bidirectional coding characterization
CN118318222A (en) Automatic notebook completion using sequence-to-sequence converter
CN115510230A (en) Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism
US11941360B2 (en) Acronym definition network
CN111476035B (en) Chinese open relation prediction method, device, computer equipment and storage medium
CN117436451A (en) Agricultural pest and disease damage named entity identification method based on IDCNN-Attention
CN116501864A (en) Cross embedded attention BiLSTM multi-label text classification model, method and equipment
CN114372467A (en) Named entity extraction method and device, electronic equipment and storage medium
US11983488B1 (en) Systems and methods for language model-based text editing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200107

Termination date: 20200811

CF01 Termination of patent right due to non-payment of annual fee