CN108717423B

CN108717423B - Code segment recommendation method based on deep semantic mining

Info

Publication number: CN108717423B
Application number: CN201810371788.5A
Authority: CN
Inventors: 陶传奇; 包盼盼; 黄志球; 周宇; 王铁鑫
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2018-04-24
Filing date: 2018-04-24
Publication date: 2020-07-07
Anticipated expiration: 2038-04-24
Also published as: CN108717423A

Abstract

The invention discloses a code segment recommendation method based on deep semantic mining, which utilizes the action of a deep learning technology in natural language processing and the advantages thereof in natural language semantic mining and combines the characteristic of recommending query code segments. According to the input natural language search and the code segments themselves and the comments carried by the code segments, natural language semantics and the specific functions of the code segments are deeply mined to generate sentence vectors and paragraph vectors, so that the code segments with consistent semantic attributes and the natural language query are mapped to similar vector spaces, and N code segments which are most matched and have the similarity ordered from high to low are recommended for the given query. The method not only improves the accuracy of recommendation, but also can improve the recall ratio of recommendation, and has better fault-tolerant capability on the input natural language query.

Description

Code segment recommendation method based on deep semantic mining

Technical Field

The invention belongs to the technical field of code recommendation with query, and particularly relates to a code segment recommendation method based on deep semantic mining.

Background

In the actual code writing process, developers often encounter unfamiliar programming tasks or need to realize certain specific functions, and in such a situation, if the developers can find the existing similar code segments to learn the use method of the code segments or directly copy and paste the code segments and then modify and perfect the code segments for code reuse, a great deal of time, energy and meaningless repeated work can be saved for the developers; however, how to recommend high-quality code segments based on the actual needs of developers is an important issue for software reuse.

In actual development, a developer will typically choose to query the required code fragments using a search engine. However, since the software code has integrity, the keywords in the code segment cannot accurately describe the function of a segment of code, and therefore the query result is usually not satisfactory. In addition, the existing recommendation method usually only focuses on the code segment itself and ignores the description information, and the description information of the code segment describes the function of the code segment in natural language most simply and intuitively. In recent years, due to the wide application of deep learning, the field of language processing has also made breakthrough progress, so that deep semantic and information mining on natural languages and programming languages can also have good effects. Therefore, combining language processing technology with code recommendation is a new and effective recommendation method.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a code segment recommendation method based on deep semantic mining, which uses a deep learning technology to support the natural language query-oriented code segment recommendation; the invention can deeply excavate natural language semantics and specific functions of code segments according to natural language search of a user and comments and code segment bodies carried by the code segments, so that the annotated code segments with consistent semantic attributes and natural language queries are mapped to similar vector spaces, and the most matched code segments are recommended for the given query.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

the invention relates to a code segment recommendation method based on deep semantic mining, which comprises the following steps of:

step 1): constructing a large-scale code segment set S with method description information;

step 2): construction method description information set D₁And method subject set D₂Constructing an annotations Collection D₁' training Encoder-Decoder natural language sentence vector generator model M by using constructed data set₁Training Encoder-Decoder programming language paragraph vector generator model M₂；

Step 3), extracting the method Name of each code segment in the code segment set S, and forming a key value pair form < Name, α ' > ' with the mapped vector representation α ' of the code segment as an index file used in recommendation;

step 4): for a given natural language query, a corresponding natural language sentence vector is obtained, and then N pieces of well-ordered code segments which are most matched are recommended to each query in the code segment set S with the method description information.

Preferably, the step 1) specifically comprises: acquiring a specific project from an open source software platform, cutting a source code file in the specific project by taking a method as a unit to obtain a code segment set S with method description information, wherein the name form of each code segment is package name, class name and method name; .

Preferably, the step 1) specifically further comprises: the specific items are Java items, Android items and other items.

Preferably, the step 2) specifically includes:

21) describing information set D in a method₁For training set pair Encoder-Decoder natural language sentence vector generator model M₁Training is carried out, convergence is carried out to a specified state, and training of the natural language sentence vector generator is completed; method description information set D₁The first sentence annotated to each code segment is extracted as input and a natural language sentence vector α is generated₁And as part of the corresponding annotated code segment vector representation;

22) with method subject set D₂Vector generator model M for training set pairs Encoder-Decoder programming language paragraphs₂Training is performed, model training is completed when training to a specified convergence state, and segment vectors α for each code segment body are generated simultaneously₂；

23) Vector α natural language sentence₁And a segment vector α of the code segment body₂Weighted addition to obtain vector α as the vector that can ultimately characterize the entire annotated code segment, and then the collection of all vectors α and annotation collection D₁' the natural language sentence vector representation as a training set to train the neural network mapping model M₃And after trainingMapping model M through neural network after formation₃Mapping vector α into an annotated code fragment mapped backward quantity representation α'.

Preferably, the step 4) specifically includes:

41) for the trained Encoder-Decoder natural language sentence vector generator model M₁Given a natural language input, calculating to obtain a query statement vector β of a specified dimension;

42) the similarity between the two vectors is expressed by the included angle cos theta of the two vectors, the similarity value between the mapped vector representation α' and the query statement vector β is calculated, the N code segments which are most similar to the vector are recommended for the given natural language query, and the N code segments are ranked from high to low according to the similarity.

The invention has the beneficial effects that:

the invention utilizes the function of deep learning technology in natural language processing and the advantages of deep learning technology in language semantic mining to solve the problem of how to recommend high-quality annotated code segments according to given natural language query; has the following advantages:

(1) the natural language processing by deep learning can really dig the natural language semantics deeply, but not only matching by text keywords, so that sentence vectors corresponding to sentences with the same semantics are closer in semantic space distance, the meaning to be expressed by query can be really dug, the matching during recommendation is more accurate, and the recommendation accuracy is improved.

(2) The processing method for paragraph vectorization of the code segment main body by utilizing deep learning can be used for mining the structural information of the code segment and semantic information at a programming language level, and not only is simple feature word extraction performed, so that the information contained in the code segment main body is fully mined, and the recommendation effect of the code segment can be improved.

(3) N code segments with the most similar annotation semantics are obtained by deep semantic matching and are used as recommendation results, and the N code segments are ranked from high to low according to semantic similarity, so that even if the input query expression is not clear enough or has slight deviation, a proper recommendation result can be found at a relatively low position, the recall ratio is improved, and certain fault-tolerant capability is realized.

Drawings

FIG. 1 is a diagram of a framework model used in generating sentence vectors and paragraph vectors in the present invention.

FIG. 2 is a schematic diagram of an Encoder-Decoder model used in the present invention.

Fig. 3 is a schematic diagram of basic units in an Encoder-Decoder model used in the present invention.

Fig. 4 is a schematic diagram of the present invention.

Detailed Description

In order to facilitate understanding of those skilled in the art, the present invention will be further described with reference to the following examples and drawings, which are not intended to limit the present invention.

The technical solution of the present invention is described in detail by using Java code segment recommendation as an example with reference to fig. 1-4 as follows:

step 1: constructing a large-scale code segment set S with method description information; wherein the content of the first and second substances,

11) the method comprises the steps of obtaining Java items on an open source software platform (such as GitHub), cutting Java files in the items according to methods as units to obtain methods with method description information, and writing the methods into files with package names, class names and method names as file names.

12) And screening the preliminarily obtained code segment set S with the method description information, and deleting poor (such as no method description information) or useless (such as a test method) code segments to obtain a simplified high-quality S set.

Step 2: construction method description information set D₁And method body set D for training programming language paragraph vector₂；

Extracting the description information of all methods to obtain a method description information set D used for training natural language sentence vectors₁Extracting the first sentence of the description information of the method to obtain an annotation set D₁' extracting code segment ontology of all methods to obtain the main set D of the method used for training segment vector of code segment₂。

Step 3, constructing and training a natural language sentence vector generator and a programming language paragraph vector generator, obtaining a vector representation α with annotation code segments, and then mapping a vector α through a trained neural network mapping model to obtain a vector α', wherein,

31) sentence vector generator for natural language describing information set D by method₁For the training set, then for the Encoder-Decoder natural language sentence vector generator model M₁Training is carried out until the state is converged to a specified state, and the training of the natural language sentence vector generator is completed; method description information set D₁The first sentence annotated by each code segment is extracted as M₁To generate a natural language sentence vector α₁And as part of the corresponding annotated code segment vector representation;

32) paragraph vector generator for programming language and method subject set D₂Encoder-Decoder programming language paragraph vector generator model M for input pairs₂Training is performed, the training is completed when the training is reached to a specified convergence state, and a segment vector α of each code segment body is generated in the training process₂；

33) Vector α natural language sentence₁And a segment vector α of the code segment body₂Weighting and adding to obtain a vector α, taking the vector α as a vector which can finally represent the whole annotated code segment, then taking α corresponding to each annotated code segment and the annotated vector representation corresponding to the annotated code segment as a training set, and training the neural network mapping model M₃And after training is completed, α is mapped to a mapped backward quantity representation α' through mapping of the mapping model, and the vector can represent a vector representation of the semantic vector α of the annotated code segment in the natural language semantic space.

And 4, extracting the method Name of each code segment from the code segment set S with the method description information, namely the form of package Name & class Name & method Name, and representing α 'the form of the key value pair < Name, α' >, as an index file used in recommendation with the mapping backward quantity of the code segment.

And 5: a given natural language is queried to obtain a corresponding natural language sentence vector, and then N pieces of well-matched code segments are recommended to each query in a code segment set S with method description information; wherein the content of the first and second substances,

51) for a given natural language query, calculating a query statement sentence vector β corresponding to the natural language query by using a trained Encoder-Decoder natural language sentence vector generator model M1;

52) the similarity between the two vectors is represented by an included angle cos theta of the two vectors, the similarity of the query statement vector β and the similarity of the mapped vector representation α' of each code segment in the index file is obtained through query calculation of a given natural language, the N code segments which are most similar to the vector β are recommended according to the index file, and the N code segments are ranked from high to low according to the similarity.

Example (b):

firstly, cutting Java items acquired from an open source software platform GitHub to obtain code segments with annotations, and writing the code segments into a file. Taking the project of assert j-core-master as an example, the cutting results in … … of "main.

In the project assert j-core-master, 35 code segments with independent functions and high quality are obtained.

After the data set processing is completed, the annotation set D of the code segment is further obtained₁’，D₁The interpretive statements in' are "remove the first instance of a value if found in the list and replace it with the last item", "get file content", ". Method description information set D₁，D₁The description in (1) is:

"Remove the first instance of a value of a found in the list and replace it with the last item in the list. this shows a copy down of an update at the exception of the not previous list order", and the like, and code segment method subject set D2.

After all models have been trained, any natural language query can be used to obtain its corresponding sentence vector, and the natural language sentence vector α with annotated code segments₁And a segment vector α for each code segment body₂And add up to compute the final α as a vector representation of the annotated code segment, in the above example of code segment:

α＝[0.0501139,0.0799258,0.0690878,......]

after the mapping model conversion is:

α’＝[0.1001695,0.060278,0.0700396,......]

input query y ═ y₁,y₂,...,y_t) Specifically, "remove the first instance of alist" is obtained after model processing, and a sentence vector of a specified dimension corresponding to y is obtained ("remove the first instance of a list" [0.0703125,0.0869141, 0.0878906.])。

The index file of N annotated code segments in the data set S is<N4S₁,α_i’₁>，<N4S₂,α_i’₂>，......，<N4S_N,α_i’_N>In particular as<“Remove the first instance of a value if found in the listand replaces it with the last item in the list”,M₁>......<“Input and outputof a file”,M_k>.., having cos θ (y, α)_i') are such values as 0.0054, 0.062, 0.785, respectively, and are the smallest N of S, and cos θ (y, α'₁)<cosθ(y，α’₂)<......<cosθ(y，α’_N) Then the recommended result is:

1:<main.java.org.assertj.core&Delta&fastUnorderedRemoveInt,[0.1001695,......]>

2:<……,……>

...

N:<……,……>

the N ordered code segments are indexes when actually recommended, namely links corresponding to the code segments are expressed in the form of package names, class names and method names in the S, when a user wants to check a specific code segment, the user only needs to click to check the real code segment of the source code, and the design is based on the consideration of user comfort and attractiveness.

While the invention has been described in terms of its preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

1. A code segment recommendation method based on deep semantic mining is characterized by comprising the following steps:

step 4): a given natural language is queried to obtain a corresponding natural language sentence vector, and then N pieces of well-matched code segments are recommended to each query in a code segment set S with method description information; wherein the content of the first and second substances,

for a given natural language query, calculating a query statement sentence vector β corresponding to the natural language query by using a trained Encoder-Decoder natural language sentence vector generator model M1;

the similarity between the two vectors is represented by an included angle cos theta of the two vectors, the similarity of the query statement vector β and the similarity of the mapped vector representation α' of each code segment in the index file is obtained through query calculation of a given natural language, the N code segments which are most similar to the vector β are recommended according to the index file, and the N code segments are ranked from high to low according to the similarity.

2. The code segment recommendation method based on deep semantic mining as claimed in claim 1, wherein the step 1) specifically comprises: and acquiring a specific project from the open source software platform, cutting a source code file in the specific project by taking a method as a unit to obtain a code segment set S with method description information, wherein the name form of each code segment is package name, class name and method name.

3. The code segment recommendation method based on deep semantic mining as claimed in claim 2, wherein the step 1) further comprises: the specific project is a Java project or an Android project.

4. The code segment recommendation method based on deep semantic mining as claimed in claim 1, wherein the step 2) specifically comprises:

23) Will be self-supportingLanguage sentence vector α₁And a segment vector α of the code segment body₂Weighted addition to obtain vector α as the vector that can ultimately characterize the entire annotated code segment, and then the collection of all vectors α and annotation collection D₁' the natural language sentence vector representation as a training set to train the neural network mapping model M₃And after training is finished, the model M is mapped through a neural network₃Mapping vector α into an annotated code fragment mapped backward quantity representation α'.