CN108717423B - Code segment recommendation method based on deep semantic mining - Google Patents

Code segment recommendation method based on deep semantic mining Download PDF

Info

Publication number
CN108717423B
CN108717423B CN201810371788.5A CN201810371788A CN108717423B CN 108717423 B CN108717423 B CN 108717423B CN 201810371788 A CN201810371788 A CN 201810371788A CN 108717423 B CN108717423 B CN 108717423B
Authority
CN
China
Prior art keywords
code segment
natural language
vector
training
code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810371788.5A
Other languages
Chinese (zh)
Other versions
CN108717423A (en
Inventor
陶传奇
包盼盼
黄志球
周宇
王铁鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN201810371788.5A priority Critical patent/CN108717423B/en
Publication of CN108717423A publication Critical patent/CN108717423A/en
Application granted granted Critical
Publication of CN108717423B publication Critical patent/CN108717423B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/31Programming languages or programming paradigms
    • G06F8/315Object-oriented languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a code segment recommendation method based on deep semantic mining, which utilizes the action of a deep learning technology in natural language processing and the advantages thereof in natural language semantic mining and combines the characteristic of recommending query code segments. According to the input natural language search and the code segments themselves and the comments carried by the code segments, natural language semantics and the specific functions of the code segments are deeply mined to generate sentence vectors and paragraph vectors, so that the code segments with consistent semantic attributes and the natural language query are mapped to similar vector spaces, and N code segments which are most matched and have the similarity ordered from high to low are recommended for the given query. The method not only improves the accuracy of recommendation, but also can improve the recall ratio of recommendation, and has better fault-tolerant capability on the input natural language query.

Description

Code segment recommendation method based on deep semantic mining
Technical Field
The invention belongs to the technical field of code recommendation with query, and particularly relates to a code segment recommendation method based on deep semantic mining.
Background
In the actual code writing process, developers often encounter unfamiliar programming tasks or need to realize certain specific functions, and in such a situation, if the developers can find the existing similar code segments to learn the use method of the code segments or directly copy and paste the code segments and then modify and perfect the code segments for code reuse, a great deal of time, energy and meaningless repeated work can be saved for the developers; however, how to recommend high-quality code segments based on the actual needs of developers is an important issue for software reuse.
In actual development, a developer will typically choose to query the required code fragments using a search engine. However, since the software code has integrity, the keywords in the code segment cannot accurately describe the function of a segment of code, and therefore the query result is usually not satisfactory. In addition, the existing recommendation method usually only focuses on the code segment itself and ignores the description information, and the description information of the code segment describes the function of the code segment in natural language most simply and intuitively. In recent years, due to the wide application of deep learning, the field of language processing has also made breakthrough progress, so that deep semantic and information mining on natural languages and programming languages can also have good effects. Therefore, combining language processing technology with code recommendation is a new and effective recommendation method.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a code segment recommendation method based on deep semantic mining, which uses a deep learning technology to support the natural language query-oriented code segment recommendation; the invention can deeply excavate natural language semantics and specific functions of code segments according to natural language search of a user and comments and code segment bodies carried by the code segments, so that the annotated code segments with consistent semantic attributes and natural language queries are mapped to similar vector spaces, and the most matched code segments are recommended for the given query.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
the invention relates to a code segment recommendation method based on deep semantic mining, which comprises the following steps of:
step 1): constructing a large-scale code segment set S with method description information;
step 2): construction method description information set D1And method subject set D2Constructing an annotations Collection D1' training Encoder-Decoder natural language sentence vector generator model M by using constructed data set1Training Encoder-Decoder programming language paragraph vector generator model M2
Step 3), extracting the method Name of each code segment in the code segment set S, and forming a key value pair form < Name, α ' > ' with the mapped vector representation α ' of the code segment as an index file used in recommendation;
step 4): for a given natural language query, a corresponding natural language sentence vector is obtained, and then N pieces of well-ordered code segments which are most matched are recommended to each query in the code segment set S with the method description information.
Preferably, the step 1) specifically comprises: acquiring a specific project from an open source software platform, cutting a source code file in the specific project by taking a method as a unit to obtain a code segment set S with method description information, wherein the name form of each code segment is package name, class name and method name; .
Preferably, the step 1) specifically further comprises: the specific items are Java items, Android items and other items.
Preferably, the step 2) specifically includes:
21) describing information set D in a method1For training set pair Encoder-Decoder natural language sentence vector generator model M1Training is carried out, convergence is carried out to a specified state, and training of the natural language sentence vector generator is completed; method description information set D1The first sentence annotated to each code segment is extracted as input and a natural language sentence vector α is generated1And as part of the corresponding annotated code segment vector representation;
22) with method subject set D2Vector generator model M for training set pairs Encoder-Decoder programming language paragraphs2Training is performed, model training is completed when training to a specified convergence state, and segment vectors α for each code segment body are generated simultaneously2
23) Vector α natural language sentence1And a segment vector α of the code segment body2Weighted addition to obtain vector α as the vector that can ultimately characterize the entire annotated code segment, and then the collection of all vectors α and annotation collection D1' the natural language sentence vector representation as a training set to train the neural network mapping model M3And after trainingMapping model M through neural network after formation3Mapping vector α into an annotated code fragment mapped backward quantity representation α'.
Preferably, the step 4) specifically includes:
41) for the trained Encoder-Decoder natural language sentence vector generator model M1Given a natural language input, calculating to obtain a query statement vector β of a specified dimension;
42) the similarity between the two vectors is expressed by the included angle cos theta of the two vectors, the similarity value between the mapped vector representation α' and the query statement vector β is calculated, the N code segments which are most similar to the vector are recommended for the given natural language query, and the N code segments are ranked from high to low according to the similarity.
The invention has the beneficial effects that:
the invention utilizes the function of deep learning technology in natural language processing and the advantages of deep learning technology in language semantic mining to solve the problem of how to recommend high-quality annotated code segments according to given natural language query; has the following advantages:
(1) the natural language processing by deep learning can really dig the natural language semantics deeply, but not only matching by text keywords, so that sentence vectors corresponding to sentences with the same semantics are closer in semantic space distance, the meaning to be expressed by query can be really dug, the matching during recommendation is more accurate, and the recommendation accuracy is improved.
(2) The processing method for paragraph vectorization of the code segment main body by utilizing deep learning can be used for mining the structural information of the code segment and semantic information at a programming language level, and not only is simple feature word extraction performed, so that the information contained in the code segment main body is fully mined, and the recommendation effect of the code segment can be improved.
(3) N code segments with the most similar annotation semantics are obtained by deep semantic matching and are used as recommendation results, and the N code segments are ranked from high to low according to semantic similarity, so that even if the input query expression is not clear enough or has slight deviation, a proper recommendation result can be found at a relatively low position, the recall ratio is improved, and certain fault-tolerant capability is realized.
Drawings
FIG. 1 is a diagram of a framework model used in generating sentence vectors and paragraph vectors in the present invention.
FIG. 2 is a schematic diagram of an Encoder-Decoder model used in the present invention.
Fig. 3 is a schematic diagram of basic units in an Encoder-Decoder model used in the present invention.
Fig. 4 is a schematic diagram of the present invention.
Detailed Description
In order to facilitate understanding of those skilled in the art, the present invention will be further described with reference to the following examples and drawings, which are not intended to limit the present invention.
The technical solution of the present invention is described in detail by using Java code segment recommendation as an example with reference to fig. 1-4 as follows:
step 1: constructing a large-scale code segment set S with method description information; wherein the content of the first and second substances,
11) the method comprises the steps of obtaining Java items on an open source software platform (such as GitHub), cutting Java files in the items according to methods as units to obtain methods with method description information, and writing the methods into files with package names, class names and method names as file names.
12) And screening the preliminarily obtained code segment set S with the method description information, and deleting poor (such as no method description information) or useless (such as a test method) code segments to obtain a simplified high-quality S set.
Step 2: construction method description information set D1And method body set D for training programming language paragraph vector2
Extracting the description information of all methods to obtain a method description information set D used for training natural language sentence vectors1Extracting the first sentence of the description information of the method to obtain an annotation set D1' extracting code segment ontology of all methods to obtain the main set D of the method used for training segment vector of code segment2
Step 3, constructing and training a natural language sentence vector generator and a programming language paragraph vector generator, obtaining a vector representation α with annotation code segments, and then mapping a vector α through a trained neural network mapping model to obtain a vector α', wherein,
31) sentence vector generator for natural language describing information set D by method1For the training set, then for the Encoder-Decoder natural language sentence vector generator model M1Training is carried out until the state is converged to a specified state, and the training of the natural language sentence vector generator is completed; method description information set D1The first sentence annotated by each code segment is extracted as M1To generate a natural language sentence vector α1And as part of the corresponding annotated code segment vector representation;
32) paragraph vector generator for programming language and method subject set D2Encoder-Decoder programming language paragraph vector generator model M for input pairs2Training is performed, the training is completed when the training is reached to a specified convergence state, and a segment vector α of each code segment body is generated in the training process2
33) Vector α natural language sentence1And a segment vector α of the code segment body2Weighting and adding to obtain a vector α, taking the vector α as a vector which can finally represent the whole annotated code segment, then taking α corresponding to each annotated code segment and the annotated vector representation corresponding to the annotated code segment as a training set, and training the neural network mapping model M3And after training is completed, α is mapped to a mapped backward quantity representation α' through mapping of the mapping model, and the vector can represent a vector representation of the semantic vector α of the annotated code segment in the natural language semantic space.
And 4, extracting the method Name of each code segment from the code segment set S with the method description information, namely the form of package Name & class Name & method Name, and representing α 'the form of the key value pair < Name, α' >, as an index file used in recommendation with the mapping backward quantity of the code segment.
And 5: a given natural language is queried to obtain a corresponding natural language sentence vector, and then N pieces of well-matched code segments are recommended to each query in a code segment set S with method description information; wherein the content of the first and second substances,
51) for a given natural language query, calculating a query statement sentence vector β corresponding to the natural language query by using a trained Encoder-Decoder natural language sentence vector generator model M1;
52) the similarity between the two vectors is represented by an included angle cos theta of the two vectors, the similarity of the query statement vector β and the similarity of the mapped vector representation α' of each code segment in the index file is obtained through query calculation of a given natural language, the N code segments which are most similar to the vector β are recommended according to the index file, and the N code segments are ranked from high to low according to the similarity.
Example (b):
firstly, cutting Java items acquired from an open source software platform GitHub to obtain code segments with annotations, and writing the code segments into a file. Taking the project of assert j-core-master as an example, the cutting results in … … of "main.
Figure BDA0001638633640000041
In the project assert j-core-master, 35 code segments with independent functions and high quality are obtained.
After the data set processing is completed, the annotation set D of the code segment is further obtained1’,D1The interpretive statements in' are "remove the first instance of a value if found in the list and replace it with the last item", "get file content", ". Method description information set D1,D1The description in (1) is:
"Remove the first instance of a value of a found in the list and replace it with the last item in the list. this shows a copy down of an update at the exception of the not previous list order", and the like, and code segment method subject set D2.
After all models have been trained, any natural language query can be used to obtain its corresponding sentence vector, and the natural language sentence vector α with annotated code segments1And a segment vector α for each code segment body2And add up to compute the final α as a vector representation of the annotated code segment, in the above example of code segment:
α=[0.0501139,0.0799258,0.0690878,......]
after the mapping model conversion is:
α’=[0.1001695,0.060278,0.0700396,......]
input query y ═ y1,y2,...,yt) Specifically, "remove the first instance of alist" is obtained after model processing, and a sentence vector of a specified dimension corresponding to y is obtained ("remove the first instance of a list" [0.0703125,0.0869141, 0.0878906.])。
The index file of N annotated code segments in the data set S is<N4S1i1>,<N4S2i2>,......,<N4SNiN>In particular as<“Remove the first instance of a value if found in the listand replaces it with the last item in the list”,M1>......<“Input and outputof a file”,Mk>.., having cos θ (y, α)i') are such values as 0.0054, 0.062, 0.785, respectively, and are the smallest N of S, and cos θ (y, α'1)<cosθ(y,α’2)<......<cosθ(y,α’N) Then the recommended result is:
1:<main.java.org.assertj.core&Delta&fastUnorderedRemoveInt,[0.1001695,......]>
2:<……,……>
...
N:<……,……>
the N ordered code segments are indexes when actually recommended, namely links corresponding to the code segments are expressed in the form of package names, class names and method names in the S, when a user wants to check a specific code segment, the user only needs to click to check the real code segment of the source code, and the design is based on the consideration of user comfort and attractiveness.
While the invention has been described in terms of its preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims (4)

1. A code segment recommendation method based on deep semantic mining is characterized by comprising the following steps:
step 1): constructing a large-scale code segment set S with method description information;
step 2): construction method description information set D1And method subject set D2Constructing an annotations Collection D1' training Encoder-Decoder natural language sentence vector generator model M by using constructed data set1Training Encoder-Decoder programming language paragraph vector generator model M2
Step 3), extracting the method Name of each code segment in the code segment set S, and forming a key value pair form < Name, α ' > ' with the mapped vector representation α ' of the code segment as an index file used in recommendation;
step 4): a given natural language is queried to obtain a corresponding natural language sentence vector, and then N pieces of well-matched code segments are recommended to each query in a code segment set S with method description information; wherein the content of the first and second substances,
for a given natural language query, calculating a query statement sentence vector β corresponding to the natural language query by using a trained Encoder-Decoder natural language sentence vector generator model M1;
the similarity between the two vectors is represented by an included angle cos theta of the two vectors, the similarity of the query statement vector β and the similarity of the mapped vector representation α' of each code segment in the index file is obtained through query calculation of a given natural language, the N code segments which are most similar to the vector β are recommended according to the index file, and the N code segments are ranked from high to low according to the similarity.
2. The code segment recommendation method based on deep semantic mining as claimed in claim 1, wherein the step 1) specifically comprises: and acquiring a specific project from the open source software platform, cutting a source code file in the specific project by taking a method as a unit to obtain a code segment set S with method description information, wherein the name form of each code segment is package name, class name and method name.
3. The code segment recommendation method based on deep semantic mining as claimed in claim 2, wherein the step 1) further comprises: the specific project is a Java project or an Android project.
4. The code segment recommendation method based on deep semantic mining as claimed in claim 1, wherein the step 2) specifically comprises:
21) describing information set D in a method1For training set pair Encoder-Decoder natural language sentence vector generator model M1Training is carried out, convergence is carried out to a specified state, and training of the natural language sentence vector generator is completed; method description information set D1The first sentence annotated to each code segment is extracted as input and a natural language sentence vector α is generated1And as part of the corresponding annotated code segment vector representation;
22) with method subject set D2Vector generator model M for training set pairs Encoder-Decoder programming language paragraphs2Training is performed, model training is completed when training to a specified convergence state, and segment vectors α for each code segment body are generated simultaneously2
23) Will be self-supportingLanguage sentence vector α1And a segment vector α of the code segment body2Weighted addition to obtain vector α as the vector that can ultimately characterize the entire annotated code segment, and then the collection of all vectors α and annotation collection D1' the natural language sentence vector representation as a training set to train the neural network mapping model M3And after training is finished, the model M is mapped through a neural network3Mapping vector α into an annotated code fragment mapped backward quantity representation α'.
CN201810371788.5A 2018-04-24 2018-04-24 Code segment recommendation method based on deep semantic mining Active CN108717423B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810371788.5A CN108717423B (en) 2018-04-24 2018-04-24 Code segment recommendation method based on deep semantic mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810371788.5A CN108717423B (en) 2018-04-24 2018-04-24 Code segment recommendation method based on deep semantic mining

Publications (2)

Publication Number Publication Date
CN108717423A CN108717423A (en) 2018-10-30
CN108717423B true CN108717423B (en) 2020-07-07

Family

ID=63899075

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810371788.5A Active CN108717423B (en) 2018-04-24 2018-04-24 Code segment recommendation method based on deep semantic mining

Country Status (1)

Country Link
CN (1) CN108717423B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670022B (en) * 2018-12-13 2023-09-29 南京航空航天大学 Java application program interface use mode recommendation method based on semantic similarity
CN110716749B (en) * 2019-09-03 2023-08-04 东南大学 Code searching method based on functional similarity matching
CN110806861B (en) * 2019-10-10 2021-10-08 南京航空航天大学 API recommendation method and terminal combining user feedback information
CN111061935B (en) * 2019-12-16 2022-04-12 北京理工大学 Science and technology writing recommendation method based on self-attention mechanism
CN111142850B (en) * 2019-12-23 2021-05-25 南京航空航天大学 Code segment recommendation method and device based on deep neural network
CN111191002B (en) * 2019-12-26 2023-05-23 武汉大学 Neural code searching method and device based on hierarchical embedding
CN111459491B (en) * 2020-03-17 2021-11-05 南京航空航天大学 Code recommendation method based on tree neural network
CN111522839B (en) * 2020-04-25 2023-09-01 华中科技大学 Deep learning-based natural language query method
CN111857660B (en) * 2020-07-06 2021-10-08 南京航空航天大学 Context-aware API recommendation method and terminal based on query statement
US11720346B2 (en) 2020-10-02 2023-08-08 International Business Machines Corporation Semantic code retrieval using graph matching
US11645054B2 (en) 2021-06-03 2023-05-09 International Business Machines Corporation Mapping natural language and code segments

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105190597A (en) * 2012-12-13 2015-12-23 微软技术许可有限责任公司 Social-based information recommendation system
US9557972B2 (en) * 2014-03-25 2017-01-31 Electronics And Telecommunications Research Institute System and method for code recommendation and share
CN106462399A (en) * 2014-06-30 2017-02-22 微软技术许可有限责任公司 Code recommendation
CN107506414A (en) * 2017-08-11 2017-12-22 武汉大学 A kind of code based on shot and long term memory network recommends method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105190597A (en) * 2012-12-13 2015-12-23 微软技术许可有限责任公司 Social-based information recommendation system
US9557972B2 (en) * 2014-03-25 2017-01-31 Electronics And Telecommunications Research Institute System and method for code recommendation and share
CN106462399A (en) * 2014-06-30 2017-02-22 微软技术许可有限责任公司 Code recommendation
CN107506414A (en) * 2017-08-11 2017-12-22 武汉大学 A kind of code based on shot and long term memory network recommends method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于搜索的代码推荐技术研究;吕飞;《万方学位论文数据库 硕士论文》;20160603;全文 *

Also Published As

Publication number Publication date
CN108717423A (en) 2018-10-30

Similar Documents

Publication Publication Date Title
CN108717423B (en) Code segment recommendation method based on deep semantic mining
France et al. The UML as a formal modeling notation
CN107210035B (en) Generation of language understanding systems and methods
CN111090461B (en) Code annotation generation method based on machine translation model
WO2020010834A1 (en) Faq question and answer library generalization method, apparatus, and device
CN112559766B (en) Legal knowledge map construction system
Casali et al. An assistant for loading learning object metadata: An ontology based approach
CN102567306B (en) Acquisition method and acquisition system for similarity of vocabularies between different languages
Ockeloen et al. BiographyNet: Managing Provenance at Multiple Levels and from Different Perspectives.
CN103593335A (en) Chinese semantic proofreading method based on ontology consistency verification and reasoning
CN105868187B (en) The construction method of more translation Parallel Corpus
CN112328800A (en) System and method for automatically generating programming specification question answers
Ell et al. SPARQL query verbalization for explaining semantic search engine queries
CN107656921A (en) A kind of short text dependency analysis method based on deep learning
Balsmeier et al. Automated disambiguation of us patent grants and applications
CN114911893A (en) Method and system for automatically constructing knowledge base based on knowledge graph
Baralis et al. Learning from summaries: Supporting e-learning activities by means of document summarization
Peng et al. Automated code compliance checking research based on BIM and knowledge graph
Yang et al. User story clustering in agile development: a framework and an empirical study
CN113779062A (en) SQL statement generation method and device, storage medium and electronic equipment
CN112836525A (en) Human-computer interaction based machine translation system and automatic optimization method thereof
Shahbaz et al. Automatic generation of extended er diagram using natural language processing
Duran-Limon et al. Towards an ontology-based approach for deriving product architectures
Blümel et al. PROBADO3D-indexing and searching 3D CAD databases: Supporting planning through content-based indexing and 3D shape retrieval
Elkaimbillah et al. Comparative study of knowledge graph models in education domain

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant