CN108491208A

CN108491208A - A kind of code annotation sorting technique based on neural network model

Info

Publication number: CN108491208A
Application number: CN201810098481.2A
Authority: CN
Inventors: 陈焕超; 陈湘萍; 刘志勇; 黄袁
Original assignee: National Sun Yat Sen University
Current assignee: Sun Yat Sen University; National Sun Yat Sen University
Priority date: 2018-01-31
Filing date: 2018-01-31
Publication date: 2018-09-04

Abstract

The code annotation sorting technique based on neural network model that the embodiment of the invention discloses a kind of, wherein this method includes：The corpus of one annotation of structure generates a term vector to each annotation word and indicates；Classify to annotation, and Manual definition's classification；Pretreatment and word insertion processing are carried out to annotation, obtain 250 dimension term vectors of each word；250 dimension term vector input disaggregated models of each word are subjected to classification processing, obtain classification results.In the embodiment of the present invention, by both considering the contextual feature of annotation, it is contemplated that the semantic feature of annotation, is allocated attention by word weights of importance, text more preferable can must be characterized, classification is also more acurrate.Compared to the realization of other technologies, the present invention is suitable for the classification of various types of annotations.

Description

A kind of code annotation sorting technique based on neural network model

Technical field

The present invention relates to program comprehension, nerual network technique field more particularly to a kind of generations based on neural network model Code annotation category method.

Background technology

In recent years, with the development of software industry, the scale and complexity of software are being continuously improved, the life cycle of software Also increasingly longer.A large amount of source code packages in software systems have recorded the realization of programmer, and help developer containing annotation Solve code.Code annotation plays important role in software maintenance and program comprehension, and some researches show that the second best in quality generations Code annotation can significantly improve the efficiency of exploitation and maintenance personnel to program comprehension.Therefore, the quality for improving code annotation will have Effect improves the maintainability of software, and then improves the quality of software.

The method of present analysis software quality all has ignored annotation substantially, or only calculates annotation and code in source code The quantitative calculating of ratio.However, this balancing method is far from enough, because many annotations are not often all closed with source code System, such as version information etc., these annotations should exclude when weighing software quality.Moreover, the effect of different annotations is simultaneously Not exactly the same.Some annotations are to provide realization reason, and some is then that programmer reminds to oneself addition, besides programming What tool automatically generated etc..To have very great help to software quality estimation so classifying automatically to code annotation.

Common annotation automatic classification method, is broadly divided into following two categories：Based on location information of the annotation in code into Row classification and the sorting technique based on conventional machines study.The method that position based on annotation in code is classified compares office Limit, can only classify to specifically annotating, can omit the more special annotation of other positions information.Based on traditional engineering The classification of habit can handle the annotation of different type and different-format, but it compares the feature selecting dependent on model, and be easy More sensitive to the noise of training data to training data over-fitting, to other, new data do not have good generalization ability.

Sorting technique based on location of annotated information information compares limitation, be typically only capable to annotate be divided into class annotation and method annotation Deng, to he annotate do not classify.Code annotation sorting technique based on traditional machine learning can be to different notes It releases and classifies, but accuracy rate of its classification too depends on the selection of feature in grader, and these feature selectings often mistake In the subjective judgement for relying on people, or excessively rely on the feature extraction of training set.The existing generation based on traditional machine learning Code annotation category method has only taken into account the text feature of code annotation, the semantic information etc. without considering annotation, classification Accuracy rate can be by certain limitation.And based on traditional machine learning, such as random forest, training set is generally smaller, Can be excessively sensitive to the noise in training set, a degree of over-fitting is can frequently result in, not to the generalization ability of new annotation By force.

Invention content

It is an object of the invention to overcome the deficiencies in the prior art, the present invention provides a kind of based on neural network model Code annotation sorting technique, in the embodiment of the present invention, by both considering the contextual feature of annotation, it is contemplated that the language of annotation Adopted feature is allocated attention by word weights of importance, more preferable can must characterize text, and classification is also more acurrate.It compares In the realization of other technologies, it is more suitable for the classification of various types of annotations.

In order to solve the above technical problem, the present invention provides a kind of code annotation classification side based on neural network model Method, the method includes：

The corpus of one annotation of structure generates a term vector to each annotation word and indicates；

Classify to annotation, and Manual definition's classification；

Pretreatment and word insertion processing are carried out to annotation, obtain 250 dimension term vectors of each word；

250 dimension term vector input disaggregated models of each word are subjected to classification processing, obtain classification results.

Preferably, the corpus of one annotation of the structure generates a term vector to each annotation word and indicates step Suddenly include：

Annotation is extracted from open source projects, and is segmented；

It will each annotate and constitute a document, and build the corpus of an annotation；

Corpus is trained, generating a term vector to each annotation word indicates, formula is expressed as：

Wherein, w_i+jIndicate that in a length be the word in the contextual window of 2k+1, n indicates text total length, logp (w_i+j|w_i) it is given w_iPredict w_i+jProbability.

Preferably, when being trained to corpus, k is set as 2, learning rate is initialized as 0.001, term vector dimension Number is set as 250.After having trained the term vector model, the vector for obtaining each word indicates.

Preferably, described pair of annotation is classified, and Manual definition's classification, including：

Descriptive notes：The corresponding function of code segment of description or the annotation of behavior.

Suggestiveness annotates：The annotation suggested is provided to user or reader；

Warning annotation：The annotation explicitly alerted to user or reader；

Exception class annotates：The annotation for the reason of explaining throw exception；

Exploration annotation：The annotation of development phase；

Annotation containing code：The code being annotated, by code wrap in annotation, so as to attempt hiding function or certain A little ongoing work；

Format class annotates：For separating code, the annotation of format code；

Metadata category annotates：Define the annotation of metadata；

The annotation automatically generated：The annotation automatically generated by PowerBuilder；

Other annotations：It is not belonging to the annotation of above nine kinds of annotations, substantially some elusive symbols.

Preferably, described pair of annotation carries out the step of pretreatment and word insertion processing, including：

Word segmentation processing is carried out to every annotation, obtains participle annotation；

Participle is annotated and carries out stop words filtration treatment, obtains filtered participle annotation；

Filtered participle is annotated and carries out Speech conversion processing, obtains transformed word；

Transformed word is subjected to word insertion processing, obtains 250 dimension term vectors of each word.

Preferably, carry out described pair annotation carry out pretreatment and word insertion processing step while also build training set, Verification collection and test set.

Preferably, described the step of 250 dimension term vector input disaggregated models of each word are subjected to classification processing is carried out It also needs to be trained disaggregated model before, including：

Using two-way length, memory network generates new memory to hidden layer state in short-term using input word and in the past, includes new Word；

Judge whether input value is worth jointly using input gate according to input word and in the past hidden layer state to retain, is then to new The constraint of memory one, otherwise terminate；

It is then root again by forgeing whether door is useful to the calculating of current data memory to past mnemon makes assessment According to the exercising result of input gate, new memory is generated；Otherwise forget the past memory in part；

Fusion is added to get up to generate final memory according to two kinds of results；

Final memory is detached from hidden layer state by out gate, then on this basis, introduces an attention mechanism, Finally obtain an optimal disaggregated model.

In embodiments of the present invention, by both considering the contextual feature of annotation, it is contemplated that the semantic feature of annotation, Attention is allocated by word weights of importance, more preferable can must characterize text, classification is also more acurrate.Compared to other skills The realization of art is more suitable for the classification of various types of annotations.

Description of the drawings

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with Other attached drawings are obtained according to these attached drawings.

Fig. 1 is a kind of code annotation sorting technique flow signal based on neural network model in the embodiment of the present invention Figure.

Specific implementation mode

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other without creative efforts Embodiment shall fall within the protection scope of the present invention.

Fig. 1 is a kind of code annotation sorting technique flow diagram based on neural network model of the embodiment of the present invention. As shown in Figure 1, the method includes：

S1 builds the corpus of an annotation, and generating a term vector to each annotation word indicates；

S2 classifies to annotation, and Manual definition's classification；

S3 carries out pretreatment to annotation and word insertion is handled, and obtains 250 dimension term vectors of each word；

250 dimension term vector input disaggregated models of each word are carried out classification processing, obtain classification results by S4.

In a particular embodiment, we extract magnanimity annotation from version repository of increasing income in advance, obtain annotation data set.

Further, the corpus of one annotation of the structure described in S1 generates a term vector to each annotation word Indicate that step includes：

S11 extracts annotation from open source projects, and is segmented；

S12 will be annotated each and be constituted a document, build the corpus of an annotation；

S13 is trained corpus, and generating a term vector to each annotation word indicates, formula is expressed as：

When being trained to corpus, k is set as 2 by us, and learning rate is initialized as 0.001, term vector dimension It is set as 250.After having trained the term vector model, the vector that we obtain each word indicates.

In S2, ten kinds of annotation categories of our Manual definitions, including：

Descriptive notes：The corresponding function of code segment of description or the annotation of behavior.This kind of annotation is typically to describe the section The purpose of code, or the details that code is realized, such as the details of variable declarations are provided, or realized for coded description Reason, to explain the basic principle etc. of certain selections, pattern or option behind.General comment full text uses natural language expressing.

Suggestiveness annotates：The annotation suggested is provided to user or reader.This kind of annotation is usually using annotation corresponding generation The developer of code provides some feasible proposals, is usually mingled with some code snippets etc., or have some marks, such as@ Usage ,@return etc..

Warning annotation：The annotation explicitly alerted to user or reader.This kind of annotation typically occurs in optional function Or before class.

Exception class annotates：The annotation for the reason of explaining throw exception.This kind of annotation usually has@throw or@ Exception printed words, the suggestion explained abnormal cause or this is avoided to occur extremely.

Exploration annotation：The annotation of development phase.This kind of annotation covers the content of present or future exploitation, including mistake point Analysis and solution, or the notes that have handled and has repaired, the also clearly work to be completed of future, such as TODO are annotated.

Annotation containing code：The code being annotated, by code wrap in annotation, so as to attempt hiding function or certain A little ongoing work.In general, the code that such annotation represents the function in test or deletes temporarily.

Format class annotates：For separating code, the annotation of format code.

Metadata category annotates：Define the annotation of metadata.This kind of annotation provides code authors' information, version information etc..

The annotation automatically generated：In the annotation automatically generated by PowerBuilder, such as Java Eclipse " Auto-generated method stub " etc..

Other annotations：It is not belonging to the annotation of above nine kinds of annotations.Substantially some elusive symbols.

Further, S3 includes：

S31 carries out word segmentation processing to every annotation, obtains participle annotation；

S32 annotates participle and carries out stop words filtration treatment, obtains filtered participle annotation；

S33 annotates filtered participle and carries out Speech conversion processing, obtains transformed word；

Transformed word is carried out word insertion processing by S34, obtains 250 dimension term vectors of each word.

Further, S3 also builds training set, verification collection and test set.

Specifically, it carries out also needing to structure disaggregated model before S4 and be trained, including：

The major parameter of the model is set as：Two-way length in short-term memory network the input number of plies be 24, hide the number of plies be 48, Use Adam optimizers, learning rate 0.001.

One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can It is completed with instructing relevant hardware by program, which can be stored in a computer readable storage medium, storage Medium may include：Read-only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), disk or CD etc..

In addition, being provided for the embodiments of the invention a kind of code annotation sorting technique based on neural network model above It is described in detail, principle and implementation of the present invention are described for specific case used herein, the above reality The explanation for applying example is merely used to help understand the method and its core concept of the present invention；Meanwhile for the general technology of this field Personnel, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, in conclusion this theory Bright book content should not be construed as limiting the invention.

Claims

1. a kind of code annotation sorting technique based on neural network model, which is characterized in that the method includes：

Classify to annotation, and Manual definition's classification；

2. a kind of code annotation sorting technique based on neural network model as described in claim 1, which is characterized in that described The annotation of structure one corpus, generating a term vector expression step to each annotation word includes：

Annotation is extracted from open source projects, and is segmented；

Wherein, w_i+jIndicate that in a length be the word in the contextual window of 2k+1, n indicates text total length, logp (w_i+j |w_i) it is given w_iPredict w_i+jProbability.

3. a kind of code annotation sorting technique based on neural network model as described in claim 1, which is characterized in that described Classify to annotation, and Manual definition's classification, including：

Warning annotation：The annotation explicitly alerted to user or reader；

Exploration annotation：The annotation of development phase；

Annotation containing code：The code being annotated, by code wrap in annotation, so as to attempt hiding function or it is certain just In the work of progress；

Format class annotates：For separating code, the annotation of format code；

Metadata category annotates：Define the annotation of metadata；

4. a kind of code annotation sorting technique based on neural network model as described in claim 1, which is characterized in that carry out Described pair of annotation carries out also building training set, verification collection and test set while the step that pretreatment and word insertion are handled.