CN108519890A

CN108519890A - A kind of robustness code abstraction generating method based on from attention mechanism

Info

Publication number: CN108519890A
Application number: CN201810306806.1A
Authority: CN
Inventors: 彭敏; 胡刚; 袁梦霆; 王清; 曲金帅
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2018-04-08
Filing date: 2018-04-08
Publication date: 2018-09-11
Anticipated expiration: 2038-04-08
Also published as: CN108519890B

Abstract

The invention discloses a kind of robustness code abstraction generating methods based on from attention mechanism, are the extraction of the code and its description language material of programming community's high quality to (text of query specification replys the text of code) first；Followed by the redundancy of code and its description language material pair filters out；Then by the corresponding query specification text conversion of code at declarative sentence；It is finally based on the code summarization generation from attention mechanism series model.The present invention can effectively remove redundancy and noise content, and the abstract generated is improved on automatic judgment and artificial evaluation and test accuracy rate, and evaluation result is better than existing Baseline Methods.

Description

A kind of robustness code abstraction generating method based on from attention mechanism

Technical field

The invention belongs to the crossing domains of soft project and natural language processing technique, and in particular to when one kind being based on small echo Frequency converts the side of EM algorithms and the robustness code summarization generation based on the series model these two aspects advantage from attention mechanism Method is particularly suitable for carrying the code description data of noise information in programming community.

Background technology

In the evolution of Large-Scale Projects, code annotation is the key job in software maintenance, and the annotation of high quality is past Toward that can understand for developer efficient reference information is provided with more fresh code.Although there is numerous tools for program comprehension at present And method, but still understand code it is time-consuming be longer than write the duration of code.Developer tends to code (method of skimming simultaneously The codes entities such as signature) carry out prehension program to reduce workload, but this ignores key message again.Code abstract technology provides one Kind compromise scheme, can generate brief, sufficient sentence description, with help developer faster, more accurately understand code Intention.

Existing method is divided into two major classes：Extraction-type make a summary and production abstract, the tactical rule of former concerns code and Syntactic feature is the keyword abstraction based on code regulation and structure as a result, its effect and bad.For example, Sridhara It proposes to extract key term (method entity-noun, procedure-verb) by key identifier mark as Java method body Annotation is generated, is calculated using TF or TF-IDF algorithms and the higher entry of weight selection forms abstract, usually 5-10 word； Mcburney is provided with different EyeA, EyeB by introducing entry position weight (method signature, method call, control stream) With EyeC Weight Value Distributed Methods, to improve the effect of keyword extraction；Moreno proposes rule matching method, by being based on people The phrase generation matching stencil of 13 kinds of method class of work Rulemaking, which to generate for each method class, makes a summary；Same database In sql like language, Koutrika proposes an interactive system, and SQL query is converted into nature using NL masterplates and database Language text.And the latter utilizes end-to-end (end-to-end) network of deep learning, code can be trained to describe parallel corpora Accuracy rate is promoted, superiority is confirmed by a large amount of achievements.Such as Allamanis is proposed the convolution god with attention mechanism Type abstract is described to merge the method body of Code Context information through in network application to code summarization generation task, producing. Iyer is put forward for the first time sequence-series model using Recognition with Recurrent Neural Network, and combines attention mechanism, with training The code that StackOverflow is extracted describes language material to generate brief code abstract.

And the main problem that these methods face at present is：

1) it is only the code building abstract of specific structure (method body, one group of method call), the semantic information of code is simultaneously Do not embody；

2) mode generated tends to rely on the language material of high quality and is difficult to expand to other programming communities easily by noise It influences；

3) doubt statement is presented in the form of production code abstract, this is not inconsistent completely with the style of code annotation；

4) though traditional series model combines attention mechanism, still hold in the sparse code of reply distribution of lengths It is vulnerable.

These problems significantly limit the effect of code summarization generation, this be also the invention solves critical issue.

Invention content

The present invention is directed to improve the accuracy and naturality of code summarization generation effect, taken out in processing programs Ask-Answer Community Take code-description (enquirement description text-reply code snippet) noise language material clock synchronization have stronger robustness, can gram The clothes noise that directly acquisition parallel corpora introduces to train summarization generation model, and self- is introduced in series model Attention, can be in less coding structure for long range sequence when carrying out seq2seq Sequence Learnings from attention mechanism It relies on, causes the problem that summarization generation is ineffective.

The source code project of high quality often gives the detailed annotation of code or the guide for use of parameter in API documents, It can understand for programmer and safeguard that code provides efficient reference.The Ask-Answer Communities IT StackOverflow contains a large amount of related The question and answer of program are pasted, wherein the description text and code snippet that excavate can extract as the matching pair of code-abstract.But program community Middle presence is largely made an uproar and redundancy, needs to be filtered it and screen, to obtain the information of high quality.For this purpose, this Invention first proposed DB-WTFF Fusion Features frames, and the attribute in the description segment of enquirement and the code snippet of reply is special Sign is respectively seen as signal and carries out Wavelet time-frequency conversion to highlight the detail differences of its characteristic value.Estimate to describe respectively using EM algorithms With the weights of each feature of code, the marking value of the description after fusion and the marking value of code are subjected to secondary EM algorithms and estimated, Further fusion obtains the sample set of integrated value ranking Top-K.Next proposes the filter algorithm of deep learning, is based on twin god T-SNNC algorithms are devised through network (Siamese Neural Network, SNN), can be realized under the sample of lesser amt To describing the classification of short text, the sample pair containing redundancy description text is filtered, and being parsed by the syntax tree of NLP technologies will Text conversion is described into statement sentence.Finally the language material of purification is used in and is based on from attention mechanism (self-attention) CODE-OAN series models are made a summary to train to generate.

The technical solution adopted in the present invention is：It is a kind of based on from the robustness code summarization generation side of attention mechanism Method, which is characterized in that include the following steps：

Step 1：It programs the code of high quality in community and its describes language material to (the description text of enquirement, the code of reply Segment) extraction；

Using a kind of language material of high quality to extracting frame, that is, DB-WTFF (Double Wavelet Time-Frequency Transform Feature Fusion) Fusion Features frame, by signal field Wavelet time-frequency convert (Wavelet Transform, WT) and greatest hope (Expectation Maximization, EM) algorithm for calculating noise language material feature The synthesis marking value of contribution extracts marking and is worth Top-K language material pair in the top.

Step 2：Code and its redundancy for describing language material pair filter out；

Twin neural network (Siamese Neural Network, SNN) deep learning frame of single sample learning is introduced, Build a kind of description Text Filtering Algorithm i.e. T-SNNC of de-redundancy (Title-Siamese Neural Network Classfication Algorithm of documents categorization).The noise language material that high quality is extracted in step 1 is removed by T-SNNC algorithms Wherein there is the language material pair of description text redundancy description.

Step 3：The declarative sentence of the corresponding query descriptive statement form of code is converted；

According to the doubt statement character string list artificially collected come the doubt statement in identification step 2, and pass through Stanford CoreNLP kits parse the syntax tree structure of sentence, are based on Sub-tree Matching pattern (Stanford Tregex two rule (VP)<</VB./、VP<</VBG./) extract the verb phrase in doubt statement, remove text The unrelated Program Type description information in tail portion, as the corresponding abstract description of code.

Step 4：Based on the code summarization generation from attention mechanism series model；

Language material after step 3 is purified is referred to as (U to set_T,U_C), the size of set is K, wherein U_CIndicate code The set of segment, U_TThe set for indicating corresponding description text, then (c_k,t_k), (k=1 ..., K) it is considered as a pair of of training in set Language material, i.e. c_k∈S_C,n_k∈S_T.Pass through a kind of Seq2Seq sequences frame, that is, CODE-OAN (CODE Only Attention Need) summarization generation model algorithm learns (c_k,n_k) Feature Mapping.As given new code snippet c, net can be passed through Network generates the corresponding abstracts of c and describes n^*。

The present invention can extract the code description of high quality by Wavelet time-frequency transformation and EM algorithms (DB-WTFF) first Material pair, then build twin neural network classification model (T-SNNC) filter with code independent redundancy description, be finally based on Carry out pairs of training corpus from the series model (CODE-OAN) of attention mechanism to generate the abstract of code, the results showed that, this hair Bright method can be effectively filtered redundancy and content noise, and can generate the code efficiently generated close to annotation style Abstract, the table in manually evaluation and test (terseness, informedness, naturality) and automatic judgment (ROUG-L, BLEU-2, METEOR) It now increases, evaluation and test is better than existing Baseline Methods.

Description of the drawings

Attached drawing 1：The realization idiographic flow schematic diagram of the embodiment of the present invention

Attached drawing 2：Fusion Features frame (DB-WTFF) schematic diagram of the dual Wavelet time-frequency transformation of the embodiment of the present invention；

Attached drawing 3：Textual classification model (T-SNNC) schematic diagram based on twin neural network of the embodiment of the present invention；

Attached drawing 4：Two-way semantic (TS-BI-LSTM) coding framework schematic diagram of text of the embodiment of the present invention；

Attached drawing 5：The CODE-OAN model framework schematic diagrames based on Self-Attention mechanism of the embodiment of the present invention.

Specific implementation mode

Understand for the ease of those of ordinary skill in the art and implement the present invention, with reference to the accompanying drawings and embodiments to this hair It is bright to be described in further detail, it should be understood that implementation example described herein is merely to illustrate and explain the present invention, not For limiting the present invention.

Implementation flow chart of the present invention is as shown in Figure 1, include mainly：(1) information extraction of high quality.It is taken out according to programming community The different behavior properties of model, eigenmatrix of the structure one about program question and answer patch are taken to be regarded using wavelet decomposition transform It transforms in frequency domain and handles for the discrete signal of same dimension, the marking value of information is then calculated according to signature contributions.(2) low noise Information purifies.A little noise is still remained in the language material pair of high quality model, is needed to wherein there is redundancy description and defect generation The matching of code, using T-SNNC deep learning algorithms, is able to effectively purify data flow to filtering out.(3) code is automatic Abstract.The high quality language material of purification is put into training in sequence-series neural network to (code-abstract), one can be learnt The model of code summarization generation.New code is input in model, you can obtain the annotation information of code.

The present invention specifically includes following steps：

Step 1：The code and its description language material of community's high quality are to (the description text of enquirement, the code text of reply) It extracts；

First, noise language material is extracted in programming community to (the description text of enquirement, the code text of reply), and is counted Analyze code and its describe text a variety of social characteristics values (reply volume, the amount of thumbing up, pageview etc.) and its between mutual pass System, and build an eigenmatrixThe dimension of representing matrix, row vector are the characteristic values of every language material, Column vector is the distribution of every category feature value, and wherein the characteristic value dimension of code section is M, and the characteristic value dimension of description section is N；

Secondly, the Fusion Features frame of the dual Wavelet time-frequency transformation of structure one, as shown in Fig. 2, the structure of the frame Marking is normalized to the eigenmatrix F in step 1 first in Cheng Shi in M+N dimensions, and the wherein characteristic dimension of code is M, The characteristic dimension of description is N；Then this M and N number of feature are considered as two L dimensional signals constantly converted respectively, signal is turned It changes in T/F space, then the matrix of description sectionIn M signal be converted into a series of wavelet trees {TTree₁,....,TTree_M, the matrix of the code section of codeIn N number of signal be converted into a series of small echos Set { CTree₁,....,CTree_N}；The coefficient vector of identical leaf node in two class wavelet trees is finally spliced into coefficient by row Matrix(i=1 ..., P } and P≤log₂M+1) and(i=1 ..., Q } and Q≤log₂N+1), whereinWithThe dimension of representing matrix respectively.

Distinguish sytem matrix A in estimating step 1.2 using greatest hope EM algorithms_iAnd B_iLinear fusion weight and reconstruct The signal S of Cheng Xin_TAnd S_C；

Two new signal S of greatest hope EM algorithms quadratic estimate are utilized again_TAnd S_CTo the percentage contribution of signal fused, and Signal after fusion is subjected to inverse transformation and obtains a signal S based on time domain, as all language materials in M+N feature Comprehensive marking.

Finally, it extracts marking and is worth Top-K data in the top to (the description text of enquirement, the code text of reply) As the code of high quality and its language material pair of description.

Extract high quality code and its expectation after, however it remains some noise informations i.e. and code independent redundancy Description, needs to be removed in language material.First, several (clean-positive samples of artificial mark language material of description text are collected：566 Item, not totally-negative sample：232)；

First the present invention is based on the twin neural network SNN deep learning frames of single sample learning, a kind of de-redundancy is proposed Description text classification network (T-SNNC), flow demonstration is as shown in Figure 3.Two are shared wherein in the frame of T-SNNC algorithms The identical two-way LSTM codings of text semantic are respectively to describing text x₁And x₂Semantic coding is carried out, description text is embedded in and is indicated Vectorial dimensionality reduction is to vectorial s_hAnd s_l.Input of two vectorial absolute differences as linear classifier, is obtained by sigmod activation primitives Obtain the vectorial p (x of one 2 dimension probability distribution₁,x₂), as predicted value；

Wherein α_jIt is the parameter of study,σ indicates activation primitive.

Secondly network is trained using two classification cross entropy loss functions of formula (2), L is added in loss function₂Weight is damaged Item is lost, allows e-learning smaller or smoother weight, to improve the generalization ability of model；

Wherein L is loss function, | | w | |₂Item, p (x are lost for weight₁⊙x₂) indicate text x₁And x₂When being the same category, T=1；When indicating that text x1 and x2 are different classes of, t=0；

Then, by new pairing language material (just/just+ negative/negative1886691：Label is 0, positive/negativeItem：1) label is by training to be fitted the parameter of T-SNNC networks；

Finally, during prediction, (text description to be measured, similar comparison sample) pairing is sent into network, input is worked as Number of tags (0>1), the label of description to be measured is i.e. consistent with comparing sample, if input label number (0<1), the label of description to be measured is It is opposite with comparing sample.

Involved DS-Bi-LSTM coding frameworks are as shown in figure 4, the frame is broadly divided into 6 layers, with document in the present invention In all words word insertion be expressed as inputting, be expressed as exporting with the whole semantic embedding of the text.It is discussed in detail separately below This 6 levels：

Institute's predicate insertion expression layer is input layer, and the sequence for describing text indicates T=(v₁,...,v_K), v_iIt is that vocabulary corresponds to No. id；The size of default dictionary is D, and the dimension of word insertion is d；The word insertion vector e of vocabulary v is obtained by tabling look-up, word is embedding Enter shown in inquiry operation such as following formula (3), wherein embedding indicates table lookup operation；

E=embddding (v, D, d) (3)

Two-way LSTM hidden layers：Including two LSTM hidden layers of forward and backward, under synchronization, each input word is embedding Enter at the same be connected to two LSTM of forward and backward hide layer unit, while the two hide layer units be connected to it is same Output.If the input word under current t moment is embedded as e_t, forward direction LSTM hide layer unit output beBackward LSTM hides The output of layer unit isThen current time forward and backward hide layer unit output be：

Wherein H () indicates the hiding layer operation of LSTM, c_t-1Indicate the state value of Cell units inscribed at one, b_t-1All biasings are inscribed when referring to last.

Two-way LSTM output layers：Each output unit is connected to two LSTM of forward and backward inscribed when this and hides simultaneously Layer unit, i.e.,：

WhereinWithTo the connection weight between hidden layer and backward hidden layer and two-way LSTM output layers before respectively Value, b_gFor biasing, σ is activation primitive.The output form of this layer is vector, and the dimension of each vector is consistent with input vector.

Average pond layer：Original characteristic value can be handled and construct new feature, Jin Ershi by pondization operation Referring now to original validity feature dimensionality reduction, strengthen and filtering for noise.Common pondization operation has maximum pondization peace Equal two kinds of pondization.Since the macroscopic view of text is semantic closely related with each vocabulary in text, the present invention is using flat Equal pondization operation.Average pondization is averaged all neuron values in a certain range, can be by the part in terms of each Information is all taken into account, and the loss of information is avoided.The operation in average pond is as follows：

Wherein L is the length of the sequence of words of input, g_tFor the vector of upper layer output；

Full articulamentum：Matrix multiplication is equivalent to a Feature Space Transformation, and the dimension of the vector on upper layer is converted, Preserve useful information simultaneously.Wherein W is the weight connected entirely, and b is biasing.The dimension of the output vector of this layer is amplified.

Q=Wpool (g)+b (7)

Encode output layer：The result in average pond can be obtained to the semantic coding of final entire document by activation primitive Vector is：

S=σ (q) (8)

Wherein, σ indicates that activation primitive, q are the vector of upper layer output.

Step 3：Code describes the sentence style conversion of textual form；

Most of description text is interrogative sentence form or the sentence comprising personal pronoun, and text end includes program description Supplemental information, this is simultaneously suitable as the abstract of code.Abstract is the declarative descriptive statement as code, it is necessary into line statement It reports operation and removes the redundancy unrelated with Program Type.This step, sentence conversion work are important, can make to retouch Information is stated more close to the style of annotation, this determines the style of follow-up summarization generation simultaneously.It first will be in be described text All word small letters, doubt statement therein is identified by the query character tandem table as shown in Table 1 artificially collected, Then the statement ingredient of sentence is extracted by the syntax tree of Stanford CoreNLP tool Packet analyzing sentences.In analytic tree In, with (CC：Conjunction, CD:Cardinal numerals, JJ:Adjective, IN:Preposition) etc. identify part of speech, with (NP：Noun phrase, VP：Verb Phrase, PP：Prepositional phrase, IN:Preposition, ADJP：Adjective Phrases) indicate syntactic structure.Then, with reference to Sub-tree Matching pattern Two rule (VP of (Stanford Tregex)<</VB./、VP<</VBG./) short to extract the verb in doubt statement Language.Both rules can will in syntax tree comprising class such as (" I ", " you ", " he ") personal pronoun (PRP) ingredient and class such as The interrogative pronoun ingredient of (" how ", " what ") finally retains the ingredient of verb phrase, that is, VP to removal.Then, according to conclusion The key-strings list (as shown in table 2) of three kinds of Program Types description directly replaces space character to filter out description text tail The Program Type character string of these redundancies of portion, and language material is described as code corresponding abstract.

Table 1：Identify the character string list of interrogative sentence

Table 2：The key-strings list of Program Type description

Step 4：From the code summarization generation of attention mechanism；

Firstly, it is necessary to which the setting of problem definition, that is, realize conversion of the program language text to natural language text.It will be above-mentioned High quality extracts and redundancy filters out that treated (code snippet, abstract description) matching is referred to as (U to set_T,U_C), set Size is K, wherein U_CIndicate the set of code snippet, U_TThe set of corresponding description text is indicated, then our (c_k,t_k), (k= 1 ..., K) it is considered as a pair of of training corpus, i.e. c in set_k∈S_C,n_k∈S_T.The task of so code summarization generation is exactly given One new code snippet c, learns (c by neural network_k,n_k) Feature Mapping, the abstract to provide c describes n^*, quite In optimization scoring functionsAs shown in formula (9).

The present invention is generating the model construction stage, real on the basis of proposing Self-Attention mechanism based on Vaswani Show based on the seq2seq models i.e. CODE-OAN (CODE-Only-Attention-Need) suitable for code-text.The mould Convolution sum loop structure of the type without Encoder-Decoder in traditional code abstract sequence, only retains the knot of attention Structure, work that can be parallel is to accelerate training speed.Our method is to be provided using a specific attention mechanism to train Description text n distributions under code snippet c.Specifically, in generation phase, source is passed through based on this self-attention patterns Code snippet come sequentially generate abstract in each word.

As shown in figure 5, CODE-OAN models are divided into two parts of encoder and decoder, the wherein ends Encoder, input Embedding is added with Positional Embedding, as N number of identical Layer layers of the input of stacking.It is each A Layer layers is made of the parts Multi-Head attention and FeedFoward (feedforward layer) part, two parts Directly (connected by way of residual error articulamentum) ＆Norm (normalization layer) an Add；The ends Decoder, Decoder is also to be made of N number of identical Layer, and the Layer in decoder in the Layer of encoder by being inserted into one A Maked Multi-Head Attention+Add＆Norm compositions.The embedding of the output and position of output The input that embedding sums as Decoder, by Attention+Add＆Norm layers of a Multi-Head, output is done It is inputted for the query (Q) of next Multi-Head Attention+Add＆Norm, Key (K) and Value (V) inputs are The output of Encoder (i-th layer of output corresponds to i-th layer in Decoder of input, wherein i ∈ [1, N] in Encoder).It connects After output using an Add+Norm operation, convert to obtain the general of ideal output by a linear+softmax Rate.

The data instance of community is mainly programmed with real scene-StackOverflow below, by the method for the invention The high efficiency and accuracy of the method for the present invention are verified with the Experimental comparison of baseline algorithm.Experimental data derives from StackExchange main websites (https://archive.org/download/stackexchange/ Stackoverflow.com), wherein the XML file for being decompressed into 50G covers different types of program data, pass through attribute mark Label (c#, sql/database/oracle) can extract C# language categorical data totally 100372, and sql like language categorical data is total 398480, include code text (PL) and its corresponding description text (NL) per data.

The specific implementation mode of the present invention is as follows：

Step 1：The noise language material of initial acquisition is pre-processed, some data for not meeting format are removed, according to society It hands over attributive character value to carry out construction feature matrix, carries out the mean value completion operation of default value, removal frequency distribution is made an uproar less than %1's Point, all data are normalized.

Step 2：(PN, NL) language material of high quality is extracted using DB-WTFF Fusion Features frame, wherein with PyWavelets kits realize wavelet transformation, and Daubechies7 small echos is selected to make 5 layers of decomposition, extract high quality altogether Language material (C#88000 items, SQL 46000).

Step 3：T-SNNC networks are built using Tflearn frames, text sequence length is unified to 33, and the complete of output connects It connects a layer dimension and is set as 32.In DS-Bi-LSTM coding modules, it is 128 that word, which is embedded in dimension, and two-way LSTM hides of layer unit Number is 128.By the paired data manually marked by training set, verification collection, test set 9：1：1 ratio is divided, every batch of when training Number of samples is 1000, and the initial value of learning rate is 0.002, and study attenuation rate is 0.99, is had trained altogether 50 times, can be obtained 82% Classification accuracy.

Step 4：Tag recognition by the good T-SNNC disaggregated models of parameter training for (NL) text described in language material, will Tag Estimation value is that the language material of 1 (i.e. the code description of redundancy) directly filters out, the rear key component extraction for carrying out descriptive statement The Program Type segment unrelated with tail portion is rejected, can retain the purification language material (C#688705 items, SQL 34668) of high quality.

Step 5：Annotation segment in code is removed to all data sets, and utilizes antlr tools and sqlparse masterplates Come customize respectively parsing code syntactic structure tree, by the special parameter (such as character string, number) and correlated variables in code (such as table name, row name) carries out unified replacement, can get the crucial token of (C#, SQL) coded description text structural information.

Step 6：The training corpus on CODE-OAN models, according to the performance on verification collection come adjusting parameter.Use mini_ Batch sizes are 10, and N=4 is arranged, i.e. the ends Encoder and the ends Decoder stacking 4 is Layer layers identical.All sublayers Output and token is embedded and the embedded dimension of abstract is 512.Learning rate is located at 0.05, if after 60 wheel iteration on verification collection Accuracy rate decline then with 0.95 times of rate attenuation.Gradient maximum norm value is set as 0, and makes dropout=0.5.Iteration It amounts to 50 times, about puts up the best performance at 30-40 times.In decoding, maximal sequence length is 20, beam_size=20.

The present embodiment selects some Baseline Methods as Experimental comparison, is that 1) IR, a kind of information retrieval method give respectively Code, the editing distance of itself and arbitrary code in set can be calculated successively, by the description text corresponding to the minimum code of distance This makes a summary as it；2) MOSES, a statistical machine translation method can use Irstlm kits to train n-gram language models To realize；3) SUM-NN, i.e. RNN-seq2seq+Attention) model, a kind of common RNN sequences with attention mechanism Row model；3) CODE-NN, a kind of newest code summarization generation model are realized based on Torch deep learning frames；4)CNN- ATT, CNN-seq2seq+Attention model, a kind of attention model based on convolutional network.

The present embodiment carries out synthesis by the way of machine translation index and the marking of artificial questionnaire, to experimental result effect and comments It surveys.Evaluation index uses (ROUGE-N and BLEU-N), and ROUGE-N is the Similarity Measures based on recall rate, and BLEU-N is Method for measuring similarity based on accuracy, thus METEOR indexs are also adopted, it can calculate between Candidate Set and reference set The blending of accuracy rate and recall rate is average.

The present embodiment automatic judgment index is all based on the Duplication measure of N-gram models.The description text of code About 6-10 word, if using original description text as with reference to collecting, generation abstract description is used as Candidate Set, the two overlapping words mistake Go it is sparse be difficult to effectively calculate, need to additionally increase code corresponding abstract description.200 Codabar codes work is randomly choosed from test set For final review object, 10 volunteers additionally 3 corresponding description texts of mark are invited.

The artificial questionnaire marking of the present embodiment is to verify the practicability namely naturality that generate abstract by series model (narration is smooth), informedness (are summarized succinct).It invites 12 professional members to participate in this investigation, answers 2 problems：1) problem 1, Naturality embodies, and please select；2) problem 2, informedness are embodied, please be selected.Answer provides 4 options, and (" agreeing to very much ", " has Point is agreed to ", " general ", " disagreeing a little ", or " very different meaning "), participant only selects one.In data statistics, very The very different meaning of agreement-indicates the score value of 1-5 successively.Consider that two aspects are weighted mean value summation comprehensively, obtains comprehensive point.

The abstract collection that the present embodiment note manually marks is hus¹,hus²,hus³, the abstract collection that machine generates is amg, description The abstract collection of text is tis, then being { tis, hus for the reference set for evaluating and testing data¹,hus²,hus³, Candidate Set is then {amg}.Candidate Set and reference set are divided equally, half is used for the arameter optimization (DEV-set) of model, half for automatic judgment and Questionnaire survey (EVA-set).

Finally, experimental result of the invention evaluation and test is as shown in Table 3 and Table 4, and table 3 is each method in automatic judgment index Score, table 4 are score of each method in artificial questionnaire marking mode.

Table 3：The score of machine automatic judgment

Table 4：The score of artificial questionnaire evaluation and test

Contrast experiment, it can be deduced that, in automatic judgment, CODE-OAN models will on ROUG, BLEU and METEOR Performance of the language material for being substantially better than existing Baseline Methods, and extracting and purify by high quality on all sequences model will Better than not purified language material, the step for this illustrates DBWTFF frames and T-SNNC models be it is effective and can significantly improve from The quality of dynamic abstract.

Finally, it is the experiment effect of the displaying present invention, a C# and SQL code segment has been randomly selected in EVA-set, Ensure to exist simultaneously in purification and unpurified language material (purification is indicated by DB-WTFF and T-SNNC operations).The code exists The abstract generated under distinct methods is as shown in table 5.

Table 5：The code summarization generation example of C# and SQL

It describes incidental noise information it is experimentally confirmed that this method can efficiently handle code and can be that code snippet generates Accurate natural language description, the i.e. present invention propose a kind of side based on the robustness code summarization generation from attention mechanism Method,

It should be understood that the part that this specification does not elaborate belongs to the prior art.

It should be understood that the above-mentioned description for preferred embodiment is more detailed, can not therefore be considered to this The limitation of invention patent protection range, those skilled in the art under the inspiration of the present invention, are not departing from power of the present invention Profit requires under protected ambit, can also make replacement or deformation, each fall within protection scope of the present invention, this hair It is bright range is claimed to be determined by the appended claims.

Claims

1. a kind of robustness code abstraction generating method based on from attention mechanism, which is characterized in that include the following steps：

Step 1：It programs the code of community's high quality and its describes the extraction of language material pair；

Step 2：The redundancy of the language material pair of code snippet and its description text filters out；

Step 3：By the corresponding description text conversion of code at declarative sentence；

Step 4：Based on the code summarization generation from attention mechanism series model.

2. the robustness code abstraction generating method according to claim 1 based on from attention mechanism, which is characterized in that The specific implementation of step 1 includes following sub-step：

Step 1.1：Noise language material is extracted in programming community to (the description text of enquirement, the code text of reply), and is counted Analyze code and its describe text a variety of social characteristics values and between correlation, and build an eigenmatrixThe dimension of representing matrix, row vector are the characteristic values of every language material, and column vector is every category feature value Distribution；Wherein the characteristic value dimension of code section is M, and the characteristic value dimension of description section is N；

Step 1.2：The Fusion Features frame of the dual Wavelet time-frequency transformation of structure one；

Marking is normalized in M+N dimensions to the eigenmatrix F in step 1.1 first；Then respectively by this M and N number of Feature is considered as two L dimensional signals constantly converted, converts a signal into T/F space, then the matrix of description sectionIn M signal be converted into a series of wavelet tree { TTree₁,....,TTree_M, the matrix of code sectionIn N number of signal be converted into a series of wavelet tree { CTree₁,....,CTree_N}；It finally will be in two class wavelet trees The coefficient vector of identical leaf node is spliced into coefficient matrix by rowAnd P≤log₂M+1) andAnd Q≤log₂N+1), whereinWithThe dimension of representing matrix respectively；

Step 1.3：Distinguish sytem matrix A in estimating step 1.2 using greatest hope EM algorithms_iAnd B_iLinear fusion weight simultaneously It is reconstructed into new signal S_TAnd S_C；

Step 1.4：Utilize two new signal S of greatest hope EM algorithms quadratic estimate_TAnd S_CTo the percentage contribution of signal fused, and Signal after fusion is subjected to wavelet inverse transformation and obtains a signal S based on time domain, as all language materials in M+N feature On comprehensive marking；

Step 1.5：It extracts marking and is worth Top-K data in the top to (the description text of enquirement, the code text of reply) As the code of high quality and its language material pair of description.

3. the robustness code abstraction generating method according to claim 1 based on from attention mechanism, which is characterized in that The specific implementation of step 2 includes following sub-step：

Step 2.1：Collect the description collection of the code manually marked, sordid negative sample several including clean positive sample This is several；

Step 2.2：In twin neural network (the Siamese Neural Network of single sample learning：SNN) deep learning frame On the basis of frame, the description Text Filtering Algorithm T-SNNC of one de-redundancy of design construction, the high quality language that will be extracted in step 1 Expect the language material pair to wherein there is redundancy description by the removal of T-SNNC algorithms.

4. the robustness code abstraction generating method according to claim 3 based on from attention mechanism, it is characterised in that： The description Text Filtering Algorithm T-SNNC of a de-redundancy is built described in step 2.2, wherein in the frame of T-SNNC algorithms altogether Two identical two-way LSTM codings of text semantic are enjoyed respectively to the x of description text₁And x₂Semantic coding is carried out, description text Embedded indicate vectorial dimensionality reduction at vectorial s_hAnd s_l；Two vectorial absolute differences are allocated as the input for linear classifier, are passed through Sigmoid activation primitives obtain the vectorial p (x of one 2 dimension probability distribution₁,x₂), as predicted value；

Wherein α_jIt is the parameter of study,σ indicates activation primitive；

Network is trained using two classification cross entropy loss functions of formula (2), L is added in loss function₂Weight loses item, allows net Network learns smaller or smoother weight, to improve the generalization ability of model；

Wherein L is loss function, | | w | |₂Item, p (x are lost for weight₁⊙x₂) indicate text x₁And x₂When being the same category, t= 1；Indicate text x₁And x₂When being different classes of, t=0；

Collect manually accomplish fluently label description corpus of text it is several (positive sample, negative sample), by new pairing language material (just/just+ It is negative/negative：Label is 0, positive/negative：1) label is by training to be fitted the parameter of T-SNNC networks；During prediction, (it will wait for Survey text description, similar comparison sample) it matches and is sent into T-SNNC networks, when input label number 0>1, the label of description to be measured is It is consistent with comparing sample；If input label number 0<1, the label of description to be measured is i.e. opposite with comparing sample.

5. the robustness code abstraction generating method according to claim 4 based on from attention mechanism, it is characterised in that： The described two-way LSTM codings, coding framework are divided into 6 layers, including word insertion expression layer, two-way LSTM hidden layers, two-way LSTM defeated Go out layer, average pond layer, full articulamentum, coding output layer；It is embedding with the word of all words of text in the two-way LSTM coding frameworks Enter to input, is output with the whole semantic embedding vector of the text；

Institute's predicate insertion expression layer is input layer, and the sequence for describing text indicates T=(v₁,...,v_K), v_iIt is the corresponding id of vocabulary Number；The size of default dictionary is D, and the dimension of word insertion is d；The word insertion vector e of vocabulary v is obtained by tabling look-up, word insertion is looked into Operation is ask as shown in following formula (3), wherein embedding indicates table lookup operation；

E=embedding (v, D, d) (3)

The two-way LSTM hidden layers, including two LSTM hidden layers of forward and backward；Under synchronization, each input word is embedding Enter at the same be connected to two LSTM of forward and backward hide layer unit, while the two hide layer units be connected to it is same Output；If the input word under current t moment is embedded as e_t, forward direction LSTM hide layer unit output beBackward LSTM hides The output of layer unit isThen current time forward and backward hide layer unit output be：

Wherein H () indicates the hiding layer operation of LSTM, c_t-1Indicate the state value of Cell units inscribed at one, b_t-1It is general All biasings are inscribed when referring to last；

The two-way LSTM output layers, each output unit are connected to two LSTM of forward and backward inscribed when this and hide simultaneously Layer unit, i.e.,：

WhereinWithTo the connection weight between hidden layer and backward hidden layer and two-way LSTM output layers, b before respectively_gFor Biasing, σ is activation primitive；The output form of this layer is vector, and the dimension of each vector is consistent with input vector；

Original characteristic value is handled and is constructed new feature, and then realization pair by the average pond layer, pondization operation In original validity feature dimensionality reduction, strengthen and filtering for noise；Common pondization operation has maximum pond and average pond Change two kinds；Average pondization is averaged all neuron values in a certain range, can be by the local message in terms of each It all takes into account, avoids the loss of information；Shown in the operation such as following formula (6) in average pond：

The full articulamentum is the operation for carrying out matrix multiplication, is equivalent to a Feature Space Transformation, by the dimension of the vector on upper layer Degree is converted, while preserving useful information；

Q=Wpool (g)+b (7)

Wherein W is the weight connected entirely, and b is biasing, and pool (g) is the vector of upper layer output；The dimension of the output vector of this layer It is amplified；

The coding output layer, the result in average pond is obtained by activation primitive the semantic coding of final entire document to Amount is：

S=σ (q) (8)

6. the robustness code abstraction generating method according to claim 1 based on from attention mechanism, which is characterized in that The specific implementation of step 3 includes following sub-step：

Step 3.1：First by all word small letters in all description texts, doubt statement therein is identified, parse sentence Syntax tree extract the syntactic structure of sentence；

Step 3.2：The verb phrase in doubt statement is extracted according to Sub-tree Matching pattern；

Step 3.3：The redundance character string of description text tail portion relevant procedures type is filtered out, and eventually as the corresponding abstract of code Corpus of text.

7. according to claim 1-6 any one based on the robustness code abstraction generating method from attention mechanism, It is characterized in that, the specific implementation process of step 4 is：

Above-mentioned high quality is extracted and redundancy filters out that treated (code text, summary texts) matching is referred to as (U to set_T, U_C), the size of set is K, wherein U_CIndicate the set of code snippet, U_TIndicate the set of corresponding description text, (c_k,t_k) be considered as A pair of of training corpus, i.e. c in set_k∈S_C,n_k∈S_T, k=1 ..., K；Then the task of code summarization generation is exactly one given New code snippet c, learns (c by neural network_k,n_k) Feature Mapping, the abstract to provide c describes n^*, it is equivalent to excellent Change scoring functionsAs shown in formula (9)；

As given new code snippet c, network can be crossed and generate the corresponding abstract description n of c^*。

8. the robustness code abstraction generating method according to claim 7 based on from attention mechanism, which is characterized in that Learn (c by CODE-OAN summarization generations model algorithm_k,n_k) Feature Mapping；

The model construction stage is being generated, on the basis of proposing Self-Attention mechanism based on Vaswani, is being realized based on suitable Seq2seq models, that is, CODE-OAN summarization generation models for code-text；The model is without traditional code abstract sequence The convolution sum loop structure of middle Encoder-Decoder only retains the structure of attention, and work that can be parallel is to accelerate to instruct Practice speed；It trains the description text n provided under code snippet c to be distributed using a specific attention mechanism simultaneously, that is, exists Generation phase sequentially generates each word in abstract based on this self-attention patterns by source code snippet；

CODE-OAN summarization generation models are divided into two parts of encoder and decoder, the wherein ends Encoder, input Embedding is added with Positional Embedding, as N number of identical Layer layers of the input of stacking；It is each A Layer layers is made of the parts Multi-Head attention and a part feedforward layer FeedFoward, and two parts are straight A residual error articulamentum Add was connected to connect with the mode of feature normalization layer Norm；The ends Decoder, decoder are also by N A identical Layer is formed, and the Layer in decoder is by being inserted into a Maked Multi- in the Layer of encoder Head Attention+Add＆Norm compositions；The position embedding summations of the embedding of output and output as The input of Decoder is exported by Attention+Add＆Norm layers of a Multi-Head as next Multi- The query of Head Attention+Add＆Norm is inputted, and Key and Value inputs are the output of Encoder, i-th in Encoder The output of layer corresponds to i-th layer in Decoder of input, wherein i ∈ [1, N]；Then it exports and is grasped using an Add+Norm After work, convert to obtain the probability of ideal output by a linear+softmax.