CN108519890A - A kind of robustness code abstraction generating method based on from attention mechanism - Google Patents

A kind of robustness code abstraction generating method based on from attention mechanism Download PDF

Info

Publication number
CN108519890A
CN108519890A CN201810306806.1A CN201810306806A CN108519890A CN 108519890 A CN108519890 A CN 108519890A CN 201810306806 A CN201810306806 A CN 201810306806A CN 108519890 A CN108519890 A CN 108519890A
Authority
CN
China
Prior art keywords
code
layer
text
output
description
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810306806.1A
Other languages
Chinese (zh)
Other versions
CN108519890B (en
Inventor
彭敏
胡刚
袁梦霆
王清
曲金帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN201810306806.1A priority Critical patent/CN108519890B/en
Publication of CN108519890A publication Critical patent/CN108519890A/en
Application granted granted Critical
Publication of CN108519890B publication Critical patent/CN108519890B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/72Code refactoring

Abstract

The invention discloses a kind of robustness code abstraction generating methods based on from attention mechanism, are the extraction of the code and its description language material of programming community's high quality to (text of query specification replys the text of code) first;Followed by the redundancy of code and its description language material pair filters out;Then by the corresponding query specification text conversion of code at declarative sentence;It is finally based on the code summarization generation from attention mechanism series model.The present invention can effectively remove redundancy and noise content, and the abstract generated is improved on automatic judgment and artificial evaluation and test accuracy rate, and evaluation result is better than existing Baseline Methods.

Description

A kind of robustness code abstraction generating method based on from attention mechanism
Technical field
The invention belongs to the crossing domains of soft project and natural language processing technique, and in particular to when one kind being based on small echo Frequency converts the side of EM algorithms and the robustness code summarization generation based on the series model these two aspects advantage from attention mechanism Method is particularly suitable for carrying the code description data of noise information in programming community.
Background technology
In the evolution of Large-Scale Projects, code annotation is the key job in software maintenance, and the annotation of high quality is past Toward that can understand for developer efficient reference information is provided with more fresh code.Although there is numerous tools for program comprehension at present And method, but still understand code it is time-consuming be longer than write the duration of code.Developer tends to code (method of skimming simultaneously The codes entities such as signature) carry out prehension program to reduce workload, but this ignores key message again.Code abstract technology provides one Kind compromise scheme, can generate brief, sufficient sentence description, with help developer faster, more accurately understand code Intention.
Existing method is divided into two major classes:Extraction-type make a summary and production abstract, the tactical rule of former concerns code and Syntactic feature is the keyword abstraction based on code regulation and structure as a result, its effect and bad.For example, Sridhara It proposes to extract key term (method entity-noun, procedure-verb) by key identifier mark as Java method body Annotation is generated, is calculated using TF or TF-IDF algorithms and the higher entry of weight selection forms abstract, usually 5-10 word; Mcburney is provided with different EyeA, EyeB by introducing entry position weight (method signature, method call, control stream) With EyeC Weight Value Distributed Methods, to improve the effect of keyword extraction;Moreno proposes rule matching method, by being based on people The phrase generation matching stencil of 13 kinds of method class of work Rulemaking, which to generate for each method class, makes a summary;Same database In sql like language, Koutrika proposes an interactive system, and SQL query is converted into nature using NL masterplates and database Language text.And the latter utilizes end-to-end (end-to-end) network of deep learning, code can be trained to describe parallel corpora Accuracy rate is promoted, superiority is confirmed by a large amount of achievements.Such as Allamanis is proposed the convolution god with attention mechanism Type abstract is described to merge the method body of Code Context information through in network application to code summarization generation task, producing. Iyer is put forward for the first time sequence-series model using Recognition with Recurrent Neural Network, and combines attention mechanism, with training The code that StackOverflow is extracted describes language material to generate brief code abstract.
And the main problem that these methods face at present is:
1) it is only the code building abstract of specific structure (method body, one group of method call), the semantic information of code is simultaneously Do not embody;
2) mode generated tends to rely on the language material of high quality and is difficult to expand to other programming communities easily by noise It influences;
3) doubt statement is presented in the form of production code abstract, this is not inconsistent completely with the style of code annotation;
4) though traditional series model combines attention mechanism, still hold in the sparse code of reply distribution of lengths It is vulnerable.
These problems significantly limit the effect of code summarization generation, this be also the invention solves critical issue.
Invention content
The present invention is directed to improve the accuracy and naturality of code summarization generation effect, taken out in processing programs Ask-Answer Community Take code-description (enquirement description text-reply code snippet) noise language material clock synchronization have stronger robustness, can gram The clothes noise that directly acquisition parallel corpora introduces to train summarization generation model, and self- is introduced in series model Attention, can be in less coding structure for long range sequence when carrying out seq2seq Sequence Learnings from attention mechanism It relies on, causes the problem that summarization generation is ineffective.
The source code project of high quality often gives the detailed annotation of code or the guide for use of parameter in API documents, It can understand for programmer and safeguard that code provides efficient reference.The Ask-Answer Communities IT StackOverflow contains a large amount of related The question and answer of program are pasted, wherein the description text and code snippet that excavate can extract as the matching pair of code-abstract.But program community Middle presence is largely made an uproar and redundancy, needs to be filtered it and screen, to obtain the information of high quality.For this purpose, this Invention first proposed DB-WTFF Fusion Features frames, and the attribute in the description segment of enquirement and the code snippet of reply is special Sign is respectively seen as signal and carries out Wavelet time-frequency conversion to highlight the detail differences of its characteristic value.Estimate to describe respectively using EM algorithms With the weights of each feature of code, the marking value of the description after fusion and the marking value of code are subjected to secondary EM algorithms and estimated, Further fusion obtains the sample set of integrated value ranking Top-K.Next proposes the filter algorithm of deep learning, is based on twin god T-SNNC algorithms are devised through network (Siamese Neural Network, SNN), can be realized under the sample of lesser amt To describing the classification of short text, the sample pair containing redundancy description text is filtered, and being parsed by the syntax tree of NLP technologies will Text conversion is described into statement sentence.Finally the language material of purification is used in and is based on from attention mechanism (self-attention) CODE-OAN series models are made a summary to train to generate.
The technical solution adopted in the present invention is:It is a kind of based on from the robustness code summarization generation side of attention mechanism Method, which is characterized in that include the following steps:
Step 1:It programs the code of high quality in community and its describes language material to (the description text of enquirement, the code of reply Segment) extraction;
Using a kind of language material of high quality to extracting frame, that is, DB-WTFF (Double Wavelet Time-Frequency Transform Feature Fusion) Fusion Features frame, by signal field Wavelet time-frequency convert (Wavelet Transform, WT) and greatest hope (Expectation Maximization, EM) algorithm for calculating noise language material feature The synthesis marking value of contribution extracts marking and is worth Top-K language material pair in the top.
Step 2:Code and its redundancy for describing language material pair filter out;
Twin neural network (Siamese Neural Network, SNN) deep learning frame of single sample learning is introduced, Build a kind of description Text Filtering Algorithm i.e. T-SNNC of de-redundancy (Title-Siamese Neural Network Classfication Algorithm of documents categorization).The noise language material that high quality is extracted in step 1 is removed by T-SNNC algorithms Wherein there is the language material pair of description text redundancy description.
Step 3:The declarative sentence of the corresponding query descriptive statement form of code is converted;
According to the doubt statement character string list artificially collected come the doubt statement in identification step 2, and pass through Stanford CoreNLP kits parse the syntax tree structure of sentence, are based on Sub-tree Matching pattern (Stanford Tregex two rule (VP)<</VB./、VP<</VBG./) extract the verb phrase in doubt statement, remove text The unrelated Program Type description information in tail portion, as the corresponding abstract description of code.
Step 4:Based on the code summarization generation from attention mechanism series model;
Language material after step 3 is purified is referred to as (U to setT,UC), the size of set is K, wherein UCIndicate code The set of segment, UTThe set for indicating corresponding description text, then (ck,tk), (k=1 ..., K) it is considered as a pair of of training in set Language material, i.e. ck∈SC,nk∈ST.Pass through a kind of Seq2Seq sequences frame, that is, CODE-OAN (CODE Only Attention Need) summarization generation model algorithm learns (ck,nk) Feature Mapping.As given new code snippet c, net can be passed through Network generates the corresponding abstracts of c and describes n*
The present invention can extract the code description of high quality by Wavelet time-frequency transformation and EM algorithms (DB-WTFF) first Material pair, then build twin neural network classification model (T-SNNC) filter with code independent redundancy description, be finally based on Carry out pairs of training corpus from the series model (CODE-OAN) of attention mechanism to generate the abstract of code, the results showed that, this hair Bright method can be effectively filtered redundancy and content noise, and can generate the code efficiently generated close to annotation style Abstract, the table in manually evaluation and test (terseness, informedness, naturality) and automatic judgment (ROUG-L, BLEU-2, METEOR) It now increases, evaluation and test is better than existing Baseline Methods.
Description of the drawings
Attached drawing 1:The realization idiographic flow schematic diagram of the embodiment of the present invention
Attached drawing 2:Fusion Features frame (DB-WTFF) schematic diagram of the dual Wavelet time-frequency transformation of the embodiment of the present invention;
Attached drawing 3:Textual classification model (T-SNNC) schematic diagram based on twin neural network of the embodiment of the present invention;
Attached drawing 4:Two-way semantic (TS-BI-LSTM) coding framework schematic diagram of text of the embodiment of the present invention;
Attached drawing 5:The CODE-OAN model framework schematic diagrames based on Self-Attention mechanism of the embodiment of the present invention.
Specific implementation mode
Understand for the ease of those of ordinary skill in the art and implement the present invention, with reference to the accompanying drawings and embodiments to this hair It is bright to be described in further detail, it should be understood that implementation example described herein is merely to illustrate and explain the present invention, not For limiting the present invention.
Implementation flow chart of the present invention is as shown in Figure 1, include mainly:(1) information extraction of high quality.It is taken out according to programming community The different behavior properties of model, eigenmatrix of the structure one about program question and answer patch are taken to be regarded using wavelet decomposition transform It transforms in frequency domain and handles for the discrete signal of same dimension, the marking value of information is then calculated according to signature contributions.(2) low noise Information purifies.A little noise is still remained in the language material pair of high quality model, is needed to wherein there is redundancy description and defect generation The matching of code, using T-SNNC deep learning algorithms, is able to effectively purify data flow to filtering out.(3) code is automatic Abstract.The high quality language material of purification is put into training in sequence-series neural network to (code-abstract), one can be learnt The model of code summarization generation.New code is input in model, you can obtain the annotation information of code.
The present invention specifically includes following steps:
Step 1:The code and its description language material of community's high quality are to (the description text of enquirement, the code text of reply) It extracts;
First, noise language material is extracted in programming community to (the description text of enquirement, the code text of reply), and is counted Analyze code and its describe text a variety of social characteristics values (reply volume, the amount of thumbing up, pageview etc.) and its between mutual pass System, and build an eigenmatrixThe dimension of representing matrix, row vector are the characteristic values of every language material, Column vector is the distribution of every category feature value, and wherein the characteristic value dimension of code section is M, and the characteristic value dimension of description section is N;
Secondly, the Fusion Features frame of the dual Wavelet time-frequency transformation of structure one, as shown in Fig. 2, the structure of the frame Marking is normalized to the eigenmatrix F in step 1 first in Cheng Shi in M+N dimensions, and the wherein characteristic dimension of code is M, The characteristic dimension of description is N;Then this M and N number of feature are considered as two L dimensional signals constantly converted respectively, signal is turned It changes in T/F space, then the matrix of description sectionIn M signal be converted into a series of wavelet trees {TTree1,....,TTreeM, the matrix of the code section of codeIn N number of signal be converted into a series of small echos Set { CTree1,....,CTreeN};The coefficient vector of identical leaf node in two class wavelet trees is finally spliced into coefficient by row Matrix(i=1 ..., P } and P≤log2M+1) and(i=1 ..., Q } and Q≤log2N+1), whereinWithThe dimension of representing matrix respectively.
Distinguish sytem matrix A in estimating step 1.2 using greatest hope EM algorithmsiAnd BiLinear fusion weight and reconstruct The signal S of Cheng XinTAnd SC
Two new signal S of greatest hope EM algorithms quadratic estimate are utilized againTAnd SCTo the percentage contribution of signal fused, and Signal after fusion is subjected to inverse transformation and obtains a signal S based on time domain, as all language materials in M+N feature Comprehensive marking.
Finally, it extracts marking and is worth Top-K data in the top to (the description text of enquirement, the code text of reply) As the code of high quality and its language material pair of description.
Step 2:Code and its redundancy for describing language material pair filter out;
Extract high quality code and its expectation after, however it remains some noise informations i.e. and code independent redundancy Description, needs to be removed in language material.First, several (clean-positive samples of artificial mark language material of description text are collected:566 Item, not totally-negative sample:232);
First the present invention is based on the twin neural network SNN deep learning frames of single sample learning, a kind of de-redundancy is proposed Description text classification network (T-SNNC), flow demonstration is as shown in Figure 3.Two are shared wherein in the frame of T-SNNC algorithms The identical two-way LSTM codings of text semantic are respectively to describing text x1And x2Semantic coding is carried out, description text is embedded in and is indicated Vectorial dimensionality reduction is to vectorial shAnd sl.Input of two vectorial absolute differences as linear classifier, is obtained by sigmod activation primitives Obtain the vectorial p (x of one 2 dimension probability distribution1,x2), as predicted value;
Wherein αjIt is the parameter of study,σ indicates activation primitive.
Secondly network is trained using two classification cross entropy loss functions of formula (2), L is added in loss function2Weight is damaged Item is lost, allows e-learning smaller or smoother weight, to improve the generalization ability of model;
Wherein L is loss function, | | w | |2Item, p (x are lost for weight1⊙x2) indicate text x1And x2When being the same category, T=1;When indicating that text x1 and x2 are different classes of, t=0;
Then, by new pairing language material (just/just+ negative/negative1886691:Label is 0, positive/negativeItem:1) label is by training to be fitted the parameter of T-SNNC networks;
Finally, during prediction, (text description to be measured, similar comparison sample) pairing is sent into network, input is worked as Number of tags (0>1), the label of description to be measured is i.e. consistent with comparing sample, if input label number (0<1), the label of description to be measured is It is opposite with comparing sample.
Involved DS-Bi-LSTM coding frameworks are as shown in figure 4, the frame is broadly divided into 6 layers, with document in the present invention In all words word insertion be expressed as inputting, be expressed as exporting with the whole semantic embedding of the text.It is discussed in detail separately below This 6 levels:
Institute's predicate insertion expression layer is input layer, and the sequence for describing text indicates T=(v1,...,vK), viIt is that vocabulary corresponds to No. id;The size of default dictionary is D, and the dimension of word insertion is d;The word insertion vector e of vocabulary v is obtained by tabling look-up, word is embedding Enter shown in inquiry operation such as following formula (3), wherein embedding indicates table lookup operation;
E=embddding (v, D, d) (3)
Two-way LSTM hidden layers:Including two LSTM hidden layers of forward and backward, under synchronization, each input word is embedding Enter at the same be connected to two LSTM of forward and backward hide layer unit, while the two hide layer units be connected to it is same Output.If the input word under current t moment is embedded as et, forward direction LSTM hide layer unit output beBackward LSTM hides The output of layer unit isThen current time forward and backward hide layer unit output be:
Wherein H () indicates the hiding layer operation of LSTM, ct-1Indicate the state value of Cell units inscribed at one, bt-1All biasings are inscribed when referring to last.
Two-way LSTM output layers:Each output unit is connected to two LSTM of forward and backward inscribed when this and hides simultaneously Layer unit, i.e.,:
WhereinWithTo the connection weight between hidden layer and backward hidden layer and two-way LSTM output layers before respectively Value, bgFor biasing, σ is activation primitive.The output form of this layer is vector, and the dimension of each vector is consistent with input vector.
Average pond layer:Original characteristic value can be handled and construct new feature, Jin Ershi by pondization operation Referring now to original validity feature dimensionality reduction, strengthen and filtering for noise.Common pondization operation has maximum pondization peace Equal two kinds of pondization.Since the macroscopic view of text is semantic closely related with each vocabulary in text, the present invention is using flat Equal pondization operation.Average pondization is averaged all neuron values in a certain range, can be by the part in terms of each Information is all taken into account, and the loss of information is avoided.The operation in average pond is as follows:
Wherein L is the length of the sequence of words of input, gtFor the vector of upper layer output;
Full articulamentum:Matrix multiplication is equivalent to a Feature Space Transformation, and the dimension of the vector on upper layer is converted, Preserve useful information simultaneously.Wherein W is the weight connected entirely, and b is biasing.The dimension of the output vector of this layer is amplified.
Q=Wpool (g)+b (7)
Encode output layer:The result in average pond can be obtained to the semantic coding of final entire document by activation primitive Vector is:
S=σ (q) (8)
Wherein, σ indicates that activation primitive, q are the vector of upper layer output.
Step 3:Code describes the sentence style conversion of textual form;
Most of description text is interrogative sentence form or the sentence comprising personal pronoun, and text end includes program description Supplemental information, this is simultaneously suitable as the abstract of code.Abstract is the declarative descriptive statement as code, it is necessary into line statement It reports operation and removes the redundancy unrelated with Program Type.This step, sentence conversion work are important, can make to retouch Information is stated more close to the style of annotation, this determines the style of follow-up summarization generation simultaneously.It first will be in be described text All word small letters, doubt statement therein is identified by the query character tandem table as shown in Table 1 artificially collected, Then the statement ingredient of sentence is extracted by the syntax tree of Stanford CoreNLP tool Packet analyzing sentences.In analytic tree In, with (CC:Conjunction, CD:Cardinal numerals, JJ:Adjective, IN:Preposition) etc. identify part of speech, with (NP:Noun phrase, VP:Verb Phrase, PP:Prepositional phrase, IN:Preposition, ADJP:Adjective Phrases) indicate syntactic structure.Then, with reference to Sub-tree Matching pattern Two rule (VP of (Stanford Tregex)<</VB./、VP<</VBG./) short to extract the verb in doubt statement Language.Both rules can will in syntax tree comprising class such as (" I ", " you ", " he ") personal pronoun (PRP) ingredient and class such as The interrogative pronoun ingredient of (" how ", " what ") finally retains the ingredient of verb phrase, that is, VP to removal.Then, according to conclusion The key-strings list (as shown in table 2) of three kinds of Program Types description directly replaces space character to filter out description text tail The Program Type character string of these redundancies of portion, and language material is described as code corresponding abstract.
Table 1:Identify the character string list of interrogative sentence
Table 2:The key-strings list of Program Type description
Step 4:From the code summarization generation of attention mechanism;
Firstly, it is necessary to which the setting of problem definition, that is, realize conversion of the program language text to natural language text.It will be above-mentioned High quality extracts and redundancy filters out that treated (code snippet, abstract description) matching is referred to as (U to setT,UC), set Size is K, wherein UCIndicate the set of code snippet, UTThe set of corresponding description text is indicated, then our (ck,tk), (k= 1 ..., K) it is considered as a pair of of training corpus, i.e. c in setk∈SC,nk∈ST.The task of so code summarization generation is exactly given One new code snippet c, learns (c by neural networkk,nk) Feature Mapping, the abstract to provide c describes n*, quite In optimization scoring functionsAs shown in formula (9).
The present invention is generating the model construction stage, real on the basis of proposing Self-Attention mechanism based on Vaswani Show based on the seq2seq models i.e. CODE-OAN (CODE-Only-Attention-Need) suitable for code-text.The mould Convolution sum loop structure of the type without Encoder-Decoder in traditional code abstract sequence, only retains the knot of attention Structure, work that can be parallel is to accelerate training speed.Our method is to be provided using a specific attention mechanism to train Description text n distributions under code snippet c.Specifically, in generation phase, source is passed through based on this self-attention patterns Code snippet come sequentially generate abstract in each word.
As shown in figure 5, CODE-OAN models are divided into two parts of encoder and decoder, the wherein ends Encoder, input Embedding is added with Positional Embedding, as N number of identical Layer layers of the input of stacking.It is each A Layer layers is made of the parts Multi-Head attention and FeedFoward (feedforward layer) part, two parts Directly (connected by way of residual error articulamentum) &Norm (normalization layer) an Add;The ends Decoder, Decoder is also to be made of N number of identical Layer, and the Layer in decoder in the Layer of encoder by being inserted into one A Maked Multi-Head Attention+Add&Norm compositions.The embedding of the output and position of output The input that embedding sums as Decoder, by Attention+Add&Norm layers of a Multi-Head, output is done It is inputted for the query (Q) of next Multi-Head Attention+Add&Norm, Key (K) and Value (V) inputs are The output of Encoder (i-th layer of output corresponds to i-th layer in Decoder of input, wherein i ∈ [1, N] in Encoder).It connects After output using an Add+Norm operation, convert to obtain the general of ideal output by a linear+softmax Rate.
The data instance of community is mainly programmed with real scene-StackOverflow below, by the method for the invention The high efficiency and accuracy of the method for the present invention are verified with the Experimental comparison of baseline algorithm.Experimental data derives from StackExchange main websites (https://archive.org/download/stackexchange/ Stackoverflow.com), wherein the XML file for being decompressed into 50G covers different types of program data, pass through attribute mark Label (c#, sql/database/oracle) can extract C# language categorical data totally 100372, and sql like language categorical data is total 398480, include code text (PL) and its corresponding description text (NL) per data.
The specific implementation mode of the present invention is as follows:
Step 1:The noise language material of initial acquisition is pre-processed, some data for not meeting format are removed, according to society It hands over attributive character value to carry out construction feature matrix, carries out the mean value completion operation of default value, removal frequency distribution is made an uproar less than %1's Point, all data are normalized.
Step 2:(PN, NL) language material of high quality is extracted using DB-WTFF Fusion Features frame, wherein with PyWavelets kits realize wavelet transformation, and Daubechies7 small echos is selected to make 5 layers of decomposition, extract high quality altogether Language material (C#88000 items, SQL 46000).
Step 3:T-SNNC networks are built using Tflearn frames, text sequence length is unified to 33, and the complete of output connects It connects a layer dimension and is set as 32.In DS-Bi-LSTM coding modules, it is 128 that word, which is embedded in dimension, and two-way LSTM hides of layer unit Number is 128.By the paired data manually marked by training set, verification collection, test set 9:1:1 ratio is divided, every batch of when training Number of samples is 1000, and the initial value of learning rate is 0.002, and study attenuation rate is 0.99, is had trained altogether 50 times, can be obtained 82% Classification accuracy.
Step 4:Tag recognition by the good T-SNNC disaggregated models of parameter training for (NL) text described in language material, will Tag Estimation value is that the language material of 1 (i.e. the code description of redundancy) directly filters out, the rear key component extraction for carrying out descriptive statement The Program Type segment unrelated with tail portion is rejected, can retain the purification language material (C#688705 items, SQL 34668) of high quality.
Step 5:Annotation segment in code is removed to all data sets, and utilizes antlr tools and sqlparse masterplates Come customize respectively parsing code syntactic structure tree, by the special parameter (such as character string, number) and correlated variables in code (such as table name, row name) carries out unified replacement, can get the crucial token of (C#, SQL) coded description text structural information.
Step 6:The training corpus on CODE-OAN models, according to the performance on verification collection come adjusting parameter.Use mini_ Batch sizes are 10, and N=4 is arranged, i.e. the ends Encoder and the ends Decoder stacking 4 is Layer layers identical.All sublayers Output and token is embedded and the embedded dimension of abstract is 512.Learning rate is located at 0.05, if after 60 wheel iteration on verification collection Accuracy rate decline then with 0.95 times of rate attenuation.Gradient maximum norm value is set as 0, and makes dropout=0.5.Iteration It amounts to 50 times, about puts up the best performance at 30-40 times.In decoding, maximal sequence length is 20, beam_size=20.
The present embodiment selects some Baseline Methods as Experimental comparison, is that 1) IR, a kind of information retrieval method give respectively Code, the editing distance of itself and arbitrary code in set can be calculated successively, by the description text corresponding to the minimum code of distance This makes a summary as it;2) MOSES, a statistical machine translation method can use Irstlm kits to train n-gram language models To realize;3) SUM-NN, i.e. RNN-seq2seq+Attention) model, a kind of common RNN sequences with attention mechanism Row model;3) CODE-NN, a kind of newest code summarization generation model are realized based on Torch deep learning frames;4)CNN- ATT, CNN-seq2seq+Attention model, a kind of attention model based on convolutional network.
The present embodiment carries out synthesis by the way of machine translation index and the marking of artificial questionnaire, to experimental result effect and comments It surveys.Evaluation index uses (ROUGE-N and BLEU-N), and ROUGE-N is the Similarity Measures based on recall rate, and BLEU-N is Method for measuring similarity based on accuracy, thus METEOR indexs are also adopted, it can calculate between Candidate Set and reference set The blending of accuracy rate and recall rate is average.
The present embodiment automatic judgment index is all based on the Duplication measure of N-gram models.The description text of code About 6-10 word, if using original description text as with reference to collecting, generation abstract description is used as Candidate Set, the two overlapping words mistake Go it is sparse be difficult to effectively calculate, need to additionally increase code corresponding abstract description.200 Codabar codes work is randomly choosed from test set For final review object, 10 volunteers additionally 3 corresponding description texts of mark are invited.
The artificial questionnaire marking of the present embodiment is to verify the practicability namely naturality that generate abstract by series model (narration is smooth), informedness (are summarized succinct).It invites 12 professional members to participate in this investigation, answers 2 problems:1) problem 1, Naturality embodies, and please select;2) problem 2, informedness are embodied, please be selected.Answer provides 4 options, and (" agreeing to very much ", " has Point is agreed to ", " general ", " disagreeing a little ", or " very different meaning "), participant only selects one.In data statistics, very The very different meaning of agreement-indicates the score value of 1-5 successively.Consider that two aspects are weighted mean value summation comprehensively, obtains comprehensive point.
The abstract collection that the present embodiment note manually marks is hus1,hus2,hus3, the abstract collection that machine generates is amg, description The abstract collection of text is tis, then being { tis, hus for the reference set for evaluating and testing data1,hus2,hus3, Candidate Set is then {amg}.Candidate Set and reference set are divided equally, half is used for the arameter optimization (DEV-set) of model, half for automatic judgment and Questionnaire survey (EVA-set).
Finally, experimental result of the invention evaluation and test is as shown in Table 3 and Table 4, and table 3 is each method in automatic judgment index Score, table 4 are score of each method in artificial questionnaire marking mode.
Table 3:The score of machine automatic judgment
Table 4:The score of artificial questionnaire evaluation and test
Contrast experiment, it can be deduced that, in automatic judgment, CODE-OAN models will on ROUG, BLEU and METEOR Performance of the language material for being substantially better than existing Baseline Methods, and extracting and purify by high quality on all sequences model will Better than not purified language material, the step for this illustrates DBWTFF frames and T-SNNC models be it is effective and can significantly improve from The quality of dynamic abstract.
Finally, it is the experiment effect of the displaying present invention, a C# and SQL code segment has been randomly selected in EVA-set, Ensure to exist simultaneously in purification and unpurified language material (purification is indicated by DB-WTFF and T-SNNC operations).The code exists The abstract generated under distinct methods is as shown in table 5.
Table 5:The code summarization generation example of C# and SQL
It describes incidental noise information it is experimentally confirmed that this method can efficiently handle code and can be that code snippet generates Accurate natural language description, the i.e. present invention propose a kind of side based on the robustness code summarization generation from attention mechanism Method,
It should be understood that the part that this specification does not elaborate belongs to the prior art.
It should be understood that the above-mentioned description for preferred embodiment is more detailed, can not therefore be considered to this The limitation of invention patent protection range, those skilled in the art under the inspiration of the present invention, are not departing from power of the present invention Profit requires under protected ambit, can also make replacement or deformation, each fall within protection scope of the present invention, this hair It is bright range is claimed to be determined by the appended claims.

Claims (8)

1. a kind of robustness code abstraction generating method based on from attention mechanism, which is characterized in that include the following steps:
Step 1:It programs the code of community's high quality and its describes the extraction of language material pair;
Step 2:The redundancy of the language material pair of code snippet and its description text filters out;
Step 3:By the corresponding description text conversion of code at declarative sentence;
Step 4:Based on the code summarization generation from attention mechanism series model.
2. the robustness code abstraction generating method according to claim 1 based on from attention mechanism, which is characterized in that The specific implementation of step 1 includes following sub-step:
Step 1.1:Noise language material is extracted in programming community to (the description text of enquirement, the code text of reply), and is counted Analyze code and its describe text a variety of social characteristics values and between correlation, and build an eigenmatrixThe dimension of representing matrix, row vector are the characteristic values of every language material, and column vector is every category feature value Distribution;Wherein the characteristic value dimension of code section is M, and the characteristic value dimension of description section is N;
Step 1.2:The Fusion Features frame of the dual Wavelet time-frequency transformation of structure one;
Marking is normalized in M+N dimensions to the eigenmatrix F in step 1.1 first;Then respectively by this M and N number of Feature is considered as two L dimensional signals constantly converted, converts a signal into T/F space, then the matrix of description sectionIn M signal be converted into a series of wavelet tree { TTree1,....,TTreeM, the matrix of code sectionIn N number of signal be converted into a series of wavelet tree { CTree1,....,CTreeN};It finally will be in two class wavelet trees The coefficient vector of identical leaf node is spliced into coefficient matrix by rowAnd P≤log2M+1) andAnd Q≤log2N+1), whereinWithThe dimension of representing matrix respectively;
Step 1.3:Distinguish sytem matrix A in estimating step 1.2 using greatest hope EM algorithmsiAnd BiLinear fusion weight simultaneously It is reconstructed into new signal STAnd SC
Step 1.4:Utilize two new signal S of greatest hope EM algorithms quadratic estimateTAnd SCTo the percentage contribution of signal fused, and Signal after fusion is subjected to wavelet inverse transformation and obtains a signal S based on time domain, as all language materials in M+N feature On comprehensive marking;
Step 1.5:It extracts marking and is worth Top-K data in the top to (the description text of enquirement, the code text of reply) As the code of high quality and its language material pair of description.
3. the robustness code abstraction generating method according to claim 1 based on from attention mechanism, which is characterized in that The specific implementation of step 2 includes following sub-step:
Step 2.1:Collect the description collection of the code manually marked, sordid negative sample several including clean positive sample This is several;
Step 2.2:In twin neural network (the Siamese Neural Network of single sample learning:SNN) deep learning frame On the basis of frame, the description Text Filtering Algorithm T-SNNC of one de-redundancy of design construction, the high quality language that will be extracted in step 1 Expect the language material pair to wherein there is redundancy description by the removal of T-SNNC algorithms.
4. the robustness code abstraction generating method according to claim 3 based on from attention mechanism, it is characterised in that: The description Text Filtering Algorithm T-SNNC of a de-redundancy is built described in step 2.2, wherein in the frame of T-SNNC algorithms altogether Two identical two-way LSTM codings of text semantic are enjoyed respectively to the x of description text1And x2Semantic coding is carried out, description text Embedded indicate vectorial dimensionality reduction at vectorial shAnd sl;Two vectorial absolute differences are allocated as the input for linear classifier, are passed through Sigmoid activation primitives obtain the vectorial p (x of one 2 dimension probability distribution1,x2), as predicted value;
Wherein αjIt is the parameter of study,σ indicates activation primitive;
Network is trained using two classification cross entropy loss functions of formula (2), L is added in loss function2Weight loses item, allows net Network learns smaller or smoother weight, to improve the generalization ability of model;
Wherein L is loss function, | | w | |2Item, p (x are lost for weight1⊙x2) indicate text x1And x2When being the same category, t= 1;Indicate text x1And x2When being different classes of, t=0;
Collect manually accomplish fluently label description corpus of text it is several (positive sample, negative sample), by new pairing language material (just/just+ It is negative/negative:Label is 0, positive/negative:1) label is by training to be fitted the parameter of T-SNNC networks;During prediction, (it will wait for Survey text description, similar comparison sample) it matches and is sent into T-SNNC networks, when input label number 0>1, the label of description to be measured is It is consistent with comparing sample;If input label number 0<1, the label of description to be measured is i.e. opposite with comparing sample.
5. the robustness code abstraction generating method according to claim 4 based on from attention mechanism, it is characterised in that: The described two-way LSTM codings, coding framework are divided into 6 layers, including word insertion expression layer, two-way LSTM hidden layers, two-way LSTM defeated Go out layer, average pond layer, full articulamentum, coding output layer;It is embedding with the word of all words of text in the two-way LSTM coding frameworks Enter to input, is output with the whole semantic embedding vector of the text;
Institute's predicate insertion expression layer is input layer, and the sequence for describing text indicates T=(v1,...,vK), viIt is the corresponding id of vocabulary Number;The size of default dictionary is D, and the dimension of word insertion is d;The word insertion vector e of vocabulary v is obtained by tabling look-up, word insertion is looked into Operation is ask as shown in following formula (3), wherein embedding indicates table lookup operation;
E=embedding (v, D, d) (3)
The two-way LSTM hidden layers, including two LSTM hidden layers of forward and backward;Under synchronization, each input word is embedding Enter at the same be connected to two LSTM of forward and backward hide layer unit, while the two hide layer units be connected to it is same Output;If the input word under current t moment is embedded as et, forward direction LSTM hide layer unit output beBackward LSTM hides The output of layer unit isThen current time forward and backward hide layer unit output be:
Wherein H () indicates the hiding layer operation of LSTM, ct-1Indicate the state value of Cell units inscribed at one, bt-1It is general All biasings are inscribed when referring to last;
The two-way LSTM output layers, each output unit are connected to two LSTM of forward and backward inscribed when this and hide simultaneously Layer unit, i.e.,:
WhereinWithTo the connection weight between hidden layer and backward hidden layer and two-way LSTM output layers, b before respectivelygFor Biasing, σ is activation primitive;The output form of this layer is vector, and the dimension of each vector is consistent with input vector;
Original characteristic value is handled and is constructed new feature, and then realization pair by the average pond layer, pondization operation In original validity feature dimensionality reduction, strengthen and filtering for noise;Common pondization operation has maximum pond and average pond Change two kinds;Average pondization is averaged all neuron values in a certain range, can be by the local message in terms of each It all takes into account, avoids the loss of information;Shown in the operation such as following formula (6) in average pond:
Wherein L is the length of the sequence of words of input, gtFor the vector of upper layer output;
The full articulamentum is the operation for carrying out matrix multiplication, is equivalent to a Feature Space Transformation, by the dimension of the vector on upper layer Degree is converted, while preserving useful information;
Q=Wpool (g)+b (7)
Wherein W is the weight connected entirely, and b is biasing, and pool (g) is the vector of upper layer output;The dimension of the output vector of this layer It is amplified;
The coding output layer, the result in average pond is obtained by activation primitive the semantic coding of final entire document to Amount is:
S=σ (q) (8)
Wherein, σ indicates that activation primitive, q are the vector of upper layer output.
6. the robustness code abstraction generating method according to claim 1 based on from attention mechanism, which is characterized in that The specific implementation of step 3 includes following sub-step:
Step 3.1:First by all word small letters in all description texts, doubt statement therein is identified, parse sentence Syntax tree extract the syntactic structure of sentence;
Step 3.2:The verb phrase in doubt statement is extracted according to Sub-tree Matching pattern;
Step 3.3:The redundance character string of description text tail portion relevant procedures type is filtered out, and eventually as the corresponding abstract of code Corpus of text.
7. according to claim 1-6 any one based on the robustness code abstraction generating method from attention mechanism, It is characterized in that, the specific implementation process of step 4 is:
Above-mentioned high quality is extracted and redundancy filters out that treated (code text, summary texts) matching is referred to as (U to setT, UC), the size of set is K, wherein UCIndicate the set of code snippet, UTIndicate the set of corresponding description text, (ck,tk) be considered as A pair of of training corpus, i.e. c in setk∈SC,nk∈ST, k=1 ..., K;Then the task of code summarization generation is exactly one given New code snippet c, learns (c by neural networkk,nk) Feature Mapping, the abstract to provide c describes n*, it is equivalent to excellent Change scoring functionsAs shown in formula (9);
As given new code snippet c, network can be crossed and generate the corresponding abstract description n of c*
8. the robustness code abstraction generating method according to claim 7 based on from attention mechanism, which is characterized in that Learn (c by CODE-OAN summarization generations model algorithmk,nk) Feature Mapping;
The model construction stage is being generated, on the basis of proposing Self-Attention mechanism based on Vaswani, is being realized based on suitable Seq2seq models, that is, CODE-OAN summarization generation models for code-text;The model is without traditional code abstract sequence The convolution sum loop structure of middle Encoder-Decoder only retains the structure of attention, and work that can be parallel is to accelerate to instruct Practice speed;It trains the description text n provided under code snippet c to be distributed using a specific attention mechanism simultaneously, that is, exists Generation phase sequentially generates each word in abstract based on this self-attention patterns by source code snippet;
CODE-OAN summarization generation models are divided into two parts of encoder and decoder, the wherein ends Encoder, input Embedding is added with Positional Embedding, as N number of identical Layer layers of the input of stacking;It is each A Layer layers is made of the parts Multi-Head attention and a part feedforward layer FeedFoward, and two parts are straight A residual error articulamentum Add was connected to connect with the mode of feature normalization layer Norm;The ends Decoder, decoder are also by N A identical Layer is formed, and the Layer in decoder is by being inserted into a Maked Multi- in the Layer of encoder Head Attention+Add&Norm compositions;The position embedding summations of the embedding of output and output as The input of Decoder is exported by Attention+Add&Norm layers of a Multi-Head as next Multi- The query of Head Attention+Add&Norm is inputted, and Key and Value inputs are the output of Encoder, i-th in Encoder The output of layer corresponds to i-th layer in Decoder of input, wherein i ∈ [1, N];Then it exports and is grasped using an Add+Norm After work, convert to obtain the probability of ideal output by a linear+softmax.
CN201810306806.1A 2018-04-08 2018-04-08 Robust code abstract generation method based on self-attention mechanism Active CN108519890B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810306806.1A CN108519890B (en) 2018-04-08 2018-04-08 Robust code abstract generation method based on self-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810306806.1A CN108519890B (en) 2018-04-08 2018-04-08 Robust code abstract generation method based on self-attention mechanism

Publications (2)

Publication Number Publication Date
CN108519890A true CN108519890A (en) 2018-09-11
CN108519890B CN108519890B (en) 2021-07-20

Family

ID=63431781

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810306806.1A Active CN108519890B (en) 2018-04-08 2018-04-08 Robust code abstract generation method based on self-attention mechanism

Country Status (1)

Country Link
CN (1) CN108519890B (en)

Cited By (53)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344242A (en) * 2018-09-28 2019-02-15 广东工业大学 A kind of dialogue answering method, device, equipment and storage medium
CN109543180A (en) * 2018-11-08 2019-03-29 中山大学 A kind of text emotion analysis method based on attention mechanism
CN109543195A (en) * 2018-11-19 2019-03-29 腾讯科技(深圳)有限公司 A kind of method, the method for information processing and the device of text translation
CN109543667A (en) * 2018-11-14 2019-03-29 北京工业大学 A kind of text recognition method based on attention mechanism
CN109634578A (en) * 2018-10-19 2019-04-16 北京大学 A kind of program creating method based on textual description
CN109739483A (en) * 2018-12-28 2019-05-10 北京百度网讯科技有限公司 Method and apparatus for generated statement
CN109784280A (en) * 2019-01-18 2019-05-21 江南大学 Human bodys' response method based on Bi-LSTM-Attention model
CN109783079A (en) * 2018-12-21 2019-05-21 南京航空航天大学 A kind of code annotation generation method based on program analysis and Recognition with Recurrent Neural Network
CN109886492A (en) * 2019-02-26 2019-06-14 浙江鑫升新能源科技有限公司 Photovoltaic power generation power prediction model and its construction method based on Attention LSTM
CN109960506A (en) * 2018-12-03 2019-07-02 复旦大学 A kind of code annotation generation method based on structure perception
CN110018820A (en) * 2019-04-08 2019-07-16 浙江大学滨海产业技术研究院 A method of the Graph2Seq based on deeply study automatically generates Java code annotation
CN110031214A (en) * 2019-04-09 2019-07-19 重庆大学 Gear hobbing quality online evaluation method based on shot and long term memory network
CN110119444A (en) * 2019-04-23 2019-08-13 中电科大数据研究院有限公司 A kind of official document summarization generation model that extraction-type is combined with production
CN110222199A (en) * 2019-06-20 2019-09-10 青岛大学 A kind of character relation map construction method based on ontology and a variety of Artificial neural network ensembles
CN110223324A (en) * 2019-06-05 2019-09-10 东华大学 A kind of method for tracking target of the twin matching network indicated based on robust features
CN110263143A (en) * 2019-06-27 2019-09-20 苏州大学 Improve the neurologic problems generation method of correlation
CN110348014A (en) * 2019-07-10 2019-10-18 电子科技大学 A kind of semantic similarity calculation method based on deep learning
CN110390010A (en) * 2019-07-31 2019-10-29 电子科技大学 A kind of Method for Automatic Text Summarization
CN110399162A (en) * 2019-07-09 2019-11-01 北京航空航天大学 A kind of source code annotation automatic generation method
CN110442675A (en) * 2019-06-27 2019-11-12 平安科技(深圳)有限公司 Question and answer matching treatment, model training method, device, equipment and storage medium
CN110472230A (en) * 2019-07-11 2019-11-19 平安科技(深圳)有限公司 The recognition methods of Chinese text and device
CN110543566A (en) * 2019-09-06 2019-12-06 上海海事大学 intention classification method based on self-attention neighbor relation coding
CN110597979A (en) * 2019-06-13 2019-12-20 中山大学 Self-attention-based generating text summarization method
CN111046907A (en) * 2019-11-02 2020-04-21 国网天津市电力公司 Semi-supervised convolutional network embedding method based on multi-head attention mechanism
CN111159223A (en) * 2019-12-31 2020-05-15 武汉大学 Interactive code searching method and device based on structured embedding
CN111177326A (en) * 2020-04-10 2020-05-19 深圳壹账通智能科技有限公司 Key information extraction method and device based on fine labeling text and storage medium
CN111222338A (en) * 2020-01-08 2020-06-02 大连理工大学 Biomedical relation extraction method based on pre-training model and self-attention mechanism
CN111354354A (en) * 2018-12-20 2020-06-30 深圳市优必选科技有限公司 Training method and device based on semantic recognition and terminal equipment
CN111355671A (en) * 2019-12-31 2020-06-30 鹏城实验室 Network traffic classification method, medium and terminal device based on self-attention mechanism
CN111400487A (en) * 2020-03-14 2020-07-10 北京工业大学 Quality evaluation method of text abstract
CN111428508A (en) * 2018-12-24 2020-07-17 微软技术许可有限责任公司 Style customizable text generation
CN111522581A (en) * 2020-04-22 2020-08-11 山东师范大学 Enhanced code annotation automatic generation method and system
CN111625276A (en) * 2020-05-09 2020-09-04 山东师范大学 Code abstract generation method and system based on semantic and syntactic information fusion
CN111723194A (en) * 2019-03-18 2020-09-29 阿里巴巴集团控股有限公司 Abstract generation method, device and equipment
CN111737954A (en) * 2020-06-12 2020-10-02 百度在线网络技术(北京)有限公司 Text similarity determination method, device, equipment and medium
CN111797242A (en) * 2020-06-29 2020-10-20 哈尔滨工业大学 Code abstract generation method based on code knowledge graph and knowledge migration
WO2020227970A1 (en) * 2019-05-15 2020-11-19 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for generating abstractive text summarization
CN112069199A (en) * 2020-08-20 2020-12-11 浙江大学 Multi-round natural language SQL conversion method based on intermediate syntax tree
CN112562669A (en) * 2020-12-01 2021-03-26 浙江方正印务有限公司 Intelligent digital newspaper automatic summarization and voice interaction news chat method and system
CN112800172A (en) * 2021-02-07 2021-05-14 重庆大学 Code searching method based on two-stage attention mechanism
CN112800777A (en) * 2021-04-14 2021-05-14 北京育学园健康管理中心有限公司 Semantic determination method
CN113032418A (en) * 2021-02-08 2021-06-25 浙江大学 Method for converting complex natural language query into SQL (structured query language) based on tree model
CN113113000A (en) * 2021-04-06 2021-07-13 重庆邮电大学 Lightweight speech recognition method based on adaptive mask and grouping linear transformation
CN113282336A (en) * 2021-06-11 2021-08-20 重庆大学 Code abstract integration method based on quality assurance framework
CN113326866A (en) * 2021-04-16 2021-08-31 山西大学 Automatic abstract generation method and system fusing semantic scenes
CN113397482A (en) * 2021-05-19 2021-09-17 中国航天科工集团第二研究院 Human behavior analysis method and system
CN113434136A (en) * 2021-06-30 2021-09-24 平安科技(深圳)有限公司 Code generation method and device, electronic equipment and storage medium
CN113609840A (en) * 2021-08-25 2021-11-05 西华大学 Method and system for generating Chinese legal judgment abstract
US20220138425A1 (en) * 2020-11-05 2022-05-05 Adobe Inc. Acronym definition network
CN114548046A (en) * 2022-04-25 2022-05-27 阿里巴巴达摩院(杭州)科技有限公司 Text processing method and device
CN115408056A (en) * 2022-10-28 2022-11-29 北京航空航天大学 Code abstract automatic generation method based on information retrieval and neural network
CN115442211A (en) * 2022-08-19 2022-12-06 南京邮电大学 Weblog analysis method and device based on twin neural network and fixed analysis tree
CN117407051A (en) * 2023-12-12 2024-01-16 武汉大学 Code automatic abstracting method based on structure position sensing

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090265330A1 (en) * 2008-04-18 2009-10-22 Wen-Huang Cheng Context-based document unit recommendation for sensemaking tasks
CN107102861A (en) * 2017-04-25 2017-08-29 中南大学 A kind of vectorial method and system for obtaining function in Open Source Code storehouse
CN107133079A (en) * 2017-05-25 2017-09-05 中国人民解放军国防科学技术大学 A kind of software semanteme summary automatic generation method reported based on problem
CN107239446A (en) * 2017-05-27 2017-10-10 中国矿业大学 A kind of intelligence relationship extracting method based on neutral net Yu notice mechanism

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090265330A1 (en) * 2008-04-18 2009-10-22 Wen-Huang Cheng Context-based document unit recommendation for sensemaking tasks
CN107102861A (en) * 2017-04-25 2017-08-29 中南大学 A kind of vectorial method and system for obtaining function in Open Source Code storehouse
CN107133079A (en) * 2017-05-25 2017-09-05 中国人民解放军国防科学技术大学 A kind of software semanteme summary automatic generation method reported based on problem
CN107239446A (en) * 2017-05-27 2017-10-10 中国矿业大学 A kind of intelligence relationship extracting method based on neutral net Yu notice mechanism

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SRINIVASAN IYER 等: "Summarizing Source Code using a Neural Attention Model", 《PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》 *
彭敏 等: "基于核主成分分析与小波变换的高质量微博提取", 《计算机工程》 *
彭敏 等: "基于高质量信息提取的微博自动摘要", 《计算机工程》 *

Cited By (84)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344242A (en) * 2018-09-28 2019-02-15 广东工业大学 A kind of dialogue answering method, device, equipment and storage medium
CN109634578A (en) * 2018-10-19 2019-04-16 北京大学 A kind of program creating method based on textual description
CN109634578B (en) * 2018-10-19 2021-04-02 北京大学 Program generation method based on text description
CN109543180A (en) * 2018-11-08 2019-03-29 中山大学 A kind of text emotion analysis method based on attention mechanism
CN109543180B (en) * 2018-11-08 2020-12-04 中山大学 Text emotion analysis method based on attention mechanism
CN109543667A (en) * 2018-11-14 2019-03-29 北京工业大学 A kind of text recognition method based on attention mechanism
CN109543667B (en) * 2018-11-14 2023-05-23 北京工业大学 Text recognition method based on attention mechanism
CN109543195A (en) * 2018-11-19 2019-03-29 腾讯科技(深圳)有限公司 A kind of method, the method for information processing and the device of text translation
CN109543195B (en) * 2018-11-19 2022-04-12 腾讯科技(深圳)有限公司 Text translation method, information processing method and device
CN109960506B (en) * 2018-12-03 2023-05-02 复旦大学 Code annotation generation method based on structure perception
CN109960506A (en) * 2018-12-03 2019-07-02 复旦大学 A kind of code annotation generation method based on structure perception
CN111354354A (en) * 2018-12-20 2020-06-30 深圳市优必选科技有限公司 Training method and device based on semantic recognition and terminal equipment
CN111354354B (en) * 2018-12-20 2024-02-09 深圳市优必选科技有限公司 Training method, training device and terminal equipment based on semantic recognition
CN109783079A (en) * 2018-12-21 2019-05-21 南京航空航天大学 A kind of code annotation generation method based on program analysis and Recognition with Recurrent Neural Network
CN111428508A (en) * 2018-12-24 2020-07-17 微软技术许可有限责任公司 Style customizable text generation
CN109739483A (en) * 2018-12-28 2019-05-10 北京百度网讯科技有限公司 Method and apparatus for generated statement
CN109739483B (en) * 2018-12-28 2022-02-01 北京百度网讯科技有限公司 Method and device for generating statement
CN109784280A (en) * 2019-01-18 2019-05-21 江南大学 Human bodys' response method based on Bi-LSTM-Attention model
CN109886492A (en) * 2019-02-26 2019-06-14 浙江鑫升新能源科技有限公司 Photovoltaic power generation power prediction model and its construction method based on Attention LSTM
CN111723194A (en) * 2019-03-18 2020-09-29 阿里巴巴集团控股有限公司 Abstract generation method, device and equipment
CN110018820A (en) * 2019-04-08 2019-07-16 浙江大学滨海产业技术研究院 A method of the Graph2Seq based on deeply study automatically generates Java code annotation
CN110018820B (en) * 2019-04-08 2022-08-23 浙江大学滨海产业技术研究院 Method for automatically generating Java code annotation based on Graph2Seq of deep reinforcement learning
CN110031214B (en) * 2019-04-09 2020-09-22 重庆大学 Hobbing quality online evaluation method based on long-term and short-term memory network
CN110031214A (en) * 2019-04-09 2019-07-19 重庆大学 Gear hobbing quality online evaluation method based on shot and long term memory network
CN110119444B (en) * 2019-04-23 2023-06-30 中电科大数据研究院有限公司 Drawing type and generating type combined document abstract generating model
CN110119444A (en) * 2019-04-23 2019-08-13 中电科大数据研究院有限公司 A kind of official document summarization generation model that extraction-type is combined with production
WO2020227970A1 (en) * 2019-05-15 2020-11-19 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for generating abstractive text summarization
CN110223324B (en) * 2019-06-05 2023-06-16 东华大学 Target tracking method of twin matching network based on robust feature representation
CN110223324A (en) * 2019-06-05 2019-09-10 东华大学 A kind of method for tracking target of the twin matching network indicated based on robust features
CN110597979B (en) * 2019-06-13 2023-06-23 中山大学 Self-attention-based generated text abstract method
CN110597979A (en) * 2019-06-13 2019-12-20 中山大学 Self-attention-based generating text summarization method
CN110222199A (en) * 2019-06-20 2019-09-10 青岛大学 A kind of character relation map construction method based on ontology and a variety of Artificial neural network ensembles
CN110442675A (en) * 2019-06-27 2019-11-12 平安科技(深圳)有限公司 Question and answer matching treatment, model training method, device, equipment and storage medium
CN110263143B (en) * 2019-06-27 2021-06-15 苏州大学 Neural problem generation method for improving correlation
CN110263143A (en) * 2019-06-27 2019-09-20 苏州大学 Improve the neurologic problems generation method of correlation
CN110399162A (en) * 2019-07-09 2019-11-01 北京航空航天大学 A kind of source code annotation automatic generation method
CN110399162B (en) * 2019-07-09 2021-02-26 北京航空航天大学 Automatic generation method of source code annotation
CN110348014A (en) * 2019-07-10 2019-10-18 电子科技大学 A kind of semantic similarity calculation method based on deep learning
CN110472230A (en) * 2019-07-11 2019-11-19 平安科技(深圳)有限公司 The recognition methods of Chinese text and device
CN110472230B (en) * 2019-07-11 2023-09-05 平安科技(深圳)有限公司 Chinese text recognition method and device
CN110390010A (en) * 2019-07-31 2019-10-29 电子科技大学 A kind of Method for Automatic Text Summarization
CN110543566B (en) * 2019-09-06 2022-07-22 上海海事大学 Intention classification method based on self-attention neighbor relation coding
CN110543566A (en) * 2019-09-06 2019-12-06 上海海事大学 intention classification method based on self-attention neighbor relation coding
CN111046907A (en) * 2019-11-02 2020-04-21 国网天津市电力公司 Semi-supervised convolutional network embedding method based on multi-head attention mechanism
CN111046907B (en) * 2019-11-02 2023-10-27 国网天津市电力公司 Semi-supervised convolutional network embedding method based on multi-head attention mechanism
CN111159223A (en) * 2019-12-31 2020-05-15 武汉大学 Interactive code searching method and device based on structured embedding
CN111159223B (en) * 2019-12-31 2021-09-03 武汉大学 Interactive code searching method and device based on structured embedding
CN111355671A (en) * 2019-12-31 2020-06-30 鹏城实验室 Network traffic classification method, medium and terminal device based on self-attention mechanism
CN111222338A (en) * 2020-01-08 2020-06-02 大连理工大学 Biomedical relation extraction method based on pre-training model and self-attention mechanism
CN111400487A (en) * 2020-03-14 2020-07-10 北京工业大学 Quality evaluation method of text abstract
CN111177326B (en) * 2020-04-10 2020-08-04 深圳壹账通智能科技有限公司 Key information extraction method and device based on fine labeling text and storage medium
CN111177326A (en) * 2020-04-10 2020-05-19 深圳壹账通智能科技有限公司 Key information extraction method and device based on fine labeling text and storage medium
CN111522581A (en) * 2020-04-22 2020-08-11 山东师范大学 Enhanced code annotation automatic generation method and system
CN111522581B (en) * 2020-04-22 2021-06-25 山东师范大学 Enhanced code annotation automatic generation method and system
CN111625276A (en) * 2020-05-09 2020-09-04 山东师范大学 Code abstract generation method and system based on semantic and syntactic information fusion
CN111737954A (en) * 2020-06-12 2020-10-02 百度在线网络技术(北京)有限公司 Text similarity determination method, device, equipment and medium
CN111797242A (en) * 2020-06-29 2020-10-20 哈尔滨工业大学 Code abstract generation method based on code knowledge graph and knowledge migration
CN111797242B (en) * 2020-06-29 2023-04-07 哈尔滨工业大学 Code abstract generation method based on code knowledge graph and knowledge migration
CN112069199A (en) * 2020-08-20 2020-12-11 浙江大学 Multi-round natural language SQL conversion method based on intermediate syntax tree
US20220138425A1 (en) * 2020-11-05 2022-05-05 Adobe Inc. Acronym definition network
US11941360B2 (en) * 2020-11-05 2024-03-26 Adobe Inc. Acronym definition network
CN112562669A (en) * 2020-12-01 2021-03-26 浙江方正印务有限公司 Intelligent digital newspaper automatic summarization and voice interaction news chat method and system
CN112562669B (en) * 2020-12-01 2024-01-12 浙江方正印务有限公司 Method and system for automatically abstracting intelligent digital newspaper and performing voice interaction chat
CN112800172A (en) * 2021-02-07 2021-05-14 重庆大学 Code searching method based on two-stage attention mechanism
CN113032418A (en) * 2021-02-08 2021-06-25 浙江大学 Method for converting complex natural language query into SQL (structured query language) based on tree model
CN113113000B (en) * 2021-04-06 2022-05-13 重庆邮电大学 Lightweight speech recognition method based on adaptive mask and grouping linear transformation
CN113113000A (en) * 2021-04-06 2021-07-13 重庆邮电大学 Lightweight speech recognition method based on adaptive mask and grouping linear transformation
CN112800777B (en) * 2021-04-14 2021-07-30 北京育学园健康管理中心有限公司 Semantic determination method
CN112800777A (en) * 2021-04-14 2021-05-14 北京育学园健康管理中心有限公司 Semantic determination method
CN113326866A (en) * 2021-04-16 2021-08-31 山西大学 Automatic abstract generation method and system fusing semantic scenes
CN113326866B (en) * 2021-04-16 2022-05-31 山西大学 Automatic abstract generation method and system fusing semantic scenes
CN113397482A (en) * 2021-05-19 2021-09-17 中国航天科工集团第二研究院 Human behavior analysis method and system
CN113282336A (en) * 2021-06-11 2021-08-20 重庆大学 Code abstract integration method based on quality assurance framework
CN113282336B (en) * 2021-06-11 2023-11-10 重庆大学 Code abstract integration method based on quality assurance framework
CN113434136A (en) * 2021-06-30 2021-09-24 平安科技(深圳)有限公司 Code generation method and device, electronic equipment and storage medium
CN113434136B (en) * 2021-06-30 2024-03-05 平安科技(深圳)有限公司 Code generation method, device, electronic equipment and storage medium
CN113609840A (en) * 2021-08-25 2021-11-05 西华大学 Method and system for generating Chinese legal judgment abstract
CN113609840B (en) * 2021-08-25 2023-06-16 西华大学 Chinese law judgment abstract generation method and system
CN114548046A (en) * 2022-04-25 2022-05-27 阿里巴巴达摩院(杭州)科技有限公司 Text processing method and device
CN115442211B (en) * 2022-08-19 2023-08-04 南京邮电大学 Network log analysis method and device based on twin neural network and fixed analysis tree
CN115442211A (en) * 2022-08-19 2022-12-06 南京邮电大学 Weblog analysis method and device based on twin neural network and fixed analysis tree
CN115408056A (en) * 2022-10-28 2022-11-29 北京航空航天大学 Code abstract automatic generation method based on information retrieval and neural network
CN117407051A (en) * 2023-12-12 2024-01-16 武汉大学 Code automatic abstracting method based on structure position sensing
CN117407051B (en) * 2023-12-12 2024-03-08 武汉大学 Code automatic abstracting method based on structure position sensing

Also Published As

Publication number Publication date
CN108519890B (en) 2021-07-20

Similar Documents

Publication Publication Date Title
CN108519890A (en) A kind of robustness code abstraction generating method based on from attention mechanism
CN110781680B (en) Semantic similarity matching method based on twin network and multi-head attention mechanism
CN109857990B (en) Financial bulletin information extraction method based on document structure and deep learning
CN108363743B (en) Intelligent problem generation method and device and computer readable storage medium
CN110222163A (en) A kind of intelligent answer method and system merging CNN and two-way LSTM
CN109635280A (en) A kind of event extraction method based on mark
CN108733653A (en) A kind of sentiment analysis method of the Skip-gram models based on fusion part of speech and semantic information
CN111241294A (en) Graph convolution network relation extraction method based on dependency analysis and key words
CN110020438A (en) Enterprise or tissue Chinese entity disambiguation method and device based on recognition sequence
CN110083710A (en) It is a kind of that generation method is defined based on Recognition with Recurrent Neural Network and the word of latent variable structure
CN112836046A (en) Four-risk one-gold-field policy and regulation text entity identification method
CN112765952A (en) Conditional probability combined event extraction method under graph convolution attention mechanism
CN111125333B (en) Generation type knowledge question-answering method based on expression learning and multi-layer covering mechanism
CN109033073B (en) Text inclusion recognition method and device based on vocabulary dependency triple
CN112417134A (en) Automatic abstract generation system and method based on voice text deep fusion features
CN110009025A (en) A kind of semi-supervised additive noise self-encoding encoder for voice lie detection
CN115759092A (en) Network threat information named entity identification method based on ALBERT
CN114926150A (en) Digital intelligent auditing method and device for transformer technology conformance assessment
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN116049387A (en) Short text classification method, device and medium based on graph convolution
CN112200674A (en) Stock market emotion index intelligent calculation information system
CN117033423A (en) SQL generating method for injecting optimal mode item and historical interaction information
CN116561251A (en) Natural language processing method
Bai et al. Gated character-aware convolutional neural network for effective automated essay scoring
CN115658898A (en) Chinese and English book entity relation extraction method, system and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant