CN110532554A

CN110532554A - A kind of Chinese abstraction generating method, system and storage medium

Info

Publication number: CN110532554A
Application number: CN201910787889.5A
Authority: CN
Inventors: 李维勇; 柳斌; 张伟; 李建林; 李方方
Original assignee: Nanjing College of Information Technology
Current assignee: Nanjing College of Information Technology
Priority date: 2019-08-26
Filing date: 2019-08-26
Publication date: 2019-12-03
Anticipated expiration: 2039-08-26
Also published as: CN110532554B

Abstract

The invention discloses a kind of Chinese abstraction generating method, system and storage medium, the method includes the steps: target text is obtained, determines the Chinese word sequence vector of target text；The Chinese word sequence vector is input in the good encoder of pre-training, generative semantics vector；The full text semanteme for being most suitable for current time is recombinated out according to semantic vector, and the semantic intermediate semanteme of full text of summarizing after recombination is sent to the good decoder of pre-training；The word and summarize the semantic intermediate semantic distribution for inferring subsequent time word of full text that decoder is predicted according to previous moment, final word sequence generated is the abstract of target text.The present invention is able to ascend the generation quality and readability of Chinese text abstract.

Description

A kind of Chinese abstraction generating method, system and storage medium

Technical field

The present invention relates to a kind of Chinese abstraction generating method, system and storage mediums, belong to text information processing technology neck Domain.

Background technique

Automatic abstract is the technology that automatic text analysis, content summary and summarization generation are realized using computer, is to solve A kind of supplementary means of information overabundance problem at present can help the mankind to further understand natural language text, and more quickly, Key message is accurately and comprehensively obtained, all there is important Practical significance in terms of industry and business.

The technology of currently used relatively low, the readable difference of the generally existing Chinese text summarization generation quality of abstraction generating method Problem.

Summary of the invention

It is an object of the invention to overcome deficiency in the prior art, provide a kind of Chinese abstraction generating method, system and Storage medium is able to ascend the generation quality and readability of Chinese text abstract.

In order to achieve the above objectives, the present invention adopts the following technical solutions realization:

In a first aspect, described method includes following steps the present invention provides a kind of Chinese abstraction generating method:

Target text is obtained, by the Chinese character separating in target text at strokes sequence；

The Chinese word sequence vector of target text is determined according to strokes sequence；

The Chinese word sequence vector is input in the good encoder of pre-training, generative semantics vector；

The full text semanteme for being most suitable for current time is recombinated out according to semantic vector, it will be in the summary full text semanteme after recombination Between semantic be sent to the good decoder of pre-training；

The intermediate semantic deduction subsequent time word of word and summary full text semanteme that decoder is predicted according to previous moment Distribution, final word sequence generated are the abstract of target text.

With reference to first aspect, further, the method for determining the Chinese word sequence vector of target text includes:

N-gram cutting is carried out to the strokes sequence, obtains the n-gram information in Chinese-character stroke；

Skip-Gram model prediction centre word context is used according to n-gram, obtains corresponding Chinese word sequence vector.

With reference to first aspect, further, the method for the n-gram information in acquisition Chinese-character stroke includes:

Word is split into character, finds the corresponding strokes sequence of each character；

By strokes sequence IDization；

N-gram summation is carried out to the strokes sequence of IDization, obtains the n-gram information.

With reference to first aspect, further, the encoder uses two-way length Memory Neural Networks in short-term.

With reference to first aspect, further, the method for generative semantics vector includes:

By Chinese word sequence vector, forward and reverse is input to two-way length in short-term in Memory Neural Networks respectively, obtains two kinds Corresponding two hidden states of each word under sequence；

Two hidden state head and the tail splicings are generated into the semantic vector.

With reference to first aspect, further, the side for being most suitable for the full text semanteme at current time is recombinated out according to semantic vector Method includes:

Attention mechanism is added, when generating sentence semantics vector by encoder to calculate different input words to decoder The weighing factor at end；

It is recombinated out according to hidden state of the input word to the weighing factor combination decoder feedback of decoder end and is most suitable for working as The full text semantic information at preceding moment.

With reference to first aspect, further, the method also includes optimizing word order generated using beam search algorithm Column.

With reference to first aspect, further, the method also includes pre-processing to target text, comprising:

Spcial character is removed, the spcial character includes punctuation mark, deactivates modal particle and adversative；

All dates are replaced with into TAG_DATE；

Hyperlink URL is replaced with into label TAG_URL；

Number is replaced with into TAG_NUMBER；

English word is replaced with into TAG_NAME_EN.

Second aspect, the present invention provides a kind of Chinese summarization generation system, including processor and memory, the storages Be stored with program on device, described program can the method any one of aforementioned by the processor load and execution the step of.

The third aspect, the present invention provides a kind of computer readable storage mediums, are stored thereon with computer program, special The step of sign is, which realizes any one of aforementioned the method when being executed by processor.

Compared with prior art, advantageous effects of the invention: the Chinese word sequence vector of target text is inputted In the encoder good to pre-training, generative semantics vector；The full text semanteme for being most suitable for current time is recombinated out according to semantic vector, The semantic intermediate semanteme of full text of summarizing after recombination is sent to the good decoder of pre-training；Decoder is predicted according to previous moment Word and summarize the semantic intermediate semantic distribution for inferring subsequent time word of full text, final word sequence generated is target The abstract of text can increase the feature needed for phrase understands, capture the part with former character semantic similarity, help to be promoted The generation quality and readability of Chinese text abstract.

Detailed description of the invention

Fig. 1 is a kind of flow chart of the Chinese abstraction generating method provided according to embodiments of the present invention；

Fig. 2 is the training stage of the language model provided according to embodiments of the present invention and the method schematic diagram of test phase；

Fig. 3 is that is provided according to embodiments of the present invention split into word " university " method schematic diagram of n-gram stroke；

Fig. 4 is the structural schematic diagram of the two-way length that provides according to embodiments of the present invention Memory Neural Networks in short-term；

Fig. 5 is the structural schematic diagram of the Seq2Seq model provided according to embodiments of the present invention；

Fig. 6 is that the method for choosing best text sequence using beam search algorithm provided according to embodiments of the present invention is illustrated Figure；

Fig. 7 is that the method for carrying out semantic vector calculating using attention mechanism provided according to embodiments of the present invention is illustrated Figure.

Specific embodiment

The invention will be further described below in conjunction with the accompanying drawings.Following embodiment is only used for clearly illustrating the present invention Technical solution, and not intended to limit the protection scope of the present invention.

As shown in Figure 1, be a kind of flow chart of the Chinese abstraction generating method provided according to embodiments of the present invention, it is specific to wrap Include following steps:

A) Text Pretreatment: after target text is segmented, carrying out the vectorization processing of word, and construct corresponding vocabulary, Input of the term vector sequence of formation as next stage.

B) semantic understanding: the memory function of Recognition with Recurrent Neural Network, by input coding of term vector sequence of first stage Device (uses two-way shot and long term Memory Neural Networks (abbreviation Bi-LSTM)), and encoder generates the semantic vector of each section of text simultaneously It is transmitted to next stage.

C) computing with words: the application joined attention mechanism, attention machine when encoder generates sentence semantics vector Hidden state according to decoder feedback is made to recombinate out the full text semantic information for being most suitable for current time, and will be in after recombination Between semantic information send to decoder, for current time step word prediction.

D) summarization generation: (using RNN, (Recurrent Neural Network recycles nerve net to this stage decoder Network)) according to the word of previous moment prediction and summarizes the semantic intermediate semantic vector of full text and infer the distribution of subsequent time word, most Throughout one's life at a word sequence i.e. abstract sentence.

In four above-mentioned processes, the language of stroke vector coding and training automatic abstract is introduced for the feature of Chinese Model.For the structure for the embodiment model of the embodiment of the present invention being more clear, model is divided into training stage and survey in Fig. 2 Two parts of examination stage.

The left side Fig. 2 is the training stage of model of the embodiment of the present invention, the transmission direction of arrows show data and and parameter Backpropagation, underline part be the embodiment of the present invention use for Chinese stroke coding mode.Rest part packet Include the optimization of the encoder being made of Bi-LSTM, the decoder that RNN is formed and attention mechanism.The right is the test of model Stage mainly tests trained decoder, one section of test text is inputted, after the Chinese character code based on stroke The autoabstract generated by trained decoder.In this stage, it in order to maximize the probability for generating sentence, is added Beam Search (beam-search) increases the range of choice for generating sentence, and optimization generates the fluency made a summary.

Each word is split into strokes sequence by the embodiment of the present invention, captures class using based on the n-gram information of stroke Like " knowing " in " intelligence " it is this only can just be captured from smaller stroke granularity and and the close part of former character semanteme, it Apparent effect promoting is brought to the term vector expression on Chinese.

In the method for term vector coding, the embodiment of the present invention uses the Skip-Gram training method of Word2vec, passes through Centre word predicts context.In Word2vec, each character is initialized using simple one-hot coding, and uses for reference Each word is converted to corresponding strokes sequence by the thought of Fasttext, such as: " university " then will become " ノ Dian Dian Dian ノ a Dian Off Off Shu mono- " does n-gram cutting for such sequence, and the stroke initialization after cutting can thus be captured as input It is associated with to deep layer existing between Chinese character.The meaning of this method is, Chinese character is splitted into stroke or radical, mainly passes through stroke Or the semantic information between radical, the feature of text itself is enriched, so that the modular construction similitude between similar character is captured, Increase the feature needed for phrase understands.For in terms of philological, it can allow the words of similar meaning that there is class on Chinese characters word-formation As structure, this is the theoretical basis of the method for Chinese character coding of the embodiment of the present invention.

Chinese-character stroke subdivision has more than 30 kinds, and stroke is divided into five major class here: horizontal, vertical, slash, right-falling stroke, folding.As shown in table 1, This five strokes are used into digital number respectively so that it is convenient to its corresponding vector in dictionary.

1 encoding of Chinese stroke of table

Stroke title	It is horizontal	It is perpendicular	It skims	Right-falling stroke	Folding
						Shape	One	Shu (亅)	Pie	Fu (Dian)	Ya (Yin)
ID	1	2	3	4	5

As shown in figure 3, describing the process that word splits into stroke and takes n-gram value, it is broadly divided into Four processes:

A) word is split into character first；

B) the corresponding strokes sequence of each character is found；

C) the strokes sequence IDization that previous step is obtained；

D) it sums to the strokes sequence n-gram of IDization.The n-gram of each stroke represents a vector, stroke vector Latitude is consistent with the term vector latitude of context words.It is total that different local same word and stroke are appeared in test in full Enjoy identical vector.

The context words similitude with higher of one word and it, current word w and its context words c's is similar Property is indicated using their inner product of vectorsWhereinWithThe vector for respectively representing w and c indicates. N-gram and each context words to each stroke distribute a vector, and according to all strokes of composition current word w N-gram vector sum the term vector inner product of cliction and calculates similitude up and down.By the stroke n-gram of words all in corpus Vector is stored in dictionary S, and wherein S (w) indicates the stroke n-gram set of word w, and similarity function is are as follows:

Wherein q is the element of S (w),For the insertion vector of q.

Language model is intended to predict that the probability in short occurred, Skip-Gram model used by the embodiment of the present invention make Context c, as calculating Probability p (c | w) are predicted with centre word w, this probability is calculated used here as softmax function:

C ' belongs to a word in corpus vocabulary V, it can be found that the calculation amount of denominator is | V |, for dividing above Female calculating is accelerated using the negative method of sampling, and the thought of negative sampling is that reduce the number of negative sample needs more to reduce model New weight number.Such as input " university ", when encoding using one-hot, corresponding " whole nation " and " ranking " is wished in output layer The neuron node output 1 of word, if | V | size be 10000, remaining 9999 nodes wish that output is 0, this 9999 samples are exactly negative sample herein, and negative sampling is exactly that 5-20 are extracted from 9999 negative samples to update its weight, Remaining weight does not update, to achieve the effect that reduce calculation amount.Loss function based on negative sampling are as follows:

L=∑_w∈D∑_c∈T(w)Log σ (sim (w, c))+λ logE_{C '~p}[log σ (- sim (w, c '))] (3)

D is the set of all word training in corpus in above-mentioned formula, and T (w) is the context list of current word in window Word, σ are sigmoid activation primitive, σ (x)=(1+exp (- x))^-1, λ is negative sample number, E_{C '~p}It is expected, so that negative The sample of sampling meets p distribution.For more detailed description this algorithm detailed process, table 2 gives the process of stroke coding:

2 Stroke_Embedding algorithm of table (stroke embedded mobile GIS) process

One important parameter of Stroke_Embedding algorithm is n_gram, it is contemplated that the unit stroke of Chinese character in Chinese Several sizes is here 3,4 and 5 to its value, to capture the most of component information for including in Chinese character.

In long text sequence, the semanteme that can more completely capture text is most important, and the embodiment of the present invention passes through Bi- LSTM network can capture two-way semantic dependency.Fig. 4 is Bi-LSTM schematic network structure.As Sequence-to- When the encoding or decoding device of Sequence model (following shorthand are as follows: Seq2 Seq model), although not needing to complete such as emotion The classification task of analysis, but Bi-LSTM is more accurately positioned to semantic vector or text snippet brings mentioning in effect It rises.Bi-LSTM corresponding is coded portion, the nerve that it forms the sentence sequence difference forward and reverse input LSTM of input In network, the corresponding two hidden state h of each word under two kinds of sequences are obtained_tWith h '_t, then by hidden state under two kinds of sequences Splicing:

h_new=concatenate (h_t, h '_t) (4)

It is i.e. end to end, thus more complete semantic vector can be obtained in coded portion.The structure of entire model As shown in Figure 5.

Primary structure of the Bi-LSTM as illustraton of model left-half encoder section of the embodiment of the present invention, it is semantic obtaining After vector C, it is passed to the word that right side part decoder carrys out each time point in gradually formation sequence, training stage decoder It is made of RNN unit.For intuitively, the text input bidirectional circulating that the segment length which inputs the left side is n is neural Network, wherein x₀, x₁... x_t, x_nRespectively indicate the corresponding term vector of each word in text, after obtaining semantic vector C, decoding Process is then the input as each moment, then exports a word y at the corresponding moment₁, y₂... y_m, thus may be used To decode a segment length as the text of m, the wherein not stringent size relation of the size of n and m.Due to the embodiment of the present invention Task is text snippet, and input text size is greater than output text, therefore n > m.Above it is exactly the training process of model, is surveying During examination, trained decoder model is mainly used, in such a way that optimization generates, is finally completed on test set Experiment.

After the completion of Seq2Seq model training, need to generate new abstract sentence using the model, and the generation of sentence be by The sequence problem that each word forms after increasingly generating.Beam-search is then the preceding K word composition sequence according to maximum probability, The size of K is to collect beamwidth.In Fig. 6, the schematic diagram of a best text sequence is chosen for Beam Search.

Assuming that dictionary size is 6, detailed steps are as follows:

(1) first word y is generated¹Probability distribution [0.1,0.1,0.4,0.2,0.1,0.1] after, choose wherein most probably Two words of rate, the most probable selection of " I ", " " as first word as shown in Figure 7；

(2) second step calculates the probability that second word is arranged in pairs or groups therewith aiming at the two words, by " I " and " " As the input of decoder, the then maximum the first two sequence of reselection, as " I " and " seeing ", and so on, finally It is terminated encountering end mark<s>.

(3) last available two sequences, " I is watching movie " and " watching movie me ", it is evident that the former probability and most Greatly, so selecting the former as final result.

The algorithm flow of Beam Search is as shown in table 3:

Wrap up in 3 Beam Search algorithm flows

In the algorithm of Beam Search, the size of parameter K is affected to the decoding speed of test phase decoder, Between 3 to 7 words of general value.

Input of the semantic vector C as the semantic compression and decoder of list entries, because the limitation of its length leads to nothing Method includes enough useful informations, and especially in automatic abstract task, the sequence of encoder input is often a Duan Changwen This, a C can not just summarize all information, model accuracy is caused to decline.The embodiment of the present invention is made based on attention mechanism It obtains semantic vector C and keeps more semantic informations.Attention mechanism is introduced underneath with machine translation:

In machine translation, each word is different by the influence power of input word in the translation of decoder output, because This, attention mechanism inputs different semantic vector C according to different moments_iTo solve this problem, C_iCarry out self-encoding encoder hidden layer Vector h_jWith the weight a of distribution_ijSum of products.In LSTM sequence, the hidden layer vector h of each moment output_iIt is by current defeated It gos out and the decision of current time memory unit, semantic vector is to export in the last one unit of encoder, and use attention The mechanism weighted sum hidden layer vector at each moment.The calculation method for the attention mechanism that Fig. 7 is shown.

When calculating is to the semantic vector C of word " I "₁When, influence on it should be most for the semantic information of " I " in list entries Greatly, thus distribute to its weight just should be maximum.Similarly C₂" seeing " is most related, therefore corresponding a₂₂Maximum, then how to count Calculate a_ijAn Important Problems are reformed into.

In fact, a_ijSize be from model middle school come, it is by the hidden state and decoding in -1 stage of jth of encoder The hidden state in the i-th stage of device is related.Such as the corresponding semantic vector C of calculating word " watch "₂When, first by calculating Three hiding vector h of one word " I " and encoder₁, h₂, h₃Similarity, it is close that this also utilizes adjacent words semanteme How language rule calculates a to preferably introduce_ij, the hidden state of encoder is defined as h_j, decoder hides layer state It is defined as h_i, C_iCalculation formula are as follows:

The total length of list entries, h are indicated for above formula V_jIt is also known, a_ijCalculation formula it is as follows:

a_ijIt is the probability normalized output of a softmax, e_ijAn alignment model is indicated, for measuring encoder-side The word of position j for decoder end the word of position i degree of registration (influence degree).In other words, decoder end generates When the word of position i, how many degree is influenced by the word of the position j of encoder-side.Alignment model e_ijCalculation there are many kinds of, Different calculations represents different Attention models.Most simple and most common alignment model is dot product Product matrix, i.e., the hidden state h of the output of decoder end_tWith the hidden state h of output of encoder-side_sMatrix Multiplication is carried out, when calculating It is to be completed by the similitude of the last moment and list entries hidden state matrix that calculate the word that will be predicted.It is common Alignment calculation it is as follows:

score(h_t, h_s)=e_ijIndicate the alignment thereof of source and target word, it is common to have the above dot product, weight network Mapping, concat map several method.

When calculating alignment thereof, each word of decoder end output requires to calculate and the side of alignment of list entries Formula, all hidden layers of encoder will be used to calculate similitude to get the weight vectors a arrived_ijLength be equal to input sequence The length of column.The higher Soft Attention mechanism of service efficiency of the embodiment of the present invention, it is right between word to be calculated by dot product Answer mode with determine the relevance between word power.

The beneficial effect of Chinese abstraction generating method is provided further to verify the embodiment of the present invention, below with reference to experiment Data are further described the embodiment of the present invention.Likewise, following embodiment be only used for clearly illustrating it is of the invention Technical solution, and not intended to limit the protection scope of the present invention.

0.1 data set and pretreatment

Experiment uses the extensive Chinese short text summary data collection LCSTS that Sina weibo is taken from from Harbin Institute of Technology, number 2,000,000 abstracts that really Chinese short text datas and each text author provide are contained according to concentrating, this data set is also Maximum one Chinese section text snippet data set at present, provides the standard method of Segmentation of Data Set, table 4 illustrates three portions The number of divided data collection.

4 LCSTS data of table composition

Data set includes three parts:

A) first part is the major part of notebook data collection, contains 2400591 (short text, abstract) data pair, this Partial data is used to the model that training generates abstract.

B) second part includes 10666 (short text, abstract) data pair manually marked, and each sample has beaten 1- 5 points, score is the degree of correlation for judging short text and abstract, and 1 represents least correlation, and 5 representatives are most related.This partial data It is that stochastical sampling comes out from first part's data, for analyzing the distribution situation of first part's data.Wherein, be labeled as 3, 4,5 points of sample original text and abstract correlation are much better, therefrom it is also seen that can be comprising some without going out in many abstracts Word in present original text, this also illustrates different from sentence compression duty.It is weaker to be labeled as 1,2 point of correlation, is more like mark Topic is either commented on rather than is made a summary.Statistics shows that 1,2 point of data less than twenty percent, can be filtered with the method for supervised learning Fall.

C) Part III includes 1106 (short text, abstract) data pair, and three people carry out 2000 pairs in total abstracts It judges, Dynamic data exchange here is in first part and second part.Select 3 points or more of data as short text abstract task Test data set.

The pretreatment stage of data is particularly important, because in the format of encoder section data and standardization to entire experiment Influence it is very big, 1 part PART of above-mentioned LCSTS is training data, training data short text input and summing-up pluck After extracting, need that some of which information is replaced and is handled:

Spcial character: removal spcial character, mainly includes punctuation mark and common deactivated modal particle and adversative etc., Such as: " ", ", $ ... " Ah eh and；

Content in bracket, such as [happy] have many full animation expressions and deposit in this form because of data source microblogging It to be removed in pretreatment；

Date tag replacement: replacing with TAG_DATE for all dates, e.g.: * * * * month * day, the * * * * * month, etc. Deng；

Hyperlink URL: label TAG_URL is replaced with；

Replacement number: TAG_NUMBER；

English tag replacement: replacement English word is label TAG_NAME_EN.

0.2 assessment method

The evaluation method that the embodiment of the present invention uses includes tri- kinds of Rouge-1, Rouge-2 and Rouge-L, wherein Rouge- L in L is the initial of LCS (longest common subsequence, longest common subsequence).

The calculation formula of Rouge-N is as follows:

Wherein, n-gram indicates n-gram word, and { Ref Summaries } indicates that with reference to abstract, i.e., the standard obtained in advance is plucked It wants, Count_match(n-gram) indicate occur the number of n-gram, Count (n-gram) simultaneously in rev of Sys and reference abstract It then indicates with reference to the n-gram number occurred in abstract.

The calculation formula of Rouge-L is as follows:

Wherein LCS (X, Y) is the most normal common subsequence length of X and Y, and m, n are respectively indicated with reference to abstract and autoabstract The length number of word (generally be exactly), R_lcs, P_lcsRespectively indicate recall rate and accuracy rate.Last F_lcsIt is exactly Rouge-L points Number.

0.3 experimental design and interpretation of result

After first using jieba participle packet to first part's participle of LCSTS data set, chooses 50000 high frequency words and make For encoder vocabulary, the word occurred in vocabulary uses " UNK " to indicate.When decoder is arranged, an important parameter Be decoder dictionary size, comparative test done to the parameter of decoder dictionary in an experiment, be respectively set 2000,5000, 80000, the 11000, size of 14,000 5 kind of specification passes through the size that best dictionary is chosen in experiment.4 layers are used in the encoder Two-way LSTM, each node layer are 256, batch_size 64, define Bucket barrels of mechanism: buckets=[(120, 30) long inside the Bucket for ...], the sentence of list entries being assigned to according to the similarity degree of length different regular lengths The inadequate all additions PAD character of degree, length for heading limit 30 words.

Several method for comparison are as follows:

A) Tf-idf: the Baseline Methods of extraction-type abstract.

B) ABS system: the Baseline Methods of production abstract.

C) Our+att: common term vector input+attention mechanism.

D) Our (S)+att: stroke coding term vector+attention mechanism.

Table 5 and table 6 include four groups of experimental datas be 8000 in decoder dictionary size under experimental result and Rouge comment It is divided to (the latter two groups experimental results for the embodiment of the present invention).

The abstract comparison of 5 distinct methods of table

Table 6Rouge appraisal result

Model	Rouge-1 (%)	Rouge-2 (%)	Rouge-L (%)
				Tf-idf	27.30	24.30	26.76
ABS system	24.26	15.22	24.11
				Our+att	24.83	15.61	22.19
Our(S)+att	25.08	17.05	22.77

In table 5 by method sample provided by the embodiment of the present invention compared respectively extraction-type Baseline Methods Tf-idf and Production Baseline Methods ABS system has extracted most representational example in test set in this sample, compares mesh Sentence is marked it can be found that three models all have good effect to the abstract of this section of text.The model output of the embodiment of the present invention Sentence effect ratio Tf-idf and ABS system effect are more preferable, semantic comparatively more complete, and generate " North China " this Do not have in original text occur neologisms, this point make it summarize area " the northeastern Inner Mongolia Shanxi central and north, middle of Hebei province and High level overview, phase are achieved the effect that when the ground such as by north in the middle part of the Central Jilin Heilungkiang of northeast Beijing-Tianjin area, the west and south, Liaoning " Than two kinds control methods, model of the embodiment of the present invention seem more complete in description.

In table 6, by comparing the Rouge-1 of four kinds of methods, Rouge-2, the discovery of Rouge-L score, production method It is weaker than extraction-type method in Rouge scoring.It is word-based similar for being mainly due to the evaluation criterion of Rouge scoring Degree, production method is in the preferable situation of actual effect, it is more likely that is to have a poor Rouge scoring.Therefore, Experimental section of the embodiment of the present invention, is shown using sample and two methods of Rouge scoring intuitively indicate modelling effect.In table 7, Influence by Experimental comparison's decoder dictionary size to model of the embodiment of the present invention.

The Rouge scoring of the different decoder dictionary sizes of table 7

Ours (S) indicates the Seq2Seq model mould that training obtains after using stroke coding of the embodiment of the present invention in table 7 Type.When decoder dictionary size is 2k, trained using the coding based on stroke come training pattern and using common term vector Promotion for modelling effect is huge, and during dictionary size rises, Rouge scoring only rises average 4- 5 points.This point illustrates under dictionary and incomplete situation, passes through the n- to Chinese character minimum unit stroke based on the coding of stroke Gram information coding has been combined into " the uncommon word " being not present in more dictionary, and model is after being added stroke coding to dictionary It is required that relying on reduces, and there is good promotion in effect.

To sum up, the embodiment of the present invention realizes one more concisely, accurately by a series of natural language processing techniques Chinese abstraction generating method.Firstly, the design feature for Chinese proposes the text vector coding based on stroke, stroke word is constructed Allusion quotation constitutes text vector by Skip-Gram model, completes to the finer expression task of Hanzi component information.Secondly, right Optimized in text generation using Seq2Seq model, including use Bi-LSTM in the encoder, is solved to a certain extent Long sequence text information loses and the supplementary question of information from back to front, uses attention mechanism capture input and output word Between association power, and in the decoder of test phase use Beam Search optimization generation effect.Finally Based on LCSTS data set training pattern, the coding of the embodiment of the present invention is confirmed by two aspects of Rouge scoring and artificial judgment Method and model effect promoting in text snippet readability are obvious.

The embodiment of the invention also provides a kind of Chinese summarization generation system, including processor and memory, the storages Program is stored on device, the step of described program can be by the processor load and execution preceding method.

The embodiment of the invention provides a kind of computer readable storage mediums, are stored thereon with computer program, the program The step of preceding method is realized when being executed by processor.

It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.

The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, without departing from the technical principles of the invention, several improvement and deformations can also be made, these improvement and deformations Also it should be regarded as protection scope of the present invention.

Claims

1. a kind of Chinese abstraction generating method, which is characterized in that described method includes following steps:

The full text semanteme for being most suitable for current time is recombinated out according to semantic vector, will summarize the semantic intermediate language of full text after recombination Justice is sent to the good decoder of pre-training；

The word and summarize the semantic intermediate semantic distribution for inferring subsequent time word of full text that decoder is predicted according to previous moment, Final word sequence generated is the abstract of target text.

2. Chinese abstraction generating method according to claim 1, which is characterized in that determine the Chinese term vector of target text The method of sequence includes:

3. Chinese abstraction generating method according to claim 2, which is characterized in that obtain the n-gram letter in Chinese-character stroke The method of breath includes:

By strokes sequence IDization；

4. Chinese abstraction generating method according to claim 1, which is characterized in that the encoder uses two-way length in short-term Memory Neural Networks.

5. Chinese abstraction generating method according to claim 4, which is characterized in that the method for generative semantics vector includes:

By Chinese word sequence vector, forward and reverse is input to two-way length in short-term in Memory Neural Networks respectively, obtains two kinds of sequences Under corresponding two hidden states of each word；

6. Chinese abstraction generating method according to claim 1, which is characterized in that recombinated out and be most suitable for according to semantic vector The method of the full text semanteme at current time includes:

Attention mechanism is added, when generating sentence semantics vector by encoder to calculate different input words to decoder end Weighing factor；

It is recombinated out according to hidden state of the input word to the weighing factor combination decoder feedback of decoder end when being most suitable for current The full text semantic information at quarter.

7. Chinese abstraction generating method according to claim 1, which is characterized in that the method also includes being searched using boundling Rope algorithm optimization word sequence generated.

8. Chinese abstraction generating method according to claim 1, which is characterized in that the method also includes to target text It is pre-processed, comprising:

All dates are replaced with into TAG_DATE；

Hyperlink URL is replaced with into label TAG_URL；

Number is replaced with into TAG_NUMBER；

English word is replaced with into TAG_NAME_EN.

9. a kind of Chinese summarization generation system, which is characterized in that including processor and memory, be stored with journey on the memory The step of sequence, described program can be by any one of the processor load and execution such as claim 1 to 8 the method.

10. computer readable storage medium is stored thereon with computer program, which is characterized in that the program is executed by processor The step of any one of Shi Shixian claim 1~8 the method.