CN110532554A - A kind of Chinese abstraction generating method, system and storage medium - Google Patents

A kind of Chinese abstraction generating method, system and storage medium Download PDF

Info

Publication number
CN110532554A
CN110532554A CN201910787889.5A CN201910787889A CN110532554A CN 110532554 A CN110532554 A CN 110532554A CN 201910787889 A CN201910787889 A CN 201910787889A CN 110532554 A CN110532554 A CN 110532554A
Authority
CN
China
Prior art keywords
chinese
word
vector
sequence
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910787889.5A
Other languages
Chinese (zh)
Other versions
CN110532554B (en
Inventor
李维勇
柳斌
张伟
李建林
李方方
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing College of Information Technology
Original Assignee
Nanjing College of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing College of Information Technology filed Critical Nanjing College of Information Technology
Priority to CN201910787889.5A priority Critical patent/CN110532554B/en
Publication of CN110532554A publication Critical patent/CN110532554A/en
Application granted granted Critical
Publication of CN110532554B publication Critical patent/CN110532554B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a kind of Chinese abstraction generating method, system and storage medium, the method includes the steps: target text is obtained, determines the Chinese word sequence vector of target text;The Chinese word sequence vector is input in the good encoder of pre-training, generative semantics vector;The full text semanteme for being most suitable for current time is recombinated out according to semantic vector, and the semantic intermediate semanteme of full text of summarizing after recombination is sent to the good decoder of pre-training;The word and summarize the semantic intermediate semantic distribution for inferring subsequent time word of full text that decoder is predicted according to previous moment, final word sequence generated is the abstract of target text.The present invention is able to ascend the generation quality and readability of Chinese text abstract.

Description

A kind of Chinese abstraction generating method, system and storage medium
Technical field
The present invention relates to a kind of Chinese abstraction generating method, system and storage mediums, belong to text information processing technology neck Domain.
Background technique
Automatic abstract is the technology that automatic text analysis, content summary and summarization generation are realized using computer, is to solve A kind of supplementary means of information overabundance problem at present can help the mankind to further understand natural language text, and more quickly, Key message is accurately and comprehensively obtained, all there is important Practical significance in terms of industry and business.
The technology of currently used relatively low, the readable difference of the generally existing Chinese text summarization generation quality of abstraction generating method Problem.
Summary of the invention
It is an object of the invention to overcome deficiency in the prior art, provide a kind of Chinese abstraction generating method, system and Storage medium is able to ascend the generation quality and readability of Chinese text abstract.
In order to achieve the above objectives, the present invention adopts the following technical solutions realization:
In a first aspect, described method includes following steps the present invention provides a kind of Chinese abstraction generating method:
Target text is obtained, by the Chinese character separating in target text at strokes sequence;
The Chinese word sequence vector of target text is determined according to strokes sequence;
The Chinese word sequence vector is input in the good encoder of pre-training, generative semantics vector;
The full text semanteme for being most suitable for current time is recombinated out according to semantic vector, it will be in the summary full text semanteme after recombination Between semantic be sent to the good decoder of pre-training;
The intermediate semantic deduction subsequent time word of word and summary full text semanteme that decoder is predicted according to previous moment Distribution, final word sequence generated are the abstract of target text.
With reference to first aspect, further, the method for determining the Chinese word sequence vector of target text includes:
N-gram cutting is carried out to the strokes sequence, obtains the n-gram information in Chinese-character stroke;
Skip-Gram model prediction centre word context is used according to n-gram, obtains corresponding Chinese word sequence vector.
With reference to first aspect, further, the method for the n-gram information in acquisition Chinese-character stroke includes:
Word is split into character, finds the corresponding strokes sequence of each character;
By strokes sequence IDization;
N-gram summation is carried out to the strokes sequence of IDization, obtains the n-gram information.
With reference to first aspect, further, the encoder uses two-way length Memory Neural Networks in short-term.
With reference to first aspect, further, the method for generative semantics vector includes:
By Chinese word sequence vector, forward and reverse is input to two-way length in short-term in Memory Neural Networks respectively, obtains two kinds Corresponding two hidden states of each word under sequence;
Two hidden state head and the tail splicings are generated into the semantic vector.
With reference to first aspect, further, the side for being most suitable for the full text semanteme at current time is recombinated out according to semantic vector Method includes:
Attention mechanism is added, when generating sentence semantics vector by encoder to calculate different input words to decoder The weighing factor at end;
It is recombinated out according to hidden state of the input word to the weighing factor combination decoder feedback of decoder end and is most suitable for working as The full text semantic information at preceding moment.
With reference to first aspect, further, the method also includes optimizing word order generated using beam search algorithm Column.
With reference to first aspect, further, the method also includes pre-processing to target text, comprising:
Spcial character is removed, the spcial character includes punctuation mark, deactivates modal particle and adversative;
All dates are replaced with into TAG_DATE;
Hyperlink URL is replaced with into label TAG_URL;
Number is replaced with into TAG_NUMBER;
English word is replaced with into TAG_NAME_EN.
Second aspect, the present invention provides a kind of Chinese summarization generation system, including processor and memory, the storages Be stored with program on device, described program can the method any one of aforementioned by the processor load and execution the step of.
The third aspect, the present invention provides a kind of computer readable storage mediums, are stored thereon with computer program, special The step of sign is, which realizes any one of aforementioned the method when being executed by processor.
Compared with prior art, advantageous effects of the invention: the Chinese word sequence vector of target text is inputted In the encoder good to pre-training, generative semantics vector;The full text semanteme for being most suitable for current time is recombinated out according to semantic vector, The semantic intermediate semanteme of full text of summarizing after recombination is sent to the good decoder of pre-training;Decoder is predicted according to previous moment Word and summarize the semantic intermediate semantic distribution for inferring subsequent time word of full text, final word sequence generated is target The abstract of text can increase the feature needed for phrase understands, capture the part with former character semantic similarity, help to be promoted The generation quality and readability of Chinese text abstract.
Detailed description of the invention
Fig. 1 is a kind of flow chart of the Chinese abstraction generating method provided according to embodiments of the present invention;
Fig. 2 is the training stage of the language model provided according to embodiments of the present invention and the method schematic diagram of test phase;
Fig. 3 is that is provided according to embodiments of the present invention split into word " university " method schematic diagram of n-gram stroke;
Fig. 4 is the structural schematic diagram of the two-way length that provides according to embodiments of the present invention Memory Neural Networks in short-term;
Fig. 5 is the structural schematic diagram of the Seq2Seq model provided according to embodiments of the present invention;
Fig. 6 is that the method for choosing best text sequence using beam search algorithm provided according to embodiments of the present invention is illustrated Figure;
Fig. 7 is that the method for carrying out semantic vector calculating using attention mechanism provided according to embodiments of the present invention is illustrated Figure.
Specific embodiment
The invention will be further described below in conjunction with the accompanying drawings.Following embodiment is only used for clearly illustrating the present invention Technical solution, and not intended to limit the protection scope of the present invention.
As shown in Figure 1, be a kind of flow chart of the Chinese abstraction generating method provided according to embodiments of the present invention, it is specific to wrap Include following steps:
A) Text Pretreatment: after target text is segmented, carrying out the vectorization processing of word, and construct corresponding vocabulary, Input of the term vector sequence of formation as next stage.
B) semantic understanding: the memory function of Recognition with Recurrent Neural Network, by input coding of term vector sequence of first stage Device (uses two-way shot and long term Memory Neural Networks (abbreviation Bi-LSTM)), and encoder generates the semantic vector of each section of text simultaneously It is transmitted to next stage.
C) computing with words: the application joined attention mechanism, attention machine when encoder generates sentence semantics vector Hidden state according to decoder feedback is made to recombinate out the full text semantic information for being most suitable for current time, and will be in after recombination Between semantic information send to decoder, for current time step word prediction.
D) summarization generation: (using RNN, (Recurrent Neural Network recycles nerve net to this stage decoder Network)) according to the word of previous moment prediction and summarizes the semantic intermediate semantic vector of full text and infer the distribution of subsequent time word, most Throughout one's life at a word sequence i.e. abstract sentence.
In four above-mentioned processes, the language of stroke vector coding and training automatic abstract is introduced for the feature of Chinese Model.For the structure for the embodiment model of the embodiment of the present invention being more clear, model is divided into training stage and survey in Fig. 2 Two parts of examination stage.
The left side Fig. 2 is the training stage of model of the embodiment of the present invention, the transmission direction of arrows show data and and parameter Backpropagation, underline part be the embodiment of the present invention use for Chinese stroke coding mode.Rest part packet Include the optimization of the encoder being made of Bi-LSTM, the decoder that RNN is formed and attention mechanism.The right is the test of model Stage mainly tests trained decoder, one section of test text is inputted, after the Chinese character code based on stroke The autoabstract generated by trained decoder.In this stage, it in order to maximize the probability for generating sentence, is added Beam Search (beam-search) increases the range of choice for generating sentence, and optimization generates the fluency made a summary.
Each word is split into strokes sequence by the embodiment of the present invention, captures class using based on the n-gram information of stroke Like " knowing " in " intelligence " it is this only can just be captured from smaller stroke granularity and and the close part of former character semanteme, it Apparent effect promoting is brought to the term vector expression on Chinese.
In the method for term vector coding, the embodiment of the present invention uses the Skip-Gram training method of Word2vec, passes through Centre word predicts context.In Word2vec, each character is initialized using simple one-hot coding, and uses for reference Each word is converted to corresponding strokes sequence by the thought of Fasttext, such as: " university " then will become " ノ Dian Dian Dian ノ a Dian Off Off Shu mono- " does n-gram cutting for such sequence, and the stroke initialization after cutting can thus be captured as input It is associated with to deep layer existing between Chinese character.The meaning of this method is, Chinese character is splitted into stroke or radical, mainly passes through stroke Or the semantic information between radical, the feature of text itself is enriched, so that the modular construction similitude between similar character is captured, Increase the feature needed for phrase understands.For in terms of philological, it can allow the words of similar meaning that there is class on Chinese characters word-formation As structure, this is the theoretical basis of the method for Chinese character coding of the embodiment of the present invention.
Chinese-character stroke subdivision has more than 30 kinds, and stroke is divided into five major class here: horizontal, vertical, slash, right-falling stroke, folding.As shown in table 1, This five strokes are used into digital number respectively so that it is convenient to its corresponding vector in dictionary.
1 encoding of Chinese stroke of table
Stroke title It is horizontal It is perpendicular It skims Right-falling stroke Folding
Shape One Shu (亅) Pie Fu (Dian) Ya (Yin)
ID 1 2 3 4 5
As shown in figure 3, describing the process that word splits into stroke and takes n-gram value, it is broadly divided into Four processes:
A) word is split into character first;
B) the corresponding strokes sequence of each character is found;
C) the strokes sequence IDization that previous step is obtained;
D) it sums to the strokes sequence n-gram of IDization.The n-gram of each stroke represents a vector, stroke vector Latitude is consistent with the term vector latitude of context words.It is total that different local same word and stroke are appeared in test in full Enjoy identical vector.
The context words similitude with higher of one word and it, current word w and its context words c's is similar Property is indicated using their inner product of vectorsWhereinWithThe vector for respectively representing w and c indicates. N-gram and each context words to each stroke distribute a vector, and according to all strokes of composition current word w N-gram vector sum the term vector inner product of cliction and calculates similitude up and down.By the stroke n-gram of words all in corpus Vector is stored in dictionary S, and wherein S (w) indicates the stroke n-gram set of word w, and similarity function is are as follows:
Wherein q is the element of S (w),For the insertion vector of q.
Language model is intended to predict that the probability in short occurred, Skip-Gram model used by the embodiment of the present invention make Context c, as calculating Probability p (c | w) are predicted with centre word w, this probability is calculated used here as softmax function:
C ' belongs to a word in corpus vocabulary V, it can be found that the calculation amount of denominator is | V |, for dividing above Female calculating is accelerated using the negative method of sampling, and the thought of negative sampling is that reduce the number of negative sample needs more to reduce model New weight number.Such as input " university ", when encoding using one-hot, corresponding " whole nation " and " ranking " is wished in output layer The neuron node output 1 of word, if | V | size be 10000, remaining 9999 nodes wish that output is 0, this 9999 samples are exactly negative sample herein, and negative sampling is exactly that 5-20 are extracted from 9999 negative samples to update its weight, Remaining weight does not update, to achieve the effect that reduce calculation amount.Loss function based on negative sampling are as follows:
L=∑w∈Dc∈T(w)Log σ (sim (w, c))+λ logEC '~p[log σ (- sim (w, c '))] (3)
D is the set of all word training in corpus in above-mentioned formula, and T (w) is the context list of current word in window Word, σ are sigmoid activation primitive, σ (x)=(1+exp (- x))-1, λ is negative sample number, EC '~pIt is expected, so that negative The sample of sampling meets p distribution.For more detailed description this algorithm detailed process, table 2 gives the process of stroke coding:
2 Stroke_Embedding algorithm of table (stroke embedded mobile GIS) process
One important parameter of Stroke_Embedding algorithm is n_gram, it is contemplated that the unit stroke of Chinese character in Chinese Several sizes is here 3,4 and 5 to its value, to capture the most of component information for including in Chinese character.
In long text sequence, the semanteme that can more completely capture text is most important, and the embodiment of the present invention passes through Bi- LSTM network can capture two-way semantic dependency.Fig. 4 is Bi-LSTM schematic network structure.As Sequence-to- When the encoding or decoding device of Sequence model (following shorthand are as follows: Seq2 Seq model), although not needing to complete such as emotion The classification task of analysis, but Bi-LSTM is more accurately positioned to semantic vector or text snippet brings mentioning in effect It rises.Bi-LSTM corresponding is coded portion, the nerve that it forms the sentence sequence difference forward and reverse input LSTM of input In network, the corresponding two hidden state h of each word under two kinds of sequences are obtainedtWith h 't, then by hidden state under two kinds of sequences Splicing:
hnew=concatenate (ht, h 't) (4)
It is i.e. end to end, thus more complete semantic vector can be obtained in coded portion.The structure of entire model As shown in Figure 5.
Primary structure of the Bi-LSTM as illustraton of model left-half encoder section of the embodiment of the present invention, it is semantic obtaining After vector C, it is passed to the word that right side part decoder carrys out each time point in gradually formation sequence, training stage decoder It is made of RNN unit.For intuitively, the text input bidirectional circulating that the segment length which inputs the left side is n is neural Network, wherein x0, x1... xt, xnRespectively indicate the corresponding term vector of each word in text, after obtaining semantic vector C, decoding Process is then the input as each moment, then exports a word y at the corresponding moment1, y2... ym, thus may be used To decode a segment length as the text of m, the wherein not stringent size relation of the size of n and m.Due to the embodiment of the present invention Task is text snippet, and input text size is greater than output text, therefore n > m.Above it is exactly the training process of model, is surveying During examination, trained decoder model is mainly used, in such a way that optimization generates, is finally completed on test set Experiment.
After the completion of Seq2Seq model training, need to generate new abstract sentence using the model, and the generation of sentence be by The sequence problem that each word forms after increasingly generating.Beam-search is then the preceding K word composition sequence according to maximum probability, The size of K is to collect beamwidth.In Fig. 6, the schematic diagram of a best text sequence is chosen for Beam Search.
Assuming that dictionary size is 6, detailed steps are as follows:
(1) first word y is generated1Probability distribution [0.1,0.1,0.4,0.2,0.1,0.1] after, choose wherein most probably Two words of rate, the most probable selection of " I ", " " as first word as shown in Figure 7;
(2) second step calculates the probability that second word is arranged in pairs or groups therewith aiming at the two words, by " I " and " " As the input of decoder, the then maximum the first two sequence of reselection, as " I " and " seeing ", and so on, finally It is terminated encountering end mark<s>.
(3) last available two sequences, " I is watching movie " and " watching movie me ", it is evident that the former probability and most Greatly, so selecting the former as final result.
The algorithm flow of Beam Search is as shown in table 3:
Wrap up in 3 Beam Search algorithm flows
In the algorithm of Beam Search, the size of parameter K is affected to the decoding speed of test phase decoder, Between 3 to 7 words of general value.
Input of the semantic vector C as the semantic compression and decoder of list entries, because the limitation of its length leads to nothing Method includes enough useful informations, and especially in automatic abstract task, the sequence of encoder input is often a Duan Changwen This, a C can not just summarize all information, model accuracy is caused to decline.The embodiment of the present invention is made based on attention mechanism It obtains semantic vector C and keeps more semantic informations.Attention mechanism is introduced underneath with machine translation:
In machine translation, each word is different by the influence power of input word in the translation of decoder output, because This, attention mechanism inputs different semantic vector C according to different momentsiTo solve this problem, CiCarry out self-encoding encoder hidden layer Vector hjWith the weight a of distributionijSum of products.In LSTM sequence, the hidden layer vector h of each moment outputiIt is by current defeated It gos out and the decision of current time memory unit, semantic vector is to export in the last one unit of encoder, and use attention The mechanism weighted sum hidden layer vector at each moment.The calculation method for the attention mechanism that Fig. 7 is shown.
When calculating is to the semantic vector C of word " I "1When, influence on it should be most for the semantic information of " I " in list entries Greatly, thus distribute to its weight just should be maximum.Similarly C2" seeing " is most related, therefore corresponding a22Maximum, then how to count Calculate aijAn Important Problems are reformed into.
In fact, aijSize be from model middle school come, it is by the hidden state and decoding in -1 stage of jth of encoder The hidden state in the i-th stage of device is related.Such as the corresponding semantic vector C of calculating word " watch "2When, first by calculating Three hiding vector h of one word " I " and encoder1, h2, h3Similarity, it is close that this also utilizes adjacent words semanteme How language rule calculates a to preferably introduceij, the hidden state of encoder is defined as hj, decoder hides layer state It is defined as hi, CiCalculation formula are as follows:
The total length of list entries, h are indicated for above formula VjIt is also known, aijCalculation formula it is as follows:
aijIt is the probability normalized output of a softmax, eijAn alignment model is indicated, for measuring encoder-side The word of position j for decoder end the word of position i degree of registration (influence degree).In other words, decoder end generates When the word of position i, how many degree is influenced by the word of the position j of encoder-side.Alignment model eijCalculation there are many kinds of, Different calculations represents different Attention models.Most simple and most common alignment model is dot product Product matrix, i.e., the hidden state h of the output of decoder endtWith the hidden state h of output of encoder-sidesMatrix Multiplication is carried out, when calculating It is to be completed by the similitude of the last moment and list entries hidden state matrix that calculate the word that will be predicted.It is common Alignment calculation it is as follows:
score(ht, hs)=eijIndicate the alignment thereof of source and target word, it is common to have the above dot product, weight network Mapping, concat map several method.
When calculating alignment thereof, each word of decoder end output requires to calculate and the side of alignment of list entries Formula, all hidden layers of encoder will be used to calculate similitude to get the weight vectors a arrivedijLength be equal to input sequence The length of column.The higher Soft Attention mechanism of service efficiency of the embodiment of the present invention, it is right between word to be calculated by dot product Answer mode with determine the relevance between word power.
The beneficial effect of Chinese abstraction generating method is provided further to verify the embodiment of the present invention, below with reference to experiment Data are further described the embodiment of the present invention.Likewise, following embodiment be only used for clearly illustrating it is of the invention Technical solution, and not intended to limit the protection scope of the present invention.
0.1 data set and pretreatment
Experiment uses the extensive Chinese short text summary data collection LCSTS that Sina weibo is taken from from Harbin Institute of Technology, number 2,000,000 abstracts that really Chinese short text datas and each text author provide are contained according to concentrating, this data set is also Maximum one Chinese section text snippet data set at present, provides the standard method of Segmentation of Data Set, table 4 illustrates three portions The number of divided data collection.
4 LCSTS data of table composition
Data set includes three parts:
A) first part is the major part of notebook data collection, contains 2400591 (short text, abstract) data pair, this Partial data is used to the model that training generates abstract.
B) second part includes 10666 (short text, abstract) data pair manually marked, and each sample has beaten 1- 5 points, score is the degree of correlation for judging short text and abstract, and 1 represents least correlation, and 5 representatives are most related.This partial data It is that stochastical sampling comes out from first part's data, for analyzing the distribution situation of first part's data.Wherein, be labeled as 3, 4,5 points of sample original text and abstract correlation are much better, therefrom it is also seen that can be comprising some without going out in many abstracts Word in present original text, this also illustrates different from sentence compression duty.It is weaker to be labeled as 1,2 point of correlation, is more like mark Topic is either commented on rather than is made a summary.Statistics shows that 1,2 point of data less than twenty percent, can be filtered with the method for supervised learning Fall.
C) Part III includes 1106 (short text, abstract) data pair, and three people carry out 2000 pairs in total abstracts It judges, Dynamic data exchange here is in first part and second part.Select 3 points or more of data as short text abstract task Test data set.
The pretreatment stage of data is particularly important, because in the format of encoder section data and standardization to entire experiment Influence it is very big, 1 part PART of above-mentioned LCSTS is training data, training data short text input and summing-up pluck After extracting, need that some of which information is replaced and is handled:
Spcial character: removal spcial character, mainly includes punctuation mark and common deactivated modal particle and adversative etc., Such as: " ", ", $ ... " Ah eh and;
Content in bracket, such as [happy] have many full animation expressions and deposit in this form because of data source microblogging It to be removed in pretreatment;
Date tag replacement: replacing with TAG_DATE for all dates, e.g.: * * * * month * day, the * * * * * month, etc. Deng;
Hyperlink URL: label TAG_URL is replaced with;
Replacement number: TAG_NUMBER;
English tag replacement: replacement English word is label TAG_NAME_EN.
0.2 assessment method
The evaluation method that the embodiment of the present invention uses includes tri- kinds of Rouge-1, Rouge-2 and Rouge-L, wherein Rouge- L in L is the initial of LCS (longest common subsequence, longest common subsequence).
The calculation formula of Rouge-N is as follows:
Wherein, n-gram indicates n-gram word, and { Ref Summaries } indicates that with reference to abstract, i.e., the standard obtained in advance is plucked It wants, Countmatch(n-gram) indicate occur the number of n-gram, Count (n-gram) simultaneously in rev of Sys and reference abstract It then indicates with reference to the n-gram number occurred in abstract.
The calculation formula of Rouge-L is as follows:
Wherein LCS (X, Y) is the most normal common subsequence length of X and Y, and m, n are respectively indicated with reference to abstract and autoabstract The length number of word (generally be exactly), Rlcs, PlcsRespectively indicate recall rate and accuracy rate.Last FlcsIt is exactly Rouge-L points Number.
0.3 experimental design and interpretation of result
After first using jieba participle packet to first part's participle of LCSTS data set, chooses 50000 high frequency words and make For encoder vocabulary, the word occurred in vocabulary uses " UNK " to indicate.When decoder is arranged, an important parameter Be decoder dictionary size, comparative test done to the parameter of decoder dictionary in an experiment, be respectively set 2000,5000, 80000, the 11000, size of 14,000 5 kind of specification passes through the size that best dictionary is chosen in experiment.4 layers are used in the encoder Two-way LSTM, each node layer are 256, batch_size 64, define Bucket barrels of mechanism: buckets=[(120, 30) long inside the Bucket for ...], the sentence of list entries being assigned to according to the similarity degree of length different regular lengths The inadequate all additions PAD character of degree, length for heading limit 30 words.
Several method for comparison are as follows:
A) Tf-idf: the Baseline Methods of extraction-type abstract.
B) ABS system: the Baseline Methods of production abstract.
C) Our+att: common term vector input+attention mechanism.
D) Our (S)+att: stroke coding term vector+attention mechanism.
Table 5 and table 6 include four groups of experimental datas be 8000 in decoder dictionary size under experimental result and Rouge comment It is divided to (the latter two groups experimental results for the embodiment of the present invention).
The abstract comparison of 5 distinct methods of table
Table 6Rouge appraisal result
Model Rouge-1 (%) Rouge-2 (%) Rouge-L (%)
Tf-idf 27.30 24.30 26.76
ABS system 24.26 15.22 24.11
Our+att 24.83 15.61 22.19
Our(S)+att 25.08 17.05 22.77
In table 5 by method sample provided by the embodiment of the present invention compared respectively extraction-type Baseline Methods Tf-idf and Production Baseline Methods ABS system has extracted most representational example in test set in this sample, compares mesh Sentence is marked it can be found that three models all have good effect to the abstract of this section of text.The model output of the embodiment of the present invention Sentence effect ratio Tf-idf and ABS system effect are more preferable, semantic comparatively more complete, and generate " North China " this Do not have in original text occur neologisms, this point make it summarize area " the northeastern Inner Mongolia Shanxi central and north, middle of Hebei province and High level overview, phase are achieved the effect that when the ground such as by north in the middle part of the Central Jilin Heilungkiang of northeast Beijing-Tianjin area, the west and south, Liaoning " Than two kinds control methods, model of the embodiment of the present invention seem more complete in description.
In table 6, by comparing the Rouge-1 of four kinds of methods, Rouge-2, the discovery of Rouge-L score, production method It is weaker than extraction-type method in Rouge scoring.It is word-based similar for being mainly due to the evaluation criterion of Rouge scoring Degree, production method is in the preferable situation of actual effect, it is more likely that is to have a poor Rouge scoring.Therefore, Experimental section of the embodiment of the present invention, is shown using sample and two methods of Rouge scoring intuitively indicate modelling effect.In table 7, Influence by Experimental comparison's decoder dictionary size to model of the embodiment of the present invention.
The Rouge scoring of the different decoder dictionary sizes of table 7
Ours (S) indicates the Seq2Seq model mould that training obtains after using stroke coding of the embodiment of the present invention in table 7 Type.When decoder dictionary size is 2k, trained using the coding based on stroke come training pattern and using common term vector Promotion for modelling effect is huge, and during dictionary size rises, Rouge scoring only rises average 4- 5 points.This point illustrates under dictionary and incomplete situation, passes through the n- to Chinese character minimum unit stroke based on the coding of stroke Gram information coding has been combined into " the uncommon word " being not present in more dictionary, and model is after being added stroke coding to dictionary It is required that relying on reduces, and there is good promotion in effect.
To sum up, the embodiment of the present invention realizes one more concisely, accurately by a series of natural language processing techniques Chinese abstraction generating method.Firstly, the design feature for Chinese proposes the text vector coding based on stroke, stroke word is constructed Allusion quotation constitutes text vector by Skip-Gram model, completes to the finer expression task of Hanzi component information.Secondly, right Optimized in text generation using Seq2Seq model, including use Bi-LSTM in the encoder, is solved to a certain extent Long sequence text information loses and the supplementary question of information from back to front, uses attention mechanism capture input and output word Between association power, and in the decoder of test phase use Beam Search optimization generation effect.Finally Based on LCSTS data set training pattern, the coding of the embodiment of the present invention is confirmed by two aspects of Rouge scoring and artificial judgment Method and model effect promoting in text snippet readability are obvious.
The embodiment of the invention also provides a kind of Chinese summarization generation system, including processor and memory, the storages Program is stored on device, the step of described program can be by the processor load and execution preceding method.
The embodiment of the invention provides a kind of computer readable storage mediums, are stored thereon with computer program, the program The step of preceding method is realized when being executed by processor.
It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.
The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, without departing from the technical principles of the invention, several improvement and deformations can also be made, these improvement and deformations Also it should be regarded as protection scope of the present invention.

Claims (10)

1. a kind of Chinese abstraction generating method, which is characterized in that described method includes following steps:
Target text is obtained, by the Chinese character separating in target text at strokes sequence;
The Chinese word sequence vector of target text is determined according to strokes sequence;
The Chinese word sequence vector is input in the good encoder of pre-training, generative semantics vector;
The full text semanteme for being most suitable for current time is recombinated out according to semantic vector, will summarize the semantic intermediate language of full text after recombination Justice is sent to the good decoder of pre-training;
The word and summarize the semantic intermediate semantic distribution for inferring subsequent time word of full text that decoder is predicted according to previous moment, Final word sequence generated is the abstract of target text.
2. Chinese abstraction generating method according to claim 1, which is characterized in that determine the Chinese term vector of target text The method of sequence includes:
N-gram cutting is carried out to the strokes sequence, obtains the n-gram information in Chinese-character stroke;
Skip-Gram model prediction centre word context is used according to n-gram, obtains corresponding Chinese word sequence vector.
3. Chinese abstraction generating method according to claim 2, which is characterized in that obtain the n-gram letter in Chinese-character stroke The method of breath includes:
Word is split into character, finds the corresponding strokes sequence of each character;
By strokes sequence IDization;
N-gram summation is carried out to the strokes sequence of IDization, obtains the n-gram information.
4. Chinese abstraction generating method according to claim 1, which is characterized in that the encoder uses two-way length in short-term Memory Neural Networks.
5. Chinese abstraction generating method according to claim 4, which is characterized in that the method for generative semantics vector includes:
By Chinese word sequence vector, forward and reverse is input to two-way length in short-term in Memory Neural Networks respectively, obtains two kinds of sequences Under corresponding two hidden states of each word;
Two hidden state head and the tail splicings are generated into the semantic vector.
6. Chinese abstraction generating method according to claim 1, which is characterized in that recombinated out and be most suitable for according to semantic vector The method of the full text semanteme at current time includes:
Attention mechanism is added, when generating sentence semantics vector by encoder to calculate different input words to decoder end Weighing factor;
It is recombinated out according to hidden state of the input word to the weighing factor combination decoder feedback of decoder end when being most suitable for current The full text semantic information at quarter.
7. Chinese abstraction generating method according to claim 1, which is characterized in that the method also includes being searched using boundling Rope algorithm optimization word sequence generated.
8. Chinese abstraction generating method according to claim 1, which is characterized in that the method also includes to target text It is pre-processed, comprising:
Spcial character is removed, the spcial character includes punctuation mark, deactivates modal particle and adversative;
All dates are replaced with into TAG_DATE;
Hyperlink URL is replaced with into label TAG_URL;
Number is replaced with into TAG_NUMBER;
English word is replaced with into TAG_NAME_EN.
9. a kind of Chinese summarization generation system, which is characterized in that including processor and memory, be stored with journey on the memory The step of sequence, described program can be by any one of the processor load and execution such as claim 1 to 8 the method.
10. computer readable storage medium is stored thereon with computer program, which is characterized in that the program is executed by processor The step of any one of Shi Shixian claim 1~8 the method.
CN201910787889.5A 2019-08-26 2019-08-26 Chinese abstract generation method, system and storage medium Active CN110532554B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910787889.5A CN110532554B (en) 2019-08-26 2019-08-26 Chinese abstract generation method, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910787889.5A CN110532554B (en) 2019-08-26 2019-08-26 Chinese abstract generation method, system and storage medium

Publications (2)

Publication Number Publication Date
CN110532554A true CN110532554A (en) 2019-12-03
CN110532554B CN110532554B (en) 2023-05-05

Family

ID=68664157

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910787889.5A Active CN110532554B (en) 2019-08-26 2019-08-26 Chinese abstract generation method, system and storage medium

Country Status (1)

Country Link
CN (1) CN110532554B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061861A (en) * 2019-12-12 2020-04-24 西安艾尔洛曼数字科技有限公司 XLNET-based automatic text abstract generation method
CN111078865A (en) * 2019-12-24 2020-04-28 北京百度网讯科技有限公司 Text title generation method and device
CN111191451A (en) * 2019-12-30 2020-05-22 苏州思必驰信息科技有限公司 Chinese sentence simplification method and device
CN111639174A (en) * 2020-05-15 2020-09-08 民生科技有限责任公司 Text abstract generation system, method and device and computer readable storage medium
CN111666759A (en) * 2020-04-17 2020-09-15 北京百度网讯科技有限公司 Method and device for extracting key information of text, electronic equipment and storage medium
CN111723196A (en) * 2020-05-21 2020-09-29 西北工业大学 Single document abstract generation model construction method and device based on multi-task learning
CN111930940A (en) * 2020-07-30 2020-11-13 腾讯科技(深圳)有限公司 Text emotion classification method and device, electronic equipment and storage medium
CN112115256A (en) * 2020-09-15 2020-12-22 大连大学 Method and device for generating news text abstract integrated with Chinese stroke information
CN112364225A (en) * 2020-09-30 2021-02-12 昆明理工大学 Judicial public opinion text summarization method combining user comments
CN112560456A (en) * 2020-11-03 2021-03-26 重庆安石泽太科技有限公司 Generation type abstract generation method and system based on improved neural network
CN112765976A (en) * 2020-12-30 2021-05-07 北京知因智慧科技有限公司 Text similarity calculation method, device and equipment and storage medium
CN113254573A (en) * 2020-02-12 2021-08-13 北京嘀嘀无限科技发展有限公司 Text abstract generation method and device, electronic equipment and readable storage medium
CN113449105A (en) * 2021-06-25 2021-09-28 上海明略人工智能(集团)有限公司 Work summary generation method, system, electronic device and medium
CN113609863A (en) * 2021-02-04 2021-11-05 腾讯科技(深圳)有限公司 Method, device and computer equipment for training and using data conversion model
CN114553803A (en) * 2022-01-21 2022-05-27 上海鱼尔网络科技有限公司 Quick reply method, device and system for instant messaging
WO2022142121A1 (en) * 2020-12-31 2022-07-07 平安科技(深圳)有限公司 Abstract sentence extraction method and apparatus, and server and computer-readable storage medium
CN116049385A (en) * 2023-04-03 2023-05-02 北京太极信息系统技术有限公司 Method, device, equipment and platform for generating information and create industry research report

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804495A (en) * 2018-04-02 2018-11-13 华南理工大学 A kind of Method for Automatic Text Summarization semantic based on enhancing
CN109885673A (en) * 2019-02-13 2019-06-14 北京航空航天大学 A kind of Method for Automatic Text Summarization based on pre-training language model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804495A (en) * 2018-04-02 2018-11-13 华南理工大学 A kind of Method for Automatic Text Summarization semantic based on enhancing
CN109885673A (en) * 2019-02-13 2019-06-14 北京航空航天大学 A kind of Method for Automatic Text Summarization based on pre-training language model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
沈华东等: "AM-BRNN:一种基于深度学习的文本摘要自动抽取模型", 《小型微型计算机系统》 *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061861B (en) * 2019-12-12 2023-09-01 西安艾尔洛曼数字科技有限公司 Text abstract automatic generation method based on XLNet
CN111061861A (en) * 2019-12-12 2020-04-24 西安艾尔洛曼数字科技有限公司 XLNET-based automatic text abstract generation method
CN111078865A (en) * 2019-12-24 2020-04-28 北京百度网讯科技有限公司 Text title generation method and device
CN111078865B (en) * 2019-12-24 2023-02-21 北京百度网讯科技有限公司 Text title generation method and device
CN111191451B (en) * 2019-12-30 2024-02-02 思必驰科技股份有限公司 Chinese sentence simplification method and device
CN111191451A (en) * 2019-12-30 2020-05-22 苏州思必驰信息科技有限公司 Chinese sentence simplification method and device
CN113254573A (en) * 2020-02-12 2021-08-13 北京嘀嘀无限科技发展有限公司 Text abstract generation method and device, electronic equipment and readable storage medium
CN111666759A (en) * 2020-04-17 2020-09-15 北京百度网讯科技有限公司 Method and device for extracting key information of text, electronic equipment and storage medium
CN111666759B (en) * 2020-04-17 2024-03-26 北京百度网讯科技有限公司 Extraction method and device of text key information, electronic equipment and storage medium
CN111639174B (en) * 2020-05-15 2023-12-22 民生科技有限责任公司 Text abstract generation system, method, device and computer readable storage medium
CN111639174A (en) * 2020-05-15 2020-09-08 民生科技有限责任公司 Text abstract generation system, method and device and computer readable storage medium
CN111723196A (en) * 2020-05-21 2020-09-29 西北工业大学 Single document abstract generation model construction method and device based on multi-task learning
CN111930940A (en) * 2020-07-30 2020-11-13 腾讯科技(深圳)有限公司 Text emotion classification method and device, electronic equipment and storage medium
CN111930940B (en) * 2020-07-30 2024-04-16 腾讯科技(深圳)有限公司 Text emotion classification method and device, electronic equipment and storage medium
CN112115256A (en) * 2020-09-15 2020-12-22 大连大学 Method and device for generating news text abstract integrated with Chinese stroke information
CN112364225A (en) * 2020-09-30 2021-02-12 昆明理工大学 Judicial public opinion text summarization method combining user comments
CN112560456A (en) * 2020-11-03 2021-03-26 重庆安石泽太科技有限公司 Generation type abstract generation method and system based on improved neural network
CN112560456B (en) * 2020-11-03 2024-04-09 重庆安石泽太科技有限公司 Method and system for generating generated abstract based on improved neural network
CN112765976A (en) * 2020-12-30 2021-05-07 北京知因智慧科技有限公司 Text similarity calculation method, device and equipment and storage medium
WO2022142121A1 (en) * 2020-12-31 2022-07-07 平安科技(深圳)有限公司 Abstract sentence extraction method and apparatus, and server and computer-readable storage medium
CN113609863A (en) * 2021-02-04 2021-11-05 腾讯科技(深圳)有限公司 Method, device and computer equipment for training and using data conversion model
CN113449105A (en) * 2021-06-25 2021-09-28 上海明略人工智能(集团)有限公司 Work summary generation method, system, electronic device and medium
CN114553803A (en) * 2022-01-21 2022-05-27 上海鱼尔网络科技有限公司 Quick reply method, device and system for instant messaging
CN116049385B (en) * 2023-04-03 2023-06-13 北京太极信息系统技术有限公司 Method, device, equipment and platform for generating information and create industry research report
CN116049385A (en) * 2023-04-03 2023-05-02 北京太极信息系统技术有限公司 Method, device, equipment and platform for generating information and create industry research report

Also Published As

Publication number Publication date
CN110532554B (en) 2023-05-05

Similar Documents

Publication Publication Date Title
CN110532554A (en) A kind of Chinese abstraction generating method, system and storage medium
CN110134757B (en) Event argument role extraction method based on multi-head attention mechanism
CN108108351B (en) Text emotion classification method based on deep learning combination model
CN110245229B (en) Deep learning theme emotion classification method based on data enhancement
CN113254599B (en) Multi-label microblog text classification method based on semi-supervised learning
CN109558487A (en) Document Classification Method based on the more attention networks of hierarchy
CN110134946B (en) Machine reading understanding method for complex data
CN110929030A (en) Text abstract and emotion classification combined training method
CN110309514A (en) A kind of method for recognizing semantics and device
CN109885670A (en) A kind of interaction attention coding sentiment analysis method towards topic text
CN111061861B (en) Text abstract automatic generation method based on XLNet
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN105183833A (en) User model based microblogging text recommendation method and recommendation apparatus thereof
CN110489750A (en) Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF
CN108733647B (en) Word vector generation method based on Gaussian distribution
CN112925904B (en) Lightweight text classification method based on Tucker decomposition
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN112434164B (en) Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration
CN111639186A (en) Multi-class multi-label text classification model and device dynamically embedded with projection gate
CN112784013A (en) Multi-granularity text recommendation method based on context semantics
CN110321434A (en) A kind of file classification method based on word sense disambiguation convolutional neural networks
CN116258137A (en) Text error correction method, device, equipment and storage medium
CN113094502A (en) Multi-granularity takeaway user comment sentiment analysis method
CN113032541A (en) Answer extraction method based on bert and fusion sentence cluster retrieval
CN110059192A (en) Character level file classification method based on five codes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant