CN108628828A

CN108628828A - A kind of joint abstracting method of viewpoint and its holder based on from attention

Info

Publication number: CN108628828A
Application number: CN201810347840.3A
Authority: CN
Inventors: 李雄; 刘春阳; 张传新; 张旭; 王萌; 闫昊; 唐彬
Original assignee: Beihang University; National Computer Network and Information Security Management Center
Current assignee: Beihang University; National Computer Network and Information Security Management Center
Priority date: 2018-04-18
Filing date: 2018-04-18
Publication date: 2018-10-09
Anticipated expiration: 2038-04-18
Also published as: CN108628828B

Abstract

A kind of joint abstracting method of viewpoint and its holder based on from attention of the present invention：S1. structure extraction viewpoint and its corpus of holder；S2. identification includes the sentence of viewpoint；S3. joint extracts viewpoint and its holder.Advantage of the present invention：1, textual classification model avoids the case where sentence extracted does not include viewpoint；2, viewpoint and its holder combine extraction model and have broken away from part-of-speech tagging, named the natural language processings link such as Entity recognition and syntax dependency parsing, avoid these links from influence of the error to model extraction effect occur, and the model has very high flexibility ratio and covering surface；3, corpus of the present invention comprising structure extraction viewpoint and its holder, identification include the sentence of viewpoint, and joint extracts viewpoint and its holder.4, the present invention effectively combines the two advantage on the basis of two-way LSTM using self attention, keeps the expression semanteme of sequence of terms more rich, trained model accuracy rate higher.

Description

A kind of joint abstracting method of viewpoint and its holder based on from attention

Technical field

The present invention relates to a kind of natural language processing methods, more particularly to one kind to be based on from attention (self- Attention the joint abstracting method of viewpoint and its holder), it can extract the viewpoint in Chinese news corpus automatically And its holder, belong to Computer Science and Technology field.

Background technology

With the development of Internet technology, a large amount of text message is skyrocketed through on internet, and electronic medium is rapidly sent out Exhibition, for traditionally on paper media also in the camp that electronic medium is added, explosive growth is presented in news corpus.Viewpoint is carried out to text Extraction is also increasingly paid close attention to by researcher, and as most active research field in natural language processing instantly it One.The explosive growth of news corpus on network forms obstruction to obtaining information instead.In the less feelings of past news amount Under condition, one more comprehensive understanding can be formed to dependent event by artificial Fast Reading news, record viewpoint.And now News amount is very huge, if only reading section news, the information got is relatively limited, may obtain unilateral cognition, If reading whole news and counting the viewpoint of each expert or mechanism, because data volume is excessively huge, lead to reality It is upper infeasible.Currently, major news portal website or microblogging etc. all provide the summary info of news from media, it is provided to User can be allowed quickly and easily to understand the general contents of news, however only a small number of hot news just have such abstract, because Editorial staff is still relied on for it to write manually.It can see on the e-commerce platforms such as Taobao, the sight of comment on commodity Point excavates and sentiment analysis technology has gradually moved towards business application by science, and use is facilitated while saving human resources Family quick obtaining information on commodity comment.In contrast, the viewpoint of news corpus and its automatic extraction technique of holder are still being ground Study carefully the stage, nonetheless, it is contemplated that it is all widely used and studies in many fields, such as information retrieval, data are dug Pick, text mining, Web excavations etc., the range covered extends to the fields such as management and sociology from computer science.Newly It hears viewpoint extraction technique and is being increasingly becoming research hotspot.

The current hot spot of opining mining is concentrated mainly in comment on commodity, which is actually that a kind of fine granularity is multi-party The sentiment analysis in face.Sentiment analysis is divided into chapter grade, Sentence-level, phrase grade in granular level, is divided into two in taxonomical hierarchy Pole, multipole, various aspects.The main task that comment on commodity viewpoint extracts is to extract estimator, evaluation object and evaluating word, mainly By two kinds of supervised learning method and unsupervised learning method：

1. supervised learning method

The mainstream of supervised learning method is to be based on sequence labelling method, and the method for obtaining best effects at present is hidden Ma Er Section's husband's model (Hidden Markov Model, HMM) and condition random field (Conditional Random Field, CRF), Including the methods of Lexical HMM model, Skip-CRF, Tree-CRF.In addition to both main stream approach, it is also based on syntax The method of dependence filters out candidate evaluations pair, then using sorting technique to determine whether belonging to evaluation object and evaluating word.

2. unsupervised learning method

Unsupervised learning method mainly realizes that two kinds of models of mainstream are probability potential applications moulds using topic model Type (Probabilistic Latent Semantic Analysis, PLSA) and latent Dirichletal location (Latent Dirichlet Allocation, LDA) method.Both methods is initially not particularly suited for viewpoint extraction, but it can be by Extension is for modeling much information.The preferable method of effect includes Sentiment-LDA, MaxEnt-LDA etc. at present Method.Somebody combines HMM and LDA, it is proposed that HMM-LDA models, it can be found that potential evaluation object.

It is relatively fewer that news viewpoint extracts current research, has at present based on the associated viewpoint sentence of bilingual news sentence element Abstracting method, thinking is that sentence of the cluster comprising fixed morpheme and emotion is considered viewpoint sentence, first using name Entity recognition Method carries out sequence labelling to news sentence, obtains morpheme set, recycles emotion word dictionary to extract emotion word, then passes through Correlation degree between the morpheme of different news corpus between emotion word calculates sentence weight, finally obtains comprising viewpoint sentence Sentence cluster.

Our target is viewpoint and the viewpoint holder extracted in news corpus, is had with task above certain Similitude, but it is not exactly the same.Currently, the viewpoint and its holder that extract in news corpus become natural language not yet The hot spot of processing, research data is also relatively fewer, we can be by naming Entity recognition and syntax dependency parsing to obtain viewpoint The template of sentence, but the method coverage rate of template matches is low, underaction can only extract fixed expression way, it is difficult to adapt to Flexible language change.Therefore, we have proposed a kind of viewpoints based on self-attention and its joint of holder to take out Method is taken, solves the problems, such as this, compensates for field blank.

The distinct methods that viewpoint extracts have different limitations.The supervised learning of supervised learning method and other tasks Method all has that labeled data collection is difficult to obtain, classification is more, it is different classes of between training corpus gap it is big.In addition, With the prevalence of cyberspeak, language is also changing, and labeled data in the early time may be eliminated soon, mark new data Or it corrects legacy data and is required for expending a large amount of energy.

Mainly evaluation object and evaluating word are modeled using topic model in unsupervised learning method, however it is main It inscribes model to need to carry out a large amount of parameter complicated adjustment, can just obtain preferable as a result, causing training usually into postponing Slowly.In addition, topic model is easy to find out the evaluation generally occurred in document, it is difficult then hair for there is not frequent evaluation It is existing.In news corpus, the evaluation of universal evaluation, especially mechanism expert is actually rare, and often oneself respectively expresses in expert mechanism See, such evaluation is easy to be submerged in news corpus.

Current existing bilingual news viewpoint sentence abstracting method has used the relevance of bilingual news, while to emotion The extraction of word has still used most basic emotion word dictionary.It is that this method finally extracts the result is that one include emotion The small paragraph of tendency being made of multiple sentences, wherein might not include evaluation, accuracy rate can not reach requirement.

Invention content

The purpose of the present invention is to provide a kind of viewpoint based on from attention and its joint abstracting method of holder, To overcome the defect that above-mentioned evaluated views extract and news viewpoint sentence extracts, the textual classification model of the method for the present invention effectively to keep away The case where sentence extracted is not comprising viewpoint is exempted from；Viewpoint and its holder combine extraction model and have broken away from part-of-speech tagging, life The name natural language processings link such as Entity recognition and syntax dependency parsing avoids the error of these links appearance for model The influence of extraction effect, and the model does not have the process of artificial definition template, increases flexibility ratio and covering surface.

A kind of joint abstracting method of viewpoint and its holder based on from attention of the present invention, specifically includes following step Suddenly：

S1. structure extraction viewpoint and its corpus of holder

Corpus includes two parts, and a part is the negative sample not comprising viewpoint, and another part is comprising viewpoint and its to hold The positive sample for the person of having, the mark comprising viewpoint and its holder, a positive sample can be expressed as in positive sample<Original text, viewpoint Holder and viewpoint>Two tuples, the wherein format of viewpoint holder and viewpoint part are [viewpoint holder]:[viewpoint].This hair It is bright that such corpus is obtained by way of manually marking.

S2. identification includes the sentence of viewpoint

Sentence of the identification comprising viewpoint is two classification problem of text, and positive class is the sentence comprising viewpoint, does not include and sees The sentence of point is as negative class.Present invention employs the textual classification model based on CNN, the structure of this textual classification model is such as Shown in Fig. 2, specific implementation step is：

S21：Term vector is obtained, using Chinese wikipedia as language material, utilizes the term vector of word2vec model trainings d dimensions；

S22：Word segmentation processing is carried out to sentence s, s is expressed as a Matrix C=＜ w1, w2 ..., wn using term vector >, wherein w1 is the corresponding d dimensions term vector of first word in sentence s；

S23：Matrix C is handled with k convolution kernel, the size of each convolution kernel is x*d, and x is one small more than 0 In 5 integer, each convolution operation obtains a n-dimensional vector；

S24：Maximum pond is carried out to the k n dimensions that step S23 is obtained, each n-dimensional vector exports maximum numerical value, finally Obtain a k dimensional vector；

S25：The k dimensional vectors that step S24 is obtained are as the input of the fully-connected network for classification；

S26：Model training, training data and test data can be that initial data is randomly ordered, be instructed by 80% Practice, 20%, which does the method tested, separates.

S3. joint extracts viewpoint and its holder

The extraction of viewpoint and its holder are that viewpoint and its holder are extracted from the sentence comprising viewpoint, in short In may include multiple names and viewpoint, how accurately to extract and match name and viewpoint is this task pass to be solved Key problem.Present invention employs the information that two-way LSTM captures text positive sequence and backward, are established using self-attention Each relationship between word and context words, and several words are extracted from text by Pointer Network and are constituted<It sees Point holder, viewpoint>Two tuples, as shown in figure 3, the joint extraction model of viewpoint and its holder, including word Embedding layers, LSTM layers two-way, self-attention layers and four part of pointer networks layers, joint, which extracts, to be seen It puts and its specific implementation step of holder is：

S31：Term vector is obtained, using Chinese wikipedia as language material, utilizes the term vector of word2vec model trainings d dimensions；

S32：The sentence of vectorization<w₁,w₂,…,w_n>It is inputted as two-way LSTM, has been merged context information Word vectors<h₁,h₂,…,h_n>；

S33：By the word vectors of the obtained fusion semantic informations of step S32, word w is calculated to each word_iWith with other Word w_jBetween weight α_ij, the vectorial a ' that is weighted_i, by a '_iAnd h_iIt is spliced into a_iAs self-attention layers Output, correlation formula are as follows：

e_ij=W^e*tanh(W^sh_j+W^aa′_i-1) a_i=[a '_i；h_i]

Wherein a '_iIndicate word w_iAfter self-attention mechanism weighted sums as a result, α_ijIndicate word w_i With with other words w_jBetween weight.Wherein α_ijIt is calculated by softmax functions, in e_ijIn calculating, W^e,W^s,W^aIt is to need The parameter to be learnt, the last one formula indicate the concatenation of vector.

S34：The output that step S33 is obtained<a₁,a₂,…,a_n>As Pointer Network encoder it is defeated Enter, the output of encoder is denoted as<h₁,h₂,…,h_n>, the maximum input subsequence of decoder output probabilities, which is exactly Combine the viewpoint being drawn into and its holder.According to the training corpus of structure, the first word of the sequence of output is held for viewpoint The person of having, remaining is viewpoint.

S35：Model training, training data and test data can be that initial data is randomly ordered, be instructed by 80% Practice, 20%, which does the method tested, separates.

A kind of joint abstracting method of viewpoint and its holder based on from attention of the present invention, advantage and effect exist In：

1, the textual classification model of the method for the present invention effectively prevents the case where sentence extracted is not comprising viewpoint；

2, viewpoint and its holder combine extraction model and have broken away from part-of-speech tagging, named Entity recognition and interdependent point of syntax The natural language processings links such as analysis avoid influence of the error of these links appearance for model extraction effect, and the mould Type does not have the process of artificial definition template, increases flexibility ratio and covering surface；

3, the work of previous opining mining is mainly towards comment on commodity, therefore main target is to extract evaluation object It is relatively fewer for the research for extracting viewpoint and its holder in newsletter archive with the Sentiment orientation to evaluation object, although with Name Entity recognition combination syntax dependency parsing can construct the template for extracting viewpoint, but this method coverage rate is low, spirit Poor activity, it is difficult to meet demand.For these limitations, the present invention proposes a kind of new method for extracting viewpoint and its holder, Including structure extraction viewpoint and its corpus of holder, identification include the sentence of viewpoint, joint extracts viewpoint and its holder Method.

4, the present invention proposes the sequence of terms of the integrating context information based on self-attention and two-way LSTM Representation method.Use merely two-way LSTM can integrating context information, but the pass between other words cannot be embodied System.The sequence signature between word is then lost using self-attention merely, the present invention is on the basis of two-way LSTM The advantages of the two being effectively combined using self-attention so that the expression semanteme of sequence of terms is more abundant, training Model accuracy rate higher.

Description of the drawings

Fig. 1 is the method for the present invention main flow chart.

Fig. 2 is the discrimination model that the method for the present invention includes viewpoint sentence.

Fig. 3 is the joint extraction model of the method for the present invention viewpoint machine holder.

Specific implementation mode

Below in conjunction with the accompanying drawings, the following further describes the technical solution of the present invention.

The method of the present invention has the characteristics that：

First, it includes mechanism or the viewpoint of expert generally there was only division statement in news corpus, we devise one For the mechanism expert view sentence judgment method of news corpus, can quickly judge in paragraph whether to include viewpoint sentence.

Second, in order to realize the effective identification and extraction of evaluating holder and evaluation content in news corpus, we build One end to end neural network cross model, which is based on self-attention and Pointer Network to evaluation The joint that content and its holder carry out extracts.

In this way, we are achieved that viewpoint and its holder a joint abstracting method based on self-attention.

The task of the present invention includes mainly the corpus of three aspects, structure extraction viewpoint and its holder；Training text Disaggregated model, identification include the sentence of viewpoint；Training can combine from the sentence comprising viewpoint extracts viewpoint and its holder Network model.On the basis of the above task is completed, the flow for a document extraction viewpoint and its holder is, first Subordinate sentence processing first is carried out to document, obtains sentence set.Then, it is to every words textual classification model judgement in set No includes viewpoint, if including if with the joint extraction model of viewpoint and its holder extract viewpoint and its holder.Side of the present invention The main flow of method is as shown in Figure 1, be as follows：

S1. structure extraction viewpoint and its corpus of holder.

The corpus of structure includes two parts, and a part is the negative sample not comprising viewpoint, and another part is comprising viewpoint And its positive sample of holder, the mark comprising viewpoint and its holder, a positive sample can be expressed as in positive sample<It is former Text, viewpoint holder and viewpoint>Two tuples, the wherein format of viewpoint holder and viewpoint part are [viewpoint holder]:It [sees Point].The present invention obtains such corpus by way of manually marking.

S2. identification includes the sentence of viewpoint.

Sentence of the identification comprising viewpoint is two classification problem of text, and positive class is the sentence comprising viewpoint, does not include and sees The sentence of point is as negative class.Deep learning achieves good effect in text classification problem at present, there is employed herein based on The textual classification model of CNN, this model can use the term vector of pre-training as mode input, and increase model can Transplantability, and the assemblage characteristic of local word can be obtained by controlling the size of convolution window, improve the accurate of classification Rate.The structure of this textual classification model is as shown in Fig. 2, the specific implementation step of this textual classification model is：

S21：Term vector is obtained, using Chinese wikipedia as language material, utilizes the term vector of word2vec model trainings d dimensions.

S22：Word segmentation processing is carried out to sentence s, using term vector by s be expressed as a Matrix C=< w1,w2,…,wn>, Wherein w1 is the corresponding d dimensions term vector of first word in sentence s.

S23：Matrix C is handled with k convolution kernel, the size of each convolution kernel is x*d, and x is one small more than 0 In 5 integer, each convolution operation obtains a n-dimensional vector.

S23：Maximum pond is carried out to the k n dimensions that step S23 is obtained, each n-dimensional vector exports maximum numerical value, finally Obtain a k dimensional vector.

S25：The k dimensional vectors that step S24 is obtained are as the input of the fully-connected network for classification.

S3. joint extracts viewpoint and its holder.

The extraction of viewpoint and its holder are that viewpoint and its holder are extracted from the sentence comprising viewpoint, in short In may include multiple names and viewpoint, how accurately to extract and match name and viewpoint is this task pass to be solved Key problem.Present invention employs the information that two-way LSTM captures text positive sequence and backward, are established using self-attention Each relationship between word and context words, and several words are extracted from text by Pointer Network and are constituted<It sees Point holder, viewpoint>Two tuples, as shown in figure 3, the joint extraction model of viewpoint and its holder, including word Embedding layers, LSTM layers two-way, self-attention layers and four part of pointer networks layers, sight of the invention It puts and its specific implementation step of the joint extraction model of holder is：

S31：Term vector is obtained, using Chinese wikipedia as language material, utilizes the term vector of word2vec model trainings d dimensions.

S33：By the word vectors of the obtained fusion semantic informations of step S32, word w is calculated to each word_iWith with other Word w_jBetween weight α_ij, the vectorial a ' that is weighted_i, by a '_iAnd h_iIt is spliced into a_iIt is defeated as self-attention layers Go out, correlation formula is as follows：

e_ij=W^e*tanh(W^sh_j+W^aa′_i-1) a_i=[a '_i；h_i]

Method proposes a novel viewpoint and its abstracting methods of holder, including structure corpus, identification packet Sentence containing viewpoint, joint extract viewpoint and its holder's three parts.Identify sentence whether comprising viewpoint text classification text This disaggregated model effectively prevents the case where sentence extracted is not comprising viewpoint, and viewpoint and its holder combine extraction model Part-of-speech tagging, the name natural language processings link such as Entity recognition and syntax dependency parsing have been broken away from, these links has been avoided and goes out Influence of the existing error for model extraction effect, and the model does not have the process of artificial definition template, increases flexibility ratio And covering surface.

The key point of the present invention and protection point are that joint extracts the processing method of viewpoint and its holder and is based on The representation method of the sequence of terms of the integrating context information of self-attention and two-way LSTM.

Claims

1. a kind of joint abstracting method of viewpoint and its holder based on from attention, it is characterised in that：This method is specifically wrapped Include following steps：

S1. structure extraction viewpoint and its corpus of holder

Corpus includes two parts, and a part is the negative sample not comprising viewpoint, and another part is comprising viewpoint and its holder Positive sample, the mark comprising viewpoint and its holder, a positive sample can be expressed as in positive sample<Original text, viewpoint are held Person and viewpoint>Two tuples, the wherein format of viewpoint holder and viewpoint part are [viewpoint holder]:[viewpoint]；

S2. identification includes the sentence of viewpoint

Sentence of the identification comprising viewpoint is two classification problem of text, and positive class is the sentence comprising viewpoint, does not include viewpoint Sentence is as negative class；

S3. joint extracts viewpoint and its holder

Using two-way LSTM capture text positive sequence and backward information, using self-attention establish each word with up and down Relationship between cliction language, and several words are extracted from text by Pointer Network and are constituted<Viewpoint holder, viewpoint> Two tuples.

2. the joint abstracting method of a kind of viewpoint and its holder based on from attention according to claim 1, special Sign is：The step S2 specifically uses the textual classification model based on CNN, and steps are as follows：

S22：Word segmentation processing is carried out to sentence s, s is expressed as a Matrix C=＜ w using term vector₁, w₂..., w_n＞, Middle w₁It is the corresponding d dimensions term vector of first word in sentence s；

S23：Matrix C is handled with k convolution kernel, the size of each convolution kernel is x*d, and x is one and is more than 0 less than 5 Integer, each convolution operation obtain a n-dimensional vector；

S24：Maximum pond is carried out to the k n dimensions that step S23 is obtained, each n-dimensional vector exports maximum numerical value, finally obtains One k dimensional vector；

S26：Model training, training data and test data can be that initial data is randomly ordered, and training is done by 80%, 20% The method tested is done to separate.

3. the joint abstracting method of a kind of viewpoint and its holder based on from attention according to claim 1, special Sign is：The step S3 implements step：

S32：The sentence ＜ w of vectorization₁, w₂..., w_n＞ is inputted as two-way LSTM, has been merged context information Word vectors ＜ h₁, h₂..., h_n＞；

S33：By the word vectors of the obtained fusion semantic informations of step S32, word w is calculated to each word_iWith with other words w_j Between weight α_ij, the vectorial a ' that is weighted_i, by a '_iAnd h_iIt is spliced into a_iAs self-attention layers of output, phase It is as follows to close formula：

e_ij=W^e*tanh(W^sh_j+W^aa′_i-1)

a_i=[a ＇_i；h_i]

Wherein a ＇_iIndicate word w_iAfter self-attention mechanism weighted sums as a result, α_ijIndicate word w_iWith with its He is word w_jBetween weight；Wherein α_ijIt is calculated by softmax functions, in e_ijIn calculating, W^e,W^s,W^aIt needs to learn Parameter, the last one formula indicate the concatenation of vector；

S34：The output ＜ a that step S33 is obtained₁, a₂..., a_n＞ as the encoder of Pointer Network input, The output of encoder is denoted as ＜ h₁, h₂..., h_nThe maximum input subsequence of ＞, decoder output probability, which is exactly to join Close the viewpoint being drawn into and its holder；According to the training corpus of structure, the first word of the sequence of output is held for viewpoint Person, remaining is viewpoint；

S35：Model training, training data and test data can be that initial data is randomly ordered, and training is done by 80%, 20% The method tested is done to separate.