CN109710946A

CN109710946A - A kind of joint debate digging system and method based on dependence analytic tree

Info

Publication number: CN109710946A
Application number: CN201910034772.XA
Authority: CN
Inventors: 廖祥文; 陈泽泽; 陈志豪; 陈国龙
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2019-01-15
Filing date: 2019-01-15
Publication date: 2019-05-03

Abstract

The present invention relates to a kind of based on the joint debate digging system for relying on analytic tree, comprising: data preprocessing module, for being pre-processed to data；Text is embedded in module, for from extracting in the text of input, word, character, part of speech, the vector of dependence and argument type is indicated between argument；Sequential coding module, using the contextual information of two-way length Memory Neural Networks learning text in short-term, for completing the task of argument border detection and argument Relation extraction；Analytic tree module is relied on, analytic tree is relied on by building, for finding shortest path warp in two argument component entities；Label output module is excavated in debate, and the Tag Estimation work of three tasks, the type label of argument and the relational tags of argument are excavated for completing debate.The present invention can practise the text vector feature of high quality from debate text data middle school, finally detect the debate structure of text.

Description

A kind of joint debate digging system and method based on dependence analytic tree

Technical field

The present invention relates to natural language processing fields, and in particular to a kind of to excavate system based on the joint debate for relying on analytic tree System and method.

Background technique

Currently, many technical methods can be used for debate excavation.Traditional debate method for digging is all mainly to subtask Independent modeling, and the related information between three subtasks is had ignored, lead to degraded performance.In addition, there are also part work to use Pipeline model carries out joint modeling to three subtasks, these models have error propagation in the training process.

Currently, there is some research methods based on assembly line.Its basic idea is three sons excavated to debate The method that task uses assembly line is solved according to the sequence of assembly line.Wrong meeting of the pipelining technique due to argument type identification , there is error propagation in the extraction mistake for influencing argument relationship.In addition, this method will identify that the argument come carries out It matches two-by-two, carries out the classification of argument relationship later, produce the redundancy of argument relationship pair.

However, debate Research on Mining method often has ignored the related information between subtask at present, it is every there is also ignoring The problem of a subtask different characteristics, and related information has critically important meaning to debate excavation, the label of a task is pre- Surveying result can be used as the validity feature for predicting that subtask label is excavated in other debates.Therefore, in view of the above deficiencies, it is desirable to It is a kind of more efficient, careful and can make full use of between subtask related information and every height is made full use of to appoint to find The method of business feature, and then improve the performance of debate mining model.

Summary of the invention

In view of this, the purpose of the present invention is to provide a kind of based on the joint debate digging system for relying on analytic tree and side Method can practise the text vector feature of high quality from debate text data middle school, finally detect the debate structure of text.

To achieve the above object, the present invention adopts the following technical scheme:

A kind of joint debate digging system based on dependence analytic tree, comprising:

One data preprocessing module, for being pre-processed to data；

One text is embedded in module, for from extracting word, character, part of speech, dependence between argument in the text of input And the vector of argument type indicates；

One sequential coding module has been used to using the contextual information of two-way length Memory Neural Networks learning text in short-term At the task of argument border detection and argument Relation extraction；

One relies on analytic tree module, relies on analytic tree by building, most short for finding in two argument component entities Lu Jing；

Label output module is excavated in one debate, and the Tag Estimation work of three tasks is excavated for completing debate, argument The relational tags of type label and argument.

Further, the data preprocessing module pre-process to data and be specifically included:

(1) web page interlinkage in document, spcial character, punctuation mark are removed；

(2) word segmentation processing is carried out to document；

(3) stem reduction treatment is carried out to English data；

(4) stop words for including in data set is filtered out according to the deactivated vocabulary of Chinese and English respectively.

Further, the text insertion module uses depth convolutional neural networks.

Further, a kind of analytic method based on the joint debate digging system for relying on analytic tree, feature exist In, comprising the following steps:

Step S1: being input to data preprocessing module for opinion distortion document to be excavated and pre-process, obtained pretreatment Text afterwards, and input text insertion module；

Step S2: text is embedded in module using depth convolutional neural networks to pretreated Text Feature Extraction word, character, word Property, the vector of dependence and argument type indicates between argument, and list entries coding module；

Step S3: list entries coding module is embedded in the text data that module inputs according to text, in short-term using two-way length The contextual information of Memory Neural Networks learning text data completes argument border detection and argument Relation extraction, obtains argument The type label of component entity；

Step S4: dependence is closed according to relying between the type label and argument of argument component entity, relies on analytic tree Module relies on analytic tree by building, and training obtains the label of argument relationship；

Step S5: label output module is excavated in debate, and the relational tags of the type label of obtained argument and argument are defeated Out.

Further, the step S2 specifically:

Step S21: the input of depth convolutional neural networks is pretreated text sequence x=[x₁,x₂,...,x_n], according to The sequence of word in text sentence, every a line are all the words indicated by d dimensional vector, and CNN output is sequence C=[c₁, c₂,...,c_n,], C indicates to input the feature of each word, and n indicates the maximum length of list entries；

Step S22: the convolution kernel W ∈ R for the use of one width of narrow convolution sum being k between x^(d×k), and willWithThe head and tail portion of sequence are filled into as filling vector；

Step S23: Text Feature Extraction word V is exported respectively^w, character V^c, part of speech V^p, dependence V between argument^dAnd argument Type V^eVector indicate, and in list entries coding module.

Further, the step S3 specifically:

Step S31: the input of sequential coding layer is the sharing feature parameter vector of text embeding layer output, including word V^w, word Accord with V^c, part of speech V^p, for learning text contextual information and identify argument component entity；

Step S32: one two-way LSTM of building calculates and obtains sentence vector, each LSTM unit is in t-th of word by one A n-dimensional vector composition, comprising: an input gate i_t, a forgetting door f_t, an out gate o_t, a memory unit c_t, and One hidden unit h_t, the vector input of one n dimension of each LSTM unit reception, previous hidden state is h_t-1, previous note Recalling unit is c_t-1；

Undated parameter according to the following formula:

i_t=σ (W⁽ⁱ⁾x_t+I⁽ⁱ⁾h_t-1+bⁱ)

f_t=σ (W^(f)x_t+I^(f)h_t-1+b^f)

o_t=σ (W^(o)x_t+I^(o)h_t-1+b^o)

u_t=tanh (W^(u)x_t+I^(u)h_t-1+b^u)

c_t=i_t⊙u_t+f_t⊙c_t-1

h_t=o_t⊙tanh(c_t)

Wherein, σ indicates that logistic activation primitive, ⊙ indicate that the dot product of vector, W and I indicate that weight matrix, b indicate inclined Difference vector, input of the LSTM unit on t-th of word are the word V of t-th of word_t ^w, character V_t ^cWith part of speech V_t ^pConnection vectorBy the hidden unit of two reversed LSTMWithIt is connected asAs output；

Step S32: upper BIO label is marked to each word of input sentence, its argument type is then marked again, is formed The form of " BIO- argument type "；

Step S33: one two layers of neural network being made of DenseNet and Softmax of building:

Wherein, W is weight matrix, and b is bias vector；

Step S34: by s_tWith the vector e of previous word_i-1As input, it is input to later by one layer of neural network The type label of Softmax layers of acquisition argument component entity obtains output and is mapped as vector e_i。

Further, the step S4 specifically:

Step S41: by the type label e of argument component entity_iWith the dependence V from text embeding layer^eInput dependence Analytic sheaf；

Step S42: the LSTM combination recurrent neural network of two-way tree construction is constructed, and by following formula in LSTM unit The interior n-dimensional vector for calculating t-th of node:

h_t=o_t⊙tanh(c_t)

Wherein, m () is mapping function, and C (t) is the child node of t-th of node, and i is shared parameter.

The relationship between two target words pair is indicated using shortest path structure, it is for capturing between target word pair Independent path will rely on analytic sheaf and be stacked on sequence layer, text sequence and the information for relying on analytic tree are merged into output In, rely on the LSTM input of t-th of word of analytic sheaf are as follows:Hidden unit s in catenation sequence layer_tAnd opinion Point relational dependence typeAnd the entity of argument component indicates

The direction of the type of relationship and relationship: being indicated the relationship between argument by step S43, and the dependence of each candidate is closed System can be expressed as d_p=[↑ h_pA:↓h_p1:↓h_p2], wherein ↓ h_pAIndicate the corresponding minimum father node of two argument entity nodes Hidden layer, ↑ h_p1With ↑ h_p2It is the hidden state vector of two LSTM units, respectively indicates first in top-down LSTM-RNN With second target argument physical components；

Step S44: one two layers of neural network of setting, it includes the hidden layer h of n dimension^(r)With Softmax's Output layer:

Wherein, W is weight matrix, and b is bias vector；

The LSTM-RNNs of tree construction is superimposed upon on sequence layer, the input d of argument relationship classification is constructed_p, will be each The average value of the hidden state vector of argument physical components is connected to d from sequence layer_pCarry out the classification of argument relationship, obtain as Lower formula:

Wherein, U_p1And U_p2It is the index of first and second argument entity set of words；

Step S45: being to be assigned with two labels to according to direction to each word in prediction, when the both direction mark of prediction When signing inconsistent, select the relationship with high confidence as the final result of output, and training output obtains argument relationship Label.

Further, the two-way mode referred to from top to bottom and from the bottom up of the two-way tree construction, not only to each Node transmits the information from leaf node, also transmits information to node

Compared with the prior art, the invention has the following beneficial effects:

The automatic system that must be identified argument and extract the relationship between argument in subjectivity document of the present invention can combine more Tasking learning method carries out debate excavation from debate text in high quality.

Detailed description of the invention

Fig. 1 is the schematic configuration view of present system.

Specific embodiment

The present invention will be further described with reference to the accompanying drawings and embodiments.

Fig. 1 is please referred to, the present invention provides a kind of based on the joint debate digging system for relying on analytic tree, comprising:

One data preprocessing module, for being pre-processed to data；

In this implementation, each module concrete function are as follows:

(1) data preprocessing module

Information abundant is contained in the online debate document of input but while being also mingled with certain noise.Cause This, first pre-processes data, is substantially carried out the operation of the following aspects:

1. removing the web page interlinkage in document, spcial character, punctuation mark etc.；

2. pair document carries out word segmentation processing；

3. a pair English data carry out stem reduction treatment；

4. filtering out the stop words for including in data set respectively according to the deactivated vocabulary of Chinese and English.

2) text is embedded in module

Extracted from the text of input using convolutional neural networks word, character, part of speech (Part-of-Speech), argument it Between dependence and argument type expression, depth convolutional neural networks (CNN) input be text sequence x=[x₁, x₂,...,x_n], according to the sequence of word in text sentence, every a line is all the word indicated by d dimensional vector, CNN output For sequence C=[c₁,c₂,...,c_n,], C indicates to input the feature of each word, and n indicates the maximum length of list entries.We The convolution kernel W ∈ R for the use of one width of narrow convolution sum being k between x^(d×k), and willWithAs filling Vector is filled into the head and tail portion of sequence, to guarantee that the length of list entries will not change after convolutional layer.Point It Shu Chu not V^w, V^c, V^p, V^dAnd V^e, these parameters input in subsequent text sequence layer as the bottom shared parameter of model and carry out Training study.

3) sequence layer module

The input of sequential coding layer is the sharing feature parameter vector of text embeding layer output, carrys out the context letter of learning text It ceases and identifies argument component entity, calculated using a two-way LSTM obtain sentence vector first, each LSTM unit is at t-th Word is made of a n-dimensional vector, comprising: input gate (input gate) i_t, forgetting door (forget gate) f_t, One out gate (output gate) o_t, memory unit (memory cell) c_tAn and hidden unit h_t, each LSTM unit receives the vector input of a n dimension, and previous hidden state is h_t-1, previous memory unit is c_t-1.According to Lower formula undated parameter:

i_t=σ (W⁽ⁱ⁾x_t+I⁽ⁱ⁾h_t-1+bⁱ)

f_t=σ (W^(f)x_t+I^(f)h_t-1+b^f)

o_t=σ (W^(o)x_t+I^(o)h_t-1+b^o)

u_t=tanh (W^(u)x_t+I^(u)h_t-1+b^u)

c_t=i_t⊙u_t+f_t⊙c_t-1

h_t=o_t⊙tanh(c_t)

Wherein, σ indicates that logistic activation primitive, ⊙ indicate that the dot product of vector, W and I indicate that weight matrix, b indicate inclined Difference vector, input of the LSTM unit on t-th of word are the word V of t-th of word_t ^w, character V_t ^cWith part of speech V_t ^pConnection vectorWe are simultaneously by the hidden unit of two reversed LSTMWithIt is connected asAs output.

All regard the identification for the two argument type of one argument border detection of task and task that debate is excavated as sequence labelling to ask Topic, we first mark upper BIO label to each word of input sentence, then mark its argument type, i.e., each word again The form of " BIO- argument type " is formed, such labeling method is both the label of debate mining task one and task two.In sequence The top layer of column coding layer completes this two tasks, we realize two layers of nerve being made of DenseNet and Softmax Network:

Wherein, W is weight matrix, and b is bias vector.

In the decoding process of argument Entity recognition, it is contemplated that the dependence of label is come pre- using the predicted value of a word Survey the value of next word, specific practice is us by s_tWith the vector e of previous word_i-1As input, pass through one layer of mind later The type label for obtaining argument component entity to Softmax layers through network inputs obtains output and is mapped as vector e_i。

4) analytic sheaf module is relied on

Rely on the type mark that the input of analytic sheaf module is the argument component entity that sequence layer neural metwork training exports Sign e_iWith the dependence V from text embeding layer^e。

It is realized using the mode of the LSTM combination recurrent neural network of two-way tree construction, here two-way refers to from top to bottom Mode from the bottom up, this bi-directional configuration not only transmits the information from leaf node to each node, also to node Information is transmitted, this is particularly significant for the classification of argument relationship, takes full advantage of the node near tree bottom, top-down knot Structure sends information near leaf node from top, and can be compatible with the leaf node of different type and quantity, mutually similar The child node of type shares weight matrix in LSTM unit.The n dimension of t-th of node is calculated in LSTM unit according to following formula Vector:

h_t=o_t⊙tanh(c_t)

The relationship between two target words pair is indicated using shortest path structure (SPTree), it is for capturing target word Independent path between.We are stacked on analytic sheaf is relied on sequence layer, by the letter of text sequence and dependence analytic tree Breath is merged into output, relies on the LSTM input of t-th of word of analytic sheaf are as follows:It is hidden in catenation sequence layer Hide unit s_tWith argument relational dependence typeAnd the entity of argument component indicates

All argument physical components identified according to sequence layer, to the last one word of each argument physical components All situations are arranged out, dependence analytic sheaf is then inputted, export this argument reality finally by two layers of neural net layer The relationship classification of body component combination.When two argument physical components being drawn into be mistake or they between it is not related, Relationship between them is regarded as negative relationship, therefore the direction of the type of relationship and relationship is indicated into the relationship between argument. The dependence of each candidate can be expressed as d_p=[↑ h_pA:↓h_p1:↓h_p2], wherein ↓ h_pAIndicate two argument entity nodes pair The hidden layer for the minimum father node answered, ↑ h_p1With ↑ h_p2It is the hidden state vector of two LSTM units, respectively indicates top-down First and second On Targets point entity components in LSTM-RNN.

Similar with the Entity recognition of argument type, we realize two layers of neural network, it includes the hidden of a n dimension Hide layer h^(r)With the output layer of a Softmax:

Wherein, W is weight matrix, and b is bias vector.

The LSTM-RNNs of tree construction is superimposed upon on sequence layer, the input d of argument relationship classification is constructed_p, at this point, Sequence layer to rely on analytic sheaf input be do not have it is directive, in order to make full use of argument entity information and solve input d_pNothing To the problem of.The average value of the hidden state vector of each argument physical components is connected to d from sequence layer_pTo carry out argument pass The classification of system obtains following formula:

Wherein, U_p1And U_p2It is the index of first and second argument entity set of words.

The LSTM-RNNs of tree construction is superimposed upon on sequence layer, the input d of argument relationship classification is constructed_p, at this point, Sequence layer to rely on analytic sheaf input be do not have it is directive, in order to make full use of argument entity information and solve input d_pNothing To the problem of.The average value of the hidden state vector of each argument physical components is connected to d from sequence layer_pTo carry out argument pass The classification of system.

In addition, considering the direction between two argument component entities from left to right and from right to left simultaneously, it is in prediction Two labels are assigned with to according to direction to each word, when the both direction label of prediction is inconsistent, selection has higher Final result of the relationship of confidence level as output.Finally training output obtains the label of argument relationship.

The foregoing is merely presently preferred embodiments of the present invention, all equivalent changes done according to scope of the present invention patent with Modification, is all covered by the present invention.

Claims

1. a kind of based on the joint debate digging system for relying on analytic tree characterized by comprising

One data preprocessing module, for being pre-processed to data；

One text is embedded in module, for from extracted in the text of input word, character, part of speech, between argument dependence and The vector of argument type indicates；

One sequential coding module, using the contextual information of two-way length Memory Neural Networks learning text in short-term, for completing to discuss The task of point border detection and argument Relation extraction；

One relies on analytic tree module, relies on analytic tree by building, for finding shortest path warp in two argument component entities；

Label output module is excavated in one debate, and the Tag Estimation work of three tasks, the type of argument are excavated for completing debate The relational tags of label and argument.

2. according to claim 1 based on the joint debate digging system for relying on analytic tree, it is characterised in that: the data Preprocessing module carries out pretreatment to data and specifically includes:

(2) word segmentation processing is carried out to document；

(3) stem reduction treatment is carried out to English data；

3. according to claim 1 based on the joint debate digging system for relying on analytic tree, it is characterised in that: the text It is embedded in module and uses depth convolutional neural networks.

4. -3 any a kind of analytic method based on the joint debate digging system for relying on analytic tree according to claim 1, Characterized by comprising the following steps:

Step S1: opinion distortion document to be excavated is input to data preprocessing module and is pre-processed, what is obtained is pretreated Text, and input text insertion module；

Step S2: text be embedded in module using depth convolutional neural networks to pretreated Text Feature Extraction word, character, part of speech, The vector of dependence and argument type indicates between argument, and list entries coding module；

Step S3: list entries coding module is embedded in the text data that module inputs according to text, uses two-way long short-term memory The contextual information of neural network learning text data completes argument border detection and argument Relation extraction, obtains argument component The type label of entity；

Step S4: dependence is closed according to relying between the type label and argument of argument component entity, relies on analytic tree module Analytic tree is relied on by building, training obtains the label of argument relationship；

Step S5: debate excavates label output module and exports the relational tags of the type label of obtained argument and argument.

5. a kind of analytic method based on the joint debate digging system for relying on analytic tree according to claim 4, special Sign is: the step S2 specifically:

Step S21: the input of depth convolutional neural networks is pretreated text sequence x=[x₁,x₂,...,x_n], according to text The sequence of word in sentence, every a line are all the words indicated by d dimensional vector, and CNN output is sequence C=[c₁,c₂,..., c_n,], C indicates to input the feature of each word, and n indicates the maximum length of list entries；

Step S23: Text Feature Extraction word V is exported respectively^w, character V^c, part of speech V^p, dependence V between argument^dAnd argument type V^e Vector indicate, and in list entries coding module.

6. a kind of analytic method based on the joint debate digging system for relying on analytic tree according to claim 4, special Sign is: the step S3 specifically:

Step S31: the input of sequential coding layer is the sharing feature parameter vector of text embeding layer output, including word V^w, character V^c、 Part of speech V^p, for learning text contextual information and identify argument component entity；

Step S32: one two-way LSTM of building calculates and obtains sentence vector, each LSTM unit is in t-th of word by a n Dimensional vector composition, comprising: an input gate i_t, a forgetting door f_t, an out gate o_t, a memory unit c_tAnd one Hidden unit h_t, the vector input of one n dimension of each LSTM unit reception, previous hidden state is h_t-1, previous memory list Member is c_t-1；

Undated parameter according to the following formula:

i_t=σ (W⁽ⁱ⁾x_t+I⁽ⁱ⁾h_t-1+bⁱ)

f_t=σ (W^(f)x_t+I^(f)h_t-1+b^f)

o_t=σ (W^(o)x_t+I^(o)h_t-1+b^o)

u_t=tanh (W^(u)x_t+I^(u)h_t-1+b^u)

c_t=i_t⊙u_t+f_t⊙c_t-1

h_t=o_t⊙tanh(c_t)

Wherein, σ indicates logistic activation primitive, and ⊙ indicates the dot product of vector, and W and I indicate weight matrix, b indicate deviation to Amount, input of the LSTM unit on t-th of word are the word V of t-th of word_t ^w, character V_t ^cWith part of speech V_t ^pConnection vectorBy the hidden unit of two reversed LSTMWithIt is connected asAs output；

Wherein, W is weight matrix, and b is bias vector；

Step S34: by s_tWith the vector e of previous word_i-1As input, Softmax is input to by one layer of neural network later Layer obtains the type label of argument component entity, obtains output and is mapped as vector e_i。

7. a kind of analytic method based on the joint debate digging system for relying on analytic tree according to claim 6, special Sign is: the step S4 specifically:

Step S41: by the type label e of argument component entity_iWith the dependence V from text embeding layer^eInput dependence parsing Layer；

Step S42: the LSTM combination recurrent neural network of two-way tree construction is constructed, and is counted in LSTM unit by following formula Calculate the n-dimensional vector of t-th of node:

h_t=o_t⊙tanh(c_t)

Wherein, m () is mapping function, and C (t) is the child node of t-th of node, and i is shared parameter；

The relationship between two target words pair is indicated using shortest path structure, it is used to capture the dependence between target word pair Path will rely on analytic sheaf and be stacked on sequence layer, text sequence and the information for relying on analytic tree are merged into output, according to Rely the LSTM input of t-th of word of analytic sheaf are as follows:Hidden unit s in catenation sequence layer_tIt is closed with argument System relies on typeAnd the entity of argument component indicates

Step S43: the direction of the type of relationship and relationship is indicated to the relationship between argument, the dependence of each candidate can To be expressed as d_p=[↑ h_pA:↓h_p1:↓h_p2], wherein ↓ h_pAIndicate the implicit of the corresponding minimum father node of two argument entity nodes Layer, ↑ h_p1With ↑ h_p2It is the hidden state vector of two LSTM units, respectively indicates first in top-down LSTM-RNN and Two objects argument physical components；

Step S44: one two layers of neural network of setting, it includes the hidden layer h of n dimension^(r)With the output of a Softmax Layer:

Wherein, W is weight matrix, and b is bias vector；

The LSTM-RNNs of tree construction is superimposed upon on sequence layer, the input d of argument relationship classification is constructed_p, by each argument The average value of the hidden state vector of physical components is connected to d from sequence layer_pThe classification of argument relationship is carried out, obtains following public affairs Formula:

Step S45: being to be assigned with two labels to according to direction to each word in prediction, when prediction both direction label not When consistent, select the relationship with high confidence as the final result of output, and training output obtains the mark of argument relationship Label.

8. a kind of analytic method based on the joint debate digging system for relying on analytic tree according to claim 7, special Sign is: the two-way mode referred to from top to bottom and from the bottom up of the two-way tree construction, not only comes to the transmitting of each node From the information of leaf node, information also is transmitted to node.