WO2018218706A1

WO2018218706A1 - Method and system for extracting news event based on neural network

Info

Publication number: WO2018218706A1
Application number: PCT/CN2017/089136
Authority: WO
Inventors: 周勇; 刘兵; 陈斌; 王重秋
Original assignee: 中国矿业大学
Priority date: 2017-05-27
Filing date: 2017-06-20
Publication date: 2018-12-06
Also published as: CN107239445A

Abstract

A method and system for extracting a news event based on a neural network. The method comprises the steps of: carrying out data pre-processing on original text of a training corpus; introducing an event sentence sequence represented by a word vector into a bidirectional long- and short-term memory network, and using the bidirectional long- and short-term memory network to carry out training so as to obtain a semantic feature of each candidate trigger word; introducing the event sentence sequence represented by a word vector into a convolutional neural network, and using the convolutional neural network to carry out training so as to obtain a global feature of an event sentence in which the candidate trigger word is; and according to the semantic feature of the candidate trigger word and the global feature of the sentence in which the candidate trigger word is, using softmax as a classifier to classify each candidate trigger word, so as to find a trigger word of a news event, and according to the type of the trigger word, determining the type to which the event belongs. The method and system can quickly and accurately extract a news event and process a news event included in a non-standard sentence, and have the characteristics of high efficiency and universal applicability.

Description

Method and system for news event extraction based on neural network

Technical field

The invention relates to natural language processing, in particular to a news event extraction method and system based on a bidirectional long-term memory network (BiLSTM) and a convolutional neural network (CNN).

Background technique

With the development of computers and the increasing popularity of the Internet, a large amount of information appears in the form of electronic text. In a large number of web texts, how to find valuable news events has become an urgent problem to be solved. Event extraction is produced in this context. As a sub-task of information extraction, event extraction is a research hotspot of information extraction. Its research content is to automatically find specific types of events and their event elements from natural text.

Extracting the corresponding event from the text is usually done by identifying the trigger word of the event, so the trigger word is the key to identifying the event instance.

The patent document with the patent number CN201210321193.1 discloses an event extraction method, which uses a trigger word morphological structure and a similarity degree to extend the trigger word, so that when the event instance is extracted, not only the event corresponding to the known trigger word can be extracted. For example, an event instance corresponding to the extended unknown trigger word may also be extracted, which improves the recall rate of the event extraction. The patent document with the patent number CN201410108447.0 discloses a method for extracting news atomic events. Firstly, the initial fusion rule base and the information unit fusion rule base are used to fuse the part-of-speech and noun recognition results, and then the core vocabulary and event extraction rules are utilized. The library extracts events from the information unit fusion result of the news body.

Based on the above research status, there are mainly the following problems for news event extraction: First, the discrimination of news events mainly depends on the trigger word itself, neglecting the context relationship, and it is easy to cause event categories when encountering ambiguous candidate trigger words. Misjudgment. Second, the network text, especially the microblog text, is mostly an irregular statement. The current event extraction method lacks the research of extracting events from non-standard statements.

Summary of the invention

The object of the present invention is to overcome the deficiencies in the prior art, and to provide a neural network based method and system for news event extraction to eliminate candidate trigger word ambiguity and to be able to process news events of irregular statements.

In order to achieve the above object, the technical solution adopted by the present invention is:

A neural network based method for news event extraction, comprising the following steps:

Step S1: performing data preprocessing on the original text of the training corpus: segmenting the original text of the training corpus, obtaining the event sentence, and then performing segmentation and naming recognition on the event sentence; according to the manually marked news event information, The sentence is sequenced, the trigger word is marked according to its type, the non-trigger word is marked as no category, and the sequence of event sentences is obtained; and the sequence of event sentences is expressed in the form of a word vector;

Step S2, the event sentence sequence represented by the word vector is transmitted to the two-way long and short time memory network, and the semantic feature of each candidate trigger word is trained by using the two-way long and short time memory network;

Step S3, the event sentence sequence represented by the word vector is introduced into the convolutional neural network, and the global feature of the event sentence where the candidate trigger word is located is trained by using the convolutional neural network;

Step S4, according to the semantic feature of the candidate trigger word obtained in step S2 and the global feature of the sentence in which the candidate trigger word is obtained in step S3, use softmax as a classifier to classify each candidate trigger word to find a news event. The trigger word, and according to the trigger word type, determine the type of the event.

Step S1 is specifically:

In step S11, the natural language processing tool is used to perform segmentation, word segmentation and naming body recognition on the original text of the training corpus, so that the original text of the training corpus is presented in the event sentence and each sentence includes several words, and the event sentence is expressed as L={w ₁ , w ₂ ,...,w _i ,...,w _n }, where w _i is the i-th word in the event sentence, and n represents the length of the event sentence;

Step S12, according to the word segmentation and the naming body recognition result, the event sentence is manually labeled, in the labeling process, the non-trigger word is marked as untyped, and the trigger word is marked according to the news event category to obtain the event sentence sequence;

Step S13, the word vector is obtained through the open source toolkit word2vec training, and each word in the sequence of the event sentence is expressed as a vector of 300 length according to the word vector obtained by the training using the Skip-gram model;

Step S14, each event sentence is processed into a sequence form of a word vector representation, that is, each candidate trigger word w _{i is} represented by a 300-length word vector x _i , and the event sentence is expressed as L={x ₁ , x ₂ , ..., x _i ,...,x _n }.

Step S2 is specifically:

Step S21, assuming that the event sentence is expressed as L={x ₁ , x ₂ , . . . , x _i , . . . , x _n }, where x _i is the word vector of the ith candidate trigger word, and n is the length of the sentence. ;

Step S22, passing L as a sequence into the long and short time memory network, and obtaining the output result of the sequence FW={fw ₁ , fw ₂ , . . . , fw _i , . . . , fw _n }, where fw _i represents the first The semantic features extracted by the i candidate trigger words through the long and short time memory network;

In step S23, L is inverted, that is, L'={x _n , x _n-1 , . . . , x _i , . . . , x ₁ }, and the reverse sequence L′ is transmitted to the long and short time memory network. The output result of the reverse sequence BW={bw ₁ , bw ₂ , . . . , bw _i , . . . , bw _n }, where bw _i indicates that the ith candidate trigger word is extracted by the memory network in the reverse length Semantic features;

Step S24, splicing the FW and the BW obtained by the two-way long-term memory network to obtain the output result of the sentence L after the two-way length and long-term memory network, that is, O={r ₁ , r ₂ ,...r _i ,.. ., r _n }, where r _i =[fw _i :bw _i ].

Step S3 is specifically as follows:

Step S31, assuming that the event sentence is expressed as L = {x ₁ , x ₂ , ..., x _i , ..., x _n }, where x _i is the word vector of the ith word, and n represents the length of the sentence;

In step S32, a convolution operation is performed on the event sentence, and the calculation formula is:

C _i =f(w ^T x _i:i-h+1 +b)

Where f is the activation function, C _i is the feature obtained by convolution, w is the weight matrix, h is the convolution kernel size, and i:i-h+1 is the i-th word to the i-h+1th word. , b indicates the offset;

By sliding the window, convolving all the words to obtain a feature map;

In step S33, using the maximum pooling, the feature map is pooled to obtain a global feature C _{o of the} event sentence.

Step S4 is specifically:

Step S31, the candidate trigger term meanings O={r ₁ , r ₂ , . . . , r _i , . . . , r _n } obtained by the two-way long-term memory network and the global features of the sentence extracted by the convolutional neural network C _o is cascaded to obtain an output vector O _t =[O:C _o ];

In step S32, the output vector O _t is classified using softmax to obtain the type of news event prediction.

A system for news event extraction based on neural network, comprising a text and processing module, a neural network training module, and a news event prediction module, wherein:

The text and processing module is configured to perform data preprocessing on the original text of the training corpus, including: segmenting the original text of the training corpus, obtaining an event sentence, and then performing segmentation and naming recognition on the event sentence; and manually marking the news event according to the manual The information is sequenced by the event sentence, the trigger word is marked according to its type, the non-trigger word is marked as no category, and the sequence of the event sentence is obtained; and the sequence of the event sentence is expressed in the form of a word vector;

The neural network training module comprises a bidirectional long and short time memory network training module and a convolutional neural network training module, and the bidirectional long and short time memory network training module is configured to train the event sentence sequence represented by the word vector to obtain the semantics of each candidate trigger word. The convolutional neural network training module is configured to train the sequence of event sentences represented by the word vector to obtain a global feature of the event sentence in which the candidate trigger word is located;

The news event prediction module is configured to classify each candidate trigger word by using softmax as a classifier according to the semantic feature of the candidate trigger word obtained by the neural network training module and the global feature of the sentence where the candidate trigger word is located, thereby finding the news. The trigger word of the event, and according to the type of the trigger word, determine the type of the event.

Advantageous Effects: Since the above technical solution is adopted, the present invention has the following beneficial effects as compared with the prior art:

1. The present invention employs a bidirectional long and short time memory network (BiLSTM), which is capable of eliminating ambiguity of candidate trigger words based on context information of candidate trigger words. For example, "A car hits the guardrail of the expressway." And "I went to dinner today and just hit a classmate who hasn't seen it for a long time." The trigger words in the above two sentences are "collided", the former belongs to traffic. Accident-like incidents, the latter being incidents of encounters. When BiLSTM is used to extract the candidate trigger term meaning information, the actual meaning of the candidate trigger word can be judged according to the context information of the sentence, which can effectively avoid the ambiguity phenomenon of the vocabulary, thereby improving the accuracy of the news event classification.

2. The global feature of the sentence extracted by the convolutional neural network (CNN) is used in the present invention. When the sentence is an irregular statement, the event category can be accurately determined according to the global feature of the sentence and the semantic features of the candidate trigger word. Therefore, the present invention can solve the problem of news event recognition of irregular statements.

DRAWINGS

1 is a flowchart of a neural network based news event extraction method and system provided by the present invention;

Figure 2 is a key step workflow for news event extraction based on bidirectional long-term memory network (BiLSTM) and convolutional neural network (CNN);

Figure 3 is a schematic diagram of the structure of a convolutional neural network (CNN).

detailed description

The invention is further described below by way of specific embodiments.

As shown in FIG. 1 , a neural network based news event extraction system includes a text and processing module, a neural network training module, and a news event prediction module, wherein:

The text and processing module is used for data preprocessing of the original text of the training corpus, including: segmenting the original text of the training corpus, obtaining the event sentence, and then performing segmentation and naming recognition on the event sentence; according to the manually marked news event information, The event sentence is sequenced, the trigger word is labeled according to its type, and the non-trigger word is marked as none. Category, get the sequence of event sentences; and express the sequence of event sentences in the form of a word vector;

The neural network training module comprises a two-way long-term memory network training module and a convolutional neural network training module, and the two-way long-term memory network training module is configured to train the sequence of event sentences represented by the word vector to obtain the semantic features of each candidate trigger word; The convolutional neural network training module is configured to train the sequence of event sentences represented by the word vector to obtain global features of the event sentences in which the candidate trigger words are located;

The news event prediction module is configured to classify each candidate trigger word by using softmax as a classifier according to the semantic feature of the candidate trigger word obtained by the neural network training module and the global feature of the sentence in which the candidate trigger word is located, thereby finding a news event. Trigger the word and determine the type of the event based on the type of the trigger word.

The present invention will be further described below in conjunction with specific examples.

A method based on neural network for news event extraction, the example sentence is: "11:25, S20 outer ring Humin interchange has a 3 car rear-end collision." It is known that the event trigger word in this sentence is "tail-catching" The news event category belongs to a traffic accident.

Step 1. Perform segmentation on the event sentence and identify the name. Available:

11:25\O S20 outer circle\O Hushen Interchange\O Occurrence\O together\O 3 car\O rear-end\Y accident\O

The event trigger word "tailing" is marked as belonging to a traffic accident, and the remaining candidate trigger words are marked as no category, and the sequence label of the event sentence is obtained L={w ₁ , w ₂ ,...,w _i ,...,w _n }, where w _i is the ith word in the event sentence, and n is the length of the event sentence.

Select a sufficiently large corpus, use the open source toolkit word2vec to train the word vector, select the Skip-gram model, and each word is expressed as a 300-length vector.

Finally, the event sentence can be expressed as: L = {x ₁ , x ₂ , ..., x _i , ..., x _n }, where x _i is a 300-dimensional vector of the ith word in the event sentence, n represents the length of the event sentence.

Step 2: Pass the event sentence L={x ₁ , x ₂ ,..., x _i ,..., x _n } to be trained into the bidirectional long-term memory network (BiLSTM) and train it with BiLSTM. The semantic features of each candidate trigger word are shown in Figure 2.

Passing the sentence L as a sequence into the long-short-time memory network (LSTM), and obtaining the output result of the sequence FW={fw ₁ , fw ₂ ,...,fw _i ,...,fw _n },fw _i denotes the i The semantic features extracted by the candidate trigger words through LSTM, and n is the length of the event sentence. The calculation process for fw _i is as follows:

Define x _t as the input word vector at time _t , h _t is the hidden layer state vector storing all useful information at time t, σ is the sigmoid regression layer, U _i , U _f , U _c , U _o are different for the input The weight matrix of x _t , W _i , W _f , W _c , W _o is the weight matrix of the hidden layer state h _t , and b _i , b _f , b _c , b _o are offset vectors;

(1) The calculation of the forgotten gate at time t is as shown in equation (1):

f _t =σ(W _f ·[h _t-1 ,x _t ]+b _f ) (1)

(2) Update all the information stored in h _t-1 at time _t , and the calculation formula is as shown in equations (2) and (3):

i _t =σ(W _i ·[h _t-1 ,x _t ]+b _i ) (2)

In the above formula, i _t represents the value at t, which determines the value to be updated,

Indicates the information that determines the update.

(3) The information stored at time t-1 is updated to the storage information at time t at time t, and the calculation formula is equation (4):

(4) The output at time t is as shown in equation (5), and h _{t is} updated, and the calculation formula is as shown in equation (6):

o _t =σ(W _o ·[h _t-1 ,x _t ]+b _o ) (5)

h _t =o _t *tanh(C _t ) (6)

Where o _t is the output at time t; h _t is the vector of the hidden layer at time t. Finally, fw _t = o _t , that is, the output of the tth word in the sentence is fw _t .

Similarly, BW={bw ₁ , bw ₂ , . . . , bw _i , . . . , bw _n } is obtained, and the FW obtained by the bidirectional LSTM obtaining the output result is spliced with the BW, and the sentence L is obtained through the bidirectional long-term memory network. The output of (BiLSTM), that is, O = {r ₁ , r ₂ , ..., r _i , ..., r _n }, where r _i = [fw _i : bw _i ].

Step 3: The event sentence L={x ₁ , x ₂ , . . . , x _i , . . . , x _n } to be trained is introduced into a convolutional neural network (CNN), and the candidate trigger word is obtained by using the CNN. The global characteristics of the sentence, as shown in Figure 3.

(1) Take a convolution operation on the sentence, the calculation formula is (7):

C _i =f(w ^T x _i:i-h+1 +b) (7)

Where f is the activation function, C _i is the feature obtained by convolution, w is the weight matrix, h is the convolution kernel size, and i:i-h+1 is the i-th word to the i-h+1th word. b indicates the offset;

By sliding the window, all words are convolved to obtain a feature map.

(2) Using the maximum pooling, pool the feature map to get the sentence feature C _o .

Step 4: According to the second step, the semantic features of the event sentence L={x ₁ , x ₂ , . . . , x _i , . . . , x _n } are O={r ₁ , r ₂ , . . . , r _i , ..., r _n } (where the candidate trigger word x _i corresponds to the semantic feature is r _i ), and the global feature C _{o of the} event sentence L obtained in step 3 is classified, thereby belonging to the news event category.

The semantic feature of the event sentence L={x ₁ ,x ₂ ,...,x _i ,...,x _n } obtained in step 2 is O={r ₁ ,r ₂ ,...,r _i ,.. , r _n } (where the candidate trigger word x _i corresponds to the semantic feature r _i ), and cascades with the global feature C _{o of the} event sentence L obtained in step three to obtain an output vector O _t =[O:C _o ]; Softmax classifies the output vector O _t to obtain the type of news event prediction.

The above description is only a preferred embodiment of the present invention, and it should be noted that those skilled in the art can also make several improvements and retouchings without departing from the principles of the present invention. It should be considered as the scope of protection of the present invention.

Claims

A method for extracting news events based on a neural network, comprising: the following steps:

Step S1: performing data preprocessing on the original text of the training corpus: segmenting the original text of the training corpus, obtaining an event sentence, and then performing segmentation and naming recognition on the event sentence; and sequence the event sentence according to the manually marked news event information Labeling, the trigger word is marked according to its type, the non-trigger word is marked as no category, and the sequence of event sentences is obtained; and the sequence of event sentences is expressed in the form of a word vector;

Step S2, the event sentence sequence represented by the word vector is transmitted to the two-way long and short time memory network, and the semantic feature of each candidate trigger word is trained by using the two-way long and short time memory network;

Step S3, the event sentence sequence represented by the word vector is introduced into the convolutional neural network, and the global feature of the event sentence where the candidate trigger word is located is trained by using the convolutional neural network;

Step S4, according to the semantic feature of the candidate trigger word obtained in step S2 and the global feature of the sentence in which the candidate trigger word is obtained in step S3, use softmax as a classifier to classify each candidate trigger word to find a news event. The trigger word, and according to the trigger word type, determine the type of the event.
The method for extracting news events based on a neural network according to claim 1, wherein the step S1 is specifically:

In step S11, the natural language processing tool is used to perform segmentation, word segmentation and naming body recognition on the original text of the training corpus, so that the original text of the training corpus is presented in the event sentence and each sentence includes several words, and the event sentence is expressed as L={w 1 , w 2 ,...,w i ,...,w n }, where w i is the i-th word in the event sentence, and n represents the length of the event sentence;

Step S12, according to the word segmentation and the naming body recognition result, the event sentence is manually labeled, in the labeling process, the non-trigger word is marked as untyped, and the trigger word is marked according to the news event category to obtain the event sentence sequence;

Step S13, the word vector is obtained through the open source toolkit word2vec training, and each word in the sequence of the event sentence is expressed as a vector of 300 length according to the word vector obtained by the training using the Skip-gram model;

Step S14, each event sentence is processed into a sequence form of a word vector representation, that is, each candidate trigger word w i is represented by a 300-length word vector x i , and the event sentence is expressed as L={x 1 , x 2 , ..., x i ,...,x n }.
The method for extracting news events based on a neural network according to claim 1, wherein the step S2 is specifically:

Step S21, assuming that the event sentence is expressed as L={x 1 , x 2 , . . . , x i , . . . , x n }, where x i is the word vector of the ith candidate trigger word, and n is the length of the sentence. ;

Step S22, passing L as a sequence into the long and short time memory network, and obtaining the output result of the sequence FW={fw 1 , fw 2 , . . . , fw i , . . . , fw n }, where fw i represents the first The semantic features extracted by the i candidate trigger words through the long and short time memory network;

In step S23, L is inverted, that is, L'={x n , x n-1 , . . . , x i , . . . , x 1 }, and the reverse sequence L′ is transmitted to the long and short time memory network. The output result of the reverse sequence BW={bw 1 , bw 2 , . . . , bw i , . . . , bw n }, where bw i indicates that the ith candidate trigger word is extracted by the memory network in the reverse length Semantic features;

Step S24, the FW and the BW obtained by the bidirectional long-term memory network are outputted, and the output result of the sentence L after the bidirectional length and length of the memory network is obtained, that is, O={r 1 , r 2 , . . . , r i ,. .., r n }, where r i =[fw i :bw i ].
The method for extracting news events based on a neural network according to claim 1, wherein the step S3 is specifically:

Step S31, assuming that the event sentence is expressed as L = {x 1 , x 2 , ..., x i , ..., x n }, where x i is the word vector of the ith word, and n represents the length of the sentence;

In step S32, a convolution operation is performed on the event sentence, and the calculation formula is:

C i =f(w T x i:i-h+1 +b)

Where f is the activation function, C i is the feature obtained by convolution, w is the weight matrix, h is the convolution kernel size, and i:i-h+1 is the i-th word to the i-h+1th word. b indicates the offset;

By sliding the window, convolving all the words to obtain a feature map;

In step S33, using the maximum pooling, the feature map is pooled to obtain a global feature C o of the event sentence.
The method for extracting news events based on a neural network according to claim 1, wherein the step S4 is specifically:

Step S31, the candidate trigger term meanings O={r 1 , r 2 , . . . , r i , . . . , r n } obtained by the two-way long-term memory network and the global features of the sentence extracted by the convolutional neural network C o is cascaded to obtain an output vector O t =[O:C o ];

In step S32, the output vector O t is classified using softmax to obtain the type of news event prediction.
A system for news event extraction based on neural network, comprising: a text and processing module, a neural network training module, and a news event prediction module, wherein:

The text and processing module is configured to perform data preprocessing on the original text of the training corpus, including: segmenting the original text of the training corpus, obtaining an event sentence, and then performing segmentation and naming recognition on the event sentence; and manually marking the news event according to the manual The information is sequenced by the event sentence, the trigger word is marked according to its type, the non-trigger word is marked as no category, and the sequence of the event sentence is obtained; and the sequence of the event sentence is expressed in the form of a word vector;

The neural network training module comprises a bidirectional long and short time memory network training module and a convolutional neural network training module, and the bidirectional long and short time memory network training module is configured to train the event sentence sequence represented by the word vector to obtain the semantics of each candidate trigger word. The convolutional neural network training module is configured to train the sequence of event sentences represented by the word vector to obtain a global feature of the event sentence in which the candidate trigger word is located;

The news event prediction module is configured to classify each candidate trigger word by using softmax as a classifier according to the semantic feature of the candidate trigger word obtained by the neural network training module and the global feature of the sentence where the candidate trigger word is located, thereby finding the news. The trigger word of the event, and according to the type of the trigger word, determine the type of the event.