CN112580330B

CN112580330B - Vietnam news event detection method based on Chinese trigger word guidance

Info

Publication number: CN112580330B
Application number: CN202011108823.8A
Authority: CN
Inventors: 高盛祥; 寇梦珂; 余正涛; 王振晗; 朱俊国; 朱恩昌
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2020-10-16
Filing date: 2020-10-16
Publication date: 2023-09-12
Anticipated expiration: 2040-10-16
Also published as: CN112580330A

Abstract

The invention relates to a Vietnam news event detection method based on Chinese trigger word guidance, and belongs to the technical field of natural language processing. The invention firstly adopts an anti-learning method to map two languages under the same semantic space, then blends entity information in the coding process, embeds the mapped Chinese trigger words into the trigger word information in the Vietnam news of guidance model attention through an attention mechanism, and finally carries out multi-classification of event types by utilizing the obtained trigger word information so as to realize detection of Vietnam news events. The invention needs to identify the trigger words in news in the current event detection, no Vietnam trigger word mark corpus exists at present, and the problem of Vietnam mark corpus missing can be solved by using rich Chinese mark corpus.

Description

Vietnam news event detection method based on Chinese trigger word guidance

Technical Field

The invention relates to a Vietnam news event detection method based on Chinese trigger word guidance, and belongs to the technical field of natural language processing.

Background

Event detection is a hotspot problem in current natural language processing research. The recognition of trigger words plays a vital role in the event detection task. At present, vietnam language data are scarce, and Vietnam language does not have trigger word mark data, so that the event in Vietnam language news is difficult to detect. Therefore, according to the feature that sentences expressing the same point of view but expressed in different languages generally have the same or similar semantic components, the problem of missing Vietnam trigger word marks is solved by using rich Chinese trigger word marks.

Disclosure of Invention

The invention provides a Vietnam news event detection method based on Chinese trigger word guidance, which is used for solving the problems that the existing Vietnam data is scarce, the corpus is not marked by the Vietnam trigger word, and the text in different languages is difficult to express in the same feature space.

The technical scheme of the invention is as follows: the Vietnam news event detection method based on Chinese trigger word guidance comprises the following specific steps of:

step1, collecting news texts for detecting related news events of the Zhongyue bilingual, and performing duplication removal and screening on the news texts;

step2, performing preprocessing such as word segmentation and entity labeling on the middle-crossing news text, labeling event types and Chinese trigger words in the middle-crossing news text, and dividing the labeled Vietnam news corpus into training corpus, testing corpus and verification corpus;

step3, mapping the Chinese and the Yue languages into the same semantic space by adopting an countermeasure learning method, and extracting a mapped Chinese trigger word vector;

step4, obtaining entity vectors in the Vietnam word vector fusion sentence as the input of the BiLSTM layer; acquiring semantic information of Vietnam news sentences by using BiLSTM, and finding trigger word information in the Vietnam news sentences by using the mapped Chinese trigger word guidance model through an attention mechanism;

step5, finally, performing multi-classification on event types by using the obtained trigger word information, and further realizing detection of Vietnam news events.

As a further scheme of the invention, in Step1, using the scipy as a crawling tool, simulating user operation, customizing different templates for Chinese and Vietnam news websites, and obtaining detailed data according to XPath path formulation templates of page data elements to obtain news headlines, news time and news text data.

As a further scheme of the present invention, the specific steps of Step2 are as follows:

step2.1, marking trigger words and event types in Chinese news texts and event types in Vietnam news texts by referring to an event marking system of ACE, and dividing the event types into seven types, namely And->A relationship;

step2.2, dividing the experimental data into training corpus, testing corpus and verification corpus.

As a further aspect of the present invention, the specific Step of Step3 is:

step3.1, predicting the context information of the target word in Chinese by adopting a skip-gram expansion model method, and simultaneously predicting the context information of the alignment word of the target word in Vietnam, thereby obtaining a Chinese-to-Chinese bilingual word vector;

step3.2, projecting Chinese into the same semantic space as Vietnam by using a mapping function, and training a word discriminator and the mapping function in sequence by using a random gradient descent method;

step3.3, giving chinese news text and marking trigger words in sentences.

As a further scheme of the invention, the specific steps for acquiring semantic information of Vietnam news by adopting BiLSTM in Step4 are as follows:

step4.1, pre-training Vietnam word vectors on Vietnam language materials to obtain word vector word lists, randomly initializing an entity vector for each entity mark by utilizing entity mark types in an underthesea tool to obtain entity vector word lists, and converting all input words and entity marks into low-dimensional vectors by searching the word vector word lists and the entity vector word lists;

step4.2, concatenating word vectors and entity vectors as inputs to the BiLSTM to capture semantic information in sentences.

As a further scheme of the present invention, the specific steps of Step5 are as follows: and inputting the extracted trigger words in the Vietnam sentences to a classification layer, and classifying event types of the Vietnam news sentences by adopting a softmax classifier, so that detection of the Vietnam news events is realized.

The beneficial effects of the invention are as follows:

1. according to the Vietnam news event detection method based on Chinese trigger word guidance, two languages are mapped into the same semantic space by using an anti-learning method, chinese is infinitely close to Vietnam by using a mapping function until a discriminator cannot discriminate the two languages, and then a mapped Chinese trigger word vector is extracted;

2. the Vietnam news event detection method based on Chinese trigger word guidance uses BiLSTM to mine context implicit semantic information of event sentences, a mapped Chinese trigger word guidance model is used for finding trigger word information in Vietnam sentences through an attention mechanism, and finally the obtained attention context vector is used for carrying out multi-classification of event types.

3. According to the Vietnam news event detection method based on Chinese trigger word guidance, the rich Chinese trigger word marks are used for finding trigger word information in Vietnam news sentences by combining the characteristics of bilingual consistency, and classification is carried out through a softmax layer;

4. the Vietnam news event detection method based on Chinese trigger word guidance solves the problem of trigger word deletion in Vietnam event detection tasks.

Drawings

FIG. 1 is a flow chart of Vietnam news event detection based on Chinese trigger word guidance provided by the invention;

fig. 2 is a diagram of a model for detecting Vietnam news events based on Chinese trigger word guidance.

Detailed Description

Example 1: as shown in fig. 1-2, the method for detecting the news event in vietnam based on Chinese trigger word guidance comprises the following specific steps:

step1, collecting news texts for detecting related news events of the Zhongyue bilingual language; the method comprises the steps of crawling Vietnam news websites (Vietnam news agency, vietnam economic hours and Vietnam doors), and crawling Chinese news websites (Baidu, xinhua Net and people net) correspondingly aiming at the news topics crawled by Vietnam, wherein 813 Vietnam news texts and 4065 Chinese news texts are crawled. Finally, de-duplication and screening are carried out on the news text;

in Step1, the preferred scheme of the invention uses the Scrapy as the crawling tool to simulate the operation of a user, customize different templates for Chinese and Vietnam news websites, and acquire detailed data according to the XPath path formulation templates of page data elements to acquire data such as news headlines, news time, news text and the like.

The design of the preferred scheme is an important component of the invention, mainly provides data support for the corpus collection process and the event time sequence relationship recognition.

Step2, performing preprocessing such as word segmentation and entity labeling on the middle-crossing news text, labeling event types and Chinese trigger words in the middle-crossing news text, and distributing the labeled Vietnam news corpus into training corpus, testing corpus and verification corpus according to the proportion of 8:1:1;

step3, mapping the two languages into the same semantic space by adopting an anti-learning method. Extracting the mapped Chinese trigger word vector;

As a preferred embodiment of the present invention, the specific steps of Step2 are:

step2.1, the event in the invention is composed of trigger words and parameters, the trigger words can clearly express the occurrence of one kind of event, usually a single verb or noun, and the parameters describe the information such as time, place, character and the like of the occurrence of the event; marking a Chinese trigger word and an event type in a Chinese and Yue news text by the customized Chinese and Yue bilingual related news event;

step2.2, format using ACE2005 dataset, defines 7 event types, of which there are 25089 news sentences in total;

step2.3, and dividing the experimental data into training corpus, test corpus and verification corpus.

As a preferred embodiment of the present invention, in Step 2: the event types are divided into seven types, namelyAnd-> Relationship.

As a preferred embodiment of the present invention, the specific Step of Step3 is:

step3.1, pretraining Chinese word vectors by adopting skip-gram expansion model method And Vietnam word vector->Where E and N are each a vocabulary size. d, d _s And d _z The Chinese word vector dimension and the Vietnam word vector dimension are represented respectively. The chinese is then projected into the same semantic space as vietnam using mapping function f:

wherein the method comprises the steps ofIs a mapping matrix. />Is the projected chinese word vector.

The constrained transform matrix U is orthogonal to Singular Value Decomposition (SVD) to reduce the parameter search space:

step3.2, in order to optimize the mapping function f, a multi-layer perceptron is introduced as a word discriminator, using the vietnam word vector and the mapped chinese word vector as inputs, outputting a single scalar.Representation->Probabilities from the Vietnam vocabulary. Word discriminator uses binary cross entropy loss:

y _i ＝δ _i (1-2∈)+∈ (4)

wherein delta _i =1 indicates that the word is from z, δ _i =0 means that the word is from s. I _s；z Representing the number of words sampled together from the vocabulary of z and s. E is a smoothed value added to the positive and negative labels.

The mapping function f and word discriminator D are two countermeasure layers, flipping word labels, optimized by minimizing the penalty:

y _i ＝δ _i (1-2∈)+∈ (6)

mapping two languages into the same semantic space using an anti-learning approach, training word discriminators and mapping functions in turn using random gradient descent (SGD) to minimizeAnd->

Step3.3, giving chinese news text and marking trigger words in sentences. Mapping Chinese trigger words to the same semantic space as Vietnam language through a mapping matrix, and converting all mapped Chinese trigger words into a group of mapping vectors G= { G ₁ ，g ₂ ，…g _m Used to capture hidden trigger words in Vietnam sentences.

The design of the preferred scheme is an important component of the invention, and mainly provides a vector coding process for the invention, and two languages are mapped to the same semantic space by combining bilingual word vectors. And finding trigger words in the Vietnam sentence for the following mapped Chinese trigger word guidance model to be used as a bedding.

As a preferred scheme of the invention, the invention utilizes Chinese trigger word information to enhance trigger word meaning information in Vietnam news sentences through an attention mechanism, wherein:

the Step4 adopts BiLSTM to acquire semantic information of Vietnam news, and comprises the following specific steps:

step4.1, given a vietnam news sentence s= { w containing n words ₁ ，w ₂ ，…w _n For each word in S, wi is marked by underthesea as entity type e _i . Then through word vector vocabularyInquiring word vector corresponding to wi ++>And by entity vector vocabulary->Query e _i Corresponding entity vector->Finally, the word vector and the entity vector are spliced together to form a wi final vector representation V _i ：

Will each word w in S _i Are all expressed as vectors v in the manner described above _i By usingThe operator performs vector direction splicing operation, so that the semantic representation matrix M of the sentence S _s The method comprises the following steps:

As a preferable scheme of the invention, the method adopts marked Chinese trigger word information to guide and find the trigger word information in the Vietnam news text according to the characteristics of consistency of different languages in the same news theme environment;

the BiLSTM:

sentences are modeled using BiLSTM, run on word and entity embedded junction sequences. The bi-directional BiLSTM can be regarded as two uni-directional LSTMs, including a forward LSTM and a reverse LSTM, that enable the output at the current time to be linked to both the state at the previous time and the state at the subsequent time.

The word vector of each word in the Vietnam news sentence is sequentially input into a neural network formed by BiLSTM units, so as to obtain the hidden layer vector h= { h of the sentence ₁ ，h ₂ ，…h _n }，h _i Is the hidden layer vector representation of the i-th word in the sentence. In each step of this phase, the input w of the forward BiLSTM at time t _t And a previously hidden state vector h _t-1 Computing a current hidden state vectorThen run BiLSTM in reverse to generate backward hidden layer vector representation +.>

The forward LSTM combines with the backward LSTM to form BiLSTM. Unlike LSTM, the data of the input layer is calculated in both forward and backward directions, and the finally outputted hidden state is spliced again to be used as the input of the next layer.

The attention mechanism:

each type of event is typically triggered by a specific set of words, which are referred to as event trigger words. For example, the number of the cells to be processed, events are typically triggered by words such as "fight", "attack", and the like. Thus, event trigger words are important clues to complete event detection tasks. According to a group of Chinese trigger word vectors G= { G ₁ ，g ₂ ，…g _m Hidden state h= { h by } and BiLSTM ₁ ，h ₂ ，…h _n Each trigger word vector g is calculated _i Obtaining attention weights between (i=1, 2, … m) and the hidden state h, obtaining a group of attention weight vectors α= { α ₁ ，…，a _m }. Specifically, the attention weight between the kth Chinese trigger word vector gk and the hidden state ht at time t in a given G is calculated by equation (10), in which the trigger words of the Vietnam news target event type are expected to get a higher weight than the other words.

G after calculation _k Hidden state h= { h with all times ₁ ，h ₂ ，…，h _n After the attention weights between the two, an attention weight vector alpha is obtained _k ＝[α ¹ ，α ² ，…，α ⁿ ]. Complete traversal g= { G ₁ ，g ₂ ，…，g _m Then a set of attention weight vectors α= { α is obtained ₁ ，α ₂ ，…，a _m }. Then, the vector with the element with the largest weight in the group of weight vectors is obtained and used as the final attention weight vector of the current input sentence, and is marked as alpha _max ＝[α ¹ ，α ² ，…，α ⁿ ]. Because a greater attention weight is found for each chinese trigger word vector in G that is most relevant to the current input sentence.

Finally, alpha is _max Weighted summation with h to obtain the vector representation S of the current input sentence _att As in formula (11):

S _att ＝∑ _i α ⁱ h _i (11)

where i=1, 2, … n.

The preferred scheme is designed to be integrated with Chinese trigger words to guide Vietnam to find trigger word information, and then to classify the trigger word information into event types. The BiLSTM can extract information from the front direction and the back direction, so that the problem of long-distance dependence is solved, and the implicit semantic information of the event sentence is more effectively mined. The attention mechanism adds the weight of the trigger words in the current event, so that the Vietnam event detection task achieves the best effect.

As a preferred embodiment of the present invention, the specific steps of Step5 are: the extracted Vietnam trigger word information is input to a classification layer, and Vietnam news events are classified by adopting a softmax classifier, so that detection of the Vietnam news events is realized.

As a preferred embodiment of the present invention, event types may be categorized into seven categories by Chinese and Vietnam related news stories.

As a preferred embodiment of the present invention, the vector representation S of the current input sentence _att Inputting a softmax layer to obtain probability distribution P of the event type to be predicted:

P＝softmax(W·S _att +b) (12)

where W and b are the weight and bias of the softmax layer, respectively.

The Chinese trigger words designed by the preferred scheme have a certain constraint function, and are helpful for better identifying event time sequence relations.

Step6, respectively carrying out experimental exploration on coding characteristics and the presence or absence of Chinese trigger words, proving rationality and high efficiency of model setting, and comparing the model with the existing model, thereby proving that the method has better effect on Vietnam event detection.

The experiment adopts the accuracy (P), the recall (R) and the F value (F) as evaluation indexes to carry out comparison experiment.

Precision (P): the proportion of correctly predicted events in the total predicted events.

Recall (R): the proportion of correctly predicted events in real events.

To verify whether the text model can promote the effect of event detection, a first set of experiments was set up. The model compares the text model with the CNN model and the GCN model on the vietnam news dataset, and compares the text model with a baseline model (TBNNAM) on the basis of not marking trigger words. The experimental results are shown in table 1:

table 1 shows the performance of different models

As shown by a comparison experiment, the model effect of LSTM is superior to CNN, mainly because LSTM can solve the problems of gradient elimination and gradient explosion in the training process. The global information captured by the last state of LSTM in tbnniam model is also important to this task, and the global information and local information captured by the attention mechanism are complementary. The BiLSTM used herein is capable of capturing more semantic information in sentences than LSTM. Experimental results show that the model has better effect.

The method is characterized in that the method is conducted aiming at coding features fused in a word embedding layer, and a second group of experiments are set for verifying whether the effect of event detection can be improved by fusing entity information into word vectors. This experiment compares the effect on the model before and after adding the entity. The experimental results are shown in table 2:

table 2 shows the effect of coding features on model performance

As can be seen from a comparison experiment, the annotation of the entity can capture the semantic information of the word. After the entity vector is added, the accuracy, recall rate and F value of the model are all improved compared with those of the model, and the performance of event detection can be improved after the entity vector is added.

Because of the complexity of Vietnam, the sentence is difficult to be marked with the trigger word, and a third set of experiments is set for verifying whether the effect of event detection can be improved by merging the Chinese trigger word. The experiment compares the effect of the presence of Chinese trigger words on event detection, and the experimental result is shown in Table 3:

table 3 shows the model performance contrast for Chinese trigger word guidance

As can be seen from a comparison experiment, the effect of the mark with the trigger word is obviously better than that of the mark without the trigger word. Different languages have consistency for the same news event sentence, and the Chinese trigger words can be used for finding the trigger words in the corresponding Vietnam sentence, so that the event detection of the Vietnam news is completed.

From the data, the semantic information of sentences can be better captured by integrating the entity information, and the trigger word information in Vietnam sentences is found by using a Chinese trigger word guidance model, so that Vietnam news event detection is realized.

While the present invention has been described in detail with reference to the drawings, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The Vietnam news event detection method based on Chinese trigger word guidance is characterized by comprising the following steps of:

step2, word segmentation is carried out on the middle-crossing news text, entity labeling pretreatment is carried out, event types and Chinese trigger words in the middle-crossing bilingual news text are labeled, and the labeled Vietnam news corpus is divided into training corpus, testing corpus and verification corpus;

step5, finally, performing multi-classification of event types by using the obtained trigger word information, thereby realizing detection of Vietnam news events;

the specific steps of the Step3 are as follows:

step3.3, giving a Chinese news text and marking trigger words in sentences;

2. The method for detecting Vietnam news events based on Chinese trigger word guidance according to claim 1, wherein the method comprises the following steps: in Step1, using the Scrapy as a crawling tool to simulate user operation, customizing different templates for Chinese and Vietnam news websites, and obtaining detailed data according to XPath path formulation templates of page data elements to obtain news headlines, news time and news text data.

3. The method for detecting Vietnam news events based on Chinese trigger word guidance according to claim 1, wherein the method comprises the following steps: the specific steps of the Step2 are as follows:

marking trigger words and event types in Chinese news texts and event types in Vietnam news texts by referring to an event marking system of ACE, and dividing the event types into seven types, namely relations of 'chuy ế n th ă m', 'G ặ p' and 'Ti ế p x m c', 'Thu ộ c kine t ế', 'Thay đ ổ i', 'Giao d ị ch', 'Cu ộ c xung đ ộ t', respectively;

4. The method for detecting Vietnam news events based on Chinese trigger word guidance according to claim 1, wherein the method comprises the following steps: the specific steps of the Step5 are as follows: and inputting the extracted trigger words in the Vietnam sentences to a classification layer, and classifying event types of the Vietnam news sentences by adopting a softmax classifier, so that detection of the Vietnam news events is realized.