CN112580330A

CN112580330A - Vietnamese news event detection method based on Chinese trigger word guidance

Info

Publication number: CN112580330A
Application number: CN202011108823.8A
Authority: CN
Inventors: 高盛祥; 寇梦珂; 余正涛; 王振晗; 朱俊国; 朱恩昌
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2020-10-16
Filing date: 2020-10-16
Publication date: 2021-03-30
Anticipated expiration: 2040-10-16
Also published as: CN112580330B

Abstract

The invention relates to a Vietnamese news event detection method based on Chinese trigger word guidance, and belongs to the technical field of natural language processing. The method comprises the steps of firstly mapping two languages to the same semantic space by adopting an antagonistic learning method, then integrating entity information in the encoding process, embedding the mapped Chinese trigger words into a guidance model by an attention mechanism to pay attention to the trigger word information in Vietnamese news, and finally performing multi-classification of event types by using the obtained trigger word information so as to realize the detection of the Vietnamese news events. The invention needs to identify the trigger words in news in the event detection at present, has no Vietnamese trigger word marking corpus at present, and can solve the problem of the deficiency of the Vietnamese marking corpus by using rich Chinese marking corpus.

Description

Vietnamese news event detection method based on Chinese trigger word guidance

Technical Field

The invention relates to a Vietnamese news event detection method based on Chinese trigger word guidance, and belongs to the technical field of natural language processing.

Background

Event detection is a hot issue for current natural language processing research. The recognition of trigger words plays a crucial role in the event detection task. At present, Vietnamese data are scarce, and Vietnamese does not have trigger word marking data, so that events in Vietnamese news are difficult to detect. Therefore, according to the characteristic that sentences which express the same view point but are expressed by different languages usually have the same or similar semantic components, the problem of the south-crossing trigger mark missing by using rich Chinese trigger marks has important significance.

Disclosure of Invention

The invention provides a Vietnamese news event detection method based on Chinese trigger word guidance, which is used for solving the problems that the present Vietnamese data is scarce, no Vietnamese trigger word is used for marking linguistic data, and the texts in different languages are difficult to express in the same characteristic space.

The technical scheme of the invention is as follows: the Vietnamese news event detection method based on the Chinese trigger word guidance comprises the following specific steps of:

step1, collecting news texts for detecting the Chinese-crossing bilingual related news events, and carrying out duplicate removal and screening on the news texts;

step2, preprocessing the Chinese-Vietnamese news text, such as word segmentation and entity labeling, labeling event types and Chinese trigger words in the Chinese-Vietnamese news text, and dividing the labeled Vietnamese news corpus into a training corpus, a test corpus and a verification corpus;

step3, mapping the two languages to the same semantic space by adopting an antagonistic learning method, and extracting the mapped Chinese trigger word and word vectors;

step4, acquiring entity vectors in the Vietnam word vector fusion sentence as the input of a BilSTM layer; acquiring semantic information of Vietnamese news sentences by using BilSTM, and finding out trigger word information in the Vietnamese sentences by using the mapped Chinese trigger word guide model through an attention mechanism;

and Step5, finally, performing multi-classification of event types by using the obtained trigger word information, and further realizing Vietnamese news event detection.

In Step1, using script as a crawling tool to simulate user operation, customizing different templates for the chinese and vietnamese news websites, and formulating a template according to an XPath path of a page data element to obtain detailed data, and obtaining news headlines, news time and news text data.

As a further scheme of the present invention, the Step2 specifically comprises the following steps:

step2.1, marking a trigger word and an event type in a Chinese news text and an event type in a Vietnamese news text by referring to an event marking system of ACE, and dividing the event types into seven types, namely

And "

xúc”、

“Giao

”、“

xung

"relationship;

and Step2.2, dividing the experimental data into training corpora, testing corpora and verification corpora.

As a further scheme of the invention, the Step3 comprises the following specific steps:

step3.1, predicting the context information of the target word in Chinese by adopting a skip-gram expansion model method, and simultaneously predicting the context information of the target word in aligned words in Vietnamese so as to obtain a middle-cross bilingual word vector;

step3.2, projecting the Chinese into the semantic space same as Vietnamese by using a mapping function, and sequentially training a word discriminator and the mapping function by using a random gradient descending method;

step3.3, give Chinese news text and mark the trigger in the sentence.

As a further scheme of the present invention, the Step4, which adopts BiLSTM to obtain semantic information of the vietnamese news, specifically comprises the following steps:

step4.1, pre-training Vietnamese word vectors on Vietnamese linguistic data to obtain a word vector word list, randomly initializing an entity vector for each entity mark by utilizing an entity mark type in an understhesea tool to obtain an entity vector word list, and converting all input words and entity marks into low-dimensional vectors by searching the word vector word list and the entity vector word list;

step4.2, splicing the word vector and the entity vector to be used as the input of the BilSTM, and capturing semantic information in the sentence.

As a further scheme of the present invention, the Step5 specifically comprises the following steps: and inputting the extracted trigger words in the Vietnamese sentences into a classification layer, and classifying the event types of the Vietnamese news sentences by adopting a softmax classifier, thereby realizing the detection of the Vietnamese news events.

The invention has the beneficial effects that:

1. the invention relates to a Vietnamese news event detection method based on Chinese trigger word guidance, which comprises the steps of mapping two languages into the same semantic space by using an antagonistic learning method, using a mapping function to enable Chinese to be infinitely close to Vietnamese until a discriminator cannot distinguish the two languages, and then extracting a mapped Chinese trigger word vector;

2. the Vietnamese news event detection method based on Chinese trigger guidance utilizes BilSTM to mine the context implicit semantic information of event sentences, finds the trigger information in the Vietnamese sentences through the mapped Chinese trigger guidance model by the attention mechanism, and finally utilizes the obtained attention context vector to carry out multi-classification of event types.

3. The Vietnamese news event detection method based on Chinese trigger guidance combines the characteristic of bilingual consistency, uses rich Chinese trigger marks to find the trigger information in the Vietnamese news sentence, and classifies the Vietnamese news sentence through a softmax layer;

4. the Vietnamese news event detection method based on the Chinese trigger word guidance solves the problem of trigger word loss in the Vietnamese news event detection task.

Drawings

FIG. 1 is a flow chart of Vietnamese news event detection based on Chinese trigger guidance according to the present invention;

FIG. 2 is a diagram of a Vietnamese news event detection model based on Chinese trigger guidance.

Detailed Description

Example 1: as shown in fig. 1-2, the method for detecting the vietnamese news event based on the guidance of the Chinese trigger word specifically comprises the following steps:

step1, collecting news texts for Chinese and Vietnamese related news event detection; the method comprises the steps of crawling Vietnamese news websites (Vietnamese news society, Vietnamese economic time news and Vietnamese), crawling Chinese news websites (Baidu, Xinhua net and people's net) corresponding to news topics crawled by Vietnamese, and crawling 813 Vietnamese news texts and 4065 Chinese news texts together. Finally, de-duplication and screening are carried out on the news text;

in Step1, as a preferred embodiment of the present invention, script is used as a crawling tool, user operations are simulated, different templates are customized for the chinese and vietnamese news websites, a template is formulated according to the XPath path of the page data elements to obtain detailed data, and data such as news headlines, news time, and news text are obtained.

The design of the preferred scheme is an important component of the invention, mainly provides a data support for the corpus collection process and the event time sequence relationship identification of the invention.

Step2, preprocessing the Chinese-Vietnamese news text such as word segmentation and entity labeling, labeling event types and Chinese trigger words in the Chinese-Vietnamese news text, and mixing the labeled Vietnamese news corpus according to the ratio of 8: 1: 1, distributing training corpora, testing corpora and verification corpora according to the proportion;

step3, adopting a method of counterlearning to map the two languages to the same semantic space. Extracting the mapped Chinese trigger word and word vectors;

As a preferred embodiment of the present invention, the Step2 specifically comprises the following steps:

step2.1, the event in the invention consists of a trigger word and parameters, the trigger word can clearly express the occurrence of a class of event, usually a single verb or noun, and the parameters describe the information of the occurrence time, place, person and the like of the event; marking the event types of the Chinese trigger words and the Chinese and overtaking news texts in the customized Chinese and overtaking bilingual related news events;

step2.2, defined as 7 event types by using the format of an ACE2005 data set, wherein 25089 news sentences are in total;

and Step2.3, dividing the experimental data into training corpora, testing corpora and verification corpora.

In a preferred embodiment of the present invention, Step2 is: divide the event types into seven types, respectively "

(Access) ",") "

(meeting) "and"

x c (cooperative) "," u "

kinh

(economic) "," Thay

(transition period) "," Giao "," Gi "," Giao "," Gi

(trade) "," is a series of products "

xung

(conflict) "relationship.

As a preferable scheme of the invention, the Step3 comprises the following specific steps:

step3.1, adopting skip-gram extended model method to pre-train Chinese word vector

And Vietnam word vector

Where E and N are the respective vocabulary sizes. d_sAnd d_zDenoted respectively are the chinese word vector dimension and the vietnamese word vector dimension. The chinese is then projected into the same semantic space as vietnamese using a mapping function f:

wherein

Is a mapping matrix.

Is the projected Chinese word vector.

The constrained transform matrix U is orthogonal to the Singular Value Decomposition (SVD) to reduce the parameter search space:

step3.2, in order to optimize the mapping function f, a multilayer perceptron is introduced as a word discriminator, Vietnamese word vectors and mapped Chinese word vectors are used as input, and a single scalar is output.

To represent

Probabilities from the Vietnamese vocabulary. The word discriminator uses a binary cross entropy loss:

y_i＝δ_i(1-2∈)+∈ ⑷

wherein, delta_i1 indicates that the word is from z, δ _i0 means that the word is from s. I is_s；zIndicating the number of words sampled together from the vocabulary of z and s. E is a smoothed value added to the positive and negative labels.

The mapping function f and the word discriminator D are two countermeasure layers, flipping the word labels, optimized by minimizing the loss:

y_i＝δ_i(1-2∈)+∈ ⑹

using a counterlearning approach to map two languages into the same semantic space, a word discriminator and mapping function are trained using Stochastic Gradient Descent (SGD) in turn to minimize

And

step3.3, give Chinese news text and mark the trigger in the sentence. Mapping the Chinese trigger words to the same semantic space with Vietnamese language through a mapping matrix, and converting all the mapped Chinese trigger words into a group of mapping vectors G ═ G₁,g₂,…g_mAnd capturing the hidden trigger words in the Vietnamese sentences.

The design of the preferred scheme is an important component of the invention, and mainly provides a vector coding process for the invention, and two languages are mapped to the same semantic space by combining bilingual word vectors. And finding the trigger words in the Vietnamese sentences for the following mapped Chinese trigger word guidance model to be used as a cushion.

As a preferred scheme of the invention, the invention utilizes Chinese trigger information to enhance the trigger meaning information in Vietnamese news sentences through an attention mechanism, wherein:

the Step4 adopts BilSTM to obtain semantic information of Vietnamese news, and comprises the following specific steps:

step4.1, a vietnamese news sentence S containing n words is given { w ═ w₁,w₂,…w_nFor each word in S, w_iAll marked out entity type e by understhesea_i. Then through the word vector word list

Query to w_iCorresponding word vector

And through entity vector vocabulary

Query to_iCorresponding entity vector

Finally, the word vector and the entity vector are spliced together to be used as w_iThe final vector representation V_i:

Every word w in S_iAre all represented as vectors v in the manner described above_iBy using

The operator carries out the splicing operation in the vector direction, and then the semantic expression matrix M of the sentence S_sComprises the following steps:

As a preferred scheme of the invention, according to the characteristic that different languages have consistency in the same news theme environment, the method adopts markable Chinese trigger word information to guide and find out the trigger word information in the Vietnamese news text;

the BiLSTM:

sentences were modeled using BiLSTM, running on a concatenated sequence of embedded words and entities. The bidirectional BilSTM can be regarded as two unidirectional LSTMs, including a forward LSTM and a reverse LSTM, so that the output at the current moment can be linked with the state at the previous moment and the state at the next moment.

Sequentially inputting a word vector of each word in a Vietnamese news sentence into a neural network formed by a BilSTM unit to obtain a hidden layer vector h ═ h of the sentence₁,h₂,…h_n}，h_iIs a hidden layer vector representation of the ith word in the sentence. In each step of this phase, the input w of forward BilSTM at time t_tAnd the previous hidden state vector h_t-1Calculating the current hidden state vector

Followed by running the BilSTM in reverse to generate a backward hidden vector representation

The forward LSTM is combined with the backward LSTM to form a BiLSTM. Different from the LSTM, the data of the input layer can be calculated in the forward direction and the backward direction, and finally the output hidden state is spliced and then used as the input of the next layer.

The attention mechanism is as follows:

each type of event is typically triggered by a specific set of words, referred to as event-triggered words. For example,') "

xung

(conflict) "events are typically triggered by" fight "," attack ", and the like. Thus, the event trigger is an important clue to complete the event detection task. According to a set of Chinese trigger word vectors G ═ G₁,g₂,…g_mH ═ h hidden states produced by BilsTM₁,h₂,…h_nCalculating each trigger word vector g_i(i 1,2, … m) and a hidden state h, and obtaining a set of attention weight vectors α ═ α { (α {)₁,…,α_m}. Specifically, given the kth Chinese trigger vector G in G_kHidden state h with time t_tThe attention weight in between is calculated by equation (10), and in this model, the trigger word for the Vietnamese news target event type is expected to get a higher weight than the other words.

Finish g calculation_kHidden state h with all time points ═ h₁,h₂,…,h_nGet an attention weight vector α after attention weights between_k＝[α¹,α²,…,αⁿ]. After traversal G ═ G₁,g₂,…,g_mGet a set of attention weight vectors α ═ α }₁,α₂,…,α_m}. Then, the vector with the element with the largest weight in the group of weight vectors is obtained as the final attention weight vector of the current input sentence, which is marked as alpha_max＝[α¹,α²,…,αⁿ]. Because for each Chinese trigger word in GThe word vector that measures the most relevant to the current input sentence finds a greater attention weight.

Finally, alpha is adjusted_maxWeighted summation is carried out on the vector S and the h, and the vector representation S of the current input sentence can be obtained_attAs shown in formula (11):

S_att＝∑_iαⁱh_i (11)

wherein i is 1,2, … n.

The preferred scheme design provides that Chinese trigger words are blended to guide Vietnamese to find trigger word information, and then event types are classified. The BilSTM can extract information from the positive direction and the negative direction, so that the problem of long-distance dependence is solved, and the implicit semantic information of the event sentence is effectively mined. The attention mechanism adds the weight of the trigger word in the current event, so that the Vietnamese event detection task achieves the best effect.

As a preferred embodiment of the present invention, the Step5 specifically comprises the following steps: and inputting the extracted Vietnamese trigger word information into a classification layer, and classifying the Vietnamese news events by adopting a softmax classifier, thereby realizing the detection of the Vietnamese news events.

As a preferred aspect of the present invention, the event types can be classified into seven categories by chinese and vietnamese related news stories.

As a preferred scheme of the invention, the vector of the current input sentence is represented by S_attInputting a softmax layer to obtain the probability distribution P of the event type to be predicted:

P＝softmax(W·S_att+b) (12)

where W and b are the weight and offset of the softmax layer, respectively.

The preferred scheme designs the Chinese trigger words to have certain constraint action, which is favorable for better recognizing the event time sequence relation.

Step6, respectively carrying out experimental exploration on coding characteristics and the existence of Chinese trigger words, proving the reasonability and high efficiency of model setting, and comparing the model with the existing model to prove that the method has better effect on Vietnamese event detection.

The experiment was compared using the accuracy (P), recall (R), and F-value (F) as evaluation indices.

Precision (P): the proportion of correctly predicted events in the total predicted events.

Recall (R): the proportion of correctly predicted events in real events.

In order to verify whether the text model can improve the event detection effect, a first group of experiments are set. The model compares the text model with a CNN model and a GCN model on a Vietnamese news data set, and compares the text model with a baseline model (TBNNAM) on the basis of not marking trigger words. The results of the experiment are shown in table 1:

table 1 shows the comparison of the properties of different models

As can be seen from comparison experiments, the model effect of LSTM is superior to CNN, mainly because LSTM can solve the problems of gradient disappearance and gradient explosion in the training process. The global information captured by the last state of the LSTM in the TBNNAM model is also important for this task, and this global information and local information captured by the attention mechanism are complementary. BilSTM as used herein is capable of capturing more semantic information in a sentence than LSTM. The experimental result shows that the model has better effect.

And a second group of experiments are set for researching the encoding characteristics of the word embedded layer, and verifying whether the event detection effect can be improved by integrating the entity information into the word vector. This experiment compares the effect on the model before and after addition of the entity. The results of the experiment are shown in table 2:

TABLE 2 Effect of coding characteristics on model Performance

Through comparative experiments, the annotation of the entity can capture semantic information of the words. After the entity vector is added, the accuracy, the recall rate and the F value of the model are all improved compared with those of the model, and the fact that the performance of event detection can be improved after the entity vector is added is proved.

Due to the complexity of Vietnamese, the sentence is difficult to mark by the trigger words, and a third group of experiments are set for verifying whether the Chinese trigger words are blended into the sentence to improve the event detection effect. The experiment compares the effect of the event detection of the Chinese trigger word, and the experiment result is shown in table 3:

table 3 shows the comparison of the model performance for the Chinese trigger instruction

As can be seen from comparison experiments, the effect of marking with the trigger words is obviously superior to the effect of marking without the trigger words. Different languages have consistency for the same news event sentence, and the Chinese trigger words can be used for finding the trigger words in the corresponding Vietnamese sentence, so that the event detection of the Vietnamese news is completed.

From the data, the semantic information of the sentences can be better captured by integrating the entity information, and the information of the trigger words in the Vietnamese sentences can be found by utilizing the Chinese trigger word guidance model, so that the detection of the Vietnamese news events is realized.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The Vietnamese news event detection method based on Chinese trigger word guidance is characterized by comprising the following steps of:

step2, preprocessing the Chinese-Vietnamese news text, such as word segmentation and entity labeling, labeling event types and Chinese trigger words in the Chinese-Vietnamese news text, and dividing the labeled Vietnamese news corpus into a training corpus, a testing corpus and a verification corpus;

step3, mapping the two languages in the middle and beyond languages to the same semantic space by adopting an antagonistic learning method, and extracting a mapped Chinese trigger word vector;

2. The method for detecting Vietnamese news events based on Chinese trigger word guidance according to claim 1, wherein: in Step1, Scapy is used as a crawling tool, user operation is simulated, different templates are customized for Chinese and Vietnamese news websites, the templates are formulated according to XPath paths of page data elements to obtain detailed data, and news titles, news time and news text data are obtained.

3. The method for detecting Vietnamese news events based on Chinese trigger word guidance according to claim 1, wherein: the specific steps of Step2 are as follows:

And

a relationship;

4. The method for detecting Vietnamese news events based on Chinese trigger word guidance according to claim 1, wherein: the specific Step of Step3 is as follows:

step3.2, projecting the Chinese into the semantic space same as Vietnamese by using a mapping function, and sequentially training a word discriminator and the mapping function by using a random gradient descent method;

step3.3, give Chinese news text and mark the trigger in the sentence.

5. The method for detecting Vietnamese news events based on Chinese trigger word guidance according to claim 1, wherein: the Step4 adopts BilSTM to obtain semantic information of Vietnamese news, and comprises the following specific steps:

step4.1, pre-training Vietnamese word vectors on Vietnamese linguistic data to obtain a word vector vocabulary, randomly initializing an entity vector for each entity marker by utilizing the entity marker type in an understhesea tool to obtain an entity vector vocabulary, and converting all input words and entity markers into low-dimensional vectors by searching the word vector vocabulary and the entity vector vocabulary;

6. The method for detecting Vietnamese news events based on Chinese trigger word guidance according to claim 1, wherein: the specific steps of Step5 are as follows: and inputting the extracted trigger words in the Vietnamese sentences into a classification layer, and classifying event types of the Vietnamese news sentences by adopting a softmax classifier, thereby realizing the detection of the Vietnamese news events.