CN112580330A - Vietnamese news event detection method based on Chinese trigger word guidance - Google Patents

Vietnamese news event detection method based on Chinese trigger word guidance Download PDF

Info

Publication number
CN112580330A
CN112580330A CN202011108823.8A CN202011108823A CN112580330A CN 112580330 A CN112580330 A CN 112580330A CN 202011108823 A CN202011108823 A CN 202011108823A CN 112580330 A CN112580330 A CN 112580330A
Authority
CN
China
Prior art keywords
vietnamese
chinese
news
word
trigger
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011108823.8A
Other languages
Chinese (zh)
Other versions
CN112580330B (en
Inventor
高盛祥
寇梦珂
余正涛
王振晗
朱俊国
朱恩昌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202011108823.8A priority Critical patent/CN112580330B/en
Publication of CN112580330A publication Critical patent/CN112580330A/en
Application granted granted Critical
Publication of CN112580330B publication Critical patent/CN112580330B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a Vietnamese news event detection method based on Chinese trigger word guidance, and belongs to the technical field of natural language processing. The method comprises the steps of firstly mapping two languages to the same semantic space by adopting an antagonistic learning method, then integrating entity information in the encoding process, embedding the mapped Chinese trigger words into a guidance model by an attention mechanism to pay attention to the trigger word information in Vietnamese news, and finally performing multi-classification of event types by using the obtained trigger word information so as to realize the detection of the Vietnamese news events. The invention needs to identify the trigger words in news in the event detection at present, has no Vietnamese trigger word marking corpus at present, and can solve the problem of the deficiency of the Vietnamese marking corpus by using rich Chinese marking corpus.

Description

Vietnamese news event detection method based on Chinese trigger word guidance
Technical Field
The invention relates to a Vietnamese news event detection method based on Chinese trigger word guidance, and belongs to the technical field of natural language processing.
Background
Event detection is a hot issue for current natural language processing research. The recognition of trigger words plays a crucial role in the event detection task. At present, Vietnamese data are scarce, and Vietnamese does not have trigger word marking data, so that events in Vietnamese news are difficult to detect. Therefore, according to the characteristic that sentences which express the same view point but are expressed by different languages usually have the same or similar semantic components, the problem of the south-crossing trigger mark missing by using rich Chinese trigger marks has important significance.
Disclosure of Invention
The invention provides a Vietnamese news event detection method based on Chinese trigger word guidance, which is used for solving the problems that the present Vietnamese data is scarce, no Vietnamese trigger word is used for marking linguistic data, and the texts in different languages are difficult to express in the same characteristic space.
The technical scheme of the invention is as follows: the Vietnamese news event detection method based on the Chinese trigger word guidance comprises the following specific steps of:
step1, collecting news texts for detecting the Chinese-crossing bilingual related news events, and carrying out duplicate removal and screening on the news texts;
step2, preprocessing the Chinese-Vietnamese news text, such as word segmentation and entity labeling, labeling event types and Chinese trigger words in the Chinese-Vietnamese news text, and dividing the labeled Vietnamese news corpus into a training corpus, a test corpus and a verification corpus;
step3, mapping the two languages to the same semantic space by adopting an antagonistic learning method, and extracting the mapped Chinese trigger word and word vectors;
step4, acquiring entity vectors in the Vietnam word vector fusion sentence as the input of a BilSTM layer; acquiring semantic information of Vietnamese news sentences by using BilSTM, and finding out trigger word information in the Vietnamese sentences by using the mapped Chinese trigger word guide model through an attention mechanism;
and Step5, finally, performing multi-classification of event types by using the obtained trigger word information, and further realizing Vietnamese news event detection.
In Step1, using script as a crawling tool to simulate user operation, customizing different templates for the chinese and vietnamese news websites, and formulating a template according to an XPath path of a page data element to obtain detailed data, and obtaining news headlines, news time and news text data.
As a further scheme of the present invention, the Step2 specifically comprises the following steps:
step2.1, marking a trigger word and an event type in a Chinese news text and an event type in a Vietnamese news text by referring to an event marking system of ACE, and dividing the event types into seven types, namely
Figure BDA0002727884400000021
Figure BDA0002727884400000022
And "
Figure BDA0002727884400000026
xúc”、
Figure BDA0002727884400000027
“Giao
Figure BDA00027278844000000210
”、“
Figure BDA0002727884400000029
xung
Figure BDA0002727884400000028
"relationship;
and Step2.2, dividing the experimental data into training corpora, testing corpora and verification corpora.
As a further scheme of the invention, the Step3 comprises the following specific steps:
step3.1, predicting the context information of the target word in Chinese by adopting a skip-gram expansion model method, and simultaneously predicting the context information of the target word in aligned words in Vietnamese so as to obtain a middle-cross bilingual word vector;
step3.2, projecting the Chinese into the semantic space same as Vietnamese by using a mapping function, and sequentially training a word discriminator and the mapping function by using a random gradient descending method;
step3.3, give Chinese news text and mark the trigger in the sentence.
As a further scheme of the present invention, the Step4, which adopts BiLSTM to obtain semantic information of the vietnamese news, specifically comprises the following steps:
step4.1, pre-training Vietnamese word vectors on Vietnamese linguistic data to obtain a word vector word list, randomly initializing an entity vector for each entity mark by utilizing an entity mark type in an understhesea tool to obtain an entity vector word list, and converting all input words and entity marks into low-dimensional vectors by searching the word vector word list and the entity vector word list;
step4.2, splicing the word vector and the entity vector to be used as the input of the BilSTM, and capturing semantic information in the sentence.
As a further scheme of the present invention, the Step5 specifically comprises the following steps: and inputting the extracted trigger words in the Vietnamese sentences into a classification layer, and classifying the event types of the Vietnamese news sentences by adopting a softmax classifier, thereby realizing the detection of the Vietnamese news events.
The invention has the beneficial effects that:
1. the invention relates to a Vietnamese news event detection method based on Chinese trigger word guidance, which comprises the steps of mapping two languages into the same semantic space by using an antagonistic learning method, using a mapping function to enable Chinese to be infinitely close to Vietnamese until a discriminator cannot distinguish the two languages, and then extracting a mapped Chinese trigger word vector;
2. the Vietnamese news event detection method based on Chinese trigger guidance utilizes BilSTM to mine the context implicit semantic information of event sentences, finds the trigger information in the Vietnamese sentences through the mapped Chinese trigger guidance model by the attention mechanism, and finally utilizes the obtained attention context vector to carry out multi-classification of event types.
3. The Vietnamese news event detection method based on Chinese trigger guidance combines the characteristic of bilingual consistency, uses rich Chinese trigger marks to find the trigger information in the Vietnamese news sentence, and classifies the Vietnamese news sentence through a softmax layer;
4. the Vietnamese news event detection method based on the Chinese trigger word guidance solves the problem of trigger word loss in the Vietnamese news event detection task.
Drawings
FIG. 1 is a flow chart of Vietnamese news event detection based on Chinese trigger guidance according to the present invention;
FIG. 2 is a diagram of a Vietnamese news event detection model based on Chinese trigger guidance.
Detailed Description
Example 1: as shown in fig. 1-2, the method for detecting the vietnamese news event based on the guidance of the Chinese trigger word specifically comprises the following steps:
step1, collecting news texts for Chinese and Vietnamese related news event detection; the method comprises the steps of crawling Vietnamese news websites (Vietnamese news society, Vietnamese economic time news and Vietnamese), crawling Chinese news websites (Baidu, Xinhua net and people's net) corresponding to news topics crawled by Vietnamese, and crawling 813 Vietnamese news texts and 4065 Chinese news texts together. Finally, de-duplication and screening are carried out on the news text;
in Step1, as a preferred embodiment of the present invention, script is used as a crawling tool, user operations are simulated, different templates are customized for the chinese and vietnamese news websites, a template is formulated according to the XPath path of the page data elements to obtain detailed data, and data such as news headlines, news time, and news text are obtained.
The design of the preferred scheme is an important component of the invention, mainly provides a data support for the corpus collection process and the event time sequence relationship identification of the invention.
Step2, preprocessing the Chinese-Vietnamese news text such as word segmentation and entity labeling, labeling event types and Chinese trigger words in the Chinese-Vietnamese news text, and mixing the labeled Vietnamese news corpus according to the ratio of 8: 1: 1, distributing training corpora, testing corpora and verification corpora according to the proportion;
step3, adopting a method of counterlearning to map the two languages to the same semantic space. Extracting the mapped Chinese trigger word and word vectors;
step4, acquiring entity vectors in the Vietnam word vector fusion sentence as the input of a BilSTM layer; acquiring semantic information of Vietnamese news sentences by using BilSTM, and finding out trigger word information in the Vietnamese sentences by using the mapped Chinese trigger word guide model through an attention mechanism;
and Step5, finally, performing multi-classification of event types by using the obtained trigger word information, and further realizing Vietnamese news event detection.
As a preferred embodiment of the present invention, the Step2 specifically comprises the following steps:
step2.1, the event in the invention consists of a trigger word and parameters, the trigger word can clearly express the occurrence of a class of event, usually a single verb or noun, and the parameters describe the information of the occurrence time, place, person and the like of the event; marking the event types of the Chinese trigger words and the Chinese and overtaking news texts in the customized Chinese and overtaking bilingual related news events;
step2.2, defined as 7 event types by using the format of an ACE2005 data set, wherein 25089 news sentences are in total;
and Step2.3, dividing the experimental data into training corpora, testing corpora and verification corpora.
In a preferred embodiment of the present invention, Step2 is: divide the event types into seven types, respectively "
Figure BDA00027278844000000412
(Access) ",") "
Figure BDA00027278844000000420
(meeting) "and"
Figure BDA00027278844000000421
x c (cooperative) "," u "
Figure BDA00027278844000000422
kinh
Figure BDA00027278844000000423
(economic) "," Thay
Figure BDA00027278844000000419
(transition period) "," Giao "," Gi "," Giao "," Gi
Figure BDA00027278844000000418
(trade) "," is a series of products "
Figure BDA00027278844000000417
xung
Figure BDA00027278844000000424
(conflict) "relationship.
As a preferable scheme of the invention, the Step3 comprises the following specific steps:
step3.1, adopting skip-gram extended model method to pre-train Chinese word vector
Figure BDA0002727884400000041
Figure BDA0002727884400000042
And Vietnam word vector
Figure BDA0002727884400000043
Where E and N are the respective vocabulary sizes. dsAnd dzDenoted respectively are the chinese word vector dimension and the vietnamese word vector dimension. The chinese is then projected into the same semantic space as vietnamese using a mapping function f:
Figure BDA0002727884400000044
wherein
Figure BDA0002727884400000045
Is a mapping matrix.
Figure BDA0002727884400000046
Is the projected Chinese word vector.
The constrained transform matrix U is orthogonal to the Singular Value Decomposition (SVD) to reduce the parameter search space:
Figure BDA0002727884400000047
step3.2, in order to optimize the mapping function f, a multilayer perceptron is introduced as a word discriminator, Vietnamese word vectors and mapped Chinese word vectors are used as input, and a single scalar is output.
Figure BDA0002727884400000048
To represent
Figure BDA0002727884400000049
Probabilities from the Vietnamese vocabulary. The word discriminator uses a binary cross entropy loss:
Figure BDA00027278844000000410
yi=δi(1-2∈)+∈ ⑷
wherein, deltai1 indicates that the word is from z, δ i0 means that the word is from s. I iss;zIndicating the number of words sampled together from the vocabulary of z and s. E is a smoothed value added to the positive and negative labels.
The mapping function f and the word discriminator D are two countermeasure layers, flipping the word labels, optimized by minimizing the loss:
Figure BDA0002727884400000051
yi=δi(1-2∈)+∈ ⑹
using a counterlearning approach to map two languages into the same semantic space, a word discriminator and mapping function are trained using Stochastic Gradient Descent (SGD) in turn to minimize
Figure BDA0002727884400000052
And
Figure BDA0002727884400000053
step3.3, give Chinese news text and mark the trigger in the sentence. Mapping the Chinese trigger words to the same semantic space with Vietnamese language through a mapping matrix, and converting all the mapped Chinese trigger words into a group of mapping vectors G ═ G1,g2,…gmAnd capturing the hidden trigger words in the Vietnamese sentences.
The design of the preferred scheme is an important component of the invention, and mainly provides a vector coding process for the invention, and two languages are mapped to the same semantic space by combining bilingual word vectors. And finding the trigger words in the Vietnamese sentences for the following mapped Chinese trigger word guidance model to be used as a cushion.
As a preferred scheme of the invention, the invention utilizes Chinese trigger information to enhance the trigger meaning information in Vietnamese news sentences through an attention mechanism, wherein:
the Step4 adopts BilSTM to obtain semantic information of Vietnamese news, and comprises the following specific steps:
step4.1, a vietnamese news sentence S containing n words is given { w ═ w1,w2,…wnFor each word in S, wiAll marked out entity type e by understheseai. Then through the word vector word list
Figure BDA0002727884400000054
Query to wiCorresponding word vector
Figure BDA0002727884400000055
And through entity vector vocabulary
Figure BDA0002727884400000056
Query toiCorresponding entity vector
Figure BDA0002727884400000057
Finally, the word vector and the entity vector are spliced together to be used as wiThe final vector representation Vi:
Figure BDA0002727884400000058
Every word w in SiAre all represented as vectors v in the manner described aboveiBy using
Figure BDA00027278844000000511
The operator carries out the splicing operation in the vector direction, and then the semantic expression matrix M of the sentence SsComprises the following steps:
Figure BDA00027278844000000510
step4.2, splicing the word vector and the entity vector to be used as the input of the BilSTM, and capturing semantic information in the sentence.
As a preferred scheme of the invention, according to the characteristic that different languages have consistency in the same news theme environment, the method adopts markable Chinese trigger word information to guide and find out the trigger word information in the Vietnamese news text;
the BiLSTM:
sentences were modeled using BiLSTM, running on a concatenated sequence of embedded words and entities. The bidirectional BilSTM can be regarded as two unidirectional LSTMs, including a forward LSTM and a reverse LSTM, so that the output at the current moment can be linked with the state at the previous moment and the state at the next moment.
Sequentially inputting a word vector of each word in a Vietnamese news sentence into a neural network formed by a BilSTM unit to obtain a hidden layer vector h ═ h of the sentence1,h2,…hn},hiIs a hidden layer vector representation of the ith word in the sentence. In each step of this phase, the input w of forward BilSTM at time ttAnd the previous hidden state vector ht-1Calculating the current hidden state vector
Figure BDA0002727884400000061
Followed by running the BilSTM in reverse to generate a backward hidden vector representation
Figure BDA0002727884400000062
The forward LSTM is combined with the backward LSTM to form a BiLSTM. Different from the LSTM, the data of the input layer can be calculated in the forward direction and the backward direction, and finally the output hidden state is spliced and then used as the input of the next layer.
Figure BDA0002727884400000063
The attention mechanism is as follows:
each type of event is typically triggered by a specific set of words, referred to as event-triggered words. For example,') "
Figure BDA0002727884400000065
xung
Figure BDA0002727884400000066
(conflict) "events are typically triggered by" fight "," attack ", and the like. Thus, the event trigger is an important clue to complete the event detection task. According to a set of Chinese trigger word vectors G ═ G1,g2,…gmH ═ h hidden states produced by BilsTM1,h2,…hnCalculating each trigger word vector gi(i 1,2, … m) and a hidden state h, and obtaining a set of attention weight vectors α ═ α { (α {)1,…,αm}. Specifically, given the kth Chinese trigger vector G in GkHidden state h with time ttThe attention weight in between is calculated by equation (10), and in this model, the trigger word for the Vietnamese news target event type is expected to get a higher weight than the other words.
Figure BDA0002727884400000064
Finish g calculationkHidden state h with all time points ═ h1,h2,…,hnGet an attention weight vector α after attention weights betweenk=[α12,…,αn]. After traversal G ═ G1,g2,…,gmGet a set of attention weight vectors α ═ α }12,…,αm}. Then, the vector with the element with the largest weight in the group of weight vectors is obtained as the final attention weight vector of the current input sentence, which is marked as alphamax=[α12,…,αn]. Because for each Chinese trigger word in GThe word vector that measures the most relevant to the current input sentence finds a greater attention weight.
Finally, alpha is adjustedmaxWeighted summation is carried out on the vector S and the h, and the vector representation S of the current input sentence can be obtainedattAs shown in formula (11):
Satt=∑iαihi (11)
wherein i is 1,2, … n.
The preferred scheme design provides that Chinese trigger words are blended to guide Vietnamese to find trigger word information, and then event types are classified. The BilSTM can extract information from the positive direction and the negative direction, so that the problem of long-distance dependence is solved, and the implicit semantic information of the event sentence is effectively mined. The attention mechanism adds the weight of the trigger word in the current event, so that the Vietnamese event detection task achieves the best effect.
As a preferred embodiment of the present invention, the Step5 specifically comprises the following steps: and inputting the extracted Vietnamese trigger word information into a classification layer, and classifying the Vietnamese news events by adopting a softmax classifier, thereby realizing the detection of the Vietnamese news events.
As a preferred aspect of the present invention, the event types can be classified into seven categories by chinese and vietnamese related news stories.
As a preferred scheme of the invention, the vector of the current input sentence is represented by SattInputting a softmax layer to obtain the probability distribution P of the event type to be predicted:
P=softmax(W·Satt+b) (12)
where W and b are the weight and offset of the softmax layer, respectively.
The preferred scheme designs the Chinese trigger words to have certain constraint action, which is favorable for better recognizing the event time sequence relation.
Step6, respectively carrying out experimental exploration on coding characteristics and the existence of Chinese trigger words, proving the reasonability and high efficiency of model setting, and comparing the model with the existing model to prove that the method has better effect on Vietnamese event detection.
The experiment was compared using the accuracy (P), recall (R), and F-value (F) as evaluation indices.
Precision (P): the proportion of correctly predicted events in the total predicted events.
Recall (R): the proportion of correctly predicted events in real events.
Figure BDA0002727884400000071
In order to verify whether the text model can improve the event detection effect, a first group of experiments are set. The model compares the text model with a CNN model and a GCN model on a Vietnamese news data set, and compares the text model with a baseline model (TBNNAM) on the basis of not marking trigger words. The results of the experiment are shown in table 1:
table 1 shows the comparison of the properties of different models
Figure BDA0002727884400000081
As can be seen from comparison experiments, the model effect of LSTM is superior to CNN, mainly because LSTM can solve the problems of gradient disappearance and gradient explosion in the training process. The global information captured by the last state of the LSTM in the TBNNAM model is also important for this task, and this global information and local information captured by the attention mechanism are complementary. BilSTM as used herein is capable of capturing more semantic information in a sentence than LSTM. The experimental result shows that the model has better effect.
And a second group of experiments are set for researching the encoding characteristics of the word embedded layer, and verifying whether the event detection effect can be improved by integrating the entity information into the word vector. This experiment compares the effect on the model before and after addition of the entity. The results of the experiment are shown in table 2:
TABLE 2 Effect of coding characteristics on model Performance
Figure BDA0002727884400000082
Through comparative experiments, the annotation of the entity can capture semantic information of the words. After the entity vector is added, the accuracy, the recall rate and the F value of the model are all improved compared with those of the model, and the fact that the performance of event detection can be improved after the entity vector is added is proved.
Due to the complexity of Vietnamese, the sentence is difficult to mark by the trigger words, and a third group of experiments are set for verifying whether the Chinese trigger words are blended into the sentence to improve the event detection effect. The experiment compares the effect of the event detection of the Chinese trigger word, and the experiment result is shown in table 3:
table 3 shows the comparison of the model performance for the Chinese trigger instruction
Figure BDA0002727884400000083
As can be seen from comparison experiments, the effect of marking with the trigger words is obviously superior to the effect of marking without the trigger words. Different languages have consistency for the same news event sentence, and the Chinese trigger words can be used for finding the trigger words in the corresponding Vietnamese sentence, so that the event detection of the Vietnamese news is completed.
From the data, the semantic information of the sentences can be better captured by integrating the entity information, and the information of the trigger words in the Vietnamese sentences can be found by utilizing the Chinese trigger word guidance model, so that the detection of the Vietnamese news events is realized.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (6)

1. The Vietnamese news event detection method based on Chinese trigger word guidance is characterized by comprising the following steps of:
step1, collecting news texts for detecting the Chinese-crossing bilingual related news events, and carrying out duplicate removal and screening on the news texts;
step2, preprocessing the Chinese-Vietnamese news text, such as word segmentation and entity labeling, labeling event types and Chinese trigger words in the Chinese-Vietnamese news text, and dividing the labeled Vietnamese news corpus into a training corpus, a testing corpus and a verification corpus;
step3, mapping the two languages in the middle and beyond languages to the same semantic space by adopting an antagonistic learning method, and extracting a mapped Chinese trigger word vector;
step4, acquiring entity vectors in the Vietnam word vector fusion sentence as the input of a BilSTM layer; acquiring semantic information of Vietnamese news sentences by using BilSTM, and finding out trigger word information in the Vietnamese sentences by using the mapped Chinese trigger word guide model through an attention mechanism;
and Step5, finally, performing multi-classification of event types by using the obtained trigger word information, and further realizing Vietnamese news event detection.
2. The method for detecting Vietnamese news events based on Chinese trigger word guidance according to claim 1, wherein: in Step1, Scapy is used as a crawling tool, user operation is simulated, different templates are customized for Chinese and Vietnamese news websites, the templates are formulated according to XPath paths of page data elements to obtain detailed data, and news titles, news time and news text data are obtained.
3. The method for detecting Vietnamese news events based on Chinese trigger word guidance according to claim 1, wherein: the specific steps of Step2 are as follows:
step2.1, marking a trigger word and an event type in a Chinese news text and an event type in a Vietnamese news text by referring to an event marking system of ACE, and dividing the event types into seven types, namely
Figure FDA0002727884390000011
Figure FDA0002727884390000012
And
Figure FDA0002727884390000013
a relationship;
and Step2.2, dividing the experimental data into training corpora, testing corpora and verification corpora.
4. The method for detecting Vietnamese news events based on Chinese trigger word guidance according to claim 1, wherein: the specific Step of Step3 is as follows:
step3.1, predicting the context information of the target word in Chinese by adopting a skip-gram expansion model method, and simultaneously predicting the context information of the target word in aligned words in Vietnamese so as to obtain a middle-cross bilingual word vector;
step3.2, projecting the Chinese into the semantic space same as Vietnamese by using a mapping function, and sequentially training a word discriminator and the mapping function by using a random gradient descent method;
step3.3, give Chinese news text and mark the trigger in the sentence.
5. The method for detecting Vietnamese news events based on Chinese trigger word guidance according to claim 1, wherein: the Step4 adopts BilSTM to obtain semantic information of Vietnamese news, and comprises the following specific steps:
step4.1, pre-training Vietnamese word vectors on Vietnamese linguistic data to obtain a word vector vocabulary, randomly initializing an entity vector for each entity marker by utilizing the entity marker type in an understhesea tool to obtain an entity vector vocabulary, and converting all input words and entity markers into low-dimensional vectors by searching the word vector vocabulary and the entity vector vocabulary;
step4.2, splicing the word vector and the entity vector to be used as the input of the BilSTM, and capturing semantic information in the sentence.
6. The method for detecting Vietnamese news events based on Chinese trigger word guidance according to claim 1, wherein: the specific steps of Step5 are as follows: and inputting the extracted trigger words in the Vietnamese sentences into a classification layer, and classifying event types of the Vietnamese news sentences by adopting a softmax classifier, thereby realizing the detection of the Vietnamese news events.
CN202011108823.8A 2020-10-16 2020-10-16 Vietnam news event detection method based on Chinese trigger word guidance Active CN112580330B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011108823.8A CN112580330B (en) 2020-10-16 2020-10-16 Vietnam news event detection method based on Chinese trigger word guidance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011108823.8A CN112580330B (en) 2020-10-16 2020-10-16 Vietnam news event detection method based on Chinese trigger word guidance

Publications (2)

Publication Number Publication Date
CN112580330A true CN112580330A (en) 2021-03-30
CN112580330B CN112580330B (en) 2023-09-12

Family

ID=75119819

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011108823.8A Active CN112580330B (en) 2020-10-16 2020-10-16 Vietnam news event detection method based on Chinese trigger word guidance

Country Status (1)

Country Link
CN (1) CN112580330B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627170A (en) * 2021-07-01 2021-11-09 昆明理工大学 Multi-feature fusion Vietnamese keyword generation method
CN114896394A (en) * 2022-04-18 2022-08-12 桂林电子科技大学 Event trigger detection and classification method based on multi-language pre-training model
CN115759036A (en) * 2022-10-28 2023-03-07 中国矿业大学(北京) Method for constructing recommendation-based event detection model and method for detecting event by using model
CN113627170B (en) * 2021-07-01 2024-05-28 昆明理工大学 Multi-feature fusion Vietnam keyword generation method

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829801A (en) * 2018-06-06 2018-11-16 大连理工大学 A kind of event trigger word abstracting method based on documentation level attention mechanism
CN109359184A (en) * 2018-10-16 2019-02-19 苏州大学 English event synchronous anomalies method and system
CN109670172A (en) * 2018-12-06 2019-04-23 桂林电子科技大学 A kind of scenic spot anomalous event abstracting method based on complex neural network
CN110135457A (en) * 2019-04-11 2019-08-16 中国科学院计算技术研究所 Event trigger word abstracting method and system based on self-encoding encoder fusion document information
CN110334213A (en) * 2019-07-09 2019-10-15 昆明理工大学 The Chinese based on bidirectional crossed attention mechanism gets over media event sequential relationship recognition methods
CN110377738A (en) * 2019-07-15 2019-10-25 昆明理工大学 Merge the Vietnamese news event detecting method of interdependent syntactic information and convolutional neural networks
CN110633409A (en) * 2018-06-20 2019-12-31 上海财经大学 Rule and deep learning fused automobile news event extraction method
CN110941955A (en) * 2019-11-25 2020-03-31 中国科学院自动化研究所 Cross-language event classification method and device
CN111382575A (en) * 2020-03-19 2020-07-07 电子科技大学 Event extraction method based on joint labeling and entity semantic information

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829801A (en) * 2018-06-06 2018-11-16 大连理工大学 A kind of event trigger word abstracting method based on documentation level attention mechanism
CN110633409A (en) * 2018-06-20 2019-12-31 上海财经大学 Rule and deep learning fused automobile news event extraction method
CN109359184A (en) * 2018-10-16 2019-02-19 苏州大学 English event synchronous anomalies method and system
CN109670172A (en) * 2018-12-06 2019-04-23 桂林电子科技大学 A kind of scenic spot anomalous event abstracting method based on complex neural network
CN110135457A (en) * 2019-04-11 2019-08-16 中国科学院计算技术研究所 Event trigger word abstracting method and system based on self-encoding encoder fusion document information
CN110334213A (en) * 2019-07-09 2019-10-15 昆明理工大学 The Chinese based on bidirectional crossed attention mechanism gets over media event sequential relationship recognition methods
CN110377738A (en) * 2019-07-15 2019-10-25 昆明理工大学 Merge the Vietnamese news event detecting method of interdependent syntactic information and convolutional neural networks
CN110941955A (en) * 2019-11-25 2020-03-31 中国科学院自动化研究所 Cross-language event classification method and device
CN111382575A (en) * 2020-03-19 2020-07-07 电子科技大学 Event extraction method based on joint labeling and entity semantic information

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
寇梦珂 等: "基于中文触发词指导的越南语新闻事件检测", 《中文信息学报》, vol. 37, no. 4, pages 45 - 51 *
易士翔 等: "基于BiLSTM 的公共安全事件触发词识别", 《工程科学学报》, vol. 41, no. 9, pages 1201 - 1207 *
王吉地 等: "融合依存信息和卷积神经网络的越南语新闻事件检测", 《南京大学学报(自然科学)》, vol. 56, no. 1, pages 125 - 131 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627170A (en) * 2021-07-01 2021-11-09 昆明理工大学 Multi-feature fusion Vietnamese keyword generation method
CN113627170B (en) * 2021-07-01 2024-05-28 昆明理工大学 Multi-feature fusion Vietnam keyword generation method
CN114896394A (en) * 2022-04-18 2022-08-12 桂林电子科技大学 Event trigger detection and classification method based on multi-language pre-training model
CN114896394B (en) * 2022-04-18 2024-04-05 桂林电子科技大学 Event trigger word detection and classification method based on multilingual pre-training model
CN115759036A (en) * 2022-10-28 2023-03-07 中国矿业大学(北京) Method for constructing recommendation-based event detection model and method for detecting event by using model

Also Published As

Publication number Publication date
CN112580330B (en) 2023-09-12

Similar Documents

Publication Publication Date Title
CN110334213B (en) Method for identifying time sequence relation of Hanyue news events based on bidirectional cross attention mechanism
CN111966917B (en) Event detection and summarization method based on pre-training language model
WO2023060795A1 (en) Automatic keyword extraction method and apparatus, and device and storage medium
CN113011533A (en) Text classification method and device, computer equipment and storage medium
CN112541343B (en) Semi-supervised counterstudy cross-language abstract generation method based on word alignment
CN111931506B (en) Entity relationship extraction method based on graph information enhancement
CN105205124B (en) A kind of semi-supervised text sentiment classification method based on random character subspace
CN112732916B (en) BERT-based multi-feature fusion fuzzy text classification system
WO2021051518A1 (en) Text data classification method and apparatus based on neural network model, and storage medium
CN112668319B (en) Vietnamese news event detection method based on Chinese information and Vietnamese statement method guidance
CN110704621A (en) Text processing method and device, storage medium and electronic equipment
CN113704546A (en) Video natural language text retrieval method based on space time sequence characteristics
CN110851176B (en) Clone code detection method capable of automatically constructing and utilizing pseudo-clone corpus
CN111581943A (en) Chinese-over-bilingual multi-document news viewpoint sentence identification method based on sentence association graph
CN114169312A (en) Two-stage hybrid automatic summarization method for judicial official documents
CN110110116A (en) A kind of trademark image retrieval method for integrating depth convolutional network and semantic analysis
CN112580330B (en) Vietnam news event detection method based on Chinese trigger word guidance
CN111881256B (en) Text entity relation extraction method and device and computer readable storage medium equipment
CN115062104A (en) Knowledge prompt-fused legal text small sample named entity identification method
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN114417851A (en) Emotion analysis method based on keyword weighted information
CN115859980A (en) Semi-supervised named entity identification method, system and electronic equipment
Moschitti Syntactic and semantic kernels for short text pair categorization
CN111767733A (en) Document security classification discrimination method based on statistical word segmentation
CN111813927A (en) Sentence similarity calculation method based on topic model and LSTM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant