CN112580330B - Vietnam news event detection method based on Chinese trigger word guidance - Google Patents

Vietnam news event detection method based on Chinese trigger word guidance Download PDF

Info

Publication number
CN112580330B
CN112580330B CN202011108823.8A CN202011108823A CN112580330B CN 112580330 B CN112580330 B CN 112580330B CN 202011108823 A CN202011108823 A CN 202011108823A CN 112580330 B CN112580330 B CN 112580330B
Authority
CN
China
Prior art keywords
vietnam
news
chinese
word
trigger
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011108823.8A
Other languages
Chinese (zh)
Other versions
CN112580330A (en
Inventor
高盛祥
寇梦珂
余正涛
王振晗
朱俊国
朱恩昌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202011108823.8A priority Critical patent/CN112580330B/en
Publication of CN112580330A publication Critical patent/CN112580330A/en
Application granted granted Critical
Publication of CN112580330B publication Critical patent/CN112580330B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a Vietnam news event detection method based on Chinese trigger word guidance, and belongs to the technical field of natural language processing. The invention firstly adopts an anti-learning method to map two languages under the same semantic space, then blends entity information in the coding process, embeds the mapped Chinese trigger words into the trigger word information in the Vietnam news of guidance model attention through an attention mechanism, and finally carries out multi-classification of event types by utilizing the obtained trigger word information so as to realize detection of Vietnam news events. The invention needs to identify the trigger words in news in the current event detection, no Vietnam trigger word mark corpus exists at present, and the problem of Vietnam mark corpus missing can be solved by using rich Chinese mark corpus.

Description

Vietnam news event detection method based on Chinese trigger word guidance
Technical Field
The invention relates to a Vietnam news event detection method based on Chinese trigger word guidance, and belongs to the technical field of natural language processing.
Background
Event detection is a hotspot problem in current natural language processing research. The recognition of trigger words plays a vital role in the event detection task. At present, vietnam language data are scarce, and Vietnam language does not have trigger word mark data, so that the event in Vietnam language news is difficult to detect. Therefore, according to the feature that sentences expressing the same point of view but expressed in different languages generally have the same or similar semantic components, the problem of missing Vietnam trigger word marks is solved by using rich Chinese trigger word marks.
Disclosure of Invention
The invention provides a Vietnam news event detection method based on Chinese trigger word guidance, which is used for solving the problems that the existing Vietnam data is scarce, the corpus is not marked by the Vietnam trigger word, and the text in different languages is difficult to express in the same feature space.
The technical scheme of the invention is as follows: the Vietnam news event detection method based on Chinese trigger word guidance comprises the following specific steps of:
step1, collecting news texts for detecting related news events of the Zhongyue bilingual, and performing duplication removal and screening on the news texts;
step2, performing preprocessing such as word segmentation and entity labeling on the middle-crossing news text, labeling event types and Chinese trigger words in the middle-crossing news text, and dividing the labeled Vietnam news corpus into training corpus, testing corpus and verification corpus;
step3, mapping the Chinese and the Yue languages into the same semantic space by adopting an countermeasure learning method, and extracting a mapped Chinese trigger word vector;
step4, obtaining entity vectors in the Vietnam word vector fusion sentence as the input of the BiLSTM layer; acquiring semantic information of Vietnam news sentences by using BiLSTM, and finding trigger word information in the Vietnam news sentences by using the mapped Chinese trigger word guidance model through an attention mechanism;
step5, finally, performing multi-classification on event types by using the obtained trigger word information, and further realizing detection of Vietnam news events.
As a further scheme of the invention, in Step1, using the scipy as a crawling tool, simulating user operation, customizing different templates for Chinese and Vietnam news websites, and obtaining detailed data according to XPath path formulation templates of page data elements to obtain news headlines, news time and news text data.
As a further scheme of the present invention, the specific steps of Step2 are as follows:
step2.1, marking trigger words and event types in Chinese news texts and event types in Vietnam news texts by referring to an event marking system of ACE, and dividing the event types into seven types, namely And->A relationship;
step2.2, dividing the experimental data into training corpus, testing corpus and verification corpus.
As a further aspect of the present invention, the specific Step of Step3 is:
step3.1, predicting the context information of the target word in Chinese by adopting a skip-gram expansion model method, and simultaneously predicting the context information of the alignment word of the target word in Vietnam, thereby obtaining a Chinese-to-Chinese bilingual word vector;
step3.2, projecting Chinese into the same semantic space as Vietnam by using a mapping function, and training a word discriminator and the mapping function in sequence by using a random gradient descent method;
step3.3, giving chinese news text and marking trigger words in sentences.
As a further scheme of the invention, the specific steps for acquiring semantic information of Vietnam news by adopting BiLSTM in Step4 are as follows:
step4.1, pre-training Vietnam word vectors on Vietnam language materials to obtain word vector word lists, randomly initializing an entity vector for each entity mark by utilizing entity mark types in an underthesea tool to obtain entity vector word lists, and converting all input words and entity marks into low-dimensional vectors by searching the word vector word lists and the entity vector word lists;
step4.2, concatenating word vectors and entity vectors as inputs to the BiLSTM to capture semantic information in sentences.
As a further scheme of the present invention, the specific steps of Step5 are as follows: and inputting the extracted trigger words in the Vietnam sentences to a classification layer, and classifying event types of the Vietnam news sentences by adopting a softmax classifier, so that detection of the Vietnam news events is realized.
The beneficial effects of the invention are as follows:
1. according to the Vietnam news event detection method based on Chinese trigger word guidance, two languages are mapped into the same semantic space by using an anti-learning method, chinese is infinitely close to Vietnam by using a mapping function until a discriminator cannot discriminate the two languages, and then a mapped Chinese trigger word vector is extracted;
2. the Vietnam news event detection method based on Chinese trigger word guidance uses BiLSTM to mine context implicit semantic information of event sentences, a mapped Chinese trigger word guidance model is used for finding trigger word information in Vietnam sentences through an attention mechanism, and finally the obtained attention context vector is used for carrying out multi-classification of event types.
3. According to the Vietnam news event detection method based on Chinese trigger word guidance, the rich Chinese trigger word marks are used for finding trigger word information in Vietnam news sentences by combining the characteristics of bilingual consistency, and classification is carried out through a softmax layer;
4. the Vietnam news event detection method based on Chinese trigger word guidance solves the problem of trigger word deletion in Vietnam event detection tasks.
Drawings
FIG. 1 is a flow chart of Vietnam news event detection based on Chinese trigger word guidance provided by the invention;
fig. 2 is a diagram of a model for detecting Vietnam news events based on Chinese trigger word guidance.
Detailed Description
Example 1: as shown in fig. 1-2, the method for detecting the news event in vietnam based on Chinese trigger word guidance comprises the following specific steps:
step1, collecting news texts for detecting related news events of the Zhongyue bilingual language; the method comprises the steps of crawling Vietnam news websites (Vietnam news agency, vietnam economic hours and Vietnam doors), and crawling Chinese news websites (Baidu, xinhua Net and people net) correspondingly aiming at the news topics crawled by Vietnam, wherein 813 Vietnam news texts and 4065 Chinese news texts are crawled. Finally, de-duplication and screening are carried out on the news text;
in Step1, the preferred scheme of the invention uses the Scrapy as the crawling tool to simulate the operation of a user, customize different templates for Chinese and Vietnam news websites, and acquire detailed data according to the XPath path formulation templates of page data elements to acquire data such as news headlines, news time, news text and the like.
The design of the preferred scheme is an important component of the invention, mainly provides data support for the corpus collection process and the event time sequence relationship recognition.
Step2, performing preprocessing such as word segmentation and entity labeling on the middle-crossing news text, labeling event types and Chinese trigger words in the middle-crossing news text, and distributing the labeled Vietnam news corpus into training corpus, testing corpus and verification corpus according to the proportion of 8:1:1;
step3, mapping the two languages into the same semantic space by adopting an anti-learning method. Extracting the mapped Chinese trigger word vector;
step4, obtaining entity vectors in the Vietnam word vector fusion sentence as the input of the BiLSTM layer; acquiring semantic information of Vietnam news sentences by using BiLSTM, and finding trigger word information in the Vietnam news sentences by using the mapped Chinese trigger word guidance model through an attention mechanism;
step5, finally, performing multi-classification on event types by using the obtained trigger word information, and further realizing detection of Vietnam news events.
As a preferred embodiment of the present invention, the specific steps of Step2 are:
step2.1, the event in the invention is composed of trigger words and parameters, the trigger words can clearly express the occurrence of one kind of event, usually a single verb or noun, and the parameters describe the information such as time, place, character and the like of the occurrence of the event; marking a Chinese trigger word and an event type in a Chinese and Yue news text by the customized Chinese and Yue bilingual related news event;
step2.2, format using ACE2005 dataset, defines 7 event types, of which there are 25089 news sentences in total;
step2.3, and dividing the experimental data into training corpus, test corpus and verification corpus.
As a preferred embodiment of the present invention, in Step 2: the event types are divided into seven types, namelyAnd-> Relationship.
As a preferred embodiment of the present invention, the specific Step of Step3 is:
step3.1, pretraining Chinese word vectors by adopting skip-gram expansion model method And Vietnam word vector->Where E and N are each a vocabulary size. d, d s And d z The Chinese word vector dimension and the Vietnam word vector dimension are represented respectively. The chinese is then projected into the same semantic space as vietnam using mapping function f:
wherein the method comprises the steps ofIs a mapping matrix. />Is the projected chinese word vector.
The constrained transform matrix U is orthogonal to Singular Value Decomposition (SVD) to reduce the parameter search space:
step3.2, in order to optimize the mapping function f, a multi-layer perceptron is introduced as a word discriminator, using the vietnam word vector and the mapped chinese word vector as inputs, outputting a single scalar.Representation->Probabilities from the Vietnam vocabulary. Word discriminator uses binary cross entropy loss:
y i =δ i (1-2∈)+∈ (4)
wherein delta i =1 indicates that the word is from z, δ i =0 means that the word is from s. I s;z Representing the number of words sampled together from the vocabulary of z and s. E is a smoothed value added to the positive and negative labels.
The mapping function f and word discriminator D are two countermeasure layers, flipping word labels, optimized by minimizing the penalty:
y i =δ i (1-2∈)+∈ (6)
mapping two languages into the same semantic space using an anti-learning approach, training word discriminators and mapping functions in turn using random gradient descent (SGD) to minimizeAnd->
Step3.3, giving chinese news text and marking trigger words in sentences. Mapping Chinese trigger words to the same semantic space as Vietnam language through a mapping matrix, and converting all mapped Chinese trigger words into a group of mapping vectors G= { G 1 ,g 2 ,…g m Used to capture hidden trigger words in Vietnam sentences.
The design of the preferred scheme is an important component of the invention, and mainly provides a vector coding process for the invention, and two languages are mapped to the same semantic space by combining bilingual word vectors. And finding trigger words in the Vietnam sentence for the following mapped Chinese trigger word guidance model to be used as a bedding.
As a preferred scheme of the invention, the invention utilizes Chinese trigger word information to enhance trigger word meaning information in Vietnam news sentences through an attention mechanism, wherein:
the Step4 adopts BiLSTM to acquire semantic information of Vietnam news, and comprises the following specific steps:
step4.1, given a vietnam news sentence s= { w containing n words 1 ,w 2 ,…w n For each word in S, wi is marked by underthesea as entity type e i . Then through word vector vocabularyInquiring word vector corresponding to wi ++>And by entity vector vocabulary->Query e i Corresponding entity vector->Finally, the word vector and the entity vector are spliced together to form a wi final vector representation V i
Will each word w in S i Are all expressed as vectors v in the manner described above i By usingThe operator performs vector direction splicing operation, so that the semantic representation matrix M of the sentence S s The method comprises the following steps:
step4.2, concatenating word vectors and entity vectors as inputs to the BiLSTM to capture semantic information in sentences.
As a preferable scheme of the invention, the method adopts marked Chinese trigger word information to guide and find the trigger word information in the Vietnam news text according to the characteristics of consistency of different languages in the same news theme environment;
the BiLSTM:
sentences are modeled using BiLSTM, run on word and entity embedded junction sequences. The bi-directional BiLSTM can be regarded as two uni-directional LSTMs, including a forward LSTM and a reverse LSTM, that enable the output at the current time to be linked to both the state at the previous time and the state at the subsequent time.
The word vector of each word in the Vietnam news sentence is sequentially input into a neural network formed by BiLSTM units, so as to obtain the hidden layer vector h= { h of the sentence 1 ,h 2 ,…h n },h i Is the hidden layer vector representation of the i-th word in the sentence. In each step of this phase, the input w of the forward BiLSTM at time t t And a previously hidden state vector h t-1 Computing a current hidden state vectorThen run BiLSTM in reverse to generate backward hidden layer vector representation +.>
The forward LSTM combines with the backward LSTM to form BiLSTM. Unlike LSTM, the data of the input layer is calculated in both forward and backward directions, and the finally outputted hidden state is spliced again to be used as the input of the next layer.
The attention mechanism:
each type of event is typically triggered by a specific set of words, which are referred to as event trigger words. For example, the number of the cells to be processed, events are typically triggered by words such as "fight", "attack", and the like. Thus, event trigger words are important clues to complete event detection tasks. According to a group of Chinese trigger word vectors G= { G 1 ,g 2 ,…g m Hidden state h= { h by } and BiLSTM 1 ,h 2 ,…h n Each trigger word vector g is calculated i Obtaining attention weights between (i=1, 2, … m) and the hidden state h, obtaining a group of attention weight vectors α= { α 1 ,…,a m }. Specifically, the attention weight between the kth Chinese trigger word vector gk and the hidden state ht at time t in a given G is calculated by equation (10), in which the trigger words of the Vietnam news target event type are expected to get a higher weight than the other words.
G after calculation k Hidden state h= { h with all times 1 ,h 2 ,…,h n After the attention weights between the two, an attention weight vector alpha is obtained k =[α 1 ,α 2 ,…,α n ]. Complete traversal g= { G 1 ,g 2 ,…,g m Then a set of attention weight vectors α= { α is obtained 1 ,α 2 ,…,a m }. Then, the vector with the element with the largest weight in the group of weight vectors is obtained and used as the final attention weight vector of the current input sentence, and is marked as alpha max =[α 1 ,α 2 ,…,α n ]. Because a greater attention weight is found for each chinese trigger word vector in G that is most relevant to the current input sentence.
Finally, alpha is max Weighted summation with h to obtain the vector representation S of the current input sentence att As in formula (11):
S att =∑ i α i h i (11)
where i=1, 2, … n.
The preferred scheme is designed to be integrated with Chinese trigger words to guide Vietnam to find trigger word information, and then to classify the trigger word information into event types. The BiLSTM can extract information from the front direction and the back direction, so that the problem of long-distance dependence is solved, and the implicit semantic information of the event sentence is more effectively mined. The attention mechanism adds the weight of the trigger words in the current event, so that the Vietnam event detection task achieves the best effect.
As a preferred embodiment of the present invention, the specific steps of Step5 are: the extracted Vietnam trigger word information is input to a classification layer, and Vietnam news events are classified by adopting a softmax classifier, so that detection of the Vietnam news events is realized.
As a preferred embodiment of the present invention, event types may be categorized into seven categories by Chinese and Vietnam related news stories.
As a preferred embodiment of the present invention, the vector representation S of the current input sentence att Inputting a softmax layer to obtain probability distribution P of the event type to be predicted:
P=softmax(W·S att +b) (12)
where W and b are the weight and bias of the softmax layer, respectively.
The Chinese trigger words designed by the preferred scheme have a certain constraint function, and are helpful for better identifying event time sequence relations.
Step6, respectively carrying out experimental exploration on coding characteristics and the presence or absence of Chinese trigger words, proving rationality and high efficiency of model setting, and comparing the model with the existing model, thereby proving that the method has better effect on Vietnam event detection.
The experiment adopts the accuracy (P), the recall (R) and the F value (F) as evaluation indexes to carry out comparison experiment.
Precision (P): the proportion of correctly predicted events in the total predicted events.
Recall (R): the proportion of correctly predicted events in real events.
To verify whether the text model can promote the effect of event detection, a first set of experiments was set up. The model compares the text model with the CNN model and the GCN model on the vietnam news dataset, and compares the text model with a baseline model (TBNNAM) on the basis of not marking trigger words. The experimental results are shown in table 1:
table 1 shows the performance of different models
As shown by a comparison experiment, the model effect of LSTM is superior to CNN, mainly because LSTM can solve the problems of gradient elimination and gradient explosion in the training process. The global information captured by the last state of LSTM in tbnniam model is also important to this task, and the global information and local information captured by the attention mechanism are complementary. The BiLSTM used herein is capable of capturing more semantic information in sentences than LSTM. Experimental results show that the model has better effect.
The method is characterized in that the method is conducted aiming at coding features fused in a word embedding layer, and a second group of experiments are set for verifying whether the effect of event detection can be improved by fusing entity information into word vectors. This experiment compares the effect on the model before and after adding the entity. The experimental results are shown in table 2:
table 2 shows the effect of coding features on model performance
As can be seen from a comparison experiment, the annotation of the entity can capture the semantic information of the word. After the entity vector is added, the accuracy, recall rate and F value of the model are all improved compared with those of the model, and the performance of event detection can be improved after the entity vector is added.
Because of the complexity of Vietnam, the sentence is difficult to be marked with the trigger word, and a third set of experiments is set for verifying whether the effect of event detection can be improved by merging the Chinese trigger word. The experiment compares the effect of the presence of Chinese trigger words on event detection, and the experimental result is shown in Table 3:
table 3 shows the model performance contrast for Chinese trigger word guidance
As can be seen from a comparison experiment, the effect of the mark with the trigger word is obviously better than that of the mark without the trigger word. Different languages have consistency for the same news event sentence, and the Chinese trigger words can be used for finding the trigger words in the corresponding Vietnam sentence, so that the event detection of the Vietnam news is completed.
From the data, the semantic information of sentences can be better captured by integrating the entity information, and the trigger word information in Vietnam sentences is found by using a Chinese trigger word guidance model, so that Vietnam news event detection is realized.
While the present invention has been described in detail with reference to the drawings, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (4)

1. The Vietnam news event detection method based on Chinese trigger word guidance is characterized by comprising the following steps of:
step1, collecting news texts for detecting related news events of the Zhongyue bilingual, and performing duplication removal and screening on the news texts;
step2, word segmentation is carried out on the middle-crossing news text, entity labeling pretreatment is carried out, event types and Chinese trigger words in the middle-crossing bilingual news text are labeled, and the labeled Vietnam news corpus is divided into training corpus, testing corpus and verification corpus;
step3, mapping the Chinese and the Yue languages into the same semantic space by adopting an countermeasure learning method, and extracting a mapped Chinese trigger word vector;
step4, obtaining entity vectors in the Vietnam word vector fusion sentence as the input of the BiLSTM layer; acquiring semantic information of Vietnam news sentences by using BiLSTM, and finding trigger word information in the Vietnam news sentences by using the mapped Chinese trigger word guidance model through an attention mechanism;
step5, finally, performing multi-classification of event types by using the obtained trigger word information, thereby realizing detection of Vietnam news events;
the specific steps of the Step3 are as follows:
step3.1, predicting the context information of the target word in Chinese by adopting a skip-gram expansion model method, and simultaneously predicting the context information of the alignment word of the target word in Vietnam, thereby obtaining a Chinese-to-Chinese bilingual word vector;
step3.2, projecting Chinese into the same semantic space as Vietnam by using a mapping function, and training a word discriminator and the mapping function in sequence by using a random gradient descent method;
step3.3, giving a Chinese news text and marking trigger words in sentences;
the Step4 adopts BiLSTM to acquire semantic information of Vietnam news, and comprises the following specific steps:
step4.1, pre-training Vietnam word vectors on Vietnam language materials to obtain word vector word lists, randomly initializing an entity vector for each entity mark by utilizing entity mark types in an underthesea tool to obtain entity vector word lists, and converting all input words and entity marks into low-dimensional vectors by searching the word vector word lists and the entity vector word lists;
step4.2, concatenating word vectors and entity vectors as inputs to the BiLSTM to capture semantic information in sentences.
2. The method for detecting Vietnam news events based on Chinese trigger word guidance according to claim 1, wherein the method comprises the following steps: in Step1, using the Scrapy as a crawling tool to simulate user operation, customizing different templates for Chinese and Vietnam news websites, and obtaining detailed data according to XPath path formulation templates of page data elements to obtain news headlines, news time and news text data.
3. The method for detecting Vietnam news events based on Chinese trigger word guidance according to claim 1, wherein the method comprises the following steps: the specific steps of the Step2 are as follows:
marking trigger words and event types in Chinese news texts and event types in Vietnam news texts by referring to an event marking system of ACE, and dividing the event types into seven types, namely relations of 'chuy ế n th ă m', 'G ặ p' and 'Ti ế p x m c', 'Thu ộ c kine t ế', 'Thay đ ổ i', 'Giao d ị ch', 'Cu ộ c xung đ ộ t', respectively;
step2.2, dividing the experimental data into training corpus, testing corpus and verification corpus.
4. The method for detecting Vietnam news events based on Chinese trigger word guidance according to claim 1, wherein the method comprises the following steps: the specific steps of the Step5 are as follows: and inputting the extracted trigger words in the Vietnam sentences to a classification layer, and classifying event types of the Vietnam news sentences by adopting a softmax classifier, so that detection of the Vietnam news events is realized.
CN202011108823.8A 2020-10-16 2020-10-16 Vietnam news event detection method based on Chinese trigger word guidance Active CN112580330B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011108823.8A CN112580330B (en) 2020-10-16 2020-10-16 Vietnam news event detection method based on Chinese trigger word guidance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011108823.8A CN112580330B (en) 2020-10-16 2020-10-16 Vietnam news event detection method based on Chinese trigger word guidance

Publications (2)

Publication Number Publication Date
CN112580330A CN112580330A (en) 2021-03-30
CN112580330B true CN112580330B (en) 2023-09-12

Family

ID=75119819

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011108823.8A Active CN112580330B (en) 2020-10-16 2020-10-16 Vietnam news event detection method based on Chinese trigger word guidance

Country Status (1)

Country Link
CN (1) CN112580330B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627170A (en) * 2021-07-01 2021-11-09 昆明理工大学 Multi-feature fusion Vietnamese keyword generation method
CN114896394B (en) * 2022-04-18 2024-04-05 桂林电子科技大学 Event trigger word detection and classification method based on multilingual pre-training model
CN115759036B (en) * 2022-10-28 2023-08-04 中国矿业大学(北京) Method for constructing event detection model based on recommendation and method for carrying out event detection by using model

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829801A (en) * 2018-06-06 2018-11-16 大连理工大学 A kind of event trigger word abstracting method based on documentation level attention mechanism
CN109359184A (en) * 2018-10-16 2019-02-19 苏州大学 English event synchronous anomalies method and system
CN109670172A (en) * 2018-12-06 2019-04-23 桂林电子科技大学 A kind of scenic spot anomalous event abstracting method based on complex neural network
CN110135457A (en) * 2019-04-11 2019-08-16 中国科学院计算技术研究所 Event trigger word abstracting method and system based on self-encoding encoder fusion document information
CN110334213A (en) * 2019-07-09 2019-10-15 昆明理工大学 The Chinese based on bidirectional crossed attention mechanism gets over media event sequential relationship recognition methods
CN110377738A (en) * 2019-07-15 2019-10-25 昆明理工大学 Merge the Vietnamese news event detecting method of interdependent syntactic information and convolutional neural networks
CN110633409A (en) * 2018-06-20 2019-12-31 上海财经大学 Rule and deep learning fused automobile news event extraction method
CN110941955A (en) * 2019-11-25 2020-03-31 中国科学院自动化研究所 Cross-language event classification method and device
CN111382575A (en) * 2020-03-19 2020-07-07 电子科技大学 Event extraction method based on joint labeling and entity semantic information

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829801A (en) * 2018-06-06 2018-11-16 大连理工大学 A kind of event trigger word abstracting method based on documentation level attention mechanism
CN110633409A (en) * 2018-06-20 2019-12-31 上海财经大学 Rule and deep learning fused automobile news event extraction method
CN109359184A (en) * 2018-10-16 2019-02-19 苏州大学 English event synchronous anomalies method and system
CN109670172A (en) * 2018-12-06 2019-04-23 桂林电子科技大学 A kind of scenic spot anomalous event abstracting method based on complex neural network
CN110135457A (en) * 2019-04-11 2019-08-16 中国科学院计算技术研究所 Event trigger word abstracting method and system based on self-encoding encoder fusion document information
CN110334213A (en) * 2019-07-09 2019-10-15 昆明理工大学 The Chinese based on bidirectional crossed attention mechanism gets over media event sequential relationship recognition methods
CN110377738A (en) * 2019-07-15 2019-10-25 昆明理工大学 Merge the Vietnamese news event detecting method of interdependent syntactic information and convolutional neural networks
CN110941955A (en) * 2019-11-25 2020-03-31 中国科学院自动化研究所 Cross-language event classification method and device
CN111382575A (en) * 2020-03-19 2020-07-07 电子科技大学 Event extraction method based on joint labeling and entity semantic information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
融合依存信息和卷积神经网络的越南语新闻事件检测;王吉地 等;《南京大学学报(自然科学)》;第56卷(第1期);第125-131页 *

Also Published As

Publication number Publication date
CN112580330A (en) 2021-03-30

Similar Documents

Publication Publication Date Title
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN110334213B (en) Method for identifying time sequence relation of Hanyue news events based on bidirectional cross attention mechanism
Jung Semantic vector learning for natural language understanding
Cheng et al. Neural summarization by extracting sentences and words
CN112580330B (en) Vietnam news event detection method based on Chinese trigger word guidance
Navigli et al. Learning word-class lattices for definition and hypernym extraction
CN112541343B (en) Semi-supervised counterstudy cross-language abstract generation method based on word alignment
CN112668319B (en) Vietnamese news event detection method based on Chinese information and Vietnamese statement method guidance
Wang et al. Keyword extraction from online product reviews based on bi-directional LSTM recurrent neural network
CN112101027A (en) Chinese named entity recognition method based on reading understanding
CN113704546A (en) Video natural language text retrieval method based on space time sequence characteristics
Mahmoud et al. BLSTM-API: Bi-LSTM recurrent neural network-based approach for Arabic paraphrase identification
CN111581943A (en) Chinese-over-bilingual multi-document news viewpoint sentence identification method based on sentence association graph
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
Bdeir et al. A framework for arabic tweets multi-label classification using word embedding and neural networks algorithms
CN114861082A (en) Multi-dimensional semantic representation-based aggressive comment detection method
US20190095525A1 (en) Extraction of expression for natural language processing
Selamat Improved N-grams approach for web page language identification
Sornlertlamvanich et al. Thai Named Entity Recognition Using BiLSTM-CNN-CRF Enhanced by TCC
CN110489624B (en) Method for extracting Hanyue pseudo parallel sentence pair based on sentence characteristic vector
Rao et al. ASRtrans at semeval-2022 task 5: Transformer-based models for meme classification
Sreejith et al. N-gram based algorithm for distinguishing between Hindi and Sanskrit texts
Zhang et al. A chinese dataset with negative full forms for general abbreviation prediction
Rakshit et al. A Novel Pipeline for Improving Optical Character Recognition through Post-processing Using Natural Language Processing
CN111061939B (en) Scientific research academic news keyword matching recommendation method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant