CN110377738A

CN110377738A - Merge the Vietnamese news event detecting method of interdependent syntactic information and convolutional neural networks

Info

Publication number: CN110377738A
Application number: CN201910635489.2A
Authority: CN
Inventors: 余正涛; 刘畅; 高盛祥; 张亚飞; 王吉地; 王振晗; 郭军军
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2019-07-15
Filing date: 2019-07-15
Publication date: 2019-10-25

Abstract

The present invention relates to the Vietnamese news event detecting methods for merging interdependent syntactic information and convolutional neural networks, belong to natural language processing technique field.The present invention collects the more bilingual newsletter archive of the Chinese first, and according to the feature of event, event type, the mark system for event detection is arranged, forms training data.Then the convolutional neural networks for merging interdependent syntactic information are detected for sentence level Vietnamese media event.The meaning of a word, location information, part-of-speech information and name entity information are merged in an encoding process first.Secondly using the feature between the continuous word of traditional convolutional encoding, using the feature between the discontinuous word of convolutional encoding for merging interdependent syntactic information, fusion two parts feature realizes media event detection as event code.The present invention achieves very good effect in media event detection.

Description

Merge the Vietnamese media event detection of interdependent syntactic information and convolutional neural networks Method

Technical field

The present invention relates to the Vietnamese news event detecting methods for merging interdependent syntactic information and convolutional neural networks, belong to Natural language processing technique field.

Background technique

Event detection is that the important information of natural language processing extracts task, and purport identifies the event of specified type in text. Currently, event detection research is mostly unfolded under Chinese, English-speaking environment, since Vietnamese belongs to scarcity of resources type languages, for Temporarily nobody is related to the event detection of Vietnamese.Therefore, using artificial intelligence technology, machine is detected automatically in Vietnamese newsletter archive Media event become the difficult point and one of key technology of task.

Event detection task is based primarily upon following two categories method at present.(1) machine learning method.Zhang Xuan et al. propose with DPEMM model is the event extraction frame of core.Pei Donghui et al. proposes the subevent classification based on supporting vector machine model certainly Dynamic identification.The improvement that Gao Yongbing et al. carries out TF-IDF for the feature of microblogging obtains Event Distillation result.(2) deep learning side Method.Nguyen et al. proposes that a kind of integrated processes based on recurrent neural network carry out English event on the basis of existing research It extracts.Chen et al. proposes that the more pond convolutional neural networks (DMCNN) of dynamic solve the identification of multiple events in sentence and share The problem of parameter matches.Nguyen et al. carries out convolution using the word in convolutional neural networks distich, is implied in sentence with obtaining Semantic information；The above-mentioned detection method for being directed to other Languages, therefore the invention proposes a kind of interdependent syntactic informations of fusion With the Vietnamese news event detecting method of convolutional neural networks.

Summary of the invention

The present invention provides the Vietnamese news event detecting method for merging interdependent syntactic information and convolutional neural networks, with For solving Vietnamese media event detection classification problem, the more bilingual media event type detection of the Chinese is realized.

The technical scheme is that merging the Vietnamese media event detection of interdependent syntactic information and convolutional neural networks Event type, the mark for event detection is arranged according to the feature of event in method, first the collection more bilingual newsletter archive of the Chinese System forms training data.Then the convolutional neural networks for merging interdependent syntactic information, for sentence level Vietnamese news thing Part is detected.The meaning of a word, location information, part-of-speech information and name entity information are merged in an encoding process first.Secondly benefit With the feature between the continuous word of traditional convolutional encoding, the spy between the discontinuous word of convolutional encoding for merging interdependent syntactic information is utilized Sign, fusion two parts feature realize media event detection as event code；

Specific step is as follows for the detection method:

Step1, corpus are collected: it collects and is used for Vietnamese event detection newsletter archive, use Scrapy as crawling tool, User's operation is imitated, customizes different templates for Vietnamese news website, mould is formulated according to the path XPath of page data element Plate obtains detailed data, obtains such as headline, news time, body.Duplicate removal and screening are carried out to newsletter archive again；

Step2, building corpus: by the mark system of Vietnamese event detection, according to the language feature of Vietnamese with And Vietnamese newsletter archive is marked in the demand of event detection, and the Vietnamese news corpus marked is divided into trained language Material, testing material and verifying collection；

As a preferred solution of the present invention, in the step Step2, media event text is made of trigger word and parameter, Trigger word can clearly express a kind of event and occur, and the main word of trigger event is usually single verb or noun, and parameter is retouched State the information such as time, place, the personage of event generation；Mark system uses the extensible markup language tissue text of XML, point It is other that trigger word, parameter, event category are marked, the Vietnamese newsletter archive being collected into is marked, Vietnamese is established Media event detection data collection.It is as shown in table 1 to trigger vocabulary.

Table 1 is triggering vocabulary

Step3, text vector: training Vietnamese term vector merges term vector, position vector, the word of word sequence in sentence Property vector sum entity type vector is as mode input；

As a preferred solution of the present invention, in the step Step3, using the method training of skip-gram language model Vietnamese term vector, respectively construct position insertion table, part of speech insertion table, entity type insertion table by location information, part-of-speech information, Entity type information is embedded into vector.

Step4, building merge convolutional neural networks (the Dependency Parsing of interdependent syntactic information Convolutional Neural Networks, DPCNN) model: on the basis of step Step3, using convolutional neural networks With the convolutional neural networks for merging interdependent syntactic information, media event sentence coding is obtained, training event detection disaggregated model is realized The more bilingual media event type detection of the Chinese；

As a preferred solution of the present invention, in the step Step4, using continuous word in traditional multi-kernel convolution coding sentence Between semantic information, while using the semantic information between discontinuous word in the convolutional encoding sentence for merging interdependent syntactic information, Merge semantic information of the two-part semantic information as current sentence.

Model proposed by the present invention is made of three parts: (1) sentence coding layer, (2) convolutional layer, (3) pond layer.Work as input When S1 event sentence, context of methods model is as shown in Figure 2: S1:Nam,cóTrung(translation: million refugee of Vietnam, only China relieves)；

(1) coding layer

Firstly, word grade information in sentence is converted into real-valued vectors by coding layer, the input as neural network.If X= { x1, x2, x3 ..., xn } is the sentence that a length is n, and wherein xi is i-th of word in sentence.Appoint in natural language processing In business, the semantic information of word is related with its position in sentence, the identification and semanteme of part of speech and entity type information to trigger word Understanding play the role of promotion.It is defeated as model that term vector, position vector, part of speech vector sum entity type vector are merged herein Enter.

Term vector is a real-valued vectors, and this method can also be using word2vec model training method training Vietnam's words and phrases Vector.Position encoded a part as coding is introduced the semantic structure information of current word by this method.Position vector, which refers to, to be worked as The relative position of preceding word and trigger word.For example, in S1, "(appearing) " and " Phase between (refugee) " Contraposition is set to 6.Since part of speech and entity type help to obtain current word semantics information, part-of-speech tagging is carried out to Vietnamese, and It defines part of speech and is embedded in table, 28 kinds of part of speech labels are embedded into part of speech vector.Entity recognition is named to Vietnamese, definition is real Body type is embedded in table, identifies that name, place name, institution term, time in sentence etc. names entity, entity tag is embedded into In entity vector.Table shares ten kinds of entity types, be divided into three categories (entity class, time class and numeric class), seven groups (name, Mechanism name, place name, time, date, currency and percentage).

(2) convolutional layer

Convolutional layer captures the combination semantic information of entire sentence, and by these valuable semantic compressions to Feature Mapping In.Filter w in convolution algorithm can extract the feature in convolution window between word.When convolution kernel size is m, in window M word { x_i,x_i+1,x_i+2,...,x_i+m-1Use x_I:i+m-1It indicates, obtained convolution feature c_iIt indicates, formula is as follows:

c_i=f (wx_i:i+m-1+b) (1)

Wherein b (b ∈ R) is bias term, and f is nonlinear activation function, and w is characterized weight, and filter is applied in sentence Each possible window { x_1:m,x_2:m+1,...,x_n+m-1:n}.Since the feature in sentence is not single, make in convolution process Different characteristic is obtained with multiple filters, as k filter W={ w of use₁,w₂,...,w_kWhen, the following public affairs of convolution algorithm Formula indicates:

c_ji=f (w_jx_i:i+m-1+b_j) (2)

Wherein, [1, k] j ∈, w_jIt is characterized weight, b_jIt is expressed as biasing.

Construct the interdependent syntax tree of Vietnamese, as shown in Fig. 3, by analysis it is found that "(relief) " and "Syntactic relation " SBV (subject-predicate relationship) " between (refugee) " facilitate judgement "Ra (appearing) " not goes out The trigger word of seat life event.

Convolution algorithm can capture the semantic information in window between continuous word, can not capture discontinuous word outside window Feature is herein introduced the information outside window by interdependent syntactic analysis.Interdependent information indicates by D={ N, E }, wherein N= { x1, x2, x3 ..., xp } (p≤n) indicates that there are all word nodes of dependence in sentence, there is two word nodes of dependence It is indicated by xs, t；E is side between two word nodes, and each side (xs, xt) represents from word node xs sensing word node xt, and has Interdependent information labels L (xs, xt).For example, in Fig. 2, node "(relief) " and "(refugee) " it Between the interdependent information labels of directed edge be The method that Kipf et al. is proposed It indicates, since information flow is more than the direction indicated according to label, is added to self-loopa (xs, xs) and reverse edge here (xt,xs).Self-loopa has the label of L (xs, xs), and the label of reverse edge is L^ (xt, xs).Specific interdependent information labels tool There is fixed parameter, the calculating of dependent feature is as follows:

(3)

Wherein the range of j is 1 to k, and f is nonlinear activation function, and WL (xi, N) is original side respectively there are three types of form, instead Xiang Bian, self-loopa side, bL (xi, N) are bias term.Finally, convolution feature and the convolution feature for merging interdependent syntactic information are spelled It connects, as current sentence feature, formula is as follows:

E_ji=W_ic_ji+(1-W_i)h_i (4)

E ∈ R, k (n-m+1) are the matrix of consequence that convolution obtains, and k is the number of filter, and n is sentence length, m filter Window size, (1-W_i)h_iIt indicates to merge interdependent syntactic information feature, W_ic_jiIndicate convolution feature.

(3) pond layer

Pond layer can extract the most representative feature in convolution feature.The method for choosing maximum pond herein, it is public Formula is as follows:

E^*=Each-max (E₁,E₂,E₃,...,E_k) (5)

For k filter, the local feature of most worthy in each filter is extracted, other characteristic values are all abandoned, k A local feature aggregates into a vector E* as event code.Finally, event code is sent into full articulamentum, soft- is used Max activation primitive classifies to E*, obtains the class probability of event, is predicted according to type of the probability distribution to event, Its formula are as follows:

Wherein S_iPresentation class probability, C indicate classification number, and i indicates classification index.The range of i is 1 to 6 (comprising non-thing Including part type).

Step5, event type detection: it to needing to identify that the more bilingual media event sentence of the Chinese encodes, then will extract new Input vector of the feature vector of news event sentence as disaggregated model, obtains final classification results by disaggregated model.

The beneficial effects of the present invention are: the convolutional neural networks that the present invention merges interdependent syntactic information can be new to sentence level News event is detected.The meaning of a word, location information, part-of-speech information and name entity information have been merged in an encoding process.Secondly benefit With the feature between the continuous word of traditional convolutional encoding, the spy between the discontinuous word of convolutional encoding for merging interdependent syntactic information is utilized Sign merges two parts feature as event code, to realize event detection.The experimental results showed that this method is in media event Very good effect is achieved in detection classification.

Detailed description of the invention

Fig. 1 is the flow chart in the present invention；

Fig. 2 is the DPCNN Method Modeling flow diagram proposed in the present invention；

Fig. 3 is the interdependent syntactic analysis result figure of S1 in the present invention.

Specific embodiment

Embodiment 1: as shown in Figure 1-3, merge the Vietnamese media event inspection of interdependent syntactic information and convolutional neural networks Survey method, specific step is as follows for the detection method:

Step1, corpus are collected: being used Scrapy as tool is crawled, crawled following news website: news agency, Vietnam Http:// www.vnagency.com.vn, country, Vietnam English newspaper http://vietnamnews.vnagency.com.vn, Vietnam telecommunication network http://www.vnn.vn, Vietnam Economic Times http://www.vneconomy.com.vn；Collection is used for Vietnamese event detection newsletter archive carries out duplicate removal and screening to newsletter archive；

As a preferred solution of the present invention, in the step Step1, use Scrapy as the tool that crawls, imitate user Operation customizes different templates for Vietnamese news website, formulates template according to the path XPath of page data element and obtains in detail It counts evidence accurately, obtains such as headline, news time, body.

Step2, building corpus: by the mark system of Vietnamese event detection, according to the language feature of Vietnamese with And Vietnamese newsletter archive is marked in the demand of event detection, by the Vietnamese news corpus marked according to 8:1:1's Pro rate training corpus, testing material and verifying collection；Wherein it is labelled with leader's travel activity field altogether after pretreatment Vietnamese newsletter archive 1233, totally 9576 event sentences；

As a preferred solution of the present invention, in the step Step2, media event text is made of trigger word and parameter, Trigger word can clearly express a kind of event and occur, and the main word of trigger event is usually single verb or noun, and parameter is retouched State the information such as time, place, the personage of event generation；Mark system uses the extensible markup language tissue text of XML, point It is other that trigger word, parameter, event category are marked, the Vietnamese newsletter archive being collected into is marked, Vietnamese is established Media event detection data collection.

Step4, building event category detection model: on the basis of step Step3, using convolutional neural networks and fusion The convolutional neural networks of interdependent syntactic information, obtain media event sentence coding, and training event detection disaggregated model realizes that the Chinese is more double Language media event type detection；

In order to verify effect of the invention, it is provided with comparative experiments, using accuracy rate (P), recall rate (R) and F value (F) As evaluation index.

Wherein, A is the quantity of correct identification events type, and B is the quantity of wrong identification event type, and C is unrecognized The quantity of the correct identification events type arrived.

(1) to probe into influence of the model number of plies to experimental result, the mould of the present invention of 1 layer, 2 layers and 3 layers convolution is respectively adopted Type is tested, and optimum number of strata is found, and experimental result is as shown in table 2:

Influence of the 2 model number of plies of table to experimental result

The model number of plies	P (%)	R (%)	F (%)
				1	74.04	62.63	70.08
2	76.78	64.25	71.45
				3	75.53	59.01	68.23

By analysis it is found that having reached optimum efficiency when the quantity of convolutional layer is 2, recall rate, accuracy rate and F value difference It is 75.78%, 64.25%, 70.45%.When the convolution number of plies is 3, the performance of model is declined.Therefore, in subsequent experimental In, model is all made of two layers of convolution.

(2) coding characteristic is probed into

The coding characteristic incorporated for word embeding layer is probed into, and after removing a certain item coding vector, remaining 2 classes are compiled Code vector and term vector merge the input as model, probe into influence of the different coding feature to model performance of the present invention, test The results are shown in Table 3:

Influence of 3 coding characteristic of table to experimental result

After removing a certain item coding vector known to analysis, accuracy rate, recall rate, F value and this paper model of model Compared to being declined, thus demonstrates while event detection performance can be improved using three kinds of coding vectors.

(3) different models are probed into

To prove effect of this paper model in Vietnamese event detection task, this paper model is believed with interdependent syntax is not added Traditional convolutional neural networks of breath and the figure convolutional neural networks for merging interdependent syntactic information compare, experimental result such as 4 institute of table Show:

The different model performance comparisons of table 4

Different models	P (%)	R (%)	F (%)
				CNN	73.23	66.14	69.23
GCN	75.00	63.92	70.24
				DPCNN	76.78	64.25	71.45

Pass through the modelling effect of DPCNN (convolutional neural networks for merging interdependent syntactic analysis) and GCN known to comparative analysis Better than CNN, therefore the interdependent syntactic information of introducing can capture the information that CNN is not captured.GCN and DPCNN is compared, can be seen It arriving, the F value of DPCNN has 0.19% promotion, illustrate that most information can be captured by GCN, but simultaneously using continuous Convolutional Neural and merge the convolutional neural networks of interdependent syntactic information and can capture more implicit informations in sentence.

It is analyzed by testing above with instance data, this method propose a kind of for the new of Vietnamese media event detection Type neural network model, the Model Fusion term vector, position vector, part of speech vector sum name entity vector to capture word rank Semantic information, while semantic letter is obtained using the convolutional neural networks of traditional convolutional neural networks and the interdependent syntactic information of fusion Breath.It compares by model being arranged different parameters, and by best model and basic skills, it was demonstrated that this method is in Vietnamese Preferable effect is reached on media event Detection task.

Above in conjunction with attached drawing, the embodiment of the present invention is explained in detail, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims

1. merging the Vietnamese news event detecting method of interdependent syntactic information and convolutional neural networks, it is characterised in that:

Specific step is as follows for the detection method:

Step1, corpus are collected: being collected and are used for Vietnamese event detection newsletter archive, carry out duplicate removal and screening to newsletter archive；

Step2, building corpus: by the mark system of Vietnamese event detection, according to the language feature and thing of Vietnamese Vietnamese newsletter archive is marked in the demand of part detection, and the Vietnamese news corpus marked is divided into training corpus, is surveyed Try corpus and verifying collection；

Step3, text vector: training Vietnamese term vector, merge the term vector of word sequence in sentence, position vector, part of speech to Amount and entity type vector are as mode input；

Step4, building event category detection model: it on the basis of step Step3, using convolutional neural networks and merges interdependent The convolutional neural networks of syntactic information, obtain media event sentence coding, and training event detection disaggregated model realizes that the Chinese is more bilingual new Hear event type detection；

Step5, event type detection: to needing to identify that the more bilingual media event sentence of the Chinese encodes, news thing will then be extracted Input vector of the feature vector of part sentence as disaggregated model obtains final classification results by disaggregated model.

2. the Vietnamese media event detection side of fusion interdependent syntactic information and convolutional neural networks according to claim 1 Method, it is characterised in that: in the step Step1, use Scrapy as the tool that crawls, imitate user's operation, be that Vietnamese is new It hears website and customizes different templates, template is formulated according to the path XPath of page data element and obtains detailed data, is obtained as new Hear title, news time, body.

3. the Vietnamese media event detection side of fusion interdependent syntactic information and convolutional neural networks according to claim 1 Method, it is characterised in that: in the step Step2, media event text is made of trigger word and parameter, and trigger word can clear table Occur up to a kind of event, the main word of trigger event is usually single verb or noun, parameter describe event generation time, The information such as place, personage；Mark system use XML extensible markup language tissue text, respectively to trigger word, parameter, Event category is marked, and the Vietnamese newsletter archive being collected into is marked, and establishes Vietnamese media event detection data Collection.

4. the Vietnamese media event detection side of fusion interdependent syntactic information and convolutional neural networks according to claim 1 Method, it is characterised in that: in the step Step3, using the method training Vietnamese term vector of skip-gram language model, divide Not Gou Jian position insertion table, part of speech insertion table, entity type insertion table location information, part-of-speech information, entity type information is embedding Enter into vector.

5. the Vietnamese media event detection side of fusion interdependent syntactic information and convolutional neural networks according to claim 1 Method, it is characterised in that: in the step Step4, the semantic information in sentence between continuous word is encoded using traditional multi-kernel convolution, Simultaneously using the semantic information between discontinuous word in the convolutional encoding sentence for merging interdependent syntactic information, two-part semanteme is merged Semantic information of the information as current sentence.