CN109214000A

CN109214000A - A kind of neural network card language entity recognition method based on topic model term vector

Info

Publication number: CN109214000A
Application number: CN201810965632.XA
Authority: CN
Inventors: 严馨; 谢俊; 徐广义; 张磊; 周枫; 郭剑毅
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2018-08-23
Filing date: 2018-08-23
Publication date: 2019-01-15

Abstract

The present invention relates to a kind of neural network card language entity recognition method based on topic model term vector, belongs to natural language processing technique field.The present invention first obtains card language corpus of text and pre-processes to corpus；Then topic model is constructed；It is numbered using the theme that the topic model built obtains each word of text, this theme number is considered as pseudo- word；To after pretreatment text and pseudo- word obtained above be put into same corpus text, handle while obtaining the term vector of each word and the corresponding theme vector of word in text using skip-gram model；Term vector obtained in above-mentioned steps and theme vector are cascaded to obtain theme term vector；It is finally input to obtained theme term vector as an input feature vector in the deep learning model constructed, and then realizes the Entity recognition to card language.The present invention can preferably solve the problems, such as that polysemy present in text and unisonance ambiguity, Kampuchean name the recognition correct rate of entity high.

Description

A kind of neural network card language entity recognition method based on topic model term vector

Technical field

The present invention relates to a kind of neural network card language entity recognition method based on topic model term vector, belongs to nature language Say processing technology field.

Background technique

With the fast development of modern economy, the exchange, cooperation between China and country in Southeast Asia are more and more frequent, wherein With State of Cambodia economic, culture, in terms of exchange and cooperation be also in increase trend year by year.Develop increasingly in China and Kampuchea Under close background, the cultural knowledge of concern and study State of Cambodia is particularly important, but simultaneously because bilingualism is obstructed to give this One task brings many difficulties.Therefore, the demands for solving these difficulties using natural language processing technique are more more and more intense.

Kampuchean is also known as Khmer, belongs to Austroasiatic Meng Cambodia linguistic subfamily Khmer Zhi Yuyan, as Cambodia Official language uses in China.The phenomenon that foreign word is borrowed in Kampuchean is very universal.Kampuchean is ancient high It develops and grows up on the basis of cotton language, absorb many bar Sanskrits, it is such as safe also to receive surrounding countries at the same time The influence of the language such as language, Chinese, Vietnamese, Laotian.So there are many word formation patterns of form for Kampuchean.Due to Cambodia Language is that text history is the most ancient in every country in Southeast Asia's language, has very high researching value.And at present both at home and abroad to card In terms of the research of Pu stockaded village language mainly lays particular emphasis on culture, due to the particularity of language, for rare foreign languages word as Kampuchean Research work in terms of method analysis aspect especially names Entity recognition is also extremely limited, therefore the research work is to solution card Pu Political economy analysis, the public sentiment assurance on stockaded village etc. have very important significance.

Name Entity recognition is the basic task of natural language processing field, even more many natural language application field researchs Antecedent basis.Earliest name Entity recognition is made on MUC-6 (Message Understanding Conference) Put forward for a subtask.Name Entity recognition task mainly identifies the proprietary name occurred in text and significant Numeral classifier phrase and sorted out.Its action is by earliest Entity recognition (name, place name, mechanism name) to now to text The refinement of middle Entity recognition and the identification of temporal expression (date, time), numerical expression (currency values, percentage etc.). Since the Entity recognitions such as quantity, time, date, currency can usually obtain good identification effect by the way of pattern match Fruit, name, place name, mechanism name are more complex in contrast, therefore research in recent years mainly based on these types of entity and is ordered Name Entity recognition is the important research content of information extraction, at the natural languages such as information retrieval, machine translation and question answering system Reason field has a wide range of applications.

Summary of the invention

The present invention provides a kind of neural network card language entity recognition method based on topic model term vector, for solving Certainly existing polysemy when recognition correct rate is low and card language Entity recognition of Kampuchean name entity, unisonance ambiguity are asked Topic.

The technical scheme is that a kind of neural network card language entity recognition method based on topic model term vector, Card language corpus of text is obtained first and corpus is pre-processed；Then topic model is constructed to the text after pretreatment；Make The theme number of each word of text is obtained with the topic model built, this theme number is considered as pseudo- word；To pretreatment Text and pseudo- word obtained above afterwards is put into same corpus text, handle while being obtained using skip-gram model The term vector of each word and the corresponding theme vector of word into text；By term vector obtained in above-mentioned steps and theme vector into Row cascade obtains theme term vector；Finally obtained theme term vector is input to as an input feature vector and have been constructed In deep learning model, and then realize the Entity recognition to card language.

Specific step is as follows for the method:

Step1, card language corpus of text is obtained from papery text, card language website first with crawlers；To above-mentioned text This is successively segmented, filters punctuation mark, stop words pre-processes to obtain card language list language corpus of text ready for use；

Step2, HDP topic model is constructed to the text after pretreatment；Text is obtained using the topic model built The theme of each word is numbered, this theme number is considered as pseudo- word；

Step3, skip-gram model is constructed to the text after above-mentioned pretreatment；To text after pretreatment and upper The pseudo- word stated is put into same corpus text, handle while obtaining using skip-gram model each in text The corresponding theme vector of term vector and word of word；

Step4, term vector obtained in above-mentioned steps and theme vector are cascaded to obtain theme term vector；

Step5, finally using obtained theme term vector as an input feature vector it is input to the depth constructed It practises in model, and then realizes the Entity recognition to card language.

Specific step is as follows by the step Step2:

Step2.1, pretreated text is divided into N number of document, each document

Step2.2, construction HDP topic model, then need the theme for assuming all documents to be distributed H both from some, then Use α and H as the Dirichlet distribution of parameter as priori at this time；

Step2.3, a distribution G is extracted from priori first₀, as the priori of the theme distribution of this document, Meet at this time: G₀~DP α, H；

Step2.4, G is recycled₀It is one Dirichlet distribution of parametric configuration with γ, a master is extracted from this distribution Topic distribution G_dAs the theme distribution of d documents, i.e., meet at this time:G_d~DP (γ, G₀)；

Step2.5, from the theme distribution G of d documents obtained above_dIt is middle extract i-th of word theme θ_di, most A word x is generated from the theme eventually_di, i.e., at this time by just obtaining the theme distribution of word after iteration, by this theme distribution It is set as a pseudo- word.

Specific step is as follows by the step Step3:

Step3.1, skip-gram model is constructed to the text after above-mentioned pretreatment；It will be in the text after pretreatment Word indicated with w, the pseudo- word that the theme for using topic model to obtain is numbered is indicated with z, textual words and theme are compiled Number pseudo- word be put into one text as unit of group, i.e., input at this time is D { w_i,z_i}={ w₁,z₁,…w_i,z_i,… w_M,z_M}

Step3.2, according to the input in above-mentioned steps, obtain the objective function of skip-gram model at this time are as follows:

Wherein, M is the number of the word of input model, and k is the window size for predicting context.

Specific step is as follows by the step Step4:

Step4.1, the term vector of word each in text obtained in above-mentioned Step2 is indicated with w, in step Step3 To the theme vector of word indicated with z；

Step4.2, the theme vector z of term vector w and word are cascaded using ⊕ mode, that is, met: w^z=w ⊕ z, this When just obtain required theme term vector w^z。

Specific step is as follows by the step Step5:

Step5.1, using descriptor vector characteristics obtained above as input feature vector (x₁,x₂,…x_n), it is input to CRF mould In type, obtain:

Wherein, t_j(y_m+1,y_m, x, m) and it is defined in the transfer characteristic function on two adjacent marker positions of observation sequence, For portraying the influence of correlativity and observation sequence to them between adjacent marker variable, s_k(y_m, x, m) and it is defined in Turntable characteristic function on the mark position m of observation sequence, for portraying influence of the observation sequence to token variable, λ_jAnd μ_kFor Parameter, Z are standardizing factor, and the marking probability for just obtaining sequences y at this time realizes the name Entity recognition of card language.

The beneficial effects of the present invention are:

1, the present invention provides a kind of methods for being applicable in and solving the problems, such as the Entity recognition of card language, and in preferable solution text Existing polysemy and unisonance ambiguity problem, Kampuchean name the recognition correct rate of entity high；

2, the present invention is syntactic analysis, Sentence analysis, information extraction, information retrieval and the machine translation etc. of subsequent card language Work provides strong support.

Detailed description of the invention

Fig. 1 is the flow chart in the present invention.

Specific embodiment

Embodiment 1: as shown in Figure 1, a kind of neural network card language entity recognition method based on topic model term vector, first It first obtains card language corpus of text and corpus is pre-processed；Then topic model is constructed to the text after pretreatment；It uses The topic model built obtains the theme number of each word of text, this theme number is considered as pseudo- word；After pretreatment Text and pseudo- word obtained above be put into same corpus text, handle while obtaining using skip-gram model The term vector of each word and the corresponding theme vector of word in text；Term vector obtained in above-mentioned steps and theme vector are carried out Cascade obtains theme term vector；Finally the depth constructed is input to using obtained theme term vector as an input feature vector It spends in learning model, and then realizes the Entity recognition to card language.

Further, specific step is as follows for the method:

Further, specific step is as follows by the step Step2:

Step2.1, pretreated text is divided into N number of document, each document

Further, specific step is as follows by the step Step3:

Further, specific step is as follows by the step Step4:

Further, specific step is as follows by the step Step5:

Step5.1, using descriptor vector characteristics obtained above as input feature vector (x₁,x₂,…x_n), it is input to depth In learning model (deep learning model uses CRF model), obtain:

Above in conjunction with attached drawing, the embodiment of the present invention is explained in detail, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims

1. a kind of neural network card language entity recognition method based on topic model term vector, it is characterised in that: acquisition card first Language corpus of text simultaneously pre-processes corpus；Then topic model is constructed to the text after pretreatment；Use what is built Topic model obtains the theme number of each word of text, this theme number is considered as pseudo- word；To after pretreatment text and Puppet word obtained above is put into same corpus text, handle while obtaining using skip-gram model every in text The corresponding theme vector of term vector and word of a word；Term vector obtained in above-mentioned steps and theme vector are cascaded to obtain Theme term vector；Finally the deep learning mould constructed is input to using obtained theme term vector as an input feature vector In type, and then realize the Entity recognition to card language.

2. the neural network card language entity recognition method according to claim 1 based on topic model term vector, feature Be: specific step is as follows for the method:

Step1, card language corpus of text is obtained from papery text, card language website first with crawlers；To above-mentioned text according to It is secondary segmented, filter punctuation mark, stop words pre-processes to obtain card language list language corpus of text ready for use；

Step2, HDP topic model is constructed to the text after pretreatment；It is each that text is obtained using the topic model built The theme of word is numbered, this theme number is considered as pseudo- word；

Step3, skip-gram model is constructed to the text after above-mentioned pretreatment；To text after pretreatment and above-mentioned To pseudo- word be put into same corpus text, handle while obtaining each word in text using skip-gram model Term vector and the corresponding theme vector of word；

Step5, finally using obtained theme term vector as an input feature vector it is input to the deep learning mould constructed In type, and then realize the Entity recognition to card language.

3. the neural network card language entity recognition method according to claim 2 based on topic model term vector, feature Be: specific step is as follows by the step Step2:

Step2.1, pretreated text is divided into N number of document, each document d ∈ { 1,2 ... N }；

Step2.2, construction HDP topic model, then need the theme for assuming all documents to be distributed H both from some, then at this time Use α and H as the Dirichlet of parameter distribution as priori；

Step2.3, a distribution G is extracted from priori first₀, as the priori of the theme distribution of this document, i.e., at this time Meet: G₀~DP (α, H)；

Step2.4, G is recycled₀It is one Dirichlet distribution of parametric configuration with γ, a theme distribution is extracted from this distribution G_dAs the theme distribution of d documents, i.e., meet at this time:G_d~DP (γ, G₀)；

Step2.5, from the theme distribution G of d documents obtained above_dIt is middle extract i-th of word theme θ_di, finally from A word x is generated in the theme_di, i.e., this theme distribution is set by just obtaining the theme distribution of word after iteration at this time For a pseudo- word.

4. the neural network card language entity recognition method according to claim 2 based on topic model term vector, feature Be: specific step is as follows by the step Step3:

Step3.1, skip-gram model is constructed to the text after above-mentioned pretreatment；By the list in the text after pretreatment Word is indicated with w, and the pseudo- word that the theme for using topic model to obtain is numbered is indicated with z, and textual words and theme are numbered Pseudo- word is put into one text as unit of group, i.e., input at this time is D={ w_i,z_i}={ w₁,z₁,…w_i,z_i,…w_M, z_M}

5. the neural network card language entity recognition method according to claim 2 based on topic model term vector, feature Be: specific step is as follows by the step Step4:

Step4.1, the term vector of word each in text obtained in above-mentioned Step2 is indicated with w, obtained in step Step3 The theme vector of word is indicated with z；

Step4.2, the theme vector z of term vector w and word are usedMode is cascaded, that is, is met:At this time Just required theme term vector w is obtained^z。

6. the neural network card language entity recognition method according to claim 2 based on topic model term vector, feature Be: specific step is as follows by the step Step5:

Step5.1, using descriptor vector characteristics obtained above as input feature vector (x₁,x₂,…x_n), it is input to CRF model In, it obtains:

Wherein, t_j(y_m+1,y_m, x, m) and it is defined in the transfer characteristic function on two adjacent marker positions of observation sequence, it is used for Portray the influence of correlativity and observation sequence to them between adjacent marker variable, s_k(y_m, x, m) and it is defined in observation Turntable characteristic function on the mark position m of sequence, for portraying influence of the observation sequence to token variable, λ_jAnd μ_kFor parameter, Z is standardizing factor, and the marking probability for just obtaining sequences y at this time realizes the name Entity recognition of card language.