CN104615589A

CN104615589A - Named-entity recognition model training method and named-entity recognition method and device

Info

Publication number: CN104615589A
Application number: CN201510082318.3A
Authority: CN
Inventors: 张军
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2015-02-15
Filing date: 2015-02-15
Publication date: 2015-05-13

Abstract

An embodiment of the invention provides a named-entity recognition model training method and a named-entity recognition method and device. The method used for training a recurrent neutral network (RNN) named-entity recognition model includes: acquiring multiple labeled sample data, wherein each sample datum includes a text string and multiple term segment labeled data thereof, and each term segment labeled datum includes segmented terms separated from the text string and a named-entity attribute tag in the text string; mapping the segmented terms in the labeled sample data to be term vectors, taming the sample data as training samples, training the RNN named-entity recognition model, and learning parameters of the RNN named-entity recognition model. By the named-entity recognition model training method and the name-entity recognition method and device, the trained model has better generalization ability, the named entity in the natural language tests can be recognized rapidly, and recognition accuracy of the named entity is improved.

Description

Train the method for Named Entity Extraction Model, named entity recognition method and device

Technical field

The present invention relates to natural language processing technique field, particularly relate to a kind of method, named entity recognition method and device of training Named Entity Extraction Model.

Background technology

Named entity (such as name, place name, organizational structure's title, the network words etc. of certain sense) identify it is an important component part of natural language understanding, therefore, set up and safeguard that named entity storehouse is one of the core in numerous natural language processing (Natural Language Processing, NLP) field application (such as search system, machine translation system etc.).Such as, if search engine can by named entity storehouse, the search word identifying user " I had never expected " it is the title of an online movie play absolutely, and that just can return more accurate Search Results to user.

In the prior art, the following two kinds of named entity recognition methods of general employing.A kind of method excavates named entity by rule-based method in the middle of the inquiry log of search engine.Concrete, the search word of the search word recent user inputted and in the past user compares.If find it is new search word, be then the probability of named entity to the search word made new advances by the designed increment based on search word and with the similarity new probability formula of search word in the past, and search word probability being exceeded certain threshold value is identified as named entity.Although this method can identify emerging named entity on internet accurately, but the realization of described method depends on the data of inquiry log, and use search word to carry out searching described search word from user to be identified as named entity and to postpone, affect the inquiry experience of user.

Another kind method is from the corpus marked in advance (manually marking out the named entity one group of text data), by the method establishment Hidden Markov hypothesis of statistics, this model is then utilized to mark out new named entity from a large amount of text datas.Although the method can obtain good effect in small-scale data, but because it depends on Markov hypothesis, (whether current word is a part for certain named entity, depend on the word of the fixed qty (being generally 2) before it), cause this model to lack generalization ability, the accuracy of identification on large-scale data is not high.

Summary of the invention

The object of the embodiment of the present invention is, provides a kind of method, named entity recognition method and device of training Named Entity Extraction Model, can identify the named entity in natural language text quickly and automatically, and improve the identification accuracy of named entity.

In order to realize foregoing invention object, The embodiment provides a kind of for training the method for Recognition with Recurrent Neural Network (RNN) Named Entity Extraction Model, comprise: the sample data obtaining multiple mark, each described sample data comprises text string and multiple participle labeled data thereof, and described participle labeled data comprises the participle that separates from described text string and the named entity attribute mark in described text string thereof; Participle in the sample data of described multiple mark is mapped as term vector, using described sample data as training sample, RNN Named Entity Extraction Model is trained, to learn the parameter of described RNN Named Entity Extraction Model.

The embodiment of the present invention additionally provides a kind of for training the device of Recognition with Recurrent Neural Network (RNN) Named Entity Extraction Model, comprise: sample data acquisition module, for obtaining the sample data of multiple mark, each described sample data comprises text string and multiple participle labeled data thereof, and described participle labeled data comprises the participle that separates from described text string and the named entity attribute mark in described text string thereof; Parameter learning module, for the participle in the sample data of described multiple mark is mapped as term vector, using described sample data as training sample, trains RNN Named Entity Extraction Model, to learn the parameter of described RNN Named Entity Extraction Model.

The embodiment of the present invention additionally provides a kind of recognition methods of named entity, comprising: obtain text string; Word segmentation processing is carried out to described text string and obtains multiple participle; The RNN Named Entity Extraction Model obtained is trained to obtain the named entity attribute mark of maximum probability corresponding to described participle respectively by method according to claim 5; According to the named entity attribute mark of maximum probability corresponding to described participle, identification is carried out to described text string and obtain named entity.

The embodiment of the present invention additionally provides a kind of recognition device of named entity, comprising: text string acquisition module, for obtaining text string; Text string word-dividing mode, obtains multiple participle for carrying out word segmentation processing to described text string; Named entity attribute mark acquisition module, for training the RNN Named Entity Extraction Model obtained to obtain the named entity attribute mark of maximum probability corresponding to described participle respectively by device according to claim 17; Named entity recognition module, obtains named entity for carrying out identification according to the named entity attribute mark of maximum probability corresponding to described participle to described text string.

The method of the training Named Entity Extraction Model that the embodiment of the present invention provides, named entity recognition method and device, by obtaining the sample data of multiple mark, and the participle in the sample data of multiple mark is mapped as term vector, using described sample data as training sample, RNN Named Entity Extraction Model is trained, to learn the parameter of described RNN Named Entity Extraction Model.Compared with prior art, without the need to depending on inquiry log and Hidden Markov hypothesis, this model has better generalization ability, can identify the named entity in natural language text automatically and quickly, improve the identification accuracy of named entity.

Accompanying drawing explanation

Fig. 1 is the ultimate principle block diagram that the embodiment of the present invention is shown;

Fig. 2 is the process flow diagram of the method for training RNN Named Entity Extraction Model that the embodiment of the present invention one is shown;

Fig. 3 is the illustrative diagram of the RNN Named Entity Extraction Model that the embodiment of the present invention one is shown;

Fig. 4 is the process flow diagram of the recognition methods of the named entity that the embodiment of the present invention two is shown;

Fig. 5 is the logic diagram of the device for training RNN Named Entity Extraction Model that the embodiment of the present invention three is shown;

Fig. 6 is the logic diagram of the recognition device of the named entity that the embodiment of the present invention four is shown.

Embodiment

Basic conception of the present invention is, obtain the sample data of multiple mark, and the participle in the sample data of described multiple mark is mapped as term vector, using described sample data as training sample, RNN Named Entity Extraction Model is trained, to learn the parameter of described RNN Named Entity Extraction Model.On the other hand, in the text string got, each participle is as input, named entity attribute mark corresponding to described participle is obtained by trained Named Entity Extraction Model, named entity attribute mark that finally can be corresponding according to described participle, identification is carried out to text string and obtains named entity, this model has better generalization ability, makes the recognition speed of named entity faster, and improves the identification accuracy of named entity.

Fig. 1 is the ultimate principle block diagram of the embodiment of the present invention.With reference to Fig. 1, in the present invention, first need to obtain training sample, concrete, by heuristic rule to text string process obtain weak mark sample data (having marked the text of named entity in advance) as training sample, thus automatically can obtain sample data, certainly also obtain training sample by the mode of such as manual mark.Secondly, this training sample is utilized to train RNN Named Entity Extraction Model, to learn out the parameter of RNN Named Entity Extraction Model, namely utilize designed training algorithm to train the RNN Named Entity Extraction Model set up, obtain the parameter of RNN Named Entity Extraction Model.Finally, obtain text string to be identified, utilize these parameters can obtain the named entity attribute mark of maximum probability corresponding to participle in the middle of described text string to be identified, just can be identified text string by the named entity attribute mark of maximum probability corresponding to participle, finally obtain named entity.

Can from large-scale natural language text content (such as VIP web page library by said process, forum postings etc.) in the middle of, mark out a large amount of named entities, in order to the accuracy of named entity can be ensured, also by simply adding up the quantity being noted as the phrase (one or more word composition) of named entity, then a threshold value is set, if certain word frequency (word frequency refers to the number of times that some given words occur in residing file) being noted as the word of named entity exceedes this threshold value, then using by as new named entity, thus obtain the named entity storehouse that automatic mining goes out, be mainly used in such as search engine, the application in the NLP fields such as mechanical translation.

A kind of for training the method for Recognition with Recurrent Neural Network Named Entity Extraction Model, named entity recognition method and device to be described in detail to the embodiment of the present invention below in conjunction with accompanying drawing.

Embodiment one

Fig. 2 is the process flow diagram of the method for training RNN Named Entity Extraction Model that the embodiment of the present invention one is shown.Described RNN Named Entity Extraction Model is for identifying the named entity in text.

With reference to Fig. 2, in step S110, obtain the sample data of multiple mark, each described sample data comprises text string and multiple participle labeled data thereof, and described participle labeled data comprises the participle that separates from described text string and the named entity attribute mark in described text string thereof.

Concrete, according to design of the present invention, the named entity attribute mark of described participle in described text string comprises the information whether described participle belongs to named entity.

In addition, the named entity attribute mark of described participle in described text string also can comprise the position mark in described participle named entity belonging to it.

Such as, the named entity attribute mark of described participle in described text string can comprise the beginning flag of named entity, the continuity mark of named entity and non-named entity mark.Such as, the named entity attribute mark of participle in described text string is initial (the routine B) of whether certain named entity, a whether part (routine I) for certain named entity, or this word is not any named entity (routine O), thus can obtain the named entity attribute mark of all entity word in the middle of a text string.It should be noted that, mark B implication is Begin, represents the beginning of the named entity of certain type, and mark I implication is In, is expressed as the continuity of certain named entity, and mark O implication is Out, represents that this word is not named entity word.

Preferably, the named entity attribute mark of described participle in described text string also can comprise the type of the named entity belonging to described participle.Here, the type of named entity can include, but not limited to the network words of name, place name, institutional framework name, movie and television play name, title or certain sense.Such as, the named entity attribute mark of participle in described text string is initial (the routine B-DRAMA) of whether certain named entity, a whether part (routine I-DRAMA) for certain named entity, or this word is not any named entity (routine O), DRAMA can replace by the type of other predefined named entities and (as PERSON, represent name; ADDR represents address).Table 1 shows the sample data of a mark, as shown in table 1, include in the sample data of a mark text string " why I had never expected so fire absolutely? " and multiple participle labeled data, wherein, participle labeled data comprises the participle that separates from described text string and the named entity attribute mark in described text string thereof, such as, participle " absolutely " and " B-DRAMA ".

Table 1

According to design of the present invention, described training sample comprises such as M group < text string, multiple participle labeled data > sample data.Here, the value of M is generally enough large, usually can more than ten million magnitude.Content in aforementioned table 1 is exactly a concrete sample data example.Obviously, purely rely on manpower to mark this M group sample data will take time and effort very much.Therefore, further, described method also can comprise: the sample data obtaining multiple mark according to heuristic rule from natural language text.Such as, if containing paired punctuation marks used to enclose the title in described natural language text, then using the text string containing described paired punctuation marks used to enclose the title as sample data, and mark named entity attribute mark corresponding to each participle in described text string; Again such as, if certain text string contains the participle matched completely with predetermined title in described natural language text, then using the text string containing described participle as sample data, and mark named entity attribute mark corresponding to each participle in described text string.By aforementioned heuristic rule, text string is marked, automatically can obtain the sample data of weak mark, improve treatment effeciency.

In step S120, the participle in the sample data of described multiple mark is mapped as term vector, using described sample data as training sample, RNN Named Entity Extraction Model is trained, to learn the parameter of described RNN Named Entity Extraction Model.

According to an alternative embodiment of the invention, step S120 can comprise, the input layer of described RNN Named Entity Extraction Model is generated by the participle of described training sample Chinese version string, the term vector that in described input layer, each participle is corresponding is searched from predefined vocabulary, the term vector layer of described RNN Named Entity Extraction Model is generated by described term vector, matrix mapping is carried out to described term vector layer, obtain the hidden layer of described RNN Named Entity Extraction Model, using the term vector of each described participle as condition, calculate the probability of the multiple named entity attribute marks corresponding with each described participle under the described conditions respectively, as the output layer of described RNN Named Entity Extraction Model, the sample data of described multiple mark is utilized to train described RNN Named Entity Extraction Model, obtain the parameter of described RNN Named Entity Extraction Model.

Concrete, Fig. 3 is the illustrative diagram of the RNN Named Entity Extraction Model that the embodiment of the present invention one is shown.With reference to Fig. 3, participle is carried out to described training sample Chinese version string, such as, supposes that a text string comprises T participle, be designated as: Text=(w ₁..., w _t), each participle input word segmentation processing obtained, can generate the input layer of described RNN Named Entity Extraction Model; Each participle w in text string _iall belong to a word in predefined vocabulary, the size of vocabulary is | V| (the special word <OOV> comprising in order to the OOV of mark not in the middle of dictionary); Each participle finds corresponding term vector by the mode of looking up the dictionary, and this vectorial layer is called the term vector layer of described RNN Named Entity Extraction Model.

Here it should be noted that, described term vector is used to a kind of mode of the word in language being carried out mathematicization, as its name suggests, term vector is shown as a vector a vocabulary exactly, the simplest term vector mode represents a word with a very long vector, the length of vector is the size of vocabulary, the component of vector only has one " 1 ", other are " 0 " entirely, the position of " 1 " is to should the position of word in vocabulary, for example, " microphone " is expressed as [0 001 00 0000000000 ...], but this mode can not portray the similarity between word and word well, on this basis, occur that again a kind of term vector represents, overcome aforesaid drawbacks.Its ultimate principle is directly common with one vector representation word, and such as [0.792,0.177,0.107,0.109,0.542 ...], namely common vector representation form.

In actual applications, the term vector of network represents each input word w _icorresponding term vector, the column vector C (w of its to be a length be EMBEDDING_SIZE _i); The hidden layer of network represents the designed state of RNN Named Entity Extraction Model when each time point i, be a length is the column vector h of HIDDEN_SIZE _i, the common span of EMBEDDING_SIZE be here 50 to 1000, HIDDEN_SIZE common value be 1 to 4 times of EMBEDDING_SIZE.

Is the hidden layer of RNN Named Entity Extraction Model on term vector layer.The feature of RNN network is, when calculating the value of current hidden layer, to employ the vector value of the node of the value of term vector layer and the hidden layer of back.Output layer on hiding, the named entity attribute mark (such as B, I or O) that certain participle word of each node on behalf is possible.Output layer also can be described as SoftMax layer, can calculate the probability that each participle belongs to certain named entity attribute mark.RNN Named Entity Extraction Model is just established by the input layer of above-mentioned generation, term vector layer, hidden layer and output layer.The starting point of the present embodiment is in the middle of the sample data of aforementioned mark, with set up RNN Named Entity Extraction Model, learn out the parameter of RNN Named Entity Extraction Model, thus can extensively cannot rely on rule in the middle of the text (such as, eliminating the text of punctuation marks used to enclose the title) identifying named entity to other.

Preferably, by following formula execution is described, matrix mapping is carried out to described term vector layer, obtains the process of the hidden layer of described RNN Named Entity Extraction Model:

[h _i] _j＝sigmoid([WC(w _i)] _j+[Uh _i-1)] _j

Wherein, [h _i] _jfor a jth element in i-th vector of described hidden layer, W, U are the transformation matrix parameter of described RNN Named Entity Extraction Model, C (w _i) be i-th term vector of described term vector layer, h _i-1for the i-th-1 vector of described hidden layer.Here, W is line number is HIDDEN_SIZE, and columns is the matrix of EMBEDDING_SIZE; U is then line number is HIDDEN_SIZE, and columns is also the matrix of HIDDEN_SIZE.Sigmoid is the function of nonlinear transformation.

Further, perform using the term vector of each described participle as condition by following formula, calculate the probability of the multiple named entity attribute marks corresponding with each described participle under the described conditions respectively, the process as the output layer of described RNN Named Entity Extraction Model:

P (label = L_{i} | w_{i}) = \frac{e^{O_{L} \cdot h_{i}}}{Σ_{k = 1}^{K} e^{O_{k} \cdot h_{i}}}

Wherein, L _ibe i-th named entity attribute mark, w _ibe i-th participle, h _ifor i-th vector of described hidden layer, O is the transformation matrix parameter of described RNN Named Entity Extraction Model, and K is the line number of described transformation matrix parameter O.Here O is a behavior K, is classified as the matrix of HIDDEN_SIZE.

Preferably, the sample data of described multiple mark is utilized to train described RNN Named Entity Extraction Model, the process obtaining the parameter of described RNN Named Entity Extraction Model can comprise: the conditional probability obtaining multiple named entity attribute marks corresponding to each described participle, conditional probability according to described multiple named entity attribute mark sets up loss function, the sample data of described multiple mark is utilized to train described loss function, obtain the parameter sets of the described RNN Named Entity Extraction Model making described loss function minimum, wherein, described parameter sets comprises term vector and transformation matrix parameter.

Particularly, perform by following formula and utilize the sample data of described multiple mark to train described RNN Named Entity Extraction Model, obtain the process of the parameter of described RNN Named Entity Extraction Model:

Wherein, all < Text, Label > to the sample data for all marks, the parameter sets that θ is the described RNN Named Entity Extraction Model that makes J (θ) minimum, described parameter sets comprises term vector and transformation matrix parameter, L _ibe i-th named entity attribute mark, w _ibe i-th participle.Here, it should be noted that the parameter of described RNN Named Entity Extraction Model has: term vector C (w) of each word w in the middle of vocabulary and transformation matrix parameter W, U, O, be designated as θ by this group parameter sets.

Here it should be noted that, above-mentioned formula is loss function, trains described RNN Named Entity Extraction Model by stochastic gradient descent method.Concrete, utilize stochastic gradient descent method (Stochastic Gradient Descen, SGD) and back-propagation algorithm (Back PropagationThrough Time, BPTT) exactly, optimum parameter θ can be obtained.The thought of SGD algorithm is the gradient (partial derivative of parameter) by calculating a certain group of training sample, carry out the parameter that iteration renewal random initializtion is crossed, at every turn the method upgraded allows parameter deduct a set learning rate (learningrate) be multiplied by the gradient calculated, thus can allow the value that RNN Named Entity Extraction Model calculates according to parameter after many iterations, and the difference between actual value minimizes on defined loss function.In addition, BPTT algorithm is the method for the gradient of a kind of effective calculating parameter in RNN network.

By this for training the method for RNN Named Entity Extraction Model, obtain the sample data of multiple mark, and the participle in the sample data of described multiple mark is mapped as term vector, using described sample data as training sample, RNN Named Entity Extraction Model is trained, to learn the parameter of described RNN Named Entity Extraction Model, compared with prior art, without the need to depending on inquiry log and Hidden Markov hypothesis, there is better generalization ability, the named entity identified in natural language text can be applied to, and the recognition speed of named entity is fast, degree of accuracy is higher.

Embodiment two

Fig. 4 is the process flow diagram of the recognition methods of the named entity that the embodiment of the present invention two is shown.Described method can be performed on such as search engine server.

With reference to Fig. 4, in step S210, obtain text string.

Described text string can be the search word sent from client.Such as, user inputs in browser searches engine interface " why I had never expected so fire absolutely? " search for, described search word is sent to search engine server by browser application.

In step S220, word segmentation processing is carried out to described text string and obtains multiple participle.

Such as, search engine server can utilize existing participle technique, carries out word segmentation processing obtain multiple participle to the text string got.

In step S230, the RNN Named Entity Extraction Model obtained is trained to obtain the named entity attribute mark of maximum probability corresponding to described participle respectively by the method according to embodiment one.Describe described for training the method for RNN Named Entity Extraction Model in aforesaid embodiment one.

In step S240, according to the named entity attribute mark of maximum probability corresponding to described participle, identification is carried out to described text string and obtain named entity.

In step S230, after obtaining named entity attribute mark corresponding to described participle, named entity attribute mark that just can be corresponding according to described participle, identifies text string, finally identifies the named entity in text string.

Further, as previously mentioned, the named entity attribute mark of the maximum probability that described participle is corresponding also can comprise the type of the named entity belonging to described participle, therefore, described method can also comprise: the type obtaining described named entity according to the named entity attribute mark of maximum probability corresponding to described participle.

By the recognition methods of this named entity, word segmentation processing is carried out to the text string obtained and obtains multiple participle, and the named entity attribute mark of maximum probability corresponding to described participle is obtained by trained RNN Named Entity Extraction Model, finally can according to the named entity attribute mark of maximum probability corresponding to described participle, identification is carried out to text string and obtains named entity, compared with prior art, the named entity in natural language text can be identified fast, and improve the identification accuracy of named entity, the type of identified named entity can also be obtained.

Embodiment three

Fig. 5 is the logic diagram of the device for training RNN Named Entity Extraction Model that the embodiment of the present invention three is shown.

With reference to Fig. 5, described RNN Named Entity Extraction Model is for identifying the named entity in text, described for training the device of RNN Named Entity Extraction Model to comprise sample data acquisition module 310 and parameter learning module 320.

Sample data acquisition module 310 is for obtaining the sample data of multiple mark, each described sample data comprises text string and multiple participle labeled data thereof, and described participle labeled data comprises the participle that separates from described text string and the named entity attribute mark in described text string thereof.

Alternatively, the named entity attribute mark of described participle in described text string comprises the information whether described participle belongs to named entity.Further, the named entity attribute mark of described participle in described text string also can comprise the position mark in described participle named entity belonging to it.

Preferably, the named entity attribute mark of described participle in described text string comprises: the beginning flag of named entity, the continuity mark of named entity and non-named entity mark.

Further, the named entity attribute mark of described participle in described text string also comprises the type of the named entity belonging to described participle.

Alternatively, sample data acquisition module 310 also for obtaining the sample data of multiple mark from natural language text according to heuristic rule, wherein, if containing paired punctuation marks used to enclose the title in described natural language text, then described sample data acquisition module 310 using the text string containing described paired punctuation marks used to enclose the title as sample data, and mark named entity attribute mark corresponding to each participle in described text string, if or certain text string contains the participle matched completely with predetermined title in described natural language text, then described sample data acquisition module 310 using the text string containing described participle as sample data, and mark named entity attribute mark corresponding to each participle in described text string.

Parameter learning module 320, for the participle in the sample data of described multiple mark is mapped as term vector, using described sample data as training sample, is trained RNN Named Entity Extraction Model, to learn the parameter of described RNN Named Entity Extraction Model.

Preferably, described parameter learning module 320 can comprise:

Input layer generation unit, for being generated the input layer of described RNN Named Entity Extraction Model by the participle of described training sample Chinese version string.

Term vector layer generation unit, for searching the term vector that in described input layer, each participle is corresponding from predefined vocabulary, is generated the term vector layer of described RNN Named Entity Extraction Model by described term vector.

Hidden layer generation unit, for carrying out matrix mapping to described term vector layer, obtains the hidden layer of described RNN Named Entity Extraction Model.

Output layer generation unit, for using the term vector of each described participle as condition, calculates the probability of the multiple named entity attribute marks corresponding with each described participle, under the described conditions respectively as the output layer of described RNN Named Entity Extraction Model.

Parameter learning unit, for utilizing the sample data of described multiple mark to train described RNN Named Entity Extraction Model, obtains the parameter of described RNN Named Entity Extraction Model.

Further, described hidden layer generation unit is used for carrying out matrix mapping by following formula to described term vector layer, obtains hidden layer:

[h _i] _j＝sigmoid([WC(w _i)] _j+[Uh _i-1)] _j

Wherein, [h _i] _jfor a jth element in i-th vector of described hidden layer, W, U are the transformation matrix parameter of described RNN Named Entity Extraction Model, C (w _i) be i-th term vector of described term vector layer, h _i-1for the i-th-1 vector of described hidden layer.

Alternatively, described output layer generation unit is used for the probability being calculated the multiple named entity attribute marks corresponding with each described participle by following formula respectively, the output layer as described RNN Named Entity Extraction Model:

P (label = L_{i} | w_{i}) = \frac{e^{O_{L} \cdot h_{i}}}{Σ_{k = 1}^{K} e^{O_{k} \cdot h_{i}}}

Wherein, L _ibe i-th named entity attribute mark, w _ibe i-th participle, h _ifor i-th vector of described hidden layer, O is the transformation matrix parameter of described RNN Named Entity Extraction Model, and K is the line number of described transformation matrix parameter O.

Preferably, described parameter learning unit is for obtaining the conditional probability of multiple named entity attribute marks corresponding to each described participle, conditional probability according to described multiple named entity attribute mark sets up loss function, the sample data of described multiple mark is utilized to train described loss function, obtain the parameter sets of the described RNN Named Entity Extraction Model making the value of loss function minimum, wherein, described parameter sets comprises term vector and transformation matrix parameter.

Concrete, utilize the sample data of described multiple mark to train described RNN Named Entity Extraction Model by following formula, obtain the parameter of described RNN Named Entity Extraction Model:

Wherein, all < Text, Label > to the sample data for all marks, the parameter sets that θ is the described RNN Named Entity Extraction Model that makes J (θ) minimum, described parameter sets comprises term vector and transformation matrix parameter, L _ibe i-th named entity attribute mark, w _ibe i-th participle.

By this for training the device of RNN Named Entity Extraction Model, obtain the sample data of multiple mark, and the participle in the sample data of described multiple mark is mapped as term vector, using described sample data as training sample, RNN Named Entity Extraction Model is trained, to learn the parameter of described RNN Named Entity Extraction Model, compared with prior art, without the need to depending on inquiry log and Hidden Markov hypothesis, there is better generalization ability, the named entity identified in natural language text can be applied to, and the recognition speed of named entity is fast, degree of accuracy is higher.

Embodiment four

With reference to Fig. 6, the recognition device of described named entity comprises text string acquisition module 410, text string word-dividing mode 420, named entity attribute mark acquisition module 430 and named entity recognition module 440.

Text string acquisition module 410 is for obtaining text string.

Text string word-dividing mode 420 obtains multiple participle for carrying out word segmentation processing to described text string.

Named entity attribute mark acquisition module 430 is for training the RNN Named Entity Extraction Model obtained to obtain the named entity attribute mark of maximum probability corresponding to described participle respectively by the device according to embodiment three.Describe described for training the device of RNN Named Entity Extraction Model in aforesaid embodiment three.

Named entity recognition module 440 obtains named entity for carrying out identification according to the named entity attribute mark of maximum probability corresponding to described participle to described text string.

Further, described recognition device can also comprise: the type acquisition module (not shown) of named entity, for obtaining the type of described named entity according to the named entity attribute mark of maximum probability corresponding to described participle.

By the recognition device of this named entity, word segmentation processing is carried out to the text string obtained and obtains multiple participle, and the named entity attribute mark of maximum probability corresponding to described participle is obtained by trained RNN Named Entity Extraction Model, finally can according to the named entity attribute mark of maximum probability corresponding to described participle, identification is carried out to text string and obtains named entity, compared with prior art, the named entity in natural language text can be identified fast, and improve the identification accuracy of named entity, the type of identified named entity can also be obtained.

In addition, each functional module in each embodiment of the present invention can be integrated in a processing module, also can be that the independent physics of modules exists, also can two or more module integrations in a module.Above-mentioned integrated module both can adopt the form of hardware to realize, and the form that hardware also can be adopted to add software function module realizes.

The above-mentioned integrated module realized with the form of software function module, can be stored in a computer read/write memory medium.Above-mentioned software function module is stored in a storage medium, comprising some instructions in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) or processor (processor) perform the part steps of method described in each embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, portable hard drive, ROM (read-only memory) (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disc or CD etc. various can be program code stored medium.

The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; change can be expected easily or replace, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of described claim.

Claims

1., for training a method for Recognition with Recurrent Neural Network (RNN) Named Entity Extraction Model, described RNN Named Entity Extraction Model, for identifying the named entity in text, is characterized in that, described method comprises:

Obtain the sample data of multiple mark, each described sample data comprises text string and multiple participle labeled data thereof, and described participle labeled data comprises the participle that separates from described text string and the named entity attribute mark in described text string thereof;

Participle in the sample data of described multiple mark is mapped as term vector, using described sample data as training sample, RNN Named Entity Extraction Model is trained, to learn the parameter of described RNN Named Entity Extraction Model.

2. method according to claim 1, is characterized in that, the named entity attribute mark of described participle in described text string comprises the information whether described participle belongs to named entity.

3. method according to claim 2, is characterized in that, the named entity attribute mark of described participle in described text string also comprises the position mark in described participle named entity belonging to it.

4. method according to claim 1, is characterized in that, described participle the named entity attribute mark in described text string comprise: the beginning flag of named entity, the continuity mark of named entity and non-named entity mark.

5. the method according to any one of Claims 1 to 4, is characterized in that, the named entity attribute mark of described participle in described text string also comprises the type of the named entity belonging to described participle.

6. method according to claim 5, is characterized in that, described method also comprises:

From natural language text, the sample data of multiple mark is obtained according to heuristic rule, wherein,

If containing paired punctuation marks used to enclose the title in described natural language text, then using the text string containing described paired punctuation marks used to enclose the title as sample data, and mark named entity attribute mark corresponding to each participle in described text string, or

If certain text string contains the participle matched completely with predetermined title in described natural language text, then using the text string containing described participle as sample data, and mark named entity attribute mark corresponding to each participle in described text string.

7. method according to claim 5, is characterized in that, described using described sample data as training sample, trains RNN Named Entity Extraction Model, comprises with the process of the parameter learning described RNN Named Entity Extraction Model:

The input layer of described RNN Named Entity Extraction Model is generated by the participle of described training sample Chinese version string,

From predefined vocabulary, search the term vector that in described input layer, each participle is corresponding, generated the term vector layer of described RNN Named Entity Extraction Model by described term vector,

Matrix mapping is carried out to described term vector layer, obtains the hidden layer of described RNN Named Entity Extraction Model,

Using the term vector of each described participle as condition, calculate the probability of the multiple named entity attribute marks corresponding with each described participle under the described conditions respectively, as the output layer of described RNN Named Entity Extraction Model,

Utilize the sample data of described multiple mark to train described RNN Named Entity Extraction Model, obtain the parameter of described RNN Named Entity Extraction Model.

8. method according to claim 7, is characterized in that, the described sample data of described multiple mark that utilizes is trained described RNN Named Entity Extraction Model, and the process obtaining the parameter of described RNN Named Entity Extraction Model comprises:

Obtain the conditional probability of multiple named entity attribute marks corresponding to each described participle,

Conditional probability according to described multiple named entity attribute mark sets up loss function,

Utilize the sample data of described multiple mark to train described loss function, obtain the parameter sets of the described RNN Named Entity Extraction Model making described loss function minimum, wherein, described parameter sets comprises term vector and transformation matrix parameter.

9. a recognition methods for named entity, is characterized in that, described recognition methods comprises:

Obtain text string;

Word segmentation processing is carried out to described text string and obtains multiple participle;

The RNN Named Entity Extraction Model obtained is trained to obtain the named entity attribute mark of maximum probability corresponding to described participle respectively by method according to claim 5;

According to the named entity attribute mark of maximum probability corresponding to described participle, identification is carried out to described text string and obtain named entity.

10. method according to claim 9, is characterized in that, described method also comprises: the type obtaining described named entity according to the named entity attribute mark of maximum probability corresponding to described participle.

11. 1 kinds for training the device of Recognition with Recurrent Neural Network (RNN) Named Entity Extraction Model, described RNN Named Entity Extraction Model, for identifying the named entity in text, is characterized in that, described device comprises:

Sample data acquisition module, for obtaining the sample data of multiple mark, each described sample data comprises text string and multiple participle labeled data thereof, and described participle labeled data comprises the participle that separates from described text string and the named entity attribute mark in described text string thereof;

Parameter learning module, for the participle in the sample data of described multiple mark is mapped as term vector, using described sample data as training sample, trains RNN Named Entity Extraction Model, to learn the parameter of described RNN Named Entity Extraction Model.

12. devices according to claim 11, is characterized in that, the named entity attribute mark of described participle in described text string comprises the information whether described participle belongs to named entity.

13. devices according to claim 12, is characterized in that, the named entity attribute mark of described participle in described text string also comprises the position mark in described participle named entity belonging to it.

14. devices according to claim 11, is characterized in that, described participle the named entity attribute mark in described text string comprise: the beginning flag of named entity, the continuity mark of named entity and non-named entity mark.

15. devices according to any one of claim 11 ~ 14, it is characterized in that, the named entity attribute mark of described participle in described text string also comprises the type of the named entity belonging to described participle.

16. devices according to claim 15, is characterized in that, described sample data acquisition module also for obtaining the sample data of multiple mark from natural language text according to heuristic rule, wherein,

If containing paired punctuation marks used to enclose the title in described natural language text, then described sample data acquisition module is using the text string containing described paired punctuation marks used to enclose the title as sample data, and marks named entity attribute mark corresponding to each participle in described text string, or

If certain text string contains the participle matched completely with predetermined title in described natural language text, then described sample data acquisition module is using the text string containing described participle as sample data, and marks named entity attribute mark corresponding to each participle in described text string.

17. devices according to claim 15, is characterized in that, described parameter learning module comprises:

Input layer generation unit, for being generated the input layer of described RNN Named Entity Extraction Model by the participle of described training sample Chinese version string,

Term vector layer generation unit, for searching the term vector that in described input layer, each participle is corresponding from predefined vocabulary, is generated the term vector layer of described RNN Named Entity Extraction Model by described term vector,

Hidden layer generation unit, for carrying out matrix mapping to described term vector layer, obtains the hidden layer of described RNN Named Entity Extraction Model,

Output layer generation unit, for using the term vector of each described participle as condition, calculates the probability of the multiple named entity attribute marks corresponding with each described participle under the described conditions respectively, as the output layer of described RNN Named Entity Extraction Model,

18. devices according to claim 17, it is characterized in that, described parameter learning unit is for obtaining the conditional probability of multiple named entity attribute marks corresponding to each described participle, conditional probability according to described multiple named entity attribute mark sets up loss function, the sample data of described multiple mark is utilized to train described loss function, obtain the parameter sets of the described RNN Named Entity Extraction Model making described loss function minimum, wherein, described parameter sets comprises term vector and transformation matrix parameter.

The recognition device of 19. 1 kinds of named entities, is characterized in that, described recognition device comprises:

Text string acquisition module, for obtaining text string;

Text string word-dividing mode, obtains multiple participle for carrying out word segmentation processing to described text string;

Named entity attribute mark acquisition module, for training the RNN Named Entity Extraction Model obtained to obtain the named entity attribute mark of maximum probability corresponding to described participle respectively by device according to claim 17;

Named entity recognition module, obtains named entity for carrying out identification according to the named entity attribute mark of maximum probability corresponding to described participle to described text string.

20. devices according to claim 19, is characterized in that, described recognition device also comprises: the type acquisition module of named entity, for obtaining the type of described named entity according to the named entity attribute mark of maximum probability corresponding to described participle.