CN107967251A

CN107967251A - A kind of name entity recognition method based on Bi-LSTM-CNN

Info

Publication number: CN107967251A
Application number: CN201710946531.3A
Authority: CN
Inventors: 唐华阳; 岳永鹏; 刘林峰
Original assignee: Beijing Know Future Information Technology Co ltd
Current assignee: Beijing Know Future Information Technology Co ltd
Priority date: 2017-10-12
Filing date: 2017-10-12
Publication date: 2018-04-27

Abstract

The present invention relates to a kind of name entity recognition method based on Bi LSTM CNN.Training corpus data with label are converted to the corpus data of character level in the training stage by this method, then deep learning model of the training based on Bi LSTM CNN, the testing material data of no label are converted to the corpus data of character level in forecast period, are then predicted using training stage trained deep learning model.The present invention, can be against the influence of the precision of word segmentation, while the problem of can also evade unregistered word using the vector of character level rather than word-level；In addition the built-up pattern of two-way shot and long term Memory Neural Networks Bi LSTM and convolutional neural networks CNN is used, precision is significantly improved compared to traditional algorithm.

Description

A kind of name entity recognition method based on Bi-LSTM-CNN

Technical field

The invention belongs to information technology field, and in particular to a kind of name entity recognition method based on Bi-LSTM-CNN.

Background technology

Name Entity recognition (Named Entity Recognition, abbreviation NER) is referred to for given data set Identify the process for the substantive noun with certain sense specified.The scene of putting into practice of name entity recognition method includes：

Scene 1：Event detection.Place, time, personage are several basic composition parts of time, in plucking for structure event When wanting, related person, place, unit etc. can be protruded.In event search system, relevant personage, time, place can be made For indexing key words.Relation between several composition parts of event, event has been described in more detail from semantic level.

Scene 2：Information retrieval.Name entity can be used for improving and improving the effect of searching system, when user's input " weight When greatly ", it can be found that what user more thought retrieval is " University Of Chongqing ", rather than its corresponding adjective implication.In addition, establishing When inverted index, if name entity is cut into multiple words, it will cause search efficiency to reduce.In addition, search engine Develop to the direction of semantic understanding, calculating answer.

Scene 3：Semantic network.Concept and example and its corresponding relation, such as " country " are generally comprised in semantic network It is a concept, China is an example, and " China " is the relation between one " country " expression entity and concept.Semantic network In example have be greatly name entity.

Scene 4：Machine translation.The translation of name entity often has some special translation rules, such as Chinese people's translation To be represented into during English using the phonetic of name, it is famous in the posterior rule of preceding surname, and common word will translate into correspondence English word.The name entity in text is recognized accurately, has important meaning to the effect for improving machine translation.

Scene 5：Question answering system.Accurately identify that each part to go wrong is especially important, the association area of problem, Related notion.At present, most of question answering system can only all search for answer, and cannot calculate answer.Search for answer and carry out keyword Matching, user manually extracts answer according to search result, and more friendly mode is that answer is calculated to be presented to user. Some problem needs to consider the relation between entity, such as " the 45th, U.S. president " in question answering system, at present Search engine can return to answer " Donald Trump " in a particular format.

Traditional name entity recognition method can be divided into the name entity recognition method based on dictionary, based on word frequency statistics Method and method based on artificial nerve network model.Name entity recognition method based on dictionary, its principle are by the greatest extent In the more different classes of entity vocabulary income dictionary of amount, when identification, is matched text message with the word in dictionary, That mixes is then labeled as corresponding entity class.Method based on word frequency statistics, such as CRF (condition random field), its principle be Learn the semantic information to preceding the latter word, then make classification and judge.

Name Entity recognition based on dictionary depends critically upon dictionary, it is impossible to identifies unregistered word.United based on word frequency The HMM (hidden Markov) of meter can only associate the semanteme of the previous word of current word, identification essence with CRF (condition random field) method Spend not high enough, the discrimination of especially unregistered word is relatively low., there is ladder in training in the method based on artificial nerve network model Disappearance problem is spent, and the network number of plies is few in actual application, it is final to name Entity recognition result advantage unobvious.

The content of the invention

The present invention, can be effective in view of the above-mentioned problems, provide a kind of name entity recognition method based on Bi-LSTM-CNN Improve the precision of name Entity recognition.Wherein Bi-LSTM is Bi-directional Long Short-Term Memory, i.e., Two-way shot and long term Memory Neural Networks；CNN is Convolution Neural Network, i.e. convolutional neural networks.

In the present invention, posting term refers to being already present in the word in vocabulary, and unregistered word refers to not appearing in word Word in table.

The technical solution adopted by the present invention is as follows：

A kind of name entity recognition method based on Bi-LSTM-CNN, comprises the following steps：

1) original language material data OrgData is converted into the corpus data NewData of character level；

2) character in NewData is counted, character set CharSet is obtained, each character is numbered, obtains character The corresponding character number set CharID of set CharSet；The label of character in NewData is counted, obtains tag set LabelSet, each label is numbered, and obtains the corresponding tag number set LabelID of tag set LabelSet；

3) NewData is grouped sentence according to sentence length, obtains the data acquisition system for including n group sentences GroupData；

4) BatchSize data w, and corresponding label are extracted in random certain group without the slave GroupData put back to Y, and the data w of extraction is converted to the data BatchData of regular length by CharID, corresponding label is passed through LabelID is converted to the label y of regular length_ID；

5) by data BatchData and label y_IDIt is sent into the deep learning model based on Bi-LSTM-CNN, the training depth The parameter of learning model, when the penalty values satisfaction setting condition that deep learning model produces or reaches maximum iteration N, then Terminate the training of the deep learning model；Otherwise step 4) is used to regenerate data with the training deep learning model；

6) data PreData to be predicted is converted into data PreMData with the deep learning Model Matching, and will It is sent into the trained deep learning model, obtains name Entity recognition result OrgResult.

Further, step 1) is marked each character using the mark mode of BMESO：If some word is corresponding Label is Label, then the character marking most started positioned at the word is Label_B, and the character marking among the word is Label_M, the word positioned at the word end are labeled as Label_E, and Label_ is labeled as if the word only has a character S, is labeled as o if the word does not have tape label or is not belonging to entity tag.

Further, in step 3), if l_iRepresent the sentence length of the i-th word, then incite somebody to action | l_i-l_j| the sentence of ＜ δ is included into One group, wherein δ represents sentence length interval.

Further, step 4) includes：

The data w being drawn into 4-1) is converted into numeral, namely the correspondence by CharSet and CharID, by w Each character be converted into corresponding numeral；

The corresponding label y of the data w of extraction 4-2) are converted into numeral, namely pair by LabelSet and LabelID It should be related to, each character in y is converted into corresponding numeral；

4-3) assume that specific length is maxLen, as the data sentence length l ＜ maxLen being drawn into, behind sentence MaxLen-l 0 is mended, obtains BatchData, and maxLen-l 0 will be mended behind the corresponding label y of w, obtains y_ID。

Further, the deep learning model based on Bi-LSTM-CNN described in step 5) includes：

Embedding layers, for the character data of input to be converted to vector；

Bi-LSTM layers, the LSTM units comprising some forward and reverses, for extracting the semantic relation of intercharacter；

Concatenate layers, the semantic information for the LSTM units of forward and reverse to be extracted is stitched together；

First DropOut layers, for preventing model over-fitting；

Conv layers, the semantic information for whole word to be extracted with current single character by LSTM takes out word spy Sign；

Second DropOut layers, for preventing model over-fitting；

SoftMax layers, for classifying to each character.

The name entity recognition method based on Bi-LSTM-CNN of the invention, using character level rather than the vector of word-level, Can be against the influence of the precision of word segmentation, while the problem of unregistered word can also be evaded；In addition using two-way shot and long term memory god Traditional algorithm is compared with the built-up pattern of convolutional neural networks CNN through network B i-LSTM, precision improves very much.

Brief description of the drawings

The step flow chart of Fig. 1 the method for the present invention.

Fig. 2 deep learning model schematics.

Fig. 3 .LSTM cell schematics.

Embodiment

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below by specific embodiment and Attached drawing, is described in further details the present invention.

The invention discloses a kind of name entity recognition method based on Bi-LSTM-CNN.Such as know in corpus data Others' name, place name and institution term etc. name entity.The key problem of the present invention includes three：The effect of 1 name Entity recognition Rate, the precision of 2 name Entity recognitions, the accuracy of identification of 3 unregistered words.

In order to solve the problems, such as unregistered word, the present invention abandons traditional vocabulary method, but uses based on term vector Thought, and be the vector based on character, rather than word.In order to solve tradition name, entity recognition method precision is low asks Topic, the present invention use the thought of deep learning, utilize two-way shot and long term Memory Neural Networks model (Bi-LSTM) and convolutional Neural Network (CNN) model is combined to be named Entity recognition.Low in order to solve name Entity recognition efficiency, the present invention avoids word Frequency counts, and avoids string matching, but carry out Entity recognition by the way of similar Function Mapping.

The name entity recognition method flow chart of the present invention is as shown in Figure 1.This method is divided into two stages：Training stage, Forecast period.

(1) training stage：(left side dotted line frame of flow chart)

Step 1：Training corpus data with label are converted to the corpus data of character level.

Step 2：Deep learning model is trained using Adam gradient descent algorithms.In addition other Algorithm for Training can also be used Deep learning model, such as SGD, that is, stochastic gradient descent algorithm.

(2) forecast period：(the right dotted line frame of flow chart)

Step 1：The testing material data of no label are converted into the corpus data of character level.

Step 2：It is predicted using training stage trained deep learning model.

The specific implementation process in two stages is specifically described below.

(1) training stage：

Step 1-1：Initial data OrgData is converted into the data NewData of character level.

Specially：(it can also be used using the mark mode of BMESO (Begin, Middle, End, Single, Other) Other mark modes), each word with label in original language material data is subjected to character level cutting.If some word corresponds to Label be Label, then the character marking that should most start positioned at the word be Label_B, the character mark among the word Label_M is denoted as, the word positioned at the word end is labeled as Label_E, is labeled as if the word only has a character Label_S, is labeled as Other if the word does not have tape label or is not belonging to entity tag.

For example, initial data is：" [Zhang San]/pre [graduation]/o [in]/o [Harvard University]/org [.]/o ", then convert To be after the data of character level："/tri-/pre_E of pre_B finish/o_B industry/o_E in/o_S Kazakhstan/org_B Buddhists/org_M it is big/org_ M/org_E./o_S”.

Step 1-2：The character set CharSet of NewData is counted, in order to avoid running into unknown character in prediction, A special symbol " null " is added in CharSet.And number each character according to natural number increasing, obtain character set The corresponding character number set CharID of CharSet.

Such as the example in step 1-1, the CharSet after statistics are：Null, three, finish, industry, in, breathe out, Buddhist, greatly, Learn,., punctuation mark can also count inside；CharID is：{null:0,:1, three:2, finish:3, industry:4, in:5, breathe out:6, Buddhist:7, greatly:8, learn:9,.:10}.

Tag set LabelSet is counted, each label is numbered, produces corresponding tag number set LabelID。

Such as the example in step 1-1, the LabelSet after statistics are：{pre_B,pre_E,o_B,o_E,o_s,org_ B,org_M,org_E}；LabelID is：{pre_B:0,pre_E:1,o_B:2,o_E:3,o_s:4,org_B:5,org_M:6, org_E:7}。

Step 1-3：NewData is divided according to sentence length.

If l_iRepresent the sentence length of the i-th word, then incite somebody to action | l_i-l_j| the sentence of ＜ δ is included into one group, and wherein δ represents sentence length Degree interval.If the data after packet are GroupData, n groups are set to altogether.

Step 1-4：BatchSize data w are extracted in random certain group without the slave GroupData put back to, and it is corresponding Label y, and the data of extraction are converted to the data BatchData of regular length by CharID, and corresponding mark Label are converted to the label y of regular length by LabelID_ID。

The data by extraction are converted to the data BatchData of regular length by CharID, and corresponding Label the label y of regular length is converted to by LabelID_ID, it is specially：

Step 1-4-1：The data w being drawn into is converted into numeral, namely is closed by the way that CharSet is corresponding with CharID System, corresponding numeral is converted into by each character in w.

Such as the data in step 1-1 be converted to CharID after be：[1,2,3,4,5,6,7,8,9,10]

Step 1-4-2：The corresponding label y of the data w of extraction are converted into numeral, namely by LabelSet with The correspondence of LabelID, corresponding numeral is converted into by each character in y.

Such as the label in step 1-1 be converted to LabelID after be：[0,1,2,3,4,5,6,6,7,4]

Step 1-4-3：Assuming that specific length is maxLen, as the data sentence length l ＜ maxLen being drawn into, by sentence Son mends maxLen-l 0 below, obtains BatchData.And maxLen-l 0 will be mended behind the corresponding label y of w, obtain y_ID。

Step 1-5：The data BatchData of step 1-4 is sent into deep learning model, produce loss function Cost (y ', y_ID)。

Deep learning model is as shown in Figure 2 in the name entity recognition method of the present invention.The wherein implication explanation of each several part It is as follows：

w₁~w_n：The each character that can be intuitively interpreted as in certain words, that is, the data w in step 1-4, but At incoming Embedding layers, it is necessary to first complete step 1-4.

y₁~y_n：Each character in certain words can be intuitively interpreted as and correspond to prediction label, will be used for and physical tags y_IDCounting loss value.

Embedding layers：That is embeding layer, that is, the process of vectorization, for by the character data of input be converted to Amount.

Bi-LSTM layers：LSTM units comprising some forward and reverses, for extracting the semantic relation of intercharacter.

Concatenate layers：Semantic information for the LSTM units of forward and reverse to be extracted is stitched together.

First DropOut layers：That is filter layer, for preventing model over-fitting.

Conv layers：That is convolutional layer, for the semantic information for extracting whole word by LSTM with current single character Take out word feature.

Second DropOut layers：That is filter layer, for preventing model over-fitting.

SoftMax：Classify layer, for finally classifying to each character.

The training deep learning model, is specially：

Step 1-5-1：Incoming data BatchData is subjected to vectorization at Embedding layers, also i.e. by data Each character in each data in BatchData is converted into BatchVec by a vector table Char2Vec.

Step 1-5-2：BatchVec is passed to Bi-LSTM layers, is in detail：First vector in every data is passed to First positive LSTM unit, positive second vector are passed to second LSTM unit, and so on.Positive at the same time the The input of i LSTM unit is in addition to i-th of vector in every data, also comprising the defeated of the i-th -1 positive LSTM unit Go out.First vector in every data is passed to first reverse LSTM unit again, second reverse vector is passed to the Two LSTM units, and so on.The input of same i-th reverse of LSTM unit is except i-th of vector in every data Outside, the output of the i-th -1 reverse LSTM unit is also included.Note that the vector that each LSTM units once receive is not Only one, but BatchSize.

Fig. 3 is shown in more detailed LSTM units description.The implication of each symbol is described as follows in Fig. 3：

w：Character in input data (such as a word).

C_i-1, C_i：The semanteme that the semantic information and preceding i character that i-1 character is accumulated before representing respectively are accumulated Information.

h_i-1, h_i：The characteristic information of the characteristic information of the i-th -1 character of expression and i-th of character respectively.

f：Forget door, the accumulation semantic information (C for i-1 character before controlling_i-1) retain how much.

i：Input gate, for control input data (w and h_i-1) retain how much.

o：Out gate, how many characteristic information exported in the feature of i-th of character of output for controlling.

tanh：Hyperbolic tangent function

u:tanh：With controlling i-th of character how many characteristic information to be retained in C together with input gate i_i-1In.

* ,+：Represent that step-by-step carries out multiplication and step-by-step carries out addition respectively.

Step 1-5-3：By the output of each LSTM units of forward and reverseWithIt is Concatenate layers incoming, namely It is that the output result of the LSTM units of forward and reverse is stitched together to be combined into

Step 1-5-4：Concatenate layers of output is passed to DropOut layers, that is to say random by h_iMiddle η (0≤η ≤ 1) image watermarking falls, and does not allow its continuation to be transmitted backward.

Step 1-5-5：After the output of DropOut is passed to Conv convolutional layers progress convolution, ReLU activation primitives are usedAnd the output of convolutional layer is set to c_i。

Step 1-5-6：It is similar with step 1-5-3, by Conv layers of output c_iIt is DropOut layers incoming, it that is to say random By c_iThe image watermarking of middle η (0≤η≤1) falls, and does not allow its continuation to be transmitted backward.

Step 1-5-7：The output of DropOut is passed to SoftMax layers, and produces final penalty values Cost (y ', y_ID)。 Specific calculation formula is as follows：

Cost(y′,y_ID)=- y_IDlog(y′)+(1-y_ID) log (1-y ') (formula 1)

Outputs of the wherein y ' expressions BatchData after deep learning category of model layer (SoftMax layers), corresponding to figure Y in 2₁,₂,...,_n。y_IDRepresent corresponding true tag.

Step 1-6：The parameter of deep learning model is trained using Adam gradient descent algorithms,

Step 1-7：If Cost (y ', y that deep learning model produces_ID) no longer reduce, or reach greatest iteration time Number N, then terminate the training of deep learning model；Otherwise step 1-4 is jumped to.

Wherein, Cost '_i(y′,y_ID) penalty values before expression during i iteration, Cost (y ', y_ID) represent that current iteration produces Penalty values, which is meant that, if current penalty values and the difference of the average value of preceding M penalty values are less than threshold θ, Think no longer to reduce.

(2) forecast period：

Step 2-1：Data PreData to be predicted is converted into the data format with deep learning Model Matching PreMData.Specially：Numerical data by data conversion to be predicted into character level.

Step 2-2：PreMData is sent into training stage trained deep learning model, and obtains prediction result OrgResult。

Deep learning model described in forecast period step 2-2, is training stage trained deep learning model, no Cross in prediction, parameter η=1 for the DropOut layers being directed to, represents not hide any data, be all delivered to down One layer.

In the prior art, such as the method based on dictionary, it is to have no idea to solve unregistered word completely, that is to say, that not The discrimination of posting term is 0, and the accuracy of Statistics-Based Method or the method based on traditional artificial neural network probably exists 90%.The present invention 99.2% or so, significantly improves accuracy to the accuracy of test data.

The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this area Personnel can be to technical scheme technical scheme is modified or replaced equivalently, without departing from the spirit and scope of the present invention, sheet The protection domain of invention should be subject to described in claims.

Claims

1. a kind of name entity recognition method based on Bi-LSTM-CNN, it is characterised in that comprise the following steps：

2) character in NewData is counted, character set CharSet is obtained, each character is numbered, obtains character set The corresponding character number set CharID of CharSet；The label of character in NewData is counted, obtains tag set LabelSet, Each label is numbered, obtains the corresponding tag number set LabelID of tag set LabelSet；

3) NewData is grouped sentence according to sentence length, obtains the data acquisition system GroupData for including n group sentences；

4) BatchSize data w, and corresponding label y are extracted in random certain group without the slave GroupData put back to, and The data w of extraction is converted to the data BatchData of regular length by CharID, corresponding label is passed through into LabelID Be converted to the label y of regular length_ID；

5) by data BatchData and label y_IDIt is sent into the deep learning model based on Bi-LSTM-CNN, the training deep learning The parameter of model, when the penalty values satisfaction setting condition that deep learning model produces or reaches maximum iteration N, then terminates The training of the deep learning model；Otherwise step 4) is used to regenerate data with the training deep learning model；

6) data PreData to be predicted is converted into data PreMData with the deep learning Model Matching, and is sent Enter the trained deep learning model, obtain name Entity recognition result OrgResult.

2. the method as described in claim 1, it is characterised in that step 1) using BMESO mark mode to each character into Line flag：If the corresponding label of some word is Label, then the character marking most started positioned at the word is Label_B, positioned at this Character marking among word is Label_M, and the word positioned at the word end is labeled as Label_E, if the word only has one A character is then labeled as Label_S, and o is labeled as if the word does not have tape label or is not belonging to entity tag.

3. the method as described in claim 1, it is characterised in that in step 3), if l_iRepresent the sentence length of the i-th word, then will |l_i-l_j| the sentence of ＜ δ is included into one group, and wherein δ represents sentence length interval.

4. the method as described in claim 1, it is characterised in that step 4) includes：

The data w being drawn into 4-1) is converted into numeral, namely the correspondence by CharSet and CharID, will be every in w A character is converted into corresponding numeral；

The corresponding label y of the data w of extraction 4-2) are converted into numeral, namely are closed by the way that LabelSet is corresponding with LabelID System, corresponding numeral is converted into by each character in y；

4-3) assume that specific length is maxLen, as the data sentence length l ＜ maxLen being drawn into, will be mended behind sentence MaxLen-l 0, BatchData is obtained, and maxLen-l 0 will be mended behind the corresponding label y of w, obtains y_ID。

5. the method as described in claim 1, it is characterised in that the deep learning mould based on Bi-LSTM-CNN described in step 5) Type includes：

Embedding layers, for the character data of input to be converted to vector；

First DropOut layers, for preventing model over-fitting；

Conv layers, the semantic information for whole word to be extracted with current single character by LSTM takes out word feature；

Second DropOut layers, for preventing model over-fitting；

SoftMax layers, for classifying to each character.

6. method as claimed in claim 5, it is characterised in that step 5) trains the step of deep learning model to include：

Incoming data BatchData 5-1) is subjected to vectorization at Embedding layers, also i.e. by data BatchData Each character in each data is converted into BatchVec by a vector table Char2Vec；

BatchVec 5-2) is passed to Bi-LSTM layers；

5-3) by the output of each LSTM units of forward and reverseWithIt is Concatenate layers incoming；

Concatenate layers of output 5-4) is passed to first DropOut layers；

First DropOut layers of output 5-4) is passed to Conv layers；

5-5) by Conv layers of output c_iIt is second DropOut layers incoming；

Second DropOut layers of output 5-6) is passed to SoftMax layers, and produces final penalty values.

7. method as claimed in claim 6, it is characterised in that step 5-2) first vector in every data is passed to just To first LSTM unit, positive second vector be passed to second LSTM unit, and so on, while positive i-th The input of a LSTM units is in addition to i-th of vector in every data, also comprising the defeated of the i-th -1 positive LSTM unit Go out；First vector in every data is passed to first reverse LSTM unit again, second reverse vector is passed to the Two LSTM units, and so on, the input of same i-th reverse of LSTM unit is except i-th of vector in every data Outside, the output of the i-th -1 reverse LSTM unit is also included；The vector that each LSTM units once receive is BatchSize It is a.

8. method as claimed in claim 6, it is characterised in that the calculation formula of the penalty values is：

Cost(y′,y_ID)=- y_IDlog(y′)+(1-y_ID) log (1-y '),

Outputs of the wherein y ' expressions BatchData after the SoftMax layers of deep learning model, y_IDRepresent corresponding true mark Label.

9. method as claimed in claim 8, it is characterised in that if penalty values Cost (y ', y_ID) no longer reduce then termination deeply The training of learning model is spent, Cost (y ', y are judged using the following formula_ID) no longer reduce：

<mrow> <mo>|</mo> <mi>C</mi> <mi>o</mi> <mi>s</mi> <mi>t</mi> <mrow> <mo>(</mo> <msup> <mi>y</mi> <mo>&prime;</mo> </msup> <mo>,</mo> <msub> <mi>y</mi> <mrow> <mi>I</mi> <mi>D</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mfrac> <mrow> <msubsup> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mo>-</mo> <mi>M</mi> </mrow> <mrow> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <msubsup> <mi>Cost</mi> <mi>i</mi> <mo>&prime;</mo> </msubsup> <mrow> <mo>(</mo> <msup> <mi>y</mi> <mo>&prime;</mo> </msup> <mo>,</mo> <msub> <mi>y</mi> <mrow> <mi>I</mi> <mi>D</mi> </mrow> </msub> <mo>)</mo> </mrow> </mrow> <mi>M</mi> </mfrac> <mo>|</mo> <mo><</mo> <mi>&theta;</mi> <mo>,</mo> </mrow>

Wherein, Cost_i′(y′,y_ID) penalty values before expression during i iteration, Cost (y ', y_ID) represent the damage that current iteration produces Mistake value, if current penalty values and the difference of the average value of preceding M penalty values are less than threshold θ, then it is assumed that penalty values no longer reduce.

10. the method as described in claim 1, it is characterised in that step 5) uses Adam gradient descent algorithms training depth Practise the parameter of model.