CN107992468A

CN107992468A - A kind of mixing language material name entity recognition method based on LSTM

Info

Publication number: CN107992468A
Application number: CN201710947005.9A
Authority: CN
Inventors: 唐华阳; 岳永鹏; 刘林峰
Original assignee: Beijing Future Information Technology Co Ltd
Current assignee: Beijing Future Information Technology Co Ltd
Priority date: 2017-10-12
Filing date: 2017-10-12
Publication date: 2018-05-04

Abstract

The present invention relates to a kind of mixing language material based on LSTM to name entity recognition method.This method will train mixing language material data to be converted to the mixing corpus data of character level in the training stage with label, then train the deep learning model based on LSTM；The test mixing corpus data of no label is converted to the mixing corpus data of character level in forecast period, is then predicted using training stage trained deep learning model.The present invention, can be against the influence of the precision of word segmentation, while the problem of can also evade unregistered word using the vector of character level rather than word-level；The precision of name Entity recognition can be greatly improved compared to traditional algorithm using shot and long term Memory Neural Networks LSTM；Directly model training is carried out using mixing language material, it is not necessary to each languages for mixing language material are detected and separated, eventually arrive at the purpose that can identify mixing language material.

Description

LSTM-based mixed corpus named entity identification method

Technical Field

The invention belongs to the technical field of information, and particularly relates to a mixed corpus named entity identification method based on LSTM.

Background

Named Entity Recognition (NER) refers to an Entity with specific meaning in a Recognition text, and mainly comprises a person name, a place name, an organization name, a proper noun and the like.

The practical scene of the recognition method of the named entity comprises the following steps:

scene 1: and detecting an event. The place, time and person are several basic components of time, and when constructing the abstract of the event, the relevant person, place, unit and the like can be highlighted. In the event search system, related people, time and place can be used as index keywords. The relationship between several constituent parts of an event describes the event in more detail from a semantic level.

Scene 2: and (5) information retrieval. Named entities can be used to enhance and improve the effectiveness of the search system, and when a user enters "significant," it can be found that the user prefers to search for "Chongqing university," rather than its corresponding adjective meaning. In addition, when the inverted index is built, if the named entity is cut into multiple words, the query efficiency is reduced. In addition, search engines are evolving towards semantic understanding, computing answers.

Scene 3: and (5) semantic network. Concepts and instances and their corresponding relationships are generally included in a semantic network, for example, "country" is a concept, china is an instance, and "china" is a "country" that expresses the relationship between entities and concepts. A large part of the instances in a semantic network are named entities.

Scene 4: and (4) machine translation. The translation of a named entity often has some special translation rules, for example, chinese people translate to English by using Pinyin of first names, first and last names, and common words translate to corresponding English words. The named entities in the text are accurately identified, and the method has important significance for improving the effect of machine translation.

Scene 5: a question-answering system. It is particularly important to accurately identify the various components of the problem, the relevant domain of the problem, and the relevant concepts. At present, most of the question-answering systems can only search answers and cannot calculate the answers. The search answers are matched with keywords, the user manually extracts answers according to the search results, and a more friendly mode is to calculate and present the answers to the user. Some questions in the question-answering system need to consider the relationship between entities, such as "the forty-five president" in the united states, and the current search engine returns the answer "terlangpu" in a special format.

The conventional entity recognition method for mixed texts containing multiple languages comprises the following steps:

multilingual input text >

And its entity recognition for each language can employ dictionary-based, statistical-based, and artificial neural network model-based approaches. Dictionary-based named entity recognition, the principle of which is roughly: putting a plurality of entity vocabularies of different categories into a dictionary as much as possible, matching text information with words in the dictionary during recognition, and marking the matched entity vocabularies as corresponding entity categories; the principle of the method based on word frequency statistics, such as CRF (conditional random field), is to learn semantic information of a word before and after the word, and then make classification judgment.

The above method has the following disadvantages:

disadvantage 1: the granularity of detection for multiple languages is not well differentiated, and there is a loss of word segmentation accuracy because a certain language is not detected. For the case that a document contains multiple languages, firstly segmentation processing is needed, then language type detection is carried out on each paragraph, however, if the paragraph also contains multiple languages, sentence segmentation processing is needed, and the sentence containing multiple languages cannot be segmented. Because the models and the linguistic data of the participles are heavily dependent, the result is that information of the participles is lost because a certain language is not detected.

And (2) a defect: HMM (hidden Markov) and CRF (conditional random field) methods based on word frequency statistics can only correlate the semantics of the previous word of the current word, the recognition accuracy is not high enough, and especially the recognition rate of unknown words is low;

disadvantage 3: the method based on the artificial neural network model has the problem of gradient disappearance during training, the number of network layers is small in practical application, and the advantages of the final named entity recognition result are not obvious.

Disclosure of Invention

Aiming at the problems, the invention provides a method for recognizing the named entity of the mixed corpus based on an LSTM (Long Short-Term Memory neural network), which can effectively improve the recognition precision of the named entity of the mixed corpus.

In the invention, the mixed corpus refers to the corpus data of at least two languages contained in the training or prediction data; the entry word refers to a word that has appeared in the corpus vocabulary; an unknown word refers to a word that does not appear in the corpus vocabulary.

The technical scheme adopted by the invention is as follows:

a mixed corpus named entity identification method based on LSTM is characterized by comprising the following steps:

1) Converting the original mixed corpus data OrgData into character-level mixed corpus data NewData;

2) Counting characters in New data to obtain a character set CharSet, numbering each character to obtain a character number set CharID corresponding to the character set CharSet; counting labels of characters in the New data to obtain a label set LabelSet, numbering each label to obtain a label number set LabelID corresponding to the label set LabelSet;

3) Grouping the sentences by the NewData according to the sentence length to obtain a data set GroupData comprising n groups of sentences;

4) Randomly and unreleased extracting BatchSize sentence data w and corresponding label y from a certain group of group data, converting the extracted data w into fixed length data BatchData through CharID, and converting the corresponding label into fixed length label y through LabeliD _ID ；

5) Match data and tag y _ID Sending the deep learning model based on the LSTM, training parameters of the deep learning model, and terminating the training of the deep learning model when a loss value generated by the deep learning model meets a set condition or reaches the maximum iteration number N; otherwise, regenerating the data by adopting the step 4)To train the deep learning model;

6) And converting the data PreData to be predicted into data PreMData matched with the deep learning model, and sending the data PreMData into the trained deep learning model to obtain a named entity recognition result OrgResult.

Further, step 1) comprises:

1-1) separating data from tags in original mixed corpus data, and performing character-level segmentation on each word of the data;

1-2) marking each character by adopting a marking mode of BMESO: and if the Label corresponding to a certain word is Label, the character positioned at the beginning of the word is labeled as Label _ B, the character positioned in the middle of the word is labeled as Label _ M, the word positioned at the end of the word is labeled as Label _ E, if the word only has one character, the word is labeled as Label _ S, and if the word is not labeled or does not belong to the entity Label, the word is labeled as o.

Further, in step 3), let l _i If the length of the ith sentence is expressed, l will be _i -l _j Sentences with | less than δ are grouped together, where δ represents the sentence length interval.

Further, step 4) comprises:

4-1) converting the extracted data w into numbers, namely converting each character in w into a corresponding number through the corresponding relation between CharSet and CharID;

4-2) converting the label y corresponding to the extracted data w into a number, namely converting each character in y into a corresponding number through the corresponding relation between LabelSet and LabelID;

4-3) assuming that the specified length is maxLen, when the length l of the extracted data sentence is less than maxLen, supplementing the rear of the sentence with maxLen-l 0 to obtain BatchData, and supplementing the rear of the tag y corresponding to w with maxLen-l 0 to obtain y _ID 。

Further, the step 5) of the LSTM-based deep learning model includes:

the Embedding layer is used for converting input character data into vectors;

the LSTM layer comprises a plurality of LSTM units and is used for extracting semantic relations among characters;

a DropOut layer to prevent over-fitting of the model;

and a SoftMax layer for classifying each character.

The mixed corpus named entity recognition method based on the LSTM adopts vectors at a character level instead of a word level, so that the influence of word segmentation precision can be avoided, and the problem of unregistered words can be avoided; in addition, the long-short term memory neural network LSTM is adopted, so that compared with the traditional algorithm, the accuracy of named entity recognition can be greatly improved; the mixed corpus is directly used for model training, each language of the mixed corpus is not required to be detected and separated, and the purpose of recognizing the mixed corpus can be achieved finally.

Drawings

FIG. 1 is a flow chart of the steps of the method of the present invention.

FIG. 2 is a schematic diagram of a deep learning model.

Fig. 3 is a schematic diagram of an lstm cell.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, the present invention is described in further detail with reference to the following specific embodiments and the accompanying drawings.

The invention discloses a mixed corpus named entity identification method based on LSTM. Named entities such as person names, place names, and organizational names are identified in corpus data used in a mixture of languages. The core problems of the present invention include three: 1. efficiency of mixed corpus recognition, 2. Accuracy of named entity recognition, 3. Accuracy of unknown words recognition.

In order to solve the problem of unknown words, the invention abandons the traditional word list method, but adopts the thought based on word vectors, and the word vectors are based on characters instead of words. In order to solve the problem of low recognition accuracy of the traditional named entity, the method adopts the idea of deep learning and utilizes a long-short term memory neural network model (LSTM) to recognize the named entity. In order to solve the problems that the mixed corpus recognition efficiency is low and the language detection of each character is avoided, the mixed corpus is put into a deep learning model together for training.

The flow chart of the mixed corpus named entity recognition method of the invention is shown in fig. 1. The method is divided into two stages: a training phase and a prediction phase.

A training stage: (the left dotted line of the flow chart)

Step 1: and converting the training mixed corpus data with the labels into character-level mixed corpus data.

Step 2: the deep learning model was trained using Adam gradient descent algorithm. In addition, other algorithms, such as SGD (stochastic gradient descent) algorithm, can be used to train the deep learning model.

(II) a prediction stage: (Right dotted frame of the flow chart)

Step 1: and converting the test mixed corpus data without the label into mixed corpus data at a character level.

Step 2: and predicting by using the deep learning model trained in the training stage.

The specific implementation of the two stages is described in detail below.

A training stage:

step 1-1: the original corpus data OrgData is converted into the character-level corpus data NewData. The method specifically comprises the following steps:

step 1-1-1: separating the data from the labels in the original corpus data, and performing character-level segmentation on each word of the data.

For example, the raw data is "[ Zhang III ]/pre [ gradated ]/o [ from ]/o [ Harvard university ]/org [. H "/o", after data tag separation:

the data are as follows: "[ Zhang three ] [ gradated ] [ from ] [ Harvard university ] [. ]"

The label is as follows: "pre o org o"

After the data is segmented according to the character level, the method comprises the following steps: "[ Zhang III ] [ g r a d a t e d ] [ f r o m ] [ Harvard university ] [. ]"

Step 1-1-2: each character is marked using BMESO (Begin, middle, end, single, other) marking (Other marking may be used). And if the Label corresponding to a certain word is Label, the character positioned at the beginning of the word is labeled as Label _ B, the character positioned in the middle of the word is labeled as Label _ M, the word positioned at the end of the word is labeled as Label _ E, if the word only has one character, the word is labeled as Label _ S, and if the word is not labeled or does not belong to the entity Label, the word is labeled as o.

For example, the label of each character corresponding to the data converted into the character-level data in step 1-1-1 is: "pre _ B pre _ E o _ B _ M o _ M o _ M o _ M o _ M o _ M o _ M o _ E o _ B o _ M o _ M o _ E org _ B org _ M org _ E o _ S".

Step 1-2: the character set CharSet of New Data is counted, and in order to avoid encountering an unknown character in prediction, a special symbol 'null' is added in the CharSet. And numbering each character in an increasing order according to the natural number to obtain a character number set CharID corresponding to the character set CharSet.

For example, in step 1-1, the statistic CharSet is: { null, zhang, san, g, r, a, d, t, e, f, r, o, m, ha, buddha, dai, school. The punctuation mark is counted in; charID is: { null:0, tension: 1, triple: 2, g. :17}.

And counting the label sets LabelSet, numbering each label, and generating a corresponding label number set LabelID.

For example, in step 1-1, the LabelSet after statistics is: { pre _ B, pre _ M, pre _ E, o _ B, o _ M, o _ E, o _ s, org _ B, org _ M, org _ E }; labelID is: { pre _ B:0, pre _M.

Step 1-3: the NewData is divided by sentence length.

Let l _i If the length of the ith sentence is expressed, l will be _i -l _j Sentences with | less than δ are grouped together, where δ represents the sentence length interval. Let the data after grouping be GroupData, set as n groups in total.

Step 1-4: randomly and unreleased extracting BatchSize sentence data w and corresponding label y from a certain group of GroupData, converting the extracted data into fixed length data BatchData by CharID, and converting the corresponding label into fixed length label y by LabeliD _ID 。

Converting the extracted data into fixed-length data BatchData by CharID, and converting the corresponding label into fixed-length label y by LabelID _ID The method comprises the following specific steps:

step 1-4-1: and converting the extracted data w into a number, namely converting each character in w into a corresponding number through the corresponding relation between the CharSet and the CharID.

For example, after the data in step 1-1 is converted into CharID: [1,2,3,4,5,6,5,7,8,6,9,10,11,12,13,14,15,16,17]

Step 1-4-2: and converting the label y corresponding to the extracted data w into a number, namely converting each character in y into a corresponding number through the corresponding relation between the LabelSet and the LabelID.

For example, after the tag in step 1-1 is converted to LabelID: [0,2,3,4,4,4,4,4,4,5,3,4,4,5,7,8,8,9,6]

1-4-3: assuming that the specified length is maxLen, when the length of the extracted data sentence is l < maxLen, the sentence is followed by maxLen-l 0 s to obtain BatchData. And adding maxLen-l 0 to the back of the label y corresponding to w to obtain y _ID 。

Step 1-5: the data BatchData of steps 1-4 is fed into a deep learning model to generate a loss function Cost (y', y) _ID )。

The deep learning model in the mixed corpus named entity recognition method is shown in fig. 2. Wherein the meaning of each part is explained as follows:

w ₁ ～w _n : can be intuitively understood as each character in a certain sentence, namely the data w in the step 1-4. But steps 1-4 need to be completed first when the Embedding layer is transferred.

y ₁ ～y _n : can be intuitively understood that each character in a certain sentence corresponds to a prediction label and is used for matching with an actual label y _ID The loss value is calculated.

Embedding layer: i.e., an embedding layer, i.e., a vectorization process, for converting input character data into vectors.

LSTM layer: comprises a plurality of LSTM units for extracting semantic relations among characters.

DropOut layer: i.e. a filter layer, to prevent overfitting of the model.

SoftMax layer: i.e., a classification layer, for finally classifying each character.

The specific steps for training the deep learning model are as follows:

step 1-5-1: vectorizing the incoming data batchData at the Embedding layer, that is, converting each character in each piece of data in the data batchData into the batchVec through a vector table Char2 Vec.

Step 1-5-2: the BatchVec was transferred into the LSTM layer, detailed as: the first vector in each piece of data is passed into the first LSTM unit, the second vector is passed into the second LSTM unit, and so on. Meanwhile, the input of the ith LSTM unit also comprises the output of the (i-1) th LSTM unit besides the ith vector in each piece of data. Note that each LSTM unit does not receive only one vector at a time, but rather a number of BatchSize vectors.

A more detailed description of the LSTM unit is shown in fig. 3. The meaning of the symbols in FIG. 3 is illustrated below:

w: characters in input data (e.g., a sentence).

C _i-1 ，C _i : respectively representing semantic information obtained by accumulating the first i-1 characters and semantic information obtained by accumulating the first i characters.

h _i-1 ，h _i : respectively representing the characteristic information of the (i-1) th character and the characteristic information of the ith character.

f: forget gate for controlling accumulated semantic information (C) of the first i-1 characters _i-1 ) How much is retained.

i: an input gate for inputting the information of the electronic device,for controlling the input data (w and h) _i-1 ) How much is retained.

o: and the output gate is used for controlling how much characteristic information is output when the characteristic of the ith character is output.

tan h: hyperbolic tangent function

u is tanh: together with the input gate i, controls how much characteristic information for the ith character remains at C _i-1 In (1).

* ,+: respectively, indicating multiplication by bit and addition by bit.

Step 1-5-3: output h of each LSTM cell _i Into Dropout layer, i.e. randomly _i The data of the middle eta (eta is more than or equal to 0 and less than or equal to 1) is hidden and is not transmitted backwards continuously.

1-5-4: passes the Dropout output into the SoftMax layer and produces the final loss values Cost (y', y) _ID ). The specific calculation formula is as follows:

Cost(y′，y _ID )＝-y _ID log(y′)+(1-y _ID ) log (1-y') (equation 1)

Where y' represents the output of BatchData after the deep learning model classification layer (SoftMax layer). Corresponding to y in FIG. 2 ₁ ， ₂ ，…， _n 。y _ID Representing the corresponding real label.

1-6: parameters of the deep learning model were trained using Adam gradient descent algorithm.

Step 1-7: if the deep learning model generates Cost (y', y) _ID ) If the number of times of iteration is not reduced (see formula 2), or the maximum number of times of iteration N is reached, the training of the deep learning model is terminated; otherwise, jumping to the step 1-4.

Of these, cost' _i (y′，y _ID ) Represents the loss value, cost (y', y) at the first i iterations _ID ) Representing the loss value produced by the current iteration. The meaning of this formula is if the current loss value is compared to the previous M loss valuesIs less than the threshold theta, it is considered no longer decreasing.

(II) a prediction stage:

step 2-1: the data PreData to be predicted is converted into a data format PreMData matched with the deep learning model. The method specifically comprises the following steps: the data to be predicted is converted into character-level digital data.

Step 2-2: and (5) sending the PreMData into a deep learning model trained in a training stage, and obtaining a prediction result OrgResult.

The deep learning model is a deep learning model trained in the training stage, but in prediction, the parameter η =1 of the DropOut layer involved in the deep learning model indicates that no data is hidden, and all the data are transmitted to the next layer.

The accuracy of the invention to the test data is about 89.3%. In the prior art, for example, a dictionary-based method has no way to solve unknown words, that is, the recognition rate of the unknown words is 0, and the accuracy of a statistical-based method or a conventional artificial neural network-based method is about 90%. However, these are all in the case of single language corpus, the invention is calculated in the case of multilingual mixed corpus, regarding processing each language separately after separating the language, the invention can realize the unified processing, in the acceptable range of precision reduction, the processing efficiency has been improved a lot.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A mixed corpus named entity identification method based on LSTM is characterized by comprising the following steps:

2) Counting characters in New data to obtain a character set CharSet, numbering each character to obtain a character number set CharID corresponding to the character set CharSet; counting labels of characters in NewData to obtain a label set LabelSet, numbering each label to obtain a label number set LabelID corresponding to the label set LabelSet;

4) Randomly and unreleased extracting BatchSize sentence data w and corresponding label y from a certain group of GroupData, converting the extracted data w into data BatchData with fixed length by CharID, and converting the corresponding label into label y with fixed length by LabeliD _ID ；

5) Match data and tag y _ID Sending the deep learning model based on the LSTM, training parameters of the deep learning model, and terminating the training of the deep learning model when a loss value generated by the deep learning model meets a set condition or reaches the maximum iteration number N; otherwise, adopting the step 4) to regenerate data to train the deep learning model;

2. The method of claim 1, wherein step 1) comprises:

1-1) separating data from labels in original mixed corpus data, and performing character level segmentation on each word of the data;

3. The method of claim 1, wherein in step 3), let l _i The sentence length of the ith sentence is expressed, then | l _i -l _j Sentences with | less than δ are grouped together, where δ represents the sentence length interval.

4. The method of claim 1, wherein step 4) comprises:

4-2) converting the label y corresponding to the extracted data w into a number, namely converting each character in y into a corresponding number through the corresponding relation between the LabelSet and the LabelID;

5. The method of claim 1, wherein step 5) the LSTM-based deep learning model comprises:

the Embedding layer is used for converting input character data into vectors;

a DropOut layer to prevent model overfitting;

and a SoftMax layer for classifying each character.

6. The method of claim 5, wherein the step of training the deep learning model of step 5) comprises:

5-1) vectorizing the incoming data BatchData at an Embedding layer, namely converting each character in each piece of data in the data BatchData into BatchVec through a vector table Char2 Vec;

5-2) transferring the BatchVec into the LSTM layer;

5-3) output h of each LSTM cell _i A Dropout layer is introduced;

5-4) pass the output of Dropout into the SoftMax layer and produce the final loss value.

7. The method of claim 6, wherein step 5-2) passes the first vector in each piece of data into a first LSTM unit, the second vector into a second LSTM unit, and so on, and wherein the input of the ith LSTM unit contains the output of the (i-1) th LSTM unit in addition to the ith vector in each piece of data; the vectors received at one time by each LSTM unit are BatchSize.

8. The method of claim 6, wherein the loss value is calculated by the formula:

Cost(y′，y _ID )＝-y _ID log(y′)+(1-y _ID )log(1-y′)，

where y' represents the output of BatchData after passing through the SoftMax layer of the deep learning model, y _ID Representing the corresponding real label.

9. The method of claim 8, wherein if the loss value Cost (y', y) _ID ) Stopping training the deep learning model when the model is not reduced any more, and judging Cost (y', y) by adopting the following formula _ID ) No longer decreases:

of these, cost' _i (y′，y _ID ) Represents the loss value, cost (y', y) at the first i iterations _ID ) Representing the loss value generated by the current iteration, and if the difference between the current loss value and the average value of the loss values of the previous M times is less than the threshold value theta, the loss value is considered not to be reduced any more.

10. The method of claim 1, wherein step 5) trains parameters of a deep learning model using an Adam gradient descent algorithm.