CN109858041B

CN109858041B - Named entity recognition method combining semi-supervised learning with user-defined dictionary

Info

Publication number: CN109858041B
Application number: CN201910172675.7A
Authority: CN
Inventors: 苏海波; 高体伟; 孙伟; 王然; 于帮付; 黄伟
Original assignee: Beijing Percent Technology Group Co ltd
Current assignee: Beijing Percent Technology Group Co ltd
Priority date: 2019-03-07
Filing date: 2019-03-07
Publication date: 2023-02-17
Anticipated expiration: 2039-03-07
Also published as: CN109858041A

Abstract

The invention discloses a named entity recognition method combining semi-supervised learning with a custom dictionary, which comprises the following steps: s1, pre-training a Bi-LSTM language model by using unlabeled data; s2, vectorizing each character by adopting a word vector model in an Embedding layer; s3, two layers of bidirectional LSTMs are used as sequence labeling models, and the sequence labeling models are trained by using labeling data; s4, adding a user-defined dictionary; and S5, obtaining the maximum probability path in the sequence by using Viterbi decoding. The invention splices the output of the pre-trained language model and the output of the first layer of bidirectional LSTM, and uses the spliced output as the input of the second layer of bidirectional LSTM, thereby reducing the use of the markup language material, and only replacing the markup language material of a new field when switching fields. In addition, the launching matrix entering Viterbi decoding can be changed through the setting of a user-defined dictionary during prediction, so that the effect of the user-defined dictionary is achieved.

Description

Named entity recognition method combining semi-supervised learning with user-defined dictionary

Technical Field

The invention relates to the field of data processing, aims at application of a named entity recognition technology, and particularly relates to a named entity recognition method combining semi-supervised learning with a custom dictionary.

Background

Named Entity Recognition (hereinafter NER) refers to the Recognition of entities (usually nouns) from text that have a specific category, such as a person's name, place name, organization name, proper noun, etc. The NER is a basic task of information retrieval, query classification, automatic question answering and the like, and the effect of the NER directly influences the effect of subsequent processing, so the NER is a basic problem of natural language processing research.

Semi-Supervised Learning (SSL) is a key problem in the field of pattern recognition and machine Learning, and is a Learning method combining Supervised Learning and unsupervised Learning. Semi-supervised learning uses large amounts of unlabeled data, and simultaneously labeled data, to perform pattern recognition operations. The basic idea of semi-supervised learning is to label unlabeled samples with a model hypothesis building learner on the data distribution. Its formal description is given of a set of samples from some unknown distribution, S = LU, where L is a set of labeled samples L = { (x 1, y 1), (x 2, y 2), \8230; (x | L |, y | L |) }, and U is a set of unlabeled samples U = { xc1, xc2, \8230;, xc | U | }, with the hope that the function f: xyY can accurately predict its label y for sample x. Wherein xi and xc1 are d-dimensional vectors, ytIY is a label of the sample xi, L | and U | are the sizes of L and U, namely the number of included samples, and the semi-supervised learning is to find an optimal learner on the sample set S. If S = L, the problem is transformed into traditional supervised learning; conversely, if S = U, the problem is to translate to traditional unsupervised learning. How to comprehensively utilize the labeled samples and the unlabeled samples is a problem to be solved by semi-supervised learning.

Custom dictionaries are products based on user needs, and users in different fields and industries have different definitions and understandings of entities, so that some words appear to some users as entities and may not appear to others as entities. Therefore, it is necessary for users to define dictionaries, and the accuracy of named entity recognition can be improved through the dictionaries, so that the dictionaries can better meet the requirements of users.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a named entity recognition method combining semi-supervised learning and a user-defined dictionary.

In order to achieve the technical purpose, the invention adopts the following technical scheme:

a named entity recognition method combining semi-supervised learning with a custom dictionary comprises the following steps:

s1, pre-training a Bi-LSTM language model by using unlabeled data;

s2, vectorizing each character by adopting a word vector model in an Embedding layer;

s3, two layers of bidirectional LSTMs are used as sequence labeling models, and the sequence labeling models are trained by using labeling data;

in the training process of the sequence labeling model, splicing the output vector of the first layer of bidirectional LSTM of the sequence labeling model with the output of the Bi-LSTM language model obtained by pre-training in the step S1, and then taking the spliced vector as the input of the second layer of bidirectional LSTM of the sequence labeling model after passing through a full connection layer;

s4, adding a user-defined dictionary:

obtaining an emission matrix X after passing through two layers of bidirectional LSTMs of the sequence labeling model, obtaining a transfer matrix Y through a maximum likelihood probability after passing through a CRF layer, and then adjusting the probability of the emission matrix according to a user-defined dictionary to obtain an adjusted emission matrix X;

s5, obtaining a maximum probability path in the sequence by using Viterbi decoding:

and inputting the emission matrix X and the transfer matrix Y which are obtained in the step S4 and adjusted according to the user-defined dictionary into Viterbi decoding of a CRF layer to obtain sequence labels, namely correct named entity recognition results.

Further, in step S2, the word vector model is a word2vec model.

Furthermore, in step S2, a Skip-gram method is specifically adopted for word vector model training.

Further, the specific steps of training the word vector model by adopting the Skip-gram method are as follows:

(1) Firstly, collecting balanced corpora related to an application field;

(2) Preprocessing the corpus data collected in the step (1), including filtering garbage data, stopping low-frequency characters and meaningless symbols, and arranging the garbage data, the stopping low-frequency characters and the meaningless symbols into a format of training data to obtain the training data;

(3) And (4) sending the training data to a Skip-gram model, and training to obtain a word vector model.

The invention has the beneficial effects that: the NER training of the Semi-Supervised Learning is realized based on a pre-trained Language Model (pre-Supervised Language Model), a character embedding (char embeddings) technology, a custom dictionary technology, a Semi-Supervised training (Semi-Supervised Learning), a Long Short Term Memory (LSTM) network, a CRF (Conditional Random Field) Model, and the like. The output of the pre-trained language model and the output of the first layer bidirectional LSTM are spliced through the method and the special network structure and are used as the input of the second layer bidirectional LSTM. By the method, the use of the markup language material can be reduced, and meanwhile, only the markup language material of a new field can be replaced when the field is switched. In addition, the launching matrix entering the Viterbi decoding can be changed by setting the user-defined dictionary during prediction, thereby achieving the effect of the user-defined dictionary.

Drawings

FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention;

FIG. 2 is a network diagram of a Bi-LSTM language model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a word2vec training model CBOW commonly used in the embodiments of the present invention;

FIG. 4 is a schematic diagram of a word2vec training model skip-gram model commonly used in the embodiments of the present invention;

FIG. 5 is a schematic flow chart of word vector model training according to an embodiment of the present invention;

FIG. 6 is a diagram of a sequence annotation model according to an embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings, and it should be noted that the present embodiment is based on the technical scheme, and a detailed implementation manner and a specific operation process are provided, but the protection scope of the present invention is not limited to the present embodiment.

The following is a brief explanation of the terminology related to this embodiment:

named entity recognition: specific proper nouns such as names of people, places, organizations, time words, product names, etc. are identified from given text data.

Word2vec: the algorithm developed by google corporation changes words into a vector with hundreds of dimensions through unsupervised training, and the vector can capture semantic relevance between words. Also called word vectors or word embedding.

Tensorflow: tensorflow is a google open-source deep learning platform, and provides rich interfaces, multiple platforms (CPU, GPU, HADOOP) and distributed support and visual monitoring.

Skip-gram: google was used to train Wordvec on big data using a method that predicts surrounding words from the current word to obtain a training objective function.

LSTM: an LSTM (Long Short-Term Memory) Long-Short Term Memory network is a time recurrent neural network, and is suitable for processing and predicting important events with relatively Long interval and delay in a time sequence. The historical information is controlled to be left through the memory gate and the forgetting gate, and the problem of long path dependence of the traditional recurrent neural network is effectively solved.

CRF: CRF (Conditional Random Field) Conditional Random fields are one of algorithms commonly used in the Field of natural language processing in recent years, and are commonly used for syntactic analysis, named entity recognition, part-of-speech tagging and the like. The CRF adopts a Markov chain as a probability transfer model of a hidden variable, and judges the hidden variable through an observable state, and belongs to a judgment model.

Semi-supervised learning: semi-Supervised Learning (SSL) is a key problem in the field of pattern recognition and machine Learning, and is a Learning method combining Supervised Learning and unsupervised Learning. Semi-supervised learning uses large amounts of unlabeled data, and simultaneously labeled data, for pattern recognition work.

And (4) self-defining a dictionary: the user-defined dictionary is a special entity which a user wants to extract when NER extraction is carried out, and the user-defined dictionary can be extracted by setting the dictionary.

The embodiment provides a named entity recognition method combining semi-supervised learning and a custom dictionary, which comprises the following steps:

s1, pre-training a Bi-LSTM language model by using unlabeled data;

the adoption of the pre-trained Bi-LSTM language model has the following advantages:

1) The method has the advantages that the required quantity of labeled linguistic data is reduced, the main function of the language model is automatic feature extraction, and the unlabeled data is adopted for pre-training to obtain semantic information of each character in advance.

2) The training time of the model is reduced, and due to the pre-training performed in advance, the training time of the labeled data is reduced.

The invention adopts the Bi-LSTM model to train the language model, is a method of unsupervised learning, and can train the model without manually marking corpora. The network structure of the model is shown in fig. 2.

Bi-LSTM (bidirectional LSTM) can combine information of preceding and following characters to give semantic information of characters, and a semantic vector can be obtained for each character through the forward LSTM, and then another semantic vector is obtained through the backward LSTM. And splicing the two semantic vectors at an output layer to obtain final output. Since the training of this language model is in an unsupervised form, the greater the amount of data required, the better.

As can be seen from fig. 2, the Bi-LSTM used in this embodiment, forward and backward, does not share parameters. Both layers of the LSTM are trained using different parameters, i.e., the two layers of LSTM are independent.

And S2, vectorizing each character by adopting a word2vec model in an Embedding layer.

In this embodiment, a Skip-gram method is specifically adopted to train and obtain a word vector model.

The word2vec model can turn each word into a vector in a low dimensional space, typically several hundred dimensions. Thus the semantic relatedness between characters can be approximately described by the distance of the vectors. Compared with the commonly used word vector, the character-based vectorization technology can bring the following advantages:

1) Character features of finer granularity can be represented;

2) Because the number of characters is far smaller than the number of words, the obtained model occupies extremely small space, and the loading speed of the model is greatly improved;

3) New words are emerging over time, and the word vector models trained previously have increasingly severe feature hit rate drop problems, which are effectively avoided by character-based vectors because relatively few new characters are created each year.

The present embodiment therefore selects a character-based vectorization technique.

The word2vec model adopted in this embodiment is an unsupervised learning method, that is, the model can be trained without manually labeling corpora, and two common training methods are CBOW and Skip-gram, as shown in fig. 3 to 4.

CBOW is a word predicted according to the context prediction center, and vectors of words are connected according to characters w (t-2), w (t-1), w (t + 1) and w (t + 2) around the character w (t), so that the context information can be fully reserved, as shown in FIG. 3. The Skip-gram method is the opposite, using w (t) to predict the surrounding words w (t-2), w (t-1), w (t + 1), w (t + 2), as shown in FIG. 4. Under the condition of large data volume, the Skip-gram method is suitable to be adopted.

As shown in fig. 5, in this embodiment, the specific steps of training the model by using the Skip-gram method are as follows:

(1) Firstly, collecting related balance corpora (the larger the data quantity is, the better the data quantity is, the labeling is not needed because the unsupervised learning is needed), wherein the corpora mainly aim at the corresponding application field and cover most data types of the scene as much as possible;

(2) Preprocessing the corpus data collected in the step (1), including filtering garbage data, stopping low-frequency characters and meaningless symbols, and arranging the garbage data, the stopping low-frequency characters and the meaningless symbols into a format of training data, namely representing input and output to obtain the training data;

S3, adopting two layers of bidirectional LSTMs as a sequence marking model, and training the sequence marking model by adopting marking data;

in this embodiment, the training data is labeled by using a BIO labeling method. For example:

the label B-PER represents the beginning of the name of the person, I-ORG represents the middle of the name of the organization, and O represents the others.

The sequence labeling model of the embodiment adopts two-layer bidirectional LSTM, and because a small amount of labeling data is adopted for training, better fitting data is considered by increasing the complexity of the model. Meanwhile, in order to reduce the requirement for the magnitude of the labeled data, a pre-trained language model vector is introduced between two layers of bidirectional LSTM of the sequence labeling model, and a specific model is shown in fig. 6.

Specifically, in the training process of the sequence labeling model, the output vector of the first layer two-way LSTM of the sequence labeling model is spliced with the output of the Bi-LSTM language model, and then the spliced vector is used as the input of the second layer two-way LSTM of the sequence labeling model after passing through a full connection layer.

From the concrete realization, firstly, the operation process of the first layer bidirectional LSTM of the sequence labeling model is entered, namely the forward LSTM and the backward LSTM, and the output of the forward LSTM is h _ft The output of backward LSTM is h _bt After the two are spliced, h is obtained _t1 ＝[h _ft ,h _bt ]Wherein h is output forward _ft Characterizing historical context information and then outputting h _bt The future context information is characterized. Then the output h of the Bi-LSTM language model is output _lm H is obtained after splicing with the output of the first layer bidirectional LSTM _t ＝[h _lm ,h _t1 ]. And then inputting the result into a second layer bidirectional LSTM of the sequence labeling model after passing through a full connection layer.

The Recurrent Neural Network (RNN) is currently widely usedIn the field of natural language processing. For arbitrary input text sequences (x) ₁ ,x ₂ ,…,x _n ) RNN returns a set of output values (h) for this sequence ₁ ,h ₂ ,…,h _n ). Because the traditional RNN can generate the problem of gradient disappearance in the process of carrying out optimization solution, long-distance semantic information cannot be recorded for long texts in the prediction process. The LSTM model adopts different gates to control input and output of historical information, and meanwhile, bidirectional LSTM can refer to not only past historical information but also future semantic information.

S4, adding a user-defined dictionary;

and obtaining an emission matrix X after passing through two layers of bidirectional LSTMs of the sequence labeling model, obtaining a transfer matrix Y through a maximum likelihood probability after passing through a CRF layer, and then adjusting the probability of the emission matrix according to a user-defined dictionary to obtain an adjusted emission matrix X.

And S5, obtaining the maximum probability path in the sequence by using Viterbi decoding.

Specifically, the transmitting matrix X and the transfer matrix Y adjusted according to the user-defined dictionary are input into the Viterbi decoding of the CRF layer to obtain sequence labels, namely correct named entity recognition results.

The main role of the CRF layer as the last layer in the present invention is to perform viterbi decoding to find the optimal path. Conditional random fields (CRFs, or CRFs) are a discriminative probabilistic model of random fields, and are commonly used to label or analyze sequence data, such as natural language text or biological sequences. The conditional random field model has the advantages of a discriminant model, has the characteristic that the generation model considers the transition probability among context markers and performs global parameter optimization and decoding in a serialization mode, and solves the marker bias problem which is difficult to avoid by other discriminant models (such as a maximum entropy Markov model).

The conditional random field uses a probabilistic graph model, has the capability of expressing long-distance dependency and overlapping characteristics, and can better solve the problems of labeling (classification) bias and the like, and all the characteristics can be subjected to global normalization to obtain a global optimal solution. What is mainly used here is a conditional random field prediction algorithm: the Viterbi algorithm (Viterbi algorithm, a dynamic programming algorithm).

Various corresponding changes and modifications can be made by those skilled in the art based on the above technical solutions and concepts, and all such changes and modifications should be included in the protection scope of the present invention.

Claims

1. A named entity recognition method combining semi-supervised learning with a custom dictionary is characterized by comprising the following steps:

s1, pre-training a Bi-LSTM language model by using unlabeled data; wherein, the Bi-LSTM is adopted, the forward direction and the backward direction are not shared by parameters, and the two LSTMs are trained by adopting different parameters, namely the two LSTMs are independent;

s2, vectorizing each character by adopting a word2vec model in an Embedding layer; specifically, a Skip-gram method is adopted for word vector model training, and the method specifically comprises the following steps:

(1) Firstly, collecting balanced corpora related to an application field;

(3) Sending the training data to a Skip-gram model, and training to obtain a word vector model;

in the training process of the sequence labeling model, splicing the output vector of the first layer two-way LSTM of the sequence labeling model with the output of the Bi-LSTM language model obtained by pre-training in the step S1, and then taking the spliced vector as the input of the second layer two-way LSTM of the sequence labeling model after passing through a full connection layer;

s4, adding a user-defined dictionary: