CN111310471B

CN111310471B - Travel named entity identification method based on BBLC model

Info

Publication number: CN111310471B
Application number: CN202010059415.1A
Authority: CN
Inventors: 薛乐义; 曹菡; 李鹏
Original assignee: Shaanxi Normal University
Current assignee: Shaanxi Normal University
Priority date: 2020-01-19
Filing date: 2020-01-19
Publication date: 2023-03-10
Anticipated expiration: 2040-01-19
Also published as: CN111310471A

Abstract

The invention discloses a tourism named entity recognition method based on a BBLC model, which comprises the steps of carrying out BIO labeling on sentences in a corpus to obtain a BIO labeling set; inputting the BIO label set into a BERT pre-training language model, and outputting vector representation of each word in the sentence, namely a word embedding sequence in each sentence; step three, using the word embedding sequence as the input of each time step of the bidirectional LSTM, and carrying out further semantic coding to obtain a statement feature matrix; and taking the sentence characteristic matrix as the input of a CRF model, labeling and decoding the sentence x to obtain a word label sequence of the sentence x, outputting a probability value that the label of the sentence x is equal to y, solving an optimal path by using a dynamically planned Viterbi algorithm, and outputting a probability maximum label sequence. According to the invention, local context information can be obtained by adding the BERT pre-training language model, the accuracy, recall rate and F value are higher, the generalization ability and robustness are stronger, and the defects of the traditional model are favorably overcome.

Description

Travel named entity identification method based on BBLC model

Technical Field

The invention belongs to the technical field of semantic recognition, and relates to a travel named entity recognition method based on a BBLC (business-based database) model.

Background

With the rise of the tourism industry, the volume of tourism data becomes larger and larger. The method enriches the fields, and greatly reduces the efficiency of people for acquiring information due to the information acquisition complexity caused by mass data. The acquisition of more useful travel information in a short time becomes an important demand for travel in the big data era. A large amount of structured information of the existing tourism website provides great convenience for people, but more information which can better reflect the user tendency exists in texts such as tourism, strategy and comment, so that the extraction of useful information from unstructured texts is the key point of research, and the essence of the research is that the efficiency of named entity identification in the Chinese tourism field is required to be improved. Named Entity Recognition (NER) is a basic task of natural language processing, whose goal is to accurately recognize information in text. Location names, organization names, dates of interest, etc., to provide utility information, information retrieval, machine translation, entity-co-reference solutions, question-answering systems, topic discovery, topic tracking, etc., for natural language processing tasks such as information extraction. The news Understanding Conference (MUC) was first sponsored by the united states national defense advanced research program office (DARPA) and introduced the named entity assessment task including organization name, personnel name, location name, in the sixth national defense advanced research program office Conference, held 9 months 1995; the time identifier includes: a date phrase and a time phrase. Through the development of many years, the effect of English named entity recognition has already approached the effect of manual recognition. Because the English named entity recognition does not need to consider the word segmentation problem, the realization difficulty is low. According to the test evaluation results of ACE (Automatic Content Extraction) and MUC (multi-user interaction), the recall rate, the accuracy and the F value obtained by most of the current mainstream methods are basically more than 90%.

The inherent specificity of Chinese makes lexical analysis necessary before text processing, which makes Chinese named entity recognition much more difficult than English. At first, researchers mostly adopt methods based on statistics and rules, for example, a method of manually writing rules, or a method of adopting a semi-supervised learning algorithm for named entity recognition based on a conditional random field model, and a Chinese named entity recognition rapid recognition algorithm based on deep learning are proposed. However, there are still several problems, mainly including the following:

(1) The entity is continuously upgraded, unregistered words are continuously appeared, and the dictionary is difficult to enumerate. For example, it is not practical to put all city names and all people's names in a dictionary at present, and as the era develops, the continuous generation of new vocabularies brings more difficulty to named entity recognition;

(2) The Chinese words lack obvious characteristic marks and are not distinguished from English by blank spaces and capital and lowercase;

(3) The Chinese named entities have a large amount of nesting phenomena, such as composite place names and organization names of a certain district in a certain city, a computer science institute of a certain university and the like;

(4) A large number of polysemous words exist in Chinese words, for example, apples can be fruit names, movie names or company names in different contexts, and a large number of identification errors caused by ambiguity are caused in the identification process;

(5) The named entities appearing for the first time in the Chinese text are often in the form of abbreviations in the following, for example, the abbreviation of "university of a certain teacher" is "large of a teacher" and the like.

In current named entity recognition, most researchers use Word2Vec to obtain Word nesting (Word Embedding) of each Word, but such Word nesting cannot distinguish ambiguity of the same Word due to context when the Word is encoded, so that a large number of recognition errors are caused.

Therefore, providing a BBLC model-based tour named entity identification method with high accuracy, recall rate and F-number is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

Aiming at the current research situation and the existing problems, the invention provides a tour named entity identification method based on a BBLC model, provides a BBLC (Bert-BilSTM-CRF) model, and adjusts the expression of Word Embedding according to the semantics of context words by using a Transformer in the BERT, thereby achieving the purpose of eliminating ambiguity; and BIO labeling is carried out on the tourism text data aiming at the problem of the missing of a labeling set in the tourism field, NER experiments are carried out on the labeling set by using the provided BBLC model, and the method has strong competitiveness when the named entity recognition problem of the tourism data is faced.

The specific scheme for realizing the purpose is as follows:

a tourism named entity identification method based on a BBLC model comprises the following steps:

step one, performing BIO labeling on sentences in a corpus to obtain a BIO labeling set;

step two, inputting the BIO label set into a BERT pre-training language model, and outputting vector representation of each word in the sentence, namely word embedding sequence (x) in each sentence ₁ ,x ₂ ,…,x _n )；

Step three, embedding the words into the sequence (x) ₁ ,x ₂ ,…,x _n ) Performing further semantic compilation as input for each time step of a bi-directional LSTMThe code outputs a hidden state sequence, then a linear layer is accessed, the hidden state sequence is mapped to k dimensions, and a statement feature matrix P = (P) ₁ ,p ₂ ,...,p _n )∈R ^n*k Where k is the number of labels in the BIO label set, p _i ∈R _k Each dimension p of _ij Is a word x _i A probability value of classifying to the jth tag;

step four, converting the statement feature matrix P = (P) ₁ ,p ₂ ,...,p _n )∈R ^n*k As an input of the CRF model, the sentence x is labeled and decoded to obtain a word label sequence y = (y) of the sentence x ₁ ,y ₂ ,...,y _n ) The probability value that the label of the output sentence x equals y:

wherein A is _ij (k + 2) x (k + 2) dimensional transition matrix for each word from the ith label to the jth label, representing the probability of all word labels in statement x transitioning; solving the optimal path by using a Viterbi algorithm of dynamic programming, and outputting a word label sequence y = (y) ₁ ,y ₂ ,...,y _n ) The obtained final sequence is a word tag sequence with the maximum probability corresponding to the original sequence to the pre-sequencing sequence.

Preferably, said step three specifically comprises embedding the words in the sequence (x) ₁ ,x ₂ ,…,x _n ) As the hidden state sequence of the bidirectional LSTM at each time step and outputting the forward LSTM

With inverse LSTM

Position-by-position splicing is carried out in hidden states output at various positions

The complete hidden state sequence was obtained as follows:

(h ₁ ,h ₂ ,…,h _n )∈R ^n*m

then accessing a linear layer, mapping the hidden state vector from m dimension to k dimension, thereby obtaining an automatically extracted statement feature matrix P = (P) ₁ ,p ₂ ,...,p _n )∈R ^n*k 。

Preferably, after outputting the probability value that the label of the sentence x is equal to y, further comprising,

the normalized probability of the probability value equal to y for the label of the output statement x is obtained using Softmax:

the CRF model is trained using a maximized log-likelihood function, the log-likelihood L for the word-embedded and word-tagged sequences (x, y) is given by:

i is the number of words in statement x.

Compared with the prior art, the invention has the following beneficial effects:

according to the invention, local context information can be obtained by adding the BERT pre-training language model, the BERT pre-training language model transfers a large amount of traditional operations in a downstream specific NLP task to the pre-training word vector, and finally, after the used word embedding sequence is obtained, the obtained word embedding sequence is only required to be subjected to feature extraction and classification marking, so that the accuracy, recall rate and F value are higher, the generalization capability and robustness are stronger, the defects of the traditional model are favorably overcome, and a better effect can be obtained in the work of identifying the tour named entity.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the technical solutions in the prior art will be briefly described below. It is obvious that the drawings in the following description are only embodiments of the invention, and that for a person skilled in the art, other drawings can be obtained from the provided drawings without inventive effort.

FIG. 1 is a block diagram of a BBLC model based model provided by the present invention;

FIG. 2 is a change curve of accuracy in a BBLC model training process provided by an embodiment of the present invention;

FIG. 3 is a Loss function curve during BBLC model training provided by the embodiments of the present invention;

FIG. 4 is a comparison graph of the CRF model, the BilSTM-CRF model and the BBLC model provided by the embodiment of the present invention with respect to the accuracy of place name identification;

FIG. 5 is a comparison graph of accuracy of identifying a geographical name entity by a CRF model, a BilSTM-CRF model and a BBLC model provided by the embodiment of the present invention;

FIG. 6 is a comparison graph of identification accuracy of structural entities of the organization by a CRF model, a BilSTM-CRF model and a BBLC model provided by the embodiment of the invention;

FIG. 7 is a comparison graph of CRF model, bilSTM-CRF model and BBLC model against time entity identification accuracy provided by embodiments of the present invention;

FIG. 8 is a comparison graph of the CRF model, bilSTM-CRF model and BBLC model for the accuracy of identifying the event entity.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to the attached figure 1, the frame diagram based on the BBLC model is obtained by firstly using a BERT pre-training language model to obtain input semantic representation, obtaining vector representation of each word in a sentence, then inputting a word embedding sequence into a BilSTM (bidirectional LSTM model) for further semantic coding, finally decoding through a CRF layer, and outputting a label sequence with the maximum probability.

The BERT pre-training language model is a large-scale pre-training language model based on a bidirectional Transformer, the pre-training model can efficiently extract text information and is applied to various NLP tasks, and a codec (Transformer) in the BERT is used for adjusting the expression of Word Embedding (Word Embedding) according to the semantics of context words, so that the purpose of eliminating ambiguity is achieved.

BERT aims to pre-train the deep bi-directional representation by jointly adjusting the context in all layers. Therefore, the pre-trained BERT representation can be finely adjusted through an additional output layer, and is suitable for building the most advanced model of a wide range of tasks, such as question-answering tasks and language reasoning, without making great architectural modification aiming at specific tasks. The prior art severely restricts the ability to pre-train the representation. The main limitation is that the standard language model is unidirectional, which makes the types of architecture that can be used in the pre-training of the model very limited. Unlike left-to-right language model pre-training, the BERT model combines the advantages of both OpenAI-GPT and ELMo models.

The method comprises the following specific steps:

s1, performing BIO labeling on the sentences in the material library to obtain a BIO labeling set. All data is encapsulated into record form in the code.

S2, inputting the BIO label set into a BERT pre-training language model, and outputting vector representation of each word in the sentence, namely word embedding sequence (x) in each sentence ₁ ,x ₂ ,…,x _n ). The BERT pre-trained language model input is a linear sequence, with two sentences divided by separators, with the first and last sentences being incremented by two identifier numbers. There are three embeddings per word, token embedding (token embedding), segmentation embedding (segmentation embedding) and position embedding (position embedding). The token Embedding represents the Embedding of the current word, the index Embedding represents the sentence where the current word is located is segmented and embedded, the Position Embedding represents the index Embedding (index Embedding) of the Position where the current word is located, the three Embedding are summed to obtain an input sequence, and then a coder-decoder is called. The prototype of the codec includes two independent mechanisms, oneThe encoder (encoder) is responsible for receiving text as input and a decoder (decoder) is responsible for predicting the results of the task. Since the goal of BERT is to generate a language model, only a decoder mechanism is required. And the encoder reads the entire text sequence at once, rather than sequentially from left to right or right to left, this feature enables the model to learn based on both sides of the word, equivalent to a bi-directional function.

After being called, the Transformer outputs a vector sequence, each vector corresponds to a token with the same index, and then the vector sequence is used as the input of the next layer of neural network.

In the step, the invention uses the BERT pre-training language model to finely adjust the downstream task and then uses the BERT pre-training language model to perform input semantic representation, calculates the interrelation of each word in a sentence to all words in the sentence, and then considers that the interrelation between the words reflects the relevance and importance degree between different words in the sentence to a certain extent. Thus, the embedding of each word can be obtained by adjusting the importance (weight) of each word by using the correlations. This embedding, as an output of the BERT pre-trained language model, implies not only the word itself, but also the relationship of other words to the word. And (4) obtaining a word embedding sequence by using a BERT pre-training language model and then entering a Bi-LSTM-CRF model.

S3, the first layer of the Bi-LSTM-CRF model is a bidirectional LSTM layer, and sentence features are automatically extracted. Embedding the words of each word in a sentence obtained by BERT into a sequence (x) ₁ ,x ₂ ,…,x _n ) As the hidden state sequence of each time step of the bidirectional LSTM and then outputting the forward LSTM

With inverse LSTM

The complete hidden state sequence was obtained as follows:

(h ₁ ,h ₂ ,…,h _n )∈R ^n*m

after the setting is skipped, a linear layer is accessed, the hidden state vector is mapped to the dimension k from the dimension m, and k is the label number of the label set, so that the sentence characteristics which are automatically extracted are obtained and recorded as a matrix: p = (P) ₁ ,p ₂ ,...,p _n )∈R ^n*k

p _i ∈R _k For each dimension p _ij Is a word x _i The probability value of classifying to the jth label is equivalent to classifying each position by k classes independently if the probability value of P is subjected to Softmax. However, when the positions are labeled, the labeled information cannot be used, so that a CRF layer is accessed for labeling next time.

S4, the second layer of the Bi-LSTM-CRF model is a CRF layer, and sentence-level sequence labeling is carried out. The parameter of the CRF layer is a matrix A, A of (k + 2) × (k + 2) _ij The expression is that the transition matrix of each word in the sentence from the ith label to the jth label represents the probability of one-step transition of all the states of the word label in the sentence x, and then the labels which are labeled before can be utilized when labeling a position, and 2 is added because a starting state is added to the head of the sentence and an ending state is added to the tail of the sentence. If one keeps a tag sequence y = (y) with length equal to sentence length ₁ ,y ₂ ,...,y _n ) The probability value of how the label of the model for sentence x equals y is:

it can be seen that the probability value for the entire sequence is equal to the sum of the probability values for the various positions, and that the probability value for each position is derived from two parts, one part being the p output by the LSTM _i The other part is determined by the transfer matrix A of the CRF. Furthermore, softmax can be used to obtain the probability value that the label of the output statement x is equal to yNormalized probability of (2):

the CRF model is trained by maximizing the log-likelihood function, which gives the log-likelihood L for the word-embedded and word-tagged sequences (x, y):

i is the number of words in statement x.

The model uses a Viterbi algorithm of dynamic programming to solve the optimal path in the prediction process (decoding), and an output word label sequence y = (y) ₁ ,y ₂ ,...,y _n ) To the word tag sequence with the highest probability corresponding to the predicted sequence,

examples

This embodiment uses BIO tagging to tag entities. The BIO rule is that B represents the beginning of an entity, I represents the middle entity and O represents other non-entity words.

An authoritative SIGHAN 2006 off-3 corpus only contains labels of three entities, and PER, LOC and ORG respectively represent personnel names, location names and organization names; for comprehensively representing various entities of the travel data, time and other labeled entities are needed, so text data such as travel notes, strategies, comments and the like of travel websites such as journey, mahogany and the like are selected to be crawled, and after BIO labeling is carried out on 15431 entities in five types of texts in 13464 Chinese sentences, the texts are used as experimental data sets, and tags and the entities correspond to each other as shown in table 1:

TABLE 1 entity Tab set

This example was run on the label set using the BBLC model and the same label set using the BilSTM-CRF and CRF model as the comparative models.

The experimental parameter settings are as follows in table 2:

TABLE 2 set of experimental parameters

HMI NAME	num
		Batch_size	64
Learning_rate	1e-5
		Train_epochs	10
Dropout_rate	0.5

For each entity type, generally, precision (Precision), recall (Recall) and F value are used to measure the extraction result, the total number of entities in the annotation set is NS, the total number of entities identified by the model is NE, and the correct number of entities identified by the model is NT, that is, the following expression is given:

the accompanying figures 2-3 of the specification show a correct rate change curve and a Loss function curve in the BBLC model training process.

For identification of the name of a person:

the selection of a CRF model for a feature template in the aspect of name recognition is superior to the direct selection of a BilSTM-CRF model and a BBLC model because the BilSTM-CRF and BBLC models are more complex, characters of names are more flexible and shorter in length, and complex models are easy to be over-fitted.

Identification of place names, organisations

Comparing the experimental data with the accompanying fig. 4-5 of the specification, it is found that the accuracy of the BBLC and BiLSTM-CRF in the place and tissue names is much higher than that of CRF due to the complexity and length of the place and tissue (e.g., some specially abbreviated place names, country names), and the place and tissue names are often the same expression, and the difference depends on the context, so the BBLC model with Bert has much better recognition result than the other two models.

Identification of time, thing entities

Also, referring to the description, fig. 6-7 show that the expression of time is different, such as "today" and "3/22/2019" are both time entities, so that the results of BBLC and BilSTM-CRF are obviously better than those of CRF. Moreover, because the time identification needs to be combined with the context in many cases, the BBLC model identification result is optimal.

The embodiment provides a BBLC model aiming at the problem that the named entity in the tourism field is insufficient in artificial construction characteristics, carries out BIO labeling of five entity types on crawled tourism text data, and then proposes the BBLC model aiming at the word meaning problem in the Chinese named entity, and carries out experimental verification by adopting a labeled tourism data set. The results show that: compared with the BilSTM-CRF and the CRF, the method can obtain local context information by adding the Bert module, so that the accuracy rate, the recall rate and the F value are higher, the generalization capability and the robustness are stronger, the defects of the traditional model are favorably overcome, and a better effect can be obtained in the work of identifying the tour named entity.

The method for identifying a named entity for travel based on a BBLC model provided by the invention is described in detail, specific examples are applied in the text to illustrate the principle and the implementation mode of the invention, and the description of the examples is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

Claims

1. A method for identifying a tour named entity based on a BBLC model is characterized by comprising the following steps:

step one, performing BIO labeling on sentences in a material library to obtain a BIO labeling set;

step two, inputting the BIO label set into a BERT pre-training language model, and outputting the vector representation of each word in the sentence, namely embedding the word in each sentenceSequence (x) ₁ ,x ₂ ,…,x _n )；

Step three, embedding the words into the sequence (x) ₁ ,x ₂ ,…,x _n ) Performing further semantic coding as the input of each time step of the bidirectional LSTM, outputting a hidden state sequence, then accessing a linear layer, mapping the hidden state sequence to k dimensions, and obtaining a statement feature matrix P = (P) ₁ ,p ₂ ,...,p _n )∈R ^n*k Where k is the number of labels in the BIO label set, p _i ∈R _k Each dimension p of _ij Is a word x _i A probability value of classifying to the jth tag;

step four, converting the statement feature matrix P = (P) ₁ ,p ₂ ,...,p _n )∈R ^n*k As an input of the CRF model, the sentence x is labeled and decoded to obtain a word label sequence y = (y) of the sentence x ₁ ,y ₂ ,...,y _n ) The probability value that the label of the output statement x equals y:

wherein A is _ij A (k + 2) × (k + 2) dimensional transition matrix for each word from the ith tag to the jth tag, representing the probability of all word-tag transitions in statement x; solving the optimal path using a dynamically programmed Viterbi algorithm, outputting a sequence of word labels y = (y) ₁ ,y ₂ ,...,y _n ) The word label sequence with the highest probability corresponding to the predicted sequence.

2. The BBLC model-based travel named entity recognition method as claimed in claim 1, wherein the third step specifically comprises embedding words into the sequence (x) ₁ ,x ₂ ,…,x _n ) As the hidden state sequence of each time step of the bidirectional LSTM and then outputting the forward LSTM

With reversed LSTM

The complete hidden state sequence was obtained as follows:

(h ₁ ,h ₂ ,…,h _n )∈R ^n*m

3. The method for identifying named entities based on BBLC (business database model) as claimed in claim 1, wherein after outputting the probability value that the label of statement x is equal to y, further comprising,

the normalized probability of the probability value for the output statement x equal to y is found using Softmax: