CN111310471B - Travel named entity identification method based on BBLC model - Google Patents

Travel named entity identification method based on BBLC model Download PDF

Info

Publication number
CN111310471B
CN111310471B CN202010059415.1A CN202010059415A CN111310471B CN 111310471 B CN111310471 B CN 111310471B CN 202010059415 A CN202010059415 A CN 202010059415A CN 111310471 B CN111310471 B CN 111310471B
Authority
CN
China
Prior art keywords
word
model
sequence
sentence
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010059415.1A
Other languages
Chinese (zh)
Other versions
CN111310471A (en
Inventor
薛乐义
曹菡
李鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shaanxi Normal University
Original Assignee
Shaanxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shaanxi Normal University filed Critical Shaanxi Normal University
Priority to CN202010059415.1A priority Critical patent/CN111310471B/en
Publication of CN111310471A publication Critical patent/CN111310471A/en
Application granted granted Critical
Publication of CN111310471B publication Critical patent/CN111310471B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a tourism named entity recognition method based on a BBLC model, which comprises the steps of carrying out BIO labeling on sentences in a corpus to obtain a BIO labeling set; inputting the BIO label set into a BERT pre-training language model, and outputting vector representation of each word in the sentence, namely a word embedding sequence in each sentence; step three, using the word embedding sequence as the input of each time step of the bidirectional LSTM, and carrying out further semantic coding to obtain a statement feature matrix; and taking the sentence characteristic matrix as the input of a CRF model, labeling and decoding the sentence x to obtain a word label sequence of the sentence x, outputting a probability value that the label of the sentence x is equal to y, solving an optimal path by using a dynamically planned Viterbi algorithm, and outputting a probability maximum label sequence. According to the invention, local context information can be obtained by adding the BERT pre-training language model, the accuracy, recall rate and F value are higher, the generalization ability and robustness are stronger, and the defects of the traditional model are favorably overcome.

Description

Travel named entity identification method based on BBLC model
Technical Field
The invention belongs to the technical field of semantic recognition, and relates to a travel named entity recognition method based on a BBLC (business-based database) model.
Background
With the rise of the tourism industry, the volume of tourism data becomes larger and larger. The method enriches the fields, and greatly reduces the efficiency of people for acquiring information due to the information acquisition complexity caused by mass data. The acquisition of more useful travel information in a short time becomes an important demand for travel in the big data era. A large amount of structured information of the existing tourism website provides great convenience for people, but more information which can better reflect the user tendency exists in texts such as tourism, strategy and comment, so that the extraction of useful information from unstructured texts is the key point of research, and the essence of the research is that the efficiency of named entity identification in the Chinese tourism field is required to be improved. Named Entity Recognition (NER) is a basic task of natural language processing, whose goal is to accurately recognize information in text. Location names, organization names, dates of interest, etc., to provide utility information, information retrieval, machine translation, entity-co-reference solutions, question-answering systems, topic discovery, topic tracking, etc., for natural language processing tasks such as information extraction. The news Understanding Conference (MUC) was first sponsored by the united states national defense advanced research program office (DARPA) and introduced the named entity assessment task including organization name, personnel name, location name, in the sixth national defense advanced research program office Conference, held 9 months 1995; the time identifier includes: a date phrase and a time phrase. Through the development of many years, the effect of English named entity recognition has already approached the effect of manual recognition. Because the English named entity recognition does not need to consider the word segmentation problem, the realization difficulty is low. According to the test evaluation results of ACE (Automatic Content Extraction) and MUC (multi-user interaction), the recall rate, the accuracy and the F value obtained by most of the current mainstream methods are basically more than 90%.
The inherent specificity of Chinese makes lexical analysis necessary before text processing, which makes Chinese named entity recognition much more difficult than English. At first, researchers mostly adopt methods based on statistics and rules, for example, a method of manually writing rules, or a method of adopting a semi-supervised learning algorithm for named entity recognition based on a conditional random field model, and a Chinese named entity recognition rapid recognition algorithm based on deep learning are proposed. However, there are still several problems, mainly including the following:
(1) The entity is continuously upgraded, unregistered words are continuously appeared, and the dictionary is difficult to enumerate. For example, it is not practical to put all city names and all people's names in a dictionary at present, and as the era develops, the continuous generation of new vocabularies brings more difficulty to named entity recognition;
(2) The Chinese words lack obvious characteristic marks and are not distinguished from English by blank spaces and capital and lowercase;
(3) The Chinese named entities have a large amount of nesting phenomena, such as composite place names and organization names of a certain district in a certain city, a computer science institute of a certain university and the like;
(4) A large number of polysemous words exist in Chinese words, for example, apples can be fruit names, movie names or company names in different contexts, and a large number of identification errors caused by ambiguity are caused in the identification process;
(5) The named entities appearing for the first time in the Chinese text are often in the form of abbreviations in the following, for example, the abbreviation of "university of a certain teacher" is "large of a teacher" and the like.
In current named entity recognition, most researchers use Word2Vec to obtain Word nesting (Word Embedding) of each Word, but such Word nesting cannot distinguish ambiguity of the same Word due to context when the Word is encoded, so that a large number of recognition errors are caused.
Therefore, providing a BBLC model-based tour named entity identification method with high accuracy, recall rate and F-number is a technical problem to be solved by those skilled in the art.
Disclosure of Invention
Aiming at the current research situation and the existing problems, the invention provides a tour named entity identification method based on a BBLC model, provides a BBLC (Bert-BilSTM-CRF) model, and adjusts the expression of Word Embedding according to the semantics of context words by using a Transformer in the BERT, thereby achieving the purpose of eliminating ambiguity; and BIO labeling is carried out on the tourism text data aiming at the problem of the missing of a labeling set in the tourism field, NER experiments are carried out on the labeling set by using the provided BBLC model, and the method has strong competitiveness when the named entity recognition problem of the tourism data is faced.
The specific scheme for realizing the purpose is as follows:
a tourism named entity identification method based on a BBLC model comprises the following steps:
step one, performing BIO labeling on sentences in a corpus to obtain a BIO labeling set;
step two, inputting the BIO label set into a BERT pre-training language model, and outputting vector representation of each word in the sentence, namely word embedding sequence (x) in each sentence 1 ,x 2 ,…,x n );
Step three, embedding the words into the sequence (x) 1 ,x 2 ,…,x n ) Performing further semantic compilation as input for each time step of a bi-directional LSTMThe code outputs a hidden state sequence, then a linear layer is accessed, the hidden state sequence is mapped to k dimensions, and a statement feature matrix P = (P) 1 ,p 2 ,...,p n )∈R n*k Where k is the number of labels in the BIO label set, p i ∈R k Each dimension p of ij Is a word x i A probability value of classifying to the jth tag;
step four, converting the statement feature matrix P = (P) 1 ,p 2 ,...,p n )∈R n*k As an input of the CRF model, the sentence x is labeled and decoded to obtain a word label sequence y = (y) of the sentence x 1 ,y 2 ,...,y n ) The probability value that the label of the output sentence x equals y:
Figure BDA0002373951170000031
wherein A is ij (k + 2) x (k + 2) dimensional transition matrix for each word from the ith label to the jth label, representing the probability of all word labels in statement x transitioning; solving the optimal path by using a Viterbi algorithm of dynamic programming, and outputting a word label sequence y = (y) 1 ,y 2 ,...,y n ) The obtained final sequence is a word tag sequence with the maximum probability corresponding to the original sequence to the pre-sequencing sequence.
Preferably, said step three specifically comprises embedding the words in the sequence (x) 1 ,x 2 ,…,x n ) As the hidden state sequence of the bidirectional LSTM at each time step and outputting the forward LSTM
Figure RE-GDA0002444393560000032
With inverse LSTM
Figure RE-GDA0002444393560000033
Position-by-position splicing is carried out in hidden states output at various positions
Figure RE-GDA0002444393560000034
The complete hidden state sequence was obtained as follows:
(h 1 ,h 2 ,…,h n )∈R n*m
then accessing a linear layer, mapping the hidden state vector from m dimension to k dimension, thereby obtaining an automatically extracted statement feature matrix P = (P) 1 ,p 2 ,...,p n )∈R n*k
Preferably, after outputting the probability value that the label of the sentence x is equal to y, further comprising,
the normalized probability of the probability value equal to y for the label of the output statement x is obtained using Softmax:
Figure BDA0002373951170000041
the CRF model is trained using a maximized log-likelihood function, the log-likelihood L for the word-embedded and word-tagged sequences (x, y) is given by:
Figure BDA0002373951170000042
i is the number of words in statement x.
Compared with the prior art, the invention has the following beneficial effects:
according to the invention, local context information can be obtained by adding the BERT pre-training language model, the BERT pre-training language model transfers a large amount of traditional operations in a downstream specific NLP task to the pre-training word vector, and finally, after the used word embedding sequence is obtained, the obtained word embedding sequence is only required to be subjected to feature extraction and classification marking, so that the accuracy, recall rate and F value are higher, the generalization capability and robustness are stronger, the defects of the traditional model are favorably overcome, and a better effect can be obtained in the work of identifying the tour named entity.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the technical solutions in the prior art will be briefly described below. It is obvious that the drawings in the following description are only embodiments of the invention, and that for a person skilled in the art, other drawings can be obtained from the provided drawings without inventive effort.
FIG. 1 is a block diagram of a BBLC model based model provided by the present invention;
FIG. 2 is a change curve of accuracy in a BBLC model training process provided by an embodiment of the present invention;
FIG. 3 is a Loss function curve during BBLC model training provided by the embodiments of the present invention;
FIG. 4 is a comparison graph of the CRF model, the BilSTM-CRF model and the BBLC model provided by the embodiment of the present invention with respect to the accuracy of place name identification;
FIG. 5 is a comparison graph of accuracy of identifying a geographical name entity by a CRF model, a BilSTM-CRF model and a BBLC model provided by the embodiment of the present invention;
FIG. 6 is a comparison graph of identification accuracy of structural entities of the organization by a CRF model, a BilSTM-CRF model and a BBLC model provided by the embodiment of the invention;
FIG. 7 is a comparison graph of CRF model, bilSTM-CRF model and BBLC model against time entity identification accuracy provided by embodiments of the present invention;
FIG. 8 is a comparison graph of the CRF model, bilSTM-CRF model and BBLC model for the accuracy of identifying the event entity.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to the attached figure 1, the frame diagram based on the BBLC model is obtained by firstly using a BERT pre-training language model to obtain input semantic representation, obtaining vector representation of each word in a sentence, then inputting a word embedding sequence into a BilSTM (bidirectional LSTM model) for further semantic coding, finally decoding through a CRF layer, and outputting a label sequence with the maximum probability.
The BERT pre-training language model is a large-scale pre-training language model based on a bidirectional Transformer, the pre-training model can efficiently extract text information and is applied to various NLP tasks, and a codec (Transformer) in the BERT is used for adjusting the expression of Word Embedding (Word Embedding) according to the semantics of context words, so that the purpose of eliminating ambiguity is achieved.
BERT aims to pre-train the deep bi-directional representation by jointly adjusting the context in all layers. Therefore, the pre-trained BERT representation can be finely adjusted through an additional output layer, and is suitable for building the most advanced model of a wide range of tasks, such as question-answering tasks and language reasoning, without making great architectural modification aiming at specific tasks. The prior art severely restricts the ability to pre-train the representation. The main limitation is that the standard language model is unidirectional, which makes the types of architecture that can be used in the pre-training of the model very limited. Unlike left-to-right language model pre-training, the BERT model combines the advantages of both OpenAI-GPT and ELMo models.
The method comprises the following specific steps:
s1, performing BIO labeling on the sentences in the material library to obtain a BIO labeling set. All data is encapsulated into record form in the code.
S2, inputting the BIO label set into a BERT pre-training language model, and outputting vector representation of each word in the sentence, namely word embedding sequence (x) in each sentence 1 ,x 2 ,…,x n ). The BERT pre-trained language model input is a linear sequence, with two sentences divided by separators, with the first and last sentences being incremented by two identifier numbers. There are three embeddings per word, token embedding (token embedding), segmentation embedding (segmentation embedding) and position embedding (position embedding). The token Embedding represents the Embedding of the current word, the index Embedding represents the sentence where the current word is located is segmented and embedded, the Position Embedding represents the index Embedding (index Embedding) of the Position where the current word is located, the three Embedding are summed to obtain an input sequence, and then a coder-decoder is called. The prototype of the codec includes two independent mechanisms, oneThe encoder (encoder) is responsible for receiving text as input and a decoder (decoder) is responsible for predicting the results of the task. Since the goal of BERT is to generate a language model, only a decoder mechanism is required. And the encoder reads the entire text sequence at once, rather than sequentially from left to right or right to left, this feature enables the model to learn based on both sides of the word, equivalent to a bi-directional function.
After being called, the Transformer outputs a vector sequence, each vector corresponds to a token with the same index, and then the vector sequence is used as the input of the next layer of neural network.
In the step, the invention uses the BERT pre-training language model to finely adjust the downstream task and then uses the BERT pre-training language model to perform input semantic representation, calculates the interrelation of each word in a sentence to all words in the sentence, and then considers that the interrelation between the words reflects the relevance and importance degree between different words in the sentence to a certain extent. Thus, the embedding of each word can be obtained by adjusting the importance (weight) of each word by using the correlations. This embedding, as an output of the BERT pre-trained language model, implies not only the word itself, but also the relationship of other words to the word. And (4) obtaining a word embedding sequence by using a BERT pre-training language model and then entering a Bi-LSTM-CRF model.
S3, the first layer of the Bi-LSTM-CRF model is a bidirectional LSTM layer, and sentence features are automatically extracted. Embedding the words of each word in a sentence obtained by BERT into a sequence (x) 1 ,x 2 ,…,x n ) As the hidden state sequence of each time step of the bidirectional LSTM and then outputting the forward LSTM
Figure BDA0002373951170000061
With inverse LSTM
Figure BDA0002373951170000071
Position-by-position splicing is carried out in hidden states output at various positions
Figure BDA0002373951170000072
The complete hidden state sequence was obtained as follows:
(h 1 ,h 2 ,…,h n )∈R n*m
after the setting is skipped, a linear layer is accessed, the hidden state vector is mapped to the dimension k from the dimension m, and k is the label number of the label set, so that the sentence characteristics which are automatically extracted are obtained and recorded as a matrix: p = (P) 1 ,p 2 ,...,p n )∈R n*k
p i ∈R k For each dimension p ij Is a word x i The probability value of classifying to the jth label is equivalent to classifying each position by k classes independently if the probability value of P is subjected to Softmax. However, when the positions are labeled, the labeled information cannot be used, so that a CRF layer is accessed for labeling next time.
S4, the second layer of the Bi-LSTM-CRF model is a CRF layer, and sentence-level sequence labeling is carried out. The parameter of the CRF layer is a matrix A, A of (k + 2) × (k + 2) ij The expression is that the transition matrix of each word in the sentence from the ith label to the jth label represents the probability of one-step transition of all the states of the word label in the sentence x, and then the labels which are labeled before can be utilized when labeling a position, and 2 is added because a starting state is added to the head of the sentence and an ending state is added to the tail of the sentence. If one keeps a tag sequence y = (y) with length equal to sentence length 1 ,y 2 ,...,y n ) The probability value of how the label of the model for sentence x equals y is:
Figure BDA0002373951170000073
it can be seen that the probability value for the entire sequence is equal to the sum of the probability values for the various positions, and that the probability value for each position is derived from two parts, one part being the p output by the LSTM i The other part is determined by the transfer matrix A of the CRF. Furthermore, softmax can be used to obtain the probability value that the label of the output statement x is equal to yNormalized probability of (2):
Figure BDA0002373951170000074
the CRF model is trained by maximizing the log-likelihood function, which gives the log-likelihood L for the word-embedded and word-tagged sequences (x, y):
Figure BDA0002373951170000081
i is the number of words in statement x.
The model uses a Viterbi algorithm of dynamic programming to solve the optimal path in the prediction process (decoding), and an output word label sequence y = (y) 1 ,y 2 ,...,y n ) To the word tag sequence with the highest probability corresponding to the predicted sequence,
Figure BDA0002373951170000082
examples
This embodiment uses BIO tagging to tag entities. The BIO rule is that B represents the beginning of an entity, I represents the middle entity and O represents other non-entity words.
An authoritative SIGHAN 2006 off-3 corpus only contains labels of three entities, and PER, LOC and ORG respectively represent personnel names, location names and organization names; for comprehensively representing various entities of the travel data, time and other labeled entities are needed, so text data such as travel notes, strategies, comments and the like of travel websites such as journey, mahogany and the like are selected to be crawled, and after BIO labeling is carried out on 15431 entities in five types of texts in 13464 Chinese sentences, the texts are used as experimental data sets, and tags and the entities correspond to each other as shown in table 1:
TABLE 1 entity Tab set
Figure BDA0002373951170000083
This example was run on the label set using the BBLC model and the same label set using the BilSTM-CRF and CRF model as the comparative models.
The experimental parameter settings are as follows in table 2:
TABLE 2 set of experimental parameters
HMI NAME num
Batch_size 64
Learning_rate 1e-5
Train_epochs 10
Dropout_rate 0.5
For each entity type, generally, precision (Precision), recall (Recall) and F value are used to measure the extraction result, the total number of entities in the annotation set is NS, the total number of entities identified by the model is NE, and the correct number of entities identified by the model is NT, that is, the following expression is given:
Figure BDA0002373951170000091
Figure BDA0002373951170000092
Figure BDA0002373951170000093
the accompanying figures 2-3 of the specification show a correct rate change curve and a Loss function curve in the BBLC model training process.
For identification of the name of a person:
the selection of a CRF model for a feature template in the aspect of name recognition is superior to the direct selection of a BilSTM-CRF model and a BBLC model because the BilSTM-CRF and BBLC models are more complex, characters of names are more flexible and shorter in length, and complex models are easy to be over-fitted.
Identification of place names, organisations
Comparing the experimental data with the accompanying fig. 4-5 of the specification, it is found that the accuracy of the BBLC and BiLSTM-CRF in the place and tissue names is much higher than that of CRF due to the complexity and length of the place and tissue (e.g., some specially abbreviated place names, country names), and the place and tissue names are often the same expression, and the difference depends on the context, so the BBLC model with Bert has much better recognition result than the other two models.
Identification of time, thing entities
Also, referring to the description, fig. 6-7 show that the expression of time is different, such as "today" and "3/22/2019" are both time entities, so that the results of BBLC and BilSTM-CRF are obviously better than those of CRF. Moreover, because the time identification needs to be combined with the context in many cases, the BBLC model identification result is optimal.
The embodiment provides a BBLC model aiming at the problem that the named entity in the tourism field is insufficient in artificial construction characteristics, carries out BIO labeling of five entity types on crawled tourism text data, and then proposes the BBLC model aiming at the word meaning problem in the Chinese named entity, and carries out experimental verification by adopting a labeled tourism data set. The results show that: compared with the BilSTM-CRF and the CRF, the method can obtain local context information by adding the Bert module, so that the accuracy rate, the recall rate and the F value are higher, the generalization capability and the robustness are stronger, the defects of the traditional model are favorably overcome, and a better effect can be obtained in the work of identifying the tour named entity.
The method for identifying a named entity for travel based on a BBLC model provided by the invention is described in detail, specific examples are applied in the text to illustrate the principle and the implementation mode of the invention, and the description of the examples is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

Claims (3)

1. A method for identifying a tour named entity based on a BBLC model is characterized by comprising the following steps:
step one, performing BIO labeling on sentences in a material library to obtain a BIO labeling set;
step two, inputting the BIO label set into a BERT pre-training language model, and outputting the vector representation of each word in the sentence, namely embedding the word in each sentenceSequence (x) 1 ,x 2 ,…,x n );
Step three, embedding the words into the sequence (x) 1 ,x 2 ,…,x n ) Performing further semantic coding as the input of each time step of the bidirectional LSTM, outputting a hidden state sequence, then accessing a linear layer, mapping the hidden state sequence to k dimensions, and obtaining a statement feature matrix P = (P) 1 ,p 2 ,...,p n )∈R n*k Where k is the number of labels in the BIO label set, p i ∈R k Each dimension p of ij Is a word x i A probability value of classifying to the jth tag;
step four, converting the statement feature matrix P = (P) 1 ,p 2 ,...,p n )∈R n*k As an input of the CRF model, the sentence x is labeled and decoded to obtain a word label sequence y = (y) of the sentence x 1 ,y 2 ,...,y n ) The probability value that the label of the output statement x equals y:
Figure FDA0002373951160000011
wherein A is ij A (k + 2) × (k + 2) dimensional transition matrix for each word from the ith tag to the jth tag, representing the probability of all word-tag transitions in statement x; solving the optimal path using a dynamically programmed Viterbi algorithm, outputting a sequence of word labels y = (y) 1 ,y 2 ,...,y n ) The word label sequence with the highest probability corresponding to the predicted sequence.
2. The BBLC model-based travel named entity recognition method as claimed in claim 1, wherein the third step specifically comprises embedding words into the sequence (x) 1 ,x 2 ,…,x n ) As the hidden state sequence of each time step of the bidirectional LSTM and then outputting the forward LSTM
Figure FDA0002373951160000012
With reversed LSTM
Figure FDA0002373951160000013
Position-by-position splicing is carried out in hidden states output at various positions
Figure FDA0002373951160000014
The complete hidden state sequence was obtained as follows:
(h 1 ,h 2 ,…,h n )∈R n*m
then accessing a linear layer, mapping the hidden state vector from m dimension to k dimension, thereby obtaining an automatically extracted statement feature matrix P = (P) 1 ,p 2 ,...,p n )∈R n*k
3. The method for identifying named entities based on BBLC (business database model) as claimed in claim 1, wherein after outputting the probability value that the label of statement x is equal to y, further comprising,
the normalized probability of the probability value for the output statement x equal to y is found using Softmax:
Figure FDA0002373951160000021
the CRF model is trained using a maximized log-likelihood function, the log-likelihood L for the word-embedded and word-tagged sequences (x, y) is given by:
Figure FDA0002373951160000022
CN202010059415.1A 2020-01-19 2020-01-19 Travel named entity identification method based on BBLC model Active CN111310471B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010059415.1A CN111310471B (en) 2020-01-19 2020-01-19 Travel named entity identification method based on BBLC model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010059415.1A CN111310471B (en) 2020-01-19 2020-01-19 Travel named entity identification method based on BBLC model

Publications (2)

Publication Number Publication Date
CN111310471A CN111310471A (en) 2020-06-19
CN111310471B true CN111310471B (en) 2023-03-10

Family

ID=71156490

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010059415.1A Active CN111310471B (en) 2020-01-19 2020-01-19 Travel named entity identification method based on BBLC model

Country Status (1)

Country Link
CN (1) CN111310471B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737996B (en) * 2020-05-29 2024-03-26 北京百度网讯科技有限公司 Method, device, equipment and storage medium for obtaining word vector based on language model
CN111797624A (en) * 2020-06-30 2020-10-20 厦门大学附属第一医院 NPL-based automatic medicine business card extraction method
CN111785368A (en) * 2020-06-30 2020-10-16 平安科技(深圳)有限公司 Triage method, device, equipment and storage medium based on medical knowledge map
RU2760637C1 (en) * 2020-08-31 2021-11-29 Публичное Акционерное Общество "Сбербанк России" (Пао Сбербанк) Method and system for retrieving named entities
CN112084769B (en) * 2020-09-14 2024-07-05 深圳前海微众银行股份有限公司 Dependency syntax model optimization method, apparatus, device and readable storage medium
CN112347240A (en) * 2020-10-16 2021-02-09 小牛思拓(北京)科技有限公司 Text abstract extraction method and device, readable storage medium and electronic equipment
CN112733543A (en) * 2021-01-26 2021-04-30 上海交通大学 Organization named entity normalization method and system based on text editing generation model
CN112733526B (en) * 2021-01-28 2023-11-17 成都不问科技有限公司 Extraction method for automatically identifying tax collection object in financial file
CN112861537A (en) * 2021-02-02 2021-05-28 浪潮云信息技术股份公司 License entity extraction method and system based on deep learning
CN112800768A (en) * 2021-02-03 2021-05-14 北京金山数字娱乐科技有限公司 Training method and device for nested named entity recognition model
CN113158671B (en) * 2021-03-25 2023-08-11 胡明昊 Open domain information extraction method combined with named entity identification
CN113392659A (en) * 2021-06-25 2021-09-14 携程旅游信息技术(上海)有限公司 Machine translation method, device, electronic equipment and storage medium
CN113779992A (en) * 2021-07-19 2021-12-10 西安理工大学 Method for realizing BcBERT-SW-BilSTM-CRF model based on vocabulary enhancement and pre-training
CN113657103B (en) * 2021-08-18 2023-05-12 哈尔滨工业大学 Non-standard Chinese express mail information identification method and system based on NER
CN114580424B (en) * 2022-04-24 2022-08-05 之江实验室 Labeling method and device for named entity identification of legal document
CN117236338B (en) * 2023-08-29 2024-05-28 北京工商大学 Named entity recognition model of dense entity text and training method thereof

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018052446A1 (en) * 2016-09-19 2018-03-22 Siemens Aktiengesellschaft Critical infrastructure forensics
CN108628823B (en) * 2018-03-14 2022-07-01 中山大学 Named entity recognition method combining attention mechanism and multi-task collaborative training
CN110083831B (en) * 2019-04-16 2023-04-18 武汉大学 Chinese named entity identification method based on BERT-BiGRU-CRF
CN110705293A (en) * 2019-08-23 2020-01-17 中国科学院苏州生物医学工程技术研究所 Electronic medical record text named entity recognition method based on pre-training language model

Also Published As

Publication number Publication date
CN111310471A (en) 2020-06-19

Similar Documents

Publication Publication Date Title
CN111310471B (en) Travel named entity identification method based on BBLC model
CN108460013B (en) Sequence labeling model and method based on fine-grained word representation model
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
CN110427623A (en) Semi-structured document Knowledge Extraction Method, device, electronic equipment and storage medium
CN112836046A (en) Four-risk one-gold-field policy and regulation text entity identification method
CN111209412A (en) Method for building knowledge graph of periodical literature by cyclic updating iteration
CN110276069B (en) Method, system and storage medium for automatically detecting Chinese braille error
CN112528649B (en) English pinyin identification method and system for multi-language mixed text
CN114154504B (en) Chinese named entity recognition algorithm based on multi-information enhancement
CN114818717B (en) Chinese named entity recognition method and system integrating vocabulary and syntax information
He English grammar error detection using recurrent neural networks
CN115600597A (en) Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium
CN112784602A (en) News emotion entity extraction method based on remote supervision
CN113836891A (en) Method and device for extracting structured information based on multi-element labeling strategy
CN114757184B (en) Method and system for realizing knowledge question and answer in aviation field
CN114648029A (en) Electric power field named entity identification method based on BiLSTM-CRF model
CN112257442A (en) Policy document information extraction method based on corpus expansion neural network
CN113609840B (en) Chinese law judgment abstract generation method and system
Xue et al. A method of chinese tourism named entity recognition based on bblc model
CN113051886B (en) Test question duplicate checking method, device, storage medium and equipment
CN112507717A (en) Medical field entity classification method fusing entity keyword features
CN116860959A (en) Extraction type abstract method and system combining local topic and hierarchical structure information
Hu et al. Corpus of Carbonate Platforms with Lexical Annotations for Named Entity Recognition.
Zhang et al. Multitask learning for chinese named entity recognition
CN114330350A (en) Named entity identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant