CN111310471A - Travel named entity identification method based on BBLC model - Google Patents

Travel named entity identification method based on BBLC model Download PDF

Info

Publication number
CN111310471A
CN111310471A CN202010059415.1A CN202010059415A CN111310471A CN 111310471 A CN111310471 A CN 111310471A CN 202010059415 A CN202010059415 A CN 202010059415A CN 111310471 A CN111310471 A CN 111310471A
Authority
CN
China
Prior art keywords
word
model
sentence
sequence
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010059415.1A
Other languages
Chinese (zh)
Other versions
CN111310471B (en
Inventor
薛乐义
曹菡
李鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shaanxi Normal University
Original Assignee
Shaanxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shaanxi Normal University filed Critical Shaanxi Normal University
Priority to CN202010059415.1A priority Critical patent/CN111310471B/en
Publication of CN111310471A publication Critical patent/CN111310471A/en
Application granted granted Critical
Publication of CN111310471B publication Critical patent/CN111310471B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a tourism named entity recognition method based on a BBLC model, which comprises the steps of carrying out BIO labeling on sentences in a corpus to obtain a BIO labeling set; inputting the BIO label set into a BERT pre-training language model, and outputting vector representation of each word in the sentence, namely a word embedding sequence in each sentence; step three, using the word embedding sequence as the input of each time step of the bidirectional LSTM, and carrying out further semantic coding to obtain a statement feature matrix; and taking the sentence characteristic matrix as the input of a CRF model, labeling and decoding the sentence x to obtain a word label sequence of the sentence x, outputting a probability value that the label of the sentence x is equal to y, solving an optimal path by using a dynamically planned Viterbi algorithm, and outputting a probability maximum label sequence. According to the invention, local context information can be obtained by adding the BERT pre-training language model, the accuracy, recall rate and F value are higher, the generalization ability and robustness are stronger, and the defects of the traditional model are favorably overcome.

Description

Travel named entity identification method based on BBLC model
Technical Field
The invention belongs to the technical field of semantic recognition, and relates to a travel named entity recognition method based on a BBLC (business-based database) model.
Background
With the rise of the tourism industry, the volume of tourism data becomes larger and larger. The method enriches the fields, and greatly reduces the efficiency of people for acquiring information due to the information acquisition complexity caused by mass data. The acquisition of more useful travel information in a short time becomes an important demand for travel in the big data era. A large amount of structured information of the existing tourism website provides great convenience for people, but more information which can better reflect the user tendency exists in texts such as tourism, strategy and comment, so that the extraction of useful information from unstructured texts is the key point of research, and the essence of the research is that the efficiency of named entity identification in the Chinese tourism field is required to be improved. Named Entity Recognition (NER) is a basic task of natural language processing, whose goal is to accurately recognize information in text. Location names, organization names, dates of interest, etc., to provide utility information, information retrieval, machine translation, entity-co-reference solutions, question-answering systems, topic discovery, topic tracking, etc., for natural language processing tasks such as information extraction. The news Understanding Conference (MUC) was first sponsored by the united states national defense advanced research program office (DARPA) and introduced the named entity assessment task including organization name, personnel name, location name, in the sixth national defense advanced research program office Conference, held 9 months 1995; the time identifier includes: a date phrase and a time phrase. Through the development of many years, the effect of English named entity recognition has already approached the effect of manual recognition. Because the English named entity recognition does not need to consider the word segmentation problem, the realization difficulty is low. According to the test evaluation results of ACE (automatic Content extraction) and MUC, the recall rate, the accuracy and the F value obtained by most of the current mainstream methods are basically over 90 percent.
The inherent specificity of Chinese makes lexical analysis necessary before text processing, which results in the recognition of named entities in Chinese being much more difficult than in English. At first, researchers mostly adopt methods based on statistics and rules, for example, a method of manually writing rules, or a method of adopting a semi-supervised learning algorithm for named entity recognition based on a conditional random field model, and a Chinese named entity recognition rapid recognition algorithm based on deep learning are proposed. However, there are still several problems, mainly including the following:
(1) the entity is continuously upgraded, unregistered words are continuously appeared, and the dictionary is difficult to enumerate. For example, it is not practical to put all city names and all people's names in a dictionary at present, and as the era develops, the continuous generation of new vocabularies brings more difficulty to named entity recognition;
(2) the Chinese words lack obvious characteristic marks and are not distinguished from English by blank spaces and capital and lowercase;
(3) the Chinese named entities have a large amount of nesting phenomena, such as composite place names and organization names of a certain district in a certain city, a computer science institute of a certain university and the like;
(4) a large number of polysemous words exist in Chinese words, for example, apples can be fruit names, movie names or company names in different contexts, and a large number of identification errors caused by ambiguity are caused in the identification process;
(5) the named entities appearing for the first time in the Chinese text are often in the form of abbreviations in the following, for example, the abbreviation of "university of a certain teacher" is "large of a teacher" and the like.
In current named entity recognition, most researchers use Word2Vec to obtain Word nesting (Word Embedding) of each Word, but such Word nesting cannot distinguish ambiguity of the same Word due to context when the Word is encoded, so that a large number of recognition errors are caused.
Therefore, providing a BBLC model-based tour named entity identification method with high accuracy, recall rate and F-number is a technical problem to be solved by those skilled in the art.
Disclosure of Invention
Aiming at the current research situation and the existing problems, the invention provides a tour named entity identification method based on a BBLC model, provides a BBLC (Bert-BilSTM-CRF) model, and adjusts the expression of Word Embedding according to the semantics of context words by using a Transformer in the BERT, thereby achieving the purpose of eliminating ambiguity; and BIO labeling is carried out on the tourism text data aiming at the problem of the missing of a labeling set in the tourism field, NER experiments are carried out on the labeling set by using the provided BBLC model, and the method has strong competitiveness when the named entity recognition problem of the tourism data is faced.
The specific scheme for achieving the purpose is as follows:
a tourism named entity identification method based on a BBLC model comprises the following steps:
step one, performing BIO labeling on sentences in a material library to obtain a BIO labeling set;
step two, inputting the BIO label set into a BERT pre-training language model, and outputting the vector representation of each word in the sentence, namely the word embedding sequence (x) in each sentence1,x2,…,xn);
Step three, embedding the words into the sequence (x)1,x2,…,xn) Performing further semantic coding as the input of each time step of the bidirectional LSTM, outputting a hidden state sequence, then accessing a linear layer, mapping the hidden state sequence to k dimensions, and obtaining a sentence characteristic matrix P ═ P (P1,p2,...,pn)∈Rn*kWhere k is the number of labels in the BIO label set, pi∈RkEach dimension p ofijIs a word xiA probability value of classifying to the jth tag;
step four, changing the sentence characteristic matrix P into (P)1,p2,...,pn)∈Rn*kAs an input to the CRF model, the sentence x is subjected to label decoding to obtain a word label sequence y of the sentence x (y ═ y1,y2,...,yn) The probability value that the label of the output sentence x equals y:
Figure BDA0002373951170000031
wherein A isij(k +2) x (k +2) dimensional transition matrix for each word from the ith label to the jth label, representing the probability of all word labels in statement x transitioning; solving the optimal path by using a Viterbi algorithm of dynamic programming, and outputting a word label sequence y ═ y1,y2,...,yn) The final sequence obtained is the original sequenceThe word tag sequence with the highest probability corresponds to the sequence from the pre-sequencing sequence.
Preferably, said step three specifically comprises embedding the words in the sequence (x)1,x2,…,xn) Hidden state sequence as input of each time step of bidirectional LSTM and output of forward LSTM
Figure RE-GDA0002444393560000032
With inverse LSTM
Figure RE-GDA0002444393560000033
Position-by-position splicing is carried out in hidden states output at various positions
Figure RE-GDA0002444393560000034
The complete hidden state sequence was obtained as follows:
(h1,h2,…,hn)∈Rn*m
then a linear layer is accessed, the hidden state vector is mapped from m dimension to k dimension, and thus, the automatically extracted sentence characteristic matrix P ═ P is obtained1,p2,...,pn)∈Rn*k
Preferably, after outputting the probability value that the label of the sentence x is equal to y, further comprising,
the normalized probability of the probability value equal to y for the label of the output statement x is obtained using Softmax:
Figure BDA0002373951170000041
the CRF model is trained using a maximized log-likelihood function, the log-likelihood L for the word-embedded and word-tagged sequences (x, y) is given by:
Figure BDA0002373951170000042
i is the number of words in statement x.
Compared with the prior art, the invention has the following beneficial effects:
according to the invention, local context information can be obtained by adding the BERT pre-training language model, the BERT pre-training language model transfers a large amount of traditional operations in a downstream specific NLP task to the pre-training word vector, and finally, after the used word embedding sequence is obtained, the obtained word embedding sequence is only required to be subjected to feature extraction and classification marking, so that the accuracy, recall rate and F value are higher, the generalization capability and robustness are stronger, the defects of the traditional model are favorably overcome, and a better effect can be obtained in the work of identifying the tour named entity.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is obvious that the drawings in the following description are only embodiments of the invention, and that for a person skilled in the art, other drawings can be obtained from the provided drawings without inventive effort.
FIG. 1 is a block diagram of a BBLC model based model provided by the present invention;
FIG. 2 is a change curve of accuracy in a BBLC model training process provided by an embodiment of the present invention;
FIG. 3 is a Loss function curve during BBLC model training provided by the embodiments of the present invention;
FIG. 4 is a comparison graph of the CRF model, the BilSTM-CRF model and the BBLC model provided by the embodiment of the present invention with respect to the accuracy of place name identification;
FIG. 5 is a comparison graph of accuracy of identifying a geographical name entity by a CRF model, a BilSTM-CRF model and a BBLC model provided by the embodiment of the present invention;
FIG. 6 is a comparison graph of identification accuracy of structural entities of the organization by a CRF model, a BilSTM-CRF model and a BBLC model provided by the embodiment of the invention;
FIG. 7 is a comparison graph of CRF model, BilSTM-CRF model and BBLC model against time entity identification accuracy provided by embodiments of the present invention;
FIG. 8 is a comparison graph of the CRF model, BilSTM-CRF model and BBLC model for the accuracy of identifying the event entity.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to the attached figure 1, the frame diagram based on the BBLC model is obtained by firstly using a BERT pre-training language model to obtain input semantic representation, obtaining vector representation of each word in a sentence, then inputting a word embedding sequence into a BilSTM (bidirectional LSTM model) for further semantic coding, finally decoding through a CRF layer, and outputting a label sequence with the maximum probability.
The BERT pre-training language model is a large-scale pre-training language model based on a bidirectional Transformer, the pre-training model can efficiently extract text information and is applied to various NLP tasks, and a codec (Transformer) in the BERT is used for adjusting the expression of Word Embedding (Word Embedding) according to the semantics of context words, so that the purpose of eliminating ambiguity is achieved.
BERT aims to pre-train the deep bi-directional representation by jointly adjusting the context in all layers. Therefore, the pre-trained BERT representation can be finely adjusted through an additional output layer, and is suitable for building the most advanced model of a wide range of tasks, such as question-answering tasks and language reasoning, without making great architectural modification aiming at specific tasks. The prior art severely restricts the ability to pre-train the representation. The main limitation is that the standard language model is unidirectional, which makes the types of architecture that can be used in the pre-training of the model very limited. Unlike left-to-right language model pre-training, the BERT model combines the advantages of both OpenAI-GPT and ELMo models.
The method comprises the following specific steps:
and S1, carrying out BIO labeling on the sentences in the corpus to obtain a BIO labeling set. All data is encapsulated into record form in the code.
S2, inputting the BIO label set into the BERT pre-training language model, and outputting vector representation of each word in the sentence, namely word embedding sequence (x) in each sentence1,x2,…,xn). The BERT pre-trained language model input is a linear sequence, with two sentences divided by separators, with the first and last sentences being incremented by two identifier numbers. There are three embeddings per word, token embedding (tokens), segmentation embedding (segmentation embedding) and position embedding (position embedding). The token Embedding represents the Embedding of the current word, the index Embedding represents the sentence where the current word is located is segmented and embedded, the Position Embedding represents the index Embedding (index Embedding) of the Position where the current word is located, the three Embedding are summed to obtain an input sequence, and then a coder-decoder is called. The prototype of codec consists of two independent mechanisms, an encoder (encoder) responsible for receiving text as input and a decoder (decoder) responsible for predicting the results of tasks. Since the goal of BERT is to generate a language model, only a decoder mechanism is required. And the encoder reads the entire text sequence at once, rather than sequentially from left to right or right to left, this feature enables the model to learn based on both sides of the word, equivalent to a bi-directional function.
After being called, the Transformer outputs a vector sequence, each vector corresponds to a token with the same index, and then the vector sequence is used as the input of the next layer of neural network.
In the step, the invention uses the BERT pre-training language model to finely adjust the downstream task and then uses the BERT pre-training language model to perform input semantic representation, calculates the interrelation of each word in a sentence to all words in the sentence, and then considers that the interrelation between the words reflects the relevance and importance degree between different words in the sentence to a certain extent. Thus, the embedding of each word can be obtained by adjusting the importance (weight) of each word by using the correlations. This embedding, as an output of the BERT pre-trained language model, implies not only the word itself, but also the relationship of other words to the word. And (4) obtaining a word embedding sequence by using a BERT pre-training language model and then entering a Bi-LSTM-CRF model.
S3, the first layer of the Bi-LSTM-CRF model is a bidirectional LSTM layer, and sentence features are automatically extracted. Embedding the words of each word in a sentence obtained by BERT into a sequence (x)1,x2,…,xn) As the hidden state sequence of each time step of the bidirectional LSTM and then outputting the forward LSTM
Figure BDA0002373951170000061
With inverse LSTM
Figure BDA0002373951170000071
Position-by-position splicing is carried out in hidden states output at various positions
Figure BDA0002373951170000072
The complete hidden state sequence was obtained as follows:
(h1,h2,…,hn)∈Rn*m
after the setting is skipped, a linear layer is accessed, the hidden state vector is mapped to the dimension k from the dimension m, and k is the label number of the label set, so that the sentence characteristics which are automatically extracted are obtained and recorded as a matrix: p ═ P (P)1,p2,...,pn)∈Rn*k
pi∈RkFor each dimension pijIs a word xiThe probability value of classifying to the jth label is equivalent to classifying each position by k classes independently if the probability value of P is subjected to Softmax. However, when the positions are labeled, the labeled information cannot be used, so that a CRF layer is accessed for labeling next time.
S4, the second layer of the Bi-LSTM-CRF model is a CRF layer, and sentence-level sequence labeling is carried out. The parameter of the CRF layer is a matrix A, A of (k +2) × (k +2)ijThe expression is a transition matrix of each word in the statement from the ith label to the jth label, which expresses the probability of one-step transition of all states of the word label in the statement x, and the label marked before can be utilized when a position is markedSo 2 is added because a start state is added for the beginning of the sentence and an end state is added for the end of the sentence. If a tag sequence y with a length equal to the sentence length is recorded (y)1,y2,...,yn) The probability value of how the label of the model for sentence x equals y is:
Figure BDA0002373951170000073
it can be seen that the probability value for the entire sequence is equal to the sum of the probability values for the various positions, and that the probability value for each position is derived from two parts, one part being the p output by the LSTMiThe other part is determined by the transfer matrix A of the CRF. Then, the probability after normalization of the probability value that the label of the output statement x is equal to y can be obtained by utilizing Softmax:
Figure BDA0002373951170000074
the CRF model is trained by maximizing the log-likelihood function, which gives the log-likelihood L for the word-embedded and word-tagged sequences (x, y):
Figure BDA0002373951170000081
i is the number of words in statement x.
The model uses a Viterbi algorithm of dynamic programming to solve the optimal path in the prediction process (decoding), and outputs an output word label sequence y ═ y (y ═ y)1,y2,...,yn) To the word tag sequence with the highest probability corresponding to the predicted sequence,
Figure BDA0002373951170000082
examples
This embodiment uses BIO tagging to tag entities. The BIO rule is that B represents the beginning of an entity, I represents the middle entity and O represents other non-entity words.
An authoritative SIGHAN 2006 off-3 corpus only contains labels of three entities, and PER, LOC and ORG respectively represent personnel names, location names and organization names; to comprehensively represent various entities of the travel data, time and other labeled entities are needed, so text data such as travel notes, strategies, comments and the like of travel websites such as crawl and take, horse honeycomb and the like are selected, 13464 Chinese sentences are counted, and after BIO labeling is carried out on 15431 entities in five categories of texts, the entities are used as experimental data sets, and labels and the entities correspond to each other as shown in table 1:
TABLE 1 entity tag set
Figure BDA0002373951170000083
This example was run on the label set using the BBLC model and the same label set using the BilSTM-CRF and CRF model as the comparative models.
The experimental parameter settings are as follows in table 2:
TABLE 2 set of experimental parameters
HMI NAME num
Batch_size 64
Learning_rate 1e-5
Train_epochs 10
Dropout_rate 0.5
For each entity type, generally, Precision (Precision), Recall (Recall) and F value are used to measure the extraction result, the total number of entities in the annotation set is NS, the total number of entities identified by the model is NE, and the correct number of entities identified by the model is NT, that is, the following expression is given:
Figure BDA0002373951170000091
Figure BDA0002373951170000092
Figure BDA0002373951170000093
the accompanying figures 2-3 of the specification show a correct rate change curve and a Loss function curve in the BBLC model training process.
For identification of the name of a person:
the selection of the CRF model of the feature template in the aspect of name recognition is superior to the direct selection of the BilSTM-CRF model and the BBLC model because the BilSTM-CRF and BBLC models are more complex, characters of names are more flexible, the length is shorter, and the complex models are easy to overfit.
Identification of place names, organisations
Comparing the experimental data with the accompanying fig. 4-5 of the specification, it is found that the accuracy of the BBLC and BiLSTM-CRF in the place and tissue names is much higher than that of CRF due to the complexity and length of the place and tissue (e.g., some specially abbreviated place names, country names), and the place and tissue names are often the same expression, and the difference depends on the context, so the BBLC model with Bert has much better recognition result than the other two models.
Identification of time, thing entities
Also, referring to the description, fig. 6-7 show that the expression of time is different, such as "today" and "3/22/2019" are both time entities, so that the results of BBLC and BilSTM-CRF are obviously better than those of CRF. Moreover, because the time identification needs to be combined with the context in many cases, the BBLC model identification result is optimal.
The embodiment provides a BBLC model aiming at the problem that the named entity in the tourism field is insufficient in artificial construction characteristics, carries out BIO labeling of five entity types on crawled tourism text data, and then proposes the BBLC model aiming at the word meaning problem in the Chinese named entity, and carries out experimental verification by adopting a labeled tourism data set. The results show that: compared with the BilSTM-CRF and the CRF, the method can obtain the local context information by adding the Bert module, so that the accuracy, the recall rate and the F value are higher, the generalization capability and the robustness are stronger, the defects of the traditional model are favorably overcome, and a better effect can be obtained in the work of identifying the tour named entity.
The method for identifying a named entity for travel based on a BBLC model provided by the invention is described in detail, specific examples are applied in the text to illustrate the principle and the implementation mode of the invention, and the description of the examples is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (3)

1. A method for identifying a tour named entity based on a BBLC model is characterized by comprising the following steps:
step one, performing BIO labeling on sentences in a material library to obtain a BIO labeling set;
step two, inputting the BIO label set into a BERT pre-training language model, and outputting the vector representation of each word in the sentence, namely the word embedding sequence (x) in each sentence1,x2,…,xn);
Step three, embedding the words into the sequence (x)1,x2,…,xn) Performing further semantic coding as the input of each time step of the bidirectional LSTM, outputting a hidden state sequence, then accessing a linear layer, mapping the hidden state sequence to k dimensions, and obtaining a sentence characteristic matrix P ═ P (P1,p2,...,pn)∈Rn*kWhere k is the number of labels in the BIO label set, pi∈RkEach dimension p ofijIs a word xiA probability value of classifying to the jth tag;
step four, changing the sentence characteristic matrix P into (P)1,p2,...,pn)∈Rn*kAs an input to the CRF model, a sentence x is subjected to label decoding to obtain a word label sequence y of the sentence x (y ═ y1,y2,...,yn) The probability value that the label of the output statement x equals y:
Figure FDA0002373951160000011
wherein A isijA (k +2) × (k +2) dimensional transition matrix for each word from the ith tag to the jth tag, representing the probability of all word-tag transitions in statement x; solving the optimal path by using a Viterbi algorithm of dynamic programming, and outputting a word label sequence y ═ y1,y2,...,yn) Word tag sequence with maximum probability of corresponding to predicted sequence。
2. The BBLC model-based travel named entity recognition method as claimed in claim 1, wherein the third step specifically comprises embedding words into the sequence (x)1,x2,…,xn) As the hidden state sequence of each time step of the bidirectional LSTM and then outputting the forward LSTM
Figure FDA0002373951160000012
With inverse LSTM
Figure FDA0002373951160000013
Position-by-position splicing is carried out in hidden states output at various positions
Figure FDA0002373951160000014
The complete hidden state sequence was obtained as follows:
(h1,h2,…,hn)∈Rn*m
then a linear layer is accessed, the hidden state vector is mapped from m dimension to k dimension, and thus, the automatically extracted sentence characteristic matrix P ═ P is obtained1,p2,...,pn)∈Rn*k
3. The method for identifying named entities based on BBLC (business database model) as claimed in claim 1, wherein after outputting the probability value that the label of statement x is equal to y, further comprising,
the normalized probability of the probability value for the output statement x equal to y is found using Softmax:
Figure FDA0002373951160000021
the CRF model is trained using a maximized log-likelihood function, the log-likelihood L for the word-embedded and word-tagged sequences (x, y) is given by:
Figure FDA0002373951160000022
CN202010059415.1A 2020-01-19 2020-01-19 Travel named entity identification method based on BBLC model Active CN111310471B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010059415.1A CN111310471B (en) 2020-01-19 2020-01-19 Travel named entity identification method based on BBLC model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010059415.1A CN111310471B (en) 2020-01-19 2020-01-19 Travel named entity identification method based on BBLC model

Publications (2)

Publication Number Publication Date
CN111310471A true CN111310471A (en) 2020-06-19
CN111310471B CN111310471B (en) 2023-03-10

Family

ID=71156490

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010059415.1A Active CN111310471B (en) 2020-01-19 2020-01-19 Travel named entity identification method based on BBLC model

Country Status (1)

Country Link
CN (1) CN111310471B (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111797624A (en) * 2020-06-30 2020-10-20 厦门大学附属第一医院 NPL-based automatic medicine business card extraction method
CN112084769A (en) * 2020-09-14 2020-12-15 深圳前海微众银行股份有限公司 Dependency syntax model optimization method, device, equipment and readable storage medium
CN112347240A (en) * 2020-10-16 2021-02-09 小牛思拓(北京)科技有限公司 Text abstract extraction method and device, readable storage medium and electronic equipment
CN112733543A (en) * 2021-01-26 2021-04-30 上海交通大学 Organization named entity normalization method and system based on text editing generation model
CN112733526A (en) * 2021-01-28 2021-04-30 成都不问科技有限公司 Extraction method for automatically identifying taxation objects in finance and tax file
CN112800768A (en) * 2021-02-03 2021-05-14 北京金山数字娱乐科技有限公司 Training method and device for nested named entity recognition model
CN112861537A (en) * 2021-02-02 2021-05-28 浪潮云信息技术股份公司 License entity extraction method and system based on deep learning
WO2021139232A1 (en) * 2020-06-30 2021-07-15 平安科技(深圳)有限公司 Medical knowledge graph-based triage method and apparatus, device, and storage medium
CN113158671A (en) * 2021-03-25 2021-07-23 胡明昊 Open domain information extraction method combining named entity recognition
CN113392659A (en) * 2021-06-25 2021-09-14 携程旅游信息技术(上海)有限公司 Machine translation method, device, electronic equipment and storage medium
CN113657103A (en) * 2021-08-18 2021-11-16 哈尔滨工业大学 Non-standard Chinese express mail information identification method and system based on NER
RU2760637C1 (en) * 2020-08-31 2021-11-29 Публичное Акционерное Общество "Сбербанк России" (Пао Сбербанк) Method and system for retrieving named entities
US20210374343A1 (en) * 2020-05-29 2021-12-02 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for obtaining word vectors based on language model, device and storage medium
CN113779992A (en) * 2021-07-19 2021-12-10 西安理工大学 Method for realizing BcBERT-SW-BilSTM-CRF model based on vocabulary enhancement and pre-training
CN114580424A (en) * 2022-04-24 2022-06-03 之江实验室 Labeling method and device for named entity identification of legal document
CN114925695A (en) * 2022-05-19 2022-08-19 西安建筑科技大学 Named entity identification method, system, equipment and storage medium
CN117236338A (en) * 2023-08-29 2023-12-15 北京工商大学 Named entity recognition model of dense entity text and training method thereof
RU2823914C2 (en) * 2021-08-03 2024-07-30 Публичное Акционерное Общество "Сбербанк России" (Пао Сбербанк) Method and system for extracting named entities

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108628823A (en) * 2018-03-14 2018-10-09 中山大学 In conjunction with the name entity recognition method of attention mechanism and multitask coordinated training
CN109791585A (en) * 2016-09-19 2019-05-21 西门子股份公司 Critical infrastructures evidence obtaining
CN110083831A (en) * 2019-04-16 2019-08-02 武汉大学 A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF
CN110705293A (en) * 2019-08-23 2020-01-17 中国科学院苏州生物医学工程技术研究所 Electronic medical record text named entity recognition method based on pre-training language model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109791585A (en) * 2016-09-19 2019-05-21 西门子股份公司 Critical infrastructures evidence obtaining
CN108628823A (en) * 2018-03-14 2018-10-09 中山大学 In conjunction with the name entity recognition method of attention mechanism and multitask coordinated training
CN110083831A (en) * 2019-04-16 2019-08-02 武汉大学 A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF
CN110705293A (en) * 2019-08-23 2020-01-17 中国科学院苏州生物医学工程技术研究所 Electronic medical record text named entity recognition method based on pre-training language model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王凯等: "一种面向非平衡生物医学数据的自训练半监督方法", 《大庆师范学院学报》 *
邱莎: "几种基于机器学习的生物命名实体识别模型比较", 《电脑知识与技术(学术交流)》 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11526668B2 (en) * 2020-05-29 2022-12-13 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for obtaining word vectors based on language model, device and storage medium
US20210374343A1 (en) * 2020-05-29 2021-12-02 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for obtaining word vectors based on language model, device and storage medium
CN111797624A (en) * 2020-06-30 2020-10-20 厦门大学附属第一医院 NPL-based automatic medicine business card extraction method
WO2021139232A1 (en) * 2020-06-30 2021-07-15 平安科技(深圳)有限公司 Medical knowledge graph-based triage method and apparatus, device, and storage medium
WO2022045920A1 (en) * 2020-08-31 2022-03-03 Публичное Акционерное Общество "Сбербанк России" Method and system for named entity extraction
RU2760637C1 (en) * 2020-08-31 2021-11-29 Публичное Акционерное Общество "Сбербанк России" (Пао Сбербанк) Method and system for retrieving named entities
CN112084769A (en) * 2020-09-14 2020-12-15 深圳前海微众银行股份有限公司 Dependency syntax model optimization method, device, equipment and readable storage medium
CN112347240A (en) * 2020-10-16 2021-02-09 小牛思拓(北京)科技有限公司 Text abstract extraction method and device, readable storage medium and electronic equipment
CN112733543A (en) * 2021-01-26 2021-04-30 上海交通大学 Organization named entity normalization method and system based on text editing generation model
CN112733526A (en) * 2021-01-28 2021-04-30 成都不问科技有限公司 Extraction method for automatically identifying taxation objects in finance and tax file
CN112733526B (en) * 2021-01-28 2023-11-17 成都不问科技有限公司 Extraction method for automatically identifying tax collection object in financial file
CN112861537A (en) * 2021-02-02 2021-05-28 浪潮云信息技术股份公司 License entity extraction method and system based on deep learning
CN112800768A (en) * 2021-02-03 2021-05-14 北京金山数字娱乐科技有限公司 Training method and device for nested named entity recognition model
CN113158671A (en) * 2021-03-25 2021-07-23 胡明昊 Open domain information extraction method combining named entity recognition
CN113158671B (en) * 2021-03-25 2023-08-11 胡明昊 Open domain information extraction method combined with named entity identification
CN113392659A (en) * 2021-06-25 2021-09-14 携程旅游信息技术(上海)有限公司 Machine translation method, device, electronic equipment and storage medium
CN113779992A (en) * 2021-07-19 2021-12-10 西安理工大学 Method for realizing BcBERT-SW-BilSTM-CRF model based on vocabulary enhancement and pre-training
RU2823914C2 (en) * 2021-08-03 2024-07-30 Публичное Акционерное Общество "Сбербанк России" (Пао Сбербанк) Method and system for extracting named entities
CN113657103A (en) * 2021-08-18 2021-11-16 哈尔滨工业大学 Non-standard Chinese express mail information identification method and system based on NER
CN114580424A (en) * 2022-04-24 2022-06-03 之江实验室 Labeling method and device for named entity identification of legal document
CN114925695A (en) * 2022-05-19 2022-08-19 西安建筑科技大学 Named entity identification method, system, equipment and storage medium
CN117236338A (en) * 2023-08-29 2023-12-15 北京工商大学 Named entity recognition model of dense entity text and training method thereof
CN117236338B (en) * 2023-08-29 2024-05-28 北京工商大学 Named entity recognition model of dense entity text and training method thereof

Also Published As

Publication number Publication date
CN111310471B (en) 2023-03-10

Similar Documents

Publication Publication Date Title
CN111310471B (en) Travel named entity identification method based on BBLC model
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN112836046A (en) Four-risk one-gold-field policy and regulation text entity identification method
CN111209412A (en) Method for building knowledge graph of periodical literature by cyclic updating iteration
CN114818717B (en) Chinese named entity recognition method and system integrating vocabulary and syntax information
CN114154504B (en) Chinese named entity recognition algorithm based on multi-information enhancement
CN112784602B (en) News emotion entity extraction method based on remote supervision
CN112257442B (en) Policy document information extraction method based on corpus expansion neural network
CN113836891A (en) Method and device for extracting structured information based on multi-element labeling strategy
CN112507717A (en) Medical field entity classification method fusing entity keyword features
CN113051886B (en) Test question duplicate checking method, device, storage medium and equipment
CN113609840B (en) Chinese law judgment abstract generation method and system
Xue et al. A method of chinese tourism named entity recognition based on bblc model
CN113935308A (en) Method and system for automatically generating text abstract facing field of geoscience
CN116860959A (en) Extraction type abstract method and system combining local topic and hierarchical structure information
Hu et al. Corpus of Carbonate Platforms with Lexical Annotations for Named Entity Recognition.
CN117235252A (en) Dual-channel news text classification method based on countermeasure training-BERT
CN114330350A (en) Named entity identification method and device, electronic equipment and storage medium
Munandar et al. POS-tagging for non-english tweets: An automatic approach:(Study in Bahasa Indonesia)
Jiang et al. Bidirectional LSTM-CRF models for keyword extraction in Chinese sport news
Bircher Toulouse and Cahors are French Cities, but Ti* louse and Caa. Qrs as well
Zhang et al. Social Media Named Entity Recognition Based On Graph Attention Network
Zhu et al. CED: Catalog Extraction from Documents
Worke INFORMATION EXTRACTION MODEL FROM GE’EZ TEXTS

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant