CN114638227A - Named entity identification method, device and storage medium - Google Patents

Named entity identification method, device and storage medium Download PDF

Info

Publication number
CN114638227A
CN114638227A CN202011477961.3A CN202011477961A CN114638227A CN 114638227 A CN114638227 A CN 114638227A CN 202011477961 A CN202011477961 A CN 202011477961A CN 114638227 A CN114638227 A CN 114638227A
Authority
CN
China
Prior art keywords
token
words
word
training
characters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011477961.3A
Other languages
Chinese (zh)
Inventor
王惠欣
胡珉
高扬
李飞
黄河燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
Beijing Institute of Technology BIT
China Mobile Communications Ltd Research Institute
Original Assignee
China Mobile Communications Group Co Ltd
Beijing Institute of Technology BIT
China Mobile Communications Ltd Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, Beijing Institute of Technology BIT, China Mobile Communications Ltd Research Institute filed Critical China Mobile Communications Group Co Ltd
Priority to CN202011477961.3A priority Critical patent/CN114638227A/en
Publication of CN114638227A publication Critical patent/CN114638227A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Abstract

The invention discloses a named entity identification method, a named entity identification device and a storage medium, wherein the named entity identification method comprises the following steps: pre-training by using a BERT model in a token mode and a token mode in which words are respectively used and words after word segmentation are used; after the pre-training is finished, splicing token supplementary information on the last hidden layer by using the token of the last layer of the output part Transformer and the hidden layer respectively; connecting a softmax classification layer above the last layer of the Transformer in series; after the classification probability of each token based on the characters and the words is respectively obtained, converting the classification probability of each token based on the words into a label probability based on the characters; according to the classification probability based on the words and the word, each token takes the highest value as the label value of the token. By adopting the invention, the model performance can be improved; what is captured is the true bidirectional context information; and the entity slot position is supplemented and coded, so that the entity slot position is reasonably utilized.

Description

Named entity identification method, device and storage medium
Technical Field
The present invention relates to the field of machine learning technologies, and in particular, to a named entity identification method, apparatus, and storage medium.
Background
The NER (Named Entity Recognition) refers to Recognition of entities with specific meaning in text or character strings, and mainly includes names of people, places, organizations, proper nouns, and the like. Judging whether a named entity is correctly identified includes two aspects: whether the boundaries of the entity are correct; whether the type of the entity is correctly labeled. The named entities in english have a relatively obvious formal notation (i.e., the first letter of each word in the entity is capitalized), so entity boundary identification is relatively easy and the focus of the task is to determine the category of the entity. Compared with English, the Chinese named entity recognition task is more complex, and compared with the entity class labeling subtasks, the recognition of the entity boundary is more difficult.
The existing named entity identification method mainly comprises the following steps: rule-based methods, statistical-based methods.
The early named entity recognition is mostly based on a rule method, most of which adopts a rule template constructed by linguists, selects methods with characteristics including statistical information, punctuation marks, keywords, indicator words, direction words, position words (such as tails), central words and the like, and takes matching of a mode and a character string as a main means. The method is mainly used in specific occasions where some characteristics are easy to generalize.
Statistical-based methods are all classified methods at the end of their roots, giving multiple classes of named entities, and then using models to classify the entities in the text. Two ideas can be divided: one is to recognize the boundary of all named entities in the text and classify the named entities; the other type is a serialization labeling method, for each word in the text, a plurality of candidate category labels can be provided, the labels correspond to the positions of the word in various named entities, at the moment, the NER task is to perform serialization automatic labeling on each word in the text, then integrate the automatically labeled labels, and finally obtain the named entities formed by a plurality of words and the categories thereof. Among these, serialization labeling is the most efficient and most prevalent NER method. Typical methods are: SVM (Support Vector Machine), ME (Maximum Entropy), HMM (Hidden Markov Model), CRF (Conditional Random Field), Neural Network (Neural Network), and the like.
The defects of the prior art are as follows: due to the inherent limitation of various named entity identification methods in principle, the existing identification model has an entity boundary problem.
Disclosure of Invention
The invention provides a named entity recognition method, a named entity recognition device and a storage medium, which are used for solving the problem of entity boundary existing in a named entity recognition model based on words.
The invention provides the following technical scheme:
a named entity recognition method, comprising:
pre-training by using a BERT model in a token mode and a token mode in which words are respectively used and words after word segmentation are used;
after the pre-training is finished, splicing token supplementary information on the last hidden layer by using the token of the last layer of the output part Transformer and the hidden layer respectively;
connecting a softmax classification layer above the last layer of the Transformer in series;
after the classification probability of each token based on the characters and the words is respectively obtained, converting the classification probability of each token based on the words into a label probability based on the characters;
according to the classification probability based on the words and the word, each token takes the highest value as the label value of the token.
In an implementation, the Token supplementary information is an average of corresponding word vectors of one or a combination of the following information that can be collected: known entity definitions, known entity description information, structured knowledge-graph information corresponding to known entities.
In implementation, the embedding of the BERT model input is the summation of the following parametric characterizations: word representation, positional representation, segment representation.
In implementation, the vectorization of a word or word of word representation is represented as: using the words or characters in the divided linguistic data as a dictionary, training corresponding words or character vectors by using the Chinese linguistic data, and obtaining word vectorization representation of token;
the position vector of the positional representation is expressed as: after embedding the position information, obtaining the vector representation of the position;
the sentence vector of segment representation is represented as: for the data of sentence pairs, the Embedding of sentence a is added to each word of the previous sentence, and the Embedding of sentence B is added to each word of the next sentence.
In implementation, the pre-training is the pre-training of the Masked Language Model.
In the implementation, before the pre-training using the BERT model, the method further includes:
whether the two classes generated in a Chinese language corpus are models of the next sentence or not is trained in advance.
In an implementation, the method further comprises the following steps:
and performing combined retraining on the parameters acquired by the pre-training.
A named entity recognition apparatus comprising:
a processor for reading the program in the memory, performing the following processes:
pre-training by using a BERT model in a token mode and a token mode in which words are respectively used and words after word segmentation are used;
after the pre-training is finished, splicing token supplementary information on the last hidden layer by using the token of the last layer of the output part transducer and the hidden layer respectively;
connecting a softmax classification layer above the last layer of the Transformer in series;
after the classification probability of each token based on the characters and the words is respectively obtained, converting the classification probability of each token based on the words into a label probability based on the characters;
according to the classification probability based on the characters and words, taking the highest value of each token as the label value of the token;
a transceiver for receiving and transmitting data under the control of the processor.
In an implementation, the Token supplementary information is an average of corresponding word vectors of one or a combination of the following information that can be collected: known entity definitions, known entity description information, structured knowledge graph information corresponding to known entities.
In implementation, the embedding of the BERT model input is the sum of the following parametric characterizations: word representation, positional representation, segment representation.
In implementation, the vectorization of a word or word of word representation is represented as: using the words or characters in the divided linguistic data as a dictionary, training corresponding words or character vectors by using the Chinese linguistic data, and obtaining word vectorization representation of token;
the position vector of the positional representation is expressed as: after embedding the position information, obtaining the vector representation of the position;
the sentence vector of segment representation is represented as: for the data of sentence pairs, the Embedding of sentence a is added to each word of the previous sentence, and the Embedding of sentence B is added to each word of the next sentence.
In implementation, the pre-training is the pre-training of the Masked Language Model.
In the implementation, before the pre-training using the BERT model, the method further includes:
whether the two classes generated in a Chinese corpus are models of the next sentence or not is trained in advance.
In an implementation, the method further comprises the following steps:
and performing combined retraining on the parameters obtained by the pre-training.
A named entity recognition apparatus comprising:
the pre-training module is used for pre-training by using a BERT model in two ways of respectively using characters and segmented words by token;
the Transformer module is used for splicing token supplementary information on the last hidden layer by using the token of the last layer of the output part Transformer and the hidden layer after the pre-training is finished;
the softmax module is used for connecting a softmax classification layer in series on the last layer of the Transformer;
the probability module is used for converting the classification probability of each token based on words into the label probability based on the words after respectively obtaining the classification probability of each token based on the words;
and the label module is used for taking the highest value of each token as the label value of the token according to the classification probability based on the characters and the words.
In an implementation, the Transformer module is further configured to use an average of corresponding word vectors of one or a combination of the following information that can be collected: the known entity definition, the known entity description information and the structured knowledge graph information corresponding to the known entity are the Token supplementary information.
In an implementation, the pre-training module is further configured to sum the following parametric characterizations at the embedding of the BERT model input: word representation, positional representation, segment representation.
In an implementation, the pre-training module is further configured to sum the following parametric characterizations at the embedding of the BERT model input:
the vectorized representation of a word or word of word representation is: using the words or characters in the divided linguistic data as a dictionary, training corresponding words or character vectors by using the Chinese linguistic data, and obtaining word vectorization representation of token;
the position vector of the positional representation is expressed as: after embedding the position information, obtaining the vector representation of the position;
the sentence vector of segment representation is represented as: for the data of sentence pairs, the Embedding of sentence a is added to each word of the previous sentence, and the Embedding of sentence B is added to each word of the next sentence.
In implementation, the pre-training module is further used for pre-training with the Masked Language Model.
In implementation, the pre-training module is further configured to pre-train whether a second class generated in a chinese corpus is a next sentence model before pre-training using the BERT model.
In implementation, the pre-training module is further configured to perform joint retraining on the parameters obtained by the pre-training.
A computer-readable storage medium having stored thereon a computer program for executing the above named entity recognition method.
The invention has the following beneficial effects:
in the technical scheme provided by the embodiment of the invention, as the token is used for pre-training by using the BERT model in two ways of respectively using the characters and the words after word segmentation, for the problem of entity boundary existing in the named entity recognition model based on the characters, the problem that errors of entities which are not registered in the words can be transmitted forwards is reduced by using the BERT model based on the characters and the words, so that the performance of the model is improved;
because a Transformer is used, the method is more efficient and can capture longer-distance dependence relative to RNN, and compared with the prior pre-training model, the method captures bidirectional context information in a real sense;
for the problem of entity boundary existing in the BI-LSTM-CRF model based on the characters, the model performance can be further improved because the entity boundary is determined by adopting a mode of fusion adjustment based on the characters and the words;
because the token complementary information can be spliced on the last hidden layer of the token of the hidden layer, the entity slot can be complemented and coded by utilizing external complementary information such as entity definition, entity description information, structured knowledge map information corresponding to the entity and the like, and the information can be reasonably utilized;
furthermore, because the parameters obtained by pre-training are subjected to combined retraining, the training model can be adjusted in a pre-training model mode, and the problem that a good training model is difficult to obtain when the training data of the existing entity recognition method is less is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a schematic flow chart illustrating an implementation of a named entity recognition method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a BERT model according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a pre-training model according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a named entity recognition apparatus 1 according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a named entity recognition apparatus 2 according to an embodiment of the present invention.
Detailed Description
The inventor notices in the process of invention that:
the existing named entity identification method mainly comprises the following steps: rule-based methods, statistical-based methods. The early named entity recognition is mostly based on a rule method, most of which adopts a rule template constructed by linguists, selects methods with characteristics including statistical information, punctuation marks, keywords, indicator words, direction words, position words (such as tails), central words and the like, and takes matching of a mode and a character string as a main means. While the extracted rules reflect linguistic phenomena more accurately, rule-based methods perform better than statistical-based methods. However, these rules often depend on specific languages, domains and text styles, are time-consuming in programming, are difficult to cover all linguistic phenomena, are particularly prone to errors, are not well portable, and require linguistic experts to rewrite the rules for different systems. Therefore, the method is mainly used in specific occasions where some features are easy to generalize.
Statistical-based methods are all classified methods at the end of their roots, giving multiple classes of named entities, and then using models to classify the entities in the text. Two ideas can be divided: one is to recognize the boundary of all named entities in the text and classify the named entities; the other type is a serialization labeling method, each word in the text can be provided with a plurality of candidate category labels, the labels correspond to the positions of the word in various named entities, at the moment, the NER task is to perform serialization automatic labeling on each word in the text, then integrate the automatically labeled labels, and finally obtain the named entities formed by a plurality of words and the categories thereof. Among these, serialization labeling is the most efficient and most prevalent NER method. Typical methods are: SVM (Support Vector Machine), ME (Maximum Entropy), HMM (Hidden Markov Model), CRF (Conditional Random Field), Neural Network (Neural Network), and the like.
The HMM is a probabilistic directed graph model, which mainly makes two assumptions, namely a "first-order markov assumption" and an "observation independent assumption", wherein the "first-order markov assumption" describes that yt at the current time is generated from yt-1 at the previous time, and the model can only model the above information, but cannot model the below information, so that the expression capability of the model is limited. The CRF is a probabilistic undirected graph model, which needs to define a relevant feature template through which a whole sentence is scanned and matched, the features of the whole sequence are obtained by linear weighted combination of local features, the long-term context information cannot be flexibly considered, the local context features can be flexibly applied, and in addition, the model expression of the model can cover the model parameters of the HMM, so that the model has stronger expression capability than the HMM. In the neural network method, the BI-directional Long Short Term Memory neural network (BI-LSTM) is very strong in sequence modeling, can capture Long-distance context information, and has the capability of fitting nonlinearity of the neural network, which are places that cannot be surpassed by the conventional method, but the method lacks entity boundary characteristics, is very easy to make mistakes in entity boundary determination, and does not reasonably utilize external supplementary information such as entity definition. In addition, more labeled training data is needed by using the method, and when the training data is less, the network is difficult to train well.
The defects of the prior art are as follows: due to the inherent limitation of the principle of various methods, the existing identification method is easy to generate identification errors.
As mentioned above, the existing chinese named entity recognition method mainly has the following problems:
1. the rule-based method depends on specific languages, fields and text styles, the programming process is time-consuming and difficult to cover all language phenomena, errors are particularly easy to generate, the portability of the system is poor, and linguistics experts are required to rewrite rules for different systems.
2. A BI-LSTM-CRF model based on words needs to be subjected to word segmentation on texts or character strings through a word segmentation tool, the word segmentation is prone to errors on unknown words, and the names of people, places, organizations, proper nouns and the like belong to the unknown words, errors occurring on the word segmentation by using the model can be transmitted forward, so that the performance of the model is influenced.
3. Although the word-based BI-LSTM-CRF model has higher performance in Chinese named entity recognition than the word-based BI-LSTM-CRF model, the word-based BI-LSTM-CRF model lacks entity boundary information and is very prone to errors in entity boundary recognition.
4. External supplementary information such as entity definition, entity description information, knowledge graph information corresponding to the entity and the like is not reasonably utilized. No use is made of the entity information structured in the knowledge-graph.
5. The existing entity recognition method needs more marked training data, and when the training data is less, the network is difficult to train well, so that an ideal effect is obtained.
Based on this, the technical solutions provided in the embodiments of the present invention will solve at least one of the above problems, and the following describes a specific embodiment of the present invention with reference to the drawings.
Fig. 1 is a schematic flow chart of an implementation of the named entity identification method, as shown in the figure, the implementation may include:
101, pre-training by using a BERT model in a token mode and a word mode after word segmentation;
102, after pre-training is finished, splicing token supplementary information on a last hidden layer by using a token of a last layer of an output part Transformer and a hidden layer respectively;
103, connecting a softmax classification layer in series on the last layer of the Transformer;
step 104, after the classification probability of each token based on the word and the word is respectively obtained, converting the classification probability of each token based on the word into a label probability based on the word;
and 105, according to the classification probability based on the characters and the words, taking the highest value of each token as the label value of the token.
First, the implementation of the BERT model structure will be explained.
Fig. 2 is a schematic structural diagram of the BERT model, and as shown in the drawing, the model is pre-trained by using the BERT model in two ways, i.e., a token mode and a word after word segmentation, and after the pre-training is completed, the token of the hidden layer is spliced with token supplementary information at the last hidden layer by using the last layer of the transducer of the output part. And a softmax (software maximum value) classification layer is connected above the last layer of positions in series. After the classification probabilities based on the characters and each token based on the words are obtained respectively, the classification probability of each token based on the words is converted into the label probability based on the characters, finally, the classification probabilities based on the characters and the words are compared, and the highest value is taken as the label value of the token.
The BERT is a pre-trained model, assuming that a training set A exists, the network is pre-trained by A, network parameters are learned on the task A and then stored for later use, when a new task B comes, the same network structure is adopted, the parameters learned by A can be loaded when the network parameters are initialized, other high-level parameters are initialized randomly, then the network is trained by training data of the task B, when the loaded parameters are kept unchanged, the method is called as 'freeze', when the loaded parameters are continuously changed along with the training of the task B, the method is called as 'fine-tuning', namely, the parameters are better adjusted to be more suitable for the current task B.
The core of this model is a focusing mechanism, and for a statement, multiple focus points can be enabled simultaneously, without being limited to front-to-back or back-to-front, sequential serial processing. Not only the structure of the model needs to be correctly selected, but also the parameters of the model need to be correctly trained, so that the model can be guaranteed to accurately understand the semantics of the sentence. BERT takes two steps in an attempt to correctly train the parameters of the model.
The first step is to cover 15% of the vocabulary in an article, and let the model predict the covered words omnidirectionally according to context. Given that there are 1 million articles, each with an average of 100 words, randomly covering 15% of the words, the task of the model is to correctly predict these 15 ten thousand covered words. The parameters of the Transformer model are preliminarily trained by omni-directionally predicting the occluded vocabulary.
Then, the second step is used to continue training the parameters of the model. For example, from the 1 ten thousand articles, 20 ten thousand pairs of sentences were picked, for a total of 40 ten thousand sentences. When choosing the sentence pairs, 2 × 10 ten thousand pairs of sentences are two continuous context sentences, and the other 2 × 10 ten thousand pairs of sentences are not continuous sentences. The Transformer model is then asked to identify the 20 ten thousand pairs of statements, which are contiguous and which are not.
These two steps of training are combined and called pre-training (pre-training). And (4) after training, the Transformer model comprises the parameters thereof, namely the general language representation model.
the token is pre-trained in two modes of characters and words after word segmentation, and after pre-training is completed, token supplementary information is spliced on the last hidden layer by the token of the hidden layer by using the last layer of the output part transducer. And a softmax classification layer is connected above the last layer of position in series. After the classification probabilities based on the characters and each token based on the words are obtained respectively, the classification probability of each token based on the words is converted into the label probability based on the characters, finally, the classification probabilities based on the characters and the words are compared, and the highest value is taken as the label value of the token.
In implementation, the Token supplementary information is an average of corresponding word vectors of one or a combination of the following information that can be collected: known entity definitions, known entity description information, structured knowledge-graph information corresponding to known entities.
Specifically, the Token supplementary information is the average of word vectors corresponding to the collected known entity definitions, entity description information, and structured knowledge graph information corresponding to the entities.
The following describes an input implementation of the model.
In implementation, the embedding of the BERT model input is the sum of the following parametric characterizations: word representation, positional representation, segment representation.
In a specific implementation, the word or vectorization of a word representation is represented as: using the words or characters in the divided linguistic data as a dictionary, training corresponding words or character vectors by using the Chinese linguistic data, and obtaining word vectorization representation of token;
the position vector of the positional representation is expressed as: after embedding the position information, obtaining the vector representation of the position;
the sentence vector of segment representation is represented as: for the data of sentence pairs, the Embedding of sentence a is added to each word of the previous sentence, and the Embedding of sentence B is added to each word of the next sentence.
Token pre-trains the models in a word and word manner, and pre-trains the models in the word manner. All available chinese corpora are used for pre-training.
FIG. 3 is a schematic structural diagram of a pre-training model, and as shown in the figure, embedding of model input is the summation of 3 types of characterizations: word representation, position representation, segment representation.
1) word representation word vectorization representation:
the words or characters in the divided linguistic data are used as a dictionary, the corresponding word or character vectors, namely word embedding (converting the word representation represented by numerical values into vectors with fixed size) are trained by the Chinese linguistic data, and the word vectorization representation of token is obtained.
2) The positional representation position vector represents:
the position information is also subjected to embedding of the position, and a vector representation of the position is obtained. The length of the sequence is at most 512.
3segment representation sentence vector representation:
for the data of sentence pairs, the Embedding of sentence a is added to each word of the previous sentence, and the Embedding of sentence B is added to each word of the next sentence.
4) The beginning of a sentence is denoted by [ CLS ], the end by [ SEP ], and the space between two sentences in a sentence pair is also denoted by [ SEP ].
The implementation of model pre-training is described below.
In implementation, the pre-training is the pre-training of the Masked Language Model.
Specifically, the following may be mentioned:
task # 1: masked LM
To train the bi-directional features, a mask Language Model pre-training method is used to randomly mask out some tokens (e.g., 15%) in the sentence, and then train the Model to predict the removed tokens. The vector obtained by token of the hidden layer in the last hidden layer is put into softmax to calculate the probability of each word in the dictionary.
The specific operation is as follows:
tokens of 15% of the random mask corpus are predicted by entering final hidden vectors from the mask token position into softmax.
If token is replaced by marker [ MASK ] will affect the model, so the following strategy is adopted at random MASK:
1) 80% of the words are replaced by [ MASK ] tokens:
yesterday sees the movie avanda → yesterday sees the movie [ MASK ].
2) 10% of the words are replaced by any word:
yesterday sees the movie avanda → yesterday sees the movie green book.
3) 10% of the words are unchanged:
yesterday sees the movie avatar → yesterday sees the movie avatar.
Task 2 #: prediction of next sentence
In the implementation, before the pre-training using the BERT model, the method may further include:
whether the two classes generated in a Chinese language corpus are models of the next sentence or not is trained in advance.
Specifically, in order to train a model capable of understanding the connection of the long sequence context and understanding the sentence relationship, whether the two classes generated in a chinese corpus are the next sentence model or not may be trained in advance. The following may be used:
firstly, preparing a training set: (sentence a, sentence B), and in 50% of cases B is the next sentence of a, and in 50% of cases B is another sentence randomly fetched in the remaining corpus. Label is the relationship of two sentences (next/not next), the beginning of the sentence is denoted by [ CLS ], the end is denoted by [ SEP ], and the space between two sentences in a sentence pair is also denoted by [ SEP ]. For example:
input ═ CLS ] yesterday watched a movie [ MASK ] [ SEP ]
Microsoft issued robot small ice [ MASK ] [ SEP ]
Label=NotNext
Input ═ CLS ] yesterday watched a movie [ MASK ] [ SEP ]
[ MASK ] really likes [ SEP ]
Label=IsNext
The following describes the implementation of the fine-tuning model.
The slot information, i.e., Token supplemental information, may be an average of word vectors corresponding to known entity definitions, entity description information, and structured knowledge graph information corresponding to the entities, which can be collected. For example "time to watch a movie", token supplemental information features get the coded vector described by token by averaging the word vectors describing all words in the sentence.
And (4) classifying the corresponding position of each word in the last layer of the transform of the output part. For the sequence-level classification problem, a layer is added to the original BERT model, W (K × H), K is the number of classes to be classified, H is the number-of-layers output dimension of a transform, and then a softmax layer is passed to predict the probability P (K dimension) ═ softmax (CW ^ T) of the class.
In the implementation, the method can further comprise the following steps:
and performing combined retraining on the parameters obtained by the pre-training.
All parameters, including the parameters of the BERT's original pre-training, and the new W parameters, are jointly retrained with the goal of making the probability predicted by the model and the true probability distance smaller.
The following describes the implementation of label adjustment.
After the classification probabilities based on the words and each token based on the words are obtained, the classification probability of each token based on the words is converted into a label probability based on the words, such as mingming: B-PER, p1, conversion to small: B-PER, p1 and Min: the average power of I-PER, p1,
and finally, comparing the classification probabilities based on the characters and the words, and taking the highest value as the label value of the token.
TABLE 1 parts-of-speech category comparison Table
Part of speech Means of
n Noun (name)
nr Name of a person
ns Place name
nt Organization name
nz Other proper names
Based on the same inventive concept, the embodiment of the present invention further provides a named entity recognition apparatus and a computer-readable storage medium, and because the principles of these apparatuses for solving the problems are similar to the named entity recognition method, the implementation of these apparatuses can refer to the implementation of the method, and the repeated details are not repeated.
When the technical scheme provided by the embodiment of the invention is implemented, the implementation can be carried out as follows.
Fig. 4 is a schematic diagram of a named entity recognition apparatus 1, as shown in the figure, the apparatus includes:
the processor 400, which is used to read the program in the memory 420, executes the following processes:
pre-training by using a BERT model in a token mode and a token mode in which words are respectively used and words after word segmentation are used;
after the pre-training is finished, splicing token supplementary information on the last hidden layer by using the token of the last layer of the output part Transformer and the hidden layer respectively;
connecting a softmax classification layer above the last layer of the Transformer in series;
after the classification probability of each token based on the characters and the words is respectively obtained, converting the classification probability of each token based on the words into a label probability based on the characters;
according to the classification probability based on the characters and words, taking the highest value of each token as the label value of the token;
a transceiver 410 for receiving and transmitting data under the control of the processor 400.
In an implementation, the Token supplementary information is an average of corresponding word vectors of one or a combination of the following information that can be collected: known entity definitions, known entity description information, structured knowledge-graph information corresponding to known entities.
In implementation, the embedding of the BERT model input is the summation of the following parametric characterizations: word representation, positional representation, segment representation.
In implementation, vectorization of a word or word of word representation is represented as: using the words or characters in the divided linguistic data as a dictionary, training corresponding words or character vectors by using the Chinese linguistic data, and obtaining word vectorization representation of token;
the position vector of the positional representation is expressed as: after embedding the position information, obtaining the vector representation of the position;
the sentence vector of segment representation is represented as: for the data of sentence pairs, the Embedding of sentence a is added to each word of the previous sentence, and the Embedding of sentence B is added to each word of the next sentence.
In implementation, the pre-training is the pre-training of the Masked Language Model.
In the implementation, before the pre-training using the BERT model, the method further includes:
whether the two classes generated in a Chinese language corpus are models of the next sentence or not is trained in advance.
In an implementation, the method further comprises the following steps:
and performing combined retraining on the parameters obtained by the pre-training.
Where in fig. 4, the bus architecture may include any number of interconnected buses and bridges, with various circuits of one or more processors, represented by processor 400, and memory, represented by memory 420, being linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The transceiver 410 may be a number of elements including a transmitter and a receiver that provide a means for communicating with various other apparatus over a transmission medium. The processor 400 is responsible for managing the bus architecture and general processing, and the memory 420 may store data used by the processor 400 in performing operations.
Fig. 5 is a schematic diagram of a named entity recognition apparatus 2, as shown in the figure, the apparatus includes:
the pre-training module 501 is used for pre-training by using a BERT model in two ways of respectively using characters and segmented words by token;
the Transformer module 502 is configured to splice token supplementary information on the last hidden layer by using the token of the hidden layer on the last layer of the Transformer of the output part after the pre-training is completed;
a softmax module 503, configured to concatenate a softmax classification layer on the last layer of the Transformer;
a probability module 504, configured to convert the classification probability of each token based on a word into a label probability based on a word after obtaining the classification probability of each token based on a word and a word respectively;
and the label module 505 is configured to take the highest value of each token as the label value of the token according to the classification probability based on the word and the word.
In an implementation, the Transformer module is further configured to use an average of corresponding word vectors of one or a combination of the following information that can be collected: the known entity definition, the known entity description information and the structured knowledge graph information corresponding to the known entity are the Token supplementary information.
In an implementation, the pre-training module is further configured to sum the following parametric characterizations at the embedding of the BERT model input: word representation, positional representation, segment representation.
In an implementation, the pre-training module is further configured to sum the following parametric characterizations at the embedding of the BERT model input:
the vectorized representation of a word or word of word representation is: using the words or characters in the divided linguistic data as a dictionary, training corresponding words or character vectors by using the Chinese linguistic data, and obtaining word vectorization representation of token;
the position vector of the positional representation is expressed as: after embedding the position information, obtaining the vector representation of the position;
the sentence vector of segment representation is represented as: for the data of sentence pairs, the Embedding of sentence a is added to each word of the previous sentence, and the Embedding of sentence B is added to each word of the next sentence.
In implementation, the pre-training module is further used for pre-training with the Masked Language Model.
In implementation, the pre-training module is further configured to pre-train whether a second class generated in a chinese corpus is a next sentence model before pre-training using the BERT model.
In implementation, the pre-training module is further configured to perform joint retraining on the parameters obtained by the pre-training.
For convenience of description, each part of the above-described apparatus is separately described as being functionally divided into various modules or units. Of course, the functionality of the various modules or units may be implemented in the same one or more pieces of software or hardware in the practice of the invention.
The embodiment of the invention also provides a computer readable storage medium, and the computer readable storage medium stores a computer program for executing the named entity identification method.
The specific implementation can be seen in the implementation of the named entity recognition method.
In summary, in the technical solution provided by the embodiment of the present invention, for the problem of the entity boundary existing in the BI-LSTM-CRF model based on the word, a method based on the fusion adjustment of the word and the word is adopted to help optimize the entity boundary, thereby further improving the model performance.
And supplementing the entity slot position by using external supplementary information such as entity definition, entity description information, structured knowledge graph information corresponding to the entity and the like, and coding for reasonable utilization.
The method can capture the information of the bidirectional context in the real sense by more efficiently and longer-distance dependence.
By means of the scheme of pre-training the model and adjusting the training model, the problem that a good training model is difficult to obtain when the training data of the existing entity recognition method is less is solved.
The method has the advantages that the problem that errors of unregistered entities can be transmitted forwards on word segmentation is solved through a BERT (Bidirectional Encoder representation based on a Transformer) model based on characters and words, the performance of the model is improved, and the Transformer is used, so that the method is more efficient and can capture dependence of longer distance compared with RNN (Recurrent Neural Network).
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A named entity recognition method, comprising:
representing a BERT model by a bidirectional encoder based on a Transformer in two ways of respectively using characters and segmented words by a token for pre-training;
after the pre-training is finished, splicing token supplementary information on the last hidden layer by using the token of the last layer of the output part transducer and the hidden layer respectively;
a software maximum value softmax classification layer is connected above the last layer of the Transformer in series;
after the classification probability of each token based on the characters and the words is respectively obtained, converting the classification probability of each token based on the words into a label probability based on the characters;
according to the classification probability based on the words and the word, each token takes the highest value as the label value of the token.
2. The method of claim 1, wherein the Token supplemental information is an average of corresponding word vectors that can be collected of one or a combination of the following: known entity definitions, known entity description information, structured knowledge graph information corresponding to known entities.
3. The method of claim 1, wherein the embedded embedding of the BERT model input is a summation characterized by the following parameters: the word represents word representation, the position represents positional representation, and the segment represents segment representation.
4. A method as claimed in claim 3, characterized in that the vectorized representation of a word or word of the word representation is: taking the words or characters in the linguistic data after word segmentation as a dictionary, and training corresponding words or character vectors by using the Chinese linguistic data to obtain word vectorization expression of token;
the position vector of the positional representation is expressed as: after embedding the position information, obtaining the vector representation of the position;
the sentence vector of segment representation is represented as: for the data of sentence pairs, the Embedding of sentence a is added to each word of the previous sentence, and the Embedding of sentence B is added to each word of the next sentence.
5. The method of claim 1, wherein the pre-training is of a hidden Language Model, Masked Language Model.
6. The method of claim 1, wherein prior to pre-training using the BERT model, further comprising:
whether the two classes generated in a Chinese corpus are models of the next sentence or not is trained in advance.
7. The method of claim 1, further comprising:
and performing combined retraining on the parameters acquired by the pre-training.
8. A named entity recognition apparatus, comprising:
a processor for reading the program in the memory, performing the following processes:
pre-training by using a BERT model in a token mode and a token mode in which words are respectively used and words after word segmentation are used;
after the pre-training is finished, splicing token supplementary information on the last hidden layer by using the token of the last layer of the output part Transformer and the hidden layer respectively;
connecting a softmax classification layer above the last layer of the transform in series;
after the classification probability of each token based on the characters and the words is respectively obtained, converting the classification probability of each token based on the words into a label probability based on the characters;
according to the classification probability based on the characters and the words, each token takes the highest value as the label value of the token;
a transceiver for receiving and transmitting data under the control of the processor.
9. A named entity recognition apparatus, comprising:
the pre-training module is used for pre-training by using a BERT model in two ways of respectively using characters and segmented words by token;
the Transformer module is used for splicing token supplementary information on the last hidden layer by using the token of the last layer of the output part Transformer and the hidden layer after the pre-training is finished;
the softmax module is used for connecting a softmax classification layer in series on the last layer of the Transformer;
the probability module is used for converting the classification probability of each token based on words into label probability based on the words after respectively obtaining the classification probability of each token based on the words;
and the label module is used for taking the highest value of each token as the label value of the token according to the classification probability based on the characters and the words.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the method of any one of claims 1 to 7.
CN202011477961.3A 2020-12-15 2020-12-15 Named entity identification method, device and storage medium Pending CN114638227A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011477961.3A CN114638227A (en) 2020-12-15 2020-12-15 Named entity identification method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011477961.3A CN114638227A (en) 2020-12-15 2020-12-15 Named entity identification method, device and storage medium

Publications (1)

Publication Number Publication Date
CN114638227A true CN114638227A (en) 2022-06-17

Family

ID=81945365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011477961.3A Pending CN114638227A (en) 2020-12-15 2020-12-15 Named entity identification method, device and storage medium

Country Status (1)

Country Link
CN (1) CN114638227A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115859989A (en) * 2023-02-13 2023-03-28 神州医疗科技股份有限公司 Entity identification method and system based on remote supervision

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115859989A (en) * 2023-02-13 2023-03-28 神州医疗科技股份有限公司 Entity identification method and system based on remote supervision

Similar Documents

Publication Publication Date Title
CN109992782B (en) Legal document named entity identification method and device and computer equipment
CN110334354B (en) Chinese relation extraction method
CN111708882B (en) Transformer-based Chinese text information missing completion method
CN112100349A (en) Multi-turn dialogue method and device, electronic equipment and storage medium
CN111309915A (en) Method, system, device and storage medium for training natural language of joint learning
CN109918681B (en) Chinese character-pinyin-based fusion problem semantic matching method
CN111339750B (en) Spoken language text processing method for removing stop words and predicting sentence boundaries
CN110795945A (en) Semantic understanding model training method, semantic understanding device and storage medium
CN110826335A (en) Named entity identification method and device
CN112101010B (en) Telecom industry OA office automation manuscript auditing method based on BERT
CN110807333A (en) Semantic processing method and device of semantic understanding model and storage medium
CN111695341A (en) Implicit discourse relation analysis method and system based on discourse structure diagram convolution
CN109933792A (en) Viewpoint type problem based on multi-layer biaxially oriented LSTM and verifying model reads understanding method
CN116416480B (en) Visual classification method and device based on multi-template prompt learning
CN112200664A (en) Repayment prediction method based on ERNIE model and DCNN model
CN112597306A (en) Travel comment suggestion mining method based on BERT
CN115064154A (en) Method and device for generating mixed language voice recognition model
CN113239694B (en) Argument role identification method based on argument phrase
CN113326367B (en) Task type dialogue method and system based on end-to-end text generation
CN114638227A (en) Named entity identification method, device and storage medium
CN116595023A (en) Address information updating method and device, electronic equipment and storage medium
CN113901210B (en) Method for marking verbosity of Thai and Burma characters by using local multi-head attention to mechanism fused word-syllable pair
CN115240712A (en) Multi-mode-based emotion classification method, device, equipment and storage medium
CN114417891A (en) Reply sentence determination method and device based on rough semantics and electronic equipment
CN114611521A (en) Entity identification method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination