CN113177412A

CN113177412A - Named entity identification method and system based on bert, electronic equipment and storage medium

Info

Publication number: CN113177412A
Application number: CN202110364506.0A
Authority: CN
Inventors: 郑才松; 李青龙
Original assignee: Beijing Smart Starlight Information Technology Co ltd
Current assignee: Beijing Smart Starlight Information Technology Co ltd
Priority date: 2021-04-05
Filing date: 2021-04-05
Publication date: 2021-07-27

Abstract

The invention discloses a named entity identification method, a named entity identification system, electronic equipment and a storage medium based on bert, wherein the method comprises the following steps: determining a named entity label according to the identification requirement; labeling the training set according to the named entity label; respectively segmenting each training text in the training set to obtain a corresponding word sequence; inputting the word sequence into a bert feature representation layer to obtain a word vector; inputting the word vectors into a BilSTM model and a CRF model for training to obtain an entity recognition model; acquiring a text to be identified; inputting a text to be recognized into the entity recognition model to obtain a recognition result; acquiring a normalization dictionary; and matching the recognition result with the normalization dictionary to obtain a normalized recognition result. The word vector model trained by the Bert data is input, so that the model can fully learn the text characteristics, and the entity recognition effect is greatly improved; the recognition result is normalized by constructing a normalization dictionary, so that repetition and redundancy are removed, and the accuracy of the recognition result is improved.

Description

Named entity identification method and system based on bert, electronic equipment and storage medium

Technical Field

The invention relates to the field of natural language processing, in particular to a named entity identification method and system based on bert, electronic equipment and a storage medium.

Background

Named entity identification is an important link of information extraction technology and is also one of the fundamental fields and difficult problems of the information extraction field. Named entity recognition, as the name implies, is the extraction of important entities from text. For ordinary news text, the goal is to identify entities of user interest such as person names, place names, organizational names, etc. At present, named entity recognition mainly adopts methods based on rules and on statistical machine learning.

The method based on the rules summarizes the internal characteristics of the named entities through expert domain knowledge, manually designs rule templates, and extracts corresponding entities according to the templates. Such methods tend to depend on specific languages and domains, manually writing rules is time consuming and difficult to cover all linguistic phenomena, and frequent modifications of rules are required to maintain method performance.

In the field of named entity recognition, statistical-based machine learning methods are currently widely used. Typically, the named entity recognition task is often treated as a sequence tagging task. The main models are divided into two types, namely a traditional statistical machine learning model and a neural network model. Common statistical machine learning models include shallow models such as support vector machines, conditional random fields, etc., wherein the conditional random field model is widely applied to named entity recognition tasks and exhibits excellent performance. In recent years, with the development of word vector technology, deep learning methods based on neural networks have made significant progress in the field of natural language processing. Compared with the traditional statistical machine learning method, the neural network model has better effect in the entity recognition task.

The neural network method usually uses large-scale unlabeled corpora to train word vectors, and inputs the pre-trained word vectors into models such as a convolutional neural network and a cyclic neural network to realize entity recognition training. The accuracy of the recognition result is low due to insufficient feature extraction capability of the pre-training model and insufficient representation of the internal features of the Chinese.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method, a system, an electronic device, and a storage medium for identifying a named entity based on bert, so as to solve the problem in the prior art that an accuracy of an identification result is low.

Therefore, the embodiment of the invention provides the following technical scheme:

according to a first aspect, an embodiment of the present invention provides a method for identifying a named entity based on bert, including: determining a named entity tag according to the identification requirement, wherein the named entity tag is used for identifying a named entity of the text; acquiring a training set, wherein the training set is labeled according to named entity labels and comprises a plurality of labeled training texts, and one training text corresponds to one input sentence; respectively segmenting each training text to obtain a word sequence corresponding to each training text; inputting the word sequence into a bert feature representation layer to obtain a word vector of each word in the training text; inputting the word vector into a BilSTM model to obtain the output of the BilSTM model; inputting the output of the BilSTM model into a CRF model for training to obtain an entity recognition model; acquiring a text to be identified; inputting a text to be recognized into the entity recognition model to obtain a recognition result; acquiring a normalization dictionary, wherein the normalization dictionary comprises full names and short names of all named entities; and matching the recognition result with the normalization dictionary to obtain a normalized recognition result.

Optionally, the step of obtaining the normalized dictionary includes: acquiring encyclopedic corpus; matching the names of the pre-stored named entities in encyclopedic linguistic data to obtain a first dictionary; acquiring a full name-abbreviation dictionary on the Internet; obtaining a second dictionary according to the full name-abbreviation dictionary; acquiring company data on the Internet; obtaining a third dictionary according to the company data; and obtaining a normalized dictionary according to the first dictionary, the second dictionary and the third dictionary.

Optionally, the step of matching the names of the pre-stored named entities in the encyclopedia corpus to obtain the first dictionary includes: carrying out full-name matching on the name of the pre-stored named entity in encyclopedia corpus to obtain a full name corresponding to the name; searching a guide word for short in encyclopedic corpus, and matching the corresponding abbreviation of the name according to the guide word for short; and forming a first dictionary by using the full names and the short names corresponding to the names.

Optionally, after the step of obtaining the normalized dictionary according to the first dictionary, the second dictionary, and the third dictionary, the method further includes: and carrying out duplication removal processing on the normalization dictionary to obtain the duplication-removed normalization dictionary.

Optionally, the step of matching the recognition result with the normalized dictionary to obtain a normalized recognition result includes: completely matching the recognition result in the normalization dictionary to obtain a matching result; if the matching result is successful, taking the full name corresponding to the identification result as a normalized identification result; if the matching result is matching failure, calculating the similarity of the recognition result and the character string in the normalized dictionary; judging whether the similarity is greater than a preset value; if the similarity is larger than a preset value, taking the full name corresponding to the character string in the normalized dictionary as a normalized recognition result; and if the similarity is smaller than or equal to a preset value, taking the recognition result as a normalized recognition result.

Optionally, the step of calculating the similarity between the recognition result and the character string in the normalized dictionary includes: calculating the intersection and union of the recognition result and the character strings in the normalized dictionary; and taking the proportion of the intersection set and the union set as the similarity of the recognition result and the character strings in the normalized dictionary.

Optionally, the calculation formula of the output value of the state hiding layer in the BiLSTM model is as follows:

i_t＝σ(W_xix_t+W_hih_t-1+W_cic_t-1+b_i)

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cfc_t-1+b_f

c_t＝f_tc_t-1+i_ttanh(W_xcx_t+W_hch_t-1+b_c)

o_t＝σ(W_xox_t+W_hoh_t-1+W_coc_t+b_o)

h_t＝o_ttanh(c_t)

wherein, W_xiWeight matrix representing the input gate from input layer to hidden layer, W_hiWeight matrix representing the input gate from the output layer to the hidden layer, W_ciWeight matrix representing the input gate from memory layer to hidden layer, h_t-1Value representing the hidden layer at time t-1, c_t-1Representing the value of the memory layer at time t-1, b_iOffset vector, W, of input gate representing hidden layer_xfWeight matrix, W, representing the input gate from input layer to forgetting layer_hfWeight matrix, W, representing the input gate from the output layer to the forgetting layer_cfWeight matrix representing the input gate from memory layer to forget layer, b_fOffset vector of input gate representing forgetting layer, W_xcWeight matrix representing the input gate from the input layer to the memory layer, W_hcWeight matrix representing the input gate from forgetting layer to memory layer, b_cRepresenting the offset vector of the input gate of the memory layer, W_xoWeight matrix representing the input gates from input layer to output layer, W_hoWeight matrix representing the input gate from forgetting layer to output layer, W_coWeight matrix representing the input gates from the memory layer to the output layer, b_oRepresenting the offset vector of the input gate of the output layer, c representing the state of the memory cell, σ and tanh representing two different neuron activation functions, i_t、f_tAnd o_tRespectively representing an input gate, a forgetting gate and an output gate;

the overall score for the tag sequence in the CRF model is calculated as follows:

wherein the content of the first and second substances,

indicating label y_iTo y_i+1The score of the transition of (a) is,

means that the ith word is at the y_iAn output score on each label;

the calculation formula for normalizing all possible sequence paths to produce a probability distribution for the output sequence y is as follows:

wherein, Y_XRepresenting a set of output tag sequences;

the formula for maximizing the log probability for the correct tag sequence y is as follows:

the calculation formula for predicting the sequence with the highest total score as the optimal sequence is as follows:

according to a second aspect, an embodiment of the present invention provides a bert-based named entity recognition system, including: the system comprises a first processing module, a second processing module and a third processing module, wherein the first processing module is used for determining a named entity label according to identification requirements, and the named entity label is used for identifying a named entity of a text; the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a training set, the training set is labeled according to named entity labels and comprises a plurality of labeled training texts, and one training text corresponds to one input sentence; the second processing module is used for respectively segmenting each training text to obtain a word sequence corresponding to each training text; the third processing module is used for inputting the word sequence into the bert feature representation layer to obtain a word vector of each word in the training text; the fourth processing module is used for inputting the word vectors into the BilSTM model to obtain the output of the BilSTM model; the fifth processing module is used for inputting the output of the BilSTM model into a CRF model for training to obtain an entity recognition model; the second acquisition module is used for acquiring the text to be recognized; the sixth processing module is used for inputting the text to be recognized into the entity recognition model to obtain a recognition result; the third acquisition module is used for acquiring a normalization dictionary, and the normalization dictionary comprises the full names and the short names of all named entities; and the seventh processing module is used for matching the recognition result with the normalization dictionary to obtain a normalized recognition result.

Optionally, the third obtaining module includes: the first acquisition unit is used for acquiring encyclopedic corpus; the first processing unit is used for matching the names of the pre-stored named entities in the encyclopedia corpus to obtain a first dictionary; the second acquisition unit is used for acquiring a full name-abbreviation dictionary on the Internet; the second processing unit is used for obtaining a second dictionary according to the full name-abbreviation dictionary; a third acquisition unit for acquiring company data on the internet; the third processing unit is used for obtaining a third dictionary according to the company data; and the fourth processing unit is used for obtaining a normalized dictionary according to the first dictionary, the second dictionary and the third dictionary.

Optionally, the first processing unit includes: the first processing subunit is used for carrying out full-name matching on the name of the named entity stored in advance in the encyclopedic corpus to obtain a full name corresponding to the name; the second processing subunit is used for searching the guide word for short in the encyclopedic corpus and matching the corresponding abbreviation of the name according to the guide word for short; and the third processing subunit is used for forming the first dictionary by the full names and the short names corresponding to the names.

Optionally, the third obtaining module further includes: and the fifth processing unit is used for carrying out duplication elimination processing on the normalization dictionary to obtain the duplication eliminated normalization dictionary.

Optionally, the seventh processing module includes: the sixth processing unit is used for completely matching the recognition result in the normalized dictionary to obtain a matching result; the seventh processing unit is used for taking the full name corresponding to the identification result as the normalized identification result if the matching result is successful; the eighth processing unit is used for calculating the similarity of the recognition result and the character string in the normalized dictionary if the matching result is matching failure; the judging unit is used for judging whether the similarity is greater than a preset value or not; the ninth processing unit is used for taking the full name corresponding to the character string in the normalized dictionary as a normalized recognition result if the similarity is larger than a preset value; and the tenth processing unit is used for taking the recognition result as a normalized recognition result if the similarity is less than or equal to a preset value.

Optionally, the eighth processing unit includes: the fourth processing subunit is used for calculating the intersection and the union of the recognition result and the character strings in the normalized dictionary; and the fifth processing subunit is used for taking the proportion of the intersection and the union as the similarity of the recognition result and the character string in the normalized dictionary.

i_t＝σ(W_xix_t+W_hih_t-1+W_cic_t-1+b_i)

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cfc_t-1+b_f

c_t＝f_tc_t-1+i_ttanh(W_xcx_t+W_hch_t-1+b_c)

o_t＝σ(W_xox_t+W_hoh_t-1+W_coc_t+b_o)

h_t＝o_ttanh(c_t)

wherein, W_xiWeight matrix representing the input gate from input layer to hidden layer, W_hiWeight matrix representing the input gate from the output layer to the hidden layer, W_ciWeight matrix representing the input gate from memory layer to hidden layer, h_t-1Value representing the hidden layer at time t-1, c_t-1Representing the value of the memory layer at time t-1, b_iOffset vector, W, of input gate representing hidden layer_xfWeight matrix, W, representing the input gate from input layer to forgetting layer_hfWeight matrix, W, representing the input gate from the output layer to the forgetting layer_cfWeight matrix representing the input gate from memory layer to forget layer, b_fOffset vector of input gate representing forgetting layer, W_xcRepresenting input layersWeight matrix of input gates to memory layer, W_hcWeight matrix representing the input gate from forgetting layer to memory layer, b_cRepresenting the offset vector of the input gate of the memory layer, W_xoWeight matrix representing the input gates from input layer to output layer, W_hoWeight matrix representing the input gate from forgetting layer to output layer, W_coWeight matrix representing the input gates from the memory layer to the output layer, b_oRepresenting the offset vector of the input gate of the output layer, c representing the state of the memory cell, σ and tanh representing two different neuron activation functions, i_t、f_tAnd o_tRespectively representing an input gate, a forgetting gate and an output gate;

wherein the content of the first and second substances,

indicating label y_iTo y_i+1The score of the transition of (a) is,

means that the ith word is at the y_iAn output score on each label;

wherein, Y_XRepresenting a set of output tag sequences;

according to a third aspect, an embodiment of the present invention provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to cause the at least one processor to perform the method for bert based named entity recognition as described in any of the above first aspects.

According to a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, in which computer instructions are stored, the computer instructions being configured to cause a computer to execute the bert-based named entity recognition method described in any of the first aspects.

The technical scheme of the embodiment of the invention has the following advantages:

the embodiment of the invention provides a named entity identification method, a named entity identification system, electronic equipment and a storage medium based on bert, wherein the method comprises the following steps: determining a named entity tag according to the identification requirement, wherein the named entity tag is used for identifying a named entity of the text; acquiring a training set, wherein the training set is labeled according to named entity labels and comprises a plurality of labeled training texts, and one training text corresponds to one input sentence; respectively segmenting each training text to obtain a word sequence corresponding to each training text; inputting the word sequence into a bert feature representation layer to obtain a word vector of each word in the training text; inputting the word vector into a BilSTM model to obtain the output of the BilSTM model; inputting the output of the BilSTM model into a CRF model for training to obtain an entity recognition model; acquiring a text to be identified; inputting a text to be recognized into the entity recognition model to obtain a recognition result; acquiring a normalization dictionary, wherein the normalization dictionary comprises full names and short names of all named entities; and matching the recognition result with the normalization dictionary to obtain a normalized recognition result. The method does not need to manually construct features, is completely realized end to end, and can enable the model to fully learn the text features by inputting the word vector model trained by the Bert data, thereby greatly improving the effect of entity recognition; meanwhile, for the recognition result, the normalization dictionary is constructed and unified to an entity, so that the subsequent consistency processing is facilitated, and the accuracy of the recognition result is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flowchart of a specific example of a bert-based named entity recognition method according to an embodiment of the present invention;

FIG. 2 is a block diagram of one particular example of a bert based named entity recognition system in accordance with embodiments of the present invention;

fig. 3 is a schematic diagram of an electronic device according to an embodiment of the invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a named entity recognition method based on bert, and as shown in FIG. 1, the method comprises steps S1-S10.

Step S1: and determining a named entity tag according to the identification requirement, wherein the named entity tag is used for identifying the named entity of the text.

In this embodiment, the identification requirement includes a named entity to be identified, and the type of the named entity can be obtained according to the identification requirement, so as to obtain a named entity tag.

The named entity problem is formalized as a sequence labeling problem, with BIO labels representing the entity boundaries. Wherein, B represents an entity starting word, I represents an entity internal word, and O represents a non-entity word. Specifically, for news text, three types of entities, namely a person name, a place name and an organization name, are mainly identified. The name of the person is PER, the name of the person is B-PER, and the interior of the person is I-PER; the place name is LOC, the place name is B-LOC at the beginning, and the interior is I-LOC; the organization name is ORG, the place name begins B-ORG, and the interior is I-ORG. This is only illustrated schematically in the present embodiment, and is not limited thereto. In practical application, the category of the named entity and the label of each entity are determined according to practical needs.

Step S2: and acquiring a training set, wherein the training set is labeled according to the named entity labels and comprises a plurality of labeled training texts, and one training text corresponds to one input sentence.

In this embodiment, the text data is labeled according to the named entity tag, that is, the named entity corresponding to the named entity tag in the text data is labeled. The labeled text data is used as a training text, and one training text is an input sentence, namely one training text corresponds to one sentence. And forming a training set by the marked training texts.

Specifically, in this embodiment, the daily news corpus is selected and labeled to obtain a training set.

Step S3: and respectively segmenting each training text to obtain a word sequence corresponding to each training text.

In this embodiment, for each training text (input sentence), word segmentation is performed first, and the sentence is represented as a word sequence. The word segmentation specifically refers to a single word sequence for Chinese and a single word sequence for English, and is not related to a specific word segmentation method and is generally called a word sequence.

Step S4: and inputting the word sequence into a bert feature representation layer to obtain a word vector of each word in the training text.

In this embodiment, the word sequence is input to the bert feature representation layer to obtain a vector representation of each word in the sentence.

The bert model, which represents the bidirectional encoder representation of the transform. Unlike other language representation models in the near future, BERT aims to pre-train the deep bi-directional representation by jointly adjusting the context in all layers. Therefore, the pre-trained BERT representation can be finely adjusted through an additional output layer, and is suitable for building the most advanced model of a wide range of tasks, such as question-answering tasks and language reasoning, without making great architectural modification on specific tasks. The word vector trained by the large-scale data of Bert is used as model input, so that the model can fully learn text characteristics, and the effect of entity recognition is greatly improved.

Step S5: and inputting the word vector into a BilSTM model to obtain the output of the BilSTM model.

In this embodiment, each word vector in a sentence is input into the BiLSTM network, and the value of the hidden layer in the current state is calculated. Here, the sentence is calculated to obtain a forward information vector through the forward LSTM and a backward information vector through the backward LSTM, respectively. And splicing the forward information vector and the backward information vector according to positions to obtain the BiLSTM model output corresponding to the sentence.

Step S6: and (4) inputting the output of the BilSTM model into a CRF model for training to obtain an entity recognition model.

In this embodiment, the BiLSTM layer output passes through the CRF layer to obtain a label of each word, and the model is trained through the training set to obtain the entity recognition model.

Step S7: obtaining a text to be recognized

In this embodiment, the text to be recognized is news data of some specific websites downloaded by a crawler.

Step S8: and inputting the text to be recognized into the entity recognition model to obtain a recognition result.

In this embodiment, the text to be recognized is input into the entity recognition model to recognize the named entity, so as to obtain a recognition result of the text to be recognized, that is, the named entity corresponding to the text to be recognized.

Step S9: and acquiring a normalization dictionary, wherein the normalization dictionary comprises the full names and the short names of all named entities.

In this embodiment, there are often multiple descriptions of the same named entity, such as full name and abbreviation. For the situation, a normalization dictionary is constructed, the dictionary comprises various descriptions corresponding to the named entities, the normalization dictionary is a large database, and the normalization dictionary comprises the full names and various short names of all the named entities.

Step S10: and matching the recognition result with the normalization dictionary to obtain a normalized recognition result.

In this embodiment, since some entities have multiple expressions, such as company names and acronyms, the identified entities have duplication and redundancy. In order to remove the duplicate, the identified entity needs to be normalized, which facilitates subsequent processing. And unifying the identified named entities to a unique name to obtain a normalized identification result. The unique name in this embodiment is a full name corresponding to the named entity.

Firstly, obtaining labels of each named entity according to the identification requirement, labeling a training set according to the labels, and segmenting words of the labeled training set to obtain a word sequence of each training text; inputting the word sequence into a bert feature representation layer to obtain a word vector of each word; inputting the word vectors into a BilSTM model and a CRF model, and training a training set to obtain an entity recognition model; and acquiring a text to be recognized, inputting the text to be recognized into the entity recognition model to obtain a recognition result, and normalizing the recognition result through the normalization dictionary to obtain a normalized recognition result. The method does not need to manually construct features, is completely realized end to end, and can enable the model to fully learn the text features by inputting the word vector model trained by the Bert data, thereby greatly improving the effect of entity recognition; meanwhile, for the recognition result output by the model, the recognition result is subjected to normalization processing by constructing a normalization dictionary, so that the repetition and redundancy of the recognition result are removed, the subsequent consistency processing is facilitated, and the accuracy of the recognition result is improved.

As an exemplary embodiment, the step of acquiring the normalized dictionary at step S9 includes steps S91-S97.

Step S91: obtaining encyclopedia corpus.

In this embodiment, the encyclopedic corpus is obtained through a web crawler, and specifically, the encyclopedic corpus may be an encyclopedic corpus and a wiki encyclopedic corpus.

Step S92: and matching the names of the pre-stored named entities in the encyclopedia corpus to obtain a first dictionary.

In this embodiment, the names of the named entities stored in advance may be name dictionaries of named entities accumulated through history, for example, 3000 or more ten thousand company name dictionaries accumulated previously.

In this embodiment, step S92 includes steps S921 to S923.

Step S921: and carrying out full name matching on the name of the pre-stored named entity in encyclopedia linguistic data to obtain a full name corresponding to the name.

Specifically, the names of the named entities stored in advance are subjected to full-name matching in encyclopedia corpus, and the full names corresponding to the names are matched.

Step S922: searching the guide word for short in encyclopedic corpus and matching the corresponding abbreviation of the name according to the guide word for short.

Specifically, the abbreviation guiding word is a word used for representing the abbreviation, and may be a word such as "abbreviation" or "also named" and the like, and certainly, in other embodiments, the abbreviation guiding word may also include other words such as other words and aliases, and may be reasonably set according to actual needs.

In encyclopedia corpus, guide words such as 'abbreviation' and 'named' are found and matched with the name of the name.

Step S923: and forming a first dictionary by using the full names and the short names corresponding to the names.

Specifically, the full names and the short names of the names are corresponding, one full name may correspond to one or more short names, and the matched full names and short names are the first dictionary.

Step S93: and acquiring a full name-abbreviation dictionary on the Internet.

In this embodiment, the published full name-abbreviation dictionary, for example, the national institute full name-abbreviation dictionary, the world national full name, and the like, is downloaded over the internet network.

Step S94: and obtaining a second dictionary according to the full name-abbreviation dictionary.

In this embodiment, the multiple full-name/abbreviation dictionaries downloaded from the internet form a full-name/abbreviation dictionary set, and the set is used as the second dictionary.

Step S95: company data on the internet is acquired.

In this embodiment, the company data disclosed on the internet is obtained through the web crawler, and may specifically be the listed company data downloaded from the stock website.

Step S96: and obtaining a third dictionary according to the company data.

In this embodiment, the company name corresponding to the stock name is automatically extracted through a rule, where the rule may be understood as that the stock introduction material refers to the name of the company, and the corresponding company name is extracted by finding a leading word of the company name, such as "the company name", "the company name used by the company", and the like, and the obtained company name is used as the third dictionary.

Step S97: and obtaining a normalized dictionary according to the first dictionary, the second dictionary and the third dictionary.

In this embodiment, the first dictionary, the second dictionary, and the third dictionary obtained from different channels are summarized to obtain the full names and all the acronyms corresponding to the full names, and after each full name and all the acronyms corresponding to the full name are well corresponded to each other, a normalization dictionary is obtained, so that the recognition result is normalized subsequently, and the normalized recognition result corresponding to the named entity, that is, the unique full name corresponding to the named entity, is obtained.

The step obtains the normalization dictionary through encyclopedic linguistic data, the full name-abbreviation dictionary on the Internet and company data on the Internet, so that the normalization dictionary is more comprehensive and accurate.

As an exemplary embodiment, step S98 is further included after the step of obtaining the normalized dictionary from the first dictionary, the second dictionary, and the third dictionary at step S97.

Step S98: and carrying out duplication removal processing on the normalization dictionary to obtain the duplication-removed normalization dictionary.

In this embodiment, different dictionaries obtained from different channels may have the same content, and these duplicate contents are removed, only one is retained, and the de-duplicated normalization dictionary is obtained after the de-duplication process.

For example, the first dictionary is used to obtain "Beijing intelligent starlight information technology Limited company-intelligent starlight", the second dictionary is used to obtain "Beijing intelligent starlight information technology Limited company-intelligent starlight", and the two dictionaries have repetition of full name and short name, so only one dictionary is reserved in the normalization dictionary.

The step is to carry out duplication removal on the normalization dictionary, so that the redundancy of the normalization dictionary is reduced, the storage space is reduced, and the normalization efficiency is improved.

As an exemplary embodiment, the step of matching the recognition result with the normalized dictionary to obtain the normalized recognition result in step S10 includes steps S101-S106.

Step S101: and completely matching the recognition result in the normalization dictionary to obtain a matching result.

In this embodiment, the recognition result is looked up in the normalized dictionary to obtain a matching result. When the recognition result can be completely matched with the corresponding name in the normalized dictionary, namely, the recognition result and the name are completely the same, the matching result is a matching success, and the step S102 is executed; when the recognition result cannot find a completely corresponding name in the normalized dictionary, i.e. the two names are not completely the same, the matching result is a matching failure, and step S103 is executed.

Step S102: and if the matching result is successful, taking the full name corresponding to the identification result as the normalized identification result.

In this embodiment, the matching result is a successful matching, which indicates that the recognition result can find the corresponding name in the normalized dictionary, that is, the recognition result exists in the normalized dictionary, so that the full name corresponding to the name is used as the normalized recognition result.

Step S103: and if the matching result is matching failure, calculating the similarity between the recognition result and the character string in the normalized dictionary.

In this embodiment, the matching result is a matching failure, which indicates that the recognition result cannot completely correspond to the name in the normalized dictionary, and at this time, similarity comparison needs to be performed between the recognition result and the character string in the normalized dictionary to determine the closest name.

In this embodiment, step S103 may specifically include steps S1031 to S1032.

Step S1031: and calculating the intersection and union of the recognition result and the character strings in the normalized dictionary.

In this embodiment, the recognition result is compared with each character string in the normalized dictionary, and the intersection and union of the recognition result and each character string are obtained respectively.

The intersection refers to the number of the unrepeated characters contained in the two character strings together, the union refers to the number of all the unrepeated characters contained in the two character strings, and the calculation process is as follows:

1. and respectively removing duplication of the two character strings and removing repeated characters in the character strings.

2. And selecting one character string, traversing the character string, comparing the character string with the character of the other character string one by one, counting the number (intersection) of the characters with the same number, and simultaneously accumulating the total number (union) of the characters with the same number.

Step S1032: and taking the proportion of the intersection set and the union set as the similarity of the recognition result and the character strings in the normalized dictionary.

In this embodiment, the similarity calculation uses a Jaccard similarity coefficient, that is, a ratio of intersection and union of two character strings, that is, similarity is intersection/union.

Step S104: and judging whether the similarity is greater than a preset value. If the similarity is greater than the preset value, executing step S105; if the similarity is less than or equal to the preset value, step S106 is executed.

In this embodiment, the preset value is determined according to the recognition accuracy; the higher the identification precision requirement is, the larger the numerical value of the preset value is; conversely, the lower the identification accuracy requirement, the smaller the value of the preset value. The preset value may range anywhere from 0 to 1. Specifically, the preset value may be set to 0.8; of course, in other embodiments, the preset value may also be set to other values, and may be reasonably set in practical application according to needs.

Step S105: and if the similarity is greater than the preset value, taking the full name corresponding to the character string in the normalized dictionary as the normalized recognition result.

In this embodiment, when the similarity is greater than the preset value, it indicates that the similarity between the recognition result and the character string in the normalized dictionary is extremely high, the recognition result and the character string in the normalized dictionary are defaulted to be the same character, and the full name corresponding to the character string is taken as the normalized recognition result.

Step S106: and if the similarity is smaller than or equal to a preset value, taking the recognition result as a normalized recognition result.

In this embodiment, when the similarity is smaller than or equal to the preset value, it indicates that the similarity between the recognition result and the character string in the normalized dictionary is low, and the recognition result does not exist in the normalized dictionary, so the recognition result is directly used as the normalized recognition result.

And in the step, the recognition result is matched with the normalization dictionary, if the matching is successful, the full name corresponding to the recognition result is taken as the normalization recognition result, if the matching is failed, the similarity is compared according to the recognition result and the character strings in the normalization dictionary, and the normalization recognition result is obtained according to the similarity. The identification results are normalized through the steps, and are unified to an entity after normalization, so that subsequent consistency processing is facilitated.

Bilstm structure, given an input sequence (x)₁,x₂,...,x_n) The recurrent neural network model returns a sequence of representations (h) for the input sequence₁,h₂,...,h_n). The RNN model can dynamically capture the information of sequence data and has the capacity of memorizing and storing the information, but is easy to generate gradient disappearance or explosion problems on algorithm implementation. The LSTM model introduces a memory unit and a threshold mechanism, realizes effective utilization of long-distance information and solves the problem of gradient disappearance. At time t, given input x_tThe specific calculation process for the hidden layer output representation of LSTM is as follows:

i_t＝σ(W_xix_t+W_hih_t-1+W_cic_t-1+b_i)

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cfc_t-1+b_f)

c_t＝f_tc_t-1+i_ttanh(W_xcx_t+W_hch_t-1+b_c)

o_t＝σ(W_xox_t+W_hoh_t-1+W_coc_t+b_o)

h_t＝o_ttanh(c_t)

wherein, W_xiWeight matrix representing the input gate from input layer to hidden layer, W_hiWeight matrix representing the input gate from the output layer to the hidden layer, W_ciWeight matrix representing the input gate from memory layer to hidden layer, h_t-1Value representing the hidden layer at time t-1, c_t-1Representing the value of the memory layer at time t-1, b_iOffset vector, W, of input gate representing hidden layer_xfWeight matrix, W, representing the input gate from input layer to forgetting layer_hfWeight matrix, W, representing the input gate from the output layer to the forgetting layer_cfWeight matrix representing the input gate from memory layer to forget layer, b_fOffset vector of input gate representing forgetting layer, W_xcRepresenting input layers toWeight matrix of input gates of memory layer, W_hcWeight matrix representing the input gate from forgetting layer to memory layer, b_cRepresenting the offset vector of the input gate of the memory layer, W_xoWeight matrix representing the input gates from input layer to output layer, W_hoWeight matrix representing the input gate from forgetting layer to output layer, W_coWeight matrix representing the input gates from the memory layer to the output layer, b_oRepresenting the offset vector of the input gate of the output layer, c representing the state of the memory cell, σ and tanh representing two different neuron activation functions, i_t、f_tAnd o_tRespectively representing an input gate, a forgetting gate and an output gate. The threshold mechanism can effectively filter and memorize the information of the memory unit, thereby solving the problems of RNN.

In general, a softmax classifier is adopted to solve the multi-classification problem in the prediction phase, but the softmax classifier does not take the dependency relationship between tags into consideration in the sequence labeling problem. Therefore, by using the CRF model in this embodiment, the method can better predict the tags by considering the global information of the tag sequence. The specific details are as follows: suppose that a transition score matrix A, matrix element A, is introduced_i,jA transition score representing the transition of label i to label j, let y₀And y_n+1The labels are a starting label and an ending label in the sentence, the label type is k, and then A belongs to R^(k+2)*(k+2). If the sentence length is n, the score matrix of the output layer is P ∈ R^n*kMatrix element P_i,jRepresenting the output score of the ith word under the jth label; given an input sentence X ═ X₁,x₂,...,x_n) The output tag sequence Y ═ Y (Y)₁,y₂,...,y_n) Then the total score for the tag sequence is:

wherein the content of the first and second substances,

indicating label y_iTo y_i+1Is rotatedThe score is shifted to the point that the score is obtained,

means that the ith word is at the y_iOutput scores on individual tags.

Normalizing all possible sequence paths yields a probability distribution for the output sequence y, as shown below:

wherein, Y_XRepresenting a set of output tag sequences.

Maximizing the y for the correct tag sequence during training^*The specific formula of the logarithmic probability of (c) is as follows:

as can be seen from the log-probability formula, the purpose of using the sentence-level likelihood function is to encourage the model to generate the correct tag sequence. In the decoding stage, the sequence with the highest total score is predicted as the optimal sequence, and the calculation formula is as follows:

in the prediction stage, a dynamic programming algorithm Viterbi is adopted to solve the optimal sequence.

The embodiment also provides a named entity recognition system based on bert, which is used for implementing the above embodiments and preferred embodiments, and the description of the system that has been already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the system described in the embodiments below is preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.

The embodiment also provides a named entity recognition system based on bert, as shown in fig. 2, including:

the system comprises a first processing module 1, a second processing module and a third processing module, wherein the first processing module is used for determining a named entity label according to an identification requirement, and the named entity label is used for identifying a named entity of a text;

the first acquisition module 2 is used for acquiring a training set, wherein the training set is labeled according to named entity labels and comprises a plurality of labeled training texts, and one training text corresponds to one input sentence;

the second processing module 3 is used for performing word segmentation on each training text to obtain a word sequence corresponding to each training text;

the third processing module 4 is used for inputting the word sequence into the bert feature representation layer to obtain a word vector of each word in the training text;

the fourth processing module 5 is used for inputting the word vectors into the BilSTM model to obtain the output of the BilSTM model;

a fifth processing module 6, configured to input the output of the BiLSTM model to the CRF model for training, so as to obtain an entity identification model;

the second obtaining module 7 is used for obtaining a text to be recognized;

the sixth processing module 8 is configured to input the text to be recognized into the entity recognition model to obtain a recognition result;

a third obtaining module 9, configured to obtain a normalization dictionary, where the normalization dictionary includes full names and short names of all named entities;

and a seventh processing module 10, configured to match the recognition result with the normalization dictionary to obtain a normalized recognition result.

i_t＝σ(W_xix_t+W_hih_t-1+W_cic_t-1+b_i)

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cfc_t-1+b_f

c_t＝f_tc_t-1+i_ttanh(W_xcx_t+W_hch_t-1+b_c)

o_t＝σ(W_xox_t+W_hoh_t-1+W_coc_t+b_o)

h_t＝o_ttanh(c_t)

wherein the content of the first and second substances,

indicating label y_iTo y_i+1The score of the transition of (a) is,

means that the ith word is at the y_iAn output score on each label;

wherein, Y_XRepresenting a set of output tag sequences;

the bert based named entity recognition system in this embodiment is presented in the form of functional units, where a unit refers to an ASIC circuit, a processor and memory that execute one or more software or fixed programs, and/or other devices that may provide the functionality described above.

Further functional descriptions of the modules are the same as those of the corresponding embodiments, and are not repeated herein.

An embodiment of the present invention further provides an electronic device, as shown in fig. 3, the electronic device includes one or more processors 71 and a memory 72, where one processor 71 is taken as an example in fig. 3.

The controller may further include: an input device 73 and an output device 74.

The processor 71, the memory 72, the input device 73 and the output device 74 may be connected by a bus or other means, as exemplified by the bus connection in fig. 3.

The processor 71 may be a Central Processing Unit (CPU). The Processor 71 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or combinations thereof. A general purpose processor may be a microprocessor or any conventional processor or the like.

The memory 72, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the bert based named entity identification method in the embodiments of the present application. The processor 71 executes various functional applications of the server and data processing, i.e. the bert-based named entity recognition method of the above-described method embodiments, by running non-transitory software programs, instructions and modules stored in the memory 72.

The memory 72 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a processing device operated by the server, and the like. Further, the memory 72 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 72 may optionally include memory located remotely from the processor 71, which may be connected to a network connection device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 73 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the processing device of the server. The output device 74 may include a display device such as a display screen.

One or more modules are stored in the memory 72, which when executed by the one or more processors 71 perform the method shown in FIG. 1.

It will be understood by those skilled in the art that all or part of the processes of the above-described embodiments of the method may be implemented by instructing relevant hardware through a computer program, and the executed program may be stored in a computer-readable storage medium, and when executed, may include the processes of the above-described embodiments of the bert-based named entity recognition method. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.

Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims

1. A named entity recognition method based on bert is characterized by comprising the following steps:

determining a named entity tag according to the identification requirement, wherein the named entity tag is used for identifying a named entity of the text;

acquiring a training set, wherein the training set is labeled according to named entity labels and comprises a plurality of labeled training texts, and one training text corresponds to one input sentence;

respectively segmenting each training text to obtain a word sequence corresponding to each training text;

inputting the word sequence into a bert feature representation layer to obtain a word vector of each word in the training text;

inputting the word vector into a BilSTM model to obtain the output of the BilSTM model;

inputting the output of the BilSTM model into a CRF model for training to obtain an entity recognition model;

acquiring a text to be identified;

inputting a text to be recognized into the entity recognition model to obtain a recognition result;

acquiring a normalization dictionary, wherein the normalization dictionary comprises full names and short names of all named entities;

and matching the recognition result with the normalization dictionary to obtain a normalized recognition result.

2. The bert-based named entity recognition method of claim 1, wherein the step of obtaining a normalized dictionary comprises:

acquiring encyclopedic corpus;

matching the names of the pre-stored named entities in encyclopedic linguistic data to obtain a first dictionary;

acquiring a full name-abbreviation dictionary on the Internet;

obtaining a second dictionary according to the full name-abbreviation dictionary;

acquiring company data on the Internet;

obtaining a third dictionary according to the company data;

and obtaining a normalized dictionary according to the first dictionary, the second dictionary and the third dictionary.

3. The bert-based named entity recognition method of claim 2, wherein the step of matching pre-stored named entity names in an encyclopedia corpus to obtain the first dictionary comprises:

carrying out full-name matching on the name of the pre-stored named entity in encyclopedia corpus to obtain a full name corresponding to the name;

searching a guide word for short in encyclopedic corpus, and matching the corresponding abbreviation of the name according to the guide word for short;

and forming a first dictionary by using the full names and the short names corresponding to the names.

4. The bert-based named entity recognition method of claim 2, wherein after the step of deriving a normalized dictionary from the first dictionary, the second dictionary, and the third dictionary, further comprising:

and carrying out duplication removal processing on the normalization dictionary to obtain the duplication-removed normalization dictionary.

5. The bert-based named entity recognition method of claim 1, wherein the step of matching the recognition result with the normalized dictionary to obtain a normalized recognition result comprises:

completely matching the recognition result in the normalization dictionary to obtain a matching result;

if the matching result is successful, taking the full name corresponding to the identification result as a normalized identification result;

if the matching result is matching failure, calculating the similarity of the recognition result and the character string in the normalized dictionary;

judging whether the similarity is greater than a preset value;

if the similarity is larger than a preset value, taking the full name corresponding to the character string in the normalized dictionary as a normalized recognition result;

and if the similarity is smaller than or equal to a preset value, taking the recognition result as a normalized recognition result.

6. The bert-based named entity recognition method of claim 5, wherein the step of calculating the similarity between the recognition result and the strings in the normalized dictionary comprises:

calculating the intersection and union of the recognition result and the character strings in the normalized dictionary;

and taking the proportion of the intersection set and the union set as the similarity of the recognition result and the character strings in the normalized dictionary.

7. The bert-based named entity recognition method of claim 1,

the calculation formula of the output value of the state hiding layer in the BilSTM model is as follows:

i_t＝σ(W_xix_t+W_hih_t-1+W_cic_t-1+b_i)

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cfc_t-1+b_f

c_t＝f_tc_t-1+i_ttanh(W_xcx_t+W_hch_t-1+b_c)

o_t＝σ(W_xox_t+W_hoh_t-1+W_coc_t+b_o)

h_t＝o_ttanh(c_t)

wherein, W_xiWeight matrix representing the input gate from input layer to hidden layer, W_hiWeight matrix representing the input gate from the output layer to the hidden layer, W_ciWeight matrix representing the input gate from memory layer to hidden layer, h_t-1Value representing the hidden layer at time t-1, c_t-1Representing the value of the memory layer at time t-1, b_iOffset vector, W, of input gate representing hidden layer_xfWeight matrix, W, representing the input gate from input layer to forgetting layer_hfWeight matrix, W, representing the input gate from the output layer to the forgetting layer_cfWeight matrix representing the input gate from memory layer to forget layer, b_fOffset vector of input gate representing forgetting layer, W_xcWeight matrix representing the input gate from the input layer to the memory layer, W_hcWeight matrix representing the input gate from forgetting layer to memory layer, b_cBiasing of input gates representing memory layersAmount, W_xoWeight matrix representing the input gates from input layer to output layer, W_hoWeight matrix representing the input gate from forgetting layer to output layer, W_coWeight matrix representing the input gates from the memory layer to the output layer, b_oRepresenting the offset vector of the input gate of the output layer, c representing the state of the memory cell, σ and tanh representing two different neuron activation functions, i_t、f_tAnd o_tRespectively representing an input gate, a forgetting gate and an output gate;

wherein the content of the first and second substances,

indicating label y_iTo y_i+1The score of the transition of (a) is,

means that the ith word is at the y_iAn output score on each label;

wherein, Y_XRepresenting a set of output tag sequences;

8. a bert based named entity recognition system, comprising:

the system comprises a first processing module, a second processing module and a third processing module, wherein the first processing module is used for determining a named entity label according to identification requirements, and the named entity label is used for identifying a named entity of a text;

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a training set, the training set is labeled according to named entity labels and comprises a plurality of labeled training texts, and one training text corresponds to one input sentence;

the second processing module is used for respectively segmenting each training text to obtain a word sequence corresponding to each training text;

the third processing module is used for inputting the word sequence into the bert feature representation layer to obtain a word vector of each word in the training text;

the fourth processing module is used for inputting the word vectors into the BilSTM model to obtain the output of the BilSTM model;

the fifth processing module is used for inputting the output of the BilSTM model into a CRF model for training to obtain an entity recognition model;

the second acquisition module is used for acquiring the text to be recognized;

the sixth processing module is used for inputting the text to be recognized into the entity recognition model to obtain a recognition result;

the third acquisition module is used for acquiring a normalization dictionary, and the normalization dictionary comprises the full names and the short names of all named entities;

and the seventh processing module is used for matching the recognition result with the normalization dictionary to obtain a normalized recognition result.

9. An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to cause the at least one processor to perform the bert based named entity recognition method of any one of claims 1 to 7.

10. A computer-readable storage medium storing computer instructions for causing a computer to perform the bert based named entity recognition method of any one of claims 1-7.