CN113177412A - Named entity identification method and system based on bert, electronic equipment and storage medium - Google Patents

Named entity identification method and system based on bert, electronic equipment and storage medium Download PDF

Info

Publication number
CN113177412A
CN113177412A CN202110364506.0A CN202110364506A CN113177412A CN 113177412 A CN113177412 A CN 113177412A CN 202110364506 A CN202110364506 A CN 202110364506A CN 113177412 A CN113177412 A CN 113177412A
Authority
CN
China
Prior art keywords
dictionary
layer
named entity
recognition result
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110364506.0A
Other languages
Chinese (zh)
Inventor
郑才松
李青龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Smart Starlight Information Technology Co ltd
Original Assignee
Beijing Smart Starlight Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Smart Starlight Information Technology Co ltd filed Critical Beijing Smart Starlight Information Technology Co ltd
Priority to CN202110364506.0A priority Critical patent/CN113177412A/en
Publication of CN113177412A publication Critical patent/CN113177412A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses a named entity identification method, a named entity identification system, electronic equipment and a storage medium based on bert, wherein the method comprises the following steps: determining a named entity label according to the identification requirement; labeling the training set according to the named entity label; respectively segmenting each training text in the training set to obtain a corresponding word sequence; inputting the word sequence into a bert feature representation layer to obtain a word vector; inputting the word vectors into a BilSTM model and a CRF model for training to obtain an entity recognition model; acquiring a text to be identified; inputting a text to be recognized into the entity recognition model to obtain a recognition result; acquiring a normalization dictionary; and matching the recognition result with the normalization dictionary to obtain a normalized recognition result. The word vector model trained by the Bert data is input, so that the model can fully learn the text characteristics, and the entity recognition effect is greatly improved; the recognition result is normalized by constructing a normalization dictionary, so that repetition and redundancy are removed, and the accuracy of the recognition result is improved.

Description

Named entity identification method and system based on bert, electronic equipment and storage medium
Technical Field
The invention relates to the field of natural language processing, in particular to a named entity identification method and system based on bert, electronic equipment and a storage medium.
Background
Named entity identification is an important link of information extraction technology and is also one of the fundamental fields and difficult problems of the information extraction field. Named entity recognition, as the name implies, is the extraction of important entities from text. For ordinary news text, the goal is to identify entities of user interest such as person names, place names, organizational names, etc. At present, named entity recognition mainly adopts methods based on rules and on statistical machine learning.
The method based on the rules summarizes the internal characteristics of the named entities through expert domain knowledge, manually designs rule templates, and extracts corresponding entities according to the templates. Such methods tend to depend on specific languages and domains, manually writing rules is time consuming and difficult to cover all linguistic phenomena, and frequent modifications of rules are required to maintain method performance.
In the field of named entity recognition, statistical-based machine learning methods are currently widely used. Typically, the named entity recognition task is often treated as a sequence tagging task. The main models are divided into two types, namely a traditional statistical machine learning model and a neural network model. Common statistical machine learning models include shallow models such as support vector machines, conditional random fields, etc., wherein the conditional random field model is widely applied to named entity recognition tasks and exhibits excellent performance. In recent years, with the development of word vector technology, deep learning methods based on neural networks have made significant progress in the field of natural language processing. Compared with the traditional statistical machine learning method, the neural network model has better effect in the entity recognition task.
The neural network method usually uses large-scale unlabeled corpora to train word vectors, and inputs the pre-trained word vectors into models such as a convolutional neural network and a cyclic neural network to realize entity recognition training. The accuracy of the recognition result is low due to insufficient feature extraction capability of the pre-training model and insufficient representation of the internal features of the Chinese.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method, a system, an electronic device, and a storage medium for identifying a named entity based on bert, so as to solve the problem in the prior art that an accuracy of an identification result is low.
Therefore, the embodiment of the invention provides the following technical scheme:
according to a first aspect, an embodiment of the present invention provides a method for identifying a named entity based on bert, including: determining a named entity tag according to the identification requirement, wherein the named entity tag is used for identifying a named entity of the text; acquiring a training set, wherein the training set is labeled according to named entity labels and comprises a plurality of labeled training texts, and one training text corresponds to one input sentence; respectively segmenting each training text to obtain a word sequence corresponding to each training text; inputting the word sequence into a bert feature representation layer to obtain a word vector of each word in the training text; inputting the word vector into a BilSTM model to obtain the output of the BilSTM model; inputting the output of the BilSTM model into a CRF model for training to obtain an entity recognition model; acquiring a text to be identified; inputting a text to be recognized into the entity recognition model to obtain a recognition result; acquiring a normalization dictionary, wherein the normalization dictionary comprises full names and short names of all named entities; and matching the recognition result with the normalization dictionary to obtain a normalized recognition result.
Optionally, the step of obtaining the normalized dictionary includes: acquiring encyclopedic corpus; matching the names of the pre-stored named entities in encyclopedic linguistic data to obtain a first dictionary; acquiring a full name-abbreviation dictionary on the Internet; obtaining a second dictionary according to the full name-abbreviation dictionary; acquiring company data on the Internet; obtaining a third dictionary according to the company data; and obtaining a normalized dictionary according to the first dictionary, the second dictionary and the third dictionary.
Optionally, the step of matching the names of the pre-stored named entities in the encyclopedia corpus to obtain the first dictionary includes: carrying out full-name matching on the name of the pre-stored named entity in encyclopedia corpus to obtain a full name corresponding to the name; searching a guide word for short in encyclopedic corpus, and matching the corresponding abbreviation of the name according to the guide word for short; and forming a first dictionary by using the full names and the short names corresponding to the names.
Optionally, after the step of obtaining the normalized dictionary according to the first dictionary, the second dictionary, and the third dictionary, the method further includes: and carrying out duplication removal processing on the normalization dictionary to obtain the duplication-removed normalization dictionary.
Optionally, the step of matching the recognition result with the normalized dictionary to obtain a normalized recognition result includes: completely matching the recognition result in the normalization dictionary to obtain a matching result; if the matching result is successful, taking the full name corresponding to the identification result as a normalized identification result; if the matching result is matching failure, calculating the similarity of the recognition result and the character string in the normalized dictionary; judging whether the similarity is greater than a preset value; if the similarity is larger than a preset value, taking the full name corresponding to the character string in the normalized dictionary as a normalized recognition result; and if the similarity is smaller than or equal to a preset value, taking the recognition result as a normalized recognition result.
Optionally, the step of calculating the similarity between the recognition result and the character string in the normalized dictionary includes: calculating the intersection and union of the recognition result and the character strings in the normalized dictionary; and taking the proportion of the intersection set and the union set as the similarity of the recognition result and the character strings in the normalized dictionary.
Optionally, the calculation formula of the output value of the state hiding layer in the BiLSTM model is as follows:
it=σ(Wxixt+Whiht-1+Wcict-1+bi)
ft=σ(Wxfxt+Whfht-1+Wcfct-1+bf
ct=ftct-1+ittanh(Wxcxt+Whcht-1+bc)
ot=σ(Wxoxt+Whoht-1+Wcoct+bo)
ht=ottanh(ct)
wherein, WxiWeight matrix representing the input gate from input layer to hidden layer, WhiWeight matrix representing the input gate from the output layer to the hidden layer, WciWeight matrix representing the input gate from memory layer to hidden layer, ht-1Value representing the hidden layer at time t-1, ct-1Representing the value of the memory layer at time t-1, biOffset vector, W, of input gate representing hidden layerxfWeight matrix, W, representing the input gate from input layer to forgetting layerhfWeight matrix, W, representing the input gate from the output layer to the forgetting layercfWeight matrix representing the input gate from memory layer to forget layer, bfOffset vector of input gate representing forgetting layer, WxcWeight matrix representing the input gate from the input layer to the memory layer, WhcWeight matrix representing the input gate from forgetting layer to memory layer, bcRepresenting the offset vector of the input gate of the memory layer, WxoWeight matrix representing the input gates from input layer to output layer, WhoWeight matrix representing the input gate from forgetting layer to output layer, WcoWeight matrix representing the input gates from the memory layer to the output layer, boRepresenting the offset vector of the input gate of the output layer, c representing the state of the memory cell, σ and tanh representing two different neuron activation functions, it、ftAnd otRespectively representing an input gate, a forgetting gate and an output gate;
the overall score for the tag sequence in the CRF model is calculated as follows:
Figure BDA0003006874110000041
wherein the content of the first and second substances,
Figure BDA0003006874110000042
indicating label yiTo yi+1The score of the transition of (a) is,
Figure BDA0003006874110000043
means that the ith word is at the yiAn output score on each label;
the calculation formula for normalizing all possible sequence paths to produce a probability distribution for the output sequence y is as follows:
Figure BDA0003006874110000051
wherein, YXRepresenting a set of output tag sequences;
the formula for maximizing the log probability for the correct tag sequence y is as follows:
Figure BDA0003006874110000052
the calculation formula for predicting the sequence with the highest total score as the optimal sequence is as follows:
Figure BDA0003006874110000053
according to a second aspect, an embodiment of the present invention provides a bert-based named entity recognition system, including: the system comprises a first processing module, a second processing module and a third processing module, wherein the first processing module is used for determining a named entity label according to identification requirements, and the named entity label is used for identifying a named entity of a text; the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a training set, the training set is labeled according to named entity labels and comprises a plurality of labeled training texts, and one training text corresponds to one input sentence; the second processing module is used for respectively segmenting each training text to obtain a word sequence corresponding to each training text; the third processing module is used for inputting the word sequence into the bert feature representation layer to obtain a word vector of each word in the training text; the fourth processing module is used for inputting the word vectors into the BilSTM model to obtain the output of the BilSTM model; the fifth processing module is used for inputting the output of the BilSTM model into a CRF model for training to obtain an entity recognition model; the second acquisition module is used for acquiring the text to be recognized; the sixth processing module is used for inputting the text to be recognized into the entity recognition model to obtain a recognition result; the third acquisition module is used for acquiring a normalization dictionary, and the normalization dictionary comprises the full names and the short names of all named entities; and the seventh processing module is used for matching the recognition result with the normalization dictionary to obtain a normalized recognition result.
Optionally, the third obtaining module includes: the first acquisition unit is used for acquiring encyclopedic corpus; the first processing unit is used for matching the names of the pre-stored named entities in the encyclopedia corpus to obtain a first dictionary; the second acquisition unit is used for acquiring a full name-abbreviation dictionary on the Internet; the second processing unit is used for obtaining a second dictionary according to the full name-abbreviation dictionary; a third acquisition unit for acquiring company data on the internet; the third processing unit is used for obtaining a third dictionary according to the company data; and the fourth processing unit is used for obtaining a normalized dictionary according to the first dictionary, the second dictionary and the third dictionary.
Optionally, the first processing unit includes: the first processing subunit is used for carrying out full-name matching on the name of the named entity stored in advance in the encyclopedic corpus to obtain a full name corresponding to the name; the second processing subunit is used for searching the guide word for short in the encyclopedic corpus and matching the corresponding abbreviation of the name according to the guide word for short; and the third processing subunit is used for forming the first dictionary by the full names and the short names corresponding to the names.
Optionally, the third obtaining module further includes: and the fifth processing unit is used for carrying out duplication elimination processing on the normalization dictionary to obtain the duplication eliminated normalization dictionary.
Optionally, the seventh processing module includes: the sixth processing unit is used for completely matching the recognition result in the normalized dictionary to obtain a matching result; the seventh processing unit is used for taking the full name corresponding to the identification result as the normalized identification result if the matching result is successful; the eighth processing unit is used for calculating the similarity of the recognition result and the character string in the normalized dictionary if the matching result is matching failure; the judging unit is used for judging whether the similarity is greater than a preset value or not; the ninth processing unit is used for taking the full name corresponding to the character string in the normalized dictionary as a normalized recognition result if the similarity is larger than a preset value; and the tenth processing unit is used for taking the recognition result as a normalized recognition result if the similarity is less than or equal to a preset value.
Optionally, the eighth processing unit includes: the fourth processing subunit is used for calculating the intersection and the union of the recognition result and the character strings in the normalized dictionary; and the fifth processing subunit is used for taking the proportion of the intersection and the union as the similarity of the recognition result and the character string in the normalized dictionary.
Optionally, the calculation formula of the output value of the state hiding layer in the BiLSTM model is as follows:
it=σ(Wxixt+Whiht-1+Wcict-1+bi)
ft=σ(Wxfxt+Whfht-1+Wcfct-1+bf
ct=ftct-1+ittanh(Wxcxt+Whcht-1+bc)
ot=σ(Wxoxt+Whoht-1+Wcoct+bo)
ht=ottanh(ct)
wherein, WxiWeight matrix representing the input gate from input layer to hidden layer, WhiWeight matrix representing the input gate from the output layer to the hidden layer, WciWeight matrix representing the input gate from memory layer to hidden layer, ht-1Value representing the hidden layer at time t-1, ct-1Representing the value of the memory layer at time t-1, biOffset vector, W, of input gate representing hidden layerxfWeight matrix, W, representing the input gate from input layer to forgetting layerhfWeight matrix, W, representing the input gate from the output layer to the forgetting layercfWeight matrix representing the input gate from memory layer to forget layer, bfOffset vector of input gate representing forgetting layer, WxcRepresenting input layersWeight matrix of input gates to memory layer, WhcWeight matrix representing the input gate from forgetting layer to memory layer, bcRepresenting the offset vector of the input gate of the memory layer, WxoWeight matrix representing the input gates from input layer to output layer, WhoWeight matrix representing the input gate from forgetting layer to output layer, WcoWeight matrix representing the input gates from the memory layer to the output layer, boRepresenting the offset vector of the input gate of the output layer, c representing the state of the memory cell, σ and tanh representing two different neuron activation functions, it、ftAnd otRespectively representing an input gate, a forgetting gate and an output gate;
the overall score for the tag sequence in the CRF model is calculated as follows:
Figure BDA0003006874110000071
wherein the content of the first and second substances,
Figure BDA0003006874110000081
indicating label yiTo yi+1The score of the transition of (a) is,
Figure BDA0003006874110000082
means that the ith word is at the yiAn output score on each label;
the calculation formula for normalizing all possible sequence paths to produce a probability distribution for the output sequence y is as follows:
Figure BDA0003006874110000083
wherein, YXRepresenting a set of output tag sequences;
the formula for maximizing the log probability for the correct tag sequence y is as follows:
Figure BDA0003006874110000084
the calculation formula for predicting the sequence with the highest total score as the optimal sequence is as follows:
Figure BDA0003006874110000085
according to a third aspect, an embodiment of the present invention provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to cause the at least one processor to perform the method for bert based named entity recognition as described in any of the above first aspects.
According to a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, in which computer instructions are stored, the computer instructions being configured to cause a computer to execute the bert-based named entity recognition method described in any of the first aspects.
The technical scheme of the embodiment of the invention has the following advantages:
the embodiment of the invention provides a named entity identification method, a named entity identification system, electronic equipment and a storage medium based on bert, wherein the method comprises the following steps: determining a named entity tag according to the identification requirement, wherein the named entity tag is used for identifying a named entity of the text; acquiring a training set, wherein the training set is labeled according to named entity labels and comprises a plurality of labeled training texts, and one training text corresponds to one input sentence; respectively segmenting each training text to obtain a word sequence corresponding to each training text; inputting the word sequence into a bert feature representation layer to obtain a word vector of each word in the training text; inputting the word vector into a BilSTM model to obtain the output of the BilSTM model; inputting the output of the BilSTM model into a CRF model for training to obtain an entity recognition model; acquiring a text to be identified; inputting a text to be recognized into the entity recognition model to obtain a recognition result; acquiring a normalization dictionary, wherein the normalization dictionary comprises full names and short names of all named entities; and matching the recognition result with the normalization dictionary to obtain a normalized recognition result. The method does not need to manually construct features, is completely realized end to end, and can enable the model to fully learn the text features by inputting the word vector model trained by the Bert data, thereby greatly improving the effect of entity recognition; meanwhile, for the recognition result, the normalization dictionary is constructed and unified to an entity, so that the subsequent consistency processing is facilitated, and the accuracy of the recognition result is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart of a specific example of a bert-based named entity recognition method according to an embodiment of the present invention;
FIG. 2 is a block diagram of one particular example of a bert based named entity recognition system in accordance with embodiments of the present invention;
fig. 3 is a schematic diagram of an electronic device according to an embodiment of the invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a named entity recognition method based on bert, and as shown in FIG. 1, the method comprises steps S1-S10.
Step S1: and determining a named entity tag according to the identification requirement, wherein the named entity tag is used for identifying the named entity of the text.
In this embodiment, the identification requirement includes a named entity to be identified, and the type of the named entity can be obtained according to the identification requirement, so as to obtain a named entity tag.
The named entity problem is formalized as a sequence labeling problem, with BIO labels representing the entity boundaries. Wherein, B represents an entity starting word, I represents an entity internal word, and O represents a non-entity word. Specifically, for news text, three types of entities, namely a person name, a place name and an organization name, are mainly identified. The name of the person is PER, the name of the person is B-PER, and the interior of the person is I-PER; the place name is LOC, the place name is B-LOC at the beginning, and the interior is I-LOC; the organization name is ORG, the place name begins B-ORG, and the interior is I-ORG. This is only illustrated schematically in the present embodiment, and is not limited thereto. In practical application, the category of the named entity and the label of each entity are determined according to practical needs.
Step S2: and acquiring a training set, wherein the training set is labeled according to the named entity labels and comprises a plurality of labeled training texts, and one training text corresponds to one input sentence.
In this embodiment, the text data is labeled according to the named entity tag, that is, the named entity corresponding to the named entity tag in the text data is labeled. The labeled text data is used as a training text, and one training text is an input sentence, namely one training text corresponds to one sentence. And forming a training set by the marked training texts.
Specifically, in this embodiment, the daily news corpus is selected and labeled to obtain a training set.
Step S3: and respectively segmenting each training text to obtain a word sequence corresponding to each training text.
In this embodiment, for each training text (input sentence), word segmentation is performed first, and the sentence is represented as a word sequence. The word segmentation specifically refers to a single word sequence for Chinese and a single word sequence for English, and is not related to a specific word segmentation method and is generally called a word sequence.
Step S4: and inputting the word sequence into a bert feature representation layer to obtain a word vector of each word in the training text.
In this embodiment, the word sequence is input to the bert feature representation layer to obtain a vector representation of each word in the sentence.
The bert model, which represents the bidirectional encoder representation of the transform. Unlike other language representation models in the near future, BERT aims to pre-train the deep bi-directional representation by jointly adjusting the context in all layers. Therefore, the pre-trained BERT representation can be finely adjusted through an additional output layer, and is suitable for building the most advanced model of a wide range of tasks, such as question-answering tasks and language reasoning, without making great architectural modification on specific tasks. The word vector trained by the large-scale data of Bert is used as model input, so that the model can fully learn text characteristics, and the effect of entity recognition is greatly improved.
Step S5: and inputting the word vector into a BilSTM model to obtain the output of the BilSTM model.
In this embodiment, each word vector in a sentence is input into the BiLSTM network, and the value of the hidden layer in the current state is calculated. Here, the sentence is calculated to obtain a forward information vector through the forward LSTM and a backward information vector through the backward LSTM, respectively. And splicing the forward information vector and the backward information vector according to positions to obtain the BiLSTM model output corresponding to the sentence.
Step S6: and (4) inputting the output of the BilSTM model into a CRF model for training to obtain an entity recognition model.
In this embodiment, the BiLSTM layer output passes through the CRF layer to obtain a label of each word, and the model is trained through the training set to obtain the entity recognition model.
Step S7: obtaining a text to be recognized
In this embodiment, the text to be recognized is news data of some specific websites downloaded by a crawler.
Step S8: and inputting the text to be recognized into the entity recognition model to obtain a recognition result.
In this embodiment, the text to be recognized is input into the entity recognition model to recognize the named entity, so as to obtain a recognition result of the text to be recognized, that is, the named entity corresponding to the text to be recognized.
Step S9: and acquiring a normalization dictionary, wherein the normalization dictionary comprises the full names and the short names of all named entities.
In this embodiment, there are often multiple descriptions of the same named entity, such as full name and abbreviation. For the situation, a normalization dictionary is constructed, the dictionary comprises various descriptions corresponding to the named entities, the normalization dictionary is a large database, and the normalization dictionary comprises the full names and various short names of all the named entities.
Step S10: and matching the recognition result with the normalization dictionary to obtain a normalized recognition result.
In this embodiment, since some entities have multiple expressions, such as company names and acronyms, the identified entities have duplication and redundancy. In order to remove the duplicate, the identified entity needs to be normalized, which facilitates subsequent processing. And unifying the identified named entities to a unique name to obtain a normalized identification result. The unique name in this embodiment is a full name corresponding to the named entity.
Firstly, obtaining labels of each named entity according to the identification requirement, labeling a training set according to the labels, and segmenting words of the labeled training set to obtain a word sequence of each training text; inputting the word sequence into a bert feature representation layer to obtain a word vector of each word; inputting the word vectors into a BilSTM model and a CRF model, and training a training set to obtain an entity recognition model; and acquiring a text to be recognized, inputting the text to be recognized into the entity recognition model to obtain a recognition result, and normalizing the recognition result through the normalization dictionary to obtain a normalized recognition result. The method does not need to manually construct features, is completely realized end to end, and can enable the model to fully learn the text features by inputting the word vector model trained by the Bert data, thereby greatly improving the effect of entity recognition; meanwhile, for the recognition result output by the model, the recognition result is subjected to normalization processing by constructing a normalization dictionary, so that the repetition and redundancy of the recognition result are removed, the subsequent consistency processing is facilitated, and the accuracy of the recognition result is improved.
As an exemplary embodiment, the step of acquiring the normalized dictionary at step S9 includes steps S91-S97.
Step S91: obtaining encyclopedia corpus.
In this embodiment, the encyclopedic corpus is obtained through a web crawler, and specifically, the encyclopedic corpus may be an encyclopedic corpus and a wiki encyclopedic corpus.
Step S92: and matching the names of the pre-stored named entities in the encyclopedia corpus to obtain a first dictionary.
In this embodiment, the names of the named entities stored in advance may be name dictionaries of named entities accumulated through history, for example, 3000 or more ten thousand company name dictionaries accumulated previously.
In this embodiment, step S92 includes steps S921 to S923.
Step S921: and carrying out full name matching on the name of the pre-stored named entity in encyclopedia linguistic data to obtain a full name corresponding to the name.
Specifically, the names of the named entities stored in advance are subjected to full-name matching in encyclopedia corpus, and the full names corresponding to the names are matched.
Step S922: searching the guide word for short in encyclopedic corpus and matching the corresponding abbreviation of the name according to the guide word for short.
Specifically, the abbreviation guiding word is a word used for representing the abbreviation, and may be a word such as "abbreviation" or "also named" and the like, and certainly, in other embodiments, the abbreviation guiding word may also include other words such as other words and aliases, and may be reasonably set according to actual needs.
In encyclopedia corpus, guide words such as 'abbreviation' and 'named' are found and matched with the name of the name.
Step S923: and forming a first dictionary by using the full names and the short names corresponding to the names.
Specifically, the full names and the short names of the names are corresponding, one full name may correspond to one or more short names, and the matched full names and short names are the first dictionary.
Step S93: and acquiring a full name-abbreviation dictionary on the Internet.
In this embodiment, the published full name-abbreviation dictionary, for example, the national institute full name-abbreviation dictionary, the world national full name, and the like, is downloaded over the internet network.
Step S94: and obtaining a second dictionary according to the full name-abbreviation dictionary.
In this embodiment, the multiple full-name/abbreviation dictionaries downloaded from the internet form a full-name/abbreviation dictionary set, and the set is used as the second dictionary.
Step S95: company data on the internet is acquired.
In this embodiment, the company data disclosed on the internet is obtained through the web crawler, and may specifically be the listed company data downloaded from the stock website.
Step S96: and obtaining a third dictionary according to the company data.
In this embodiment, the company name corresponding to the stock name is automatically extracted through a rule, where the rule may be understood as that the stock introduction material refers to the name of the company, and the corresponding company name is extracted by finding a leading word of the company name, such as "the company name", "the company name used by the company", and the like, and the obtained company name is used as the third dictionary.
Step S97: and obtaining a normalized dictionary according to the first dictionary, the second dictionary and the third dictionary.
In this embodiment, the first dictionary, the second dictionary, and the third dictionary obtained from different channels are summarized to obtain the full names and all the acronyms corresponding to the full names, and after each full name and all the acronyms corresponding to the full name are well corresponded to each other, a normalization dictionary is obtained, so that the recognition result is normalized subsequently, and the normalized recognition result corresponding to the named entity, that is, the unique full name corresponding to the named entity, is obtained.
The step obtains the normalization dictionary through encyclopedic linguistic data, the full name-abbreviation dictionary on the Internet and company data on the Internet, so that the normalization dictionary is more comprehensive and accurate.
As an exemplary embodiment, step S98 is further included after the step of obtaining the normalized dictionary from the first dictionary, the second dictionary, and the third dictionary at step S97.
Step S98: and carrying out duplication removal processing on the normalization dictionary to obtain the duplication-removed normalization dictionary.
In this embodiment, different dictionaries obtained from different channels may have the same content, and these duplicate contents are removed, only one is retained, and the de-duplicated normalization dictionary is obtained after the de-duplication process.
For example, the first dictionary is used to obtain "Beijing intelligent starlight information technology Limited company-intelligent starlight", the second dictionary is used to obtain "Beijing intelligent starlight information technology Limited company-intelligent starlight", and the two dictionaries have repetition of full name and short name, so only one dictionary is reserved in the normalization dictionary.
The step is to carry out duplication removal on the normalization dictionary, so that the redundancy of the normalization dictionary is reduced, the storage space is reduced, and the normalization efficiency is improved.
As an exemplary embodiment, the step of matching the recognition result with the normalized dictionary to obtain the normalized recognition result in step S10 includes steps S101-S106.
Step S101: and completely matching the recognition result in the normalization dictionary to obtain a matching result.
In this embodiment, the recognition result is looked up in the normalized dictionary to obtain a matching result. When the recognition result can be completely matched with the corresponding name in the normalized dictionary, namely, the recognition result and the name are completely the same, the matching result is a matching success, and the step S102 is executed; when the recognition result cannot find a completely corresponding name in the normalized dictionary, i.e. the two names are not completely the same, the matching result is a matching failure, and step S103 is executed.
Step S102: and if the matching result is successful, taking the full name corresponding to the identification result as the normalized identification result.
In this embodiment, the matching result is a successful matching, which indicates that the recognition result can find the corresponding name in the normalized dictionary, that is, the recognition result exists in the normalized dictionary, so that the full name corresponding to the name is used as the normalized recognition result.
Step S103: and if the matching result is matching failure, calculating the similarity between the recognition result and the character string in the normalized dictionary.
In this embodiment, the matching result is a matching failure, which indicates that the recognition result cannot completely correspond to the name in the normalized dictionary, and at this time, similarity comparison needs to be performed between the recognition result and the character string in the normalized dictionary to determine the closest name.
In this embodiment, step S103 may specifically include steps S1031 to S1032.
Step S1031: and calculating the intersection and union of the recognition result and the character strings in the normalized dictionary.
In this embodiment, the recognition result is compared with each character string in the normalized dictionary, and the intersection and union of the recognition result and each character string are obtained respectively.
The intersection refers to the number of the unrepeated characters contained in the two character strings together, the union refers to the number of all the unrepeated characters contained in the two character strings, and the calculation process is as follows:
1. and respectively removing duplication of the two character strings and removing repeated characters in the character strings.
2. And selecting one character string, traversing the character string, comparing the character string with the character of the other character string one by one, counting the number (intersection) of the characters with the same number, and simultaneously accumulating the total number (union) of the characters with the same number.
Step S1032: and taking the proportion of the intersection set and the union set as the similarity of the recognition result and the character strings in the normalized dictionary.
In this embodiment, the similarity calculation uses a Jaccard similarity coefficient, that is, a ratio of intersection and union of two character strings, that is, similarity is intersection/union.
Step S104: and judging whether the similarity is greater than a preset value. If the similarity is greater than the preset value, executing step S105; if the similarity is less than or equal to the preset value, step S106 is executed.
In this embodiment, the preset value is determined according to the recognition accuracy; the higher the identification precision requirement is, the larger the numerical value of the preset value is; conversely, the lower the identification accuracy requirement, the smaller the value of the preset value. The preset value may range anywhere from 0 to 1. Specifically, the preset value may be set to 0.8; of course, in other embodiments, the preset value may also be set to other values, and may be reasonably set in practical application according to needs.
Step S105: and if the similarity is greater than the preset value, taking the full name corresponding to the character string in the normalized dictionary as the normalized recognition result.
In this embodiment, when the similarity is greater than the preset value, it indicates that the similarity between the recognition result and the character string in the normalized dictionary is extremely high, the recognition result and the character string in the normalized dictionary are defaulted to be the same character, and the full name corresponding to the character string is taken as the normalized recognition result.
Step S106: and if the similarity is smaller than or equal to a preset value, taking the recognition result as a normalized recognition result.
In this embodiment, when the similarity is smaller than or equal to the preset value, it indicates that the similarity between the recognition result and the character string in the normalized dictionary is low, and the recognition result does not exist in the normalized dictionary, so the recognition result is directly used as the normalized recognition result.
And in the step, the recognition result is matched with the normalization dictionary, if the matching is successful, the full name corresponding to the recognition result is taken as the normalization recognition result, if the matching is failed, the similarity is compared according to the recognition result and the character strings in the normalization dictionary, and the normalization recognition result is obtained according to the similarity. The identification results are normalized through the steps, and are unified to an entity after normalization, so that subsequent consistency processing is facilitated.
Bilstm structure, given an input sequence (x)1,x2,...,xn) The recurrent neural network model returns a sequence of representations (h) for the input sequence1,h2,...,hn). The RNN model can dynamically capture the information of sequence data and has the capacity of memorizing and storing the information, but is easy to generate gradient disappearance or explosion problems on algorithm implementation. The LSTM model introduces a memory unit and a threshold mechanism, realizes effective utilization of long-distance information and solves the problem of gradient disappearance. At time t, given input xtThe specific calculation process for the hidden layer output representation of LSTM is as follows:
it=σ(Wxixt+Whiht-1+Wcict-1+bi)
ft=σ(Wxfxt+Whfht-1+Wcfct-1+bf)
ct=ftct-1+ittanh(Wxcxt+Whcht-1+bc)
ot=σ(Wxoxt+Whoht-1+Wcoct+bo)
ht=ottanh(ct)
wherein, WxiWeight matrix representing the input gate from input layer to hidden layer, WhiWeight matrix representing the input gate from the output layer to the hidden layer, WciWeight matrix representing the input gate from memory layer to hidden layer, ht-1Value representing the hidden layer at time t-1, ct-1Representing the value of the memory layer at time t-1, biOffset vector, W, of input gate representing hidden layerxfWeight matrix, W, representing the input gate from input layer to forgetting layerhfWeight matrix, W, representing the input gate from the output layer to the forgetting layercfWeight matrix representing the input gate from memory layer to forget layer, bfOffset vector of input gate representing forgetting layer, WxcRepresenting input layers toWeight matrix of input gates of memory layer, WhcWeight matrix representing the input gate from forgetting layer to memory layer, bcRepresenting the offset vector of the input gate of the memory layer, WxoWeight matrix representing the input gates from input layer to output layer, WhoWeight matrix representing the input gate from forgetting layer to output layer, WcoWeight matrix representing the input gates from the memory layer to the output layer, boRepresenting the offset vector of the input gate of the output layer, c representing the state of the memory cell, σ and tanh representing two different neuron activation functions, it、ftAnd otRespectively representing an input gate, a forgetting gate and an output gate. The threshold mechanism can effectively filter and memorize the information of the memory unit, thereby solving the problems of RNN.
In general, a softmax classifier is adopted to solve the multi-classification problem in the prediction phase, but the softmax classifier does not take the dependency relationship between tags into consideration in the sequence labeling problem. Therefore, by using the CRF model in this embodiment, the method can better predict the tags by considering the global information of the tag sequence. The specific details are as follows: suppose that a transition score matrix A, matrix element A, is introducedi,jA transition score representing the transition of label i to label j, let y0And yn+1The labels are a starting label and an ending label in the sentence, the label type is k, and then A belongs to R(k+2)*(k+2). If the sentence length is n, the score matrix of the output layer is P ∈ Rn*kMatrix element Pi,jRepresenting the output score of the ith word under the jth label; given an input sentence X ═ X1,x2,...,xn) The output tag sequence Y ═ Y (Y)1,y2,...,yn) Then the total score for the tag sequence is:
Figure BDA0003006874110000201
wherein the content of the first and second substances,
Figure BDA0003006874110000202
indicating label yiTo yi+1Is rotatedThe score is shifted to the point that the score is obtained,
Figure BDA0003006874110000203
means that the ith word is at the yiOutput scores on individual tags.
Normalizing all possible sequence paths yields a probability distribution for the output sequence y, as shown below:
Figure BDA0003006874110000204
wherein, YXRepresenting a set of output tag sequences.
Maximizing the y for the correct tag sequence during training*The specific formula of the logarithmic probability of (c) is as follows:
Figure BDA0003006874110000205
as can be seen from the log-probability formula, the purpose of using the sentence-level likelihood function is to encourage the model to generate the correct tag sequence. In the decoding stage, the sequence with the highest total score is predicted as the optimal sequence, and the calculation formula is as follows:
Figure BDA0003006874110000211
in the prediction stage, a dynamic programming algorithm Viterbi is adopted to solve the optimal sequence.
The embodiment also provides a named entity recognition system based on bert, which is used for implementing the above embodiments and preferred embodiments, and the description of the system that has been already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the system described in the embodiments below is preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.
The embodiment also provides a named entity recognition system based on bert, as shown in fig. 2, including:
the system comprises a first processing module 1, a second processing module and a third processing module, wherein the first processing module is used for determining a named entity label according to an identification requirement, and the named entity label is used for identifying a named entity of a text;
the first acquisition module 2 is used for acquiring a training set, wherein the training set is labeled according to named entity labels and comprises a plurality of labeled training texts, and one training text corresponds to one input sentence;
the second processing module 3 is used for performing word segmentation on each training text to obtain a word sequence corresponding to each training text;
the third processing module 4 is used for inputting the word sequence into the bert feature representation layer to obtain a word vector of each word in the training text;
the fourth processing module 5 is used for inputting the word vectors into the BilSTM model to obtain the output of the BilSTM model;
a fifth processing module 6, configured to input the output of the BiLSTM model to the CRF model for training, so as to obtain an entity identification model;
the second obtaining module 7 is used for obtaining a text to be recognized;
the sixth processing module 8 is configured to input the text to be recognized into the entity recognition model to obtain a recognition result;
a third obtaining module 9, configured to obtain a normalization dictionary, where the normalization dictionary includes full names and short names of all named entities;
and a seventh processing module 10, configured to match the recognition result with the normalization dictionary to obtain a normalized recognition result.
Optionally, the third obtaining module includes: the first acquisition unit is used for acquiring encyclopedic corpus; the first processing unit is used for matching the names of the pre-stored named entities in the encyclopedia corpus to obtain a first dictionary; the second acquisition unit is used for acquiring a full name-abbreviation dictionary on the Internet; the second processing unit is used for obtaining a second dictionary according to the full name-abbreviation dictionary; a third acquisition unit for acquiring company data on the internet; the third processing unit is used for obtaining a third dictionary according to the company data; and the fourth processing unit is used for obtaining a normalized dictionary according to the first dictionary, the second dictionary and the third dictionary.
Optionally, the first processing unit includes: the first processing subunit is used for carrying out full-name matching on the name of the named entity stored in advance in the encyclopedic corpus to obtain a full name corresponding to the name; the second processing subunit is used for searching the guide word for short in the encyclopedic corpus and matching the corresponding abbreviation of the name according to the guide word for short; and the third processing subunit is used for forming the first dictionary by the full names and the short names corresponding to the names.
Optionally, the third obtaining module further includes: and the fifth processing unit is used for carrying out duplication elimination processing on the normalization dictionary to obtain the duplication eliminated normalization dictionary.
Optionally, the seventh processing module includes: the sixth processing unit is used for completely matching the recognition result in the normalized dictionary to obtain a matching result; the seventh processing unit is used for taking the full name corresponding to the identification result as the normalized identification result if the matching result is successful; the eighth processing unit is used for calculating the similarity of the recognition result and the character string in the normalized dictionary if the matching result is matching failure; the judging unit is used for judging whether the similarity is greater than a preset value or not; the ninth processing unit is used for taking the full name corresponding to the character string in the normalized dictionary as a normalized recognition result if the similarity is larger than a preset value; and the tenth processing unit is used for taking the recognition result as a normalized recognition result if the similarity is less than or equal to a preset value.
Optionally, the eighth processing unit includes: the fourth processing subunit is used for calculating the intersection and the union of the recognition result and the character strings in the normalized dictionary; and the fifth processing subunit is used for taking the proportion of the intersection and the union as the similarity of the recognition result and the character string in the normalized dictionary.
Optionally, the calculation formula of the output value of the state hiding layer in the BiLSTM model is as follows:
it=σ(Wxixt+Whiht-1+Wcict-1+bi)
ft=σ(Wxfxt+Whfht-1+Wcfct-1+bf
ct=ftct-1+ittanh(Wxcxt+Whcht-1+bc)
ot=σ(Wxoxt+Whoht-1+Wcoct+bo)
ht=ottanh(ct)
wherein, WxiWeight matrix representing the input gate from input layer to hidden layer, WhiWeight matrix representing the input gate from the output layer to the hidden layer, WciWeight matrix representing the input gate from memory layer to hidden layer, ht-1Value representing the hidden layer at time t-1, ct-1Representing the value of the memory layer at time t-1, biOffset vector, W, of input gate representing hidden layerxfWeight matrix, W, representing the input gate from input layer to forgetting layerhfWeight matrix, W, representing the input gate from the output layer to the forgetting layercfWeight matrix representing the input gate from memory layer to forget layer, bfOffset vector of input gate representing forgetting layer, WxcWeight matrix representing the input gate from the input layer to the memory layer, WhcWeight matrix representing the input gate from forgetting layer to memory layer, bcRepresenting the offset vector of the input gate of the memory layer, WxoWeight matrix representing the input gates from input layer to output layer, WhoWeight matrix representing the input gate from forgetting layer to output layer, WcoWeight matrix representing the input gates from the memory layer to the output layer, boRepresenting the offset vector of the input gate of the output layer, c representing the state of the memory cell, σ and tanh representing two different neuron activation functions, it、ftAnd otRespectively representing an input gate, a forgetting gate and an output gate;
the overall score for the tag sequence in the CRF model is calculated as follows:
Figure BDA0003006874110000241
wherein the content of the first and second substances,
Figure BDA0003006874110000242
indicating label yiTo yi+1The score of the transition of (a) is,
Figure BDA0003006874110000243
means that the ith word is at the yiAn output score on each label;
the calculation formula for normalizing all possible sequence paths to produce a probability distribution for the output sequence y is as follows:
Figure BDA0003006874110000244
wherein, YXRepresenting a set of output tag sequences;
the formula for maximizing the log probability for the correct tag sequence y is as follows:
Figure BDA0003006874110000245
the calculation formula for predicting the sequence with the highest total score as the optimal sequence is as follows:
Figure BDA0003006874110000246
the bert based named entity recognition system in this embodiment is presented in the form of functional units, where a unit refers to an ASIC circuit, a processor and memory that execute one or more software or fixed programs, and/or other devices that may provide the functionality described above.
Further functional descriptions of the modules are the same as those of the corresponding embodiments, and are not repeated herein.
An embodiment of the present invention further provides an electronic device, as shown in fig. 3, the electronic device includes one or more processors 71 and a memory 72, where one processor 71 is taken as an example in fig. 3.
The controller may further include: an input device 73 and an output device 74.
The processor 71, the memory 72, the input device 73 and the output device 74 may be connected by a bus or other means, as exemplified by the bus connection in fig. 3.
The processor 71 may be a Central Processing Unit (CPU). The Processor 71 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or combinations thereof. A general purpose processor may be a microprocessor or any conventional processor or the like.
The memory 72, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the bert based named entity identification method in the embodiments of the present application. The processor 71 executes various functional applications of the server and data processing, i.e. the bert-based named entity recognition method of the above-described method embodiments, by running non-transitory software programs, instructions and modules stored in the memory 72.
The memory 72 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a processing device operated by the server, and the like. Further, the memory 72 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 72 may optionally include memory located remotely from the processor 71, which may be connected to a network connection device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 73 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the processing device of the server. The output device 74 may include a display device such as a display screen.
One or more modules are stored in the memory 72, which when executed by the one or more processors 71 perform the method shown in FIG. 1.
It will be understood by those skilled in the art that all or part of the processes of the above-described embodiments of the method may be implemented by instructing relevant hardware through a computer program, and the executed program may be stored in a computer-readable storage medium, and when executed, may include the processes of the above-described embodiments of the bert-based named entity recognition method. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.
Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims (10)

1. A named entity recognition method based on bert is characterized by comprising the following steps:
determining a named entity tag according to the identification requirement, wherein the named entity tag is used for identifying a named entity of the text;
acquiring a training set, wherein the training set is labeled according to named entity labels and comprises a plurality of labeled training texts, and one training text corresponds to one input sentence;
respectively segmenting each training text to obtain a word sequence corresponding to each training text;
inputting the word sequence into a bert feature representation layer to obtain a word vector of each word in the training text;
inputting the word vector into a BilSTM model to obtain the output of the BilSTM model;
inputting the output of the BilSTM model into a CRF model for training to obtain an entity recognition model;
acquiring a text to be identified;
inputting a text to be recognized into the entity recognition model to obtain a recognition result;
acquiring a normalization dictionary, wherein the normalization dictionary comprises full names and short names of all named entities;
and matching the recognition result with the normalization dictionary to obtain a normalized recognition result.
2. The bert-based named entity recognition method of claim 1, wherein the step of obtaining a normalized dictionary comprises:
acquiring encyclopedic corpus;
matching the names of the pre-stored named entities in encyclopedic linguistic data to obtain a first dictionary;
acquiring a full name-abbreviation dictionary on the Internet;
obtaining a second dictionary according to the full name-abbreviation dictionary;
acquiring company data on the Internet;
obtaining a third dictionary according to the company data;
and obtaining a normalized dictionary according to the first dictionary, the second dictionary and the third dictionary.
3. The bert-based named entity recognition method of claim 2, wherein the step of matching pre-stored named entity names in an encyclopedia corpus to obtain the first dictionary comprises:
carrying out full-name matching on the name of the pre-stored named entity in encyclopedia corpus to obtain a full name corresponding to the name;
searching a guide word for short in encyclopedic corpus, and matching the corresponding abbreviation of the name according to the guide word for short;
and forming a first dictionary by using the full names and the short names corresponding to the names.
4. The bert-based named entity recognition method of claim 2, wherein after the step of deriving a normalized dictionary from the first dictionary, the second dictionary, and the third dictionary, further comprising:
and carrying out duplication removal processing on the normalization dictionary to obtain the duplication-removed normalization dictionary.
5. The bert-based named entity recognition method of claim 1, wherein the step of matching the recognition result with the normalized dictionary to obtain a normalized recognition result comprises:
completely matching the recognition result in the normalization dictionary to obtain a matching result;
if the matching result is successful, taking the full name corresponding to the identification result as a normalized identification result;
if the matching result is matching failure, calculating the similarity of the recognition result and the character string in the normalized dictionary;
judging whether the similarity is greater than a preset value;
if the similarity is larger than a preset value, taking the full name corresponding to the character string in the normalized dictionary as a normalized recognition result;
and if the similarity is smaller than or equal to a preset value, taking the recognition result as a normalized recognition result.
6. The bert-based named entity recognition method of claim 5, wherein the step of calculating the similarity between the recognition result and the strings in the normalized dictionary comprises:
calculating the intersection and union of the recognition result and the character strings in the normalized dictionary;
and taking the proportion of the intersection set and the union set as the similarity of the recognition result and the character strings in the normalized dictionary.
7. The bert-based named entity recognition method of claim 1,
the calculation formula of the output value of the state hiding layer in the BilSTM model is as follows:
it=σ(Wxixt+Whiht-1+Wcict-1+bi)
ft=σ(Wxfxt+Whfht-1+Wcfct-1+bf
ct=ftct-1+ittanh(Wxcxt+Whcht-1+bc)
ot=σ(Wxoxt+Whoht-1+Wcoct+bo)
ht=ottanh(ct)
wherein, WxiWeight matrix representing the input gate from input layer to hidden layer, WhiWeight matrix representing the input gate from the output layer to the hidden layer, WciWeight matrix representing the input gate from memory layer to hidden layer, ht-1Value representing the hidden layer at time t-1, ct-1Representing the value of the memory layer at time t-1, biOffset vector, W, of input gate representing hidden layerxfWeight matrix, W, representing the input gate from input layer to forgetting layerhfWeight matrix, W, representing the input gate from the output layer to the forgetting layercfWeight matrix representing the input gate from memory layer to forget layer, bfOffset vector of input gate representing forgetting layer, WxcWeight matrix representing the input gate from the input layer to the memory layer, WhcWeight matrix representing the input gate from forgetting layer to memory layer, bcBiasing of input gates representing memory layersAmount, WxoWeight matrix representing the input gates from input layer to output layer, WhoWeight matrix representing the input gate from forgetting layer to output layer, WcoWeight matrix representing the input gates from the memory layer to the output layer, boRepresenting the offset vector of the input gate of the output layer, c representing the state of the memory cell, σ and tanh representing two different neuron activation functions, it、ftAnd otRespectively representing an input gate, a forgetting gate and an output gate;
the overall score for the tag sequence in the CRF model is calculated as follows:
Figure FDA0003006874100000041
wherein the content of the first and second substances,
Figure FDA0003006874100000042
indicating label yiTo yi+1The score of the transition of (a) is,
Figure FDA0003006874100000043
means that the ith word is at the yiAn output score on each label;
the calculation formula for normalizing all possible sequence paths to produce a probability distribution for the output sequence y is as follows:
Figure FDA0003006874100000044
wherein, YXRepresenting a set of output tag sequences;
the formula for maximizing the log probability for the correct tag sequence y is as follows:
Figure FDA0003006874100000045
the calculation formula for predicting the sequence with the highest total score as the optimal sequence is as follows:
Figure FDA0003006874100000046
8. a bert based named entity recognition system, comprising:
the system comprises a first processing module, a second processing module and a third processing module, wherein the first processing module is used for determining a named entity label according to identification requirements, and the named entity label is used for identifying a named entity of a text;
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a training set, the training set is labeled according to named entity labels and comprises a plurality of labeled training texts, and one training text corresponds to one input sentence;
the second processing module is used for respectively segmenting each training text to obtain a word sequence corresponding to each training text;
the third processing module is used for inputting the word sequence into the bert feature representation layer to obtain a word vector of each word in the training text;
the fourth processing module is used for inputting the word vectors into the BilSTM model to obtain the output of the BilSTM model;
the fifth processing module is used for inputting the output of the BilSTM model into a CRF model for training to obtain an entity recognition model;
the second acquisition module is used for acquiring the text to be recognized;
the sixth processing module is used for inputting the text to be recognized into the entity recognition model to obtain a recognition result;
the third acquisition module is used for acquiring a normalization dictionary, and the normalization dictionary comprises the full names and the short names of all named entities;
and the seventh processing module is used for matching the recognition result with the normalization dictionary to obtain a normalized recognition result.
9. An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to cause the at least one processor to perform the bert based named entity recognition method of any one of claims 1 to 7.
10. A computer-readable storage medium storing computer instructions for causing a computer to perform the bert based named entity recognition method of any one of claims 1-7.
CN202110364506.0A 2021-04-05 2021-04-05 Named entity identification method and system based on bert, electronic equipment and storage medium Pending CN113177412A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110364506.0A CN113177412A (en) 2021-04-05 2021-04-05 Named entity identification method and system based on bert, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110364506.0A CN113177412A (en) 2021-04-05 2021-04-05 Named entity identification method and system based on bert, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113177412A true CN113177412A (en) 2021-07-27

Family

ID=76923039

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110364506.0A Pending CN113177412A (en) 2021-04-05 2021-04-05 Named entity identification method and system based on bert, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113177412A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627183A (en) * 2021-08-12 2021-11-09 平安国际智慧城市科技股份有限公司 Method, device and equipment for standardizing department name text and storage medium
CN113627185A (en) * 2021-07-29 2021-11-09 重庆邮电大学 Entity identification method for liver cancer pathological text naming
CN113704480A (en) * 2021-11-01 2021-11-26 成都我行我数科技有限公司 Intelligent minimum stock unit matching method
CN113722476A (en) * 2021-07-30 2021-11-30 的卢技术有限公司 Resume information extraction method and system based on deep learning
CN114036933A (en) * 2022-01-10 2022-02-11 湖南工商大学 Information extraction method based on legal documents
CN115221882A (en) * 2022-07-28 2022-10-21 平安科技(深圳)有限公司 Named entity identification method, device, equipment and medium
CN116384515A (en) * 2023-06-06 2023-07-04 之江实验室 Model training method and device, storage medium and electronic equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108460014A (en) * 2018-02-07 2018-08-28 百度在线网络技术(北京)有限公司 Recognition methods, device, computer equipment and the storage medium of business entity
CN110738055A (en) * 2019-10-23 2020-01-31 北京字节跳动网络技术有限公司 Text entity identification method, text entity identification equipment and storage medium
CN111597304A (en) * 2020-05-15 2020-08-28 上海财经大学 Secondary matching method for accurately identifying Chinese enterprise name entity
CN111723575A (en) * 2020-06-12 2020-09-29 杭州未名信科科技有限公司 Method, device, electronic equipment and medium for recognizing text
CN111859963A (en) * 2019-04-08 2020-10-30 中移(苏州)软件技术有限公司 Named entity recognition method, equipment, device and computer readable storage medium
CN111881685A (en) * 2020-07-20 2020-11-03 南京中孚信息技术有限公司 Small-granularity strategy mixed model-based Chinese named entity identification method and system
CN112001177A (en) * 2020-08-24 2020-11-27 浪潮云信息技术股份公司 Electronic medical record named entity identification method and system integrating deep learning and rules
CN112257422A (en) * 2020-10-22 2021-01-22 京东方科技集团股份有限公司 Named entity normalization processing method and device, electronic equipment and storage medium
CN112257446A (en) * 2020-10-20 2021-01-22 平安科技(深圳)有限公司 Named entity recognition method and device, computer equipment and readable storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108460014A (en) * 2018-02-07 2018-08-28 百度在线网络技术(北京)有限公司 Recognition methods, device, computer equipment and the storage medium of business entity
CN111859963A (en) * 2019-04-08 2020-10-30 中移(苏州)软件技术有限公司 Named entity recognition method, equipment, device and computer readable storage medium
CN110738055A (en) * 2019-10-23 2020-01-31 北京字节跳动网络技术有限公司 Text entity identification method, text entity identification equipment and storage medium
CN111597304A (en) * 2020-05-15 2020-08-28 上海财经大学 Secondary matching method for accurately identifying Chinese enterprise name entity
CN111723575A (en) * 2020-06-12 2020-09-29 杭州未名信科科技有限公司 Method, device, electronic equipment and medium for recognizing text
CN111881685A (en) * 2020-07-20 2020-11-03 南京中孚信息技术有限公司 Small-granularity strategy mixed model-based Chinese named entity identification method and system
CN112001177A (en) * 2020-08-24 2020-11-27 浪潮云信息技术股份公司 Electronic medical record named entity identification method and system integrating deep learning and rules
CN112257446A (en) * 2020-10-20 2021-01-22 平安科技(深圳)有限公司 Named entity recognition method and device, computer equipment and readable storage medium
CN112257422A (en) * 2020-10-22 2021-01-22 京东方科技集团股份有限公司 Named entity normalization processing method and device, electronic equipment and storage medium

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627185A (en) * 2021-07-29 2021-11-09 重庆邮电大学 Entity identification method for liver cancer pathological text naming
CN113722476A (en) * 2021-07-30 2021-11-30 的卢技术有限公司 Resume information extraction method and system based on deep learning
CN113627183A (en) * 2021-08-12 2021-11-09 平安国际智慧城市科技股份有限公司 Method, device and equipment for standardizing department name text and storage medium
CN113704480A (en) * 2021-11-01 2021-11-26 成都我行我数科技有限公司 Intelligent minimum stock unit matching method
CN113704480B (en) * 2021-11-01 2022-01-25 成都我行我数科技有限公司 Intelligent minimum stock unit matching method
CN114036933A (en) * 2022-01-10 2022-02-11 湖南工商大学 Information extraction method based on legal documents
CN114036933B (en) * 2022-01-10 2022-04-22 湖南工商大学 Information extraction method based on legal documents
CN115221882A (en) * 2022-07-28 2022-10-21 平安科技(深圳)有限公司 Named entity identification method, device, equipment and medium
CN115221882B (en) * 2022-07-28 2023-06-20 平安科技(深圳)有限公司 Named entity identification method, device, equipment and medium
CN116384515A (en) * 2023-06-06 2023-07-04 之江实验室 Model training method and device, storage medium and electronic equipment
CN116384515B (en) * 2023-06-06 2023-09-01 之江实验室 Model training method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN109992782B (en) Legal document named entity identification method and device and computer equipment
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN113177412A (en) Named entity identification method and system based on bert, electronic equipment and storage medium
WO2023065544A1 (en) Intention classification method and apparatus, electronic device, and computer-readable storage medium
CN111709243B (en) Knowledge extraction method and device based on deep learning
CN112101041B (en) Entity relationship extraction method, device, equipment and medium based on semantic similarity
CN111291195B (en) Data processing method, device, terminal and readable storage medium
CN110569366A (en) text entity relation extraction method and device and storage medium
WO2021042516A1 (en) Named-entity recognition method and device, and computer readable storage medium
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN117076653B (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
CN113326702B (en) Semantic recognition method, semantic recognition device, electronic equipment and storage medium
CN110134950B (en) Automatic text proofreading method combining words
CN111985228A (en) Text keyword extraction method and device, computer equipment and storage medium
CN112380866A (en) Text topic label generation method, terminal device and storage medium
US20230114673A1 (en) Method for recognizing token, electronic device and storage medium
CN113094494A (en) Intelligent text classification method, device, equipment and medium for electric power operation ticket
CN110309504B (en) Text processing method, device, equipment and storage medium based on word segmentation
CN115269834A (en) High-precision text classification method and device based on BERT
CN114416981A (en) Long text classification method, device, equipment and storage medium
CN112560425B (en) Template generation method and device, electronic equipment and storage medium
CN113076749A (en) Text recognition method and system
CN113705207A (en) Grammar error recognition method and device
CN115858776A (en) Variant text classification recognition method, system, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination