CN112749562A - Named entity identification method, device, storage medium and electronic equipment - Google Patents
Named entity identification method, device, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN112749562A CN112749562A CN202011636806.1A CN202011636806A CN112749562A CN 112749562 A CN112749562 A CN 112749562A CN 202011636806 A CN202011636806 A CN 202011636806A CN 112749562 A CN112749562 A CN 112749562A
- Authority
- CN
- China
- Prior art keywords
- bilstm
- bert
- model
- data
- crf
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 55
- 238000012549 training Methods 0.000 claims abstract description 59
- 238000002372 labelling Methods 0.000 claims abstract description 23
- 239000013598 vector Substances 0.000 claims description 46
- 238000012360 testing method Methods 0.000 claims description 30
- 238000012795 verification Methods 0.000 claims description 21
- 238000004140 cleaning Methods 0.000 claims description 12
- 230000011218 segmentation Effects 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 6
- 230000015654 memory Effects 0.000 claims description 6
- 238000011156 evaluation Methods 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 abstract description 3
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 230000000694 effects Effects 0.000 description 10
- 230000008569 process Effects 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 6
- 239000011159 matrix material Substances 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 4
- 230000014509 gene expression Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 3
- 239000003795 chemical substances by application Substances 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000009193 crawling Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- BUGBHKTXTAQXES-UHFFFAOYSA-N Selenium Chemical compound [Se] BUGBHKTXTAQXES-UHFFFAOYSA-N 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 229910052711 selenium Inorganic materials 0.000 description 1
- 239000011669 selenium Substances 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
Abstract
The invention provides a named entity identification method, a named entity identification device, a storage medium and electronic equipment, and relates to the technical field of natural language processing. Preprocessing acquired raw data of the professional field and constructing a data set, then constructing a BERT-BilSTM-CRF model comprising a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer, training the BERT-BilSTM-CRF model by using the data set, and finally performing named entity recognition by using the trained BERT-BilSTM-CRF model. According to the technical scheme, the named entity recognition model constructed based on the BERT model well solves the problems of difficult entity recognition and low precision when the labeling data in the professional field is insufficient and the entity boundary is fuzzy, and improves the performance and the recognition accuracy of the entity recognition model.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to a named entity identification method, a named entity identification device, a storage medium and electronic equipment.
Background
The rapid development of the internet enables text data to grow rapidly, and the explosively-grown text data contains a lot of valuable information, so that how to extract useful information from the oversized text data becomes the current research focus. The task of information extraction is to automatically or semi-automatically extract useful information from unstructured text data and convert the useful information into structured or semi-structured data, and as one of subtasks of information extraction, named entity identification technology is greatly improved and developed in both the industrial and academic fields.
However, in the professional field (such as the automobile field), because the professional field has no formed data set, insufficient labeled data, fuzzy entity boundaries, and insufficient related research documents, the existing machine learning and deep learning models cannot achieve good effects in the professional field, and especially the task of identifying the named entities in the professional field is deficient in research results.
Therefore, the problems that the named entity recognition is difficult and the recognition accuracy is low exist in the existing named entity recognition technology when the labeling data in the professional field is insufficient and the entity boundary is fuzzy.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects of the prior art, the invention provides a named entity identification method, a named entity identification device, a storage medium and electronic equipment, and solves the problems that in the prior art, when the labeled data in the professional field is insufficient and the entity boundary is fuzzy, the named entity identification is difficult and the identification precision is not high.
(II) technical scheme
In order to achieve the purpose, the invention is realized by the following technical scheme:
in a first aspect, the present invention provides a named entity identification method, where the method includes:
acquiring raw data of the professional field and constructing a data set;
building a BERT-BilSTM-CRF model, and training the BERT-BilSTM-CRF model by using the data set; the BERT-BilSTM-CRF model comprises the following components: a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer;
and carrying out named entity recognition by using the trained BERT-BilSTM-CRF model.
Preferably, the acquiring raw data of the professional field and constructing the data set includes:
acquiring professional field original data based on social media; the professional field includes the automotive field;
carrying out data cleaning, data standardization, text word segmentation and sequence labeling on the original data to obtain a data set;
and dividing the data set into a training set, a testing set and a verification set according to a certain proportion.
Preferably, the BERT pre-training model layer is configured to encode each character to obtain a word vector of the corresponding character; the BilSTM network layer is used for bidirectionally encoding a sequence formed by the word vectors to obtain new feature vectors; and the CRF reasoning layer is used for outputting a label sequence with the maximum probability based on the new feature vector.
Preferably, the method further comprises:
inputting the test set and the verification set into the trained BERT-BilSTM-CRF model for testing so as to evaluate the performance of the BERT-BilSTM-CRF model.
In a second aspect, the present invention provides a named entity recognition apparatus, including:
the data acquisition module is used for acquiring raw data in the professional field and constructing a data set;
the model training module is used for constructing a BERT-BilSTM-CRF model and training the BERT-BilSTM-CRF model by utilizing the data set; the BERT-BilSTM-CRF model comprises the following components: a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer;
and the named entity recognition module is used for recognizing the named entity by utilizing the trained BERT-BilSTM-CRF model.
Preferably, the acquiring raw data of the professional field and constructing the data set by the data acquiring module includes:
acquiring professional field original data based on social media; the professional field includes the automotive field;
carrying out data cleaning, data standardization, text word segmentation and sequence labeling on the original data to obtain a data set;
and dividing the data set into a training set, a testing set and a verification set according to a certain proportion.
Preferably, the BERT pre-training model layer in the model training module is configured to encode each character to obtain a word vector of the corresponding character; the BilSTM network layer is used for bidirectionally encoding a sequence formed by the word vectors to obtain new feature vectors; and the CRF reasoning layer is used for outputting a label sequence with the maximum probability based on the new feature vector.
Preferably, the apparatus further comprises: and the model performance evaluation module is used for inputting the test set and the verification set into the trained BERT-BilSTM-CRF model for testing so as to evaluate the performance of the BERT-BilSTM-CRF model.
In a third aspect, the present invention proposes a computer-readable storage medium storing a computer program for named entity recognition, wherein the computer program causes a computer to perform the named entity recognition method as described above.
In a fourth aspect, the present invention provides an electronic device, including:
one or more processors;
a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the named entity identification method as described above.
(III) advantageous effects
The invention provides a named entity identification method, a named entity identification device, a storage medium and electronic equipment. Compared with the prior art, the method has the following beneficial effects:
the invention discloses a named entity recognition method, a device, a storage medium and electronic equipment, which preprocess acquired raw data in the professional field and construct a data set, then construct a BERT-BilSTM-CRF model comprising a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer, train the BERT-BilSTM-CRF model by using the data set, and finally recognize the named entity by using the trained BERT-BilSTM-CRF model. According to the technical scheme, the named entity recognition model constructed based on the BERT model well solves the problems of difficult entity recognition and low precision when the labeling data in the professional field is insufficient and the entity boundary is fuzzy, and improves the performance and the recognition accuracy of the entity recognition model.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of a named entity recognition method according to an embodiment of the present invention;
FIG. 2 is a diagram of an overall framework of a named entity recognition model in an embodiment of the present invention;
FIG. 3 is a BERT pre-training model framework in an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described, and it is obvious that the described embodiments are a part of the embodiments of the present invention, but not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the application provides a named entity identification method, a named entity identification device, a storage medium and electronic equipment, solves the problems of difficulty in named entity identification and low identification precision when labeling data in the professional field is insufficient and entity boundaries are fuzzy, and improves the performance and identification accuracy of an entity identification model.
In order to solve the technical problems, the general idea of the embodiment of the application is as follows:
aiming at the problems of insufficient labeling data in the professional field and difficult and low identification precision of entities when the entity boundaries are fuzzy, the embodiment of the invention preprocesses the acquired original data in the professional field and constructs a data set, then utilizes a BERT model to enhance the semantic representation of the words, and can dynamically generate word vectors according to the characteristics of context characteristics to construct a BERT-BilSTM-CRF model comprising a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer, and trains the BERT-BilSTM-CRF model by using the data set, and finally utilizes the trained BERT-BilSTM-CRF model to identify the named entities, thereby improving the performance and the identification accuracy of the entity identification model to a great extent.
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
Example 1:
in a first aspect, the present invention provides a named entity identification method, where the method includes:
s1, acquiring original data of the professional field and constructing a data set;
s2, constructing a BERT-BilSTM-CRF model, and training the BERT-BilSTM-CRF model by using the data set; the BERT-BilSTM-CRF model comprises the following components: a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer;
and S3, carrying out named entity recognition by using the trained BERT-BilSTM-CRF model.
It can be seen that, in the method, the apparatus, the storage medium, and the electronic device for identifying a named entity according to embodiments of the present invention, acquired raw data of a professional field is preprocessed, a data set is constructed, a BERT-BiLSTM-CRF model including a BERT pre-training model layer, a blstm network layer, and a CRF inference layer is constructed, the BERT-BiLSTM-CRF model is trained using the data set, and finally, the trained BERT-BiLSTM-CRF model is used to identify a named entity. The named entity recognition model constructed based on the BERT model well solves the problems of difficult entity recognition and low precision when the labeling data is insufficient and the entity boundary is fuzzy in the professional field, and improves the performance and the recognition accuracy of the entity recognition model.
In the above method of the embodiment of the present invention, in order to obtain more, higher quality, and effective data, a preferred processing method is that when acquiring raw data in a professional field and constructing a data set, the method includes:
acquiring raw data of the professional field;
carrying out data cleaning, data standardization, text word segmentation and sequence labeling on the original data to obtain a data set;
and dividing the data set into a training set, a testing set and a verification set according to a certain proportion.
In addition, in the method of the embodiment of the present invention, in order to solve the problems of insufficient labeling data and low recognition accuracy when the entity boundary is fuzzy and difficult to recognize in the professional field, and to improve the performance and recognition accuracy of the entity recognition model, a better processing mode is that in the constructed BERT-blstm-CRF model, the BERT pre-training model layer is used to encode each character to obtain the word vector of the corresponding character; the BilSTM network layer is used for bidirectionally encoding a sequence formed by the word vectors to obtain new feature vectors; and the CRF reasoning layer is used for outputting a label sequence with the maximum probability based on the new feature vector.
In practice, since the recognition performance of the entity recognition model needs to be evaluated and adjusted in advance according to the actual application situation, a preferred processing manner is that the method further includes:
inputting the test set and the verification set into the trained BERT-BilSTM-CRF model for testing so as to evaluate the performance of the BERT-BilSTM-CRF model.
The professional fields include the automobile field, the engine field, the e-commerce field and the like, and the concrete implementation process of one embodiment of the invention is described in detail below by taking the automobile field as an example and combining the explanation of the concrete steps.
Referring to fig. 1, a named entity recognition method of the present invention specifically includes the following steps:
and S1, acquiring the raw data of the professional field and constructing a data set.
First, the comment data of the consumer is crawled from social media networks such as the home of the automobile and the easy-to-drive network and is preprocessed. Specifically, the method comprises the following steps:
and (5) data crawling. The method is characterized in that a lightweight crawler frame script based on Python is used as a base, webpage data are extracted and analyzed through XPath and CSS expressions, a Redis database is used as a distributed shared crawler queue, a MongoDB database is used as a data storage library, a Selenium automated testing tool is integrated, middleware such as a random User-Agent, an Agilent Agent IP and a self-built Agent IP pool are used at the same time, and the middleware is deployed to a cloud server, so that large-scale real-time incremental crawling of product comment data of a plurality of social media platforms is realized.
And (4) preprocessing data. After the original corpus is crawled and before the corpus is input into a model, the data is preprocessed through steps of data cleaning, data standardization, text word segmentation, sequence annotation, data set construction and the like, and data with higher quality and effectiveness are obtained.
And (6) data cleaning. Cleaning data of meaningless comments (meaningless comments mainly refer to comments which are not important to model training and tasks, such as spam comments, repeated comments and the like) mainly comprises the following steps: removing the spam comments, namely removing the comments which are violated with the social core value and have the functions of insulting, malicious vocabularies and the like, and removing the spam comments by summarizing some key words such as 'brain residue', 'rotten goods', 'http', 'Yuan' and the like, wherein the long or short comment length can be the spam comments, the comment length is limited to be 50-200 words, and other comments with the comment word number not in the range are directly removed; and (3) text deduplication, wherein in the process of observing the comment corpus, some comment contents are found to have high similarity, some comments are even repeated, and the text is deduplicated by adopting a Simhash method.
And (6) standardizing data. The data after data cleaning is further processed, which mainly comprises: and correcting the text. The method comprises the following steps that certain error characters inevitably exist in a comment data corpus, and an intelligent text error correction interface is used for correcting the text; stop words are removed. Removing some meaningless symbols, special symbols such as emoticons and the like from the comment data by using a regular expression matching method; and (5) simple and complex body conversion. And in order to train word vectors more conveniently, the traditional characters in the data are converted into simplified characters.
And (5) text word segmentation. Utilizing jieba word segmentation to construct an automobile field entity dictionary with definite word boundaries, and defining the following automobile field named entity classification: brand name, model name, structure name, and attribute name. For example: the brand names are Harvard, Baoma and the like; model names of 650EV, RAV4 and the like; the structure names are steering wheel, engine, etc.; the attribute names include power, oil consumption, displacement and the like.
And (5) labeling the sequence. Sequence tagging is simply the giving of a string of characters, the tagging of elements present in the sequence with a relevant tag, and the deep analysis of this string of sequences by the tag. And manually labeling the data by adopting a labeling system of BIO, wherein 'B' represents a word initial, I 'represents a word non-initial, and O' represents a non-entity. For example: B-BRA represents the beginning character of the brand name; I-BRA represents a middle character of a brand name; B-MOD represents the beginning character of the model name; I-MOD represents a model name middle character; B-STR represents the beginning character of the structure name; I-STR represents the middle character of the structure name; B-ATT represents attribute name initial character; I-ATT represents attribute name middle character; o denotes a non-named entity.
A data set is constructed. After manual labeling, a data set of the BIO label is automatically generated by using a Python construction technology, and then the data set is divided into a training set, a testing set and a verification set according to the proportion of 6:2: 2.
S2, constructing a BERT-BilSTM-CRF model, and training the BERT-BilSTM-CRF model by using the data set; the BERT-BilSTM-CRF model comprises the following components: a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer.
Constructing a BERT-BilSTM-CRF model for Chinese named entity recognition, see FIG. 2, which comprises: a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer. Inputting the text data of the training set into a BERT pre-training model layer, and coding each character to obtain a character vector representation of the corresponding character; then bi-directionally encoding the word vector sequence by using a BilSTM layer, thereby constructing a new feature vector for each word; and finally, outputting a label sequence with the maximum probability by using a CRF reasoning layer to serve as a final prediction label of the model. Specifically, the method comprises the following steps:
the BERT pre-trains the model layers. The BERT model uses a plurality of transform bi-directional encoders to encode characters, so that the prediction of each character can refer to character information in front and back directions, and each unit mainly consists of a feedforward neural network and a self-attention mechanism. See fig. 3, where E1, E2, …, EN are input vectors of the model and T1, T2, …, TN are output vectors of the model. The Transformer performs Attention calculation on each word in an input sentence and all words in the sentence to obtain the mutual relation between the words, captures the internal relation of the sentence, performs weighting according to Attention calculation to enable important words to obtain higher weight, and the Attention calculation is defined as:
q, K and V are respectively Query, Key and Value matrixes, and three weight matrixes are needed to be used in a calculation formula and are respectively set as WQ,WK,WVWherein W isQ,WKIs set to k x dk,WVIs set to k x dvAnd if the matrix needs to be obtained through model training, the following steps are provided:
Q=AWQ;K=AWK;V=AWV;
where a is a matrix of n × K, each row corresponds to a vector representation of a word in the input sentence, and each row of Q, K, V corresponds to a Query, Key, Value vector representation of each word in the input sentence, respectively.
A BilSTM network layer. BilSTM is used to solve the problem of gradient explosion or gradient disappearance of the Recurrent Neural Network (RNN). In addition, the long-sequence forgetting problem is relieved through 3 calculation gates, namely a forgetting gate f, an input gate i and an output gate 0. The specific calculation formula is as follows:
ft=σ(Wfht-1+Ufxt+bf)
it=σ(Wiht-1+Uixt+bi)
ot=σ(Woht-1+Uoxt+bo)
ct1=tanh(Wcht-1+Ucxt+bc)
ct=ft⊙ct-1+it⊙ct1
ht=ot⊙tanh(ct)
wherein W, b respectively represents weight matrix and offset vector connecting two layers in the computational gate, σ is sigmoid activation function, which is a dot product operation, ct1Indicating the state at time t, xtAs an input vector, htIndicating the output at time t. The BilSTM is a bidirectional long and short term memory network, which is composed of a forward LSTM and a backward LSTM, and respectively calculates, combines and outputs, i.e. at a certain moment i, the model outputs a hidden state sequence of the forward LSTMHidden state sequence with inverted LSTM outputSplicing is carried out, thus obtaining a complete hidden state sequence (t)1,t2,...,tn) Thus enabling the neural network toThe two-way semantic information is well captured, the context relation is learned, and the effect of named entity recognition is effectively improved.
And (4) CRF reasoning layer. The CRF algorithm guarantees the rationality of the predicted labels by constraining the output of the BiLSTM layer by considering the relationship between adjacent labels.
Taking the output of the BilSTM as the input sequence X ═ X (X) of the CRF layer1,x2,...,xn) The corresponding tag sequence is Y ═ Y (Y)1,y2,…,yn) The scoring of (A) is as follows:
wherein n is the sequence length, k is the number of tags, and A is the transition score matrix. A. theyi,yi+1Is represented by a label yiTransfer to label yi+1The transfer score of (2) is finally normalized by a Softmax function to obtain the maximum probability of the y sequence label, and the formula is as follows:
wherein the content of the first and second substances,representing the value of a real label, Yx representing the set of all possible labels, and in the training process, calculating the maximum likelihood probability of a correct label sequence according to the following formula:
and finally, obtaining a sequence with the highest predicted total score on all sequences through a Viterbi algorithm, wherein the sequence is used as a labeling result of the automobile field named entity recognition, and the formula is as follows:
in the embodiment of the invention, the word vector output by the BERT model layer is used as the input of the BilSTM, the BilSTM model layer gives a prediction score of a label to each input data by learning the input forward and backward information, and the vector P (P) is output1,P2,…,Pn) Represents sentence X (X)1,x2,…,xn) X ofiCorresponding to the Tag (Tag) defined by the BIO labeling system1,tag2,…,tagn) And j represents the dimension of the mark, corresponding to the output matrix P of BilSTM, where PiJ denotes the sentence X (X)1,x2,…,xn) X ofiMapping to tagjIs measured.
And S3, carrying out named entity recognition by using the trained BERT-BilSTM-CRF model.
And carrying out named entity recognition on the automobile field by using the trained BERT-BilSTM-CRF model. Certainly, in order to ensure the recognition accuracy of the BERT-BiLSTM-CRF model, before the model is used for recognizing the named entity in the automobile field, the performance of the model may be evaluated by using a test set and a verification set, specifically, the test set and the verification set are input into the complete named entity recognition model obtained after training for testing, and the entity result in the automobile field is evaluated by using Precision (P), Recall (Recall, R) and F1 (F1-score, F1) as evaluation indexes of the model performance, and the specific formula is as follows:
wherein, TPRepresenting the number of correctly identified named entities; fpA named entity identification number representing a misrecognized; fNIndicating the number of named entities that have not been identified. If the performance of the BERT-BilSTM-CRF model does not reach the expectation, the BERT-BilSTM-CRF model meeting the expectation requirement can be obtained by adjusting the model parameters.
To verify the validity of embodiments of the present invention. Dividing the processed data set into a training set, a testing set and a verification set according to the ratio of 6:2:2, and selectively building a model by adopting Tensorflow. In the verification process of the embodiment of the invention, the experimental effect of the BERT-BilSTM-CRF model provided by the invention on the identification of named entities of different labels is shown in Table 1.
TABLE 1 BERT-BilSTM-CRF model Effect on identification of different labels
In addition, three deep neural network models related to the BERT-BilSTM-CRF model provided by the invention are selected, the same data set is used for carrying out experiments, the obtained identification result is compared with the identification result obtained by the model provided by the invention, and the experimental effects of the four models on the data set are shown in Table 2.
TABLE 2 comparison of data sets on different models
As can be seen from the experimental results in tables 1 and 2, the BERT-BilSTM-CRF model provided by the invention has a good effect on the recognition of named entities with different labels in the professional field, and has a better recognition effect than other existing named entity recognition models. Therefore, the named entity recognition method provided by the invention has the advantages that the entity recognition effect is still good and the precision is higher when the labeling data in the professional field is insufficient and the entity boundary is fuzzy, and the performance and the recognition accuracy of the entity recognition model are well improved.
Thus, the whole process of the named entity identification method is completed.
Example 2:
in a second aspect, the present invention provides a named entity recognition apparatus, comprising:
the data acquisition module is used for acquiring raw data in the professional field and constructing a data set;
the model training module is used for constructing a BERT-BilSTM-CRF model and training the BERT-BilSTM-CRF model by utilizing the data set; the BERT-BilSTM-CRF model comprises the following components: a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer;
and the named entity recognition module is used for recognizing the named entity by utilizing the trained BERT-BilSTM-CRF model.
Optionally, the acquiring raw data of the professional field and constructing the data set by the data acquiring module includes:
acquiring professional field original data based on social media; the professional field includes the automotive field;
carrying out data cleaning, data standardization, text word segmentation and sequence labeling on the original data to obtain a data set;
and dividing the data set into a training set, a testing set and a verification set according to a certain proportion.
Optionally, the BERT pre-training model layer in the model training module is configured to encode each character to obtain a word vector of the corresponding character; the BilSTM network layer is used for bidirectionally encoding a sequence formed by the word vectors to obtain new feature vectors; and the CRF reasoning layer is used for outputting a label sequence with the maximum probability based on the new feature vector.
Optionally, the apparatus further comprises: and the model performance evaluation module is used for inputting the test set and the verification set into the trained BERT-BilSTM-CRF model for testing so as to evaluate the performance of the BERT-BilSTM-CRF model.
It can be understood that, the explanation, examples, and beneficial effects of the named entity identification apparatus provided in the embodiment of the present invention correspond to the above named entity identification method, and reference may be made to corresponding contents in a named entity identification method for explanation, examples, and beneficial effects of the named entity identification apparatus, which are not described herein again.
Example 3:
in a third aspect, the present invention provides a computer readable storage medium storing a computer program for named entity identification, wherein the computer program causes a computer to perform the steps of:
acquiring raw data of the professional field and constructing a data set;
building a BERT-BilSTM-CRF model, and training the BERT-BilSTM-CRF model by using the data set; the BERT-BilSTM-CRF model comprises the following components: a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer;
and carrying out named entity recognition by using the trained BERT-BilSTM-CRF model.
Optionally, the acquiring raw data of the professional field and constructing a data set includes:
acquiring professional field original data based on social media; the professional field includes the automotive field;
carrying out data cleaning, data standardization, text word segmentation and sequence labeling on the original data to obtain a data set;
and dividing the data set into a training set, a testing set and a verification set according to a certain proportion.
Optionally, the BERT pre-training model layer is configured to encode each character to obtain a word vector of the corresponding character; the BilSTM network layer is used for bidirectionally encoding a sequence formed by the word vectors to obtain new feature vectors; and the CRF reasoning layer is used for outputting a label sequence with the maximum probability based on the new feature vector.
Optionally, the method further includes:
inputting the test set and the verification set into the trained BERT-BilSTM-CRF model for testing so as to evaluate the performance of the BERT-BilSTM-CRF model.
Example 4:
in a fourth aspect, the present invention provides an electronic device, comprising:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the steps of:
acquiring raw data of the professional field and constructing a data set;
building a BERT-BilSTM-CRF model, and training the BERT-BilSTM-CRF model by using the data set; the BERT-BilSTM-CRF model comprises the following components: a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer;
and carrying out named entity recognition by using the trained BERT-BilSTM-CRF model.
Optionally, the acquiring raw data of the professional field and constructing a data set includes:
acquiring professional field original data based on social media; the professional field includes the automotive field;
carrying out data cleaning, data standardization, text word segmentation and sequence labeling on the original data to obtain a data set;
and dividing the data set into a training set, a testing set and a verification set according to a certain proportion.
Optionally, the BERT pre-training model layer is configured to encode each character to obtain a word vector of the corresponding character; the BilSTM network layer is used for bidirectionally encoding a sequence formed by the word vectors to obtain new feature vectors; and the CRF reasoning layer is used for outputting a label sequence with the maximum probability based on the new feature vector.
Optionally, the method further includes:
inputting the test set and the verification set into the trained BERT-BilSTM-CRF model for testing so as to evaluate the performance of the BERT-BilSTM-CRF model.
In summary, compared with the prior art, the method has the following beneficial effects:
1. the named entity recognition method, the device, the storage medium and the electronic equipment provided by the embodiment of the invention are used for preprocessing the acquired automobile comment data and constructing a data set, then constructing a BERT-BilsTM-CRF model comprising a BERT pre-training model layer, a BilsTM network layer and a CRF reasoning layer, training the BERT-BilsTM-CRF model by using the data set, and finally performing named entity recognition by using the trained BERT-BilsTM-CRF model. The named entity recognition model constructed based on the BERT model well solves the problems of difficult entity recognition and low precision when the labeling data in the automobile field is insufficient and the entity boundary is fuzzy, and improves the performance and the recognition accuracy of the entity recognition model;
2. according to the BERT model in the named entity recognition model, the word embedded expression with good quality can be learned from the library in a pre-training and fine-tuning mode, the semantic expression of the word can be enhanced, the word vector is dynamically generated according to the context characteristics, and the performance and the recognition accuracy of the entity recognition model are improved;
3. the embodiment of the invention integrates external knowledge (for example, external knowledge of vehicle types and vehicle brands, including standard names and alternative names of vehicle types and brands) of products in the professional field, including standard names and alternative names of vehicle types and brands, into the data processing, and can link the alternative names mentioned in the original data (user comment data) to the standard names when processing data, thereby obviously improving the recognition effect of entities in the professional field;
4. the embodiment of the invention uses the trained model to automatically identify the entity without establishing entity dictionary matching, thereby effectively relieving the problems of insufficient training data and poor identification performance when the named entity identification task is carried out in the professional field.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A named entity recognition method, comprising:
acquiring raw data of the professional field and constructing a data set;
building a BERT-BilSTM-CRF model, and training the BERT-BilSTM-CRF model by using the data set; the BERT-BilSTM-CRF model comprises the following components: a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer;
and carrying out named entity recognition by using the trained BERT-BilSTM-CRF model.
2. The method of claim 1, wherein the obtaining of domain of expertise raw data and constructing a data set comprises:
acquiring professional field original data based on social media; the professional field includes the automotive field;
carrying out data cleaning, data standardization, text word segmentation and sequence labeling on the original data to obtain a data set;
and dividing the data set into a training set, a testing set and a verification set according to a certain proportion.
3. The method of claim 1, wherein the BERT pre-training model layer is configured to encode each character to obtain a word vector for the corresponding character; the BilSTM network layer is used for bidirectionally encoding a sequence formed by the word vectors to obtain new feature vectors; and the CRF reasoning layer is used for outputting a label sequence with the maximum probability based on the new feature vector.
4. The method of claim 2, wherein the method further comprises:
inputting the test set and the verification set into the trained BERT-BilSTM-CRF model for testing so as to evaluate the performance of the BERT-BilSTM-CRF model.
5. An apparatus for named entity recognition, the apparatus comprising:
the data acquisition module is used for acquiring raw data in the professional field and constructing a data set;
the model training module is used for constructing a BERT-BilSTM-CRF model and training the BERT-BilSTM-CRF model by utilizing the data set; the BERT-BilSTM-CRF model comprises the following components: a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer;
and the named entity recognition module is used for recognizing the named entity by utilizing the trained BERT-BilSTM-CRF model.
6. The apparatus of claim 5, wherein the data acquisition module to acquire domain of expertise raw data and construct a data set comprises:
acquiring professional field original data based on social media; the professional field includes the automotive field;
carrying out data cleaning, data standardization, text word segmentation and sequence labeling on the original data to obtain a data set;
and dividing the data set into a training set, a testing set and a verification set according to a certain proportion.
7. The apparatus of claim 5, wherein the BERT pre-training model layer in the model training module is configured to encode each character to obtain a word vector corresponding to the character; the BilSTM network layer is used for bidirectionally encoding a sequence formed by the word vectors to obtain new feature vectors; and the CRF reasoning layer is used for outputting a label sequence with the maximum probability based on the new feature vector.
8. The apparatus of claim 6, wherein the apparatus further comprises: and the model performance evaluation module is used for inputting the test set and the verification set into the trained BERT-BilSTM-CRF model for testing so as to evaluate the performance of the BERT-BilSTM-CRF model.
9. A computer-readable storage medium, characterized in that it stores a computer program for named entity recognition, wherein the computer program causes a computer to perform the named entity recognition method according to any one of claims 1-4.
10. An electronic device, comprising:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the named entity identification method of any of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011636806.1A CN112749562A (en) | 2020-12-31 | 2020-12-31 | Named entity identification method, device, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011636806.1A CN112749562A (en) | 2020-12-31 | 2020-12-31 | Named entity identification method, device, storage medium and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112749562A true CN112749562A (en) | 2021-05-04 |
Family
ID=75651091
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011636806.1A Pending CN112749562A (en) | 2020-12-31 | 2020-12-31 | Named entity identification method, device, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112749562A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113486153A (en) * | 2021-07-20 | 2021-10-08 | 上海明略人工智能(集团)有限公司 | Domain knowledge extraction method, system, electronic device and medium |
CN113723104A (en) * | 2021-09-15 | 2021-11-30 | 云知声智能科技股份有限公司 | Method and device for entity extraction under noisy data |
CN113779992A (en) * | 2021-07-19 | 2021-12-10 | 西安理工大学 | Method for realizing BcBERT-SW-BilSTM-CRF model based on vocabulary enhancement and pre-training |
CN114169966A (en) * | 2021-12-08 | 2022-03-11 | 海南港航控股有限公司 | Method and system for extracting unit data of goods by tensor |
CN114648029A (en) * | 2022-03-31 | 2022-06-21 | 河海大学 | Electric power field named entity identification method based on BiLSTM-CRF model |
CN115759097A (en) * | 2022-11-08 | 2023-03-07 | 广东数鼎科技有限公司 | Vehicle type name recognition method |
CN116501884A (en) * | 2023-03-31 | 2023-07-28 | 重庆大学 | Medical entity identification method based on BERT-BiLSTM-CRF |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019071661A1 (en) * | 2017-10-09 | 2019-04-18 | 平安科技(深圳)有限公司 | Electronic apparatus, medical text entity name identification method, system, and storage medium |
CN111967266A (en) * | 2020-09-09 | 2020-11-20 | 中国人民解放军国防科技大学 | Chinese named entity recognition model and construction method and application thereof |
CN112001177A (en) * | 2020-08-24 | 2020-11-27 | 浪潮云信息技术股份公司 | Electronic medical record named entity identification method and system integrating deep learning and rules |
-
2020
- 2020-12-31 CN CN202011636806.1A patent/CN112749562A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019071661A1 (en) * | 2017-10-09 | 2019-04-18 | 平安科技(深圳)有限公司 | Electronic apparatus, medical text entity name identification method, system, and storage medium |
CN112001177A (en) * | 2020-08-24 | 2020-11-27 | 浪潮云信息技术股份公司 | Electronic medical record named entity identification method and system integrating deep learning and rules |
CN111967266A (en) * | 2020-09-09 | 2020-11-20 | 中国人民解放军国防科技大学 | Chinese named entity recognition model and construction method and application thereof |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113779992A (en) * | 2021-07-19 | 2021-12-10 | 西安理工大学 | Method for realizing BcBERT-SW-BilSTM-CRF model based on vocabulary enhancement and pre-training |
CN113486153A (en) * | 2021-07-20 | 2021-10-08 | 上海明略人工智能(集团)有限公司 | Domain knowledge extraction method, system, electronic device and medium |
CN113723104A (en) * | 2021-09-15 | 2021-11-30 | 云知声智能科技股份有限公司 | Method and device for entity extraction under noisy data |
CN114169966A (en) * | 2021-12-08 | 2022-03-11 | 海南港航控股有限公司 | Method and system for extracting unit data of goods by tensor |
CN114169966B (en) * | 2021-12-08 | 2022-08-05 | 海南港航控股有限公司 | Method and system for extracting unit data of goods by tensor |
CN114648029A (en) * | 2022-03-31 | 2022-06-21 | 河海大学 | Electric power field named entity identification method based on BiLSTM-CRF model |
CN115759097A (en) * | 2022-11-08 | 2023-03-07 | 广东数鼎科技有限公司 | Vehicle type name recognition method |
CN116501884A (en) * | 2023-03-31 | 2023-07-28 | 重庆大学 | Medical entity identification method based on BERT-BiLSTM-CRF |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112749562A (en) | Named entity identification method, device, storage medium and electronic equipment | |
CN111079985B (en) | Criminal case criminal period prediction method based on BERT and fused with distinguishable attribute features | |
CN111783394B (en) | Training method of event extraction model, event extraction method, system and equipment | |
CN111159485B (en) | Tail entity linking method, device, server and storage medium | |
CN106569998A (en) | Text named entity recognition method based on Bi-LSTM, CNN and CRF | |
CN110298043B (en) | Vehicle named entity identification method and system | |
CN113392209B (en) | Text clustering method based on artificial intelligence, related equipment and storage medium | |
CN113626589B (en) | Multi-label text classification method based on mixed attention mechanism | |
CN113297360B (en) | Law question-answering method and device based on weak supervised learning and joint learning mechanism | |
CN113743119B (en) | Chinese named entity recognition module, method and device and electronic equipment | |
CN113822026A (en) | Multi-label entity labeling method | |
CN111582506A (en) | Multi-label learning method based on global and local label relation | |
CN113987167A (en) | Dependency perception graph convolutional network-based aspect-level emotion classification method and system | |
CN113742733A (en) | Reading understanding vulnerability event trigger word extraction and vulnerability type identification method and device | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN115599899A (en) | Intelligent question-answering method, system, equipment and medium based on aircraft knowledge graph | |
CN116029305A (en) | Chinese attribute-level emotion analysis method, system, equipment and medium based on multitask learning | |
CN112989830B (en) | Named entity identification method based on multiple features and machine learning | |
CN114356990A (en) | Base named entity recognition system and method based on transfer learning | |
CN117094325B (en) | Named entity identification method in rice pest field | |
CN113641809A (en) | XLNET-BiGRU-CRF-based intelligent question answering method | |
CN117056451A (en) | New energy automobile complaint text aspect-viewpoint pair extraction method based on context enhancement | |
CN114372454A (en) | Text information extraction method, model training method, device and storage medium | |
CN116127954A (en) | Dictionary-based new work specialized Chinese knowledge concept extraction method | |
CN114911940A (en) | Text emotion recognition method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |