CN112749562A - Named entity identification method, device, storage medium and electronic equipment - Google Patents

Named entity identification method, device, storage medium and electronic equipment Download PDF

Info

Publication number
CN112749562A
CN112749562A CN202011636806.1A CN202011636806A CN112749562A CN 112749562 A CN112749562 A CN 112749562A CN 202011636806 A CN202011636806 A CN 202011636806A CN 112749562 A CN112749562 A CN 112749562A
Authority
CN
China
Prior art keywords
bilstm
bert
model
data
crf
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011636806.1A
Other languages
Chinese (zh)
Inventor
张强
丁贾明
方钊
王安宁
杨善林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202011636806.1A priority Critical patent/CN112749562A/en
Publication of CN112749562A publication Critical patent/CN112749562A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Abstract

The invention provides a named entity identification method, a named entity identification device, a storage medium and electronic equipment, and relates to the technical field of natural language processing. Preprocessing acquired raw data of the professional field and constructing a data set, then constructing a BERT-BilSTM-CRF model comprising a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer, training the BERT-BilSTM-CRF model by using the data set, and finally performing named entity recognition by using the trained BERT-BilSTM-CRF model. According to the technical scheme, the named entity recognition model constructed based on the BERT model well solves the problems of difficult entity recognition and low precision when the labeling data in the professional field is insufficient and the entity boundary is fuzzy, and improves the performance and the recognition accuracy of the entity recognition model.

Description

Named entity identification method, device, storage medium and electronic equipment
Technical Field
The invention relates to the technical field of natural language processing, in particular to a named entity identification method, a named entity identification device, a storage medium and electronic equipment.
Background
The rapid development of the internet enables text data to grow rapidly, and the explosively-grown text data contains a lot of valuable information, so that how to extract useful information from the oversized text data becomes the current research focus. The task of information extraction is to automatically or semi-automatically extract useful information from unstructured text data and convert the useful information into structured or semi-structured data, and as one of subtasks of information extraction, named entity identification technology is greatly improved and developed in both the industrial and academic fields.
However, in the professional field (such as the automobile field), because the professional field has no formed data set, insufficient labeled data, fuzzy entity boundaries, and insufficient related research documents, the existing machine learning and deep learning models cannot achieve good effects in the professional field, and especially the task of identifying the named entities in the professional field is deficient in research results.
Therefore, the problems that the named entity recognition is difficult and the recognition accuracy is low exist in the existing named entity recognition technology when the labeling data in the professional field is insufficient and the entity boundary is fuzzy.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects of the prior art, the invention provides a named entity identification method, a named entity identification device, a storage medium and electronic equipment, and solves the problems that in the prior art, when the labeled data in the professional field is insufficient and the entity boundary is fuzzy, the named entity identification is difficult and the identification precision is not high.
(II) technical scheme
In order to achieve the purpose, the invention is realized by the following technical scheme:
in a first aspect, the present invention provides a named entity identification method, where the method includes:
acquiring raw data of the professional field and constructing a data set;
building a BERT-BilSTM-CRF model, and training the BERT-BilSTM-CRF model by using the data set; the BERT-BilSTM-CRF model comprises the following components: a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer;
and carrying out named entity recognition by using the trained BERT-BilSTM-CRF model.
Preferably, the acquiring raw data of the professional field and constructing the data set includes:
acquiring professional field original data based on social media; the professional field includes the automotive field;
carrying out data cleaning, data standardization, text word segmentation and sequence labeling on the original data to obtain a data set;
and dividing the data set into a training set, a testing set and a verification set according to a certain proportion.
Preferably, the BERT pre-training model layer is configured to encode each character to obtain a word vector of the corresponding character; the BilSTM network layer is used for bidirectionally encoding a sequence formed by the word vectors to obtain new feature vectors; and the CRF reasoning layer is used for outputting a label sequence with the maximum probability based on the new feature vector.
Preferably, the method further comprises:
inputting the test set and the verification set into the trained BERT-BilSTM-CRF model for testing so as to evaluate the performance of the BERT-BilSTM-CRF model.
In a second aspect, the present invention provides a named entity recognition apparatus, including:
the data acquisition module is used for acquiring raw data in the professional field and constructing a data set;
the model training module is used for constructing a BERT-BilSTM-CRF model and training the BERT-BilSTM-CRF model by utilizing the data set; the BERT-BilSTM-CRF model comprises the following components: a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer;
and the named entity recognition module is used for recognizing the named entity by utilizing the trained BERT-BilSTM-CRF model.
Preferably, the acquiring raw data of the professional field and constructing the data set by the data acquiring module includes:
acquiring professional field original data based on social media; the professional field includes the automotive field;
carrying out data cleaning, data standardization, text word segmentation and sequence labeling on the original data to obtain a data set;
and dividing the data set into a training set, a testing set and a verification set according to a certain proportion.
Preferably, the BERT pre-training model layer in the model training module is configured to encode each character to obtain a word vector of the corresponding character; the BilSTM network layer is used for bidirectionally encoding a sequence formed by the word vectors to obtain new feature vectors; and the CRF reasoning layer is used for outputting a label sequence with the maximum probability based on the new feature vector.
Preferably, the apparatus further comprises: and the model performance evaluation module is used for inputting the test set and the verification set into the trained BERT-BilSTM-CRF model for testing so as to evaluate the performance of the BERT-BilSTM-CRF model.
In a third aspect, the present invention proposes a computer-readable storage medium storing a computer program for named entity recognition, wherein the computer program causes a computer to perform the named entity recognition method as described above.
In a fourth aspect, the present invention provides an electronic device, including:
one or more processors;
a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the named entity identification method as described above.
(III) advantageous effects
The invention provides a named entity identification method, a named entity identification device, a storage medium and electronic equipment. Compared with the prior art, the method has the following beneficial effects:
the invention discloses a named entity recognition method, a device, a storage medium and electronic equipment, which preprocess acquired raw data in the professional field and construct a data set, then construct a BERT-BilSTM-CRF model comprising a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer, train the BERT-BilSTM-CRF model by using the data set, and finally recognize the named entity by using the trained BERT-BilSTM-CRF model. According to the technical scheme, the named entity recognition model constructed based on the BERT model well solves the problems of difficult entity recognition and low precision when the labeling data in the professional field is insufficient and the entity boundary is fuzzy, and improves the performance and the recognition accuracy of the entity recognition model.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of a named entity recognition method according to an embodiment of the present invention;
FIG. 2 is a diagram of an overall framework of a named entity recognition model in an embodiment of the present invention;
FIG. 3 is a BERT pre-training model framework in an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described, and it is obvious that the described embodiments are a part of the embodiments of the present invention, but not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the application provides a named entity identification method, a named entity identification device, a storage medium and electronic equipment, solves the problems of difficulty in named entity identification and low identification precision when labeling data in the professional field is insufficient and entity boundaries are fuzzy, and improves the performance and identification accuracy of an entity identification model.
In order to solve the technical problems, the general idea of the embodiment of the application is as follows:
aiming at the problems of insufficient labeling data in the professional field and difficult and low identification precision of entities when the entity boundaries are fuzzy, the embodiment of the invention preprocesses the acquired original data in the professional field and constructs a data set, then utilizes a BERT model to enhance the semantic representation of the words, and can dynamically generate word vectors according to the characteristics of context characteristics to construct a BERT-BilSTM-CRF model comprising a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer, and trains the BERT-BilSTM-CRF model by using the data set, and finally utilizes the trained BERT-BilSTM-CRF model to identify the named entities, thereby improving the performance and the identification accuracy of the entity identification model to a great extent.
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
Example 1:
in a first aspect, the present invention provides a named entity identification method, where the method includes:
s1, acquiring original data of the professional field and constructing a data set;
s2, constructing a BERT-BilSTM-CRF model, and training the BERT-BilSTM-CRF model by using the data set; the BERT-BilSTM-CRF model comprises the following components: a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer;
and S3, carrying out named entity recognition by using the trained BERT-BilSTM-CRF model.
It can be seen that, in the method, the apparatus, the storage medium, and the electronic device for identifying a named entity according to embodiments of the present invention, acquired raw data of a professional field is preprocessed, a data set is constructed, a BERT-BiLSTM-CRF model including a BERT pre-training model layer, a blstm network layer, and a CRF inference layer is constructed, the BERT-BiLSTM-CRF model is trained using the data set, and finally, the trained BERT-BiLSTM-CRF model is used to identify a named entity. The named entity recognition model constructed based on the BERT model well solves the problems of difficult entity recognition and low precision when the labeling data is insufficient and the entity boundary is fuzzy in the professional field, and improves the performance and the recognition accuracy of the entity recognition model.
In the above method of the embodiment of the present invention, in order to obtain more, higher quality, and effective data, a preferred processing method is that when acquiring raw data in a professional field and constructing a data set, the method includes:
acquiring raw data of the professional field;
carrying out data cleaning, data standardization, text word segmentation and sequence labeling on the original data to obtain a data set;
and dividing the data set into a training set, a testing set and a verification set according to a certain proportion.
In addition, in the method of the embodiment of the present invention, in order to solve the problems of insufficient labeling data and low recognition accuracy when the entity boundary is fuzzy and difficult to recognize in the professional field, and to improve the performance and recognition accuracy of the entity recognition model, a better processing mode is that in the constructed BERT-blstm-CRF model, the BERT pre-training model layer is used to encode each character to obtain the word vector of the corresponding character; the BilSTM network layer is used for bidirectionally encoding a sequence formed by the word vectors to obtain new feature vectors; and the CRF reasoning layer is used for outputting a label sequence with the maximum probability based on the new feature vector.
In practice, since the recognition performance of the entity recognition model needs to be evaluated and adjusted in advance according to the actual application situation, a preferred processing manner is that the method further includes:
inputting the test set and the verification set into the trained BERT-BilSTM-CRF model for testing so as to evaluate the performance of the BERT-BilSTM-CRF model.
The professional fields include the automobile field, the engine field, the e-commerce field and the like, and the concrete implementation process of one embodiment of the invention is described in detail below by taking the automobile field as an example and combining the explanation of the concrete steps.
Referring to fig. 1, a named entity recognition method of the present invention specifically includes the following steps:
and S1, acquiring the raw data of the professional field and constructing a data set.
First, the comment data of the consumer is crawled from social media networks such as the home of the automobile and the easy-to-drive network and is preprocessed. Specifically, the method comprises the following steps:
and (5) data crawling. The method is characterized in that a lightweight crawler frame script based on Python is used as a base, webpage data are extracted and analyzed through XPath and CSS expressions, a Redis database is used as a distributed shared crawler queue, a MongoDB database is used as a data storage library, a Selenium automated testing tool is integrated, middleware such as a random User-Agent, an Agilent Agent IP and a self-built Agent IP pool are used at the same time, and the middleware is deployed to a cloud server, so that large-scale real-time incremental crawling of product comment data of a plurality of social media platforms is realized.
And (4) preprocessing data. After the original corpus is crawled and before the corpus is input into a model, the data is preprocessed through steps of data cleaning, data standardization, text word segmentation, sequence annotation, data set construction and the like, and data with higher quality and effectiveness are obtained.
And (6) data cleaning. Cleaning data of meaningless comments (meaningless comments mainly refer to comments which are not important to model training and tasks, such as spam comments, repeated comments and the like) mainly comprises the following steps: removing the spam comments, namely removing the comments which are violated with the social core value and have the functions of insulting, malicious vocabularies and the like, and removing the spam comments by summarizing some key words such as 'brain residue', 'rotten goods', 'http', 'Yuan' and the like, wherein the long or short comment length can be the spam comments, the comment length is limited to be 50-200 words, and other comments with the comment word number not in the range are directly removed; and (3) text deduplication, wherein in the process of observing the comment corpus, some comment contents are found to have high similarity, some comments are even repeated, and the text is deduplicated by adopting a Simhash method.
And (6) standardizing data. The data after data cleaning is further processed, which mainly comprises: and correcting the text. The method comprises the following steps that certain error characters inevitably exist in a comment data corpus, and an intelligent text error correction interface is used for correcting the text; stop words are removed. Removing some meaningless symbols, special symbols such as emoticons and the like from the comment data by using a regular expression matching method; and (5) simple and complex body conversion. And in order to train word vectors more conveniently, the traditional characters in the data are converted into simplified characters.
And (5) text word segmentation. Utilizing jieba word segmentation to construct an automobile field entity dictionary with definite word boundaries, and defining the following automobile field named entity classification: brand name, model name, structure name, and attribute name. For example: the brand names are Harvard, Baoma and the like; model names of 650EV, RAV4 and the like; the structure names are steering wheel, engine, etc.; the attribute names include power, oil consumption, displacement and the like.
And (5) labeling the sequence. Sequence tagging is simply the giving of a string of characters, the tagging of elements present in the sequence with a relevant tag, and the deep analysis of this string of sequences by the tag. And manually labeling the data by adopting a labeling system of BIO, wherein 'B' represents a word initial, I 'represents a word non-initial, and O' represents a non-entity. For example: B-BRA represents the beginning character of the brand name; I-BRA represents a middle character of a brand name; B-MOD represents the beginning character of the model name; I-MOD represents a model name middle character; B-STR represents the beginning character of the structure name; I-STR represents the middle character of the structure name; B-ATT represents attribute name initial character; I-ATT represents attribute name middle character; o denotes a non-named entity.
A data set is constructed. After manual labeling, a data set of the BIO label is automatically generated by using a Python construction technology, and then the data set is divided into a training set, a testing set and a verification set according to the proportion of 6:2: 2.
S2, constructing a BERT-BilSTM-CRF model, and training the BERT-BilSTM-CRF model by using the data set; the BERT-BilSTM-CRF model comprises the following components: a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer.
Constructing a BERT-BilSTM-CRF model for Chinese named entity recognition, see FIG. 2, which comprises: a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer. Inputting the text data of the training set into a BERT pre-training model layer, and coding each character to obtain a character vector representation of the corresponding character; then bi-directionally encoding the word vector sequence by using a BilSTM layer, thereby constructing a new feature vector for each word; and finally, outputting a label sequence with the maximum probability by using a CRF reasoning layer to serve as a final prediction label of the model. Specifically, the method comprises the following steps:
the BERT pre-trains the model layers. The BERT model uses a plurality of transform bi-directional encoders to encode characters, so that the prediction of each character can refer to character information in front and back directions, and each unit mainly consists of a feedforward neural network and a self-attention mechanism. See fig. 3, where E1, E2, …, EN are input vectors of the model and T1, T2, …, TN are output vectors of the model. The Transformer performs Attention calculation on each word in an input sentence and all words in the sentence to obtain the mutual relation between the words, captures the internal relation of the sentence, performs weighting according to Attention calculation to enable important words to obtain higher weight, and the Attention calculation is defined as:
Figure BDA0002878687680000081
q, K and V are respectively Query, Key and Value matrixes, and three weight matrixes are needed to be used in a calculation formula and are respectively set as WQ,WK,WVWherein W isQ,WKIs set to k x dk,WVIs set to k x dvAnd if the matrix needs to be obtained through model training, the following steps are provided:
Q=AWQ;K=AWK;V=AWV
where a is a matrix of n × K, each row corresponds to a vector representation of a word in the input sentence, and each row of Q, K, V corresponds to a Query, Key, Value vector representation of each word in the input sentence, respectively.
A BilSTM network layer. BilSTM is used to solve the problem of gradient explosion or gradient disappearance of the Recurrent Neural Network (RNN). In addition, the long-sequence forgetting problem is relieved through 3 calculation gates, namely a forgetting gate f, an input gate i and an output gate 0. The specific calculation formula is as follows:
ft=σ(Wfht-1+Ufxt+bf)
it=σ(Wiht-1+Uixt+bi)
ot=σ(Woht-1+Uoxt+bo)
ct1=tanh(Wcht-1+Ucxt+bc)
ct=ft⊙ct-1+it⊙ct1
ht=ot⊙tanh(ct)
wherein W, b respectively represents weight matrix and offset vector connecting two layers in the computational gate, σ is sigmoid activation function, which is a dot product operation, ct1Indicating the state at time t, xtAs an input vector, htIndicating the output at time t. The BilSTM is a bidirectional long and short term memory network, which is composed of a forward LSTM and a backward LSTM, and respectively calculates, combines and outputs, i.e. at a certain moment i, the model outputs a hidden state sequence of the forward LSTM
Figure BDA0002878687680000091
Hidden state sequence with inverted LSTM output
Figure BDA0002878687680000092
Splicing is carried out, thus obtaining a complete hidden state sequence (t)1,t2,...,tn) Thus enabling the neural network toThe two-way semantic information is well captured, the context relation is learned, and the effect of named entity recognition is effectively improved.
And (4) CRF reasoning layer. The CRF algorithm guarantees the rationality of the predicted labels by constraining the output of the BiLSTM layer by considering the relationship between adjacent labels.
Taking the output of the BilSTM as the input sequence X ═ X (X) of the CRF layer1,x2,...,xn) The corresponding tag sequence is Y ═ Y (Y)1,y2,…,yn) The scoring of (A) is as follows:
Figure BDA0002878687680000101
wherein n is the sequence length, k is the number of tags, and A is the transition score matrix. A. theyi,yi+1Is represented by a label yiTransfer to label yi+1The transfer score of (2) is finally normalized by a Softmax function to obtain the maximum probability of the y sequence label, and the formula is as follows:
Figure BDA0002878687680000102
wherein the content of the first and second substances,
Figure BDA0002878687680000103
representing the value of a real label, Yx representing the set of all possible labels, and in the training process, calculating the maximum likelihood probability of a correct label sequence according to the following formula:
Figure BDA0002878687680000104
and finally, obtaining a sequence with the highest predicted total score on all sequences through a Viterbi algorithm, wherein the sequence is used as a labeling result of the automobile field named entity recognition, and the formula is as follows:
Figure BDA0002878687680000105
in the embodiment of the invention, the word vector output by the BERT model layer is used as the input of the BilSTM, the BilSTM model layer gives a prediction score of a label to each input data by learning the input forward and backward information, and the vector P (P) is output1,P2,…,Pn) Represents sentence X (X)1,x2,…,xn) X ofiCorresponding to the Tag (Tag) defined by the BIO labeling system1,tag2,…,tagn) And j represents the dimension of the mark, corresponding to the output matrix P of BilSTM, where PiJ denotes the sentence X (X)1,x2,…,xn) X ofiMapping to tagjIs measured.
And S3, carrying out named entity recognition by using the trained BERT-BilSTM-CRF model.
And carrying out named entity recognition on the automobile field by using the trained BERT-BilSTM-CRF model. Certainly, in order to ensure the recognition accuracy of the BERT-BiLSTM-CRF model, before the model is used for recognizing the named entity in the automobile field, the performance of the model may be evaluated by using a test set and a verification set, specifically, the test set and the verification set are input into the complete named entity recognition model obtained after training for testing, and the entity result in the automobile field is evaluated by using Precision (P), Recall (Recall, R) and F1 (F1-score, F1) as evaluation indexes of the model performance, and the specific formula is as follows:
Figure BDA0002878687680000111
Figure BDA0002878687680000112
Figure BDA0002878687680000113
wherein, TPRepresenting the number of correctly identified named entities; fpA named entity identification number representing a misrecognized; fNIndicating the number of named entities that have not been identified. If the performance of the BERT-BilSTM-CRF model does not reach the expectation, the BERT-BilSTM-CRF model meeting the expectation requirement can be obtained by adjusting the model parameters.
To verify the validity of embodiments of the present invention. Dividing the processed data set into a training set, a testing set and a verification set according to the ratio of 6:2:2, and selectively building a model by adopting Tensorflow. In the verification process of the embodiment of the invention, the experimental effect of the BERT-BilSTM-CRF model provided by the invention on the identification of named entities of different labels is shown in Table 1.
TABLE 1 BERT-BilSTM-CRF model Effect on identification of different labels
Figure BDA0002878687680000114
In addition, three deep neural network models related to the BERT-BilSTM-CRF model provided by the invention are selected, the same data set is used for carrying out experiments, the obtained identification result is compared with the identification result obtained by the model provided by the invention, and the experimental effects of the four models on the data set are shown in Table 2.
TABLE 2 comparison of data sets on different models
Figure BDA0002878687680000121
As can be seen from the experimental results in tables 1 and 2, the BERT-BilSTM-CRF model provided by the invention has a good effect on the recognition of named entities with different labels in the professional field, and has a better recognition effect than other existing named entity recognition models. Therefore, the named entity recognition method provided by the invention has the advantages that the entity recognition effect is still good and the precision is higher when the labeling data in the professional field is insufficient and the entity boundary is fuzzy, and the performance and the recognition accuracy of the entity recognition model are well improved.
Thus, the whole process of the named entity identification method is completed.
Example 2:
in a second aspect, the present invention provides a named entity recognition apparatus, comprising:
the data acquisition module is used for acquiring raw data in the professional field and constructing a data set;
the model training module is used for constructing a BERT-BilSTM-CRF model and training the BERT-BilSTM-CRF model by utilizing the data set; the BERT-BilSTM-CRF model comprises the following components: a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer;
and the named entity recognition module is used for recognizing the named entity by utilizing the trained BERT-BilSTM-CRF model.
Optionally, the acquiring raw data of the professional field and constructing the data set by the data acquiring module includes:
acquiring professional field original data based on social media; the professional field includes the automotive field;
carrying out data cleaning, data standardization, text word segmentation and sequence labeling on the original data to obtain a data set;
and dividing the data set into a training set, a testing set and a verification set according to a certain proportion.
Optionally, the BERT pre-training model layer in the model training module is configured to encode each character to obtain a word vector of the corresponding character; the BilSTM network layer is used for bidirectionally encoding a sequence formed by the word vectors to obtain new feature vectors; and the CRF reasoning layer is used for outputting a label sequence with the maximum probability based on the new feature vector.
Optionally, the apparatus further comprises: and the model performance evaluation module is used for inputting the test set and the verification set into the trained BERT-BilSTM-CRF model for testing so as to evaluate the performance of the BERT-BilSTM-CRF model.
It can be understood that, the explanation, examples, and beneficial effects of the named entity identification apparatus provided in the embodiment of the present invention correspond to the above named entity identification method, and reference may be made to corresponding contents in a named entity identification method for explanation, examples, and beneficial effects of the named entity identification apparatus, which are not described herein again.
Example 3:
in a third aspect, the present invention provides a computer readable storage medium storing a computer program for named entity identification, wherein the computer program causes a computer to perform the steps of:
acquiring raw data of the professional field and constructing a data set;
building a BERT-BilSTM-CRF model, and training the BERT-BilSTM-CRF model by using the data set; the BERT-BilSTM-CRF model comprises the following components: a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer;
and carrying out named entity recognition by using the trained BERT-BilSTM-CRF model.
Optionally, the acquiring raw data of the professional field and constructing a data set includes:
acquiring professional field original data based on social media; the professional field includes the automotive field;
carrying out data cleaning, data standardization, text word segmentation and sequence labeling on the original data to obtain a data set;
and dividing the data set into a training set, a testing set and a verification set according to a certain proportion.
Optionally, the BERT pre-training model layer is configured to encode each character to obtain a word vector of the corresponding character; the BilSTM network layer is used for bidirectionally encoding a sequence formed by the word vectors to obtain new feature vectors; and the CRF reasoning layer is used for outputting a label sequence with the maximum probability based on the new feature vector.
Optionally, the method further includes:
inputting the test set and the verification set into the trained BERT-BilSTM-CRF model for testing so as to evaluate the performance of the BERT-BilSTM-CRF model.
Example 4:
in a fourth aspect, the present invention provides an electronic device, comprising:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the steps of:
acquiring raw data of the professional field and constructing a data set;
building a BERT-BilSTM-CRF model, and training the BERT-BilSTM-CRF model by using the data set; the BERT-BilSTM-CRF model comprises the following components: a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer;
and carrying out named entity recognition by using the trained BERT-BilSTM-CRF model.
Optionally, the acquiring raw data of the professional field and constructing a data set includes:
acquiring professional field original data based on social media; the professional field includes the automotive field;
carrying out data cleaning, data standardization, text word segmentation and sequence labeling on the original data to obtain a data set;
and dividing the data set into a training set, a testing set and a verification set according to a certain proportion.
Optionally, the BERT pre-training model layer is configured to encode each character to obtain a word vector of the corresponding character; the BilSTM network layer is used for bidirectionally encoding a sequence formed by the word vectors to obtain new feature vectors; and the CRF reasoning layer is used for outputting a label sequence with the maximum probability based on the new feature vector.
Optionally, the method further includes:
inputting the test set and the verification set into the trained BERT-BilSTM-CRF model for testing so as to evaluate the performance of the BERT-BilSTM-CRF model.
In summary, compared with the prior art, the method has the following beneficial effects:
1. the named entity recognition method, the device, the storage medium and the electronic equipment provided by the embodiment of the invention are used for preprocessing the acquired automobile comment data and constructing a data set, then constructing a BERT-BilsTM-CRF model comprising a BERT pre-training model layer, a BilsTM network layer and a CRF reasoning layer, training the BERT-BilsTM-CRF model by using the data set, and finally performing named entity recognition by using the trained BERT-BilsTM-CRF model. The named entity recognition model constructed based on the BERT model well solves the problems of difficult entity recognition and low precision when the labeling data in the automobile field is insufficient and the entity boundary is fuzzy, and improves the performance and the recognition accuracy of the entity recognition model;
2. according to the BERT model in the named entity recognition model, the word embedded expression with good quality can be learned from the library in a pre-training and fine-tuning mode, the semantic expression of the word can be enhanced, the word vector is dynamically generated according to the context characteristics, and the performance and the recognition accuracy of the entity recognition model are improved;
3. the embodiment of the invention integrates external knowledge (for example, external knowledge of vehicle types and vehicle brands, including standard names and alternative names of vehicle types and brands) of products in the professional field, including standard names and alternative names of vehicle types and brands, into the data processing, and can link the alternative names mentioned in the original data (user comment data) to the standard names when processing data, thereby obviously improving the recognition effect of entities in the professional field;
4. the embodiment of the invention uses the trained model to automatically identify the entity without establishing entity dictionary matching, thereby effectively relieving the problems of insufficient training data and poor identification performance when the named entity identification task is carried out in the professional field.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A named entity recognition method, comprising:
acquiring raw data of the professional field and constructing a data set;
building a BERT-BilSTM-CRF model, and training the BERT-BilSTM-CRF model by using the data set; the BERT-BilSTM-CRF model comprises the following components: a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer;
and carrying out named entity recognition by using the trained BERT-BilSTM-CRF model.
2. The method of claim 1, wherein the obtaining of domain of expertise raw data and constructing a data set comprises:
acquiring professional field original data based on social media; the professional field includes the automotive field;
carrying out data cleaning, data standardization, text word segmentation and sequence labeling on the original data to obtain a data set;
and dividing the data set into a training set, a testing set and a verification set according to a certain proportion.
3. The method of claim 1, wherein the BERT pre-training model layer is configured to encode each character to obtain a word vector for the corresponding character; the BilSTM network layer is used for bidirectionally encoding a sequence formed by the word vectors to obtain new feature vectors; and the CRF reasoning layer is used for outputting a label sequence with the maximum probability based on the new feature vector.
4. The method of claim 2, wherein the method further comprises:
inputting the test set and the verification set into the trained BERT-BilSTM-CRF model for testing so as to evaluate the performance of the BERT-BilSTM-CRF model.
5. An apparatus for named entity recognition, the apparatus comprising:
the data acquisition module is used for acquiring raw data in the professional field and constructing a data set;
the model training module is used for constructing a BERT-BilSTM-CRF model and training the BERT-BilSTM-CRF model by utilizing the data set; the BERT-BilSTM-CRF model comprises the following components: a BERT pre-training model layer, a BilSTM network layer and a CRF reasoning layer;
and the named entity recognition module is used for recognizing the named entity by utilizing the trained BERT-BilSTM-CRF model.
6. The apparatus of claim 5, wherein the data acquisition module to acquire domain of expertise raw data and construct a data set comprises:
acquiring professional field original data based on social media; the professional field includes the automotive field;
carrying out data cleaning, data standardization, text word segmentation and sequence labeling on the original data to obtain a data set;
and dividing the data set into a training set, a testing set and a verification set according to a certain proportion.
7. The apparatus of claim 5, wherein the BERT pre-training model layer in the model training module is configured to encode each character to obtain a word vector corresponding to the character; the BilSTM network layer is used for bidirectionally encoding a sequence formed by the word vectors to obtain new feature vectors; and the CRF reasoning layer is used for outputting a label sequence with the maximum probability based on the new feature vector.
8. The apparatus of claim 6, wherein the apparatus further comprises: and the model performance evaluation module is used for inputting the test set and the verification set into the trained BERT-BilSTM-CRF model for testing so as to evaluate the performance of the BERT-BilSTM-CRF model.
9. A computer-readable storage medium, characterized in that it stores a computer program for named entity recognition, wherein the computer program causes a computer to perform the named entity recognition method according to any one of claims 1-4.
10. An electronic device, comprising:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the named entity identification method of any of claims 1-4.
CN202011636806.1A 2020-12-31 2020-12-31 Named entity identification method, device, storage medium and electronic equipment Pending CN112749562A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011636806.1A CN112749562A (en) 2020-12-31 2020-12-31 Named entity identification method, device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011636806.1A CN112749562A (en) 2020-12-31 2020-12-31 Named entity identification method, device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN112749562A true CN112749562A (en) 2021-05-04

Family

ID=75651091

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011636806.1A Pending CN112749562A (en) 2020-12-31 2020-12-31 Named entity identification method, device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN112749562A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113486153A (en) * 2021-07-20 2021-10-08 上海明略人工智能(集团)有限公司 Domain knowledge extraction method, system, electronic device and medium
CN113723104A (en) * 2021-09-15 2021-11-30 云知声智能科技股份有限公司 Method and device for entity extraction under noisy data
CN113779992A (en) * 2021-07-19 2021-12-10 西安理工大学 Method for realizing BcBERT-SW-BilSTM-CRF model based on vocabulary enhancement and pre-training
CN114169966A (en) * 2021-12-08 2022-03-11 海南港航控股有限公司 Method and system for extracting unit data of goods by tensor
CN114648029A (en) * 2022-03-31 2022-06-21 河海大学 Electric power field named entity identification method based on BiLSTM-CRF model
CN115759097A (en) * 2022-11-08 2023-03-07 广东数鼎科技有限公司 Vehicle type name recognition method
CN116501884A (en) * 2023-03-31 2023-07-28 重庆大学 Medical entity identification method based on BERT-BiLSTM-CRF

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019071661A1 (en) * 2017-10-09 2019-04-18 平安科技(深圳)有限公司 Electronic apparatus, medical text entity name identification method, system, and storage medium
CN111967266A (en) * 2020-09-09 2020-11-20 中国人民解放军国防科技大学 Chinese named entity recognition model and construction method and application thereof
CN112001177A (en) * 2020-08-24 2020-11-27 浪潮云信息技术股份公司 Electronic medical record named entity identification method and system integrating deep learning and rules

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019071661A1 (en) * 2017-10-09 2019-04-18 平安科技(深圳)有限公司 Electronic apparatus, medical text entity name identification method, system, and storage medium
CN112001177A (en) * 2020-08-24 2020-11-27 浪潮云信息技术股份公司 Electronic medical record named entity identification method and system integrating deep learning and rules
CN111967266A (en) * 2020-09-09 2020-11-20 中国人民解放军国防科技大学 Chinese named entity recognition model and construction method and application thereof

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779992A (en) * 2021-07-19 2021-12-10 西安理工大学 Method for realizing BcBERT-SW-BilSTM-CRF model based on vocabulary enhancement and pre-training
CN113486153A (en) * 2021-07-20 2021-10-08 上海明略人工智能(集团)有限公司 Domain knowledge extraction method, system, electronic device and medium
CN113723104A (en) * 2021-09-15 2021-11-30 云知声智能科技股份有限公司 Method and device for entity extraction under noisy data
CN114169966A (en) * 2021-12-08 2022-03-11 海南港航控股有限公司 Method and system for extracting unit data of goods by tensor
CN114169966B (en) * 2021-12-08 2022-08-05 海南港航控股有限公司 Method and system for extracting unit data of goods by tensor
CN114648029A (en) * 2022-03-31 2022-06-21 河海大学 Electric power field named entity identification method based on BiLSTM-CRF model
CN115759097A (en) * 2022-11-08 2023-03-07 广东数鼎科技有限公司 Vehicle type name recognition method
CN116501884A (en) * 2023-03-31 2023-07-28 重庆大学 Medical entity identification method based on BERT-BiLSTM-CRF

Similar Documents

Publication Publication Date Title
CN112749562A (en) Named entity identification method, device, storage medium and electronic equipment
CN111079985B (en) Criminal case criminal period prediction method based on BERT and fused with distinguishable attribute features
CN111783394B (en) Training method of event extraction model, event extraction method, system and equipment
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN106569998A (en) Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN110298043B (en) Vehicle named entity identification method and system
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN113626589B (en) Multi-label text classification method based on mixed attention mechanism
CN113297360B (en) Law question-answering method and device based on weak supervised learning and joint learning mechanism
CN113743119B (en) Chinese named entity recognition module, method and device and electronic equipment
CN113822026A (en) Multi-label entity labeling method
CN111582506A (en) Multi-label learning method based on global and local label relation
CN113987167A (en) Dependency perception graph convolutional network-based aspect-level emotion classification method and system
CN113742733A (en) Reading understanding vulnerability event trigger word extraction and vulnerability type identification method and device
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN115599899A (en) Intelligent question-answering method, system, equipment and medium based on aircraft knowledge graph
CN116029305A (en) Chinese attribute-level emotion analysis method, system, equipment and medium based on multitask learning
CN112989830B (en) Named entity identification method based on multiple features and machine learning
CN114356990A (en) Base named entity recognition system and method based on transfer learning
CN117094325B (en) Named entity identification method in rice pest field
CN113641809A (en) XLNET-BiGRU-CRF-based intelligent question answering method
CN117056451A (en) New energy automobile complaint text aspect-viewpoint pair extraction method based on context enhancement
CN114372454A (en) Text information extraction method, model training method, device and storage medium
CN116127954A (en) Dictionary-based new work specialized Chinese knowledge concept extraction method
CN114911940A (en) Text emotion recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination