CN113408288A

CN113408288A - Named entity identification method based on BERT and BiGRU-CRF

Info

Publication number: CN113408288A
Application number: CN202110725919.7A
Authority: CN
Inventors: 陈鹏杰; 谢胜利; 孙为军
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-09-17

Abstract

The invention discloses a named entity identification method based on BERT and BiGRU-CRF, relating to the technical field of computers and comprising the following steps: the method comprises the steps that comment text data of the E-commerce industry are obtained through a web crawler, the text data are labeled, the labeled text data are preprocessed, and a training data set and a verification data set are constructed; and training a BiGRU-CRF algorithm model and a BERT algorithm model according to the training data set and the verification data set. According to the method, word vector representation of a sentence is trained through a BERT pre-training language model, a traditional BERT model training task is improved, and fragment masking is used for replacing traditional word masking; semantic information of the current word and context is further extracted through a BiGRU model; and extracting the optimized label which accords with the context logic through a CRF algorithm.

Description

Named entity identification method based on BERT and BiGRU-CRF

Technical Field

The invention relates to the technical field of computers, in particular to a named entity identification method based on BERT and BiGRU-CRF.

Background

With the rapid development of internet technology, electronic commerce is well-known in people's daily life. The e-commerce platform basically provides a commodity review area for online review of consumers. The consumer can select the satisfied commodity by reading the information of the comment area, meanwhile, the merchant can also obtain the satisfaction degree of the consumer on the commodity from the information of the comment area, the deficiency in the commodity transaction process can be found by timely checking the comment area, and the improvement can be made timely, so that the method has great significance for the continuous development of the shop.

However, the rapid development of the mobile internet has accumulated a lot of and complicated comments on the e-commerce platform, which makes it difficult for consumers to obtain correct commodity information in a short time. Merchants also have difficulty obtaining effective consumer reviews from a vast array of reviews. Therefore, how to efficiently mine the information contained in the comments from the numerous and complicated comments greatly helps to promote consumer behaviors and promote merchants to improve services or change product quality, and directly influences the economic benefit of the e-commerce platform.

With the continuous influx of a large amount of comment texts and the random publication of users, the format of comments is not uniform, grammar rules are difficult to be captured, natural language processing is performed by manpower, the speed of establishing rules and a corpus by experts cannot catch up with the speed of the increase of comment data, the requirements cannot be met, the workload is huge, and human resources are wasted.

An effective solution to the problems in the related art has not been proposed yet.

Disclosure of Invention

Aiming at the problems in the related art, the invention provides a named entity identification method based on BERT and BiGRU-CRF, so as to overcome the technical problems in the prior related art.

The technical scheme of the invention is realized as follows:

a named entity identification method based on BERT and BiGRU-CRF comprises the following steps:

obtaining comment text data of the E-commerce industry through a web crawler, and labeling the text data;

preprocessing the labeled text data, constructing a training data set and a verification data set, and training a BiGRU-CRF algorithm model and a BERT algorithm model according to the training data set and the verification data set;

the word vector representation of a sentence is trained through a BERT pre-training language model, the traditional BERT model training task is improved, and fragment masking is used for replacing the traditional word masking;

semantic information of the current word and context is further extracted through a BiGRU model;

and extracting the optimized label which accords with the context logic through a CRF algorithm.

Further, the text data is labeled, and the method comprises the following steps:

training a model by using part of the labeled text data in an incremental learning mode, and predicting the rest unlabeled text data according to the trained model;

and directly taking the prediction result with the confidence coefficient higher than the preset threshold value as a mark of the text data, and manually marking the text data with the confidence coefficient lower than the preset threshold value.

Further, the method for using the fragment mask to replace the traditional word mask comprises the following steps:

according to the geometric distribution, a segment of the hidden length is randomly selected, then the initial position is randomly selected according to the uniform distribution, and finally a segment of characters in the sentence are covered according to the length.

Further, the semantic information of the current word and context is further extracted through the BiGRU model, the semantic information comprises an entity contained in text data to be predicted, which is extracted through the trained BiGRU-CRF algorithm model, and the following steps are represented:

modeling the text data sequence in a forward direction and a backward direction by utilizing a bidirectional GRU model;

and (4) scoring the whole prediction path by using the relation between conditional random field CRF constraint label results, and extracting entities contained in the text data.

Further, the method also comprises the following steps:

if the text data is a long text, predicting the relation between the text data to be predicted and the entity through the trained BERT algorithm model, and comprising the following steps:

acquiring a BERT original model, adopting [ CLS ] marks to represent the characteristics of the whole sentence types in the article through the BERT original model, and using [ SEP ] to segment a plurality of sentences in the input article;

coding by combining BERT input with an upstream extracted entity and adopting a structure of [ CLS ] article sentence [ SEP ] subject [ object ] [ SEP ] };

connecting the subject entity vector, the sentence vector and the object entity vector, and predicting the relationship type through full connection and softmax; wherein a ═ a1, a 2.., an ] is used to represent the subject entity vector; the object entity vector is denoted by b ═ b1, b 2.

The invention has the beneficial effects that:

the invention relates to a named entity identification method based on BERT and BiGRU-CRF, which comprises the steps of obtaining comment text data of the E-commerce industry through a web crawler and marking the text data; preprocessing the labeled text data, and constructing a training data set and a verification data set; training a BiGRU-CRF algorithm model and a BERT algorithm model according to the training data set and the verification data set; extracting entities contained in text data to be predicted through a trained BiGRU-CRF algorithm model; predicting the relation between the text data to be predicted and the entity through the trained BERT algorithm model, establishing an entity connection relation, and then further analyzing semantic information according to the connection relation; the method comprises the steps of extracting required entities in texts by using an algorithm model of BiGRU-CRF on the basis of a natural language processing technology under deep learning on the basis of a named entity recognition task, predicting the relationships among the entities by the texts and the extracted entities through a BERT model, establishing entity connection relationships, and further analyzing semantic information according to the connection relationships.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic flow chart of a named entity identification method based on BERT and BiGRU-CRF according to an embodiment of the present invention;

fig. 2 is a schematic block diagram of a method for identifying named entities based on BERT and BiGRU-CRF according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.

According to the embodiment of the invention, a named entity identification method based on BERT and BiGRU-CRF is provided.

As shown in fig. 1-2, the method for identifying a named entity based on BERT and BiGRU-CRF according to an embodiment of the present invention includes the following steps:

the method comprises the steps that firstly, comment text data of the E-commerce industry are obtained through a web crawler, and the text data are marked;

preprocessing the labeled text data to construct a training data set and a verification data set; training a BiGRU-CRF algorithm model and a BERT algorithm model according to the training data set and the verification data set;

extracting entities contained in the text data to be predicted through the trained BiGRU-CRF algorithm model;

predicting the relation between the text data to be predicted and the entity through the trained BERT algorithm model;

and step five, further analyzing the semantic information according to the connection relation.

By means of the technical scheme, an automatic information extraction method based on deep learning natural language processing is innovatively provided through the problem of named entity recognition based on E-commerce comment. On the basis of the recognition of named entities of comment texts, the invention provides a high-precision named entity extraction method for information extraction of unstructured comment texts based on a natural language processing technology under deep learning.

In addition, labeling the text data includes: training a model by using part of the labeled text data in an incremental learning mode, and predicting the rest unlabeled text data according to the trained model; and directly taking the prediction result with the confidence coefficient higher than the preset threshold value as a mark of the text data, and manually re-marking the text data with the confidence coefficient lower than the preset threshold value. And constructing a commodity comment data entity library in the E-commerce industry, replacing entities of the same type from the entity library by adopting a random replacement mode, constructing random number word generators of different expression modes aiming at general entity types such as time, amount and the like, and randomly generating and replacing time, amount and the like in the original label text data. The method has the advantages that the labeled data set is expanded, meanwhile, the model can learn the language knowledge of the context better, and compared with a common method, the robustness and the accuracy of the model are improved. And simultaneously, predicting another batch of unlabeled text data by using a model trained by a part of labeled data in an incremental learning mode, directly taking a prediction result with the algorithm confidence coefficient higher than a threshold value as a mark, and manually re-labeling the data with low confidence coefficient. And carrying out entity replacement on the marked text data by using the constructed entity library and the time and money random generator, and enhancing the data set. And simultaneously, predicting another batch of unlabeled text data by using a model trained by a part of labeled data in an incremental learning mode, directly using a prediction result with high confidence coefficient of the algorithm as a label, and manually re-labeling the data with low confidence coefficient. The method greatly improves the efficiency of data marking and reduces the labor cost.

In addition, the entity contained in the text data to be predicted is extracted through the trained BiGRU-CRF algorithm model, and the method comprises the following steps: modeling the text data sequence in a forward direction and a backward direction by utilizing a bidirectional GRU model; and (4) scoring the whole prediction path by using the relation between conditional random field CRF constraint label results, and extracting entities contained in the text data. In the embodiment, named entity recognition is performed on the sentence after the sentence is labeled by using a deep learning sequence labeling model, so that the entity existing in the text is recognized. And carrying out forward and backward modeling on the sequence by using a bidirectional GRU model, and scoring the whole prediction path by using the relation between Conditional Random Field (CRF) constraint label results so as to extract an entity contained in the sentence.

In addition, if the text data is a long text, predicting the relation between the text data to be predicted and the entity through the trained BERT algorithm model, and establishing the relation connection between the related entities, wherein the relation connection comprises the following steps:

connecting the subject entity vector, the sentence vector and the object entity vector, and predicting the relationship type through full connection and softmax;

wherein a ═ a1, a 2.., an ] is used to represent the subject entity vector; the object entity vector is denoted by b ═ b1, b 2.

The method trains an entity recognition model and a relation prediction model respectively through labeled data. Constructing a BiGRU-CRF and training entity identification, modeling sequences in a forward direction and a backward direction by utilizing a bidirectional GRU model, and constraining the relation between label results by utilizing a Conditional Random Field (CRF);

constructing a BERT model as a relational analysis model, processing the forward input data of the model into a structure in the form of a '{ [ CLS ] article sentence [ SEP ] subject [ object ] [ SEP ] }', and training the BERT model;

predicting entities appearing in the text by using data to be predicted through an entity recognition model BiGRU-CRF;

and respectively combining the entities extracted in the last step, integrating the entities with the original text according to the format of the training data, and inputting a model to predict the relationship among the entities.

The invention provides a BERT model for improving a mask strategy to analyze semantic relations between entities. BERT uses a multi-layer self-attention mechanism to carry out bidirectional coded representation on text, and semantic syntax information of different levels of the text is extracted from low to high. The traditional BERT Model achieves the effect of training a Language Model by masking 15% of words in a text and predicting the Masked words in a Masked mode by a Masked Language Model. The invention provides a method for replacing the traditional word masking by using segment masking, and specifically, according to geometric distribution, a segment of masking length is randomly selected, then according to uniform distribution, a starting position is randomly selected, and finally a segment of characters in a sentence is masked according to the length. Migration learning using BERT can provide strong support for downstream tasks in general.

The BERT original model adopts [ CLS ] marks to represent the overall type characteristics of sentences, uses [ SEP ] to segment a plurality of input sentences, innovatively proposes a BERT structure to carry out semantic relation analysis aiming at a special input structure in the BERT, and codes by adopting a structure of a [ CLS ] article sentence [ SEP ] subject [ object ] [ SEP ] }bycombining the input of the BERT with an upstream extracted entity. Denote the subject entity vector using a ═ a1, a 2.., an ]; using b ═ b1, b 2.., bn ] to represent the object entity vector, and finally predicting the relationship type through full connection and softmax by connecting the sentence vector, the subject entity vector and the object entity vector, and representing:

V′＝W[concat(a，b)]+λ；

p＝softmax(V′)。

aiming at the problem that the information of words with similar meanings is split due to word masking in the traditional BERT model masking strategy, a segment masking mode is innovatively adopted, a segment of masking length is randomly selected according to geometric distribution, then the initial position is randomly selected according to uniform distribution, and finally a segment of characters in a sentence are masked according to the length. And denoising according to the connection relation by taking the extracted entity as a semantic central word, reserving a limited step length text having a dependency relation with the entity, and removing unnecessary text noise. The method can solve the problem of entity area ambiguity and improve the accuracy of entity relationship analysis. And adding the entity pairs into the model in an effective mode, and extracting the characteristics to predict the relation between the entities. Meanwhile, in order to prevent the occurrence of overfitting, the entity pair of the relation to be predicted is covered in the sentence [ MASK ]. Compared with the existing relation extraction mode, the method has obvious effect improvement. The invention discloses a method for removing text noise by using syntactic relation analysis, which comprises the following steps: aiming at the problem of fuzzy entity regions in the text, the semantic center word entities are creatively extracted by adopting a syntactic relation analysis mode, the limited step length text which has a dependency relationship with the entities is reserved, and unnecessary text noise caused by fuzzy entity distribution is removed.

In summary, by means of the technical scheme of the present invention, a semantic relationship extraction method is innovatively provided for information extraction of unstructured text based on a natural language processing technology under deep learning. Firstly, extracting entities such as product entities, money amounts, time, places, mechanisms and the like contained in a text through an algorithm model of BiGRU-CRF, predicting the relationship between the entities through a BERT model for the text and the extracted entities, establishing relationship connection between related entities, and completing text semantic analysis according to the connection relationship. The invention discloses an automatic information extraction method based on deep learning natural language processing, which is based on the problem of semantic relation analysis and extraction of unstructured documents. On the aspect of article unstructured information extraction, the invention provides a high-precision semantic relation extraction method for information extraction of unstructured texts based on a natural language processing technology under deep learning, extracts required entities in texts by using an algorithm model of BiGRU-CRF, and predicts the relation between the entities by using the texts and the extracted entities through a BERT model. The invention is a natural language analysis process combining natural language processing with deep learning pre-training model and having better prediction effect through practice, groping and research, and has high algorithm efficiency and strong pertinence. In the engineering application practice, compared with the rule-based extraction process commonly adopted by the text data mining project and the general technical method, the method has higher accuracy and higher processing speed. Moreover, the invention has the following three advantages:

1) the method comprises the steps of adopting a mode of constructing an e-commerce industry comment data entity library, replacing entities of the same type from the entity library in a random replacement mode, constructing random number word generators of different expression modes aiming at general entity types such as time, amount and the like, and randomly generating and replacing positions such as time, amount and the like in original labeled text data, so that a model can learn language knowledge of context better.

2) The method adopts an incremental learning mode, predicts another batch of unlabeled text data by using a model trained by a part of labeled data, directly uses a prediction result with the algorithm confidence higher than a threshold value as a mark, and manually re-labels the data with low confidence. The method greatly improves the efficiency of data marking and reduces the labor cost.

3) Aiming at the fuzzy characteristic of the entity region in the E-commerce comment text, the noise removal operation is innovatively carried out on irrelevant information in the text in a syntactic relation analysis mode. The extracted entity is taken as a semantic central word, the semantic central word is removed through a syntactic relation, a limited step length text which has a dependency relation with the entity is reserved, the length of the analyzed text can be shortened to the maximum extent while the original context structure is reserved, and unnecessary text noise caused by fuzzy entity distribution is removed. The training speed and the prediction accuracy of the model are greatly improved, and the speed disadvantage of the pre-training model on long sentence training reasoning can be well relieved.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A named entity identification method based on BERT and BiGRU-CRF is characterized by comprising the following steps:

2. The method for identifying named entities based on BERT and BiGRU-CRF as claimed in claim 1, wherein the text data is labeled, comprising the following steps:

3. The BERT and BiGRU-CRF based named entity recognition method of claim 2, wherein the use of fragment masking instead of traditional word masking comprises the steps of:

4. The method of claim 3, wherein the extracting semantic information of the current word and context through the BiGRU model further comprises extracting entities included in text data to be predicted through a trained BiGRU-CRF algorithm model, and represents the following steps:

5. The method for identifying named entities based on BERT and BiGRU-CRF as claimed in claim 4, further comprising the steps of: