CN117236338A

CN117236338A - Named entity recognition model of dense entity text and training method thereof

Info

Publication number: CN117236338A
Application number: CN202311095973.3A
Authority: CN
Inventors: 李静远; 殷大虎; 王元卓; 孙诗奇; 岑建何; 郑耀; 吴琼
Original assignee: China Science And Technology Big Data Research Institute; Beijing Technology and Business University
Current assignee: China Science And Technology Big Data Research Institute; Beijing Technology and Business University
Priority date: 2023-08-29
Filing date: 2023-08-29
Publication date: 2023-12-15

Abstract

The application discloses a named entity recognition model of dense entity text and a training method thereof, wherein the model comprises the following steps: a pre-training language model, a BiLSTM fine granularity capturing layer, an entity tag semantic network layer and a CRF decoding layer; for a given dense entity title text, carrying out data preprocessing, and then encoding each character through a pre-training language model to obtain high-dimensional semantic representation of a word; splicing sentence-level text features cls for each character by adopting a method of mixing multi-granularity features; providing finer granularity of sequence modeling using BiLSTM; the CRF layer is used for modeling the labels in the sequence; obtaining a model total loss to obtain a trained comprehensive model; and carrying out named entity recognition on the dense entity text by utilizing the comprehensive model to obtain a recognition result. The application solves the problems that the existing named entity recognition technology does not consider the high entity density and fine type granularity at the same time, comprehensively captures the internal structure and category information of words and texts, and improves the accuracy of entity recognition.

Description

Named entity recognition model of dense entity text and training method thereof

Technical Field

The application belongs to the technical field of natural language processing and deep learning, and particularly relates to a named entity recognition model of dense entity text and a training method thereof.

Background

Named entity recognition (Named Entity Recognition, NER) is a natural language processing technique that aims to automatically recognize named entities in text and classify them into predefined categories, such as person names, place names, organization names, etc. NER has wide application in practical applications such as information extraction, semantic analysis, machine translation, text classification, etc. In early studies, named entity recognition tasks were accomplished primarily using rules and vocabulary based manual methods. These methods rely on domain experts to construct pattern matching based rules for the problem to support detection and extraction of named entities from the data. However, rule-based and dictionary-based methods can only be used in specific fields with specified rules, have low recognition accuracy, and are time-consuming and laborious.

As the amount of data grows exponentially, the difficulty of rule extraction grows exponentially, and rule-based and dictionary-based approaches have difficulty in coping with text isomerism and complexity challenges. The method based on statistical machine learning greatly improves the condition that the feature engineering method depends on expert construction rules, and the recognition accuracy of common entity types can reach more than 80%. A method based on statistical machine learning learns hidden sequences from a large amount of text data, and a text sequence observation model, such as a hidden Markov model and a conditional random field model, is constructed through marking the sequences, so that the named entity is identified. The machine learning method based on statistics has certain universality based on the statistical feature modeling of texts.

In recent years, with the rapid development of deep learning and the appearance of distributed word representation technology, a pre-training language model represented by BERT gradually becomes a mainstream method for identifying named entity in each field due to the excellent performance and the capability of modeling polysemous words and integrating syntactic and semantic information, and brings about excellent performance improvement for a plurality of natural language processing tasks. This model has been used for many natural language processing tasks and is becoming the dominant approach to named entity recognition in various fields.

Dense entity identification is a core basic task in NLP application, can be multiplexed for various downstream scenes, and has very important significance in the aspects of searching, recommending, deducing user interests and social relations, understanding analysis in a finer manner, improving the efficiency and accuracy of reasoning and the like. The entity identification of dense entity text is the key of the application, and texts such as medicine title text, social media analysis, commodity title text, scientific literature analysis, legal text analysis and the like have the characteristics of high entity density and fine type granularity.

For domain-specific Named Entity Recognition (NER), the BERT-CRF model is a common approach in combination with domain-specific feature engineering. However, in the task of dense entity identification, at least two of the challenges need to be considered: challenge 1, most dense entity text consists of many entities that are not closely related, lacking context and grammar information. There is no semantic or syntactic structural association between entities, and as the title formulation has no uniform rule reference, the distribution of entities in dense entity text is occasional, and the model cannot learn well from sequential relationships useful information about entity types or boundaries, especially for classification of different entities. Challenge 2, the entity types in dense entity text are fine-grained, which makes prediction of entity types more difficult than other types of entities. In some downstream tasks identified by chinese entities, such as recommendation, retrieval, etc., fine-grained attributes mean a more accurate match, which will lead to a better experience for the user.

In summary, most of the current techniques mainly focus on only one of the two features of dense entity text, high entity density and fine type granularity, rather than considering both challenges.

Disclosure of Invention

Aiming at the problem of poor recognition effect caused by ambiguity or ambiguity of a recognized result or classification of similar entities into different types based on a named entity recognition model in the prior art, the application provides a named entity recognition model of dense entity text and a training method thereof.

The application solves the technical problems by adopting the scheme that: a named entity recognition model of dense entity text and a training method thereof are provided, wherein the named entity recognition model comprises a pre-training language model, a BiLSTM feature extraction layer, an entity tag semantic network layer and a CRF decoding layer.

The training method of the named entity recognition model comprises the following steps:

s1, inputting label-free data into a pre-training language model, and performing incremental training by adopting a hierarchical learning rate optimization strategy to obtain a trained pre-training language model PLM;

s2, dividing the data with the labels for training into a training set and a test set, and adopting data enhancement to expand the training set;

s3, encoding each character by the data in the training set through PLM to obtain high-dimensional semantic representation of the word;

s4, performing countermeasure learning by adding micro disturbance to the embedding layer by using the PGD in the training process so as to improve the generalization capability of the model;

s5, splicing sentence-level text features cls for each character by adopting a method of mixing multi-granularity features to obtain more robust character representation;

s6, providing finer granularity sequence modeling by using BiLSTM, and obtaining comprehensive characteristics F;

s7, inputting the comprehensive characteristics F into a CRF decoding layer for forward propagation, and updating model parameters to obtain a trained comprehensive model.

The step of incremental training by adopting the hierarchical learning rate optimization strategy in the step S1 comprises the following steps:

s1-1, adopting a dynamic MASK strategy: generating a new mask text randomly in each iteration, and enhancing the generalization capability of the model;

s1-2, mask probability is set to be 0.5, so that training difficulty is increased;

s1-3, adopting an N-gram mask strategy: selecting a token by Mask probability, and performing Mask of 1-gram, 2-gram and 3-gram fragments by the selected part with probability of 70%, 20% and 10% for increasing training difficulty; selecting a token to use [ MASK ], wherein the random word and the self-replacement probability are consistent with those of the original edition Bert;

s1-4, using R-drop to improve the stability of model output;

s1-5, the weight of the model at different times in the training stage is averaged by using EMA, so that the model weight is smoother, and the performance and generalization capability are improved.

The layered learning rate optimization strategy comprises the steps that when the pre-training language model is used for incremental training, a layer close to output is specialized for incremental training tasks, when the downstream tasks are fine-tuned, a layer close to output is used for learning new tasks, and when the downstream tasks are more closely input, a layer is used for learning at a smaller learning rate to keep more generalized knowledge, so that incremental training is performed on different attention layers of an encoder of the pre-training language model in a mode of increasing the learning rate from top to bottom.

The method for expanding the training set by adopting data enhancement in the step S2 is to extract and splice the entities with the same entity type in the training set into sentences to be added into the training set, so that the model learns the classification information in the same entity type.

The method for encoding each character by the pre-trained language model in step S3 comprises the steps of:

s3-1, original multi-head attention implementation, based on scaling dot product attention, truly inputting sequence x= (x) ₁ ，x ₂ ，...，x _n ) From different weights W ^Q ，W ^K ，W ^V Respectively multiplying to obtain input sequences Q, K, V, and output sequence (z ₁ ，z ₂ ，...，z _n ) Output z is the same as the input sequence length _i The calculation formula is as follows:

wherein alpha is _ij The hidden states of the position i and the position j are obtained after softmax;

s3-2, calculating the interdependence relation of hidden states, namely output and attention scores, by using a sin function of the relative position by using a relative position code, and obtaining relative position information between the two layers at each layer, wherein the relative position information is shown in the following formula:

wherein i and j are index positions, and the difference value of the i and j is equal to the index value of the absolute position;

s3-3, and superposing the relative position information on the token input of each position, wherein the relative position information is shown in the following formula:

s3-4, given dense entity text x= (X) ₁ ，x ₂ ，...，x _n ) Each character x _i Obtaining a high-dimensional semantic representation e after pre-training language model coding _i The following is shown:

wherein,for embedding mapping of pre-trained language model, d ₁ For the word vector dimension, v is the vocabulary size.

Step S4 the method of performing countermeasure learning by adding a minute disturbance to the empdding layer in the gradient direction using the PGD includes the steps of:

s4-1, forming the countermeasure training into a min-max optimization problem, namely maximizing an inner layer and minimizing an outer layer, wherein the formula is as follows:

wherein L represents a loss function, f _θ Representation model, x _i Representing the original data sample, y _i Representing the label, x' _i Representing the challenge sample, e representing the noise added by the constraint;

s4-2, the formula of the challenge sample is as follows:

where α is a relatively small step size, and the gradient is then normalized with a sign function.

The method of using the mixed multi-granularity feature described in step S5 includes: said for each character x _i Splicing sentence-level text features cls to obtain a more robust character representation s _i ，S＝(s ₁ ,s ₂ ,...,s _i ,...,S _n ) A vector representation representing the correspondence of each sentence is shown as follows:

step S6, capturing more comprehensive dependency relationship between context information and words through BiLSTM bidirectional modeling, and splicing the embedded mapping of the pre-training language model, the CLS, character type characteristics processed through characteristic engineering and extracted fine granularity characteristics processed through BiLSTM to obtain all extracted characteristics F.

Step S7, all the extracted features F are spliced with predicted entity tag information and real tag information respectively, and then the predicted entity tag information and the real tag information are input into a CRF decoding layer for forward propagation, and finally total loss of reverse propagation is obtained, wherein a loss function is as follows:

wherein m is the number of samples,prediction result of the ith sample for the jth class,/>Labeling results of the ith sample of the jth classification.

The CRF decoding layer includes the following:

s7-1, input vector sequence X= { X ₁ ,x ₂ ，...x _n Tag sequence y= { Y } ₁ ,y ₂ ，...y _n Output p=of BiLSTM }{p ₁ ,p ₂ ，...p _n P, where _i Is a vector representing the tag scores for the i-th and location. The score function S (X, Y) defining CRF represents the score of a given input sequence X and tag sequence Y, consisting of two parts, the emission score and the transfer score, the formula of the score function being as follows:

wherein i is from 1 to n, n is the sequence length T _i，j Representing the fraction of transitions from tag i to tag j, we need to add special START (START) and END (END) tags for the START and END of the sequence;

s7-2, the score function Z (X) defining the CRF represents the sum of the scores of all possible tag sequences for a given input sequence X, as shown in the following equation:

Z(X)＝∑ _Y′ exp (S (X, Y')) (eleven)

Wherein Y' represents all possible tag sequences;

s7-3, the negative log likelihood loss of the CRF is defined as follows:

the application has the beneficial effects that: compared with the prior art, the method and the device not only adopt the data enhancement technology in the data preprocessing, but also extract the entities with the same entity type in the training set, splice the entities into sentences, add the sentences into the training set, and enable the model to learn the internal classification information with the same entity type; the characteristic engineering technology is adopted to distinguish the characters such as numbers, letters, characters and the like and carry out additional coding; the application also provides a label semantic network layer, which is a new strategy for integrating entity label information, and can effectively extract and integrate the entity label information.

Drawings

FIG. 1 is a block diagram of a named entity recognition model training method of the present application.

FIG. 2 is a view of a hierarchical learning rate optimization strategy incremental training frame.

FIG. 3 is a view of a character encoding box in a pre-trained language model.

FIG. 4 is a block diagram of a named entity recognition model architecture of the present application.

Detailed Description

Example 1: in order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail by means of specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Dense entity text has the characteristics of high entity density and fine type granularity, and most of the current technologies only pay attention to one of the two characteristics, but not consider the two challenges at the same time, so that the recognition result based on the named entity recognition model in the prior art may have ambiguity or the problem of classifying similar entities into different types, so that the recognition effect is poor.

As shown in FIG. 4, the named entity recognition model of the dense entity text comprises four core components of a pre-training language model, a BiLSTM fine granularity capturing layer, an entity tag semantic network layer and a CRF decoding layer and other components, wherein the other components comprise an input layer, an embedding layer, a Dropout layer and the like, and are used for tasks such as data preprocessing, feature extraction and model regularization. These components together form a named entity recognition model of dense entity text, which can be used to recognize entity information such as person names, place names, organization names, etc. in the text.

The purpose of these components is to more accurately identify named entities in the text, thereby improving the performance of the model. The pre-training language model can help the model to better understand the semantics of the text, the BiLSTM fine granularity capturing layer can better capture the semantic information in the text, the entity tag semantic network layer can model the relationship between different entities, and the CRF decoding layer can adjust the final output so that the output accords with the actual situation. The synergistic effect of the components can solve the problems that dense entity texts consist of a plurality of entities which are not closely related and lack context and grammar information, a pre-training language model and a BiLSTM fine granularity capturing layer recognize and understand semantic or syntactic structures in the dense entity texts, related entities are modeled and associated according to the semantic or syntactic structures through an entity tag semantic network layer, and finally, the semantic or syntactic structures output in an associated mode are adjusted in detail through a CRF decoding layer, so that the named entity of the recognition output is ensured to be more practical, and the model has a better effect on the task of recognizing the named entity.

As shown in fig. 1, the named entity recognition training method based on the model includes:

s2, dividing the data with the labels for training into a training set and a test set, and adopting data enhancement to expand the training set; s3, encoding each character by the data in the training set through PLM to obtain high-dimensional semantic representation of the word;

s4, performing countermeasure learning by adding micro disturbance to some embedding layers by using the PGD in the training process so as to improve the generalization capability of the model;

s7, inputting the model parameters into a CRF decoding layer for forward propagation, and updating the model parameters to obtain a trained comprehensive model.

As shown in fig. 2, in the step S1, unlabeled data is input into a pre-training language model, and incremental training is performed by adopting an optimization strategy such as a hierarchical learning rate, and the steps include:

s1-1, adopting a dynamic MASK strategy: the new mask text can be randomly generated in each iteration, and the generalization capability of the model is enhanced.

S1-2 and Mask probability is set to 0.5, and training difficulty is attempted to be increased.

S1-3, adopting an N-gram mask strategy: selecting a token by Mask probability, and performing Mask of 1-gram, 2-gram and 3-gram fragments by the selected part with probability of 70%, 20% and 10% for increasing training difficulty; the selected token uses [ MASK ], random words, probability of self-substitution, and original Bert are consistent.

S1-4, the stability of model output is improved by using R-drop.

In step S1, the hierarchical learning rate specifically includes: when the pre-training language model is used for incremental training, the layer which is closer to the output is regarded as being more specialized for the incremental training task; learning a new task using a greater learning rate for layers closer to the output when fine tuning downstream tasks; a lower learning rate is used for layers closer to the input to preserve more generalized knowledge; incremental training is performed in such a way that the learning rate increases from top to bottom for different attention layers of the encoder of the pre-trained language model.

In the step S2, the training set is enhanced and expanded by adopting data, and the specific method comprises the following steps: extracting the entities of the same entity type in the training set, splicing the entities into sentences, adding the sentences into the training set, and allowing the model to learn the internal classification information of the same entity type.

As shown in fig. 3, in step S3, each character is encoded by a pre-trained language model, the encoding steps of which are as follows:

s3-1, original multi-head attention implementation, based on scaling dot product attention, truly inputting sequence x= (x) ₁ ,x ₂ ,...,x _n ) From different weights W ^Q ,W ^K ,W ^V Respectively multiplying to obtain input sequences Q, K, V, and output sequence (z ₁ ,z ₂ ,...,z _n ) Output z is the same as the input sequence length _i The calculation formula is as follows:

wherein alpha is _ij The hidden states from position i and position j are obtained after softmax.

S3-2, calculating the interdependence relation of hidden states, namely output and attention scores, by using a sin function of the relative position by using relative position codes, and obtaining relative position information between the layers at each layer, wherein the relative position information is shown in the following formula:

where i, j is the index position, and the difference between the two corresponds to the index value of the absolute position.

s3-4, given dense entity text x= (X) ₁ ,x ₂ ,...,x _n ) Each character x _i Obtaining a high-dimensional semantic representation e after pre-training language model coding _i The following is shown:

wherein,for the embedded mapping of the pre-training language model, d1 is the word vector dimension, v is the word list sizeIs small.

In the step S4, the method for performing countermeasure learning by adding a small disturbance to some of the ebedding layers in the gradient direction using the PGD includes:

wherein L represents a loss function, f _θ Representation model, x _i Representing the original data sample, y _i Representing the label, x' _i Representing the challenge sample, e represents the noise added by the constraint.

S4-2, the formula of the challenge sample is as follows:

In the step S5, the method of using the mixed multi-granularity feature includes: said for each character x _i Splicing sentence-level text features cls to obtain a more robust character representation s _i ，S＝(s ₁ ,s ₂ ,...,s _i ,...,x _n ) A vector representation representing the correspondence of each sentence is shown as follows:

in said step S6, a sequence modeling with finer granularity is provided using BiLSTM and a composite feature F is obtained, which specifically comprises: more comprehensive context information and word dependency relations are captured through BiLSTM bidirectional modeling, and then all extracted features F are obtained by splicing embedding mapping of a pre-training language model, CLS, character type features subjected to feature engineering processing and extracted fine granularity features subjected to BiLSTM.

In the step S7, they are input into the CRF decoding layer for forward propagation, and model parameters are updated to obtain a trained comprehensive model, which specifically includes the steps of:

splicing all the extracted features F with predicted entity tag information and real tag information respectively, inputting the predicted entity tag information and the real tag information into a CRF decoding layer for forward propagation, and finally obtaining total loss of backward propagation, wherein the loss function is as follows:

In the step S7, the CRF decoding layer specific method includes:

s7-1, input vector sequence X= { X ₁ ,x ₂ ，...x _n Tag sequence y= { Y } ₁ ,y ₂ ，...y _n Output p= { P of BiLSTM ₁ ,p ₂ ，...p _n P, where _i Is a vector representing the tag scores for the i-th and location. The score function S (X, Y) defining CRF represents the score of a given input sequence X and tag sequence Y, consisting of two parts, the emission score and the transfer score, the formula of the score function being as follows:

wherein i is from 1 to n, n is the sequence length T _i ， _j Representing the fraction transferred from tag i to tag j, needed to be a sequenceSTART and END special START (START) and END (END) tags are added.

Z(X)＝∑ _Y′ exp (S (X, Y')) (eleven)

Wherein Y' represents all possible tag sequences.

S7-3, the negative log likelihood loss of the CRF is defined as follows:

in specific applications, the BiLSTM fine grain capture layer and the CRF decoding layer may be implemented by existing techniques.

When the method is specifically applied, the labeling mode of obtaining the entity type and the position label of the character in the embodiment can be realized through a BIO labeling system, and if the character is at the first place of a named entity, B labeling is used; if the character is in other positions of the named entity, the character is marked by I; 0 indicates that the character does not have any physical meaning and is a common character other than a named entity.

In specific application, the embodiment adopts dense commodity text as a named real-time recognition example, and specifically takes the following sentences as examples: the ultra-long standby Bluetooth headset wireless driving sports hanging headset man is suitable for the pop apple vivo, the mobile phone M11 Chinese red (sending pair of headphones) standard edition, and each character is marked by a BIO marking system as follows:

the notation of the character "super" is: b-11

The label of the character "long" is: i-11

The notation of the character "to wait" is: i-11

The notation of the character "machine" is: i-11

The notation of the character "blue" is: the notation of the B-4 character "tooth" is: the label of the character "ear" of I-4 is: the label of the I-4 character "machine" is: the "none" notation of the I-4 character is: the label of the B-13 character "line" is: the label of the "on" character of I-13 is: the label of the B-5 character "car" is: the label of the I-5 character "fortune" is: the "dynamic" notation of the B-5 character is: the label of the I-5 character "hanging" is: the notation of the "formula" of the B-13 character is: the label of the character "ear" of I-13 is: the label of the B-4 character "wheat" is: the notation of the character I-4 "Man" is: the label of the B-8 character "o" is: the notation of the character "p" of B-37 is: the notation of the character "p" for I-37 is: the notation of the I-37 character "o" is: the notation of the character I-37 "apple" is: the label of the character "fruit" of B-37 is: the notation of the character "v" for I-37 is: the notation of the character "i" of B-37 is: i-37

The notation of the character "v" is: i-37

The notation of the character "o" is: i-37

The label of the character "hua" is: b-37

The notation of the character "yes" is: i-37

The notation of the character "hand" is: b-40

The notation of the character "machine" is: i-40

The label of the character "middle" is: b-16

The notation of the character "country" is: i-16

The label of the character "red" is: i-16

The label of the character "ear" is: b-4

The notation of the character "machine" is: i-4

The label of the character "label" is: b-13

The notation of the character "quasi" is: i-13

The notation of the character "version" is: i-13.

Taking the label result B-11 of "super" as an example, 11 in B-11 represents the entity class, and B in B-11 represents the first character of the character "super" in the entity class in the super-long standby.

A large number of BIO marked characters are used as data, the data are processed, the data are divided according to rows, the text and marked information in the data are analyzed, and the data are assembled into a data list, so that the subsequent operations such as word segmentation and encoding are facilitated. Meanwhile, entity type information in the data is extracted and stored in a set for subsequent use.

After further processing, the format of the data becomes 'label': [ { 'start_idx': the method comprises the following steps of, { 'Start_idx':51, 'end_idx':53, 'type': 13',' identity ': standard edition' } ] }.

The training and evaluation data are split into a training set and a testing set, and then data enhancement is carried out.

The training set data is fed into a pre-trained UER-Large model, then an embedded vector is obtained, and relative position information is superimposed on token inputs at each position. Wherein the expression of the relative position is:

the expression for superimposing the relative information on the token input at each location is further:

cls is a semantic feature vector, x, representing an entire text _i Representing the vector represented by each character and then for each character x _i Splicing sentence-level text features cls to obtain a more robust character representation s _i 。

Then inputting the text representation S of each sentence collected by the feature extraction layer as a BiLSTM layer, splicing character vectors, text classification features cls and character type features processed by feature engineering to obtain comprehensive features F, splicing all extracted features F with predicted entity tag information and real tag information respectively, inputting the features F into a CRF decoding layer for forward propagation, and finally obtaining total loss of reverse propagation, wherein the loss function is as follows:

further, the model is optimized through continuous training and parameter updating. After training, the optimal model is stored.

In the embodiment, the characteristic extraction with finer granularity and the dependency capture of the context information can be realized through the existing two-way long-short-term memory network BiLSTM, the realization mode is simple, and the realization is convenient.

In specific application, take input sentence as an example: the mobile phone eating chicken magic device and flat elite keyboard mouse automatic pressing gun apple ipad android flat cf hand swimming queen peripheral auxiliary suit converter wrapist game handle, and the label result of each character after model prediction can be expressed as: hand B-40, machine I-40, eating B-4, chicken I-4, god I-4, machine I-4, and B-5, flat I-5, fine I-5, english I-5, key B-4, disk I-4, mouse B-4, mark I-4, self O, dynamic O, pressure O, gun O, apple B-40, fruit I-40, I I-40, p I-40, a I-40, d I-40, amp B-47, tall I-47, flat B-40, board I-40, c B-5, f I-5, hand I-5, game I-5, king O, seat O, outer B-4, setting I-4, auxiliary B-11, auxiliary I-11, sleeve B-13, dress I-13, transfer B-4, change I-4, machine I-4, pastoral O, horse O, human O, game B-4, game I-4, hand I-4, handle I-4.

The application adopts the precision rate (P) recall rate (R) and F1 value to evaluate the performance evaluation standard of the named entity recognition model of the dense entity text. The experimental results of comparing the model of the application with the existing named entity recognition model are shown in table 1.

Table 1, resume NER dataset different deep learning models compare experimental results.

From the comparison of the result NER dataset, the model of the present application was superior to other benchmark models in Recall and F1 Score, while the Precision index was slightly lower than BERT-IDCNN-CRF. The learned information may be used to enhance the recognition of the model. The superior performance of BERT-IDCNN-CRF benefits from deconvolution. The deconvolution increases the spreading width on the basis of the classical convolution neural network, and skips data between the spreading widths when the convolution kernels perform characteristic calculation, so that a wider input matrix is ensured, and the perception field of the convolution kernels with the same convolution kernel size is increased. Experimental results show that the integration of global and local context information and tag semantic information with different scales is beneficial to alleviating the problems caused by high entity density and fine type granularity and improving the recognition precision of dense entities.

Pre-trained language models based on representation learning may improve entity recognition in various aspects, as shown in table 1. Several popular pre-trained language models were selected for comparison on crawled commodity datasets containing dense entities, as shown in table 2, with the expectation of a strong contextual representation.

Table 2, performance comparisons of pre-trained language models.

All of these models are based on BERT models, which are good at capturing context information and can achieve significant performance. It can be seen that UER-Large is better than the other models in overall evaluation, but the gap is not Large. In particular, its Recall score is inferior to that of W2NER, probably because W2NER contains modeling of entity boundary and entity word relationships, and can fully consider the relationships between entity boundary and internal words. For ERNIE2.0, more syntax information is incorporated in the form of multitasking into the self-coding of the model, but it is less suitable for the current scenario.

In order to analyze the contribution of the convolutional neural network and the label semantic network to the identification effectiveness of the product title entity, four models are established, and an ablation experiment is constructed on the crawled commodity data set containing dense entities. Table 3 is the result of the ablation experiment, a model was constructed as follows, and the effectiveness of each module in the overall effect of the model was analyzed.

Table 3 ablation experiments

w/o BL represents the BiLSTM bidirectional long and short term memory network is removed; the W/o FE representation removes a feature engineering module in the labeling semantic network, namely ignoring the influence of the Chinese text statistical features on the item title identification; w/o AL represents removing the antagonism learning module from the labeled semantic network and directly feeding all obtained features to the CRF decoding layer for recognition; w/o OS means that optimization strategies such as hierarchical learning rate, EMA, R-drop, etc. are not used in incremental training. Advanced word embedding is obtained through a UER-Large model, cls global text classification information is overlapped, and then the text is directly input into a labeled semantic network layer.

From the experimental results, the performance of the model of the present application is best, indicating that each component contributes to the enhanced recognition effect. The performance and generalization capability of the whole framework are improved by adopting various optimization strategies during incremental training, the performance improvement brought by the label semantic network to the model is also from the co-cooperation of all modules, different information of the entity labels is respectively learned, and real information is fed to a decoding layer for training, so that the performance is maximized. BiLSTM and feature engineering modules also make a significant contribution to the effectiveness of entity recognition.

Text information is adequately captured from multiple levels and angles using the UER-Large and BiLSTM models. The tag semantic information is incorporated into the feature representation using a tag semantic network module. Finally, the annotation result is output using CRF decoding. Compared with the traditional named entity recognition model, the accuracy, recall rate and F1 value of the model are respectively improved by 6.32%, 5.64% and 5.98%, which shows that the model can effectively alleviate the problems caused by high entity density and fine type granularity. It also provides a new way to identify named entities with similar data characteristics in other fields.

The above detailed description of the present application is merely illustrative or explanatory of the principles of the application and is not necessarily intended to limit the application. Accordingly, any modification, equivalent replacement, improvement, etc. made without departing from the spirit and scope of the present application should be included in the scope of the present application.

Claims

1. The named entity recognition model of the dense entity text and the training method thereof are characterized in that the named entity recognition model comprises a pre-training language model, a BiLSTM feature extraction layer, an entity tag semantic network layer and a CRF decoding layer; the training method of the named entity recognition model comprises the following steps:

s1, inputting unlabeled data into a pre-training language model, performing incremental training by adopting a layered learning rate optimization strategy to obtain a trained pre-training language model PLM, wherein the layered learning rate optimization strategy comprises specializing incremental training tasks by layers close to output during incremental training of the pre-training language model, learning new tasks by using a larger learning rate for layers close to output when fine-tuning downstream tasks, using a smaller learning rate for layers close to input to reserve more generalized knowledge, performing incremental training by adopting a mode of increasing learning rates from top to bottom for different attention layers of an encoder of the pre-training language model,

the step of incremental training by adopting the hierarchical learning rate optimization strategy comprises the following steps:

s1-4, using R-drop to improve the stability of model output;

s1-5, using EMA to average the weights of the model at different times in a training stage, so that the model weights are smoother, and the performance and generalization capability are improved;

2. The method for training a named entity recognition model of dense entity text according to claim 1, wherein in step S2, the method for extending the training set by data enhancement is to extract and splice entities of the same entity type in the training set into sentences and add the sentences into the training set, so that the model learns classification information in the same entity type.

3. The named entity recognition model of dense entity text and the training method thereof according to claim 1, wherein the method of encoding each character through the pre-training language model in step S3 comprises the steps of:

4. The named entity recognition model of dense entity text and the training method thereof according to claim 1, wherein the method for performing countermeasure learning by adding a minute disturbance to an empedding layer in a gradient direction using PGD in step S4 comprises the steps of: s4-1, forming the countermeasure training into a min-max optimization problem, namely maximizing an inner layer and minimizing an outer layer, wherein the formula is as follows:

s4-2, the formula of the challenge sample is as follows:

5. The method for training the named entity recognition model of dense entity text according to claim 1, wherein the method for employing mixed multi-granularity features in step S5 comprises: said for each character x _i Splicing sentence-level text features cls to obtain a more robust character representation s _i ，S＝(s ₁ ,s ₂ ,...,s _i ,...,S _n ) A vector representation representing the correspondence of each sentence is shown as follows:

6. the named entity recognition model of dense entity text and the training method thereof according to claim 1, wherein in step S6, more comprehensive dependency relationships between context information and words are captured through bi-directional modeling of the BiLSTM, and then all extracted features F are obtained by stitching together the embedded mapping of the pre-trained language model, CLS, character type features processed through feature engineering, and extracted fine-grained features processed through BiLSTM.

7. The named entity recognition model of dense entity text and the training method thereof according to claim 1, wherein in step S7, all extracted features F are spliced with predicted entity tag information and real tag information, respectively, and then input into a CRF decoding layer for forward propagation, and finally, total loss of backward propagation is obtained, and a loss function is as follows:

8. The named entity recognition model of dense entity text and training method thereof according to claim 1, wherein the CRF decoding layer comprises:

s7-1, input vector sequence X= { X ₁ ,x ₂ ，...x _n Tag sequence y= { Y } ₁ ,y ₂ ，...y _b Output p= { P of BiLSTM ₁ ,p ₂ ，...p _n P, where _i Is a vector representing the tag scores for the i-th and location. The score function S (X, Y) defining CRF represents the score of a given input sequence X and tag sequence Y, consisting of two parts, the emission score and the transfer score, the formula of the score function being as follows:

wherein i is from 1 to n, n is the sequence length T _i ， _j Representing the fraction of transitions from tag i to tag j, we need to add special START (START) and END (END) tags for the START and END of the sequence;

Z(X)＝∑ _Y′ exp (S (X, Y')) (eleven)

Wherein Y' represents all possible tag sequences;

s7-3, the negative log likelihood loss of the CRF is defined as follows: