CN116522945A

CN116522945A - Model and method for identifying named entities in food safety field

Info

Publication number: CN116522945A
Application number: CN202310613994.3A
Authority: CN
Inventors: 覃锡忠; 袁太萍
Original assignee: Xinjiang University
Current assignee: Xinjiang University
Priority date: 2023-05-29
Filing date: 2023-05-29
Publication date: 2023-08-01

Abstract

The invention relates to a model and a method for identifying named entities in the field of food safety. An identification model of named entities in the field of food safety, comprising: an embedding layer, a feature extraction layer, an attention layer, an entity classification layer and countermeasure training; the embedded layer comprises: mapping the natural text into a computer language; the characteristic extraction layer comprises: taking BiLSTM as a sequence encoder to perform feature extraction; the attention layer: inputting h into an attention layer, and giving a larger weight to the characteristics with the greatest influence on entity classification; the entity classification layer: obtaining a global optimal mark sequence in a final decoding stage by using a standard Viterbi algorithm; the countermeasure training comprises the following steps: regularization of the classification model is enhanced. According to the model and the method for identifying the named entity in the food safety field, the antagonism training is added as a regularization method, so that the influence of noise on the model is relieved; attention mechanisms are added to capture features that have a significant impact on entity classification, improving the accuracy of entity classification.

Description

Model and method for identifying named entities in food safety field

Technical Field

The invention belongs to the technical field of food safety, and particularly relates to a model and a method for identifying named entities in the field of food safety.

Background

The NER technology is mainly used for identifying named entities such as a person name, a place name, an organization name, a proper noun and the like specified in the text. Named entities are classified into general classes (e.g., name, place name, etc.) and specific domain classes (e.g., proteins, drugs, diseases, etc.). In the general field, there are many public data sets for named entity recognition tasks at home and abroad, and the related technology is mature. However, the technology related to named entity identification in the food safety field is still in an early stage of research, and no public data set is yet available. With the rapid development of the internet age, people are accustomed to expressing their various views through social media and e-commerce platforms. There is a vast amount of information in the internet, and these non-canonical short texts generated by user reviews contain a large amount of useful information that has not been fully explored, some of which help to manage food safety risks. And (3) identifying preset entity categories from the key information related to food safety by using a named entity identification technology, and constructing a food safety risk supervision knowledge graph, so that the intelligent supervision of big data of food safety can be assisted. However, in contrast to standard canonical text (e.g., news categories, case decisions, etc.), named entity recognition of non-canonical short text generated by user comments present on the internet presents the following major challenges:

(1) The corpus is small in size and has a plurality of entity types.

(2) Users randomly comment on the words, often have new words and errors, and include network expression, emoticons and other noise.

(3) Text does not follow strict grammatical rules.

These problems make the task of named entity recognition more difficult in the field of food safety where corpora are severely deficient. Therefore, many named entity recognition models that can achieve good results in the general field cannot be directly applied in the food safety field.

In order to solve the problems, the invention provides a model and a method for identifying named entities in the food safety field, which are based on an antagonism and attention mechanism model of ERNIE-BiLSTM-CRF.

Disclosure of Invention

The invention aims to provide a model and a method for identifying named entities in the field of food safety, which are characterized in that word embedding (word embedding) with contextual semantic information more conforming to the characteristics of non-canonical text data is trained through ERNIE, counter training is added into model training as a regularization method so as to relieve the influence of noise on the model, and then a attention mechanism is added into a BiLSTM-CRF model so as to capture the characteristics which have obvious influence on entity classification and improve the accuracy of entity classification.

In order to achieve the above purpose, the technical scheme adopted is as follows:

an identification model of named entities in the field of food safety, comprising: an embedding layer, a feature extraction layer, an attention layer, an entity classification layer and countermeasure training;

the embedded layer comprises: mapping the natural text into a computer language;

the characteristic extraction layer comprises: taking BiLSTM as a sequence encoder in the NER mark, and inputting the word vector sequence into a BiLSTM network for feature extraction;

the attention layer: inputting h into an attention layer, and giving a larger weight to the characteristics with the greatest influence on entity classification by using an attention mechanism;

the entity classification layer: obtaining a global optimal mark sequence in a final decoding stage by adopting a standard Viterbi algorithm;

the countermeasure training comprises the following steps: for enhancing regularization of the classification model.

Further, the embedded layer: a knowledge integration model of the enhanced representation proposed in hundred degrees is used as a word embedding model, and a multi-layer transducer is used as the primary encoder.

Still further, the embedded layer: extracting rich text features by using a pre-trained model ERNIE to obtain a batch_size_max_seq_len_emb_size output vector used as a sequence representation of a classification task; where batch_size is the batch size of the processed data, max_seq_len is the maximum length of the input sentence, and emb_size is the embedding dimension of each token.

Further, the feature extraction layer: encoding the character vector sequence of the ERNIE by using a BiLSTM network, and obtaining a forward hidden state by using a forward LSTM network and obtaining a backward hidden state by using a reverse LSTM network; summing the values corresponding to the forward hidden state and the reverse hidden state obtained by BiLSTM to obtain h, and expressing the output of the BiLSTM network hidden layer as follows:

wherein h= { h ₁ ，h ₂ ，…，h _t The hidden state representation corresponding to each token, x= { x ₁ ，x ₂ ，…，x _t And is a sequence of word vectors.

Further, in the attention layer: the matrix H consists of a hidden state vector H output by the BiLSTM network; let w be the matrix parameter to be trained, w ^T Is a transpose of w, satisfying the following equation:

M＝relu(H)，α＝softmax(wTM)，r＝Hα ^T wherein, the method comprises the steps of, wherein,d ^w is the dimension of the word vector in the sentence, α is the attention weighting coefficient;

the output h of BiLSTM is weighted and summed to obtain r, w, alpha and r dimensions d ^w T and d ^w After passing through the attention layer, sentence representation directions containing the most critical information are obtainedThe amounts are as follows:

h ^* ＝relu(r)。

further, in the entity classification layer: the tag sequence with the highest predictive score is output as the best answer:

x＝argmax _y P(y|x，w)，

wherein x= { x ₁ ，x ₂ ，…，x _t The vector sequence is represented by w, the parameter vector is represented by y= { y ₀ ，y ₁ ，…，y _n+1 And is a tag sequence.

Further, in the countermeasure training, the following steps are performed: adding an anti-disturbance delta in word embedding _x Counter training of NER model, delta _x The definition is as follows:

x＝∈·(g/||g|| ₂ )，

wherein g=Δ _x L (x, y, θ) is the gradient of the input sample, g is | ₂ L in g ₂ Parameters;

delta computation from the gradient of word casting matrix _x And adds it to the current word casting, corresponding to:

x _adv ＝x+Δ _x 。

still further, in the countermeasure training, the following steps: challenge training is generalized to the following maximally minimized simulation:

in the formula, D, x and y respectively represent training set, input sample and sample label, theta, L (x, y; theta) and delta _x Model parameters, loss of a single sample and antagonistic disturbance superimposed on the input respectively, Ω being the disturbance space, max (L) being the optimization objective; with an outer layer E _(x，y) The function optimizes the model parameters θ of the neural network to minimize them.

The invention also aims to provide a method for identifying the named entity in the field of food safety, which adopts the identification model and has high accuracy.

a method for identifying named entities in the field of food safety comprises the following steps:

(1) Acquiring food safety corpus;

(2) After manual labeling, converting into a BIO format;

(3) And (3) inputting the data processed in the step (2) into the recognition model for recognition.

Further, in the step (2), YEDDA is selected for manual labeling;

the identification model in the step (3) is characterized in that LSTM is set to be a bidirectional network, the hidden layer size is 768, the layer number is 1, the discarding rate is 0.5, and the learning rate is 2 multiplied by 10 ^-5 The learning rate of the entity classification layer is 2 multiplied by 10 ^-2 。

Compared with the prior art, the invention has the beneficial effects that:

the invention constructs a data set for a food safety field named entity recognition task based on a food safety related network comment text, and provides a food safety field named entity recognition model based on an ERNIE-BiLSTM-CRF countermeasure-attention mechanism.

Training an nonstandard Chinese text by using a pre-training model ERNIE to generate word vectors which are based on context and retain rich implicit information, adding self-attention to the model in a BiLSTM network, automatically focusing the characteristic which has decisive influence on entity classification, and carrying out sequence coding on the characteristic; and finally, performing sequence decoding on the CRF layer to obtain the optimal label corresponding to each token. Experimental results show that the SOTA performance is obtained on the Weibo NER of the public data set, and the F1 value is 72.64%; good performance with an F1 value of 69.68% was also obtained on the self-established domain dataset Food. In addition, the invention also carries out ablation experiments to discuss the performances of different pre-training models and countermeasure training methods. Based on the experiments of the invention, it is proved that the proposed model can extract specific entity information from a large number of non-canonical web texts. This is a great aid to the task of identifying named entities in the area of language starvation, such as food safety. The specified food safety related entity information can be extracted from the extraction network text, so that a food safety supervision knowledge graph is constructed, and the large data food safety supervision is assisted.

Drawings

FIG. 1 is a named entity recognition model based on countermeasure training and attention mechanisms;

FIG. 2 is a comparison of mask learning methods for BERT model and ERNIE model;

FIG. 3 is an ERNIE embedded layer;

FIG. 4 is a corpus labeling flow;

FIG. 5 is a YEDDA data annotation presentation;

FIG. 6 is a labeling result display;

FIG. 7 is a BIO format conversion;

FIG. 8 is a Food dataset model comparison;

FIG. 9 is a comparison of the addition of challenge training and attention mechanisms;

FIG. 10 is a comparison of named entity recognition performance of different pre-trained language models;

FIG. 11 is a comparison of named entity recognition model performance for different challenge training patterns.

Detailed Description

In order to further describe the model and method for identifying named entities in the food safety field of the present invention, the following describes the specific implementation, structure, characteristics and efficacy of the model and method for identifying named entities in the food safety field according to the present invention in combination with the preferred embodiments. In the following description, different "an embodiment" or "an embodiment" do not necessarily refer to the same embodiment. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.

The following describes in further detail a model and method for identifying named entities in the field of food safety according to the present invention, in conjunction with specific embodiments:

food safety is closely related to human health. Therefore, the named entity recognition technology is used for extracting the named entity related to the food safety, and the supervision knowledge graph in the food safety field can be constructed to help relevant departments to supervise food safety problems and reduce the harm caused by the food safety problems. However, there are few named entity recognition studies currently in the field of food safety, and no publicly available named entity recognition datasets are available.

With the rapid development of the information age, massive network data exist in the Internet, and users can help the specific field (such as the food safety field) with scarce corpus to identify named entities by containing rich hidden information in non-standardized Chinese short texts generated by various social media or online shopping platform comments. Therefore, the invention gathers network data related to food safety, constructs a food safety named entity identification data set for model training through manual annotation, and extracts specific named entities from food safety related corpus.

However, existing methods of identifying chinese named entities are mainly directed to standardized text corpora like news stories. Meanwhile, the non-standardized corpus has the following problems: (1) small corpus size; (2) there are various new mistakes and words and noise; (3) does not follow strict grammar rules. These problems make recognition of named entities in web text more challenging. The present invention therefore proposes an ERNIE-BiLSTM-CRF based challenge-attention model to improve the recognition of food safety domain entities in non-standardized text. Specifically, the resistance training is added in the model training as a regularization method to relieve the influence of noise on the model, and the attention mechanism is added in the BiLSTM-CRF model to capture the characteristics which have obvious influence on the entity classification, so that the accuracy of the entity classification is improved.

The embodiment of the invention performs experiments on a self-built Food field data set Food and a public data set Weibo NER which is also marked by web text corpus. Experimental results show that SOTA performance of the model on the self-built data set and the public data set is 69.68% and 72.64%, respectively. The validity, rationality and reliability of the model presented herein was verified. The research of the invention has practical significance in the food safety field and also provides an demonstration for the named entity recognition task of other fields with the same corpus resource deficiency. Specific examples are as follows:

example 1.

The specific operation steps are as follows:

a named entity recognition model based on countermeasure training and attention mechanism

The food safety risk named entity recognition model based on the challenge training and attention mechanism is shown in fig. 1:

(1) An embedding layer:

in natural language processing tasks, natural text needs to be mapped to a computer language so that it can be used for model training. Therefore, it is necessary to map token in natural language into a high-dimensional vector space through a word embedding model, so that token with similar semantics are closer in the vector space and token with dissimilar semantics are farther. This allows the model to better understand the relationships between the tokens. Word embedding models in pre-trained language models can generate better character representations, not only containing rich syntactic and semantic information, but also allowing modeling of ambiguous words. In the named entity recognition model proposed by the present invention, a knowledge integration model (Enhanced Representationfrom Knowledge Integration model, ERNIE) of the enhanced representation proposed by hundred degrees is used as a word embedding model.

Because the modeling object of BERT is mainly focused on the original language signal, modeling is performed by using less semantic knowledge units, and the problem is particularly obvious in chinese corpus. For example, BERT is modeled by predicting chinese characters, and it is difficult for models to learn a complete semantic representation from larger semantic units. While ERNIE improves on BERT application in the chinese direction, it enables models to learn hidden knowledge implicit in large amounts of text. As shown in FIG. 2, ERNIE enables models to learn semantic representations of complete concepts by masking semantic units of words and entities, etc. In contrast to BERT, ERNIE enhances the semantic representation capabilities of a model by directly modeling a priori semantic knowledge units.

Like BERT, ERNIE uses a multi-layer transducer as the primary encoder. In ERNIE, a sentence s= { w containing t token ₁ ，w ₂ ，…，w _t The concatenation is made with unique identifiers CLS and SEP, where CLS represents the beginning of a sentence and SEP represents the end of a sentence. As shown in FIG. 3, for each character in the sentence, it is represented by the ERNIE pre-trained language model as an Embedding of three vector accumulation constructs, namely, token Embedding (token Embedding), segment Embedding (Segment Embedding) and position Embedding (Position Embedding) Vectorizing sentences to +.>And inputting the characteristics into a bidirectional transformer for characteristic extraction. Acquiring context information of each tag in a sentence by using a Self-attention mechanism (Self-Attention Mechanism) of a transducer to generate a sequence vector x= { x containing rich semantic features ₁ ，x ₂ ，…，x _t }. In other words, the rich text feature is extracted using the pre-trained model ERNIE to obtain the batch_size_max_seq_len_emb_size output vector, which is used as a sequence representation of the classification task. Where batch_size is the batch size of the processed data, max_seq_len is the maximum length of the input sentence, and emb_size is the embedding dimension of each token.

(2) Feature extraction layer

A two-way long and short-term memory (Bidirectional Long Short Terms Memory, biLSTM) network is an improvement of RNN, which comprises two subnetworks, namely a forward network and a reverse network of LSTM. The method has good sequence modeling capability, can better model a long sequence, can well solve the problems of gradient disappearance and gradient explosion of the traditional RNN network in the training process, and can process context information.

The invention takes BiLSTM as a sequence encoder in NER marks, and takes word vector sequence x= { x ₁ ，x ₂ ，…，x _t Input into the BiLSTM network for feature extraction. Outputting the label sequence y= { y corresponding to each token ₁ ，y ₂ ，…，y _n Probability, n is the number of tags. Specifically, the word vector sequence of ERNIE is encoded using a BiLSTM network, with a forward LSTM network obtaining forward hidden states (history features) and a reverse LSTM network obtaining backward hidden states (future features). The output of the BiLSTM network hidden layer is expressed as:

summing values corresponding to the forward hiding state and the reverse hiding state obtained by BiLSTM to obtain h, wherein h= { h ₁ ，h ₂ ，…，h _t The hidden state representation for each token.

(3) Attention layer

ERNIE uses a multi-layer transducer as the primary encoder. Although the multi-headed attention in the transfomer can extract features of the text itself from multiple angles and levels, the degree of "influence" between the output information obtained by the LSTM from each point in time is the same, and features that have a greater influence on named entity recognition cannot be distinguished. In order to highlight the most critical part of the output information of the BiLSTM network for entity classification, the invention adds an attention mechanism after the BiLSTM network, inputs h into an attention layer, gives the most influenced features to the entity classification a larger weight by using the attention mechanism so as to capture the most critical semantic level information in sentences and automatically focuses the features which have decisive influence to the entity classification.

The matrix H consists of hidden state vectors H output by the BiLSTM network. Let w be the matrix parameter to be trained, w ^T Is a transpose of w, satisfying the following equation:

m=relu (H) (2)

α＝softmax(w ^T M) (3)

r＝Hα ^T (4)

Wherein,,d ^w is the dimension of the word vector in the sentence and α is the attention weighting coefficient. And weighting and summing the output h of the BiLSTM to obtain r. The dimensions of w, alpha and r are d, respectively ^w T and d ^w . After passing through the attention layer, sentence representation vectors containing the most critical information are obtained as follows:

h ^* =relu (r) (5)

(4) Entity classification layer

The conditional random field is a type of discriminant model most suitable for predictive tasks and is widely used for sequence labeling problems. Although the BiLSTM-Attention network can process long-distance text information and obtain more key features for entity classification, the dependency between adjacent tags is not effectively handled. The CRF layer serves as a sequence decoder for the NER marker, and a global optimum marker sequence is obtained in the final decoding stage using a standard viterbi algorithm.

All outputs of the BiLSTM-Attention network are input as a scoring matrix P to the CRF layer, for the predicted tag sequence y= { y ₀ ，y ₁ ，…，y _n+1 Score is defined as:

wherein P is a matrix with a size of t x n, P _i，j Vector sequence x= { x representing input sentence ₁ ，x ₂ ，…，x _t The i-th token in the list corresponds to the j-th label score. A is a matrix of transfer scores, A _i，j Representing the score transferred from tag i to tag j. Predictive label y ₀ And y _n+1 And represents the two symbols CLS and SEP that mark the beginning and end of a sentence, so a is a square matrix of size n+2.

The CRF layer uses the potential function to pair the sequence x= { x ₁ ，x ₂ ，…，x _t Estimate output tag sequence y= { y ₀ ，y ₁ ，…，y _n+1 The conditional probability of distribution P (x|w), the formula is as follows:

wherein the method comprises the steps ofIs a feature vector, and w is a parameter vector. Z (w, x) is the cumulative sum of the conditional distribution probabilities P (y|x, w) for all possible tags y.

Training a model using a maximum conditional likelihood function:

x＝argmax _w p (y|x, w) (8)

The CRF layer learns some constraints from the training data set, reduces invalid tags in the predicted sequence, and ensures that the finally output tag sequence is valid. When the sequence decoding is carried out, selecting the label sequence with the highest predictive score as the best answer to output:

x＝argmax _y p (y|x, w) (9)

(5) Countermeasure training

In NLP tasks, the challenge training is no longer used to defend against gradient-based malicious attacks, but rather is used more to enhance regularization of the classification model. Challenge training can be generalized to the following maximally minimized simulation:

in equation (10), the middle bracket is a maximum. D. x and y respectively represent a training set, an input sample and a sample label; θ, L (x, y; θ) and Δ _x Model parameters, loss of individual samples, and resistive disturbances superimposed on the input, respectively; Ω is a disturbance space. The disturbance space is typically small to avoid damage to the original input samples. max (L) is the optimization objective, i.e. finding the disturbance that maximizes the loss of a single sample. At the same time utilize the outer layer E _(x，y) The function optimizes the model parameters θ of the neural network to minimize them. When disturbance delta _x At fixed, model loss to training data is minimal.

In brief, the disturbance delta is counteracted when the disturbance delta is added _x The sample loss should then be as significant as possible. In contrast, the model loss should be as small as possible, so that the model is more robust, and deviation of the model inference result caused by micro disturbance of the sample is avoided.

The invention refers to a rapid gradient method (Fast Gradient Method, FGM) in text classification tasks by adding anti-disturbance delta in word embedding _x Counter training NER model, wherein delta _x The definition is as follows:

x＝∈·(g///g// ₂ ) (11)

Wherein g=Δ _x L (x, y, θ) is the gradient of the input sample,// g///for ₂ L in g ₂ Parameters.

x _adv ＝x+Δ _x (12)

The forward loss is calculated, then the counter gradient is obtained after counter propagation, finally the counter gradient is accumulated to the original gradient, word casting is recovered, and gradient parameters are updated after the counter gradient is accumulated.

B food safety corpus acquisition and pretreatment

(1) Food safety corpus acquisition

Since the corpus in the Food safety field is seriously deficient and no available named entity recognition data set is disclosed at present, the invention uses the independently constructed named entity recognition data set Food in the Food safety field as a data set required by model training to carry out research experiments. The corpus of the Food dataset is derived from emotion/opinion/comment tendency analysis dataset online_shping_10_cas of Chinese natural language processing language resource project (https:// gitsub.com/liuhuanyong/Chinese NLPCorpus), and the partial corpus of the Food safety field is shown in the following table:

according to the method, negative comments of consumers on foods (fruits) purchased on the XX E-commerce platform are screened out from the comments, named entities in the corpus are defined, manual label marking is carried out, and a named entity identification data set in the food safety field is constructed and used for model training.

The Food dataset contains 1914 pieces of comment information. Under the guidance of food safety related specialists, the invention manages the content according to the food safety risk: the consumer's assessment of a particular type of fruit sold by a store on an e-commerce platform, and the particular symptoms that the consumer presents after consumption, categorizes the entity of the dataset into five named entities, fruit type (FRU), food safety risk description (SIG) that may be present, sales e-commerce platform (ECP), sales store (MER), and symptoms description (SYM) that appear after consumption.

(2) Manual labeling

The invention selects a simple and efficient named entity labeling tool YEDDA for manual labeling. The labeling flow is shown in fig. 4, the corpus to be labeled is sent to the YEDDA, and the named entity which is set in advance is labeled. YEDDA is an open source text annotation tool for entity classes that provides the annotation functionality of sequence tags. The annotation tool is a lightweight but efficient open source tool for text boundary (span) annotation. The method overcomes the inefficiency problem of the traditional text labeling tool, annotates entities through command lines and shortcut keys, and the entities can configure custom labels. The marking tool has the advantages that:

(1) high efficiency: it supports shortcuts and command line modes to speed up the annotation process.

(2) Intelligent: real-time system suggestions are provided for users, and repeated labeling is reduced.

(3) And the system is a client program, so that the difficulty of deploying the system is reduced.

Fig. 5 is a YEDDA labeling operator interface, and the comment corpus in fig. 4 is manually labeled on the labeling page by taking sentences as units to obtain the named entity recognition corpus in the food safety field required by the experiment of the invention.

The functions of the right button are approximately as follows: the Open key opens the file to be marked; the "RMOn" key opens the automatic labeling function; the "RMoff" key turns off the automatic labeling function; the map key sets a mapping relation, namely, shortcut keys and entity labels are mapped one by one (for example, A is mapped to an entity label FRU); the Export key derives the text after the current annotation; "quick" exits the current callout. "Cursor" indicates the current annotation Cursor's position in the text, and "RModel" indicates whether automatic annotation is enabled.

The text derived from YEDDA after the labeling work is completed is shown in FIG. 6, but such data cannot be directly fed into model training and requires further processing.

As shown in fig. 7, the annotated data is converted into the BIO format by format conversion. Wherein a particular named entity is converted to a form beginning with "B-" and "I-" and the Other non-entities are labeled "O" (Other).

Analysis of results of experiments

(1) Evaluation index setting

The invention comprehensively evaluates the performance of the model by three indexes of Precision, recall rate (Rcall) and F1 value. Their definitions are respectively:

in the calculation formula (13) of the accuracy (P), TP _i Positive class number, FP, for correct prediction of model _i Positive class numbers predicted from negative classes for the model.

In the calculation formula (14) of the recall ratio (R), TP _i Similar to above, FN _i To the model from the positiveNegative class number of class prediction.

Since accuracy and recall are a pair of conflicting indicators, to better evaluate the performance of the classifier, the invention uses the F1 harmonic mean score of accuracy and recall as an evaluation criterion to evaluate the overall performance of the model.

(2) Experimental data

The data set needed by the training model of the invention is selected from an independently constructed Food safety field named entity identification data set Food, and BIO labeling description of each entity type is shown in table 1.

Table 1 food safety data set BIO tag description

Meanwhile, in order to make the experimental result more fair and reliable, the invention also selects a public named entity recognition data set Weibo NER which has the characteristic of similar data with the Food data set to carry out a series of experiments to verify the effectiveness and rationality of the model provided by the invention.

The Weibo NER data set is formed by filtering historical data based on the New wave microblog from 11 months in 2013 to 12 months in 2014, contains microblog information 1890 and is marked according to a DEFT ERE marking standard of LDC 2014. The entity categories in the dataset fall into four categories: each category is subdivided into a specific entity NAM (e.g., zhang Sans labeled "PER. NAM") and a generic entity NOM (e.g., NOM stands for "person") as shown in table 2.

Table 2Weibo NER dataset tag description

The present invention divides the training set, the verification set, and the test set required for the training model according to the sentence numbers of the two data sets, as shown in table 3. The Food data set is divided according to the proportion of 7:1.5:1.5, the Weibo NER data set is divided into the original data sets, and the invention does not modify the original data sets.

Table 3 data set partitioning statistics

(3) Experimental parameter setting

The experiment of the invention uses a hundred-degree published knowledge-based Chinese pre-training language model ERNIE1.0 to train word vectors. The pre-trained model is improved in the Chinese direction of google BERT, and can be processed identically to BERT in downstream tasks and model transformations. According to the default configuration, the output vector size of each character is set to 768 dimensions, and the discard rate of ERNIE is 0.1.

In the model, LSTM is set as a bidirectional network, the hidden layer size is 768, the layer number is 1, and the discarding rate is 0.5. The initial learning rate is a key parameter that needs to be adjusted according to the target task. The invention uses AdamW as an optimizer for training a pre-training model and a named entity recognition model, and the model have different learning rates.

The optimal learning rates of ERNIE and BERT are very different, with ERNIE requiring a higher initial learning rate. For pretrained model fine tuning, the initial learning rate of the Food dataset was 8×10 ^-5 The initial learning rate of Weibo NER dataset was 3×10 ^-5 . For named entity recognition model training, the LSTM learning rate of the two data sets is 2×10 ^-5 For CRF 2X 10 ^-2 。

Since the weights of the model are randomly initialized at the beginning of training, the learning rate selection is more likely to destabilize (oscillate) the model. In order to make the model more stable, the invention selects the learning rate preheating, so that the model converges faster and better, and the initial preheating step number is set to 80.

(4) Experimental analysis

The invention uses an ERNIE+BiLSTM+CRF-based improved anti-attention mechanism model to respectively carry out experiments on a named entity recognition data set Food and a published data set Weibo NER in the Food safety field.

On the Food dataset, the invention trains named entity recognition (fruit type FRU, possibly present Food safety risk description SIG, sales e-commerce platform ECP, sales store MER and post-eating symptom description SYM) on the basis of the ernie+softmax model, the ernie+bilstm+crf model, and on the basis of the ernie+bilstm+crf improved challenge-attention mechanism model. Experimental results as shown in table 4, the precision P, recall R, and F1 values of the ERNIE+BiLSTM+CRF-based improved anti-attention mechanism model in overall named entity recognition and five broad classes of named entities were compared in detail with the ERNIE+Softmax model, the ERNIE+BiLSTM+CRF model.

Table 4 comparison of Food dataset experimental results

In combination with table 4, the accuracy, recall rate and F1 value of the whole named entity identification are tabulated based on the ERNIE+BiLSTM+CRF improved anti-attention mechanism model, the ERNIE+Softmax model and the ERNIE+BiLSTM+CRF model, so that the identification effect of different models on the named entity in the Food dataset can be more intuitively understood. As shown in FIG. 8, the accuracy, recall, and F1 values of the overall challenge-self-attention model based on ERNIE+BiLSTM+CRF improvement were significantly higher than the comparative model. Thus proving to be superior to the approach of adding the ernie+bilstm+crf model of the anti-attention mechanism for extracting food safety related named entities in non-standardized text.

The present invention was compared to some of the published SOTA models on the Weibo NER dataset. In the SOTA model involved in the comparison:

(1) The Locate and Label uses a two-stage entity identifier to Locate entities and Label boundaries, identifying locating entities and Label boundaries for nested named entities.

(2) SA-NER adopts a semantic expansion mode to improve the recognition performance of named entities.

(3) AESINER improves named entity recognition capabilities of the model by introducing additional knowledge.

(4) The SLK-NER uses an attention mechanism to fuse lexical knowledge into a character-based model for named entity recognition.

(5) The FLAT does not use a pre-training model BERT, and makes full use of the Lattice information to identify the named entity.

(6) The BERT-LMCRF is a BERT-based model, and BiLSTM-CRF is used as a named entity recognition model encoder-decoder.

(7) While FLAT+BERT utilizes the Lattice information, a BERT pre-training model is introduced to train word vectors containing the above language information.

(8) FGN is a BERT-based converged glyph network.

The Chinese characters as pictographs contain potential font information, so that Xuan and the like propose a fusion font network FGN, and interaction information between character distributed representations and font representations is extracted through a fusion mechanism. Li et al propose a FLAT model that converts the Lattice structure into a planar structure consisting of span. The model fully utilizes grid information, is based on the powerful performance of a transducer and well-designed position coding, and has good parallelization capability. The SLK-NER model proposed by Hu et al utilizes global semantic information, integrates vocabulary knowledge through a attention mechanism, integrates vocabulary with larger information quantity into a character-based model, and relieves vocabulary boundary conflict. In NER domain studies, introducing additional knowledge is a common approach to improve model performance. Nie et al propose an AESINER model that effectively uses attention integration to encode and fuse different types of grammar information (e.g., vocabulary notes, composition grammar information, and dependent grammar information) to help the model identify named entities. In social media such as microblog, twitter, etc., many user-generated short text contains various types of entities, some of which are not written in accordance with standard grammatical conventions (e.g., arbitrarily abbreviated by the user), resulting in a small probability of these entities displaying sparseness, making it more difficult to identify these entities. To address this problem, previous studies have used domain information (e.g., place name dictionaries trained on prominent social media text and embedded) and external features (e.g., vocabulary tags) to help improve performance of NER on social media. However, these methods require additional work to obtain this information, and as a result, data noise is present. For example, training embeddings in the social media field may give the vocabulary watchband a number of unique expressions. Therefore, tian et al propose an SA-NER model that enhances the recognition capability of named entities through semantic expansion. However, the F1 value of this model on the microblog dataset was only 69.8%.

The model proposed by the invention is compared with the model, and the experimental results are shown in table 5, wherein each SOTA model data is the experimental result of the original paper.

Table 5Weibo NER data vs. experimental results

As shown in Table 5, the proposed model is superior to other SOTA models on Weibo NER data sets. The F1 values were increased by 12.32%, 8.64%, 3.48% and 2.86% compared to FLAT, SLK-NER, location and Label and AESINER models, respectively; compared with BERT-LMCRF, FLAT+BERT, SA-NER and FGN, the F1 value is also obviously improved by 5.52%, 4.09%, 2.84% and 1.39% respectively.

It follows that adding an countermeasure training and attention mechanism based on the ernie+bilstm+crf model can better extract named entities in non-standard text. The addition of the countermeasure training can alleviate the influence of noise on the performance of the Weibo NER and Food of the two small-scale data sets NER, the weight of the features with larger influence on the entity classification in all the features output by the BiLSTM network is increased after the features pass through the attention mechanism, the entity recognition effect is better, and the comparison result of the addition of the countermeasure training and the attention mechanism is shown in fig. 10.

As can be seen from fig. 9, the anti-attention model based on ERNIE-BiLSTM-CRF has an improvement in F1 values of 2.64%, 2.29% and 2.04%, respectively, compared to the above three models. The F1 value based on the ERNIE-BiLSTM-CRF model was 70.00%, and after the attention mechanism was added, the F1 value was 70.35%, which was increased by 0.35%. Therefore, the attention mechanism is added into the BiLSTM-CRF model, and the recognition capability of the model to the named entity can be improved to a certain extent by giving larger weight to the characteristics affecting the classification of the entity. Similarly, the addition of the challenge training F1 value of 70.60% to the ERNIE-BiLSTM-CRF model increased by 0.60%. Challenge training may prove to mitigate the effects of noise on named entity recognition during model training. When the two are added simultaneously, the influence of noise on the model can be relieved, important characteristics can be captured, good entity marking performance is obtained, the model performance is obviously improved, and the F1 value is 72.64%.

Experimental results show that the attention mechanism and the countermeasure training are added simultaneously, so that the performance of the small-scale non-standardized Chinese NER with noise can be effectively improved, the influence of noise is reduced, more weight is given to the characteristics which are conducive to entity identification, and the effectiveness of the model on the non-standardized Chinese NER is finally proved.

In addition, the embodiment of the invention compares the performance influence of different pre-training language models and different countermeasure training methods on the named entity recognition model.

With the other settings of the model unchanged, FIG. 10 analyzes named entity recognition performance for different pre-trained language models. Wherein, BERT-base published by googlehttps://github.com/google-research/bert) and Roberta-wwm-ext issued by the university of Harbin Industrial university in combination with fast flight laboratorieshttps://github.com/ymcui/Chinese-BERT-wwm) is a Chinese-based pre-trained model. It should be noted that Roberta-wwm-ext is different from the original Roberta model published by google, but is a Chinese BERT trained in a manner similar to Roberta trainingThe model, BERT like RoBERTa.

It can be seen that the BERT-base based NER model performs poorly with F1 values of 68.88% because the chinese in BERT-base is sliced at character granularity without considering chinese word separation (Chinese Word Separation, CWS) in conventional NLP tasks. In contrast, roBERTa-WWM-ext is trained using Chinese wikipedia (including simplified and traditional) and applies full word masking (Whole Word Masking, WWM) techniques to Chinese. Meanwhile, LTP of Harbin university is used as word segmentation tool to mask all Chinese characters forming the same word, and is not limited to masking single Chinese character in BERT-base. The F1 value was 70.69% and 1.81% higher than that of BERT-base. The ERNIE-based named entity recognition model F1 had a value of 72.64% which is 3.76% and 1.95% higher than the BERT-base and RoBERTa-wwm-ext based models, respectively. Because BERT-base and RoBERTa-wwm-ext are trained using wikipedia data, they are more efficient in modeling standard specification text. And the ERNIE is added with Web data such as hundred-degree Encyclopedia, hundred-degree News, hundred-degree Post and the like, so that the method has advantages in modeling the non-standardized text. The experimental data sets Food and Weibo NER of the invention are formed by non-standardized text standards generated by social media and user comments on an e-commerce platform. Therefore, the named entity recognition performance of BERT-base and RoBERTa-wwm-ext is not as good as ERNIE for the experimental dataset features of the present invention.

FIG. 11 illustrates different behavior of different countermeasure training patterns in a named entity recognition model. Among them, the fast gradient method (Fast Gradient Method, FGM), the projected gradient descent method (Project Gradient Descent, PGD) and the free mass countermeasure training (Free Large BatchAdversarial Training, freeLB) are three different countermeasure training methods, i.e., three different countermeasure disturbance generation methods. Experimental results show that the F1 value of FGM in the named entity recognition task is 72.64% which is 1.78% and 1.14% higher than that of FreeLB and PGD respectively, and is superior to the other two countermeasure training modes. That is, the disturbance generated by FGM is added in word embellishing, so that higher entity classification precision can be obtained, and FGM is more suitable for a named entity recognition model with smaller data set scale.

The foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the embodiment of the present invention in any way, but any simple modification, equivalent variation and modification of the above embodiment according to the technical substance of the embodiment of the present invention still fall within the scope of the technical solution of the embodiment of the present invention.

Claims

1. A model for identifying named entities in the field of food safety is characterized in that:

the model comprises: an embedding layer, a feature extraction layer, an attention layer, an entity classification layer and countermeasure training;

2. The recognition model of claim 1, wherein,

the embedded layer comprises: a knowledge integration model of the enhanced representation proposed in hundred degrees is used as a word embedding model, and a multi-layer transducer is used as the primary encoder.

3. The recognition model of claim 2, wherein,

the embedded layer comprises: extracting rich text features by using a pre-trained model ERNIE to obtain a batch_size_max_seq_len_emb_size output vector used as a sequence representation of a classification task; where batch_size is the batch size of the processed data, max_seq_len is the maximum length of the input sentence, and emb_size is the embedding dimension of each token.

4. The recognition model of claim 1, wherein,

the characteristic extraction layer comprises: encoding the character vector sequence of the ERNIE by using a BiLSTM network, and obtaining a forward hidden state by using a forward LSTM network and obtaining a backward hidden state by using a reverse LSTM network; summing the values corresponding to the forward hidden state and the reverse hidden state obtained by BiLSTM to obtain h, and expressing the output of the BiLSTM network hidden layer as follows:

wherein h= { h ₁ ,h ₂ ,…,h _t The hidden state representation corresponding to each token, x= { x ₁ ,x ₂ ,…,x _t And is a sequence of word vectors.

5. The recognition model of claim 1, wherein,

the attention layer is as follows: the matrix H consists of a hidden state vector H output by the BiLSTM network; let w be the matrix parameter to be trained, w ^T Is a transpose of w, satisfying the following equation:

M＝relu(H)，α＝softmax(w ^T M)，r＝Hα ^T wherein, the method comprises the steps of, wherein,d ^w is the dimension of the word vector in the sentence, α is the attention weighting coefficient;

the output h of BiLSTM is weighted and summed to obtain r, w, alpha and r dimensions d ^w T and d ^w After passing through the attention layer, sentence representation vectors containing the most critical information are obtained as follows:

h ^* ＝relu(r)。

6. the recognition model of claim 1, wherein,

the entity classification layer comprises: the tag sequence with the highest predictive score is output as the best answer:

x＝argmax _y P(y|x,w)，

wherein x= { x ₁ ,x ₂ ,…,x _t The vector sequence is represented by w, the parameter vector is represented by y= { y ₀ ,y ₁ ,…,y _n+1 And is a tag sequence.

7. The recognition model of claim 1, wherein,

in the countermeasure training, the following steps are carried out: adding an anti-disturbance delta in word embedding _x Counter training of NER model, delta _x The definition is as follows:

x＝∈·(g/||g|| ₂ )，

x _adv ＝x+Δ _x 。

8. the recognition model of claim 7, wherein,

in the countermeasure training, the following steps are carried out: challenge training is generalized to the following maximally minimized simulation:

in the formula, D, x and y respectively represent training set, input sample and sample label, theta, L (x, y; theta) and delta _x Model parameters, loss of a single sample and antagonistic disturbance superimposed on the input respectively, Ω being the disturbance space, max (L) being the optimization objective; with an outer layer E _(x,y) The function optimizes the model parameters θ of the neural network to minimize them.

9. The method for identifying the named entity in the field of food safety is characterized by comprising the following steps of:

(1) Acquiring food safety corpus;

(2) After manual labeling, converting into a BIO format;

(3) Inputting the data processed in the step (2) into the recognition model of any one of claims 1-8 for recognition.

10. The method of claim 9, wherein,

in the step (2), selecting YEDDA for manual marking;