CN114386412A

CN114386412A - Multi-modal named entity recognition method based on uncertainty perception

Info

Publication number: CN114386412A
Application number: CN202011140620.7A
Authority: CN
Inventors: 何小海; 刘露平; 王美玲; 卿粼波; 吴小强; 陈洪刚; 滕奇志
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-10-22
Filing date: 2020-10-22
Publication date: 2022-04-22
Anticipated expiration: 2040-10-22
Also published as: CN114386412B

Abstract

The invention discloses a multi-modal named entity recognition method based on uncertainty perception. The method comprises two steps of alternative label generation and label correction. In the generation of the alternative label, firstly, a pre-training model is used for carrying out feature extraction on an input text to obtain feature representation containing rich context information; the features are then fed into a bayesian neural network to output candidate tags and corresponding uncertainties. In the label correction stage, a pre-training model is used for carrying out feature extraction on the text and the image to obtain feature representation; secondly, a multi-modal fusion framework is provided, and feature fusion of texts and images is realized through a multi-head attention mechanism. And finally, sending the fusion features into a conditional random field output correction label, and correcting the alternative label by using the label. Compared with the existing method, the method disclosed by the invention can effectively inhibit noise introduced by irrelevant images, and has a wide application prospect in the fields of social media information mining, information extraction and the like.

Description

Multi-modal named entity recognition method based on uncertainty perception

Technical Field

The invention designs a multi-modal named entity recognition method based on uncertainty perception, and belongs to the intersection of the fields of natural language processing and computer vision.

Background

With the rapid development of the mobile internet and the smart terminal, social media (such as Facebook, Twitter, etc.) have been rapidly developed and become a main platform for people to keep in communication and express personal emotion. The social media platform generates massive messages every day, and the massive messages can be used for tasks such as network attack detection, natural disaster early warning, disease outbreak prediction and the like. Because the mass information contained in the social media platform is unstructured, the mass information is not beneficial to direct processing by a computer. Therefore, it has become urgent and important to automatically extract important information from social media. As a fundamental and important task, named entity identification on social media has attracted the attention of a large number of researchers in recent years. By named entity recognition, important information such as people, organizations, places and the like can be extracted from mass data. The extracted information can provide input for high-level tasks such as event detection, hot topic analysis and the like.

At present, the technology of named entity recognition is becoming more mature on more normative data such as news, but there is still a great challenge to the recognition of named entities on social media. This is mainly reflected in the following two aspects: (1) compared with the text with more standard news, the text data on the social media has the characteristics of short length, incomplete structure and the like, so that corresponding context information is often lacked when the named entity identification is carried out on the social media. (2) In addition, a great deal of spoken language expression exists on social media, so that more noise exists in data on the social media generally.

In response to the above challenges, a number of researchers have conducted intensive research and have proposed corresponding solutions. In earlier approaches, researchers explored the use of social media data features to aid named entity recognition, such as flow information using Twitter (Li C, Weng J, He Q, et al. TwinER: a small entry registration in a target Twitter stream [ C ]// International ACM SIGIR conference on research & development in information retrieval.). In recent years, some researchers have explored the use of rich visual information on social media to aid named entity recognition tasks with corresponding success. Most of messages published on social media contain corresponding pictures, and the corresponding pictures contain rich visual information which can assist in understanding the texts and provide partial context information for named entity identification. In the methods, researchers capture corresponding feature information from texts and pictures through a feature extraction network, then design different feature fusion frameworks to realize multi-modal fusion representation of text and image features, and finally use the fused feature representations for named entity recognition tasks. The problem of lack of contextual information of the social media can be relieved to a certain extent by fusing visual information, so that the performance of a named entity recognition task on the social media is effectively improved based on a multi-modal feature fusion method.

However, the existing method only focuses on feature fusion and ignores the phenomenon of image-text mismatch on social media, that is, the matching image uploaded by a user and the published message express different semantic scenes. This phenomenon is common in social media, and the large number of text-to-text mismatches present corresponding challenges to existing multimodal fusion methods. If such irrelevant visual information is incorporated into the text features, it is equivalent to introducing additional noise information into the model, so that the model may generate a wrong prediction and ultimately affect the performance of the named entity recognition task.

Aiming at the problems, the invention provides a social media named entity identification method based on uncertainty perception. In this method, the named entity recognition task is broken down into two steps: alternative label generation and label correction. In the generation of the alternative label, the model only uses text information as input, and then the prediction output and the corresponding uncertainty of the model are obtained based on the Bayesian neural network. The invention takes the output of the first stage of the model as an alternative label, and the uncertainty of the model describes whether the alternative label is sufficiently credible. In the label correction stage, firstly, a multi-mode fusion framework based on a multi-head attention mechanism is constructed, the framework realizes the fusion of the characteristics of texts and images through the multi-head attention mechanism, and then the fused characteristics are subjected to characteristic dimension conversion through a linear layer and then sent to a conditional random field to obtain a correction label. And finally, the corrected label is used for correcting the label with higher uncertainty in the alternative labels.

According to the method, model uncertainty is innovatively introduced to measure whether effective multi-modal feature fusion should be carried out, so that picture visual information can be fused only when text information is insufficient, therefore, noise introduced by irrelevant pictures can be suppressed to a certain extent, and the performance of named entity recognition tasks on social media is further improved.

Disclosure of Invention

The invention provides a multi-modal named entity recognition method based on uncertainty perception, aiming at a named entity recognition task on social media. The method decomposes a named entity recognition task into two steps: alternative label generation and label correction. In the alternative label generation stage, a named entity recognition framework based on a Bayesian bidirectional long and short memory network is constructed, the framework only uses text information as input, the text information is sent to a multi-classification network after being coded by the Bayesian bidirectional long and short memory network to obtain predicted label information, and meanwhile, the uncertainty of the predicted label is obtained by calculating the entropy information of the label probability. The uncertainty of the model label is used to indicate whether the output of the model is sufficiently reliable. In the label modification phase, the invention constructs a multi-modal fusion framework based on a multi-head attention mechanism, the framework firstly uses two self-attention networks to respectively capture the attention in the modes of texts and pictures, then captures the attention between the two modes based on a multi-modal interaction network, and finally carries out multi-modal feature fusion through a visual gating network. The fused features are then sent to conditional random field decoding to obtain a corrected label. And finally, correcting the labels with higher uncertainty in the alternative labels by using the corrected labels.

The invention realizes the purpose through the following technical scheme:

1. the social media multi-modal named entity recognition framework disclosed by the invention is shown in fig. 1 and comprises two parts, namely a Bayesian neural network and a multi-modal fusion network. The social media multi-modal named entity recognition method comprises two stages of training and reasoning, and the method is carried out in the training stage as follows:

(1) in the generation of the alternative label, firstly, the feature extraction is carried out on the input text by utilizing a pre-training language model BERT to obtain the word feature representation containing the context semantic information.

(2) And (3) sending the word feature representation into a Bayes bidirectional long and short memory network model, coding the sentence to obtain a higher-level semantic feature, then sending the semantic feature into a full connection layer to convert the feature dimension, and converting the feature dimension of each word into the number of entity category labels.

(3) And (3) sending the feature vector obtained in the step (2) into a Softmax classifier, outputting the class probability of each word, and taking the class with the highest probability as the label information of the word, wherein the label information is the alternative label.

(4) In the second stage, the input text and image are respectively extracted by a text pre-training language model BERT and an image pre-training model RestNet, and in order to match the feature dimensions of the text and the image, the image feature vectors are subjected to dimension conversion by using a linear layer and are converted into the same size as the dimension of the text feature vectors.

(5) The text and image feature vectors obtained in step (4) are fed into two multi-head self-Attention mechanism networks (A. Vaswani, N.Shazeer, N.Parmar, J.Uszkoreit, L.Jones, A.N.Gomez, L.u.Kaiser, I.Polosukhin, Attention is all you need, in: I.Guyon, U.V.Luxburg, S.Bengio, H.Wallacch, R.Fergus, S.Vishwatanathan, R.Garnett (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.,2017, pp.5998-6008), respectively), for capturing modal feature correlations within text and images.

(6) And (5) sending the text and image feature vectors output in the step (5) into a multi-modal feature fusion network based on a multi-head attention mechanism, wherein the text feature vectors are used as query vectors, the image feature vectors are respectively used as key vectors and value vectors, and after multi-modal fusion calculation, the fused multi-modal features are obtained.

(7) And (4) sending the multi-modal feature fusion vector obtained in the step (6) and the text feature vector output in the step (5) into a visual door control network to calculate and obtain a correlation degree coefficient of the visual features and each word, wherein the correlation degree coefficient is called as a visual intensity coefficient, and then multiplying the coefficient by the corresponding multi-modal feature vector to output a corresponding feature vector to represent.

(8) And (5) combining the feature vector obtained in the step (7) with the text characteristic vector obtained in the step (5), sending the combined feature vector into a linear layer for feature dimension conversion, and converting the feature vector dimension of each word into the number of entity category labels. And then obtaining probability information of the label after decoding by a conditional random field network.

(9) And (4) calculating losses of the label probability information obtained in the step (3) and the step (8) and the real label respectively, and then using the two losses to optimize parameters of the Bayesian neural network and the multi-modal neural network respectively.

In the inference stage, entity label prediction is also divided into two stages of alternative label generation and label correction, and the method comprises the following steps:

(1) in label selection generation, firstly, a feature extraction model in the training step (1) is utilized to extract features of an input sentence. And then, repeatedly sending the extracted features into a Bayesian neural network for T times, sampling one probability from posterior probability distribution every time, and calculating after forward propagation to obtain probability output. After T times of sampling, T probability outputs are obtained.

(2) And outputting the T probabilities to calculate an average value as corresponding label probability information, and taking the label information corresponding to the word with the highest probability. The invention obtains the corresponding label uncertainty according to the calculation entropy of the output label probability information, wherein the higher the uncertainty is, the more probable the error of the label prediction information of the word is.

(3) In the label correction stage, the text and the image are sent to a multi-mode feature fusion network to obtain fused features, and then the fused features are sent to a conditional random field network for decoding after feature dimension conversion is carried out on the linear layer to obtain corresponding correction labels.

(4) And finally, correcting the alternative label by using the corrected label, setting a proper threshold value during correction, correcting the label if the uncertainty of the alternative label generated in the first stage is greater than the set threshold value, and otherwise, keeping the label information generated in the first stage.

Specifically, in step (1), a BERT-base-uncached version is used for the BERT pre-training model to perform vector initialization on words of an input sentence, and after the BERT is used for the vector initialization, a feature vector C ═ C of the words is obtained₀,c₁,...,c_n]Feature vector dimensions for each word are 768 dimensions.

In the label correction, the text characteristic vector is sent into a Bayes two-way long-short memory network, the parameter of the Bayes neural network is a random variable omega, and the posterior probability distribution uses a Gaussian probability distribution q_θ(ω)＝N(ω|μ,σ²x) where μ is the mean and σ is the variance. The number of layers of the Bayesian neural network is 1, and the number of hidden layer neurons is 768. After being coded by a Bayes neural network, the characteristic vectors are sent into a linear layer to be converted into characteristic dimensions, so as to obtain a new characteristic vector T, wherein the input dimension of the linear layer is 768, and the output dimension is the number of entity label categories, which is 11 in the method of the invention. The calculation procedure of the above procedure is represented as follows:

T＝Linear(h) (4)

in step (3), the feature vector t of each word in the sentence_iAfter being sent into a softmax layer, each word probability category p (i) is obtained, wherein the calculation process of the softmax is as follows:

in step (4), the text and the image are respectively used for pre-training a network to extract features by using BERT and RestNet, wherein the BERT uses a BERT-base-uncased version, and the feature vector of each word is 768; RestNet152, which is used by RestNet, takes the last layer of the convolutional neural network as output, and each picture is represented as 7 × 7 feature vectors, each feature vector having a dimension of 2048. In feature conversion, the linear conversion layer has an input dimension of 2048 and an output dimension of 768.

In step (5), a multi-head attention mechanism is adopted to capture the correlation between each word in sentences and each region block in the image, in the invention, a total of 12 attention heads are adopted, each head hidden layer feature dimension is 64, in each attention head, a new feature representation of the word or the image region is obtained through the attention mechanism firstly, and the calculation process is as follows:

wherein Q_t,，K_t,，V_tObtained by conversion of the word vector representations of the words through three full-connected layers, Q_v,，K_v,，V_vThe feature vector of the image region block is obtained after conversion of three other full-connection layers. In the formula d_kEqual to 64. After obtaining the attention mechanism of a single head, splicing the output of a plurality of multi-head attentions, and then obtaining the coding vector representation of the word and the image visual region block through a full connection layer, wherein the calculation process is as follows:

m_t＝MultiHead(Q_t,K_t,V_t)＝concat(head_t1,...,head_th)W_t (8)

m_v＝MultiHead(Q_v,K_v,V_v)＝concat(head_v1,...,head_vh)W_v (9)

in order to prevent the gradient from disappearing, the output of the multi-head attention network further passes through a residual connection and normalization layer to obtain the output of the network, and the calculation process is as follows:

h_mt＝LayerNorm(m_t+C) (10)

h_mv＝LayerNorm(m_v+V) (11)

wherein C is the text feature representation output in the step (4), and V is the image characteristic vector representation output in the step (4).

In step (6), the features of the text and images in the modalities extracted from the attention network are fed into a multi-modal feature network to capture the correlation between modalities. The multi-modal network still adopts the multi-head attention mechanism network in step (5), wherein the text feature vector is used as the query vector, the image feature is used as the key and value vector, the calculation process is similar to that in step (5), and the invention is not described in detail here, and the feature vector output in the step is defined as P_mv。

In step (7), the multi-modal feature vectors and the text feature vectors output in step (5) are fed into a visual gating network. The gating network is mainly used for calculating the association strength of the visual information and each word. Since some words in the sentence are rarely associated with visual information in the image, such as words 'a', 'the' of the sentence, the words do not need to obtain a corresponding visual representation. An intensity coefficient, referred to as a visual intensity coefficient in the present invention, can be calculated by a gating network, the visual intensity coefficient represents the degree of contribution of the visual features to the text features, and the calculation process is as follows:

g＝σ(W_T)^Th_mt+(W_v)^TP_mv (12)

after obtaining the visual intensity coefficient, multiplying the visual intensity coefficient by the corresponding multi-modal visual feature representation to obtain the final multi-modal visual feature representation, wherein B is g h_mt。

In step (8), the multi-modal visual feature representation obtained in step (7) and the text feature representation obtained in step (5) are combined, and feature dimension conversion is performed through a linear layer to obtain a final feature vector representation H. The process is represented as follows:

H＝Linear([B；P_mv]) (13)

in step (9), the feature vector is subjected to conditional random field decoding to output probability label information.

In step (10), for Bayesian neural networks, it uses negative lower bound of Evidence (ELBO)

Loss is optimized, while the multi-modal fusion network is optimized using cross-over loss entropy, with two loss functions defined as follows:

in equation (14), logp (D | ω) is the maximum likelihood estimate, q_θThe posterior probability distribution of the (omega) parameter, p (omega) is the prior distribution of the parameter, and KL is the relative entropy of the two distributions, also called KL divergence (KL). In cross-over loss entropy, y_iThe true label for the word i, y_i' is the predicted probability output for word i, T is the size of each batch at the time of training, and N is the maximum number of words in each sentence.

In step (1) of the inference phase, the number of samples T is set to 64, i.e. the same sentence is repeatedly fed into the network 64 times, so that each word gets 64 probability outputs.

In step (2) of the inference phase, the predicted probability output for each word is an average of 64 samples, and the calculation formula is as follows:

and the uncertainty of each label is the entropy of each sample probability category, and the calculation process is as follows:

if the entropy is larger, it indicates that the prediction is less reliable.

In the step (3) of the reasoning stage, a new input sentence and a corresponding picture are sent to a multi-modal network for feature extraction, and finally feature dimension conversion is carried out through a linear layer and a modified label is obtained through a conditional random field.

In the step (3) of the inference phase, the alternative label generated in the phase 1 is corrected by using a correction label output by the multi-modal network, in the specific correction process, a threshold value is set to indicate whether the label should be corrected, if the uncertainty of the label generated in the phase 1 is greater than the threshold value, the label is corrected, otherwise, the label generated in the first phase is kept as a prepared prediction label. Is not sureThe selection of the fixed value threshold value is related to the data set, and the specific selection mode is that after the threshold value is set, the model can obtain the maximum F after the modification₁The value is obtained.

Drawings

Fig. 1 is the main framework of the network model proposed by the present invention.

Fig. 2 is a structure of a multimodal fusion network.

FIG. 3 is a graph of the performance variation of the model at different thresholds on two data sets, Twitter-2015 and Twitter-2017.

Detailed Description

The invention will be further described with reference to the accompanying drawings in which:

fig. 1 is the structure of the whole network, which is composed of two parts, a bayesian neural network and a multi-modal fusion network. The bayesian neural network accepts as input text data, the output of which contains a predictive label for each word and the corresponding uncertainty. In the Bayes neural network, an input sentence is firstly encoded through a pre-training language model BERT to obtain an initialization vector representation. The vector representation is then input into a Bayesian two-way long-short memory network model, the parameters of which are random variables, and the posterior probability distribution of which is approximated by the variables that follow the Gaussian distribution. And the vector output by the Bayesian neural network is sent into the softmax classification network after passing through a linear layer to obtain probability information. Since the parameters of the neural network are random variables, their outputs also obey a probability distribution. In order to obtain the probability output of each label, the model is sampled for T times, and the probability output of the label is obtained after the average value of the multiple sampling results is calculated. And the uncertainty value of the model is obtained by calculating the entropy information of the probability label. In the multi-mode neural network, a text and a corresponding image are subjected to initial feature extraction by a model through a text pre-training model BERT and an image pre-training model ResNet, the extracted features are sent to two models based on a self-attention mechanism for extracting correlation relations in the modes, then the text features and the image features are sent to a multi-mode fusion network with a visual door mechanism for feature fusion to obtain a fused feature representation, and then the feature representation is sent to a conditional random field for decoding and outputting a correction label after being subjected to dimension conversion by a linear layer. And finally, the label is used for correcting the alternative label, the label with uncertainty larger than the threshold is corrected by setting a proper threshold, and the label with uncertainty smaller than the threshold is not corrected.

FIG. 2 is a multi-modal fusion framework, where the input of the framework is the feature of a text and an image output from two attention networks, and the text is first subjected to feature fusion by a multi-head attention system, where the text is used as a query vector, the image feature is used as a key and a value vector, and the text is fused by the multi-head attention system and then sent to a visual gating network to obtain a visual intensity coefficient, and finally the intensity coefficient is multiplied by the corresponding feature to obtain the final feature representation.

FIG. 3 is a graph showing a model F on two public datasets, Twitter-2015 and Twitter-2017₁Trend plot of values with threshold. Wherein when the threshold is 0, the representation model uses the revised label output by the multi-modal framework as an output. As can be seen from the figure, as the threshold increases, F of the model₁The values are gradually increased due to the labels with lower uncertainty generated in the retained part of the first phase of the model. As the threshold is increased, the model will mainly use the alternative label as output, and the accuracy of the model drops rapidly because the label lacks visual information.

Tables 1 and 2 show the experimental results of the invention on the public data sets Twitter-2015 and Twitter-2017, and experiments show that the comprehensive evaluation index F of the proposed model is compared with the best existing model₁The values gave the best results.

TABLE 1 Experimental comparison of the inventive network model on the Twitter-2015 dataset with other existing models

TABLE 2 Experimental comparison of the inventive network model on the Twitter-2017 dataset with other existing models

The above embodiments are only preferred embodiments of the present invention, and are not intended to limit the technical solutions of the present invention, so long as the technical solutions can be realized on the basis of the above embodiments without creative efforts, which should be considered to fall within the protection scope of the patent of the present invention.

Claims

1. A multi-modal named entity recognition method based on uncertainty perception is characterized by comprising the following steps:

(1) respectively using a text pre-training model BERT-Base-case and an image pre-training model ResNet152 to perform feature extraction on an input text and an input image, wherein an image feature vector is obtained from the output of the last convolutional layer of the ResNet 152;

(2) constructing a Bayesian bidirectional long and short memory neural network, inputting the text feature vector into the Bayesian neural network, and outputting alternative labels and corresponding label uncertainty;

(3) constructing a multimode interactive fusion Model (MIM), sending text features and image features into the MIM, outputting the multimode fusion features, performing feature dimension conversion through a linear layer, and inputting the feature dimensions into a Conditional Random field (Conditional Random Fields) decoding network to output a correction label;

(4) and correcting the alternative label by using the correction label.

2. The method according to claim 1, wherein the Bayesian neural network construction and training method in (2) comprises the following steps:

constructing a Bayesian neural network on the basis of a Bi-directional Long Short-Term Memory (Bi-directional Long Short-Term Memory); shellfishThe parameter omega of the leaf-Si double-layer long-short memory network is a random variable, and the posterior probability p (omega | D) is formed by a Gaussian distribution q_θ(ω) approximation, i.e. q_θ(ω)＝N(μ，σ²) (ii) a In the Bayesian neural network training process, parameters are optimized by adopting negative Evidence lower bound (Evidence lower bound) loss, and a loss function calculation formula is shown as the following formula:

in the formula, logp (D | ω) is the maximum likelihood estimate, p (ω) is the prior distribution of the parameters, and KL is the relative entropy of the two distributions, also known as KL divergence (Kullback-Leibler).

3. The method according to claim 1, wherein the alternative label generation and corresponding label uncertainty calculation method in (2) specifically comprises the following steps:

repeatedly inputting the new input sentence into the Bayesian neural network for T times, and sampling from the posterior probability distribution each time to obtain model parameters

Carrying out forward propagation calculation to obtain T eigenvectors, carrying out characteristic dimension conversion on the T eigenvectors through a linear layer, and then obtaining T probability outputs through softmax; taking the average value of the T probability outputs as the final probability output, and taking the maximum probability as the label category, wherein the calculation formula is shown as the following formula:

in the formula, W_iIs a parameter to be trained of the model, h_iFeature vectors obtained after multi-modal fusion; the uncertainty of the label is obtained by calculating entropy information of the probability of each category, and the calculation formula is shown as the following formula:

where C is the number of entity classes.

4. The method according to claim 1, wherein in (3) the multi-modal feature fusion method, the fusion process is as follows:

giving a text feature vector C and an image feature vector V, firstly sending the text feature vector C and the image feature vector V into two self-attention networks, and respectively calculating to obtain feature representations in a mode: c_iAnd T_i(ii) a Then C is mixed_iAnd T_iEnter into a fusion framework based on a multi-head attention mechanism, wherein the text feature vector C_iAs a query vector, an image feature vector is used as a key vector and a value vector, and the feature fusion calculation process is shown as the following formula:

w 'in the formula'_q，W′_k，W′_vParameters to be trained, d, both models_kEqual to 64; after multi-modal feature fusion, an intensity coefficient is calculated through a visual gating network, the intensity coefficient is called as a visual intensity coefficient, the visual intensity coefficient represents the contribution degree of visual features to text features, and a visual intensity coefficient calculation formula is shown as the following formula:

g＝σ((W_T)^TC_i+(W_V)^TT_MV) (5)

then, the visual intensity coefficient is multiplied by the corresponding multi-modal feature vector to obtain the text feature representation B-g-T based on visual guidance_MVFinally, the feature vector and the text feature vector C are combined_iAnd (5) performing feature splicing through a feature concatenation function (concat) to obtain a final multi-modal feature representation.

5. The method according to claim 1, wherein the label correction process in (4) is as follows:

when the labels are corrected, a proper uncertain threshold value needs to be selected, if the uncertainty of the label in the alternative labels is higher than the threshold value, the corrected label is used for correction, and if the uncertainty of the label in the alternative labels is not higher than the threshold value, the alternative labels are reserved; the uncertainty threshold is selected relative to the data set by setting the threshold to allow the model to achieve the maximum F1 value on the data set after the above modifications.