CN114386412A - Multi-modal named entity recognition method based on uncertainty perception - Google Patents

Multi-modal named entity recognition method based on uncertainty perception Download PDF

Info

Publication number
CN114386412A
CN114386412A CN202011140620.7A CN202011140620A CN114386412A CN 114386412 A CN114386412 A CN 114386412A CN 202011140620 A CN202011140620 A CN 202011140620A CN 114386412 A CN114386412 A CN 114386412A
Authority
CN
China
Prior art keywords
label
feature
text
modal
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011140620.7A
Other languages
Chinese (zh)
Other versions
CN114386412B (en
Inventor
何小海
刘露平
王美玲
卿粼波
吴小强
陈洪刚
滕奇志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202011140620.7A priority Critical patent/CN114386412B/en
Publication of CN114386412A publication Critical patent/CN114386412A/en
Application granted granted Critical
Publication of CN114386412B publication Critical patent/CN114386412B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a multi-modal named entity recognition method based on uncertainty perception. The method comprises two steps of alternative label generation and label correction. In the generation of the alternative label, firstly, a pre-training model is used for carrying out feature extraction on an input text to obtain feature representation containing rich context information; the features are then fed into a bayesian neural network to output candidate tags and corresponding uncertainties. In the label correction stage, a pre-training model is used for carrying out feature extraction on the text and the image to obtain feature representation; secondly, a multi-modal fusion framework is provided, and feature fusion of texts and images is realized through a multi-head attention mechanism. And finally, sending the fusion features into a conditional random field output correction label, and correcting the alternative label by using the label. Compared with the existing method, the method disclosed by the invention can effectively inhibit noise introduced by irrelevant images, and has a wide application prospect in the fields of social media information mining, information extraction and the like.

Description

Multi-modal named entity recognition method based on uncertainty perception
Technical Field
The invention designs a multi-modal named entity recognition method based on uncertainty perception, and belongs to the intersection of the fields of natural language processing and computer vision.
Background
With the rapid development of the mobile internet and the smart terminal, social media (such as Facebook, Twitter, etc.) have been rapidly developed and become a main platform for people to keep in communication and express personal emotion. The social media platform generates massive messages every day, and the massive messages can be used for tasks such as network attack detection, natural disaster early warning, disease outbreak prediction and the like. Because the mass information contained in the social media platform is unstructured, the mass information is not beneficial to direct processing by a computer. Therefore, it has become urgent and important to automatically extract important information from social media. As a fundamental and important task, named entity identification on social media has attracted the attention of a large number of researchers in recent years. By named entity recognition, important information such as people, organizations, places and the like can be extracted from mass data. The extracted information can provide input for high-level tasks such as event detection, hot topic analysis and the like.
At present, the technology of named entity recognition is becoming more mature on more normative data such as news, but there is still a great challenge to the recognition of named entities on social media. This is mainly reflected in the following two aspects: (1) compared with the text with more standard news, the text data on the social media has the characteristics of short length, incomplete structure and the like, so that corresponding context information is often lacked when the named entity identification is carried out on the social media. (2) In addition, a great deal of spoken language expression exists on social media, so that more noise exists in data on the social media generally.
In response to the above challenges, a number of researchers have conducted intensive research and have proposed corresponding solutions. In earlier approaches, researchers explored the use of social media data features to aid named entity recognition, such as flow information using Twitter (Li C, Weng J, He Q, et al. TwinER: a small entry registration in a target Twitter stream [ C ]// International ACM SIGIR conference on research & development in information retrieval.). In recent years, some researchers have explored the use of rich visual information on social media to aid named entity recognition tasks with corresponding success. Most of messages published on social media contain corresponding pictures, and the corresponding pictures contain rich visual information which can assist in understanding the texts and provide partial context information for named entity identification. In the methods, researchers capture corresponding feature information from texts and pictures through a feature extraction network, then design different feature fusion frameworks to realize multi-modal fusion representation of text and image features, and finally use the fused feature representations for named entity recognition tasks. The problem of lack of contextual information of the social media can be relieved to a certain extent by fusing visual information, so that the performance of a named entity recognition task on the social media is effectively improved based on a multi-modal feature fusion method.
However, the existing method only focuses on feature fusion and ignores the phenomenon of image-text mismatch on social media, that is, the matching image uploaded by a user and the published message express different semantic scenes. This phenomenon is common in social media, and the large number of text-to-text mismatches present corresponding challenges to existing multimodal fusion methods. If such irrelevant visual information is incorporated into the text features, it is equivalent to introducing additional noise information into the model, so that the model may generate a wrong prediction and ultimately affect the performance of the named entity recognition task.
Aiming at the problems, the invention provides a social media named entity identification method based on uncertainty perception. In this method, the named entity recognition task is broken down into two steps: alternative label generation and label correction. In the generation of the alternative label, the model only uses text information as input, and then the prediction output and the corresponding uncertainty of the model are obtained based on the Bayesian neural network. The invention takes the output of the first stage of the model as an alternative label, and the uncertainty of the model describes whether the alternative label is sufficiently credible. In the label correction stage, firstly, a multi-mode fusion framework based on a multi-head attention mechanism is constructed, the framework realizes the fusion of the characteristics of texts and images through the multi-head attention mechanism, and then the fused characteristics are subjected to characteristic dimension conversion through a linear layer and then sent to a conditional random field to obtain a correction label. And finally, the corrected label is used for correcting the label with higher uncertainty in the alternative labels.
According to the method, model uncertainty is innovatively introduced to measure whether effective multi-modal feature fusion should be carried out, so that picture visual information can be fused only when text information is insufficient, therefore, noise introduced by irrelevant pictures can be suppressed to a certain extent, and the performance of named entity recognition tasks on social media is further improved.
Disclosure of Invention
The invention provides a multi-modal named entity recognition method based on uncertainty perception, aiming at a named entity recognition task on social media. The method decomposes a named entity recognition task into two steps: alternative label generation and label correction. In the alternative label generation stage, a named entity recognition framework based on a Bayesian bidirectional long and short memory network is constructed, the framework only uses text information as input, the text information is sent to a multi-classification network after being coded by the Bayesian bidirectional long and short memory network to obtain predicted label information, and meanwhile, the uncertainty of the predicted label is obtained by calculating the entropy information of the label probability. The uncertainty of the model label is used to indicate whether the output of the model is sufficiently reliable. In the label modification phase, the invention constructs a multi-modal fusion framework based on a multi-head attention mechanism, the framework firstly uses two self-attention networks to respectively capture the attention in the modes of texts and pictures, then captures the attention between the two modes based on a multi-modal interaction network, and finally carries out multi-modal feature fusion through a visual gating network. The fused features are then sent to conditional random field decoding to obtain a corrected label. And finally, correcting the labels with higher uncertainty in the alternative labels by using the corrected labels.
The invention realizes the purpose through the following technical scheme:
1. the social media multi-modal named entity recognition framework disclosed by the invention is shown in fig. 1 and comprises two parts, namely a Bayesian neural network and a multi-modal fusion network. The social media multi-modal named entity recognition method comprises two stages of training and reasoning, and the method is carried out in the training stage as follows:
(1) in the generation of the alternative label, firstly, the feature extraction is carried out on the input text by utilizing a pre-training language model BERT to obtain the word feature representation containing the context semantic information.
(2) And (3) sending the word feature representation into a Bayes bidirectional long and short memory network model, coding the sentence to obtain a higher-level semantic feature, then sending the semantic feature into a full connection layer to convert the feature dimension, and converting the feature dimension of each word into the number of entity category labels.
(3) And (3) sending the feature vector obtained in the step (2) into a Softmax classifier, outputting the class probability of each word, and taking the class with the highest probability as the label information of the word, wherein the label information is the alternative label.
(4) In the second stage, the input text and image are respectively extracted by a text pre-training language model BERT and an image pre-training model RestNet, and in order to match the feature dimensions of the text and the image, the image feature vectors are subjected to dimension conversion by using a linear layer and are converted into the same size as the dimension of the text feature vectors.
(5) The text and image feature vectors obtained in step (4) are fed into two multi-head self-Attention mechanism networks (A. Vaswani, N.Shazeer, N.Parmar, J.Uszkoreit, L.Jones, A.N.Gomez, L.u.Kaiser, I.Polosukhin, Attention is all you need, in: I.Guyon, U.V.Luxburg, S.Bengio, H.Wallacch, R.Fergus, S.Vishwatanathan, R.Garnett (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.,2017, pp.5998-6008), respectively), for capturing modal feature correlations within text and images.
(6) And (5) sending the text and image feature vectors output in the step (5) into a multi-modal feature fusion network based on a multi-head attention mechanism, wherein the text feature vectors are used as query vectors, the image feature vectors are respectively used as key vectors and value vectors, and after multi-modal fusion calculation, the fused multi-modal features are obtained.
(7) And (4) sending the multi-modal feature fusion vector obtained in the step (6) and the text feature vector output in the step (5) into a visual door control network to calculate and obtain a correlation degree coefficient of the visual features and each word, wherein the correlation degree coefficient is called as a visual intensity coefficient, and then multiplying the coefficient by the corresponding multi-modal feature vector to output a corresponding feature vector to represent.
(8) And (5) combining the feature vector obtained in the step (7) with the text characteristic vector obtained in the step (5), sending the combined feature vector into a linear layer for feature dimension conversion, and converting the feature vector dimension of each word into the number of entity category labels. And then obtaining probability information of the label after decoding by a conditional random field network.
(9) And (4) calculating losses of the label probability information obtained in the step (3) and the step (8) and the real label respectively, and then using the two losses to optimize parameters of the Bayesian neural network and the multi-modal neural network respectively.
In the inference stage, entity label prediction is also divided into two stages of alternative label generation and label correction, and the method comprises the following steps:
(1) in label selection generation, firstly, a feature extraction model in the training step (1) is utilized to extract features of an input sentence. And then, repeatedly sending the extracted features into a Bayesian neural network for T times, sampling one probability from posterior probability distribution every time, and calculating after forward propagation to obtain probability output. After T times of sampling, T probability outputs are obtained.
(2) And outputting the T probabilities to calculate an average value as corresponding label probability information, and taking the label information corresponding to the word with the highest probability. The invention obtains the corresponding label uncertainty according to the calculation entropy of the output label probability information, wherein the higher the uncertainty is, the more probable the error of the label prediction information of the word is.
(3) In the label correction stage, the text and the image are sent to a multi-mode feature fusion network to obtain fused features, and then the fused features are sent to a conditional random field network for decoding after feature dimension conversion is carried out on the linear layer to obtain corresponding correction labels.
(4) And finally, correcting the alternative label by using the corrected label, setting a proper threshold value during correction, correcting the label if the uncertainty of the alternative label generated in the first stage is greater than the set threshold value, and otherwise, keeping the label information generated in the first stage.
Specifically, in step (1), a BERT-base-uncached version is used for the BERT pre-training model to perform vector initialization on words of an input sentence, and after the BERT is used for the vector initialization, a feature vector C ═ C of the words is obtained0,c1,...,cn]Feature vector dimensions for each word are 768 dimensions.
In the label correction, the text characteristic vector is sent into a Bayes two-way long-short memory network, the parameter of the Bayes neural network is a random variable omega, and the posterior probability distribution uses a Gaussian probability distribution qθ(ω)=N(ω|μ,σ2x) where μ is the mean and σ is the variance. The number of layers of the Bayesian neural network is 1, and the number of hidden layer neurons is 768. After being coded by a Bayes neural network, the characteristic vectors are sent into a linear layer to be converted into characteristic dimensions, so as to obtain a new characteristic vector T, wherein the input dimension of the linear layer is 768, and the output dimension is the number of entity label categories, which is 11 in the method of the invention. The calculation procedure of the above procedure is represented as follows:
Figure BDA0002738158350000031
Figure BDA0002738158350000032
Figure BDA0002738158350000033
T=Linear(h) (4)
in step (3), the feature vector t of each word in the sentenceiAfter being sent into a softmax layer, each word probability category p (i) is obtained, wherein the calculation process of the softmax is as follows:
Figure BDA0002738158350000034
in step (4), the text and the image are respectively used for pre-training a network to extract features by using BERT and RestNet, wherein the BERT uses a BERT-base-uncased version, and the feature vector of each word is 768; RestNet152, which is used by RestNet, takes the last layer of the convolutional neural network as output, and each picture is represented as 7 × 7 feature vectors, each feature vector having a dimension of 2048. In feature conversion, the linear conversion layer has an input dimension of 2048 and an output dimension of 768.
In step (5), a multi-head attention mechanism is adopted to capture the correlation between each word in sentences and each region block in the image, in the invention, a total of 12 attention heads are adopted, each head hidden layer feature dimension is 64, in each attention head, a new feature representation of the word or the image region is obtained through the attention mechanism firstly, and the calculation process is as follows:
Figure BDA0002738158350000041
Figure BDA0002738158350000042
wherein Qt,,Kt,,VtObtained by conversion of the word vector representations of the words through three full-connected layers, Qv,,Kv,,VvThe feature vector of the image region block is obtained after conversion of three other full-connection layers. In the formula dkEqual to 64. After obtaining the attention mechanism of a single head, splicing the output of a plurality of multi-head attentions, and then obtaining the coding vector representation of the word and the image visual region block through a full connection layer, wherein the calculation process is as follows:
mt=MultiHead(Qt,Kt,Vt)=concat(headt1,...,headth)Wt (8)
mv=MultiHead(Qv,Kv,Vv)=concat(headv1,...,headvh)Wv (9)
in order to prevent the gradient from disappearing, the output of the multi-head attention network further passes through a residual connection and normalization layer to obtain the output of the network, and the calculation process is as follows:
hmt=LayerNorm(mt+C) (10)
hmv=LayerNorm(mv+V) (11)
wherein C is the text feature representation output in the step (4), and V is the image characteristic vector representation output in the step (4).
In step (6), the features of the text and images in the modalities extracted from the attention network are fed into a multi-modal feature network to capture the correlation between modalities. The multi-modal network still adopts the multi-head attention mechanism network in step (5), wherein the text feature vector is used as the query vector, the image feature is used as the key and value vector, the calculation process is similar to that in step (5), and the invention is not described in detail here, and the feature vector output in the step is defined as Pmv
In step (7), the multi-modal feature vectors and the text feature vectors output in step (5) are fed into a visual gating network. The gating network is mainly used for calculating the association strength of the visual information and each word. Since some words in the sentence are rarely associated with visual information in the image, such as words 'a', 'the' of the sentence, the words do not need to obtain a corresponding visual representation. An intensity coefficient, referred to as a visual intensity coefficient in the present invention, can be calculated by a gating network, the visual intensity coefficient represents the degree of contribution of the visual features to the text features, and the calculation process is as follows:
g=σ(WT)Thmt+(Wv)TPmv (12)
after obtaining the visual intensity coefficient, multiplying the visual intensity coefficient by the corresponding multi-modal visual feature representation to obtain the final multi-modal visual feature representation, wherein B is g hmt
In step (8), the multi-modal visual feature representation obtained in step (7) and the text feature representation obtained in step (5) are combined, and feature dimension conversion is performed through a linear layer to obtain a final feature vector representation H. The process is represented as follows:
H=Linear([B;Pmv]) (13)
in step (9), the feature vector is subjected to conditional random field decoding to output probability label information.
In step (10), for Bayesian neural networks, it uses negative lower bound of Evidence (ELBO)
Loss is optimized, while the multi-modal fusion network is optimized using cross-over loss entropy, with two loss functions defined as follows:
Figure BDA0002738158350000043
Figure BDA0002738158350000051
in equation (14), logp (D | ω) is the maximum likelihood estimate, qθThe posterior probability distribution of the (omega) parameter, p (omega) is the prior distribution of the parameter, and KL is the relative entropy of the two distributions, also called KL divergence (KL). In cross-over loss entropy, yiThe true label for the word i, yi' is the predicted probability output for word i, T is the size of each batch at the time of training, and N is the maximum number of words in each sentence.
In step (1) of the inference phase, the number of samples T is set to 64, i.e. the same sentence is repeatedly fed into the network 64 times, so that each word gets 64 probability outputs.
In step (2) of the inference phase, the predicted probability output for each word is an average of 64 samples, and the calculation formula is as follows:
Figure BDA0002738158350000052
and the uncertainty of each label is the entropy of each sample probability category, and the calculation process is as follows:
Figure BDA0002738158350000053
if the entropy is larger, it indicates that the prediction is less reliable.
In the step (3) of the reasoning stage, a new input sentence and a corresponding picture are sent to a multi-modal network for feature extraction, and finally feature dimension conversion is carried out through a linear layer and a modified label is obtained through a conditional random field.
In the step (3) of the inference phase, the alternative label generated in the phase 1 is corrected by using a correction label output by the multi-modal network, in the specific correction process, a threshold value is set to indicate whether the label should be corrected, if the uncertainty of the label generated in the phase 1 is greater than the threshold value, the label is corrected, otherwise, the label generated in the first phase is kept as a prepared prediction label. Is not sureThe selection of the fixed value threshold value is related to the data set, and the specific selection mode is that after the threshold value is set, the model can obtain the maximum F after the modification1The value is obtained.
Drawings
Fig. 1 is the main framework of the network model proposed by the present invention.
Fig. 2 is a structure of a multimodal fusion network.
FIG. 3 is a graph of the performance variation of the model at different thresholds on two data sets, Twitter-2015 and Twitter-2017.
Detailed Description
The invention will be further described with reference to the accompanying drawings in which:
fig. 1 is the structure of the whole network, which is composed of two parts, a bayesian neural network and a multi-modal fusion network. The bayesian neural network accepts as input text data, the output of which contains a predictive label for each word and the corresponding uncertainty. In the Bayes neural network, an input sentence is firstly encoded through a pre-training language model BERT to obtain an initialization vector representation. The vector representation is then input into a Bayesian two-way long-short memory network model, the parameters of which are random variables, and the posterior probability distribution of which is approximated by the variables that follow the Gaussian distribution. And the vector output by the Bayesian neural network is sent into the softmax classification network after passing through a linear layer to obtain probability information. Since the parameters of the neural network are random variables, their outputs also obey a probability distribution. In order to obtain the probability output of each label, the model is sampled for T times, and the probability output of the label is obtained after the average value of the multiple sampling results is calculated. And the uncertainty value of the model is obtained by calculating the entropy information of the probability label. In the multi-mode neural network, a text and a corresponding image are subjected to initial feature extraction by a model through a text pre-training model BERT and an image pre-training model ResNet, the extracted features are sent to two models based on a self-attention mechanism for extracting correlation relations in the modes, then the text features and the image features are sent to a multi-mode fusion network with a visual door mechanism for feature fusion to obtain a fused feature representation, and then the feature representation is sent to a conditional random field for decoding and outputting a correction label after being subjected to dimension conversion by a linear layer. And finally, the label is used for correcting the alternative label, the label with uncertainty larger than the threshold is corrected by setting a proper threshold, and the label with uncertainty smaller than the threshold is not corrected.
FIG. 2 is a multi-modal fusion framework, where the input of the framework is the feature of a text and an image output from two attention networks, and the text is first subjected to feature fusion by a multi-head attention system, where the text is used as a query vector, the image feature is used as a key and a value vector, and the text is fused by the multi-head attention system and then sent to a visual gating network to obtain a visual intensity coefficient, and finally the intensity coefficient is multiplied by the corresponding feature to obtain the final feature representation.
FIG. 3 is a graph showing a model F on two public datasets, Twitter-2015 and Twitter-20171Trend plot of values with threshold. Wherein when the threshold is 0, the representation model uses the revised label output by the multi-modal framework as an output. As can be seen from the figure, as the threshold increases, F of the model1The values are gradually increased due to the labels with lower uncertainty generated in the retained part of the first phase of the model. As the threshold is increased, the model will mainly use the alternative label as output, and the accuracy of the model drops rapidly because the label lacks visual information.
Tables 1 and 2 show the experimental results of the invention on the public data sets Twitter-2015 and Twitter-2017, and experiments show that the comprehensive evaluation index F of the proposed model is compared with the best existing model1The values gave the best results.
TABLE 1 Experimental comparison of the inventive network model on the Twitter-2015 dataset with other existing models
Figure BDA0002738158350000061
TABLE 2 Experimental comparison of the inventive network model on the Twitter-2017 dataset with other existing models
Figure BDA0002738158350000062
Figure BDA0002738158350000071
The above embodiments are only preferred embodiments of the present invention, and are not intended to limit the technical solutions of the present invention, so long as the technical solutions can be realized on the basis of the above embodiments without creative efforts, which should be considered to fall within the protection scope of the patent of the present invention.

Claims (5)

1. A multi-modal named entity recognition method based on uncertainty perception is characterized by comprising the following steps:
(1) respectively using a text pre-training model BERT-Base-case and an image pre-training model ResNet152 to perform feature extraction on an input text and an input image, wherein an image feature vector is obtained from the output of the last convolutional layer of the ResNet 152;
(2) constructing a Bayesian bidirectional long and short memory neural network, inputting the text feature vector into the Bayesian neural network, and outputting alternative labels and corresponding label uncertainty;
(3) constructing a multimode interactive fusion Model (MIM), sending text features and image features into the MIM, outputting the multimode fusion features, performing feature dimension conversion through a linear layer, and inputting the feature dimensions into a Conditional Random field (Conditional Random Fields) decoding network to output a correction label;
(4) and correcting the alternative label by using the correction label.
2. The method according to claim 1, wherein the Bayesian neural network construction and training method in (2) comprises the following steps:
constructing a Bayesian neural network on the basis of a Bi-directional Long Short-Term Memory (Bi-directional Long Short-Term Memory); shellfishThe parameter omega of the leaf-Si double-layer long-short memory network is a random variable, and the posterior probability p (omega | D) is formed by a Gaussian distribution qθ(ω) approximation, i.e. qθ(ω)=N(μ,σ2) (ii) a In the Bayesian neural network training process, parameters are optimized by adopting negative Evidence lower bound (Evidence lower bound) loss, and a loss function calculation formula is shown as the following formula:
Figure FDA0002738158340000011
in the formula, logp (D | ω) is the maximum likelihood estimate, p (ω) is the prior distribution of the parameters, and KL is the relative entropy of the two distributions, also known as KL divergence (Kullback-Leibler).
3. The method according to claim 1, wherein the alternative label generation and corresponding label uncertainty calculation method in (2) specifically comprises the following steps:
repeatedly inputting the new input sentence into the Bayesian neural network for T times, and sampling from the posterior probability distribution each time to obtain model parameters
Figure FDA0002738158340000014
Carrying out forward propagation calculation to obtain T eigenvectors, carrying out characteristic dimension conversion on the T eigenvectors through a linear layer, and then obtaining T probability outputs through softmax; taking the average value of the T probability outputs as the final probability output, and taking the maximum probability as the label category, wherein the calculation formula is shown as the following formula:
Figure FDA0002738158340000012
in the formula, WiIs a parameter to be trained of the model, hiFeature vectors obtained after multi-modal fusion; the uncertainty of the label is obtained by calculating entropy information of the probability of each category, and the calculation formula is shown as the following formula:
Figure FDA0002738158340000013
where C is the number of entity classes.
4. The method according to claim 1, wherein in (3) the multi-modal feature fusion method, the fusion process is as follows:
giving a text feature vector C and an image feature vector V, firstly sending the text feature vector C and the image feature vector V into two self-attention networks, and respectively calculating to obtain feature representations in a mode: ciAnd Ti(ii) a Then C is mixediAnd TiEnter into a fusion framework based on a multi-head attention mechanism, wherein the text feature vector CiAs a query vector, an image feature vector is used as a key vector and a value vector, and the feature fusion calculation process is shown as the following formula:
Figure FDA0002738158340000021
w 'in the formula'q,W′k,W′vParameters to be trained, d, both modelskEqual to 64; after multi-modal feature fusion, an intensity coefficient is calculated through a visual gating network, the intensity coefficient is called as a visual intensity coefficient, the visual intensity coefficient represents the contribution degree of visual features to text features, and a visual intensity coefficient calculation formula is shown as the following formula:
g=σ((WT)TCi+(WV)TTMV) (5)
then, the visual intensity coefficient is multiplied by the corresponding multi-modal feature vector to obtain the text feature representation B-g-T based on visual guidanceMVFinally, the feature vector and the text feature vector C are combinediAnd (5) performing feature splicing through a feature concatenation function (concat) to obtain a final multi-modal feature representation.
5. The method according to claim 1, wherein the label correction process in (4) is as follows:
when the labels are corrected, a proper uncertain threshold value needs to be selected, if the uncertainty of the label in the alternative labels is higher than the threshold value, the corrected label is used for correction, and if the uncertainty of the label in the alternative labels is not higher than the threshold value, the alternative labels are reserved; the uncertainty threshold is selected relative to the data set by setting the threshold to allow the model to achieve the maximum F1 value on the data set after the above modifications.
CN202011140620.7A 2020-10-22 2020-10-22 Multi-mode named entity recognition method based on uncertainty perception Active CN114386412B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011140620.7A CN114386412B (en) 2020-10-22 2020-10-22 Multi-mode named entity recognition method based on uncertainty perception

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011140620.7A CN114386412B (en) 2020-10-22 2020-10-22 Multi-mode named entity recognition method based on uncertainty perception

Publications (2)

Publication Number Publication Date
CN114386412A true CN114386412A (en) 2022-04-22
CN114386412B CN114386412B (en) 2023-10-13

Family

ID=81194739

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011140620.7A Active CN114386412B (en) 2020-10-22 2020-10-22 Multi-mode named entity recognition method based on uncertainty perception

Country Status (1)

Country Link
CN (1) CN114386412B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117744505A (en) * 2024-02-20 2024-03-22 电子科技大学 Deep learning-based inversion method for electromagnetic wave resistivity of azimuth while drilling
CN117744505B (en) * 2024-02-20 2024-04-26 电子科技大学 Depth learning-based inversion method for electromagnetic wave resistivity of azimuth while drilling

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060088207A1 (en) * 2004-10-22 2006-04-27 Henry Schneiderman Object recognizer and detector for two-dimensional images using bayesian network based classifier
CN104008208A (en) * 2014-06-19 2014-08-27 北京大学 Situation recognition system and method based on opportunity perception
CN107563418A (en) * 2017-08-19 2018-01-09 四川大学 A kind of picture attribute detection method based on area sensitive score collection of illustrative plates and more case-based learnings
CN110929714A (en) * 2019-11-22 2020-03-27 北京航空航天大学 Information extraction method of intensive text pictures based on deep learning
CN111046668A (en) * 2019-12-04 2020-04-21 北京信息科技大学 Method and device for recognizing named entities of multi-modal cultural relic data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060088207A1 (en) * 2004-10-22 2006-04-27 Henry Schneiderman Object recognizer and detector for two-dimensional images using bayesian network based classifier
CN104008208A (en) * 2014-06-19 2014-08-27 北京大学 Situation recognition system and method based on opportunity perception
CN107563418A (en) * 2017-08-19 2018-01-09 四川大学 A kind of picture attribute detection method based on area sensitive score collection of illustrative plates and more case-based learnings
CN110929714A (en) * 2019-11-22 2020-03-27 北京航空航天大学 Information extraction method of intensive text pictures based on deep learning
CN111046668A (en) * 2019-12-04 2020-04-21 北京信息科技大学 Method and device for recognizing named entities of multi-modal cultural relic data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XIN ZHOU等: "Joint Entity and Relation Extraction Based on Reinforcement Learning" *
魏萍 等: "基于触发词语义选择的Twitter事件共指消解研究" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117744505A (en) * 2024-02-20 2024-03-22 电子科技大学 Deep learning-based inversion method for electromagnetic wave resistivity of azimuth while drilling
CN117744505B (en) * 2024-02-20 2024-04-26 电子科技大学 Depth learning-based inversion method for electromagnetic wave resistivity of azimuth while drilling

Also Published As

Publication number Publication date
CN114386412B (en) 2023-10-13

Similar Documents

Publication Publication Date Title
CN110427617B (en) Push information generation method and device
CN111581961B (en) Automatic description method for image content constructed by Chinese visual vocabulary
Zhang et al. Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition
CN109829499B (en) Image-text data fusion emotion classification method and device based on same feature space
CN110991290A (en) Video description method based on semantic guidance and memory mechanism
CN116204674B (en) Image description method based on visual concept word association structural modeling
CN113657115A (en) Multi-modal Mongolian emotion analysis method based on ironic recognition and fine-grained feature fusion
CN116796251A (en) Poor website classification method, system and equipment based on image-text multi-mode
CN113705315A (en) Video processing method, device, equipment and storage medium
Khan et al. A deep neural framework for image caption generation using gru-based attention mechanism
Luo et al. A thorough review of models, evaluation metrics, and datasets on image captioning
CN116564338B (en) Voice animation generation method, device, electronic equipment and medium
Qi et al. Video captioning via a symmetric bidirectional decoder
CN114386412B (en) Multi-mode named entity recognition method based on uncertainty perception
CN115796182A (en) Multi-modal named entity recognition method based on entity-level cross-modal interaction
CN113553445A (en) Method for generating video description
CN112287690A (en) Sign language translation method based on conditional sentence generation and cross-modal rearrangement
CN116702094B (en) Group application preference feature representation method
Ouenniche et al. Vision-text cross-modal fusion for accurate video captioning
Jia et al. Speaker-Aware Interactive Graph Attention Network for Emotion Recognition in Conversation
CN117094291B (en) Automatic news generation system based on intelligent writing
Preethi et al. Video Captioning using Pre-Trained CNN and LSTM
CN116955699B (en) Video cross-mode search model training method, searching method and device
Aafaq ‘Deep learning for natural language description of videos
Kong Research Advanced in Multimodal Emotion Recognition Based on Deep Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant