CN116956920A

CN116956920A - Multi-mode named entity identification method for multi-task collaborative characterization

Info

Publication number: CN116956920A
Application number: CN202310752673.1A
Authority: CN
Inventors: 王海荣; 徐玺
Original assignee: North Minzu University
Current assignee: North Minzu University
Priority date: 2023-06-25
Filing date: 2023-06-25
Publication date: 2023-10-27

Abstract

The invention discloses a multi-mode named entity recognition method of multi-task collaborative characterization, which comprises the steps of calling a multi-mode feature fusion layer to encode and fuse text representation, character representation, object-level visual labels, visual description keywords and regional visual features to obtain multi-mode characterization; and calling a text feature enhancement layer to fuse the text representation and the character representation so as to enhance text semantics and obtain text representation, calling a multi-mode representation and text representation to predict a multi-mode representation and a multi-mode representation prediction sequence to obtain a text mode representation prediction sequence and a multi-mode representation prediction sequence, and fusing the two prediction tag sequences by using a tag fusion task to obtain a final predicted named entity tag. The multi-mode named entity recognition method has the advantages that the multi-mode named entity recognition efficiency is obviously improved in the image-text mode data scene, and the application prospect is good.

Description

Multi-mode named entity identification method for multi-task collaborative characterization

Technical Field

The invention relates to the technical field of multi-mode information extraction, in particular to a multi-mode named entity identification method for multi-task collaborative characterization.

Background

Along with the wide application of multimedia technology, multi-mode data such as text, pictures and audio are continuously emerging, the data contains rich semantic information, and the existing mining technology can not support mining of the multi-mode data semantic information. Thus, multi-modal mining methods are increasingly attracting attention, and multi-modal named entity recognition (Multimodal Named Entity Recognition, MNER) methods, which are the key tasks thereof, are also becoming research hotspots.

The MNER aims to identify named entities, such as person names, place names, institution names, etc., from multimodal data. After the first extraction of MNER from Moon et al in 2018, methods such as MA, VAM, CWI were developed in succession, which used an attention mechanism to fuse text representations with regional visual features, solving the problems of multimodal information fusion and filtering. However, the semantics between the text representation and the visual features are asymmetric, and the Glove is used for word representation, so that weak entity semantics are obtained. To solve such problems, the MNER method based on BiLSTM, such as ACN, BA-GAN, DCN, etc., has been proposed successively, which processes text features using BiLSTM, and enhances the entity semantics of text by aggregating context information, thereby fusing with image features. In order to further reduce the semantic gap between text and image features, a MNER method based on a transducer is proposed for the first time in 2020, CHEN and the like uses BERT as text representation, the importance of word semantics is verified, and then methods such as UMGF, ITJNER, MAF are proposed, which utilize the transducer to encode, fuse or align the modal features, solve the problem of difficult interaction between text features and image features, but the generated multi-modal features have a gap with the overall semantics of the text representation. To this end, a method of co-processing two subtasks of a text representation and a multimodal representation is proposed, as UMT, ITA, UAMNER, to assist named entity recognition with text features to solve the problem of visual bias; or training a multi-modal representation or unifying feature spaces using auxiliary tasks, a method of achieving fully fused features of HVPNET and MLMNER is proposed.

In conclusion, multi-modal named entity recognition for graphic data has achieved a certain result in feature fusion and task joint training. On the basis of joint coding and cross-modal attention feature fusion, the invention develops researches aiming at three problems existing in the joint coding and cross-modal attention feature fusion, namely, text feature semantic deviation and insufficient image feature description, the feature fusion among modes is insufficient caused by a single joint coding or cross-modal attention feature fusion module, and a tag decoder only extracts entity whole information of words and does not fully mine entity boundary information and word generic information of hidden features, so that better decoding is realized.

Disclosure of Invention

The invention aims to overcome the defects and shortcomings of the prior art, provides a multi-mode named entity identification method for multi-task collaborative characterization, a text feature enhancement layer is constructed to use character representation to complement the semantics of text representation to obtain text characterization, visual description keywords, object-level visual labels and regional visual features are introduced to collaborative express image semantics, multi-mode feature fusion is realized by combining image-text coding and cross-mode attention, multi-mode characterization is obtained, a multi-mode label decoder is constructed to decode the text characterization and the multi-mode characterization to obtain a text characterization prediction sequence and a multi-mode characterization named entity prediction sequence, and the text characterization and the multi-mode feature named entity prediction sequence are aligned through a label fusion layer, so that the effective identification of named entities in image-text data is realized.

In order to achieve the above purpose, the technical scheme provided by the invention is as follows: a multi-mode named entity identification method for multi-task collaborative characterization comprises the following steps:

1) Collecting image-text data of different modes, including texts and images;

2) Respectively carrying out feature representation on the obtained Image-text data, wherein the feature representation comprises text representation and character representation obtained by respectively encoding texts by using BERT and CNN, and visual description keywords, object-level visual labels and regional visual features obtained by respectively extracting Image features by using Image capture, mask-R-CNN and ResNet;

3) Constructing a text feature enhancement layer by using cross-modal attention, fusing text representation and character representation, enhancing the semantics of the text representation, and obtaining text representation;

4) Constructing a multi-mode fusion layer, wherein text, visual description keywords and object-level visual labels are subjected to image-text joint coding by using BERT, and multi-mode representation is obtained; constructing a text feature enhancement layer by using the cross-modal attention so as to fuse the character representation and the multi-modal representation and obtain the multi-modal text enhancement representation; constructing a visual feature enhancement layer by using cross-modal attention so as to fuse the multi-modal text enhancement representation and the regional visual features to obtain multi-modal representation, wherein the regional visual features are guided by an antagonism classification network and the feature space distribution of the multi-modal text enhancement representation is consistent so as to be beneficial to the fusion of the regional visual features and the multi-modal text enhancement representation;

5) Decoding the text representation and the multi-modal representation by using a multi-task tag decoder respectively to obtain a predicted sequence of the text representation and a predicted sequence of the multi-modal representation; the multi-task label decoder is a decoder integrating three detection tasks of entity boundary detection, word entity category detection and entity labeling detection, an entity boundary detection conversion matrix and a word entity category detection conversion matrix, and can realize named entity identification through multi-task collaborative learning;

6) And constructing a label fusion layer by using a KL divergence loss function, and avoiding visual deviation by restricting the consistency of a predicted sequence represented by a text and a predicted sequence represented by a multi-mode, so that the predicted sequence represented by the multi-mode is more accurate, namely obtaining a final predicted named entity label, and realizing effective recognition of a named entity.

Further, in step 1), the acquired graphic data is expressed as:

where E represents a text instance, I represents an image instance, i=1 represents a first pair of teletext data, and the whole formula D 'represents n' pairs of teletext data.

Further, in step 2), the acquired image-text data are respectively characterized as shown in the following formula:

Wherein B represents a text representation obtained by BERT encoding, C represents a character representation obtained by CNN encoding, K represents a visual description keyword extracted by using Image Captioning, L represents an object-level visual tag extracted by using Mask R-CNN, R represents a region visual feature extracted by using ResNet, i=1 represents a first pair of graphic data, and n' pairs of graphic data are included in the whole formula D; the individual features are shown below:

a. text representation: given a sentence E, namely a text, the sentence E is input into a pre-training language model BERT as a text coding layer of the whole model, and the calculation formula is as follows:

B＝BERT(E)

wherein E is a sentence containing n words, and the text representation B represents a feature vector obtained by inputting the sentence E into the BERT;

b. character representation: given a sentence E, it is input into CNN, and its calculation formula is:

C＝CNN(E)

wherein, the character representation C represents a feature vector obtained after the sentence E is input into the CNN;

c. regional visual characteristics: giving an image I, inputting the image I into a pre-trained ResNet, extracting regional image characteristics, and calculating the formula as follows:

R _V ＝ResNet(I)

wherein I is a color image, R _V Is a feature vector of the visual feature of the region in the picture;

d. visual description keywords: giving an Image I, inputting the Image I into a pre-trained Image capture model, extracting visual description keywords, and calculating the visual description keywords according to the following formula:

K＝Image-Captioning(I)

Wherein K represents a visual description keyword related to the picture;

e. object-level visual label: given an image I, the image I is input into a pretrained Mask R-CNN, an object-level visual tag is extracted, and a calculation formula is as follows:

L＝MaskR-CNN(I)

where L is the extracted object-level visual tag.

Further, in step 3), the text feature enhancement layer generates a semantically enhanced text representation by passing the text representation B and the character representation C through the multi-headed cross-modal attention network CMT, and the calculation formula is as follows:

U＝CMT(B,C,C _m )

wherein C is _m Is the attention mask matrix of C,u ₁ ,u ₂ ,u _n feature vectors representing text representations corresponding to the 1 st, 2 nd and n th words in text E, respectively,/-, respectively>Representing that the resulting text representation is a real feature space +.>N x d dimensions of the vector, n being the length of the text representation B and d being the feature dimension of the text representation B;

the multi-head cross-modal attention network CMT is a multi-head attention network based on common attention, and the calculation formula is as follows:

MH-CA＝[CA ₁ ，CA ₂ ，…，CA _g ]

where q is the query value of attention in CMT, k is the key value of attention in CMT, v is the value of attention in CMT, W _q Projection weight of query value, W _k Projection weight, W, which is key value _v Projection weights that are value values, σ () is a softmax activation function, CA is co-attention, MH-CA is a multi-head attention consisting of g co-attention, where CA ₁ ,CA ₂ ,CA _g The 1 st, 2 nd, g th co-attentive heads in MH-CA, respectively, and a total of g heads, LN () is the normalization layer, FFN () is the feed forward network layer,the normalized features are used, and G is the final fusion feature of the cross-modal input features q and k;

for ease of invocation, the multi-headed cross-modal attention network CMT is represented as follows:

H＝CMT(q,k,v)

where q and k are modal features of two inputs, v is the Value of attention in CMT, also is the attention mask of k, H is the enhancement representation of q under the perceptual representation of k, and is the fusion feature of the cross-modal input features q and k.

Further, the step 4) includes the steps of:

4.1 Splicing the visual description keyword K, the object-level visual tag L and the text E, and representing the visual description keyword K, the object-level visual tag L and the text E as [ K; l is; e, using BERT pair [ K ] as text prefix; l is; e, carrying out joint coding, wherein the calculation formula is as follows:

[M _KL ；M _U ]＝BERT([K；L；E])

in the method, in the process of the invention,is a multimodal visual representation, < >>Is a multimodal representation, e is the length of K and L;

4.2 In order to complement the semantic deficiency in the multi-modal representation, the text feature enhancement layer fusion character representation is used for semantic complementation, so that the multi-modal text enhancement representation is obtained, and the calculation formula is as follows:

M _Uc ＝CMT(M _U ,C,C _m )

wherein C is _m Is the attention mask representation of the character representation C, Is a multimodal text enhancement representation, < >>Representing real feature space;

4.3)M _Tc the visual global and fine granularity entity semantics are fused, but potential information of the relation between objects is not generated, so that the cross-modal attention fusion area visual characteristics are used, the image characteristics are complemented, the multi-modal text enhancement representation is optimized, the multi-modal representation M is obtained, and the calculation formula is as follows:

M＝CMT(M _Uc ,FC(R _V ),R _Vm )

wherein R is _Vm Is regional visual feature R _V The post-projection attention mask represents that,is multi-modalText enhancement representation, FC () is a regional visual feature projection function;

4.4 To better project the regional visual features, so that the regional visual features are aligned with M _Uc Is favorable for the fusion of regional visual characteristics and multi-modal text enhancement representation, and forms an anti-learning task to optimize a projection linear layer by using a modal classification network, wherein the anti-learning task loss calculation formula is as follows:

where MLP () is the modal classification network of the multi-layer perceptron,is the loss of the modal classification network.

Further, the step 5) includes the steps of:

5.1 For making label labeling more accurate, constructing a multi-task label decoder, combining the boundary label of the entity and the prediction information of the entity category label by utilizing the conversion relation among labels to obtain a final prediction sequence, and hiding vectors by utilizing a projection function The subspace projected to the three tasks is calculated as follows:

wherein, c _bio 、c _plo 、c _ner Respectively three arbitrary onesService projection function FC _bio ()、FC _plo ()、FC _ner () Projection dimension H of (2) _bio Representing projection characteristics of hidden vectors in entity boundary detection space, H _plo Representing projection characteristics of hidden vector in word entity class detection space, H _ner Representing the projection characteristics of the hidden vector in the entity labeling detection space;

respectively H _bio 、H _plo And conversion matrixVector multiplication operation +.>Calculating a predictive vector +.>The calculation formula is as follows:

fusion prediction vectorAnd H is _ner The predictive label representation Y is obtained, and the calculation formula is as follows:

marking a predicted sequence Y in Y with a conditional random field in consideration of a dependency relationship between tags;

where, the predictive label representation Y is the label input vector,is the slave label y _i To label y _i+1 Is used for the transition fraction of (a),is tag y _i Exp () is an exponential function, P (y|y) is a conditional probability distribution given Y, Y' is all possible prediction vectors, score () calculates the probability of the prediction vector being Y;

calculation of loss as a multi-tasking tag decoder using loss function, i.e. conditional maximum likelihood functionThe calculation formula is as follows:

the multi-tasking label decoder will be represented using the function MLD (), the above calculation formula can be expressed as:

In the method, in the process of the invention,is the loss of the multi-tasking tag decoder, Y ^H A predictive label that is a missing named entity of the multi-tasking label decoder;

5.2 Using a multi-tasking tag decoder to decode the text token U obtained in step 3). Hidden representation h=t, text-characterizedPredicted sequence is Y ^U The loss function is

Where MLD () represents the multi-tasking label decoder defined in step 5.1);

5.3 Decoding the multi-modal representation obtained in step 4.3) using a multi-tasking tag decoder, the hidden representation h=m, the predicted sequence of the multi-modal representation being Y ^M The loss function is

Where MLD () represents the multi-tasking label decoder defined in step 5.1).

Further, the step 6) includes the steps of:

6.1 Constructing a label fusion layer, minimizing the output profile Y of steps 5.2) and 5.3) using KL divergence ^U 、Y ^M To supervise the learning of the multimodal characterization obtained in step 4.3), the predictive label Y in step 5.3) ^M The calculation formula is as follows:

in the method, in the process of the invention,is KL divergence loss, P _θ () Is the probability of the input vector correspondence, and log is a logarithmic function.

6.2 By summing the losses of each task, each task is fused to realize multiple tasksTask collaborative optimization text characterization and multi-modal characterization and final predictive label Y is obtained ^M Overall loss, i.e. multiple task lossThe expression is as follows:

in the method, in the process of the invention,is a loss function of text-characterization named entity recognition task, < >>Is a loss function of the modal characterization named entity recognition task, < >>Is a loss function of cross-view alignment, +.>Is the loss function of the modal classification network.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. on the basis of obtaining text representation by BERT coding, CNN is used for coding the text to obtain character representation, missing semantics in the text representation are complemented, and text representation with richer and comprehensive semantics is obtained.

2. Based on the traditional use of single Image features, three pre-training modes of Image capture, mask R-CNN and ResNet are provided for extracting visual description keywords, object-level visual labels and regional visual features, and the method is used for expressing fine-granularity and coarse-granularity entity semantics and hidden object features in images, so that more comprehensive Image semantic learning is obtained.

3. The multi-modal representation of the text, the visual description keyword and the object-level visual label fully fused is obtained through the image-text joint coding layer, and the multi-modal representation of the full fusion of the image-text characteristics is obtained through the text characteristic enhancement layer and the visual characteristic enhancement layer to complement the semantic learning missing in the text and the image.

4. In order to obtain the best named entity recognition decoding from the token and train the token to have named entity semantics, a multitask tag decoder is constructed so that the named entity, named entity boundary information and named entity generic information can be obtained from the text token and the multi-mode token in a decoding way, and the semantic information of the named entity boundary, the semantic information of the named entity generic and the semantic information of the named entity in the token are also meant.

5. In order to avoid the problem of multi-modal characterization deviation caused by semantic conflict of images and texts, a label fusion layer is constructed, and the multi-modal characterization prediction sequence is more accurate by restricting the consistency of the text characterization prediction sequence and the multi-modal characterization prediction sequence, so that the final predicted named entity label is obtained, and the effective recognition of the named entity is realized.

6. In order to verify the effectiveness of the method, experiments are carried out on the disclosed Twitter-2015 and Twitter-2017 multi-mode named entity identification data sets, the text state model of the method is compared with 9 models such as BERT-CRF, UMT-T, MNER-QG-T and the like, and the average value of F1 is respectively improved by 1.20% and 1.94%; comparing the method of the invention with MAF, UMT, MNER-QG and other 8 main stream MNER models, the average value of F1 is respectively improved by 1.00% and 1.41%.

In a word, the multi-mode named entity recognition efficiency of the invention in the image-text mode data scene is obviously improved, and the invention has good application prospect.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a multi-modal feature fusion graph.

Fig. 3 is a diagram of a multi-tasking tag decoder.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.

As shown in fig. 1, this embodiment discloses a multi-mode named entity recognition method with multi-task collaborative characterization, which fuses multiple features in multiple modes to cooperatively represent the semantics of named entities, so as to realize the named entity recognition under the multi-mode data scene, and uses a related image text feature extraction technology and an image-text fusion method, and includes the following steps:

1) Collecting image-text data of different modes, including texts and images; the acquired graphic data are expressed as:

2) The acquired Image-text data are respectively subjected to feature representation, including text representation and character representation obtained by respectively encoding texts by using BERT and CNN, and visual description keywords, object-level visual labels and regional visual features are obtained by respectively extracting Image features by using Image capture, mask-R-CNN and ResNet, wherein the following formulas are shown:

B＝BERT(E)

C＝CNN(E)

c. regional visual characteristics: giving an image I, inputting the image I into a pre-trained ResNet model, extracting regional image characteristics, and adopting a calculation formula as follows:

R _V ＝ResNet(I)

K＝Image-Captioning(I)

Wherein K represents a visual description keyword related to the picture;

L＝MaskR-CNN(I)

where L is the extracted object-level visual tag.

The 5 representation methods used on the teletext data are illustrated in fig. 1 to obtain text semantics and to obtain a full description of the features in the image, respectively.

As can be seen from fig. 1, the BERT and CNN are used as text encoders, respectively, to obtain text representations and character representations as inputs to the text feature enhancement layer and the multimodal fusion layer.

3) After obtaining text representation and character representation, constructing a text feature enhancement layer by using cross-modal attention, fusing the text representation and the character representation, enhancing the semantics of the text representation, and obtaining text representation; including text feature enhancement layers:

text feature enhancement layer: the text representation B and the character representation C are passed through a multi-head cross-modal attention network CMT to generate a semantically enhanced text representation, and the calculation formula is as follows:

U＝CMT(B,C,C _m )

wherein C is _m Is the attention mask matrix of C,u ₁ ,u ₂ ,u _n feature vectors representing text representations corresponding to the 1 st, 2 nd and n th words in text E, respectively,/-, respectively >Representing that the resulting text representation is a real feature space +.>N x d dimensions of the vector, n being the length of the text representation B and d being the feature dimension of the text representation B;

MH-CA＝[CA ₁ ，CA ₂ ，…，CA _g ]

where q is the query value of attention in CMT, k is the key value of attention in CMT, v is the value of attention in CMT, W _q Projection weight of query value, W _k Is a projection of key valuesWeight, W _v Projection weights that are value values, σ () is a softmax activation function, CA is co-attention, MH-CA is a multi-head attention consisting of g co-attention, where CA ₁ ,CA ₂ ,CA _g The 1 st, 2 nd, g th co-attentive heads in MH-CA, respectively, and a total of g heads, LN () is the normalization layer, FFN () is the feed forward network layer,the normalized features are used, and G is the final fusion feature of the cross-modal input features q and k;

H＝CMT(q,k,v)

The lower left part of fig. 1 shows that text semantic feature completion is realized in the input text feature enhancement layer after extraction of the text representation and the character representation, and the text representation is obtained.

4) Constructing a multi-mode fusion layer, wherein text, visual description keywords and object-level visual labels are subjected to image-text joint coding by using BERT, and multi-mode representation is obtained; constructing a text feature enhancement layer by using the cross-modal attention so as to fuse the character representation and the multi-modal representation and obtain the multi-modal text enhancement representation; constructing a visual feature enhancement layer by using cross-modal attention so as to fuse the multi-modal text enhancement representation and the regional visual features to obtain multi-modal representation, wherein the regional visual features are guided by an antagonism classification network and the feature space distribution of the multi-modal text enhancement representation is consistent so as to be beneficial to the fusion of the regional visual features and the multi-modal text enhancement representation; comprises the following parts:

[M _KL ；M _U ]＝BERT([K；L；E])

M _Uc ＝CMT(M _U ,C,C _m )

wherein C is _m Is the attention mask representation of the character representation C,is a multimodal text enhancement representation, < >>Representing real feature space;

M＝CMT(M _Uc ,FC(R _V ),R _Vm )

wherein R is _Vm Is regional visual feature R _V The post-projection attention mask represents that,is a multimodal text enhancement representation, FC () is a regional visual feature projection function;

4.4 To better project the regional visual features, so that the regional visual features are aligned with M _Uc Is similar to the distribution of the (E) to facilitate regional visionThe subjective feature and the multi-modal text enhancement representation are fused, a modal classification network is used for forming an anti-learning task to optimize a projection linear layer, and an anti-learning task loss calculation formula is as follows:

where MLP () is the modal classification network of the multi-layer perceptron, Is the loss of the modal classification network.

From fig. 2, it is shown that the visual description keyword K, the object-level visual tag L, the text sentence E are spliced, and then the BERT is used as the graphic joint coding layer, so as to splice [ K; l is; e, carrying out joint representation to obtain a multi-mode visual representation and a multi-mode text representation M, then using a cross-mode attention as a fusion method, fusing with character representations to complement missing text semantics and region visual features, fusing with region visual features to complement missing image mode semantics to obtain multi-mode representation, and in addition, using 3 linear layers as a mode classification network to construct an opposite learning task, and realizing similar feature space distribution of the region visual features and the text representations by distinguishing the region visual features and the mode labels of the text representations so as to obtain better multi-mode representation.

5) Decoding the text representation and the multi-modal representation by using a multi-task tag decoder respectively to obtain a predicted sequence of the text representation and a predicted sequence of the multi-modal representation; the multi-task label decoder is a decoder integrating three detection tasks of entity boundary detection, word entity category detection and entity labeling detection, an entity boundary detection conversion matrix and a word entity category detection conversion matrix, and can realize named entity identification through multi-task collaborative learning; the method comprises the following three parts:

5.1 To make tag labeling more accurate, a multi-tasking tag decoder is constructed that combines the boundary markers of entities and the prediction information of entity class markers using the conversion relationship between the markersObtaining a final prediction sequence, and hiding vectors by using a projection functionThe subspace projected to the three tasks is calculated as follows:

wherein, c _bio 、c _plo 、c _ner Three task projection functions FC, respectively _bio ()、FC _plo ()、FC _ner () Projection dimension H of (2) _bio Representing projection characteristics of hidden vectors in entity boundary detection space, H _plo Representing projection characteristics of hidden vector in word entity class detection space, H _ner Representing the projection characteristics of the hidden vector in the entity labeling detection space;

FIG. 3 shows the processing of the multi-tasking label decoder, for an input hidden variable, projecting it into an entity boundary detection space, an entity class detection space and an entity label detection space through 3 linear layers to obtain entity boundary information, entity class information and entity label information of the hidden variable H, respectively; based on the dependency relationship between tasks, a conversion matrix is constructed to obtain a matrix for converting the feature from the entity boundary feature space to the entity labeling feature space and a matrix for converting the feature from the entity category feature space to the entity labeling feature space, the first two projections are converted through matrix multiplication operation to obtain the representation of the entity labeling feature space, and the representation and the entity labeling feature are weighted and summed to obtain a prediction sequence of loss and hidden representation

5.2 Using a multi-tasking tag decoder to decode the text token U obtained in step 3). Hidden representation h=t, predicted sequence of text representation Y ^U The loss function is

Where MLD () represents the multi-tasking label decoder defined in step 5.1);

5.3 Decoding the multi-modal representation obtained in step 4.3) using a multi-tasking tag decoder, the hidden representation h=m, the predicted sequence of the multi-modal representation being Y ^M The loss function is/>

Where MLD () represents the multi-tasking label decoder defined in step 5.1).

Fig. 1 shows a processing procedure of calling the multi-task tag decoder, and the multi-task tag decoder is respectively called to decode the text token and the multi-mode token obtained in the step 3), so as to obtain a prediction sequence of the text token and a prediction sequence of the multi-mode token.

6.2 Through the sum of the loss of each task, each task is fused to realize the multi-task collaborative optimization text characterization and multi-mode characterization and obtain the final prediction label Y ^M Overall loss, i.e. multiple task lossThe expression is as follows:

This process is partially illustrated in FIG. 1, by constructing a tag fusion module using the KL divergence loss function, to predict tag result Y in step 5 in two ^T 、Y ^M As input, supervised learning is realized to solve the problem of visual deviation, and a better fusion label is obtained.

The experiment is based on the Ubuntu operating system, and a language tool such as python, pytorch, cuda, gcc is used for calling the data sets of Twitter-2015 and Twitter-2017 to perform the experiment.

1) Experimental setup and evaluation index

The data sets Twitter-2015 and Twitter-2017 are split into training set, verification set and test set to evaluate the model, and the details are shown in table 1.

The evaluation index uses a recall Rate (REC), an F1 value, and a single class F1 value as the evaluation index.

Table 1 data set statistics table

2) Text modal result analysis

Based on the built environment, the effectiveness of the method provided by the invention is verified by using corresponding evaluation indexes, and the multi-mode named entity recognition method (called MTCR-T) provided by the invention is compared with 9 models such as T-NER, UMT-T, MNER-QG-T and the like, and the specific model is shown in a table 2.

Table 2 text Modal comparison experiment table of reference model (%)

The use of BERT allows the text representation to have a more complete entity semantic representation than text representations using Glove, because BERT has background knowledge of the language model. On both datasets, the BERT-CR indices are higher than the BiLSTM-CRF model. The MTCR-T is higher than the BERT-CRF model, and in Twitter-2017, REC and F1 values are respectively improved by 3.48% and 2.75%, so that the text feature fusion layer in the MTCR-T is verified to obtain the text representation capability with word entity semantics by complementing the missing semantics in the text representation.

It is efficient to use the CRF model as a decoder, e.g. the BERT-CRF model performs better than the BERT model. The multi-task label decoding module further fuses entity boundary detection and word generic detection tasks to decode, so that in Twitter-2017, compared with BERT-CRF, the MTCR-T is respectively improved by 2.28 Percent (PER), 2.50 percent (LOC), 2.00 percent (ORG) and 7.02 percent (MISC) by a single class F1 value; the F1 value of MTCR-T is elevated over both datasets compared to the UMT-T model, and these experimental results verify the validity of the multi-tasking label decoder.

Compared with the optimal models MAF-T and MNER-QG-T in the comparison model, the F1 value of the MTCR-T is respectively improved by 0.26 percent and 0.64 percent on the data sets of Twitter-2015 and Twitter-2017; furthermore, the single class F1 values of MTCR-T are respectively improved by 1.2% (PER), 0.32% (LOC), 1.38% (ORG) and 0.82% (MISC) on the Twitter-2017 dataset compared with the optimal model MNER-QG-T, and the effectiveness of the text feature fusion layer and the multi-task tag decoder is verified by improving 1.19% (PER) and 1.5% (ORG) on the Twitter-2015 dataset.

The MTCR-T still has an advance over the partial MNER model in Table 3. On the Twitter-2015 dataset, compared with MT and UAMner models, the MTCR-T has the F1 value which is 0.88 percent and 0.36 percent higher respectively and is equal to (plus or minus 0.05 percent) of the MSB, UMT, MAF model based on the graphic joint coding; on the Twitter-2017 dataset, the F1 value of MTCR-T was higher than that of the 5 models, the boost values were MT (1.77%), MSB (1.87%), UAMner (1.29%), UMT (0.88%), UMGF (0.68%), leveled with the MAF model (-0.06%), and the efficiency of text feature fusion layer and multi-tasking label decoder synergy was verified.

3) Multi-modal result analysis

Comparative analysis was performed with 8 mainstream MNER models, such as MT, UMT, MNER-QG, over the last three years, wherein w/o GAN is the model after MTCR-T removal of the comparative learning task, and the experimental results are shown in Table 3.

Table 3 Multi-modal comparison experiment table of reference model (%)

Compared with an MSB model taking a picture tag as a text suffix, each index of MTCR-T, MTCR is improved, and compared with UMT and UAMner models which use cross-modal attention fusion region visual characteristics and joint naming entity identification structures, MTCR-T is similarly improved, namely in Twitter-2015, the F1 value of MTCR-T is improved by 1.29%, 1.35% and 1.66% respectively; in Twitter-2015, the F1 value of MTCR-T is respectively improved by 2.87%, 1.88% and 2.29%, and experimental results show that compared with the mode semantic of collaborative expression of characteristics in multiple modes, the mode characteristic can effectively enhance multi-mode characterization and improve the recognition capability of named entities.

Furthermore, in the Twitter-2015 data, MTCR-T raised PER (1.21%), LOC (0.47%), ORG (1.91%) respectively, compared to the single class F1 average of the 8 mainstream MNER models; in Twitter-2017 data, MTCR-T raised PER (1.26%), LOC (0.29%), ORG (1.7%), MISC (4.47%), respectively; whereas the overall F1 average in both data sets is raised by 1.09%,1.64%, respectively, verifying that the multi-tasking tag decoder is valid.

In the Twitter-2017 dataset, the Rec of MTCR-T was 88.38% and the F1 value was 87.19%, exceeding 7 of these models, the Ree with MNER-QG was 2.42% higher, the F1 value was 0.06% lower, but the recall. The validity of MTCR-T was verified. In the Twitter-2015 dataset, the F1 value of the MTCR-T is higher than that of 5 reference models, and compared with the UMGF and MNER-QG, the F1 value of the MTCR-T is reduced by 0.09 percent and 0.18 percent, but the MTCR-T obtains better results in F1 of PER and LOC, the reason for the problem is probably that the image-text correlation in the Twitter-2015 dataset is weaker, and in the MNER-QG method, more accurate visual characteristics are obtained by carrying out fine-grained labeling on visual objects in the dataset, so that the quality of multi-mode representation is improved, and the performance is improved.

4) Ablation experimental analysis

To verify the effectiveness of the critical portions of the MTCR-T, a controlled variable method was used to see if the model performance changed by removing some of the modules from the model, as shown in Table 4, ablation experiments were performed on the Twitter-2015 and Twitter-2017 data sets, and the results analyzed.

Table 4 ablation experiment table

To verify the effectiveness of the multimodal fusion layer in MTCR-T, an ablation experiment was performed as follows: in the multimodal fusion layer, the challenge learning task (Gan) is removed and object level labels and image Keywords (KL), character representations (Char), and region visual features (Reg) are removed on a w/o Gan basis, respectively.

In Table 3, in the Twitter-2015 and Twitter-2017 data sets, the F1 value of w/o Gan is respectively reduced by 0.32% and 0.43%, which shows that the addition of the countermeasure learning task makes the regional visual features and text features more similar in semantic distribution, easier to fuse, and further improves the entity recognition capability of the model. Continuing to remove the regional visual characteristics (Reg) or removing the object-level visual labels and the image Keywords (KL), namely w/o Reg and w/o KL, wherein the multi-mode fusion layer can only fuse one image characteristic, and experiments show that in a Twitter-2015 data set, F1 values of w/o Reg and w/o KL are respectively reduced by 0.27 percent and 1.99 percent; in the Twitter-2017 dataset, the F1 values of w/o Reg and w/o KL are respectively reduced by 0.94% and 1.02%, which verifies that different image features can only represent part of semantics in an image, and features in image modes can be complemented, so that more comprehensive image description is obtained. In addition, the ablation experiment is carried out on the text characterization on the basis of removing the challenge learning task (Gan), namely the F1 value of w/o Char in the Twitter-2015 and Twitter-2017 data sets is respectively reduced by 1.41 percent and 0.63 percent, which shows that when characters are used for representing the missing semantics in the complement text, the quality of the multi-modal characterization is also improved, and the performance of the model is improved.

To verify the effectiveness of the multi-tasking tag decoder in MTCR-T, an ablation experiment was performed as follows: the entity boundary detection projection task (plo), the entity class detection projection task (bio) and the two tasks (plo bio) are removed respectively.

In table 3, the entity boundary detection projection task (plo) and the entity class detection projection task (bio) are removed respectively, that is, the multi-mode characterization active learning entity boundary information or entity class information is no longer enabled by multi-mode learning, and experimental results are displayed in a Twitter-2015 dataset, and the F1 values of w/o plo and w/o bio are reduced by 0.78% and 0.37% respectively; in the Twitter-2017 dataset, the F1 values of w/o plo, w/o bio drop by 1.05%, 0.75%, respectively, which verifies that multi-modal characterization with entity boundary semantics or entity class semantics can aid in named entity recognition. A further ablation experiment is w/o plobrio, which removes both tasks at the same time, and in the Twitter-2015 and Twitter-2017 dataset, the F1 value of w/o plobrio drops by 0.55% and 1.29% respectively, and other indexes drop further, which indicates that the multi-task tag decoder can aggregate entity boundary information or entity category information to help obtain more accurate named entity identification tags, and the validity of the multi-task tag decoder is verified.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. The multi-mode named entity identification method for the multi-task collaborative characterization is characterized by comprising the following steps of:

1) Collecting image-text data of different modes, including texts and images;

2. The method for identifying a multi-modal named entity by collaborative characterization according to claim 1, wherein in step 1), the collected teletext data is represented as:

3. The method for identifying a multi-modal named entity with collaborative characterization according to claim 2, wherein in step 2), the obtained graphic data are respectively characterized as follows:

B＝BERT(E)

C＝CNN(E)

R _V ＝ResNet(I)

K＝Image-Captioning(I)

wherein K represents a visual description keyword related to the picture;

L＝MaskR-CNN(I)

where L is the extracted object-level visual tag.

4. A multi-modal named entity recognition method of a multi-task collaborative feature according to claim 3, wherein in step 3), the text feature enhancement layer generates a semantically enhanced text feature by passing text representation B and character representation C through a multi-headed cross-modal attention network CMT, the calculation formula of which is:

U＝CMT(B,C,C _m )

MH-CA＝[CA ₁ ，CA ₂ ，…，CA _g ]

H＝CMT(q,k,v)

5. The method for identifying multi-modal named entities with collaborative characterization according to claim 4, wherein step 4) includes the steps of:

[M _KL ；M _U ]＝BERT([K；L；E])

M _Uc ＝CMT(M _U ,C,C _m )

M＝CMT(M _Uc ,FC(R _V ),R _Vm )

4.4 To better project the regional visual features, so that the regional visual features are aligned with M _Uc Is similar to facilitate the fusion of regional visual features and multi-modal text-enhanced representations, using a modal classification network structureThe projection linear layer is optimized for the antagonism learning task, and the calculation formula of the antagonism learning task loss is as follows:

6. The method for identifying multi-modal named entities with collaborative characterization according to claim 5, wherein step 5) includes the steps of:

5.1 For making label labeling more accurate, constructing a multi-task label decoder, combining the boundary label of the entity and the prediction information of the entity category label by utilizing the conversion relation among labels to obtain a final prediction sequence, and hiding vectors by utilizing a projection functionThe subspace projected to the three tasks is calculated as follows:

respectively H _bio 、H _plo And conversion matrix Vector multiplication operation +.>Calculating a predictive vectorThe calculation formula is as follows:

where, the predictive label representation Y is the label input vector,is the slave label y _i To label y _i+1 Transition fraction of->Is tag y _i Exp () is an exponential function, P (y|y) is a conditional probability distribution given Y, Y' is all possible prediction vectors, score () calculates the probability of the prediction vector being Y;

Where MLD () represents the multi-tasking label decoder defined in step 5.1);

Where MLD () represents the multi-tasking label decoder defined in step 5.1).

7. The method for identifying multi-modal named entities with collaborative characterizations according to claim 6, wherein step 6) includes the steps of:

in the method, in the process of the invention,is a loss function of text-characterization named entity recognition task, < >>Is a loss function of the modal characterization named entity recognition task, < > >Is a loss function of cross-view alignment, +.>Is the loss function of the modal classification network.