CN115659279A

CN115659279A - Multi-mode data fusion method based on image-text interaction

Info

Publication number: CN115659279A
Application number: CN202211392871.3A
Authority: CN
Inventors: 赵宗罗; 赵志新; 许毅; 李强强; 蒋良; 罗良; 周波; 王立森; 吕捷; 帅万高; 孙潇哲
Original assignee: State Grid Zhejiang Electric Power Co Ltd Hangzhou Fuyang District Power Supply Co; Hangzhou Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Current assignee: State Grid Zhejiang Electric Power Co Ltd Hangzhou Fuyang District Power Supply Co; Hangzhou Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Priority date: 2022-11-08
Filing date: 2022-11-08
Publication date: 2023-01-31

Abstract

The invention discloses a multi-mode data fusion method based on image-text interaction, which comprises the following steps: s1, obtaining multi-mode data, wherein the multi-mode data comprises inspection image data and equipment state data; s2, acquiring an image characteristic diagram of the inspection image through a convolutional neural network; s3, preprocessing the equipment state data through a text extractor to obtain a text characteristic diagram; s4, constructing a multi-head attention module to obtain a text attention weight and an image attention weight; s5, acquiring image text mixed features based on the text attention weight and the image attention weight; s6, acquiring training data and bidirectional interactive information of a target through a multi-head cross attention module; and S7, obtaining mixed feature map information through a feature map mixing module and outputting a prediction result. The scheme improves the recognition precision by performing fusion learning and analysis on the multi-modal characteristics.

Description

Multi-mode data fusion method based on image-text interaction

Technical Field

The invention relates to the technical field of data processing, in particular to a multi-mode data fusion method based on image-text interaction.

Background

Currently, artificial intelligence has a wide application space in various fields. With the increasing diversification of information propagation forms, the result types of data mining are more and more abundant. In general, analysis can divide these data into two categories, one structured, such as text, tables of values, etc., and the other unstructured, including images, audio-visual, etc. In view of the interactivity and redundancy between the two. The multi-modal technology is utilized to fuse different types of data, so that the storage space of the data can be reduced, the characteristic information of multiple dimensions can be added to the same description object, the processed data is applied to data analysis and prediction, and the result precision can be effectively improved.

In the traditional power equipment fault analysis work, given data of a certain modality (such as picture video description about a certain problem) often appears, the regularity data of other modalities (objective text solution) needs to be obtained through manual analysis, the time consumption is high, and the results are different from the degree of engagement of the problem due to different cognitive levels; furthermore, most methods focus on extracting features from individual modalities separately, without fusion learning of the features. Therefore, a multi-modal data fusion algorithm is crucial to solve the prediction problem for a given input and output data type.

The above information disclosed in this background section is only for enhancement of understanding of the background of the application and therefore it may contain information that does not form the prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

The invention aims to provide a multi-mode data fusion method based on image-text interaction, aiming at the problem that the existing prediction method is lack of fusion learning and analysis on multi-mode features. Two multi-head self-attention (MSA) modules are constructed to learn the characteristics of images and texts respectively, a bidirectional cross attention method is provided for cross information learning, two decoders are constructed, the two decoders comprise an MSA block and a multi-head cross attention (MCA) block and are used for cross information learning, the obtained feature diagram is input into a feature diagram mixing module to perform feature extraction, a prediction result is output, and result accuracy is effectively improved.

In a first aspect, an embodiment of the present invention provides a multi-modal data fusion method based on image-text interaction, including the following steps:

s1, obtaining multi-mode data, wherein the multi-mode data comprise inspection image data and equipment state data;

s2, acquiring an image characteristic diagram of the inspection image through a convolutional neural network;

s3, preprocessing the equipment state data through a text extractor to obtain a text characteristic diagram;

s4, constructing a multi-head attention module to obtain a text attention weight and an image attention weight;

s5, acquiring image text mixed features based on the text attention weight and the image attention weight;

s6, acquiring training data and bidirectional interactive information of a target through a multi-head cross attention module;

and S7, obtaining mixed characteristic diagram information through a characteristic diagram mixing module and outputting a prediction result.

Preferably, in step S2, the local features of the image are learned by a convolutional neural network to obtain an image feature map of the patrol inspection image.

Preferably, the convolutional neural network includes: convolutional layers, batch normalization layers, activation layers, and max pooling layers.

Preferably, step S3 includes the steps of:

s31, acquiring text information representing equipment state data;

s32, segmenting the text information according to words, and segmenting the text information into phrases with the length of k;

s33, recording the sequence number of each phrase through a dictionary;

and S34, representing the whole text information sequence through the sequence number.

Preferably, before step S4, the encoder needs to be constructed to learn the global features of the patrol inspection image and the text.

Preferably, the data processing step of the encoder includes:

normalizing the data with different lengths through a normalization layer;

performing multi-head attention calculation on the normalized data characteristics;

solving the nonlinear problem of data characteristics through a feedforward neural network;

and adding the original data to the output data of the multi-head attention module.

Preferably, the feedforward neural network includes: a linear layer, a GELU activation function layer and a DropPath layer.

Preferably, step S5 includes:

s51, the inspection image sequentially passes through a convolutional neural network and a multi-head attention module to obtain an image mode attention weight characteristic diagram;

s52, obtaining a text modal attention weight characteristic diagram through the text extractor and the multi-head attention module sequentially by the text data;

and S53, fusing the image modality attention weight characteristic graph and the text modality attention weight characteristic graph.

Preferably, S6 includes the following steps:

in the N-layer decoder, the training data continuously updates the feature sequence through feature information of a multi-head cross attention module from target data;

synchronously, continuously updating the characteristic sequence of the target data through the characteristic information of the multi-head cross attention module from the training data;

the characteristic information of the target data is converted into different Keys/Values to interact with the Queries of the training data, so that the effect of bidirectional interaction is achieved.

Preferably, in step S7, the feature map blending module includes a two-dimensional convolution layer, a one-dimensional convolution layer, a multi-layer sensing layer, and a full connection layer.

The invention has the beneficial effects that: a multi-mode data fusion method based on image-text interaction is provided, and a multi-mode data set is formed based on images and corresponding text information by using a multi-mode information and bidirectional cross attention method. Two multi-head self-attention (MSA) modules are constructed to learn the characteristics of images and texts respectively, a bidirectional cross attention method is provided for cross information learning, two decoders are constructed, the two decoders comprise an MSA block and a multi-head cross attention (MCA) block and are used for cross information learning, the obtained feature diagram is input into a feature diagram mixing module to perform feature extraction, a prediction result is output, and result accuracy is effectively improved.

The above summary of the present invention is merely an overview of the technical solutions of the present invention, and the present invention can be implemented in accordance with the content of the description in order to make the technical means of the present invention more clearly understood, and the above and other objects, features, and advantages of the present invention will be more clearly understood.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings.

Fig. 1 is a flow chart of a multimodal data fusion method based on image-text interaction according to the invention.

Detailed Description

For the purpose of better understanding the objects, technical solutions and advantages of the present invention, the following detailed description of the present invention with reference to the accompanying drawings and examples should be understood that the specific embodiment described herein is only a preferred embodiment of the present invention, and is only used for explaining the present invention, and not for limiting the scope of the present invention, and all other embodiments obtained by a person of ordinary skill in the art without making creative efforts shall fall within the scope of the present invention.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations (or steps) can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure; the processes may correspond to methods, functions, procedures, subroutines, and the like.

The embodiment is as follows: as shown in fig. 1, a multimodal data fusion algorithm based on image-text interaction includes the following steps:

s1, multi-mode data are obtained, wherein the multi-mode data comprise patrol image data and equipment state data.

And S2, acquiring an image characteristic diagram of the inspection image through a convolutional neural network.

Specifically, the method comprises the following steps:

s21, acquiring a structural image from an image tool library;

and S22, constructing a convolutional neural network to learn the local characteristics of the image. The convolutional neural network mainly comprises a convolutional layer, a batch normalization layer, an activation layer and a maximum pooling layer;

s23, the operation of the convolutional layer can be represented as P _conv ＝f(F _in * W) + b. Wherein

Representing the characteristic input, W and b are the parameter matrix and offset of the convolution kernel, and f represents the operation of convolution;

s24, carrying out batch normalization processing on the feature graph after the convolution operation, wherein the feature graph can be expressed as P _BN ＝BN(P _conv )；

S25, increasing the nonlinearity of the model through a GELU activation function;

s26, compressing data by using the maximum pooling layer, and extracting main features;

s27, through the constructed convolutional neural network, a local feature map of the image can be obtained, and the local feature map is represented as: f _lcoal ＝Conv(F _in )；

Wherein Conv represents the convolutional neural network constructed by the present invention.

And S3, preprocessing the equipment state data through a text extractor to obtain a text characteristic diagram.

The method specifically comprises the following steps:

s31, segmenting the text sequence according to words, and segmenting the text sequence into phrases with the length of k;

s32, constructing a dictionary to record the appearance sequence of the phrases;

and S33, replacing the original word by using the number sequence number corresponding to the word in the dictionary, and coding the whole text into a sequence consisting of numbers.

And S4, constructing a multi-head attention module to obtain the text attention weight and the image attention weight.

The method specifically comprises the following steps:

s41, constructing an encoder of a self-attention mechanism to learn global features of the image/text. The encoder mainly comprises a normalization layer, a multi-head attention calculation layer, a feedforward neural network and residual connection;

s42, the normalization layer processes data with different lengths through a linear normalization layer and can be represented as P _LN ＝Norm(F _in ) Wherein

Representing characteristic input, and Norm is LayerNorm normalization treatment;

s43, since the attention mechanism does not consider the position information, adding the position embedding information into the initial data;

and S44, calculating the multi-head attention of the normalized features. Firstly, for input x, a matrix M of three vector projections is constructed _Q ,M _K And M _V . Multiplying them separately results in three different matrices Q, K and V. The calculation of attention can therefore be expressed as:

wherein d is _k Expressed as the dimension of the matrix K. To prevent overfitting, a multi-headed attention calculation is set; the operation can be represented as: multiHead (Q ', K ', V ') = Concat (head) ₁ ,…,head _h )W ^O . Wherein the head _i ＝Attention(Q _i ,K _i ,V _i ) And

is a linear projection matrix;

s45, solving the problem of nonlinearity which cannot be solved by a single-layer perceptron through a feedforward neural network, wherein the problem comprises a linear layer, a GELU activation function layer and a DropPath layer. The DropPath layer is used for improving the generalization capability of the network;

s46, residual connection is used, the output of the multi-head attention module is added to original data, model complexity can be reduced, and gradient disappearance is prevented.

And S5, acquiring image text mixed features based on the text attention weight and the image attention weight.

The method specifically comprises the following steps:

s51, obtaining an image mode attention weight characteristic diagram X by the image mode characteristics through the convolutional neural network and the multi-head attention mechanism _img ；

S52, obtaining a text modal attention weight feature map X by the text modal features through the text extractor and the multi-head attention mechanism _txt ；

S53, adding the embedded representations from the two modality information together, denoted X _emb ＝αX _img +(1-α)X _txt 。

And S6, acquiring training data and bidirectional interactive information of the target through a multi-head cross attention module.

The bidirectional multi-head cross attention module has a composition structure similar to that of a multi-head attention mechanism, and is different from the multi-head attention mechanism in a calculation mode and different in considered objects; joint Embedded representation obtained via step S4

And object-embedded representation obtained by the same procedure

Wherein T and D respectively represent sequence length and characteristic map latitude; obtaining a cross attention Q, K, V matrix: q _α ＝X _α W _Qα ，K _β ＝X _β W _Kβ ，V _β ＝X _β W _Vβ . The implicit dependency of the training data and the target may be expressed as

The above is a single-head cross attention mechanism, and the multi-head cross attention mechanism is changed to the same one; MCA (Z) _β→α )＝[CA ₁ (Z _β→α )；CA ₂ (Z _β→α )；…；CA _k (Z _β→α )]U _mca Wherein Z is _β→α Representing a feature map with position coding, CA _k Representing a single-headed cross attention mechanism, U _mca Representing a linear mapping matrix.

And S7, obtaining mixed feature map information through a feature map mixing module and outputting a prediction result.

The method specifically comprises the following steps:

s71, the mixing module comprises a two-dimensional convolution, a one-dimensional convolution, a plurality of sensing layers and a full connection layer;

s72, connecting the training data with the feature map of the target, and extracting features through the convolutional layer;

s73, finally inputting the characteristic diagram into the full connection layer to obtain a final prediction result P which is expressed as

The above-mentioned embodiments are preferred embodiments of the multimodal data fusion method based on text-text interaction, and the scope of the invention is not limited thereto, and the invention includes and is not limited to the embodiments, and all equivalent changes in shape and structure according to the invention are within the protection scope of the invention.

Claims

1. A multi-mode data fusion method based on image-text interaction is characterized by comprising the following steps:

s1, obtaining multi-mode data, wherein the multi-mode data comprises inspection image data and equipment state data;

2. The multi-modal data fusion method based on image-text interaction as claimed in claim 1, wherein in step S2, the local features of the image are learned through a convolutional neural network to obtain the image feature map of the inspection image.

3. The multimodal data fusion method based on graphic text interaction as claimed in claim 2, wherein the convolutional neural network comprises: convolutional layers, batch normalization layers, active layers, and max pooling layers.

4. The multimodal data fusion method based on teletext interaction according to claim 1, wherein step S3 comprises the following steps:

s31, acquiring text information representing equipment state data;

s33, recording the sequence number of each phrase through a dictionary;

5. The multimodal data fusion method based on image-text interaction as claimed in claim 1, wherein before step S4, an encoder is further constructed to learn global features of the patrol inspection image and the text.

6. The method as claimed in claim 4, wherein the data processing step of the encoder comprises:

normalizing the data with different lengths through a normalization layer;

7. The method as claimed in claim 5, wherein the feedforward neural network comprises: a linear layer, a GELU activation function layer and a DropPath layer.

8. The method for multi-modal data fusion based on teletext interaction according to claim 1, wherein step S5 comprises:

s52, obtaining a text modal attention weight feature map through the text extractor and the multi-head attention module in sequence by the text data;

9. The method for multimodal data fusion based on teletext interaction according to claim 1, wherein in S6, the following steps are included:

10. The method for multimodal data fusion based on teletext interaction according to claim 1,

in step S7, the feature map mixing module includes a two-dimensional convolution layer, a one-dimensional convolution layer, a multi-layer sensing layer, and a full connection layer.