CN115659279A - Multi-mode data fusion method based on image-text interaction - Google Patents

Multi-mode data fusion method based on image-text interaction Download PDF

Info

Publication number
CN115659279A
CN115659279A CN202211392871.3A CN202211392871A CN115659279A CN 115659279 A CN115659279 A CN 115659279A CN 202211392871 A CN202211392871 A CN 202211392871A CN 115659279 A CN115659279 A CN 115659279A
Authority
CN
China
Prior art keywords
text
data
image
attention
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211392871.3A
Other languages
Chinese (zh)
Inventor
赵宗罗
赵志新
许毅
李强强
蒋良
罗良
周波
王立森
吕捷
帅万高
孙潇哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Zhejiang Electric Power Co Ltd Hangzhou Fuyang District Power Supply Co
Hangzhou Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Original Assignee
State Grid Zhejiang Electric Power Co Ltd Hangzhou Fuyang District Power Supply Co
Hangzhou Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Zhejiang Electric Power Co Ltd Hangzhou Fuyang District Power Supply Co, Hangzhou Power Supply Co of State Grid Zhejiang Electric Power Co Ltd filed Critical State Grid Zhejiang Electric Power Co Ltd Hangzhou Fuyang District Power Supply Co
Priority to CN202211392871.3A priority Critical patent/CN115659279A/en
Publication of CN115659279A publication Critical patent/CN115659279A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a multi-mode data fusion method based on image-text interaction, which comprises the following steps: s1, obtaining multi-mode data, wherein the multi-mode data comprises inspection image data and equipment state data; s2, acquiring an image characteristic diagram of the inspection image through a convolutional neural network; s3, preprocessing the equipment state data through a text extractor to obtain a text characteristic diagram; s4, constructing a multi-head attention module to obtain a text attention weight and an image attention weight; s5, acquiring image text mixed features based on the text attention weight and the image attention weight; s6, acquiring training data and bidirectional interactive information of a target through a multi-head cross attention module; and S7, obtaining mixed feature map information through a feature map mixing module and outputting a prediction result. The scheme improves the recognition precision by performing fusion learning and analysis on the multi-modal characteristics.

Description

Multi-mode data fusion method based on image-text interaction
Technical Field
The invention relates to the technical field of data processing, in particular to a multi-mode data fusion method based on image-text interaction.
Background
Currently, artificial intelligence has a wide application space in various fields. With the increasing diversification of information propagation forms, the result types of data mining are more and more abundant. In general, analysis can divide these data into two categories, one structured, such as text, tables of values, etc., and the other unstructured, including images, audio-visual, etc. In view of the interactivity and redundancy between the two. The multi-modal technology is utilized to fuse different types of data, so that the storage space of the data can be reduced, the characteristic information of multiple dimensions can be added to the same description object, the processed data is applied to data analysis and prediction, and the result precision can be effectively improved.
In the traditional power equipment fault analysis work, given data of a certain modality (such as picture video description about a certain problem) often appears, the regularity data of other modalities (objective text solution) needs to be obtained through manual analysis, the time consumption is high, and the results are different from the degree of engagement of the problem due to different cognitive levels; furthermore, most methods focus on extracting features from individual modalities separately, without fusion learning of the features. Therefore, a multi-modal data fusion algorithm is crucial to solve the prediction problem for a given input and output data type.
The above information disclosed in this background section is only for enhancement of understanding of the background of the application and therefore it may contain information that does not form the prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
The invention aims to provide a multi-mode data fusion method based on image-text interaction, aiming at the problem that the existing prediction method is lack of fusion learning and analysis on multi-mode features. Two multi-head self-attention (MSA) modules are constructed to learn the characteristics of images and texts respectively, a bidirectional cross attention method is provided for cross information learning, two decoders are constructed, the two decoders comprise an MSA block and a multi-head cross attention (MCA) block and are used for cross information learning, the obtained feature diagram is input into a feature diagram mixing module to perform feature extraction, a prediction result is output, and result accuracy is effectively improved.
In a first aspect, an embodiment of the present invention provides a multi-modal data fusion method based on image-text interaction, including the following steps:
s1, obtaining multi-mode data, wherein the multi-mode data comprise inspection image data and equipment state data;
s2, acquiring an image characteristic diagram of the inspection image through a convolutional neural network;
s3, preprocessing the equipment state data through a text extractor to obtain a text characteristic diagram;
s4, constructing a multi-head attention module to obtain a text attention weight and an image attention weight;
s5, acquiring image text mixed features based on the text attention weight and the image attention weight;
s6, acquiring training data and bidirectional interactive information of a target through a multi-head cross attention module;
and S7, obtaining mixed characteristic diagram information through a characteristic diagram mixing module and outputting a prediction result.
Preferably, in step S2, the local features of the image are learned by a convolutional neural network to obtain an image feature map of the patrol inspection image.
Preferably, the convolutional neural network includes: convolutional layers, batch normalization layers, activation layers, and max pooling layers.
Preferably, step S3 includes the steps of:
s31, acquiring text information representing equipment state data;
s32, segmenting the text information according to words, and segmenting the text information into phrases with the length of k;
s33, recording the sequence number of each phrase through a dictionary;
and S34, representing the whole text information sequence through the sequence number.
Preferably, before step S4, the encoder needs to be constructed to learn the global features of the patrol inspection image and the text.
Preferably, the data processing step of the encoder includes:
normalizing the data with different lengths through a normalization layer;
performing multi-head attention calculation on the normalized data characteristics;
solving the nonlinear problem of data characteristics through a feedforward neural network;
and adding the original data to the output data of the multi-head attention module.
Preferably, the feedforward neural network includes: a linear layer, a GELU activation function layer and a DropPath layer.
Preferably, step S5 includes:
s51, the inspection image sequentially passes through a convolutional neural network and a multi-head attention module to obtain an image mode attention weight characteristic diagram;
s52, obtaining a text modal attention weight characteristic diagram through the text extractor and the multi-head attention module sequentially by the text data;
and S53, fusing the image modality attention weight characteristic graph and the text modality attention weight characteristic graph.
Preferably, S6 includes the following steps:
in the N-layer decoder, the training data continuously updates the feature sequence through feature information of a multi-head cross attention module from target data;
synchronously, continuously updating the characteristic sequence of the target data through the characteristic information of the multi-head cross attention module from the training data;
the characteristic information of the target data is converted into different Keys/Values to interact with the Queries of the training data, so that the effect of bidirectional interaction is achieved.
Preferably, in step S7, the feature map blending module includes a two-dimensional convolution layer, a one-dimensional convolution layer, a multi-layer sensing layer, and a full connection layer.
The invention has the beneficial effects that: a multi-mode data fusion method based on image-text interaction is provided, and a multi-mode data set is formed based on images and corresponding text information by using a multi-mode information and bidirectional cross attention method. Two multi-head self-attention (MSA) modules are constructed to learn the characteristics of images and texts respectively, a bidirectional cross attention method is provided for cross information learning, two decoders are constructed, the two decoders comprise an MSA block and a multi-head cross attention (MCA) block and are used for cross information learning, the obtained feature diagram is input into a feature diagram mixing module to perform feature extraction, a prediction result is output, and result accuracy is effectively improved.
The above summary of the present invention is merely an overview of the technical solutions of the present invention, and the present invention can be implemented in accordance with the content of the description in order to make the technical means of the present invention more clearly understood, and the above and other objects, features, and advantages of the present invention will be more clearly understood.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings.
Fig. 1 is a flow chart of a multimodal data fusion method based on image-text interaction according to the invention.
Detailed Description
For the purpose of better understanding the objects, technical solutions and advantages of the present invention, the following detailed description of the present invention with reference to the accompanying drawings and examples should be understood that the specific embodiment described herein is only a preferred embodiment of the present invention, and is only used for explaining the present invention, and not for limiting the scope of the present invention, and all other embodiments obtained by a person of ordinary skill in the art without making creative efforts shall fall within the scope of the present invention.
Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations (or steps) can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure; the processes may correspond to methods, functions, procedures, subroutines, and the like.
The embodiment is as follows: as shown in fig. 1, a multimodal data fusion algorithm based on image-text interaction includes the following steps:
s1, multi-mode data are obtained, wherein the multi-mode data comprise patrol image data and equipment state data.
And S2, acquiring an image characteristic diagram of the inspection image through a convolutional neural network.
Specifically, the method comprises the following steps:
s21, acquiring a structural image from an image tool library;
and S22, constructing a convolutional neural network to learn the local characteristics of the image. The convolutional neural network mainly comprises a convolutional layer, a batch normalization layer, an activation layer and a maximum pooling layer;
s23, the operation of the convolutional layer can be represented as P conv =f(F in * W) + b. Wherein
Figure BDA0003932013500000041
Representing the characteristic input, W and b are the parameter matrix and offset of the convolution kernel, and f represents the operation of convolution;
s24, carrying out batch normalization processing on the feature graph after the convolution operation, wherein the feature graph can be expressed as P BN =BN(P conv );
S25, increasing the nonlinearity of the model through a GELU activation function;
s26, compressing data by using the maximum pooling layer, and extracting main features;
s27, through the constructed convolutional neural network, a local feature map of the image can be obtained, and the local feature map is represented as: f lcoal =Conv(F in );
Wherein Conv represents the convolutional neural network constructed by the present invention.
And S3, preprocessing the equipment state data through a text extractor to obtain a text characteristic diagram.
The method specifically comprises the following steps:
s31, segmenting the text sequence according to words, and segmenting the text sequence into phrases with the length of k;
s32, constructing a dictionary to record the appearance sequence of the phrases;
and S33, replacing the original word by using the number sequence number corresponding to the word in the dictionary, and coding the whole text into a sequence consisting of numbers.
And S4, constructing a multi-head attention module to obtain the text attention weight and the image attention weight.
The method specifically comprises the following steps:
s41, constructing an encoder of a self-attention mechanism to learn global features of the image/text. The encoder mainly comprises a normalization layer, a multi-head attention calculation layer, a feedforward neural network and residual connection;
s42, the normalization layer processes data with different lengths through a linear normalization layer and can be represented as P LN =Norm(F in ) Wherein
Figure BDA0003932013500000042
Representing characteristic input, and Norm is LayerNorm normalization treatment;
s43, since the attention mechanism does not consider the position information, adding the position embedding information into the initial data;
and S44, calculating the multi-head attention of the normalized features. Firstly, for input x, a matrix M of three vector projections is constructed Q ,M K And M V . Multiplying them separately results in three different matrices Q, K and V. The calculation of attention can therefore be expressed as:
Figure BDA0003932013500000043
wherein d is k Expressed as the dimension of the matrix K. To prevent overfitting, a multi-headed attention calculation is set; the operation can be represented as: multiHead (Q ', K ', V ') = Concat (head) 1 ,…,head h )W O . Wherein the head i =Attention(Q i ,K i ,V i ) And
Figure BDA0003932013500000044
is a linear projection matrix;
s45, solving the problem of nonlinearity which cannot be solved by a single-layer perceptron through a feedforward neural network, wherein the problem comprises a linear layer, a GELU activation function layer and a DropPath layer. The DropPath layer is used for improving the generalization capability of the network;
s46, residual connection is used, the output of the multi-head attention module is added to original data, model complexity can be reduced, and gradient disappearance is prevented.
And S5, acquiring image text mixed features based on the text attention weight and the image attention weight.
The method specifically comprises the following steps:
s51, obtaining an image mode attention weight characteristic diagram X by the image mode characteristics through the convolutional neural network and the multi-head attention mechanism img
S52, obtaining a text modal attention weight feature map X by the text modal features through the text extractor and the multi-head attention mechanism txt
S53, adding the embedded representations from the two modality information together, denoted X emb =αX img +(1-α)X txt
And S6, acquiring training data and bidirectional interactive information of the target through a multi-head cross attention module.
The bidirectional multi-head cross attention module has a composition structure similar to that of a multi-head attention mechanism, and is different from the multi-head attention mechanism in a calculation mode and different in considered objects; joint Embedded representation obtained via step S4
Figure BDA0003932013500000051
And object-embedded representation obtained by the same procedure
Figure BDA0003932013500000052
Wherein T and D respectively represent sequence length and characteristic map latitude; obtaining a cross attention Q, K, V matrix: q α =X α W ,K β =X β W ,V β =X β W . The implicit dependency of the training data and the target may be expressed as
Figure BDA0003932013500000053
The above is a single-head cross attention mechanism, and the multi-head cross attention mechanism is changed to the same one; MCA (Z) β→α )=[CA 1 (Z β→α );CA 2 (Z β→α );…;CA k (Z β→α )]U mca Wherein Z is β→α Representing a feature map with position coding, CA k Representing a single-headed cross attention mechanism, U mca Representing a linear mapping matrix.
In the N-layer decoder, the training data continuously updates the feature sequence through feature information of a multi-head cross attention module from target data;
synchronously, continuously updating the characteristic sequence of the target data through the characteristic information of the multi-head cross attention module from the training data;
the characteristic information of the target data is converted into different Keys/Values to interact with the Queries of the training data, so that the effect of bidirectional interaction is achieved.
And S7, obtaining mixed feature map information through a feature map mixing module and outputting a prediction result.
The method specifically comprises the following steps:
s71, the mixing module comprises a two-dimensional convolution, a one-dimensional convolution, a plurality of sensing layers and a full connection layer;
s72, connecting the training data with the feature map of the target, and extracting features through the convolutional layer;
s73, finally inputting the characteristic diagram into the full connection layer to obtain a final prediction result P which is expressed as
Figure BDA0003932013500000054
The above-mentioned embodiments are preferred embodiments of the multimodal data fusion method based on text-text interaction, and the scope of the invention is not limited thereto, and the invention includes and is not limited to the embodiments, and all equivalent changes in shape and structure according to the invention are within the protection scope of the invention.

Claims (10)

1. A multi-mode data fusion method based on image-text interaction is characterized by comprising the following steps:
s1, obtaining multi-mode data, wherein the multi-mode data comprises inspection image data and equipment state data;
s2, acquiring an image characteristic diagram of the inspection image through a convolutional neural network;
s3, preprocessing the equipment state data through a text extractor to obtain a text characteristic diagram;
s4, constructing a multi-head attention module to obtain a text attention weight and an image attention weight;
s5, acquiring image text mixed features based on the text attention weight and the image attention weight;
s6, acquiring training data and bidirectional interactive information of a target through a multi-head cross attention module;
and S7, obtaining mixed feature map information through a feature map mixing module and outputting a prediction result.
2. The multi-modal data fusion method based on image-text interaction as claimed in claim 1, wherein in step S2, the local features of the image are learned through a convolutional neural network to obtain the image feature map of the inspection image.
3. The multimodal data fusion method based on graphic text interaction as claimed in claim 2, wherein the convolutional neural network comprises: convolutional layers, batch normalization layers, active layers, and max pooling layers.
4. The multimodal data fusion method based on teletext interaction according to claim 1, wherein step S3 comprises the following steps:
s31, acquiring text information representing equipment state data;
s32, segmenting the text information according to words, and segmenting the text information into phrases with the length of k;
s33, recording the sequence number of each phrase through a dictionary;
and S34, representing the whole text information sequence through the sequence number.
5. The multimodal data fusion method based on image-text interaction as claimed in claim 1, wherein before step S4, an encoder is further constructed to learn global features of the patrol inspection image and the text.
6. The method as claimed in claim 4, wherein the data processing step of the encoder comprises:
normalizing the data with different lengths through a normalization layer;
performing multi-head attention calculation on the normalized data characteristics;
solving the nonlinear problem of data characteristics through a feedforward neural network;
and adding the original data to the output data of the multi-head attention module.
7. The method as claimed in claim 5, wherein the feedforward neural network comprises: a linear layer, a GELU activation function layer and a DropPath layer.
8. The method for multi-modal data fusion based on teletext interaction according to claim 1, wherein step S5 comprises:
s51, the inspection image sequentially passes through a convolutional neural network and a multi-head attention module to obtain an image mode attention weight characteristic diagram;
s52, obtaining a text modal attention weight feature map through the text extractor and the multi-head attention module in sequence by the text data;
and S53, fusing the image modality attention weight characteristic graph and the text modality attention weight characteristic graph.
9. The method for multimodal data fusion based on teletext interaction according to claim 1, wherein in S6, the following steps are included:
in the N-layer decoder, the training data continuously updates the feature sequence through feature information of a multi-head cross attention module from target data;
synchronously, continuously updating the characteristic sequence of the target data through the characteristic information of the multi-head cross attention module from the training data;
the characteristic information of the target data is converted into different Keys/Values to interact with the Queries of the training data, so that the effect of bidirectional interaction is achieved.
10. The method for multimodal data fusion based on teletext interaction according to claim 1,
in step S7, the feature map mixing module includes a two-dimensional convolution layer, a one-dimensional convolution layer, a multi-layer sensing layer, and a full connection layer.
CN202211392871.3A 2022-11-08 2022-11-08 Multi-mode data fusion method based on image-text interaction Pending CN115659279A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211392871.3A CN115659279A (en) 2022-11-08 2022-11-08 Multi-mode data fusion method based on image-text interaction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211392871.3A CN115659279A (en) 2022-11-08 2022-11-08 Multi-mode data fusion method based on image-text interaction

Publications (1)

Publication Number Publication Date
CN115659279A true CN115659279A (en) 2023-01-31

Family

ID=85015188

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211392871.3A Pending CN115659279A (en) 2022-11-08 2022-11-08 Multi-mode data fusion method based on image-text interaction

Country Status (1)

Country Link
CN (1) CN115659279A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117114657A (en) * 2023-10-23 2023-11-24 国网江西省电力有限公司超高压分公司 Fault information early warning system and method based on power equipment inspection knowledge graph
CN117150381A (en) * 2023-08-07 2023-12-01 中国船舶集团有限公司第七〇九研究所 Target function group identification and model training method thereof

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117150381A (en) * 2023-08-07 2023-12-01 中国船舶集团有限公司第七〇九研究所 Target function group identification and model training method thereof
CN117114657A (en) * 2023-10-23 2023-11-24 国网江西省电力有限公司超高压分公司 Fault information early warning system and method based on power equipment inspection knowledge graph

Similar Documents

Publication Publication Date Title
CN107979764B (en) Video subtitle generating method based on semantic segmentation and multi-layer attention framework
CN107480206B (en) Multi-mode low-rank bilinear pooling-based image content question-answering method
CN113673594B (en) Defect point identification method based on deep learning network
CN115659279A (en) Multi-mode data fusion method based on image-text interaction
CN112508077B (en) Social media emotion analysis method and system based on multi-modal feature fusion
CN107608943A (en) Merge visual attention and the image method for generating captions and system of semantic notice
CN113902964A (en) Multi-mode attention video question-answering method and system based on keyword perception
CN117218498B (en) Multi-modal large language model training method and system based on multi-modal encoder
CN113128527B (en) Image scene classification method based on converter model and convolutional neural network
CN115797495B (en) Method for generating image by sentence-character semantic space fusion perceived text
CN113516133B (en) Multi-modal image classification method and system
CN113065496B (en) Neural network machine translation model training method, machine translation method and device
CN114020891A (en) Double-channel semantic positioning multi-granularity attention mutual enhancement video question-answering method and system
CN113423004B (en) Video subtitle generating method and system based on decoupling decoding
CN114970517A (en) Visual question and answer oriented method based on multi-modal interaction context perception
CN115712709A (en) Multi-modal dialog question-answer generation method based on multi-relationship graph model
CN115205233A (en) Photovoltaic surface defect identification method and system based on end-to-end architecture
CN115718815A (en) Cross-modal retrieval method and system
CN116704198A (en) Knowledge enhancement visual question-answering method based on multi-mode information guidance
Hafeth et al. Semantic representations with attention networks for boosting image captioning
CN117762499A (en) Task instruction construction method and task processing method
CN116958700A (en) Image classification method based on prompt engineering and contrast learning
CN116957921A (en) Image rendering method, device, equipment and storage medium
CN115759262A (en) Visual common sense reasoning method and system based on knowledge perception attention network
CN116597503A (en) Classroom behavior detection method based on space-time characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination