CN113887585A

CN113887585A - Image-text multi-mode fusion method based on coding and decoding network

Info

Publication number: CN113887585A
Application number: CN202111087906.8A
Authority: CN
Inventors: 陈咪咪; 陈思华; 刘平英; 高昂昂
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2022-01-04

Abstract

The invention relates to an image-text multi-mode fusion method based on an encoding and decoding network, and belongs to the technical field of computer vision, natural language processing and mode recognition. The method comprises the following steps: s1: manually marking the existing target detection data set on the basis of the existing target detection data set to generate text information, constructing a new image-text data set, and dividing the data set into a training set, a verification set and a test set; s2: selecting a proper optimization learning method, setting related hyper-parameters, and training a training set and a verification set through an encoding and decoding network model; s3: after training, selecting a picture in the test set, inputting the coding and decoding network model, loading the trained model weight, and finally detecting the corresponding target result. The invention adopts an image-text fusion processing method, and utilizes two different types of data of the same thing to perform fusion processing, so that the accuracy is higher during network training, and further, a related required target is identified.

Description

Image-text multi-mode fusion method based on coding and decoding network

Technical Field

The invention relates to an image-text multi-mode fusion method based on an encoding and decoding network, and belongs to the technical field of computer vision, natural language processing and mode recognition.

Background

In recent years, with the rapid development of artificial intelligence technology, a large number of target detection algorithms based on deep learning emerge. The target detection is to find out all interested objects in the image, comprises two subtasks of object positioning and object classification, and determines the category and the position of the object at the same time. At present, target detection models based on deep learning mainly include YOLO, ResNet, SSD, Convolutional Neural Network (CNN) series models and the like. The classical target detection algorithm based on deep learning is usually performed only through one dimension of an image, so that scholars in related fields can continuously improve the network in order to obtain higher precision, the improvement of the network is usually realized by adopting a method for improving the deep network more, and the continuous increase of the number of layers of the deep network can cause the problems of gradient disappearance, gradient explosion and the like. To solve these problems, researchers have proposed many improved network architectures, but such architectures can make the network more complex.

Disclosure of Invention

For the problems, the invention provides an image-text multi-mode fusion method based on an encoding and decoding network by combining the idea of multi-task joint processing. The feature matrix obtained by processing the image and the text corresponding to the image is fused, so that the text information and the image information can be fused with each other, and a more accurate result after processing is obtained.

The invention adopts the following technical scheme for solving the technical problems:

an image-text multi-mode fusion method based on a coding and decoding network comprises the following steps:

s1: manually marking the existing target detection data set on the basis of the existing target detection data set to generate text information, constructing a new image-text data set, and enabling the data set to be as follows: 2: 2, dividing the ratio into a training set, a verification set and a test set;

s2: selecting a proper optimization learning method, setting related hyper-parameters, and training the training set and the verification set in the S1 through an encoding and decoding network model;

s3: after training, selecting a picture in the test set, inputting the coding and decoding network model, loading the trained model weight, and finally detecting the corresponding target result.

In step S2, the codec network model includes:

an encoder for clipping a scale of a given input image feature matrix;

the attention layer extracts related main information from the image matrix obtained after coding and weakens secondary interference information;

the decoder, expands the feature matrix size of the attention layer to the same size as the input matrix.

The encoders and decoders are four, each encoder block contains two convolutional layers with a convolutional kernel of 3x3 and one max pooling layer with a convolutional kernel of 2x2, and each decoder block contains two deconvolution layers with a convolutional kernel of 3x3 and one max pooling layer with a convolutional kernel of 2x 2.

The attention layer is processed in parallel by a hole pyramid pooling (ASPP) and a global average pooling layer (global averaging pooling).

The hole pyramid pooling adopts a hole convolution with a convolution kernel of 3x 3.

The suitable optimization learning method of step S2 is a stochastic gradient optimizer, and the relevant hyper-parameters are learning rate, batch size, momentum, and weight attenuation coefficient.

The invention has the following beneficial effects:

the invention adopts an image-text fusion processing method, and utilizes two different types of data of the same thing to perform fusion processing, so that the accuracy is higher during network training, and further, a related required target is identified.

Drawings

Fig. 1 is a diagram of a network architecture.

Fig. 2 is a view of an attention module structure.

FIG. 3 is a schematic diagram of a training set, in which (a1), (a2), and (a3) are image channel masters; (b1) (b2), (b3) are image tags; (c1) the items (c2) and (c3) are image-corresponding text information.

FIG. 4 is a graph of a segmentation prediction result, wherein (a) is a graph of an aircraft segmentation prediction result; (b) dividing a prediction result graph for the motorcycle; (c) the prediction result graph is segmented for humans and horses.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

The invention provides an image-text multi-mode fusion method based on an encoding and decoding network. The invention can acquire the characteristic matrix of the image information and the text information through the fusion processing of the image information and the text information. The feature matrix of the text information and the image information can be fused by processing the coding and decoding network again, and meanwhile, in order to better focus on useful feature information, an attention mechanism is added in the middle of the coding and decoding network, and spatial pyramid pooling and global average pooling are adopted for parallel processing. Fig. 1 is a block diagram of a network, and fig. 2 is a schematic diagram of an attention module.

The processing of multi-modal information firstly needs to process each modality to obtain a feature matrix, for an image channel, the invention adopts a 3D-Resnet network for processing, and the network does not need to classify images finally and directly learns to obtain the feature matrix and the weight ratio of the images. The text module adopts a long-time memory network (LSTM), and the network can better learn the context information of the text, so that the text content can be accurately understood. The channel is similar to the image channel, and only the feature matrix and the weight ratio are generated finally, and classification processing is not needed.

After the feature information of the image and the text is obtained, cross-modality fusion is required. The invention adopts the method that the encoding and decoding network directly carries out feature fusion on the encoding and decoding network, and carries out convolution coding on the encoding and decoding network through the feature matrix of the text and the image information, thereby obtaining a more accurate feature map (feature matrix), carrying out deconvolution on the feature map, and finally obtaining a final result through the classification of a classifier.

For the coding and decoding network, the encoder adopts a convolution of 3x3, each convolution is carried with an activation function of Relu, and the maximum pooling of 2x2 is carried out after two convolutions. The decoder uses a convolution with Relu activation function of 3x3, followed by a 2x2 up-sampled deconvolution.

The using method of the invention is as follows: firstly, inputting an image and a text, processing the image through a 3D-Resnet network, and learning to obtain a feature matrix and a weight ratio of the image. And processing the text through a long-time memory network to obtain a feature matrix and a weight ratio of the text.

And then, carrying out feature fusion on the image features and the text features through a pre-trained coding and decoding network. In the fusion process, the feature matrix of the text and the image information is subjected to convolutional coding, so that a single and accurate feature map can be obtained, then the feature matrix is subjected to deconvolution, and finally a final result can be obtained through classification of a classifier.

In order to better learn and obtain the characteristics of the fusion information, an attention module is added in the center of the coding and decoding network. Parallel processing by spatial pyramid pooling and global average pooling. The spatial pyramid pooling adopts hole convolution, and increases the receptive field of the convolution process, so that each convolution output contains information in a larger range. Finally, the number of channels is reduced to the expected value by convolution of 1x 1. During pyramid pooling, the design global average pooling is processed in parallel, i.e., all pixel values are accumulated in all feature maps and averaged. After spatial pyramid pooling and global average pooling, the features are processed by convolution with 1x1 to obtain a feature map, and unimportant noise interference is filtered out basically. And finally, a new characteristic matrix is obtained by adding a Sigmod activation function, and the new characteristic matrix is used for enlarging the receptive field to obtain high-order information.

In addition, the invention introduces two loss functions to constrain the model, namely binary cross entropy and Dice coeffient function.

The total loss of the model is expressed as

L＝L_B+L_D

Wherein: l is_BIs a binary cross entropy (binary cross entropy) loss function, L_DFor the Dice coefficient loss function, the equations are as follows:

wherein: x is the number of_iFor the ith image-the image in the text, y_iFor the text in the ith image-text,

to predict text in the ith image-text, n is the number of image-text samples, and output _ siz represents the output data size.

The invention adds text information to the existing target detection data set to form a new data set, and selects 1000 different target detection pictures in total, wherein the total comprises 20 types: person, bird, cat, cow, dog, horse, sheet, aeroplane, bicycle, boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa, tv. And manually labels it and manually generates the text information. The text information is a short phrase and mainly comprises related information in the picture. The data set was as follows 6: 2: the scale of 2 is divided into a training set, a test set, and a validation set.

And training the training set in the data set by the network model through random gradient descent (SGD), and setting the hyper-parameters to obtain the weight matrix. And then testing the data in the test set to obtain the accuracy of the model.

Fig. 3 is a schematic diagram of a training set. Three groups of data selected in the training set are shown in the figure, wherein (a1), (a2) and (a3) are image channel original images; (b1) (b2), (b3) are image tags; (c1) the items (c2) and (c3) are image-corresponding text information.

Fig. 4 is a diagram of a prediction segmentation result, which clearly shows that the object in the diagram can be identified more accurately after the network prediction is performed by the present invention, and the object name is selected and labeled. The graph (a) is the result of prediction over the network, with the plane outlined and labeled. The graph (b) outlines the motorcycle and labels motorbike, predicted by the network. And (c) detecting people and horses through network prediction, respectively framing the people and the horses, and labeling person and horse, which shows that the method is also applicable to the detection and classification of multiple targets.

Claims

1. An image-text multi-mode fusion method based on a coding and decoding network is characterized by comprising the following steps:

2. The codec network-based image-text multimodal fusion method according to claim 1, wherein the codec network model in step S2 includes:

an encoder for clipping a scale of a given input image feature matrix;

3. The codec network-based image-text multimodal fusion method according to claim 2, wherein the encoders and decoders are four, each encoder block comprises two convolutional layers with convolution kernel of 3x3 and one max pooling layer with convolution kernel of 2x2, and each decoder block comprises two deconvolution layers with convolution kernel of 3x3 and one max pooling layer with convolution kernel of 2x 2.

4. The codec network-based image-text multimodal fusion method according to claim 2, wherein the attention layer is processed by the hole pyramid pooling and the global average pooling layer in parallel.

5. The codec network-based image-text multimodal fusion method according to claim 4, wherein the hole pyramid pooling is a hole convolution with a convolution kernel of 3x 3.

6. The codec network-based image-text multimodal fusion method according to claim 1, wherein the suitable optimization learning method of step S2 is a stochastic gradient optimizer, and the related hyper-parameters are learning rate, batch size, momentum and weight attenuation coefficient.