CN113887585A - Image-text multi-mode fusion method based on coding and decoding network - Google Patents

Image-text multi-mode fusion method based on coding and decoding network Download PDF

Info

Publication number
CN113887585A
CN113887585A CN202111087906.8A CN202111087906A CN113887585A CN 113887585 A CN113887585 A CN 113887585A CN 202111087906 A CN202111087906 A CN 202111087906A CN 113887585 A CN113887585 A CN 113887585A
Authority
CN
China
Prior art keywords
text
image
training
data set
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111087906.8A
Other languages
Chinese (zh)
Inventor
陈咪咪
陈思华
刘平英
高昂昂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202111087906.8A priority Critical patent/CN113887585A/en
Publication of CN113887585A publication Critical patent/CN113887585A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to an image-text multi-mode fusion method based on an encoding and decoding network, and belongs to the technical field of computer vision, natural language processing and mode recognition. The method comprises the following steps: s1: manually marking the existing target detection data set on the basis of the existing target detection data set to generate text information, constructing a new image-text data set, and dividing the data set into a training set, a verification set and a test set; s2: selecting a proper optimization learning method, setting related hyper-parameters, and training a training set and a verification set through an encoding and decoding network model; s3: after training, selecting a picture in the test set, inputting the coding and decoding network model, loading the trained model weight, and finally detecting the corresponding target result. The invention adopts an image-text fusion processing method, and utilizes two different types of data of the same thing to perform fusion processing, so that the accuracy is higher during network training, and further, a related required target is identified.

Description

Image-text multi-mode fusion method based on coding and decoding network
Technical Field
The invention relates to an image-text multi-mode fusion method based on an encoding and decoding network, and belongs to the technical field of computer vision, natural language processing and mode recognition.
Background
In recent years, with the rapid development of artificial intelligence technology, a large number of target detection algorithms based on deep learning emerge. The target detection is to find out all interested objects in the image, comprises two subtasks of object positioning and object classification, and determines the category and the position of the object at the same time. At present, target detection models based on deep learning mainly include YOLO, ResNet, SSD, Convolutional Neural Network (CNN) series models and the like. The classical target detection algorithm based on deep learning is usually performed only through one dimension of an image, so that scholars in related fields can continuously improve the network in order to obtain higher precision, the improvement of the network is usually realized by adopting a method for improving the deep network more, and the continuous increase of the number of layers of the deep network can cause the problems of gradient disappearance, gradient explosion and the like. To solve these problems, researchers have proposed many improved network architectures, but such architectures can make the network more complex.
Disclosure of Invention
For the problems, the invention provides an image-text multi-mode fusion method based on an encoding and decoding network by combining the idea of multi-task joint processing. The feature matrix obtained by processing the image and the text corresponding to the image is fused, so that the text information and the image information can be fused with each other, and a more accurate result after processing is obtained.
The invention adopts the following technical scheme for solving the technical problems:
an image-text multi-mode fusion method based on a coding and decoding network comprises the following steps:
s1: manually marking the existing target detection data set on the basis of the existing target detection data set to generate text information, constructing a new image-text data set, and enabling the data set to be as follows: 2: 2, dividing the ratio into a training set, a verification set and a test set;
s2: selecting a proper optimization learning method, setting related hyper-parameters, and training the training set and the verification set in the S1 through an encoding and decoding network model;
s3: after training, selecting a picture in the test set, inputting the coding and decoding network model, loading the trained model weight, and finally detecting the corresponding target result.
In step S2, the codec network model includes:
an encoder for clipping a scale of a given input image feature matrix;
the attention layer extracts related main information from the image matrix obtained after coding and weakens secondary interference information;
the decoder, expands the feature matrix size of the attention layer to the same size as the input matrix.
The encoders and decoders are four, each encoder block contains two convolutional layers with a convolutional kernel of 3x3 and one max pooling layer with a convolutional kernel of 2x2, and each decoder block contains two deconvolution layers with a convolutional kernel of 3x3 and one max pooling layer with a convolutional kernel of 2x 2.
The attention layer is processed in parallel by a hole pyramid pooling (ASPP) and a global average pooling layer (global averaging pooling).
The hole pyramid pooling adopts a hole convolution with a convolution kernel of 3x 3.
The suitable optimization learning method of step S2 is a stochastic gradient optimizer, and the relevant hyper-parameters are learning rate, batch size, momentum, and weight attenuation coefficient.
The invention has the following beneficial effects:
the invention adopts an image-text fusion processing method, and utilizes two different types of data of the same thing to perform fusion processing, so that the accuracy is higher during network training, and further, a related required target is identified.
Drawings
Fig. 1 is a diagram of a network architecture.
Fig. 2 is a view of an attention module structure.
FIG. 3 is a schematic diagram of a training set, in which (a1), (a2), and (a3) are image channel masters; (b1) (b2), (b3) are image tags; (c1) the items (c2) and (c3) are image-corresponding text information.
FIG. 4 is a graph of a segmentation prediction result, wherein (a) is a graph of an aircraft segmentation prediction result; (b) dividing a prediction result graph for the motorcycle; (c) the prediction result graph is segmented for humans and horses.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings.
The invention provides an image-text multi-mode fusion method based on an encoding and decoding network. The invention can acquire the characteristic matrix of the image information and the text information through the fusion processing of the image information and the text information. The feature matrix of the text information and the image information can be fused by processing the coding and decoding network again, and meanwhile, in order to better focus on useful feature information, an attention mechanism is added in the middle of the coding and decoding network, and spatial pyramid pooling and global average pooling are adopted for parallel processing. Fig. 1 is a block diagram of a network, and fig. 2 is a schematic diagram of an attention module.
The processing of multi-modal information firstly needs to process each modality to obtain a feature matrix, for an image channel, the invention adopts a 3D-Resnet network for processing, and the network does not need to classify images finally and directly learns to obtain the feature matrix and the weight ratio of the images. The text module adopts a long-time memory network (LSTM), and the network can better learn the context information of the text, so that the text content can be accurately understood. The channel is similar to the image channel, and only the feature matrix and the weight ratio are generated finally, and classification processing is not needed.
After the feature information of the image and the text is obtained, cross-modality fusion is required. The invention adopts the method that the encoding and decoding network directly carries out feature fusion on the encoding and decoding network, and carries out convolution coding on the encoding and decoding network through the feature matrix of the text and the image information, thereby obtaining a more accurate feature map (feature matrix), carrying out deconvolution on the feature map, and finally obtaining a final result through the classification of a classifier.
For the coding and decoding network, the encoder adopts a convolution of 3x3, each convolution is carried with an activation function of Relu, and the maximum pooling of 2x2 is carried out after two convolutions. The decoder uses a convolution with Relu activation function of 3x3, followed by a 2x2 up-sampled deconvolution.
The using method of the invention is as follows: firstly, inputting an image and a text, processing the image through a 3D-Resnet network, and learning to obtain a feature matrix and a weight ratio of the image. And processing the text through a long-time memory network to obtain a feature matrix and a weight ratio of the text.
And then, carrying out feature fusion on the image features and the text features through a pre-trained coding and decoding network. In the fusion process, the feature matrix of the text and the image information is subjected to convolutional coding, so that a single and accurate feature map can be obtained, then the feature matrix is subjected to deconvolution, and finally a final result can be obtained through classification of a classifier.
In order to better learn and obtain the characteristics of the fusion information, an attention module is added in the center of the coding and decoding network. Parallel processing by spatial pyramid pooling and global average pooling. The spatial pyramid pooling adopts hole convolution, and increases the receptive field of the convolution process, so that each convolution output contains information in a larger range. Finally, the number of channels is reduced to the expected value by convolution of 1x 1. During pyramid pooling, the design global average pooling is processed in parallel, i.e., all pixel values are accumulated in all feature maps and averaged. After spatial pyramid pooling and global average pooling, the features are processed by convolution with 1x1 to obtain a feature map, and unimportant noise interference is filtered out basically. And finally, a new characteristic matrix is obtained by adding a Sigmod activation function, and the new characteristic matrix is used for enlarging the receptive field to obtain high-order information.
In addition, the invention introduces two loss functions to constrain the model, namely binary cross entropy and Dice coeffient function.
The total loss of the model is expressed as
L=LB+LD
Wherein: l isBIs a binary cross entropy (binary cross entropy) loss function, LDFor the Dice coefficient loss function, the equations are as follows:
Figure BDA0003266459260000061
Figure BDA0003266459260000062
wherein: x is the number ofiFor the ith image-the image in the text, yiFor the text in the ith image-text,
Figure BDA0003266459260000063
to predict text in the ith image-text, n is the number of image-text samples, and output _ siz represents the output data size.
The invention adds text information to the existing target detection data set to form a new data set, and selects 1000 different target detection pictures in total, wherein the total comprises 20 types: person, bird, cat, cow, dog, horse, sheet, aeroplane, bicycle, boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa, tv. And manually labels it and manually generates the text information. The text information is a short phrase and mainly comprises related information in the picture. The data set was as follows 6: 2: the scale of 2 is divided into a training set, a test set, and a validation set.
And training the training set in the data set by the network model through random gradient descent (SGD), and setting the hyper-parameters to obtain the weight matrix. And then testing the data in the test set to obtain the accuracy of the model.
Fig. 3 is a schematic diagram of a training set. Three groups of data selected in the training set are shown in the figure, wherein (a1), (a2) and (a3) are image channel original images; (b1) (b2), (b3) are image tags; (c1) the items (c2) and (c3) are image-corresponding text information.
Fig. 4 is a diagram of a prediction segmentation result, which clearly shows that the object in the diagram can be identified more accurately after the network prediction is performed by the present invention, and the object name is selected and labeled. The graph (a) is the result of prediction over the network, with the plane outlined and labeled. The graph (b) outlines the motorcycle and labels motorbike, predicted by the network. And (c) detecting people and horses through network prediction, respectively framing the people and the horses, and labeling person and horse, which shows that the method is also applicable to the detection and classification of multiple targets.

Claims (6)

1. An image-text multi-mode fusion method based on a coding and decoding network is characterized by comprising the following steps:
s1: manually marking the existing target detection data set on the basis of the existing target detection data set to generate text information, constructing a new image-text data set, and enabling the data set to be as follows: 2: 2, dividing the ratio into a training set, a verification set and a test set;
s2: selecting a proper optimization learning method, setting related hyper-parameters, and training the training set and the verification set in the S1 through an encoding and decoding network model;
s3: after training, selecting a picture in the test set, inputting the coding and decoding network model, loading the trained model weight, and finally detecting the corresponding target result.
2. The codec network-based image-text multimodal fusion method according to claim 1, wherein the codec network model in step S2 includes:
an encoder for clipping a scale of a given input image feature matrix;
the attention layer extracts related main information from the image matrix obtained after coding and weakens secondary interference information;
the decoder, expands the feature matrix size of the attention layer to the same size as the input matrix.
3. The codec network-based image-text multimodal fusion method according to claim 2, wherein the encoders and decoders are four, each encoder block comprises two convolutional layers with convolution kernel of 3x3 and one max pooling layer with convolution kernel of 2x2, and each decoder block comprises two deconvolution layers with convolution kernel of 3x3 and one max pooling layer with convolution kernel of 2x 2.
4. The codec network-based image-text multimodal fusion method according to claim 2, wherein the attention layer is processed by the hole pyramid pooling and the global average pooling layer in parallel.
5. The codec network-based image-text multimodal fusion method according to claim 4, wherein the hole pyramid pooling is a hole convolution with a convolution kernel of 3x 3.
6. The codec network-based image-text multimodal fusion method according to claim 1, wherein the suitable optimization learning method of step S2 is a stochastic gradient optimizer, and the related hyper-parameters are learning rate, batch size, momentum and weight attenuation coefficient.
CN202111087906.8A 2021-09-16 2021-09-16 Image-text multi-mode fusion method based on coding and decoding network Pending CN113887585A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111087906.8A CN113887585A (en) 2021-09-16 2021-09-16 Image-text multi-mode fusion method based on coding and decoding network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111087906.8A CN113887585A (en) 2021-09-16 2021-09-16 Image-text multi-mode fusion method based on coding and decoding network

Publications (1)

Publication Number Publication Date
CN113887585A true CN113887585A (en) 2022-01-04

Family

ID=79009294

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111087906.8A Pending CN113887585A (en) 2021-09-16 2021-09-16 Image-text multi-mode fusion method based on coding and decoding network

Country Status (1)

Country Link
CN (1) CN113887585A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114847963A (en) * 2022-05-06 2022-08-05 广东工业大学 High-precision electrocardiogram characteristic point detection method
CN116563707A (en) * 2023-05-08 2023-08-08 中国农业科学院农业信息研究所 Lycium chinense insect pest identification method based on image-text multi-mode feature fusion

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109712108A (en) * 2018-11-05 2019-05-03 杭州电子科技大学 It is a kind of that vision positioning method is directed to based on various distinctive candidate frame generation network
CN112308080A (en) * 2020-11-05 2021-02-02 南强智视(厦门)科技有限公司 Image description prediction method for directional visual understanding and segmentation
CN113362332A (en) * 2021-06-08 2021-09-07 南京信息工程大学 Depth network segmentation method for coronary artery lumen contour under OCT image

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109712108A (en) * 2018-11-05 2019-05-03 杭州电子科技大学 It is a kind of that vision positioning method is directed to based on various distinctive candidate frame generation network
CN112308080A (en) * 2020-11-05 2021-02-02 南强智视(厦门)科技有限公司 Image description prediction method for directional visual understanding and segmentation
CN113362332A (en) * 2021-06-08 2021-09-07 南京信息工程大学 Depth network segmentation method for coronary artery lumen contour under OCT image

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CAIYONG WANG ET AL.: ""Joint Iris Segmentation and Localization Using Deep Multi-task Learning Framework"", 《ARXIV》, 19 September 2019 (2019-09-19), pages 1 - 13 *
YIYI ZHOU ET AL.: ""A Real-time Global Inference Networ k for One-stage Referr ing Expression Comprehension"", ARXIV, 7 December 2019 (2019-12-07), pages 1 - 10 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114847963A (en) * 2022-05-06 2022-08-05 广东工业大学 High-precision electrocardiogram characteristic point detection method
CN116563707A (en) * 2023-05-08 2023-08-08 中国农业科学院农业信息研究所 Lycium chinense insect pest identification method based on image-text multi-mode feature fusion
CN116563707B (en) * 2023-05-08 2024-02-27 中国农业科学院农业信息研究所 Lycium chinense insect pest identification method based on image-text multi-mode feature fusion

Similar Documents

Publication Publication Date Title
CN111639240B (en) Cross-modal Hash retrieval method and system based on attention awareness mechanism
CN105184298B (en) A kind of image classification method of quick local restriction low-rank coding
KR102167808B1 (en) Semantic segmentation method and system applicable to AR
CN109886225A (en) A kind of image gesture motion on-line checking and recognition methods based on deep learning
CN108509978A (en) The multi-class targets detection method and model of multi-stage characteristics fusion based on CNN
CN109670576B (en) Multi-scale visual attention image description method
CN113887585A (en) Image-text multi-mode fusion method based on coding and decoding network
Islam et al. InceptB: a CNN based classification approach for recognizing traditional bengali games
CN110717493B (en) License plate recognition method containing stacked characters based on deep learning
CN116740344A (en) Knowledge distillation-based lightweight remote sensing image semantic segmentation method and device
Du et al. Research on small size object detection in complex background
CN112613428A (en) Resnet-3D convolution cattle video target detection method based on balance loss
CN114170657A (en) Facial emotion recognition method integrating attention mechanism and high-order feature representation
CN115375959A (en) Vehicle image recognition model establishing and recognizing method
Qin et al. Research on improved algorithm of object detection based on feature pyramid
CN114492634B (en) Fine granularity equipment picture classification and identification method and system
CN117152427A (en) Remote sensing image semantic segmentation method and system based on diffusion model and knowledge distillation
CN116863260A (en) Data processing method and device
Wu et al. Dynamic activation and enhanced image contour features for object detection
Xiao exYOLO: A small object detector based on YOLOv3 Object Detector
Wu CNN-Based Recognition of Handwritten Digits in MNIST Database
CN114972851B (en) Ship target intelligent detection method based on remote sensing image
CN116109868A (en) Image classification model construction and small sample image classification method based on lightweight neural network
CN116343016A (en) Multi-angle sonar image target classification method based on lightweight convolution network
CN111461130B (en) High-precision image semantic segmentation algorithm model and segmentation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination