CN113887585A - Image-text multi-mode fusion method based on coding and decoding network - Google Patents
Image-text multi-mode fusion method based on coding and decoding network Download PDFInfo
- Publication number
- CN113887585A CN113887585A CN202111087906.8A CN202111087906A CN113887585A CN 113887585 A CN113887585 A CN 113887585A CN 202111087906 A CN202111087906 A CN 202111087906A CN 113887585 A CN113887585 A CN 113887585A
- Authority
- CN
- China
- Prior art keywords
- text
- image
- training
- data set
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007500 overflow downdraw method Methods 0.000 title claims abstract description 13
- 238000012549 training Methods 0.000 claims abstract description 20
- 238000000034 method Methods 0.000 claims abstract description 14
- 238000001514 detection method Methods 0.000 claims abstract description 13
- 238000012360 testing method Methods 0.000 claims abstract description 9
- 238000012795 verification Methods 0.000 claims abstract description 6
- 238000005457 optimization Methods 0.000 claims abstract description 5
- 239000011159 matrix material Substances 0.000 claims description 23
- 238000011176 pooling Methods 0.000 claims description 21
- 239000000284 extract Substances 0.000 claims description 2
- 238000007499 fusion processing Methods 0.000 abstract description 6
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 238000012545 processing Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 230000004927 fusion Effects 0.000 description 4
- 241000283086 Equidae Species 0.000 description 3
- 230000004913 activation Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 101100295091 Arabidopsis thaliana NUDT14 gene Proteins 0.000 description 1
- 238000012935 Averaging Methods 0.000 description 1
- 241000282326 Felis catus Species 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to an image-text multi-mode fusion method based on an encoding and decoding network, and belongs to the technical field of computer vision, natural language processing and mode recognition. The method comprises the following steps: s1: manually marking the existing target detection data set on the basis of the existing target detection data set to generate text information, constructing a new image-text data set, and dividing the data set into a training set, a verification set and a test set; s2: selecting a proper optimization learning method, setting related hyper-parameters, and training a training set and a verification set through an encoding and decoding network model; s3: after training, selecting a picture in the test set, inputting the coding and decoding network model, loading the trained model weight, and finally detecting the corresponding target result. The invention adopts an image-text fusion processing method, and utilizes two different types of data of the same thing to perform fusion processing, so that the accuracy is higher during network training, and further, a related required target is identified.
Description
Technical Field
The invention relates to an image-text multi-mode fusion method based on an encoding and decoding network, and belongs to the technical field of computer vision, natural language processing and mode recognition.
Background
In recent years, with the rapid development of artificial intelligence technology, a large number of target detection algorithms based on deep learning emerge. The target detection is to find out all interested objects in the image, comprises two subtasks of object positioning and object classification, and determines the category and the position of the object at the same time. At present, target detection models based on deep learning mainly include YOLO, ResNet, SSD, Convolutional Neural Network (CNN) series models and the like. The classical target detection algorithm based on deep learning is usually performed only through one dimension of an image, so that scholars in related fields can continuously improve the network in order to obtain higher precision, the improvement of the network is usually realized by adopting a method for improving the deep network more, and the continuous increase of the number of layers of the deep network can cause the problems of gradient disappearance, gradient explosion and the like. To solve these problems, researchers have proposed many improved network architectures, but such architectures can make the network more complex.
Disclosure of Invention
For the problems, the invention provides an image-text multi-mode fusion method based on an encoding and decoding network by combining the idea of multi-task joint processing. The feature matrix obtained by processing the image and the text corresponding to the image is fused, so that the text information and the image information can be fused with each other, and a more accurate result after processing is obtained.
The invention adopts the following technical scheme for solving the technical problems:
an image-text multi-mode fusion method based on a coding and decoding network comprises the following steps:
s1: manually marking the existing target detection data set on the basis of the existing target detection data set to generate text information, constructing a new image-text data set, and enabling the data set to be as follows: 2: 2, dividing the ratio into a training set, a verification set and a test set;
s2: selecting a proper optimization learning method, setting related hyper-parameters, and training the training set and the verification set in the S1 through an encoding and decoding network model;
s3: after training, selecting a picture in the test set, inputting the coding and decoding network model, loading the trained model weight, and finally detecting the corresponding target result.
In step S2, the codec network model includes:
an encoder for clipping a scale of a given input image feature matrix;
the attention layer extracts related main information from the image matrix obtained after coding and weakens secondary interference information;
the decoder, expands the feature matrix size of the attention layer to the same size as the input matrix.
The encoders and decoders are four, each encoder block contains two convolutional layers with a convolutional kernel of 3x3 and one max pooling layer with a convolutional kernel of 2x2, and each decoder block contains two deconvolution layers with a convolutional kernel of 3x3 and one max pooling layer with a convolutional kernel of 2x 2.
The attention layer is processed in parallel by a hole pyramid pooling (ASPP) and a global average pooling layer (global averaging pooling).
The hole pyramid pooling adopts a hole convolution with a convolution kernel of 3x 3.
The suitable optimization learning method of step S2 is a stochastic gradient optimizer, and the relevant hyper-parameters are learning rate, batch size, momentum, and weight attenuation coefficient.
The invention has the following beneficial effects:
the invention adopts an image-text fusion processing method, and utilizes two different types of data of the same thing to perform fusion processing, so that the accuracy is higher during network training, and further, a related required target is identified.
Drawings
Fig. 1 is a diagram of a network architecture.
Fig. 2 is a view of an attention module structure.
FIG. 3 is a schematic diagram of a training set, in which (a1), (a2), and (a3) are image channel masters; (b1) (b2), (b3) are image tags; (c1) the items (c2) and (c3) are image-corresponding text information.
FIG. 4 is a graph of a segmentation prediction result, wherein (a) is a graph of an aircraft segmentation prediction result; (b) dividing a prediction result graph for the motorcycle; (c) the prediction result graph is segmented for humans and horses.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings.
The invention provides an image-text multi-mode fusion method based on an encoding and decoding network. The invention can acquire the characteristic matrix of the image information and the text information through the fusion processing of the image information and the text information. The feature matrix of the text information and the image information can be fused by processing the coding and decoding network again, and meanwhile, in order to better focus on useful feature information, an attention mechanism is added in the middle of the coding and decoding network, and spatial pyramid pooling and global average pooling are adopted for parallel processing. Fig. 1 is a block diagram of a network, and fig. 2 is a schematic diagram of an attention module.
The processing of multi-modal information firstly needs to process each modality to obtain a feature matrix, for an image channel, the invention adopts a 3D-Resnet network for processing, and the network does not need to classify images finally and directly learns to obtain the feature matrix and the weight ratio of the images. The text module adopts a long-time memory network (LSTM), and the network can better learn the context information of the text, so that the text content can be accurately understood. The channel is similar to the image channel, and only the feature matrix and the weight ratio are generated finally, and classification processing is not needed.
After the feature information of the image and the text is obtained, cross-modality fusion is required. The invention adopts the method that the encoding and decoding network directly carries out feature fusion on the encoding and decoding network, and carries out convolution coding on the encoding and decoding network through the feature matrix of the text and the image information, thereby obtaining a more accurate feature map (feature matrix), carrying out deconvolution on the feature map, and finally obtaining a final result through the classification of a classifier.
For the coding and decoding network, the encoder adopts a convolution of 3x3, each convolution is carried with an activation function of Relu, and the maximum pooling of 2x2 is carried out after two convolutions. The decoder uses a convolution with Relu activation function of 3x3, followed by a 2x2 up-sampled deconvolution.
The using method of the invention is as follows: firstly, inputting an image and a text, processing the image through a 3D-Resnet network, and learning to obtain a feature matrix and a weight ratio of the image. And processing the text through a long-time memory network to obtain a feature matrix and a weight ratio of the text.
And then, carrying out feature fusion on the image features and the text features through a pre-trained coding and decoding network. In the fusion process, the feature matrix of the text and the image information is subjected to convolutional coding, so that a single and accurate feature map can be obtained, then the feature matrix is subjected to deconvolution, and finally a final result can be obtained through classification of a classifier.
In order to better learn and obtain the characteristics of the fusion information, an attention module is added in the center of the coding and decoding network. Parallel processing by spatial pyramid pooling and global average pooling. The spatial pyramid pooling adopts hole convolution, and increases the receptive field of the convolution process, so that each convolution output contains information in a larger range. Finally, the number of channels is reduced to the expected value by convolution of 1x 1. During pyramid pooling, the design global average pooling is processed in parallel, i.e., all pixel values are accumulated in all feature maps and averaged. After spatial pyramid pooling and global average pooling, the features are processed by convolution with 1x1 to obtain a feature map, and unimportant noise interference is filtered out basically. And finally, a new characteristic matrix is obtained by adding a Sigmod activation function, and the new characteristic matrix is used for enlarging the receptive field to obtain high-order information.
In addition, the invention introduces two loss functions to constrain the model, namely binary cross entropy and Dice coeffient function.
The total loss of the model is expressed as
L=LB+LD
Wherein: l isBIs a binary cross entropy (binary cross entropy) loss function, LDFor the Dice coefficient loss function, the equations are as follows:
wherein: x is the number ofiFor the ith image-the image in the text, yiFor the text in the ith image-text,to predict text in the ith image-text, n is the number of image-text samples, and output _ siz represents the output data size.
The invention adds text information to the existing target detection data set to form a new data set, and selects 1000 different target detection pictures in total, wherein the total comprises 20 types: person, bird, cat, cow, dog, horse, sheet, aeroplane, bicycle, boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa, tv. And manually labels it and manually generates the text information. The text information is a short phrase and mainly comprises related information in the picture. The data set was as follows 6: 2: the scale of 2 is divided into a training set, a test set, and a validation set.
And training the training set in the data set by the network model through random gradient descent (SGD), and setting the hyper-parameters to obtain the weight matrix. And then testing the data in the test set to obtain the accuracy of the model.
Fig. 3 is a schematic diagram of a training set. Three groups of data selected in the training set are shown in the figure, wherein (a1), (a2) and (a3) are image channel original images; (b1) (b2), (b3) are image tags; (c1) the items (c2) and (c3) are image-corresponding text information.
Fig. 4 is a diagram of a prediction segmentation result, which clearly shows that the object in the diagram can be identified more accurately after the network prediction is performed by the present invention, and the object name is selected and labeled. The graph (a) is the result of prediction over the network, with the plane outlined and labeled. The graph (b) outlines the motorcycle and labels motorbike, predicted by the network. And (c) detecting people and horses through network prediction, respectively framing the people and the horses, and labeling person and horse, which shows that the method is also applicable to the detection and classification of multiple targets.
Claims (6)
1. An image-text multi-mode fusion method based on a coding and decoding network is characterized by comprising the following steps:
s1: manually marking the existing target detection data set on the basis of the existing target detection data set to generate text information, constructing a new image-text data set, and enabling the data set to be as follows: 2: 2, dividing the ratio into a training set, a verification set and a test set;
s2: selecting a proper optimization learning method, setting related hyper-parameters, and training the training set and the verification set in the S1 through an encoding and decoding network model;
s3: after training, selecting a picture in the test set, inputting the coding and decoding network model, loading the trained model weight, and finally detecting the corresponding target result.
2. The codec network-based image-text multimodal fusion method according to claim 1, wherein the codec network model in step S2 includes:
an encoder for clipping a scale of a given input image feature matrix;
the attention layer extracts related main information from the image matrix obtained after coding and weakens secondary interference information;
the decoder, expands the feature matrix size of the attention layer to the same size as the input matrix.
3. The codec network-based image-text multimodal fusion method according to claim 2, wherein the encoders and decoders are four, each encoder block comprises two convolutional layers with convolution kernel of 3x3 and one max pooling layer with convolution kernel of 2x2, and each decoder block comprises two deconvolution layers with convolution kernel of 3x3 and one max pooling layer with convolution kernel of 2x 2.
4. The codec network-based image-text multimodal fusion method according to claim 2, wherein the attention layer is processed by the hole pyramid pooling and the global average pooling layer in parallel.
5. The codec network-based image-text multimodal fusion method according to claim 4, wherein the hole pyramid pooling is a hole convolution with a convolution kernel of 3x 3.
6. The codec network-based image-text multimodal fusion method according to claim 1, wherein the suitable optimization learning method of step S2 is a stochastic gradient optimizer, and the related hyper-parameters are learning rate, batch size, momentum and weight attenuation coefficient.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111087906.8A CN113887585A (en) | 2021-09-16 | 2021-09-16 | Image-text multi-mode fusion method based on coding and decoding network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111087906.8A CN113887585A (en) | 2021-09-16 | 2021-09-16 | Image-text multi-mode fusion method based on coding and decoding network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113887585A true CN113887585A (en) | 2022-01-04 |
Family
ID=79009294
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111087906.8A Pending CN113887585A (en) | 2021-09-16 | 2021-09-16 | Image-text multi-mode fusion method based on coding and decoding network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113887585A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114847963A (en) * | 2022-05-06 | 2022-08-05 | 广东工业大学 | High-precision electrocardiogram characteristic point detection method |
CN116563707A (en) * | 2023-05-08 | 2023-08-08 | 中国农业科学院农业信息研究所 | Lycium chinense insect pest identification method based on image-text multi-mode feature fusion |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109712108A (en) * | 2018-11-05 | 2019-05-03 | 杭州电子科技大学 | It is a kind of that vision positioning method is directed to based on various distinctive candidate frame generation network |
CN112308080A (en) * | 2020-11-05 | 2021-02-02 | 南强智视(厦门)科技有限公司 | Image description prediction method for directional visual understanding and segmentation |
CN113362332A (en) * | 2021-06-08 | 2021-09-07 | 南京信息工程大学 | Depth network segmentation method for coronary artery lumen contour under OCT image |
-
2021
- 2021-09-16 CN CN202111087906.8A patent/CN113887585A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109712108A (en) * | 2018-11-05 | 2019-05-03 | 杭州电子科技大学 | It is a kind of that vision positioning method is directed to based on various distinctive candidate frame generation network |
CN112308080A (en) * | 2020-11-05 | 2021-02-02 | 南强智视(厦门)科技有限公司 | Image description prediction method for directional visual understanding and segmentation |
CN113362332A (en) * | 2021-06-08 | 2021-09-07 | 南京信息工程大学 | Depth network segmentation method for coronary artery lumen contour under OCT image |
Non-Patent Citations (2)
Title |
---|
CAIYONG WANG ET AL.: ""Joint Iris Segmentation and Localization Using Deep Multi-task Learning Framework"", 《ARXIV》, 19 September 2019 (2019-09-19), pages 1 - 13 * |
YIYI ZHOU ET AL.: ""A Real-time Global Inference Networ k for One-stage Referr ing Expression Comprehension"", ARXIV, 7 December 2019 (2019-12-07), pages 1 - 10 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114847963A (en) * | 2022-05-06 | 2022-08-05 | 广东工业大学 | High-precision electrocardiogram characteristic point detection method |
CN116563707A (en) * | 2023-05-08 | 2023-08-08 | 中国农业科学院农业信息研究所 | Lycium chinense insect pest identification method based on image-text multi-mode feature fusion |
CN116563707B (en) * | 2023-05-08 | 2024-02-27 | 中国农业科学院农业信息研究所 | Lycium chinense insect pest identification method based on image-text multi-mode feature fusion |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111639240B (en) | Cross-modal Hash retrieval method and system based on attention awareness mechanism | |
CN105184298B (en) | A kind of image classification method of quick local restriction low-rank coding | |
KR102167808B1 (en) | Semantic segmentation method and system applicable to AR | |
CN109886225A (en) | A kind of image gesture motion on-line checking and recognition methods based on deep learning | |
CN108509978A (en) | The multi-class targets detection method and model of multi-stage characteristics fusion based on CNN | |
CN109670576B (en) | Multi-scale visual attention image description method | |
CN113887585A (en) | Image-text multi-mode fusion method based on coding and decoding network | |
Islam et al. | InceptB: a CNN based classification approach for recognizing traditional bengali games | |
CN110717493B (en) | License plate recognition method containing stacked characters based on deep learning | |
CN116740344A (en) | Knowledge distillation-based lightweight remote sensing image semantic segmentation method and device | |
Du et al. | Research on small size object detection in complex background | |
CN112613428A (en) | Resnet-3D convolution cattle video target detection method based on balance loss | |
CN114170657A (en) | Facial emotion recognition method integrating attention mechanism and high-order feature representation | |
CN115375959A (en) | Vehicle image recognition model establishing and recognizing method | |
Qin et al. | Research on improved algorithm of object detection based on feature pyramid | |
CN114492634B (en) | Fine granularity equipment picture classification and identification method and system | |
CN117152427A (en) | Remote sensing image semantic segmentation method and system based on diffusion model and knowledge distillation | |
CN116863260A (en) | Data processing method and device | |
Wu et al. | Dynamic activation and enhanced image contour features for object detection | |
Xiao | exYOLO: A small object detector based on YOLOv3 Object Detector | |
Wu | CNN-Based Recognition of Handwritten Digits in MNIST Database | |
CN114972851B (en) | Ship target intelligent detection method based on remote sensing image | |
CN116109868A (en) | Image classification model construction and small sample image classification method based on lightweight neural network | |
CN116343016A (en) | Multi-angle sonar image target classification method based on lightweight convolution network | |
CN111461130B (en) | High-precision image semantic segmentation algorithm model and segmentation method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |