CN113516133A - Multi-modal image classification method and system - Google Patents

Multi-modal image classification method and system Download PDF

Info

Publication number
CN113516133A
CN113516133A CN202110355430.5A CN202110355430A CN113516133A CN 113516133 A CN113516133 A CN 113516133A CN 202110355430 A CN202110355430 A CN 202110355430A CN 113516133 A CN113516133 A CN 113516133A
Authority
CN
China
Prior art keywords
mode
network model
fusion
modal
order
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110355430.5A
Other languages
Chinese (zh)
Other versions
CN113516133B (en
Inventor
王勇
袁狄
何小宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202110355430.5A priority Critical patent/CN113516133B/en
Publication of CN113516133A publication Critical patent/CN113516133A/en
Application granted granted Critical
Publication of CN113516133B publication Critical patent/CN113516133B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-modal image classification method and a multi-modal image classification system, wherein a plurality of single-modal network feature extraction modules are established, each single-modal network feature extraction module comprises a multi-order cascaded residual module, and the last-order residual module is sequentially connected with a pooling layer, a full-link layer and a softmax layer to obtain a single-modal network channel; all the single-mode network channels form a single-mode network model; fusing each order residual error module in the single mode network model by utilizing a cooperative attention mechanism to obtain a fused mode network model; and fusing the single-mode network model and the fused-mode network model to obtain a classification model. The invention can improve the classification precision.

Description

Multi-modal image classification method and system
Technical Field
The invention relates to the field of image processing, in particular to a multi-modal image classification method and system.
Background
Currently, the image classification technology based on deep learning has been widely applied, and a deep learning model is used to process a shot image to judge the category of an object in the image. The existing image classification method adopts a single-mode image for processing, but the single-mode image cannot well cover the characteristics of a target object, so that the classification precision is influenced.
Disclosure of Invention
The invention aims to solve the technical problem that the prior art is insufficient, and provides a multi-mode image classification method and system to improve classification accuracy.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a method of multi-modal image classification comprising the steps of:
s1, establishing a plurality of single-mode network feature extraction modules, wherein each single-mode network feature extraction module comprises a multi-order cascaded residual module, and the last-order residual module is sequentially connected with a pooling layer, a full connection layer and a softmax layer to obtain a single-mode network channel; all the single-mode network channels form a single-mode network model;
s2, fusing each order residual error module in the single mode network model by utilizing a cooperative attention mechanism to obtain a fused mode network model;
and S3, fusing the single-mode network model and the fusion-mode network model to obtain a classification model.
The invention provides a multi-level fusion learning network structure, which processes any single-mode characteristics in a mode of a depth residual error module (residual error module) and transmits semantic information layer by layer. Along with the deepening of a network structure, the scale of the feature map is gradually reduced, global and detail local information is integrated, and richer semantic information is provided. Meanwhile, the single-mode features are mutually supervised and fused, single-mode feature information is continuously enriched and perfected, and the spatial resolution of the feature map is improved, so that the image classification precision is improved.
In step S1, the full link layer and the classification layer of the ResNet50 network are removed, and the remaining cascaded residual modules form the single-mode network channel. The structure can improve the expression capability of the network through sufficient depth and layer-by-layer feature learning, the image features of higher layers are extracted by utilizing the cascaded residual error modules, from the stage 1 to the stage n, the learned feature expression is more complex along with the deepening of the network, and the semantic information contained in the output feature graph becomes richer.
The specific implementation process of step S2 includes: respectively and sequentially performing convolution operation on the single-mode image features extracted by the 1 st order residual error modules of all single-mode channels in the single-mode network model, and respectively inputting the results of the convolution operation into a Softmax activation function to obtain a spatial attention vector; multiplying the spatial attention vector by the single-mode image features extracted by each 1 st order residual error module to obtain 1 st order fusion features, performing convolution operation on the fusion features, and inputting the result of the convolution operation into a Sigmoid activation function to obtain a 1 st order fusion mode network model; for the nth order residual error module, respectively and sequentially performing convolution operation on the single-mode image features extracted by each nth order residual error module of all single-mode channels in the single-mode network model, and respectively inputting the result of each convolution operation into a Softmax activation function to obtain a spatial attention vector; multiplying the space attention vector by the single-mode image features extracted by each nth-order residual error module to obtain nth-order fusion features; splicing the nth order fusion characteristic and the (n-1) th order fusion characteristic, performing convolution operation on the spliced fusion characteristic, and inputting a convolution operation result into a Sigmoid activation function to obtain an nth order fusion modal network model; wherein n is greater than 1; and the final-order converged modal network model is a converged modal network model. The model can efficiently fuse the cross and complementary information among the single modes of each stage, reduce the operation amount of the network model in an information compression mode, emphasize key image information through an attention mechanism, and simultaneously inhibit noise information irrelevant to image classification.
The specific implementation process of step S3 includes: respectively calculating the output of a single-mode network model and the difference score between the output of a fusion-mode network model and an actual sample through cosine distance similarity, converting the difference score into consistency weight distribution by using a Tanh activation function, and then performing normalization processing on the consistency weight distribution by using a SoftMax function so as to obtain consistency weight parameters of different models in the fusion process; respectively calculating a first difference value of the single-mode image features output by the single-mode network model and the actual label category and a second difference value of the fusion mode feature output result output by the fusion mode network model and the actual label category according to the consistency weight parameters, taking the sum of squares of the first difference value and the second difference value as a loss function of network training, minimizing the loss function through a self-adaptive gradient descent method, updating network model parameters, obtaining a trained network model, and obtaining a classification model. The step can integrate the output information from the single mode network model and the fusion mode network model, and dynamically adjust the importance degree of each model through the consistency weight so as to obtain better global information and improve the performance and generalization capability of the models on image classification tasks.
In the invention, in order to simplify the model and improve the classification efficiency, the number of the single-mode network channels is 2.
The invention also provides a multi-modal image classification system, which comprises computer equipment; the computer device is configured or programmed for performing the steps of the above-described method.
Compared with the prior art, the invention has the beneficial effects that:
1. aiming at the limitation of shallow image features, the invention provides a multi-level fusion learning network structure based on a convolutional neural network, and the framework processes any single-mode feature in a depth residual error module mode and transmits semantic information layer by layer. Along with the deepening of a network structure, the scale of the feature map is gradually reduced, global and detail local information is integrated, and richer semantic information is provided. Meanwhile, the single-mode features are mutually supervised and fused, and the single-mode feature information is continuously enriched and perfected so as to improve the spatial resolution of the feature map.
2. A fusion learning framework based on a cooperative attention mechanism is provided on the basis of multi-mode fusion, the framework is independent of a specific network of single modes, can be well embedded into a main flow backbone network, and is beneficial to keeping unique characteristics and exclusivity of each single mode. In addition, the framework can keep similarity structures between modalities and in the modalities, meanwhile, modality cooperation and feature fusion are considered, the consistency maximization of different modality expression sets is realized, data are transmitted and shared among the different modalities, and the excellent performance is shown.
Drawings
FIG. 1 is a block diagram of a monomodal network model according to an embodiment of the present invention;
FIG. 2 is a block diagram of a converged modal network model architecture according to an embodiment of the present invention;
FIG. 3 is a block diagram of a classification model according to an embodiment of the present invention.
Detailed Description
A multi-modal image classification method based on multi-level fusion learning and cooperative attention mechanism comprises the following steps:
taking the skin disease classification task as an example, the classification diagnosis of the skin disease is performed by using information of two modalities, namely a clinical image and a skin mirror image.
Step one, a multi-level feature extraction network is adopted to obtain hierarchical clinical image single-mode features and skin mirror image single-mode features.
Firstly, coding two input single-mode images by a feature extraction module based on a deep convolutional neural network to generate depth image features, specifically, establishing two single-mode network feature extraction modules which are respectively used for extracting clinical image features and skin mirror image features, wherein the feature extraction module consists of ResNet50 networks for removing a full connection layer and a classification layer, the modules are in a parallel relation, performing convolution and batch normalization operation on the input single-mode images respectively, performing nonlinear mapping by adopting a ReLU function (max (0, x)), performing further compression coding on single-mode image information by a pooling layer to obtain preliminary image features, extracting multi-level single-mode image features by a plurality of serial depth residual modules, sequentially connecting a last depth residual block with a pooling layer, the full connection layer and a softmax layer for feature dimension reduction and compression, a single-modal network model with parallel inputs is obtained.
Step two, adopting a cooperative attention mechanism to acquire supervision fusion characteristics
For the single-mode image features extracted by the depth residual error module in the step I, realizing the fusion of multi-mode features by utilizing a cooperative attention mechanism: firstly, splicing the extracted features of each single-mode image together, calculating a space attention vector through a scale perception module consisting of convolution operation and a Softmax activation function, and then sequentially multiplying the space attention vector by each input single-mode feature according to the spliced space sequence to realize the fusion of multi-mode features at the current stage; then, the fusion features extracted by the current depth residual error module are connected with the fusion features of the last depth residual error module, an automatic calibration module is used for calculating a fused channel feature map, a channel attention vector is generated through convolution and a Sigmoid activation function, the vector is multiplied by a fusion feature matrix according to the channel sequence to realize the fusion of the multi-modal features of the current stage and the multi-modal features of the last stage, and finally the fusion learning of a plurality of single-modal features is realized to obtain a fusion modal network model;
and thirdly, realizing multi-modal skin disease image classification by combining a multi-level feature extractor and a cooperative attention mechanism.
Based on the parallel single-mode network model and the fusion mode network model obtained in the first step and the second step, firstly, the clinical image and the dermatoscope image are sequentially input into the parallel single-mode network model,
and (4) extracting the output single-mode image features through the single-mode network model obtained in the step one, and extracting the output fusion-mode image features through the fusion-mode network model obtained in the step two. And then calculating by cosine distance similarity to obtain a difference score between the model output and an actual sample, and processing the difference score by using Tanh and SoftMax functions to obtain consistency weight parameters of different models. And respectively calculating the image characteristics of the single mode and the square sum of the difference value between the characteristic output result of the fusion mode and the actual label category as the loss of network training according to the consistency weight parameters, and minimizing the loss by a self-adaptive gradient descent method to update the parameters of the network model to obtain the trained network model. In actual use, only a plurality of single-mode data are input to obtain a classification result output by the model.
The first step specifically comprises the following processes:
firstly, a single-mode model is trained by utilizing a plurality of input single-mode images, the images are compressed by encoding through a feature extractor based on a convolutional neural network to generate depth features, and in the training stage, the input features of each layer are mapped by a three-dimensional array [ h, w, d ]]Where H and w are the feature map dimensions, d is the number of channels of the feature map, the feature maps of adjacent layers are connected by a field of dimension (L, H), and for convolution operations, let x beijIs the pixel value of the (i, j) position of the previous layer, yij is the pixel value of the corresponding position of the next layer, having
Figure BDA0003003551090000051
And b is a shared value of the bias, L is H is 3, and after the convolution, the nonlinear mapping is performed by a batch normalization layer and then a Relu function.
For the pooling operation, there are
Figure BDA0003003551090000052
And in order to reduce the feature loss caused by the pooling layer, the compression encoding stage adopts average pooling instead of maximum pooling to obtain the initial features of the single-mode image. Extracting multi-level image features through a plurality of depth residual modules, wherein a residual block is defined as:
yij=F(x,{Wi})+x (3)
where x, y represent the input and output of the current residual block, respectively, F (x, { W)i}) represents the residual mapping that needs to be learned. When the dimensions of x and F are different, pass a linear projection WsTo match the dimensions, as follows:
yij=F(x,{Wi})+Wsx (4)
along with the deepening of a network structure, the scale of the feature map is gradually reduced, multi-level single-mode image features are extracted through the combination of the depth residual error modules, global and local details can be integrated, and richer semantic information is provided. And then carrying out global average pooling on the extracted multi-level single-mode image characteristics, and after passing through a full connection layer, mapping through a softmax function to obtain an output result of the single-mode network model:
Figure BDA0003003551090000053
wherein z isiRepresenting the value of the ith element of the output matrix, SiIs the probability of mapping to class i.
The specific implementation process of the cooperative attention mechanism in the step two comprises the following steps:
firstly, multi-level single-mode image features A and B extracted by a current depth residual error module are spliced through spatial dimensions to obtain a primary fusion feature, then a proper proportion feature is dynamically selected according to a primary fusion result through a spatial attention mechanism based on a scale perception module, and fusion is carried out through self-learning, as follows:
Figure BDA0003003551090000054
wherein A isi,BiRespectively representing the ith characteristic value in the single-mode image characteristics, respectively carrying out dot product with the original single-mode characteristics to obtain the characteristics (effective characteristics) after scale perception, and finally summing to obtain a feature graph after supervision fusion, thereby realizing the fusion of a plurality of single-mode characteristics at the current stage.
Then, the fusion characteristics of the current stage and the previous stage are spliced through a channel dimension, and then the splicing result is used for generating a one-dimensional excitation weight through a channel attention mechanism based on an adaptive calibration module to activate each layer of channel, so that the attention to a channel domain is enhanced, wherein the mechanism is divided into three parts, namely a squeezing function, an excitation function and a scale function, and the squeezing function is as follows:
Figure BDA0003003551090000061
the function is to add and average the characteristic values in each channel of the splicing characteristic to realize the process of global average pooling, and the excitation function is as follows:
s=σ(g(z,W))=σ(W2δ(W1z)) (8)
where σ denotes sigmoid activation function, δ denotes ReLU function, W1And W2The dimensions of the channel are C/r C and C/r respectively, wherein C is the number of channels, r is a scaling parameter, a one-dimensional channel attention vector is calculated through the self-adaptive calibration module, and a scale function is as follows:
Figure BDA0003003551090000062
essentially a scaling process, for each channel ucMultiplying by different channel attention weights scTherefore, the attention to the key channel domain is enhanced, and the fusion of the current stage and the previous stage multi-modal features is realized.
The third step comprises the following processes:
and (3) based on the single mode network model and the fusion mode network model respectively obtained in the first step and the second step, expanding and connecting the final output characteristics of the two models together, synthesizing the characteristics of the two models, designing a consistency weight parameter to keep semantic consistency between the modes, firstly calculating a difference score between the model output and an actual sample through cosine distance similarity, converting the difference score into consistency weight distribution by using a Tanh activation function, and then normalizing the difference score by using a SoftMax function, thereby obtaining the consistency weight parameters of different models in the fusion process.
During the training process, the specific loss function is as follows:
L(v,m,S)=λvL(v,S)+λmL(m,S) (10)
wherein, L (v, S) and L (m, S) respectively represent consistency loss between the output features of the single-mode features v and the fusion mode m and the actual sample information S, and the calculation mode is as follows:
Figure BDA0003003551090000071
Figure BDA0003003551090000072
λvand λmRepresenting a consistency loss weight parameter for the corresponding modality. The resulting loss function is thus as follows:
Figure BDA0003003551090000073
in the model training phase, the network parameters are updated using an adaptive gradient descent method by minimizing this loss. In the model prediction stage, multi-modal images of the same target to be classified are input into a trained complete network, and classification prediction is carried out after single-modal features and fusion modal features extracted by a network model are weighted, so that auxiliary classification of the multi-modal images is completed.
The embodiment of the invention also provides a multi-modal image classification system, which comprises computer equipment; the computer device is configured or programmed for performing the steps of the above-described embodiment method.
In the invention, the computer equipment can be a microprocessor, an upper computer and other equipment.

Claims (6)

1. A method of multi-modal image classification, comprising the steps of:
s1, establishing a plurality of single-mode network feature extraction modules, wherein each single-mode network feature extraction module comprises a multi-order cascaded residual module, and the last-order residual module is sequentially connected with a pooling layer, a full connection layer and a softmax layer to obtain a single-mode network channel; all the single-mode network channels form a single-mode network model;
s2, fusing each order residual error module in the single mode network model by utilizing a cooperative attention mechanism to obtain a fused mode network model;
and S3, fusing the single-mode network model and the fusion-mode network model to obtain a classification model.
2. The multi-modal image classification method according to claim 1, wherein in step S1, the fully connected hierarchy and the classification hierarchy of the ResNet50 network are removed, and the remaining cascaded residual modules constitute the single-modal network channel.
3. The multi-modal image classification method according to claim 1, wherein the step S2 is implemented by the following steps:
respectively and sequentially performing convolution operation on the single-mode image features extracted by the 1 st order residual error modules of all single-mode channels in the single-mode network model, and respectively inputting the results of the convolution operation into a Softmax activation function to obtain a spatial attention vector; multiplying the spatial attention vector by the single-mode image features extracted by each 1 st order residual error module to obtain 1 st order fusion features, performing convolution operation on the fusion features, and inputting the result of the convolution operation into a Sigmoid activation function to obtain a 1 st order fusion mode network model;
for the nth order residual error module, respectively and sequentially performing convolution operation on the single-mode image features extracted by each nth order residual error module of all single-mode channels in the single-mode network model, and respectively inputting the result of each convolution operation into a Softmax activation function to obtain a spatial attention vector; multiplying the space attention vector by the single-mode image features extracted by each nth-order residual error module to obtain nth-order fusion features; splicing the nth order fusion characteristic and the (n-1) th order fusion characteristic, performing convolution operation on the spliced fusion characteristic, and inputting a convolution operation result into a Sigmoid activation function to obtain an nth order fusion modal network model; wherein n is greater than 1; and the final-order converged modal network model is a converged modal network model.
4. The multi-modal image classification method according to any one of claims 1 to 3, wherein the step S3 is implemented by: respectively calculating the output of a single-mode network model and the difference score between the output of a fusion-mode network model and an actual sample through cosine distance similarity, converting the difference score into consistency weight distribution by using a Tanh activation function, and then normalizing the consistency weight distribution by using a SpftMax function so as to obtain consistency weight parameters of different models in the fusion process; respectively calculating a first difference value of the single-mode image features output by the single-mode network model and the actual label category and a second difference value of the fusion mode feature output result output by the fusion mode network model and the actual label category according to the consistency weight parameters, taking the sum of squares of the first difference value and the second difference value as a loss function of network training, minimizing the loss function through a self-adaptive gradient descent method, updating network model parameters, obtaining a trained network model, and obtaining a classification model.
5. The multi-modal image classification method according to any one of claims 1 to 3, wherein the number of the single-modal network channels is 2.
6. A multi-modal image classification system comprising a computer device; the computer device is configured or programmed for carrying out the steps of the method according to one of claims 1 to 5.
CN202110355430.5A 2021-04-01 2021-04-01 Multi-modal image classification method and system Active CN113516133B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110355430.5A CN113516133B (en) 2021-04-01 2021-04-01 Multi-modal image classification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110355430.5A CN113516133B (en) 2021-04-01 2021-04-01 Multi-modal image classification method and system

Publications (2)

Publication Number Publication Date
CN113516133A true CN113516133A (en) 2021-10-19
CN113516133B CN113516133B (en) 2022-06-17

Family

ID=78062230

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110355430.5A Active CN113516133B (en) 2021-04-01 2021-04-01 Multi-modal image classification method and system

Country Status (1)

Country Link
CN (1) CN113516133B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114332592A (en) * 2022-03-11 2022-04-12 中国海洋大学 Ocean environment data fusion method and system based on attention mechanism
CN114638994A (en) * 2022-05-18 2022-06-17 山东建筑大学 Multi-modal image classification system and method based on attention multi-interaction network
CN115546217A (en) * 2022-12-02 2022-12-30 中南大学 Multi-level fusion skin disease diagnosis system based on multi-mode image data
WO2023098636A1 (en) * 2021-11-30 2023-06-08 Huawei Technologies Co., Ltd. Method, device, and medium for adaptive inference in compressed video domain

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710830A (en) * 2018-04-20 2018-10-26 浙江工商大学 A kind of intensive human body 3D posture estimation methods for connecting attention pyramid residual error network and equidistantly limiting of combination
WO2018213841A1 (en) * 2017-05-19 2018-11-22 Google Llc Multi-task multi-modal machine learning model
CN110674677A (en) * 2019-08-06 2020-01-10 厦门大学 Multi-mode multi-layer fusion deep neural network for anti-spoofing of human face
CN111192200A (en) * 2020-01-02 2020-05-22 南京邮电大学 Image super-resolution reconstruction method based on fusion attention mechanism residual error network
CN111325155A (en) * 2020-02-21 2020-06-23 重庆邮电大学 Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy
CN111709265A (en) * 2019-12-11 2020-09-25 深学科技(杭州)有限公司 Camera monitoring state classification method based on attention mechanism residual error network
US20200364870A1 (en) * 2019-05-14 2020-11-19 University-Industry Cooperation Group Of Kyung Hee University Image segmentation method and apparatus, and computer program thereof

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018213841A1 (en) * 2017-05-19 2018-11-22 Google Llc Multi-task multi-modal machine learning model
US20200089755A1 (en) * 2017-05-19 2020-03-19 Google Llc Multi-task multi-modal machine learning system
CN108710830A (en) * 2018-04-20 2018-10-26 浙江工商大学 A kind of intensive human body 3D posture estimation methods for connecting attention pyramid residual error network and equidistantly limiting of combination
US20200364870A1 (en) * 2019-05-14 2020-11-19 University-Industry Cooperation Group Of Kyung Hee University Image segmentation method and apparatus, and computer program thereof
CN110674677A (en) * 2019-08-06 2020-01-10 厦门大学 Multi-mode multi-layer fusion deep neural network for anti-spoofing of human face
CN111709265A (en) * 2019-12-11 2020-09-25 深学科技(杭州)有限公司 Camera monitoring state classification method based on attention mechanism residual error network
CN111192200A (en) * 2020-01-02 2020-05-22 南京邮电大学 Image super-resolution reconstruction method based on fusion attention mechanism residual error network
CN111325155A (en) * 2020-02-21 2020-06-23 重庆邮电大学 Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
X. HE, Y. DENG, L. FANG AND Q. PENG: "Multi-Modal Retinal Image Classification With Modality-Specific Attention Network", 《IEEE》 *
陈思佳: "面向共同空间学习的多模态数据建模和检索研究", 《中国优秀硕士学位论文全文数据库 (基础科学辑)》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023098636A1 (en) * 2021-11-30 2023-06-08 Huawei Technologies Co., Ltd. Method, device, and medium for adaptive inference in compressed video domain
CN114332592A (en) * 2022-03-11 2022-04-12 中国海洋大学 Ocean environment data fusion method and system based on attention mechanism
CN114638994A (en) * 2022-05-18 2022-06-17 山东建筑大学 Multi-modal image classification system and method based on attention multi-interaction network
CN115546217A (en) * 2022-12-02 2022-12-30 中南大学 Multi-level fusion skin disease diagnosis system based on multi-mode image data

Also Published As

Publication number Publication date
CN113516133B (en) 2022-06-17

Similar Documents

Publication Publication Date Title
CN113516133B (en) Multi-modal image classification method and system
CN110490946B (en) Text image generation method based on cross-modal similarity and antagonism network generation
CN113033570B (en) Image semantic segmentation method for improving void convolution and multilevel characteristic information fusion
CN113221969A (en) Semantic segmentation system and method based on Internet of things perception and based on dual-feature fusion
CN111507521A (en) Method and device for predicting power load of transformer area
CN112489164B (en) Image coloring method based on improved depth separable convolutional neural network
CN113240683B (en) Attention mechanism-based lightweight semantic segmentation model construction method
CN113780292A (en) Semantic segmentation network model uncertainty quantification method based on evidence reasoning
CN113239897B (en) Human body action evaluation method based on space-time characteristic combination regression
CN113486190A (en) Multi-mode knowledge representation method integrating entity image information and entity category information
CN111832637B (en) Distributed deep learning classification method based on alternating direction multiplier method ADMM
CN114418030A (en) Image classification method, and training method and device of image classification model
CN113159067A (en) Fine-grained image identification method and device based on multi-grained local feature soft association aggregation
CN114581502A (en) Monocular image-based three-dimensional human body model joint reconstruction method, electronic device and storage medium
CN114821050A (en) Named image segmentation method based on transformer
Qi et al. Learning low resource consumption cnn through pruning and quantization
CN116030537B (en) Three-dimensional human body posture estimation method based on multi-branch attention-seeking convolution
CN116797850A (en) Class increment image classification method based on knowledge distillation and consistency regularization
CN116977631A (en) Streetscape semantic segmentation method based on DeepLabV3+
CN116797821A (en) Generalized zero sample image classification method based on fusion visual information
CN112561050A (en) Neural network model training method and device
CN114494284B (en) Scene analysis model and method based on explicit supervision area relation
Ren The advance of generative model and variational autoencoder
CN112990041B (en) Remote sensing image building extraction method based on improved U-net
CN115063374A (en) Model training method, face image quality scoring method, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant