CN113516133B - Multi-modal image classification method and system - Google Patents

Multi-modal image classification method and system Download PDF

Info

Publication number
CN113516133B
CN113516133B CN202110355430.5A CN202110355430A CN113516133B CN 113516133 B CN113516133 B CN 113516133B CN 202110355430 A CN202110355430 A CN 202110355430A CN 113516133 B CN113516133 B CN 113516133B
Authority
CN
China
Prior art keywords
mode
network model
fusion
order
mode network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110355430.5A
Other languages
Chinese (zh)
Other versions
CN113516133A (en
Inventor
王勇
袁狄
何小宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202110355430.5A priority Critical patent/CN113516133B/en
Publication of CN113516133A publication Critical patent/CN113516133A/en
Application granted granted Critical
Publication of CN113516133B publication Critical patent/CN113516133B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a multi-modal image classification method and a multi-modal image classification system, wherein a plurality of single-modal network feature extraction modules are established, each single-modal network feature extraction module comprises a multi-order cascaded residual module, and the last-order residual module is sequentially connected with a pooling layer, a full connection layer and a softmax layer to obtain a single-modal network channel; all the single-mode network channels form a single-mode network model; fusing each order residual error module in the single mode network model by utilizing a cooperative attention mechanism to obtain a fused mode network model; and fusing the single-mode network model and the fused-mode network model to obtain a classification model. The invention can improve the classification precision.

Description

Multi-modal image classification method and system
Technical Field
The invention relates to the field of image processing, in particular to a multi-modal image classification method and system.
Background
Currently, the image classification technology based on deep learning has been widely applied, and a deep learning model is used to process a shot image to judge the category of an object in the image. The existing image classification method adopts a single-mode image for processing, but the single-mode image cannot well cover the characteristics of a target object, so that the classification precision is influenced.
Disclosure of Invention
The invention aims to solve the technical problem that the prior art is insufficient, and provides a multi-mode image classification method and system to improve classification accuracy.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a method of multi-modal image classification comprising the steps of:
s1, establishing a plurality of single-mode network feature extraction modules, wherein each single-mode network feature extraction module comprises a multi-order cascaded residual module, and the last-order residual module is sequentially connected with a pooling layer, a full connection layer and a softmax layer to obtain a single-mode network channel; all the single-mode network channels form a single-mode network model;
s2, fusing each order residual error module in the single mode network model by utilizing a cooperative attention mechanism to obtain a fused mode network model;
and S3, fusing the single-mode network model and the fusion-mode network model to obtain a classification model.
The invention provides a multi-level fusion learning network structure, which processes any single-mode characteristics in a mode of a depth residual error module (residual error module) and transmits semantic information layer by layer. Along with the deepening of a network structure, the scale of the feature map is gradually reduced, global and detail local information is integrated, and richer semantic information is provided. Meanwhile, the single-mode features are mutually supervised and fused, single-mode feature information is continuously enriched and perfected, and the spatial resolution of the feature map is improved, so that the image classification precision is improved.
In step S1, the full link layer and the classification layer of the ResNet50 network are removed, and the remaining cascaded residual modules form the single-mode network channel. The structure can improve the expression capability of the network through sufficient depth and layer-by-layer feature learning, the image features of higher layers are extracted by utilizing the cascaded residual error modules, from the stage 1 to the stage n, the learned feature expression is more complex along with the deepening of the network, and the semantic information contained in the output feature graph becomes richer.
The specific implementation process of step S2 includes: all single modes in the single mode network modelRespectively and sequentially performing convolution operation on the single-mode image features extracted by each 1 st order residual error module of the state channel, and respectively inputting the results of the convolution operation
Figure 578854DEST_PATH_IMAGE002
Activating a function to obtain a space attention vector; multiplying the spatial attention vector by the single-mode image features extracted by each 1 st order residual error module to obtain 1 st order fusion features, performing convolution operation on the fusion features, and inputting the result of the convolution operation into a Sigmoid activation function to obtain a 1 st order fusion mode network model; for the nth order residual error module, respectively and sequentially performing convolution operation on the single-mode image features extracted by each nth order residual error module of all single-mode channels in the single-mode network model, and respectively outputting the results of each convolution operation
Figure 292732DEST_PATH_IMAGE002
Activating a function to obtain a space attention vector; multiplying the space attention vector by the single-mode image features extracted by each nth-order residual error module to obtain nth-order fusion features; splicing the nth order fusion characteristic and the (n-1) th order fusion characteristic, performing convolution operation on the spliced fusion characteristic, and inputting a convolution operation result into a Sigmoid activation function to obtain an nth order fusion modal network model; wherein n is greater than 1; and the final-order converged modal network model is a converged modal network model. The model can efficiently fuse the cross and complementary information among the single modes of each stage, reduce the operation amount of the network model in an information compression mode, emphasize key image information through an attention mechanism, and simultaneously inhibit noise information irrelevant to image classification.
The specific implementation process of step S3 includes: respectively calculating the difference scores between the output of the single-mode network model and the output of the fusion-mode network model and the actual sample through cosine distance similarity, and using
Figure 869207DEST_PATH_IMAGE004
An activation function converts the difference score into a consistency weight distribution,is reused
Figure 61154DEST_PATH_IMAGE006
The function carries out normalization processing on the consistency weight distribution, so that consistency weight parameters of different models in the fusion process are obtained; respectively calculating a first difference value of the single-mode image features output by the single-mode network model and the actual label category and a second difference value of the fusion mode feature output result output by the fusion mode network model and the actual label category according to the consistency weight parameters, taking the sum of squares of the first difference value and the second difference value as a loss function of network training, minimizing the loss function through a self-adaptive gradient descent method, updating network model parameters, obtaining a trained network model, and obtaining a classification model. The step can integrate the output information from the single mode network model and the fusion mode network model, and dynamically adjust the importance degree of each model through the consistency weight so as to obtain better global information and improve the performance and generalization capability of the models on image classification tasks.
In the invention, in order to simplify the model and improve the classification efficiency, the number of the single-mode network channels is 2.
The invention also provides a multi-modal image classification system, which comprises computer equipment; the computer device is configured or programmed for performing the steps of the above-described method.
Compared with the prior art, the invention has the beneficial effects that:
1. aiming at the limitation of shallow image features, the invention provides a multi-level fusion learning network structure based on a convolutional neural network, and the framework processes any single-mode feature in a depth residual error module mode and transmits semantic information layer by layer. Along with the deepening of a network structure, the scale of the feature map is gradually reduced, global and detail local information is integrated, and richer semantic information is provided. Meanwhile, the single-mode features are mutually supervised and fused, and the single-mode feature information is continuously enriched and perfected so as to improve the spatial resolution of the feature map.
2. A fusion learning framework based on a cooperative attention mechanism is provided on the basis of multi-mode fusion, the framework is independent of a specific network of single modes, can be well embedded into a main flow backbone network, and is beneficial to keeping unique characteristics and exclusivity of each single mode. In addition, the framework can keep similarity structures between modalities and in the modalities, meanwhile, modality cooperation and feature fusion are considered, the consistency maximization of different modality expression sets is realized, data are transmitted and shared among the different modalities, and the excellent performance is shown.
Drawings
FIG. 1 is a block diagram of a monomodal network model according to an embodiment of the present invention;
FIG. 2 is a block diagram of a converged modal network model architecture according to an embodiment of the present invention;
FIG. 3 is a block diagram of a classification model according to an embodiment of the present invention.
Detailed Description
A multi-modal image classification method based on multi-level fusion learning and cooperative attention mechanism comprises the following steps:
taking the skin disease classification task as an example, the classification diagnosis of the skin disease is performed by using information of two modalities, namely a clinical image and a skin mirror image.
Step one, a multi-level feature extraction network is adopted to obtain hierarchical clinical image single-mode features and skin mirror image single-mode features.
Firstly, coding two input single-mode images through a feature extraction module based on a deep convolutional neural network to generate depth image features, specifically, establishing two single-mode network feature extraction modules which are respectively used for extracting clinical image features and skin mirror image features, wherein the feature extraction module consists of ResNet50 networks for removing a full connection layer and a classification layer, the modules are in parallel relation, performing convolution and batch normalization operation on the input single-mode images respectively, and then adopting the depth convolutional neural network-based feature extraction module to generate depth image features
Figure 406684DEST_PATH_IMAGE008
Function (a)
Figure 760305DEST_PATH_IMAGE010
) Performing nonlinear mapping, performing further compression coding on the single-mode image information through a pooling layer to obtain initial image characteristics, extracting multi-level single-mode image characteristics through a plurality of depth residual modules connected in series, and sequentially connecting the last depth residual block with the pooling layer, the full-connection layer and the full-connection layer
Figure 824076DEST_PATH_IMAGE012
And (4) performing feature dimension reduction and compression on the layers to obtain a single-mode network model with parallel input.
Step two, adopting a cooperative attention mechanism to acquire supervision fusion characteristics
Respectively and sequentially performing convolution operation on the single-mode image features extracted by the 1 st order residual error modules of all the single-mode channels in the single-mode network model, and respectively inputting the results of the convolution operation
Figure 554135DEST_PATH_IMAGE002
Activating a function to obtain a space attention vector; multiplying the spatial attention vector by the single-mode image features extracted by each 1 st order residual error module to obtain 1 st order fusion features, performing convolution operation on the fusion features, and inputting the result of the convolution operation into a Sigmoid activation function to obtain a 1 st order fusion mode network model;
for the nth order residual error module, respectively and sequentially performing convolution operation on the single-mode image features extracted by each nth order residual error module of all single-mode channels in the single-mode network model, and respectively inputting the results of each convolution operation
Figure 754172DEST_PATH_IMAGE002
Activating a function to obtain a space attention vector; multiplying the space attention vector by the single-mode image features extracted by each nth-order residual error module to obtain nth-order fusion features; splicing the nth order fusion characteristic and the (n-1) th order fusion characteristic, performing convolution operation on the spliced fusion characteristic, and inputting a convolution operation result into a Sigmoid activation function to obtain an nth order fusion modal network model; wherein n is greater than 1; last-order converged modal network modelType is a converged modal network model;
and thirdly, realizing multi-modal skin disease image classification by combining a multi-level feature extractor and a cooperative attention mechanism.
Based on the parallel single-mode network model and the fusion mode network model obtained in the first step and the second step, firstly, the clinical image and the dermatoscope image are sequentially input into the parallel single-mode network model,
and (4) extracting the output single-mode image features through the single-mode network model obtained in the step one, and extracting the output fusion-mode image features through the fusion-mode network model obtained in the step two. Then calculating to obtain a difference score between the model output and the actual sample through cosine distance similarity, and using the difference score
Figure DEST_PATH_IMAGE013
And
Figure 809853DEST_PATH_IMAGE006
and processing the difference scores by the function to obtain consistency weight parameters of different models. And respectively calculating the image characteristics of the single mode and the square sum of the difference value between the characteristic output result of the fusion mode and the actual label category as the loss of network training according to the consistency weight parameters, and minimizing the loss by a self-adaptive gradient descent method to update the parameters of the network model to obtain the trained network model. In actual use, only a plurality of single-mode data are input to obtain a classification result output by the model.
The first step specifically comprises the following processes:
firstly, a single-mode model is trained by utilizing a plurality of input single-mode images, the images are compressed by encoding through a feature extractor based on a convolutional neural network to generate depth features, and in a training stage, the input features of each layer are mapped by a three-dimensional array
Figure 95341DEST_PATH_IMAGE014
Is shown in whichhAndwis the size of the feature map and,dis the number of channels of the feature map, the feature map pass size of the adjacent layer is largeIs small as
Figure DEST_PATH_IMAGE015
Are connected, for convolution operations, set
Figure 894669DEST_PATH_IMAGE016
Is the previous layer
Figure DEST_PATH_IMAGE017
The value of the pixel of the location is,
Figure 480371DEST_PATH_IMAGE018
is the pixel value of the corresponding position of the next layer, has
Figure DEST_PATH_IMAGE019
(1)
Where denotes convolution, b is the biased shared value, L = H =3, and the convolution is followed by a batch normalization layer and then by a Relu function for non-linear mapping.
For the pooling operation, there are
Figure 706953DEST_PATH_IMAGE020
(2)
Wherein L = H =2, in order to reduce the feature loss caused by the pooling layer, the compression encoding stage uses average pooling instead of maximum pooling to obtain the preliminary features of the single-mode image. Extracting multi-level image features through a plurality of depth residual modules, wherein a residual block is defined as:
Figure DEST_PATH_IMAGE021
(3)
wherein the content of the first and second substances,x,yrespectively representing the input and output of the current residual block,
Figure 10896DEST_PATH_IMAGE022
representing the residual map that needs to be learned. When in usexWhen the dimensions of F and F are different, by a linear projection
Figure DEST_PATH_IMAGE023
To match the dimensions, as follows:
Figure 348336DEST_PATH_IMAGE024
(4)
along with the deepening of a network structure, the scale of the feature map is gradually reduced, multi-level single-mode image features are extracted through the combination of the depth residual error modules, global and local details can be integrated, and richer semantic information is provided. And then carrying out global average pooling on the extracted multi-level single-mode image characteristics, and after passing through a full connection layer, mapping through a softmax function to obtain an output result of the single-mode network model:
Figure DEST_PATH_IMAGE025
(5)
wherein
Figure 54124DEST_PATH_IMAGE026
To represent the output matrixiThe value of each of the elements is,
Figure DEST_PATH_IMAGE027
to be mapped toiThe probability of a class.
The specific implementation process of the cooperative attention mechanism in the step two comprises the following steps:
firstly, extracting multi-level single-mode image characteristics from current depth residual error module
Figure DEST_PATH_IMAGE029
The method comprises the following steps of obtaining preliminary fusion characteristics through spatial dimension splicing, dynamically selecting appropriate proportion characteristics according to a preliminary fusion result through a spatial attention mechanism based on a scale perception module, and fusing through self-learning, wherein the proportion characteristics comprise the following steps:
Figure 717187DEST_PATH_IMAGE030
(6)
wherein
Figure DEST_PATH_IMAGE031
Respectively representing the second of the features of a single-mode image
Figure DEST_PATH_IMAGE033
And respectively carrying out dot product on the characteristic values and the original single-mode characteristics to obtain the characteristics (effective characteristics) after scale perception, and finally summing to obtain a feature graph after supervision fusion, so that the fusion of a plurality of single-mode characteristics at the current stage is realized.
Then, the fusion characteristics of the current stage and the previous stage are spliced through a channel dimension, and then the splicing result is used for generating a one-dimensional excitation weight through a channel attention mechanism based on an adaptive calibration module to activate each layer of channel, so that the attention to a channel domain is enhanced, wherein the mechanism is divided into three parts, namely a squeezing function, an excitation function and a scale function, and the squeezing function is as follows:
Figure 39584DEST_PATH_IMAGE034
(7)
the function is to add and average the characteristic values in each channel of the splicing characteristic to realize the process of global average pooling, and the excitation function is as follows:
Figure DEST_PATH_IMAGE035
(8)
wherein
Figure 180715DEST_PATH_IMAGE036
A sigmoid activation function is represented,
Figure DEST_PATH_IMAGE037
denotes the ReLU function, W1And W2Are respectively of dimensions
Figure 475430DEST_PATH_IMAGE038
,
Figure DEST_PATH_IMAGE039
Where C is the number of channels, r is a scaling parameter, a one-dimensional channel attention vector is calculated by the adaptive calibration module, and the scale function is as follows:
Figure 574973DEST_PATH_IMAGE040
(9)
essentially a scaling process, for each channel
Figure DEST_PATH_IMAGE041
Multiplying different channel attention weights
Figure 322349DEST_PATH_IMAGE042
Therefore, the attention to the key channel domain is enhanced, and the fusion of the current stage and the previous stage multi-modal features is realized.
The third step comprises the following processes:
based on the single mode network model and the fusion mode network model respectively obtained in the first step and the second step, the final output characteristics of the two models are expanded and connected together, the characteristics of the two models are integrated, consistency weight parameters are designed to keep semantic consistency between the models, firstly, difference scores between model outputs and actual samples are obtained through cosine distance similarity calculation, and the difference scores are used
Figure DEST_PATH_IMAGE043
The activation function converts the difference score into a consistent weight distribution for reuse
Figure 267171DEST_PATH_IMAGE044
And the function carries out normalization processing on the difference scores so as to obtain consistency weight parameters of different models in the fusion process.
During the training process, the specific loss function is as follows:
Figure DEST_PATH_IMAGE045
(10)
wherein, the first and the second end of the pipe are connected with each other,
Figure 681972DEST_PATH_IMAGE046
and
Figure DEST_PATH_IMAGE047
respectively representing monomodal featuresvAnd fusion modalitiesmOutput characteristics and actual sample information ofSThe consistency loss is calculated as follows:
Figure 686837DEST_PATH_IMAGE048
Figure DEST_PATH_IMAGE049
Figure 187089DEST_PATH_IMAGE050
and
Figure DEST_PATH_IMAGE051
representing a consistency loss weight parameter for the corresponding modality. The resulting loss function is thus as follows:
Figure 404444DEST_PATH_IMAGE052
in the model training phase, the network parameters are updated using an adaptive gradient descent method by minimizing this loss. And in the model prediction stage, multi-modal images of the same target to be classified are input into a trained complete network, and classification prediction is carried out after single-modal features and fusion modal features extracted by a network model are weighted, so that auxiliary classification of the multi-modal images is completed.
The embodiment of the invention also provides a multi-modal image classification system, which comprises computer equipment; the computer device is configured or programmed to perform the steps of the method of the above-described embodiment.
In the invention, the computer equipment can be a microprocessor, an upper computer and other equipment.

Claims (5)

1. A method of multi-modal image classification, comprising the steps of:
s1, establishing a plurality of single-mode network feature extraction modules, wherein each single-mode network feature extraction module comprises a multi-order cascaded residual module, and the last-order residual module is sequentially connected with a pooling layer, a full connection layer and a softmax layer to obtain a single-mode network channel; all the single-mode network channels form a single-mode network model;
s2, fusing each order residual error module in the single mode network model by utilizing a cooperative attention mechanism to obtain a fused mode network model; the specific implementation process of step S2 includes:
respectively and sequentially performing convolution operation on the single-mode image features extracted by the 1 st order residual error modules of all the single-mode channels in the single-mode network model, and respectively inputting the results of the convolution operation
Figure DEST_PATH_IMAGE001
Activating a function to obtain a space attention vector; multiplying the spatial attention vector by the single-mode image features extracted by each 1 st order residual error module to obtain 1 st order fusion features, performing convolution operation on the fusion features, and inputting the result of the convolution operation into a Sigmoid activation function to obtain a 1 st order fusion mode network model;
for the nth order residual error module, respectively and sequentially performing convolution operation on the single-mode image features extracted by each nth order residual error module of all single-mode channels in the single-mode network model, and respectively inputting the results of each convolution operation
Figure 660351DEST_PATH_IMAGE001
Activating a function to obtain a space attention vector; multiplying the space attention vector by the single-mode image features extracted by each nth-order residual error module to obtain nth-order fusion features; splicing the fusion feature of the nth order and the fusion feature of the (n-1) th order, performing convolution operation on the spliced fusion feature, and outputting the result of the convolution operationEntering a Sigmoid activation function to obtain an nth-order fusion modal network model; wherein n is greater than 1; the final-order fusion mode network model is a fusion mode network model;
and S3, fusing the single-mode network model and the fusion-mode network model to obtain a classification model.
2. The multi-modal image classification method according to claim 1, wherein in step S1, the fully connected hierarchy and the classification hierarchy of the ResNet50 network are removed, and the remaining cascaded residual modules constitute the single-modal network channel.
3. The multi-modal image classification method according to claim 1 or 2, wherein the step S3 is implemented by the following steps: respectively calculating the difference scores between the output of the single-mode network model and the output of the fusion-mode network model and the actual sample through cosine distance similarity, and using
Figure 656120DEST_PATH_IMAGE002
The activation function converts the difference score into a consistent weight distribution for reuse
Figure DEST_PATH_IMAGE003
The activation function carries out normalization processing on the consistency weight distribution, so that consistency weight parameters of different models in the fusion process are obtained; respectively calculating a first difference value of the single-mode image features output by the single-mode network model and the actual label category and a second difference value of the fusion mode feature output result output by the fusion mode network model and the actual label category according to the consistency weight parameters, taking the sum of squares of the first difference value and the second difference value as a loss function of network training, minimizing the loss function through a self-adaptive gradient descent method, updating network model parameters, obtaining a trained network model, and obtaining a classification model.
4. The method of multi-modal image classification according to claim 1 or 2, characterized in that the number of single-modal network channels is 2.
5. A multi-modal image classification system comprising a computer device; the computer device is configured or programmed for carrying out the steps of the method according to one of claims 1 to 4.
CN202110355430.5A 2021-04-01 2021-04-01 Multi-modal image classification method and system Active CN113516133B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110355430.5A CN113516133B (en) 2021-04-01 2021-04-01 Multi-modal image classification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110355430.5A CN113516133B (en) 2021-04-01 2021-04-01 Multi-modal image classification method and system

Publications (2)

Publication Number Publication Date
CN113516133A CN113516133A (en) 2021-10-19
CN113516133B true CN113516133B (en) 2022-06-17

Family

ID=78062230

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110355430.5A Active CN113516133B (en) 2021-04-01 2021-04-01 Multi-modal image classification method and system

Country Status (1)

Country Link
CN (1) CN113516133B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230169794A1 (en) * 2021-11-30 2023-06-01 Irina Kezele Method, device, and medium for adaptive inference in compressed video domain
CN114332592B (en) * 2022-03-11 2022-06-21 中国海洋大学 Ocean environment data fusion method and system based on attention mechanism
CN114638994B (en) * 2022-05-18 2022-08-19 山东建筑大学 Multi-modal image classification system and method based on attention multi-interaction network
CN115546217B (en) * 2022-12-02 2023-04-07 中南大学 Multi-level fusion skin disease diagnosis system based on multi-mode image data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710830A (en) * 2018-04-20 2018-10-26 浙江工商大学 A kind of intensive human body 3D posture estimation methods for connecting attention pyramid residual error network and equidistantly limiting of combination
WO2018213841A1 (en) * 2017-05-19 2018-11-22 Google Llc Multi-task multi-modal machine learning model
CN110674677A (en) * 2019-08-06 2020-01-10 厦门大学 Multi-mode multi-layer fusion deep neural network for anti-spoofing of human face
CN111192200A (en) * 2020-01-02 2020-05-22 南京邮电大学 Image super-resolution reconstruction method based on fusion attention mechanism residual error network
CN111325155A (en) * 2020-02-21 2020-06-23 重庆邮电大学 Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy
CN111709265A (en) * 2019-12-11 2020-09-25 深学科技(杭州)有限公司 Camera monitoring state classification method based on attention mechanism residual error network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102215757B1 (en) * 2019-05-14 2021-02-15 경희대학교 산학협력단 Method, apparatus and computer program for image segmentation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018213841A1 (en) * 2017-05-19 2018-11-22 Google Llc Multi-task multi-modal machine learning model
CN108710830A (en) * 2018-04-20 2018-10-26 浙江工商大学 A kind of intensive human body 3D posture estimation methods for connecting attention pyramid residual error network and equidistantly limiting of combination
CN110674677A (en) * 2019-08-06 2020-01-10 厦门大学 Multi-mode multi-layer fusion deep neural network for anti-spoofing of human face
CN111709265A (en) * 2019-12-11 2020-09-25 深学科技(杭州)有限公司 Camera monitoring state classification method based on attention mechanism residual error network
CN111192200A (en) * 2020-01-02 2020-05-22 南京邮电大学 Image super-resolution reconstruction method based on fusion attention mechanism residual error network
CN111325155A (en) * 2020-02-21 2020-06-23 重庆邮电大学 Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Multi-Modal Retinal Image Classification With Modality-Specific Attention Network;X. He, Y. Deng, L. Fang and Q. Peng;《IEEE》;20210224;第40卷(第6期);全文 *
面向共同空间学习的多模态数据建模和检索研究;陈思佳;《中国优秀硕士学位论文全文数据库 (基础科学辑)》;20190228(第02期);全文 *

Also Published As

Publication number Publication date
CN113516133A (en) 2021-10-19

Similar Documents

Publication Publication Date Title
CN113516133B (en) Multi-modal image classification method and system
CN110490946B (en) Text image generation method based on cross-modal similarity and antagonism network generation
CN107480206B (en) Multi-mode low-rank bilinear pooling-based image content question-answering method
CN112381097A (en) Scene semantic segmentation method based on deep learning
CN110929080B (en) Optical remote sensing image retrieval method based on attention and generation countermeasure network
CN110175248B (en) Face image retrieval method and device based on deep learning and Hash coding
CN114418030B (en) Image classification method, training method and device for image classification model
CN112489164B (en) Image coloring method based on improved depth separable convolutional neural network
CN113239897B (en) Human body action evaluation method based on space-time characteristic combination regression
CN112308081A (en) Attention mechanism-based image target prediction method
CN113628059A (en) Associated user identification method and device based on multilayer graph attention network
CN115017178A (en) Training method and device for data-to-text generation model
CN114821050A (en) Named image segmentation method based on transformer
Qi et al. Learning low resource consumption cnn through pruning and quantization
CN116030537B (en) Three-dimensional human body posture estimation method based on multi-branch attention-seeking convolution
CN112532251A (en) Data processing method and device
WO2023173552A1 (en) Establishment method for target detection model, application method for target detection model, and device, apparatus and medium
CN116189306A (en) Human behavior recognition method based on joint attention mechanism
CN113628107B (en) Face image super-resolution method and system
CN115063374A (en) Model training method, face image quality scoring method, electronic device and storage medium
CN114494284A (en) Scene analysis model and method based on explicit supervision area relation
CN114581789A (en) Hyperspectral image classification method and system
CN113989566A (en) Image classification method and device, computer equipment and storage medium
CN113538199B (en) Image steganography detection method based on multi-layer perception convolution and channel weighting
CN115936073B (en) Language-oriented convolutional neural network and visual question-answering method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant