CN113516133B - Multi-modal image classification method and system - Google Patents
Multi-modal image classification method and system Download PDFInfo
- Publication number
- CN113516133B CN113516133B CN202110355430.5A CN202110355430A CN113516133B CN 113516133 B CN113516133 B CN 113516133B CN 202110355430 A CN202110355430 A CN 202110355430A CN 113516133 B CN113516133 B CN 113516133B
- Authority
- CN
- China
- Prior art keywords
- mode
- network model
- fusion
- order
- mode network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention discloses a multi-modal image classification method and a multi-modal image classification system, wherein a plurality of single-modal network feature extraction modules are established, each single-modal network feature extraction module comprises a multi-order cascaded residual module, and the last-order residual module is sequentially connected with a pooling layer, a full connection layer and a softmax layer to obtain a single-modal network channel; all the single-mode network channels form a single-mode network model; fusing each order residual error module in the single mode network model by utilizing a cooperative attention mechanism to obtain a fused mode network model; and fusing the single-mode network model and the fused-mode network model to obtain a classification model. The invention can improve the classification precision.
Description
Technical Field
The invention relates to the field of image processing, in particular to a multi-modal image classification method and system.
Background
Currently, the image classification technology based on deep learning has been widely applied, and a deep learning model is used to process a shot image to judge the category of an object in the image. The existing image classification method adopts a single-mode image for processing, but the single-mode image cannot well cover the characteristics of a target object, so that the classification precision is influenced.
Disclosure of Invention
The invention aims to solve the technical problem that the prior art is insufficient, and provides a multi-mode image classification method and system to improve classification accuracy.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a method of multi-modal image classification comprising the steps of:
s1, establishing a plurality of single-mode network feature extraction modules, wherein each single-mode network feature extraction module comprises a multi-order cascaded residual module, and the last-order residual module is sequentially connected with a pooling layer, a full connection layer and a softmax layer to obtain a single-mode network channel; all the single-mode network channels form a single-mode network model;
s2, fusing each order residual error module in the single mode network model by utilizing a cooperative attention mechanism to obtain a fused mode network model;
and S3, fusing the single-mode network model and the fusion-mode network model to obtain a classification model.
The invention provides a multi-level fusion learning network structure, which processes any single-mode characteristics in a mode of a depth residual error module (residual error module) and transmits semantic information layer by layer. Along with the deepening of a network structure, the scale of the feature map is gradually reduced, global and detail local information is integrated, and richer semantic information is provided. Meanwhile, the single-mode features are mutually supervised and fused, single-mode feature information is continuously enriched and perfected, and the spatial resolution of the feature map is improved, so that the image classification precision is improved.
In step S1, the full link layer and the classification layer of the ResNet50 network are removed, and the remaining cascaded residual modules form the single-mode network channel. The structure can improve the expression capability of the network through sufficient depth and layer-by-layer feature learning, the image features of higher layers are extracted by utilizing the cascaded residual error modules, from the stage 1 to the stage n, the learned feature expression is more complex along with the deepening of the network, and the semantic information contained in the output feature graph becomes richer.
The specific implementation process of step S2 includes: all single modes in the single mode network modelRespectively and sequentially performing convolution operation on the single-mode image features extracted by each 1 st order residual error module of the state channel, and respectively inputting the results of the convolution operationActivating a function to obtain a space attention vector; multiplying the spatial attention vector by the single-mode image features extracted by each 1 st order residual error module to obtain 1 st order fusion features, performing convolution operation on the fusion features, and inputting the result of the convolution operation into a Sigmoid activation function to obtain a 1 st order fusion mode network model; for the nth order residual error module, respectively and sequentially performing convolution operation on the single-mode image features extracted by each nth order residual error module of all single-mode channels in the single-mode network model, and respectively outputting the results of each convolution operationActivating a function to obtain a space attention vector; multiplying the space attention vector by the single-mode image features extracted by each nth-order residual error module to obtain nth-order fusion features; splicing the nth order fusion characteristic and the (n-1) th order fusion characteristic, performing convolution operation on the spliced fusion characteristic, and inputting a convolution operation result into a Sigmoid activation function to obtain an nth order fusion modal network model; wherein n is greater than 1; and the final-order converged modal network model is a converged modal network model. The model can efficiently fuse the cross and complementary information among the single modes of each stage, reduce the operation amount of the network model in an information compression mode, emphasize key image information through an attention mechanism, and simultaneously inhibit noise information irrelevant to image classification.
The specific implementation process of step S3 includes: respectively calculating the difference scores between the output of the single-mode network model and the output of the fusion-mode network model and the actual sample through cosine distance similarity, and usingAn activation function converts the difference score into a consistency weight distribution,is reusedThe function carries out normalization processing on the consistency weight distribution, so that consistency weight parameters of different models in the fusion process are obtained; respectively calculating a first difference value of the single-mode image features output by the single-mode network model and the actual label category and a second difference value of the fusion mode feature output result output by the fusion mode network model and the actual label category according to the consistency weight parameters, taking the sum of squares of the first difference value and the second difference value as a loss function of network training, minimizing the loss function through a self-adaptive gradient descent method, updating network model parameters, obtaining a trained network model, and obtaining a classification model. The step can integrate the output information from the single mode network model and the fusion mode network model, and dynamically adjust the importance degree of each model through the consistency weight so as to obtain better global information and improve the performance and generalization capability of the models on image classification tasks.
In the invention, in order to simplify the model and improve the classification efficiency, the number of the single-mode network channels is 2.
The invention also provides a multi-modal image classification system, which comprises computer equipment; the computer device is configured or programmed for performing the steps of the above-described method.
Compared with the prior art, the invention has the beneficial effects that:
1. aiming at the limitation of shallow image features, the invention provides a multi-level fusion learning network structure based on a convolutional neural network, and the framework processes any single-mode feature in a depth residual error module mode and transmits semantic information layer by layer. Along with the deepening of a network structure, the scale of the feature map is gradually reduced, global and detail local information is integrated, and richer semantic information is provided. Meanwhile, the single-mode features are mutually supervised and fused, and the single-mode feature information is continuously enriched and perfected so as to improve the spatial resolution of the feature map.
2. A fusion learning framework based on a cooperative attention mechanism is provided on the basis of multi-mode fusion, the framework is independent of a specific network of single modes, can be well embedded into a main flow backbone network, and is beneficial to keeping unique characteristics and exclusivity of each single mode. In addition, the framework can keep similarity structures between modalities and in the modalities, meanwhile, modality cooperation and feature fusion are considered, the consistency maximization of different modality expression sets is realized, data are transmitted and shared among the different modalities, and the excellent performance is shown.
Drawings
FIG. 1 is a block diagram of a monomodal network model according to an embodiment of the present invention;
FIG. 2 is a block diagram of a converged modal network model architecture according to an embodiment of the present invention;
FIG. 3 is a block diagram of a classification model according to an embodiment of the present invention.
Detailed Description
A multi-modal image classification method based on multi-level fusion learning and cooperative attention mechanism comprises the following steps:
taking the skin disease classification task as an example, the classification diagnosis of the skin disease is performed by using information of two modalities, namely a clinical image and a skin mirror image.
Step one, a multi-level feature extraction network is adopted to obtain hierarchical clinical image single-mode features and skin mirror image single-mode features.
Firstly, coding two input single-mode images through a feature extraction module based on a deep convolutional neural network to generate depth image features, specifically, establishing two single-mode network feature extraction modules which are respectively used for extracting clinical image features and skin mirror image features, wherein the feature extraction module consists of ResNet50 networks for removing a full connection layer and a classification layer, the modules are in parallel relation, performing convolution and batch normalization operation on the input single-mode images respectively, and then adopting the depth convolutional neural network-based feature extraction module to generate depth image featuresFunction (a)) Performing nonlinear mapping, performing further compression coding on the single-mode image information through a pooling layer to obtain initial image characteristics, extracting multi-level single-mode image characteristics through a plurality of depth residual modules connected in series, and sequentially connecting the last depth residual block with the pooling layer, the full-connection layer and the full-connection layerAnd (4) performing feature dimension reduction and compression on the layers to obtain a single-mode network model with parallel input.
Step two, adopting a cooperative attention mechanism to acquire supervision fusion characteristics
Respectively and sequentially performing convolution operation on the single-mode image features extracted by the 1 st order residual error modules of all the single-mode channels in the single-mode network model, and respectively inputting the results of the convolution operationActivating a function to obtain a space attention vector; multiplying the spatial attention vector by the single-mode image features extracted by each 1 st order residual error module to obtain 1 st order fusion features, performing convolution operation on the fusion features, and inputting the result of the convolution operation into a Sigmoid activation function to obtain a 1 st order fusion mode network model;
for the nth order residual error module, respectively and sequentially performing convolution operation on the single-mode image features extracted by each nth order residual error module of all single-mode channels in the single-mode network model, and respectively inputting the results of each convolution operationActivating a function to obtain a space attention vector; multiplying the space attention vector by the single-mode image features extracted by each nth-order residual error module to obtain nth-order fusion features; splicing the nth order fusion characteristic and the (n-1) th order fusion characteristic, performing convolution operation on the spliced fusion characteristic, and inputting a convolution operation result into a Sigmoid activation function to obtain an nth order fusion modal network model; wherein n is greater than 1; last-order converged modal network modelType is a converged modal network model;
and thirdly, realizing multi-modal skin disease image classification by combining a multi-level feature extractor and a cooperative attention mechanism.
Based on the parallel single-mode network model and the fusion mode network model obtained in the first step and the second step, firstly, the clinical image and the dermatoscope image are sequentially input into the parallel single-mode network model,
and (4) extracting the output single-mode image features through the single-mode network model obtained in the step one, and extracting the output fusion-mode image features through the fusion-mode network model obtained in the step two. Then calculating to obtain a difference score between the model output and the actual sample through cosine distance similarity, and using the difference scoreAndand processing the difference scores by the function to obtain consistency weight parameters of different models. And respectively calculating the image characteristics of the single mode and the square sum of the difference value between the characteristic output result of the fusion mode and the actual label category as the loss of network training according to the consistency weight parameters, and minimizing the loss by a self-adaptive gradient descent method to update the parameters of the network model to obtain the trained network model. In actual use, only a plurality of single-mode data are input to obtain a classification result output by the model.
The first step specifically comprises the following processes:
firstly, a single-mode model is trained by utilizing a plurality of input single-mode images, the images are compressed by encoding through a feature extractor based on a convolutional neural network to generate depth features, and in a training stage, the input features of each layer are mapped by a three-dimensional arrayIs shown in whichhAndwis the size of the feature map and,dis the number of channels of the feature map, the feature map pass size of the adjacent layer is largeIs small asAre connected, for convolution operations, setIs the previous layerThe value of the pixel of the location is,is the pixel value of the corresponding position of the next layer, has
Where denotes convolution, b is the biased shared value, L = H =3, and the convolution is followed by a batch normalization layer and then by a Relu function for non-linear mapping.
For the pooling operation, there are
Wherein L = H =2, in order to reduce the feature loss caused by the pooling layer, the compression encoding stage uses average pooling instead of maximum pooling to obtain the preliminary features of the single-mode image. Extracting multi-level image features through a plurality of depth residual modules, wherein a residual block is defined as:
wherein the content of the first and second substances,x,yrespectively representing the input and output of the current residual block,representing the residual map that needs to be learned. When in usexWhen the dimensions of F and F are different, by a linear projectionTo match the dimensions, as follows:
along with the deepening of a network structure, the scale of the feature map is gradually reduced, multi-level single-mode image features are extracted through the combination of the depth residual error modules, global and local details can be integrated, and richer semantic information is provided. And then carrying out global average pooling on the extracted multi-level single-mode image characteristics, and after passing through a full connection layer, mapping through a softmax function to obtain an output result of the single-mode network model:
whereinTo represent the output matrixiThe value of each of the elements is,to be mapped toiThe probability of a class.
The specific implementation process of the cooperative attention mechanism in the step two comprises the following steps:
firstly, extracting multi-level single-mode image characteristics from current depth residual error moduleThe method comprises the following steps of obtaining preliminary fusion characteristics through spatial dimension splicing, dynamically selecting appropriate proportion characteristics according to a preliminary fusion result through a spatial attention mechanism based on a scale perception module, and fusing through self-learning, wherein the proportion characteristics comprise the following steps:
whereinRespectively representing the second of the features of a single-mode imageAnd respectively carrying out dot product on the characteristic values and the original single-mode characteristics to obtain the characteristics (effective characteristics) after scale perception, and finally summing to obtain a feature graph after supervision fusion, so that the fusion of a plurality of single-mode characteristics at the current stage is realized.
Then, the fusion characteristics of the current stage and the previous stage are spliced through a channel dimension, and then the splicing result is used for generating a one-dimensional excitation weight through a channel attention mechanism based on an adaptive calibration module to activate each layer of channel, so that the attention to a channel domain is enhanced, wherein the mechanism is divided into three parts, namely a squeezing function, an excitation function and a scale function, and the squeezing function is as follows:
the function is to add and average the characteristic values in each channel of the splicing characteristic to realize the process of global average pooling, and the excitation function is as follows:
whereinA sigmoid activation function is represented,denotes the ReLU function, W1And W2Are respectively of dimensions,Where C is the number of channels, r is a scaling parameter, a one-dimensional channel attention vector is calculated by the adaptive calibration module, and the scale function is as follows:
essentially a scaling process, for each channelMultiplying different channel attention weightsTherefore, the attention to the key channel domain is enhanced, and the fusion of the current stage and the previous stage multi-modal features is realized.
The third step comprises the following processes:
based on the single mode network model and the fusion mode network model respectively obtained in the first step and the second step, the final output characteristics of the two models are expanded and connected together, the characteristics of the two models are integrated, consistency weight parameters are designed to keep semantic consistency between the models, firstly, difference scores between model outputs and actual samples are obtained through cosine distance similarity calculation, and the difference scores are usedThe activation function converts the difference score into a consistent weight distribution for reuseAnd the function carries out normalization processing on the difference scores so as to obtain consistency weight parameters of different models in the fusion process.
During the training process, the specific loss function is as follows:
wherein, the first and the second end of the pipe are connected with each other,andrespectively representing monomodal featuresvAnd fusion modalitiesmOutput characteristics and actual sample information ofSThe consistency loss is calculated as follows:
andrepresenting a consistency loss weight parameter for the corresponding modality. The resulting loss function is thus as follows:
in the model training phase, the network parameters are updated using an adaptive gradient descent method by minimizing this loss. And in the model prediction stage, multi-modal images of the same target to be classified are input into a trained complete network, and classification prediction is carried out after single-modal features and fusion modal features extracted by a network model are weighted, so that auxiliary classification of the multi-modal images is completed.
The embodiment of the invention also provides a multi-modal image classification system, which comprises computer equipment; the computer device is configured or programmed to perform the steps of the method of the above-described embodiment.
In the invention, the computer equipment can be a microprocessor, an upper computer and other equipment.
Claims (5)
1. A method of multi-modal image classification, comprising the steps of:
s1, establishing a plurality of single-mode network feature extraction modules, wherein each single-mode network feature extraction module comprises a multi-order cascaded residual module, and the last-order residual module is sequentially connected with a pooling layer, a full connection layer and a softmax layer to obtain a single-mode network channel; all the single-mode network channels form a single-mode network model;
s2, fusing each order residual error module in the single mode network model by utilizing a cooperative attention mechanism to obtain a fused mode network model; the specific implementation process of step S2 includes:
respectively and sequentially performing convolution operation on the single-mode image features extracted by the 1 st order residual error modules of all the single-mode channels in the single-mode network model, and respectively inputting the results of the convolution operationActivating a function to obtain a space attention vector; multiplying the spatial attention vector by the single-mode image features extracted by each 1 st order residual error module to obtain 1 st order fusion features, performing convolution operation on the fusion features, and inputting the result of the convolution operation into a Sigmoid activation function to obtain a 1 st order fusion mode network model;
for the nth order residual error module, respectively and sequentially performing convolution operation on the single-mode image features extracted by each nth order residual error module of all single-mode channels in the single-mode network model, and respectively inputting the results of each convolution operationActivating a function to obtain a space attention vector; multiplying the space attention vector by the single-mode image features extracted by each nth-order residual error module to obtain nth-order fusion features; splicing the fusion feature of the nth order and the fusion feature of the (n-1) th order, performing convolution operation on the spliced fusion feature, and outputting the result of the convolution operationEntering a Sigmoid activation function to obtain an nth-order fusion modal network model; wherein n is greater than 1; the final-order fusion mode network model is a fusion mode network model;
and S3, fusing the single-mode network model and the fusion-mode network model to obtain a classification model.
2. The multi-modal image classification method according to claim 1, wherein in step S1, the fully connected hierarchy and the classification hierarchy of the ResNet50 network are removed, and the remaining cascaded residual modules constitute the single-modal network channel.
3. The multi-modal image classification method according to claim 1 or 2, wherein the step S3 is implemented by the following steps: respectively calculating the difference scores between the output of the single-mode network model and the output of the fusion-mode network model and the actual sample through cosine distance similarity, and usingThe activation function converts the difference score into a consistent weight distribution for reuseThe activation function carries out normalization processing on the consistency weight distribution, so that consistency weight parameters of different models in the fusion process are obtained; respectively calculating a first difference value of the single-mode image features output by the single-mode network model and the actual label category and a second difference value of the fusion mode feature output result output by the fusion mode network model and the actual label category according to the consistency weight parameters, taking the sum of squares of the first difference value and the second difference value as a loss function of network training, minimizing the loss function through a self-adaptive gradient descent method, updating network model parameters, obtaining a trained network model, and obtaining a classification model.
4. The method of multi-modal image classification according to claim 1 or 2, characterized in that the number of single-modal network channels is 2.
5. A multi-modal image classification system comprising a computer device; the computer device is configured or programmed for carrying out the steps of the method according to one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110355430.5A CN113516133B (en) | 2021-04-01 | 2021-04-01 | Multi-modal image classification method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110355430.5A CN113516133B (en) | 2021-04-01 | 2021-04-01 | Multi-modal image classification method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113516133A CN113516133A (en) | 2021-10-19 |
CN113516133B true CN113516133B (en) | 2022-06-17 |
Family
ID=78062230
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110355430.5A Active CN113516133B (en) | 2021-04-01 | 2021-04-01 | Multi-modal image classification method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113516133B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230169794A1 (en) * | 2021-11-30 | 2023-06-01 | Irina Kezele | Method, device, and medium for adaptive inference in compressed video domain |
CN114332592B (en) * | 2022-03-11 | 2022-06-21 | 中国海洋大学 | Ocean environment data fusion method and system based on attention mechanism |
CN114638994B (en) * | 2022-05-18 | 2022-08-19 | 山东建筑大学 | Multi-modal image classification system and method based on attention multi-interaction network |
CN115546217B (en) * | 2022-12-02 | 2023-04-07 | 中南大学 | Multi-level fusion skin disease diagnosis system based on multi-mode image data |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108710830A (en) * | 2018-04-20 | 2018-10-26 | 浙江工商大学 | A kind of intensive human body 3D posture estimation methods for connecting attention pyramid residual error network and equidistantly limiting of combination |
WO2018213841A1 (en) * | 2017-05-19 | 2018-11-22 | Google Llc | Multi-task multi-modal machine learning model |
CN110674677A (en) * | 2019-08-06 | 2020-01-10 | 厦门大学 | Multi-mode multi-layer fusion deep neural network for anti-spoofing of human face |
CN111192200A (en) * | 2020-01-02 | 2020-05-22 | 南京邮电大学 | Image super-resolution reconstruction method based on fusion attention mechanism residual error network |
CN111325155A (en) * | 2020-02-21 | 2020-06-23 | 重庆邮电大学 | Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy |
CN111709265A (en) * | 2019-12-11 | 2020-09-25 | 深学科技(杭州)有限公司 | Camera monitoring state classification method based on attention mechanism residual error network |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102215757B1 (en) * | 2019-05-14 | 2021-02-15 | 경희대학교 산학협력단 | Method, apparatus and computer program for image segmentation |
-
2021
- 2021-04-01 CN CN202110355430.5A patent/CN113516133B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018213841A1 (en) * | 2017-05-19 | 2018-11-22 | Google Llc | Multi-task multi-modal machine learning model |
CN108710830A (en) * | 2018-04-20 | 2018-10-26 | 浙江工商大学 | A kind of intensive human body 3D posture estimation methods for connecting attention pyramid residual error network and equidistantly limiting of combination |
CN110674677A (en) * | 2019-08-06 | 2020-01-10 | 厦门大学 | Multi-mode multi-layer fusion deep neural network for anti-spoofing of human face |
CN111709265A (en) * | 2019-12-11 | 2020-09-25 | 深学科技(杭州)有限公司 | Camera monitoring state classification method based on attention mechanism residual error network |
CN111192200A (en) * | 2020-01-02 | 2020-05-22 | 南京邮电大学 | Image super-resolution reconstruction method based on fusion attention mechanism residual error network |
CN111325155A (en) * | 2020-02-21 | 2020-06-23 | 重庆邮电大学 | Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy |
Non-Patent Citations (2)
Title |
---|
Multi-Modal Retinal Image Classification With Modality-Specific Attention Network;X. He, Y. Deng, L. Fang and Q. Peng;《IEEE》;20210224;第40卷(第6期);全文 * |
面向共同空间学习的多模态数据建模和检索研究;陈思佳;《中国优秀硕士学位论文全文数据库 (基础科学辑)》;20190228(第02期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113516133A (en) | 2021-10-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113516133B (en) | Multi-modal image classification method and system | |
CN110490946B (en) | Text image generation method based on cross-modal similarity and antagonism network generation | |
CN107480206B (en) | Multi-mode low-rank bilinear pooling-based image content question-answering method | |
CN112381097A (en) | Scene semantic segmentation method based on deep learning | |
CN110929080B (en) | Optical remote sensing image retrieval method based on attention and generation countermeasure network | |
CN110175248B (en) | Face image retrieval method and device based on deep learning and Hash coding | |
CN114418030B (en) | Image classification method, training method and device for image classification model | |
CN112489164B (en) | Image coloring method based on improved depth separable convolutional neural network | |
CN113239897B (en) | Human body action evaluation method based on space-time characteristic combination regression | |
CN112308081A (en) | Attention mechanism-based image target prediction method | |
CN113628059A (en) | Associated user identification method and device based on multilayer graph attention network | |
CN115017178A (en) | Training method and device for data-to-text generation model | |
CN114821050A (en) | Named image segmentation method based on transformer | |
Qi et al. | Learning low resource consumption cnn through pruning and quantization | |
CN116030537B (en) | Three-dimensional human body posture estimation method based on multi-branch attention-seeking convolution | |
CN112532251A (en) | Data processing method and device | |
WO2023173552A1 (en) | Establishment method for target detection model, application method for target detection model, and device, apparatus and medium | |
CN116189306A (en) | Human behavior recognition method based on joint attention mechanism | |
CN113628107B (en) | Face image super-resolution method and system | |
CN115063374A (en) | Model training method, face image quality scoring method, electronic device and storage medium | |
CN114494284A (en) | Scene analysis model and method based on explicit supervision area relation | |
CN114581789A (en) | Hyperspectral image classification method and system | |
CN113989566A (en) | Image classification method and device, computer equipment and storage medium | |
CN113538199B (en) | Image steganography detection method based on multi-layer perception convolution and channel weighting | |
CN115936073B (en) | Language-oriented convolutional neural network and visual question-answering method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |