CN114821206A - Multi-modal image fusion classification method and system based on confrontation complementary features - Google Patents

Multi-modal image fusion classification method and system based on confrontation complementary features Download PDF

Info

Publication number
CN114821206A
CN114821206A CN202210755253.4A CN202210755253A CN114821206A CN 114821206 A CN114821206 A CN 114821206A CN 202210755253 A CN202210755253 A CN 202210755253A CN 114821206 A CN114821206 A CN 114821206A
Authority
CN
China
Prior art keywords
features
image
level
fusion
modal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210755253.4A
Other languages
Chinese (zh)
Other versions
CN114821206B (en
Inventor
袭肖明
王可崧
聂秀山
尹义龙
张光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Jianzhu University
Original Assignee
Shandong Jianzhu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Jianzhu University filed Critical Shandong Jianzhu University
Priority to CN202210755253.4A priority Critical patent/CN114821206B/en
Publication of CN114821206A publication Critical patent/CN114821206A/en
Application granted granted Critical
Publication of CN114821206B publication Critical patent/CN114821206B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a multi-modal image fusion classification method and system based on antagonistic complementary features, which belong to the technical field of image classification, and comprise the steps of selecting a modality to be fused from a plurality of modalities, extracting and obtaining key feature information vectors of an image by low-level features, and judging whether first channel fusion and first similarity calculation can be carried out or not; then, high-level feature extraction is carried out, and whether second channel fusion and second similarity calculation can be carried out is judged; clustering and comparative learning are carried out on feature graphs extracted from low-level features and high-level features, complementary information is effectively mined and fused, complementarity among the features is enhanced, and image fusion precision is improved.

Description

Multi-modal image fusion classification method and system based on confrontation complementary features
Technical Field
The disclosure relates to the technical field of image classification, in particular to a multi-modal image fusion classification method and system based on confrontation complementary features.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
Image classification is an important research direction of computer vision, and has wide application in numerous tasks such as article identification, face identification, video analysis, disease diagnosis and the like. Although the existing image classification method can achieve better performance under the condition of big data, the classification task with less images has poor effect on some images. In addition, only using single-mode information has certain limitations, for example, in a task of classifying by using multi-view images, a single view does not completely describe a scene, which results in poor classification performance.
Deep learning has been widely applied to fields of natural language, image processing, and the like because of its excellent extraction, learning ability, and the like. However, in some multi-modal classification tasks, there is less data and deep learning tends to fall into overfitting. In addition, the existing multi-modal fusion method based on deep learning ignores effective mining and fusion of complementary information during fusion, and limits the improvement of fusion classification precision.
Disclosure of Invention
In order to solve the above problems, the present disclosure provides a multi-modal image fusion classification method and system based on antagonistic complementary features, which adopts a contrast learning dual-branch network structure, introduces a prototype learning module based on coarse-grained density clustering to learn typical features of categories, and uses multi-points to represent class centers of the categories, so that the learned typical features are more generalized. And a channel fusion module based on counterstudy is introduced, and a channel with a large improvement on the model is selected to be fused with other modes by excavating the correlation of the features in the channel, so that the complementarity between the features is enhanced.
According to the implementation of some embodiments, the following technical scheme is adopted in the present disclosure:
the multi-modal image fusion classification method based on the antagonistic complementary features comprises the following steps:
acquiring multi-modal image data, preprocessing the image data, selecting a modality to be fused from the multiple modalities, and inputting the image data in each modality into a neural network model in groups;
extracting low-level features to obtain image key feature information vectors, judging whether first channel fusion can be carried out or not, and simultaneously carrying out first similarity calculation on the obtained image key feature vectors;
extracting high-level features from the features extracted from the low-level features, extracting image key feature information vectors again, judging whether second channel fusion can be performed or not, and performing second similarity calculation on the obtained high-level image key feature information vectors;
and clustering and contrast learning are respectively carried out on the feature maps extracted by the low-level and high-level features, the classification loss is calculated, finally, the prediction of the image is carried out to obtain the corresponding category score, and the category corresponding to the maximum value of the category score is used as the prediction result.
According to some implementation modes of some embodiments, the following technical scheme is adopted in the present disclosure:
a multi-modal image fusion classification system based on antagonistic complementary features, comprising:
the data acquisition and processing module is used for acquiring multi-modal image data and preprocessing the image data;
the characteristic extraction module is used for selecting a mode to be fused from a plurality of modes, inputting image data in each mode into the neural network model according to groups, and extracting low-level characteristics to obtain image key characteristic information vectors; loading the features obtained from the low-level feature extraction into another convolutional neural network for convolutional operation to perform high-level feature extraction, and extracting key feature information vectors of the images to obtain a feature map group of the image group;
and the channel fusion module is used for judging whether the first channel fusion and the second channel fusion can be carried out, if so, designing an influence factor by utilizing the bn layer, setting a judged threshold, designing the influence factor to calculate the influence of the channel on final prediction, and fusing the channel with other modes according to the proportion among the sub-networks of different modes when the influence factor is higher than the set threshold.
And the modal similarity calculation module is used for performing primary feature comparison on feature map groups obtained by extracting the shallow features and the high features of different modes of a scene or an article after extracting the low features and the high features, and calculating the similarity.
And the calculation and prediction module is used for calculating the mean square error loss and outputting the category where the maximum similarity score is located as the prediction result of the image.
Compared with the prior art, the beneficial effect of this disclosure is:
compared with the prior method which has small data volume and only uses a single mode, the method shows excellence in image data classification, on one hand, the method introduces a prototype learning module based on coarse-grained density clustering to learn typical characteristics of categories by using a contrast learning network structure, and expresses class centers of the categories by using multi-cluster points to solve the problem of small sample distribution imbalance, and on the other hand, the method introduces a channel fusion module based on contrast learning to strengthen information interaction between the modes by mining the correlation of the characteristics in the channels.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.
FIG. 1 is a schematic diagram of a network learning process for implementing image classification by the multi-modal image fusion algorithm of the present disclosure;
fig. 2 is a schematic diagram of a model framework of an image classification system provided by the present disclosure.
The specific implementation mode is as follows:
the present disclosure is further described with reference to the following drawings and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
Example 1
The present disclosure provides a multi-modal image fusion classification method based on antagonistic complementary features, which is shown in fig. 1 and comprises the following steps:
step 1: acquiring multi-modal image data, preprocessing the image data, selecting a modality to be fused from the multiple modalities, and inputting the image data in each modality into a neural network model in groups;
step 2: extracting low-level features to obtain image key feature information vectors, judging whether first channel fusion can be carried out or not, and simultaneously carrying out first similarity calculation on the obtained image key feature vectors;
and step 3: extracting high-level features from the features extracted from the low-level features, extracting image key feature information vectors again, judging whether second channel fusion can be performed or not, and performing second similarity calculation on the obtained high-level image key feature information vectors;
and 4, step 4: and clustering and contrast learning are respectively carried out on the feature maps extracted by the low-level feature and the high-level feature, classification loss is calculated, finally, prediction of the image is carried out to obtain a corresponding category score, and the category corresponding to the maximum value of the similarity category score is used as a prediction result.
A double-branch network structure of contrast learning is adopted, a prototype learning module based on coarse-grained density clustering is introduced to learn typical characteristics of categories, and a cluster center of the categories is represented by a multi-cluster, so that the learned typical characteristics are more generalized. In order to fully fuse multi-modal complementary information, a channel fusion module based on counterstudy is introduced, and channels with large promotion to the model are selected to be fused with other modes by mining the correlation of the features in the channels, so that the complementarity among the features is enhanced.
Specifically, firstly, multi-modal image data is acquired and then preprocessed, because images or pictures in a data set may contain external information irrelevant to a classification task, objects or scenes to be classified of the pictures are marked, a marked region is extracted, the same data enhancement is performed on one scene or the same object image group in the multi-modal data set, different data enhancement is performed on different image groups, the main data enhancement methods include random cutting, horizontal turning, vertical turning, random rotation, random multi-cutting, Gaussian noise increase and the like, and then the enhanced image sizes are converted into a uniform size.
Because the original image is not required in all regions due to problems of acquisition machines or manual work and the like, the required parts are selected in a frame mode, the required parts are extracted, the sizes of the images are possibly inconsistent because of different data sources, the images need to be input into a neural network in a unified format in the training process, the original data are subjected to scale transformation by using a transform algorithm in python and are converted into the required sizes, 224 x 224 is required in the disclosure, and because the task range aimed at by the disclosure is a small data task, the same data enhancement is carried out on the grouped image pairs in the original data, and different data enhancement is carried out on different image pairs.
After preprocessing the data, extracting features of the images, selecting a modality to be fused from the plurality of modalities, inputting image data in each modality into a neural network model according to groups, extracting low-level features to obtain key feature information vectors of the images, specifically, randomly selecting N images in a batch, selecting M images in any batch, and inputting N x M images, for example, if one batch is 16 images, selecting 2 types, and inputting 32 images to the model at a time.
After image data are processed, image groups are simultaneously loaded and input into a user-defined neural network for low-level feature extraction according to the number of image input batches and selected modes, and key feature information vectors of the images are extracted after convolution to obtain feature map groups of the image groups.
After the low-level feature extraction, it is determined whether the first channel fusion can be performed, specifically,
by utilizing the bn layer design influence factor, the bn layer is regularized in the batch dimension for translation and scaling treatment, and the method introduces
Figure 931630DEST_PATH_IMAGE001
And
Figure 28899DEST_PATH_IMAGE002
two parameters to train the two parameters. Setting a threshold value for judgment
Figure 674775DEST_PATH_IMAGE003
The importance of the channel to the model is calculated as an influence factor,
Figure 363245DEST_PATH_IMAGE004
for measuring the bias.
If it is
Figure 219206DEST_PATH_IMAGE005
If the threshold value is lower than 0.3, normalization is performed, and affine transformation is performed:
Figure 933215DEST_PATH_IMAGE006
if it is
Figure 27073DEST_PATH_IMAGE007
And (3) performing channel fusion between modes when the threshold value is higher than 0.3:
Figure 417603DEST_PATH_IMAGE008
Figure 495280DEST_PATH_IMAGE009
representing features after the radial transformation;
Figure 10051DEST_PATH_IMAGE010
representing the original characteristics;
Figure 755153DEST_PATH_IMAGE011
representing the variance of the original features;
Figure 316584DEST_PATH_IMAGE012
and
Figure 881558DEST_PATH_IMAGE013
is the mean and the error of the measured data,
Figure 937370DEST_PATH_IMAGE014
the degree of influence of the channel on the model is measured,
Figure 271399DEST_PATH_IMAGE015
is a small constant that avoids a zero division,
Figure 3732DEST_PATH_IMAGE016
and
Figure 56001DEST_PATH_IMAGE017
training networks of two modes are respectively, and l is a layer 1 characteristic diagram in the model; m represents the mth modality and c represents the c-th category.
After low-level feature extraction and channel fusion are finished, inputting the low-level feature extraction processed features into another neural network to perform high-level feature extraction, loading the features into the other neural network to perform convolution operation to perform high-level feature extraction, extracting key feature information vectors of images to obtain feature map groups of the image groups, judging whether secondary channel fusion can be performed again, if so, realizing the fusion in a first channel fusion mode, presetting a judged threshold, and if the calculated influence factor is higher than the set threshold, performing the secondary channel fusion to obtain the finally fused feature map groups.
In step 2, after low-level feature extraction and first channel fusion, performing first similarity calculation on a feature map group obtained by low-level feature fusion of different modes of a scene or an article, wherein features after shallow feature extraction are divided into different categories as much as possible, and dissimilar features are required to be extracted from features learned by a model;
similarly, in step 3, after performing high-level feature extraction, performing second channel fusion, performing first-time feature comparison between modal features extracted from the same scene or object and features acquired in a typical queue, calculating similarity between modalities with a classifier, performing second-time similarity calculation on a feature map group obtained by extracting high-level features from different modalities of a scene or an object, classifying the features after high-level feature extraction into the same category as much as possible, and requiring the features learned by a model to extract similar features.
We need to classify the fusion feature map and classification obtained above, and obtain the prediction result of classification, and compare it with learning, specifically,
after the original image is subjected to low-level feature extraction, high-level feature extraction and first and second channel fusion, a fused feature map is obtained and recorded as
Figure 649925DEST_PATH_IMAGE018
Figure 104040DEST_PATH_IMAGE019
...
Figure 7274DEST_PATH_IMAGE020
Will be
Figure 546840DEST_PATH_IMAGE021
Figure 944454DEST_PATH_IMAGE022
...
Figure 253076DEST_PATH_IMAGE023
Comparing the obtained characteristic graph with the characteristics in the typical queue to calculate classification loss; x i Denotes the bottom layer characteristics of the ith sample, X j Bottom layer features representing the jth sample
The above-mentioned typical queue obtaining process is: after low-level features, high-level features extraction and twice channel fusion, coarse-granularity density clustering is carried out on the features of the same class in the batch _ size, each feature is regarded as an independent class, the distance between every two features is calculated, when the features with the minimum distance lower than a threshold value are combined into one class, the distances between the new class and all the classes are recalculated until the minimum distance between every two features is higher than the threshold value, density clustering is carried out on the existing classes, the features are stored in a typical queue, new typical features are continuously generated in the training process, and the typical queue is updated.
Defining loss functions for measuring modal similarity after low-level feature extraction and high-level feature extraction as
Figure 327211DEST_PATH_IMAGE024
And
Figure 822914DEST_PATH_IMAGE025
Figure 21290DEST_PATH_IMAGE026
the calculation formula of (2) is as follows:
Figure 184418DEST_PATH_IMAGE027
Figure 429454DEST_PATH_IMAGE028
is the class of the jth image of modality i predicted by the N classifier, and N × M is the number of input images. The N classifier may distinguish the input image into N categories; i is the ith modality and j is the jth image.
And calculating the mean square error loss, and calculating the Euclidean distance between the predicted data and the real data. The closer the predicted value and the true value are, the smaller the mean square error of the two. The category corresponding to the maximum score is the prediction category. And performing mean square error loss calculation on the currently output prediction result and the prediction result output by historical weighting, wherein the formula is as follows:
Figure 943612DEST_PATH_IMAGE029
y is the predicted image class, y is the true image class, N x M is the number of input images.
Using mean square error loss and two modal similarity loss in small sample networks
Figure 683029DEST_PATH_IMAGE030
And
Figure 559719DEST_PATH_IMAGE031
as a function of total loss:
Figure 382181DEST_PATH_IMAGE032
Figure 118056DEST_PATH_IMAGE033
and
Figure 395585DEST_PATH_IMAGE034
is a hyper-parameter used to balance the importance of two modal similarity losses; repeatedly carrying out back propagation training by using the formula until reaching a set training round, storing the result with the minimum loss function or the best verification set effect, and carrying out pre-propagation training by using the trained network modelAnd measuring to obtain a corresponding category score, wherein the category corresponding to the maximum value of the category score is the prediction result.
Example 2
The present disclosure provides a multi-modal image fusion classification system based on antagonistic complementary features, which includes:
the data acquisition and processing module is used for acquiring multi-modal image data and preprocessing the image data;
the characteristic extraction module is used for selecting a mode to be fused from a plurality of modes, inputting image data in each mode into the neural network model according to groups, and extracting low-level characteristics to obtain image key characteristic information vectors; loading the features obtained from the low-level feature extraction into another convolutional neural network for convolutional operation to perform high-level feature extraction, and extracting key feature information vectors of the images to obtain a feature map group of the image group;
and the channel fusion module is used for judging whether the first channel fusion and the second channel fusion can be carried out, if so, designing an influence factor by utilizing the bn layer, setting a judged threshold, designing the influence factor to calculate the influence of the channel on final prediction, and fusing the channel with other modes according to the proportion between sub-networks of different modes when the influence factor is higher than the set threshold so as to enhance the complementarity between the characteristics.
And the modal similarity calculation module is used for comparing two modals of a scene or article in a feature map group obtained by extracting the shallow feature and the high feature in different modals after extracting the low feature and the high feature, and calculating the similarity.
And the calculation and prediction module is used for calculating the mean square error loss and outputting the category where the maximum similarity score is located as the prediction result of the image.
As shown in the image classification system model framework of fig. 2, the system in the dashed line frame corresponding to fig. 2 is a system module that mainly performs a classification function, wherein a custom neural network and channel fusion module and a modal similarity calculation module are used, a model is trained to determine appropriate network parameters, and finally a prediction stage is performed to obtain a required result.
The user inputs image data to be tested into the classification system, five processes of feature vector extraction, a channel fusion module, a mode similarity module, comparison learning and prediction category calculation are automatically carried out in the classification system, and finally the prediction category is output to interact with the user.
As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.
Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims (10)

1. The multi-modal image fusion classification method based on the countermeasures complementary features is characterized in that the training step comprises the following steps:
acquiring multi-modal image data, preprocessing the image data, selecting a modality to be fused from the multiple modalities, and inputting the image data in each modality into a neural network model in groups;
extracting low-level features to obtain image key feature information vectors, judging whether first channel fusion can be carried out or not, and simultaneously carrying out first similarity calculation on the obtained image key feature vectors;
extracting high-level features from the features extracted from the low-level features, extracting image key feature information vectors again, judging whether second channel fusion can be performed or not, and performing second similarity calculation on the obtained high-level image key feature information vectors;
and clustering and comparative learning are respectively carried out on the feature maps extracted by the low-level feature and the high-level feature, classification loss is calculated, finally, prediction of the image is carried out to obtain a corresponding category score, and the category corresponding to the maximum value of the category score is used as a prediction result.
2. The multi-modal image fusion classification method based on countermeasures against complementary features as claimed in claim 1, wherein the pre-processing of the image data is: the method comprises the steps of marking image data in an acquired data set, removing external information irrelevant to a classification task, marking objects and scenes to be classified of the images, extracting a marked region, performing the same data enhancement on one scene or an image group of the same object in a multi-modal data set, performing different data enhancement on different image groups, and converting the scale of the enhanced images into a uniform size.
3. The multi-modal image fusion classification method based on countering complementary features according to claim 2, wherein the data enhancement comprises random cropping, horizontal flipping, vertical flipping, random rotation, random multi-cropping, and gaussian noise increase.
4. The multi-modal image fusion classification method based on antagonistic complementary features as claimed in claim 1, wherein, a batch of N images is randomly selected, M modes are selected, and N x M images are input; and simultaneously loading the image groups according to the number of the input batches of the images and the number of the selected modes, inputting the image groups into a neural network for low-level feature extraction, extracting key feature information vectors of the images after convolution, and obtaining the feature map groups of the image groups.
5. The multi-modal image fusion classification method based on antagonistic complementary features as claimed in claim 1 is characterized in that after low-level feature extraction, whether the first channel fusion can be carried out is judged, specifically: designing an influence factor by utilizing the bn layer, setting a judged threshold, designing the influence factor, calculating the influence of a channel on final prediction, and fusing the channel with other modes according to proportion among sub-networks of different modes when the influence factor is higher than the set threshold.
6. The multi-modal image fusion classification method based on antagonistic complementary features as claimed in claim 5, wherein the influence factor is
Figure 722500DEST_PATH_IMAGE001
Wherein the content of the first and second substances,
Figure 524102DEST_PATH_IMAGE002
representing features after the radial transformation;
Figure 430878DEST_PATH_IMAGE003
representing the original characteristics;
Figure 585916DEST_PATH_IMAGE004
representing the variance of the original features;
Figure 402694DEST_PATH_IMAGE005
and
Figure 250564DEST_PATH_IMAGE006
is the mean and the error of the measured data,
Figure 769270DEST_PATH_IMAGE007
is a small constant that avoids a zero division,
Figure 462420DEST_PATH_IMAGE008
and
Figure 868124DEST_PATH_IMAGE009
training networks of two modes are respectively, and l is a layer 1 characteristic diagram in the model; m represents the mth modality, c represents the c-th category;
Figure 886896DEST_PATH_IMAGE010
using measurement channelsAnd if the influence degree of the model is higher than a threshold value, carrying out channel fusion between the modes to obtain the fused features.
7. The multimodal image fusion classification method based on antagonistic complementary features as claimed in claim 1, wherein the features extracted from the low-level feature extraction are loaded and input into another convolutional neural network for convolutional operation to extract the high-level features, the key feature information vectors of the images are extracted to obtain the feature map group of the image group, whether the second channel fusion can be performed is judged again, and if so, the second channel fusion is performed to obtain the final fused feature map group.
8. The multi-modal image fusion classification method based on anti-complementary features as claimed in claim 1, wherein after low-level feature extraction and high-level feature extraction are performed simultaneously, a feature map group obtained by extracting shallow-level features and high-level features from different modes of a scene or an article is subjected to one-time modal feature comparison, a classifier is used for calculating similarity between the modes, the features after the low-level feature extraction are dissimilar features, and the features after the high-level feature extraction are similar features.
9. The multi-modal image fusion classification method based on anti-complementary features as claimed in claim 1 is characterized in that the original images in each batch _ size are subjected to shallow layer and high layer feature extraction and channel fusion to obtain features, coarse-granularity density clustering is carried out according to categories, relatively coarse-granularity categories are divided in one category, the categories are subjected to density clustering and stored in a typical queue, and the typical queue is continuously updated along with the training process; after the original image passes through a shallow layer feature extraction module, a high layer feature extraction module and a twice channel fusion module, an obtained feature graph is marked as
Figure 768264DEST_PATH_IMAGE011
Figure 389738DEST_PATH_IMAGE012
Will be
Figure 305742DEST_PATH_IMAGE013
Figure 839622DEST_PATH_IMAGE014
And features in the typical queue are respectively connected in series, and classification loss is calculated.
10. The multi-modal image fusion classification system based on the countermeasures complementary features is characterized by comprising the following steps:
the data acquisition and processing module is used for acquiring and preprocessing multi-modal image data;
the characteristic extraction module is used for selecting a modality to be fused from a plurality of modalities, inputting image data in each modality into the neural network model according to groups, and extracting low-level characteristics to obtain an image key characteristic information vector; loading the features obtained from the low-level feature extraction into another convolutional neural network for convolutional operation to perform high-level feature extraction, and extracting key feature information vectors of the images to obtain a feature map group of the image group;
the channel fusion module is used for judging whether the first channel fusion and the second channel fusion can be carried out, if so, utilizing the bn layer to design an influence factor, setting a judged threshold, designing the influence factor to calculate the influence of the channel on the final prediction, and fusing the channel with other modes according to the proportion among sub-networks of different modes when the influence factor is higher than the set threshold;
the modal similarity calculation module is used for comparing the characteristic graphs obtained by extracting the shallow characteristic and the high characteristic of different modes of a scene or an article according to the characteristic once after extracting the low characteristic and the high characteristic and calculating the similarity;
and the calculation and prediction module is used for calculating the mean square error loss and outputting the category where the maximum similarity score is located as the prediction result of the image.
CN202210755253.4A 2022-06-30 2022-06-30 Multi-modal image fusion classification method and system based on confrontation complementary features Active CN114821206B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210755253.4A CN114821206B (en) 2022-06-30 2022-06-30 Multi-modal image fusion classification method and system based on confrontation complementary features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210755253.4A CN114821206B (en) 2022-06-30 2022-06-30 Multi-modal image fusion classification method and system based on confrontation complementary features

Publications (2)

Publication Number Publication Date
CN114821206A true CN114821206A (en) 2022-07-29
CN114821206B CN114821206B (en) 2022-09-13

Family

ID=82523286

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210755253.4A Active CN114821206B (en) 2022-06-30 2022-06-30 Multi-modal image fusion classification method and system based on confrontation complementary features

Country Status (1)

Country Link
CN (1) CN114821206B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106504255A (en) * 2016-11-02 2017-03-15 南京大学 A kind of multi-Target Image joint dividing method based on multi-tag multi-instance learning
CN109460707A (en) * 2018-10-08 2019-03-12 华南理工大学 A kind of multi-modal action identification method based on deep neural network
CN112215262A (en) * 2020-09-21 2021-01-12 清华大学 Image depth clustering method and system based on self-supervision contrast learning
WO2021022752A1 (en) * 2019-08-07 2021-02-11 深圳先进技术研究院 Multimodal three-dimensional medical image fusion method and system, and electronic device
CN112836734A (en) * 2021-01-27 2021-05-25 深圳市华汉伟业科技有限公司 Heterogeneous data fusion method and device and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106504255A (en) * 2016-11-02 2017-03-15 南京大学 A kind of multi-Target Image joint dividing method based on multi-tag multi-instance learning
CN109460707A (en) * 2018-10-08 2019-03-12 华南理工大学 A kind of multi-modal action identification method based on deep neural network
WO2021022752A1 (en) * 2019-08-07 2021-02-11 深圳先进技术研究院 Multimodal three-dimensional medical image fusion method and system, and electronic device
CN112215262A (en) * 2020-09-21 2021-01-12 清华大学 Image depth clustering method and system based on self-supervision contrast learning
CN112836734A (en) * 2021-01-27 2021-05-25 深圳市华汉伟业科技有限公司 Heterogeneous data fusion method and device and storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
MATHILDE CARON等: "Unsupervised Learning of Visual Features by Contrasting Cluster Assignments", 《ARXIV》 *
YANLING WANG等: "ClusterSCL: Cluster-Aware Supervised Contrastive Learning on Graphs", 《IN PROCEEDINGS OF THE ACM WEB》 *
ZHEN LI等: "CLMLF:A Contrastive Learning and Multi-Layer Fusion Method for Multimodal Sentiment Detection", 《ARXIV》 *
曾华等: "非均匀类簇密度聚类的多粒度自学习算法", 《系统工程与电子技术》 *
杨霄等: "基于层次化双重注意力网络的乳腺多模态图像分类", 《山东大学学报(工学版)》 *

Also Published As

Publication number Publication date
CN114821206B (en) 2022-09-13

Similar Documents

Publication Publication Date Title
CN109934293B (en) Image recognition method, device, medium and confusion perception convolutional neural network
CN108564129B (en) Trajectory data classification method based on generation countermeasure network
CN106372624B (en) Face recognition method and system
CN112434732A (en) Deep learning classification method based on feature screening
CN110929848B (en) Training and tracking method based on multi-challenge perception learning model
CN107784288A (en) A kind of iteration positioning formula method for detecting human face based on deep neural network
CN111325237B (en) Image recognition method based on attention interaction mechanism
CN110674685B (en) Human body analysis segmentation model and method based on edge information enhancement
Lomio et al. Classification of building information model (BIM) structures with deep learning
CN108595558B (en) Image annotation method based on data equalization strategy and multi-feature fusion
CN115578248B (en) Generalized enhanced image classification algorithm based on style guidance
CN112418320B (en) Enterprise association relation identification method, device and storage medium
CN110147841A (en) The fine grit classification method for being detected and being divided based on Weakly supervised and unsupervised component
CN105809113A (en) Three-dimensional human face identification method and data processing apparatus using the same
CN110852327A (en) Image processing method, image processing device, electronic equipment and storage medium
CN114863464A (en) Second-order identification method for PID drawing picture information
CN114627424A (en) Gait recognition method and system based on visual angle transformation
CN106980878B (en) Method and device for determining geometric style of three-dimensional model
CN112991281A (en) Visual detection method, system, electronic device and medium
CN116645562A (en) Detection method for fine-grained fake image and model training method thereof
CN114821206B (en) Multi-modal image fusion classification method and system based on confrontation complementary features
CN114511745B (en) Three-dimensional point cloud classification and rotation gesture prediction method and system
CN113887509B (en) Rapid multi-modal video face recognition method based on image set
CN113450344B (en) Strip steel surface defect detection method and system
CN115620083A (en) Model training method, face image quality evaluation method, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant