CN114821206B - Multi-modal image fusion classification method and system based on confrontation complementary features - Google Patents

Multi-modal image fusion classification method and system based on confrontation complementary features Download PDF

Info

Publication number
CN114821206B
CN114821206B CN202210755253.4A CN202210755253A CN114821206B CN 114821206 B CN114821206 B CN 114821206B CN 202210755253 A CN202210755253 A CN 202210755253A CN 114821206 B CN114821206 B CN 114821206B
Authority
CN
China
Prior art keywords
features
image
level
fusion
extracting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210755253.4A
Other languages
Chinese (zh)
Other versions
CN114821206A (en
Inventor
袭肖明
王可崧
聂秀山
尹义龙
张光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Jianzhu University
Original Assignee
Shandong Jianzhu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Jianzhu University filed Critical Shandong Jianzhu University
Priority to CN202210755253.4A priority Critical patent/CN114821206B/en
Publication of CN114821206A publication Critical patent/CN114821206A/en
Application granted granted Critical
Publication of CN114821206B publication Critical patent/CN114821206B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Abstract

The invention provides a multi-modal image fusion classification method and system based on antagonistic complementary features, which belong to the technical field of image classification, and comprise the steps of selecting a modality to be fused from a plurality of modalities, extracting and obtaining key feature information vectors of an image by low-level features, and judging whether first channel fusion and first similarity calculation can be carried out or not; then, high-level feature extraction is carried out, and whether second channel fusion and second similarity calculation can be carried out is judged; clustering and comparative learning are carried out on feature graphs extracted from low-level features and high-level features, complementary information is effectively mined and fused, complementarity among the features is enhanced, and image fusion precision is improved.

Description

Multi-modal image fusion classification method and system based on confrontation complementary features
Technical Field
The disclosure relates to the technical field of image classification, in particular to a multi-modal image fusion classification method and system based on confrontation complementary features.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
Image classification is an important research direction of computer vision, and has wide application in numerous tasks such as article identification, face identification, video analysis, disease diagnosis and the like. Although the existing image classification method can achieve better performance under the condition of big data, the classification task with less images has poor effect on some images. In addition, the use of only single-mode information has certain limitations, for example, in a task of classifying by using multi-view images, a single view does not completely describe a scene, which results in poor classification performance.
Deep learning has been widely applied to fields of natural language, image processing, and the like because of its excellent extraction, learning ability, and the like. However, in some multi-modal classification tasks, there is less data and deep learning tends to fall into overfitting. In addition, the existing multi-modal fusion method based on deep learning ignores effective mining and fusion of complementary information during fusion, and limits the improvement of fusion classification precision.
Disclosure of Invention
In order to solve the above problems, the present disclosure provides a multi-modal image fusion classification method and system based on antagonistic complementary features, which adopts a contrast learning dual-branch network structure, introduces a prototype learning module based on coarse-grained density clustering to learn typical features of categories, and uses multi-points to represent class centers of the categories, so that the learned typical features are more generalized. And a channel fusion module based on counterstudy is introduced, and a channel with a large improvement on the model is selected to be fused with other modes by excavating the correlation of the features in the channel, so that the complementarity between the features is enhanced.
According to the implementation of some embodiments, the following technical scheme is adopted in the present disclosure:
the multi-modal image fusion classification method based on the antagonistic complementary features comprises the following steps:
acquiring multi-modal image data, preprocessing the image data, selecting a modality to be fused from the multiple modalities, and inputting the image data in each modality into a neural network model in groups;
extracting low-level features to obtain image key feature information vectors, judging whether first channel fusion can be carried out or not, and simultaneously carrying out first similarity calculation on the obtained image key feature vectors;
extracting high-level features from the features extracted from the low-level features, extracting image key feature information vectors again, judging whether second channel fusion can be performed or not, and performing second similarity calculation on the obtained high-level image key feature information vectors;
and clustering and comparative learning are respectively carried out on the feature maps extracted by the low-level feature and the high-level feature, classification loss is calculated, finally, prediction of the image is carried out to obtain a corresponding category score, and the category corresponding to the maximum value of the category score is used as a prediction result.
According to some implementation modes of some embodiments, the following technical scheme is adopted in the present disclosure:
a multi-modal image fusion classification system based on antagonistic complementary features, comprising:
the data acquisition and processing module is used for acquiring multi-modal image data and preprocessing the image data;
the characteristic extraction module is used for selecting a mode to be fused from a plurality of modes, inputting image data in each mode into the neural network model according to groups, and extracting low-level characteristics to obtain image key characteristic information vectors; loading the features obtained from the low-level feature extraction into another convolutional neural network for convolutional operation to perform high-level feature extraction, and extracting key feature information vectors of the images to obtain a feature map group of the image group;
and the channel fusion module is used for judging whether the first channel fusion and the second channel fusion can be carried out, if so, designing an influence factor by utilizing the bn layer, setting a judged threshold, designing the influence factor to calculate the influence of the channel on final prediction, and fusing the channel with other modes according to the proportion among the sub-networks of different modes when the influence factor is higher than the set threshold.
And the modal similarity calculation module is used for performing primary feature comparison on feature map groups obtained by extracting the shallow features and the high features of different modes of a scene or an article after extracting the low features and the high features, and calculating the similarity.
And the calculation and prediction module is used for calculating the mean square error loss and outputting the category where the maximum similarity score is located as the prediction result of the image.
Compared with the prior art, this disclosed beneficial effect does:
compared with the prior method which has small data volume and only uses a single mode, the method shows excellence in image data classification, on one hand, the method introduces a prototype learning module based on coarse-grained density clustering to learn typical characteristics of categories by using a contrast learning network structure, and expresses class centers of the categories by using multi-cluster points to solve the problem of small sample distribution imbalance, and on the other hand, the method introduces a channel fusion module based on contrast learning to strengthen information interaction between the modes by mining the correlation of the characteristics in the channels.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.
FIG. 1 is a schematic diagram of a network learning process for implementing image classification by the multi-modal image fusion algorithm of the present disclosure;
fig. 2 is a schematic diagram of a model framework of an image classification system provided by the present disclosure.
The specific implementation mode is as follows:
the present disclosure is further illustrated by the following examples in conjunction with the accompanying drawings.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
Example 1
The present disclosure provides a multi-modal image fusion classification method based on antagonistic complementary features, which is shown in fig. 1 and comprises the following steps:
step 1: acquiring multi-modal image data, preprocessing the image data, selecting a modality to be fused from the multiple modalities, and inputting the image data in each modality into a neural network model in groups;
step 2: extracting low-level features to obtain image key feature information vectors, judging whether first channel fusion can be carried out or not, and simultaneously carrying out first similarity calculation on the obtained image key feature vectors;
and step 3: extracting high-level features from the features extracted from the low-level features, extracting image key feature information vectors again, judging whether second channel fusion can be performed or not, and performing second similarity calculation on the obtained high-level image key feature information vectors;
and 4, step 4: and clustering and contrast learning are respectively carried out on the feature maps extracted by the low-level feature and the high-level feature, classification loss is calculated, finally, prediction of the image is carried out to obtain a corresponding category score, and the category corresponding to the maximum value of the similarity category score is used as a prediction result.
A double-branch network structure of contrast learning is adopted, a prototype learning module based on coarse-grained density clustering is introduced to learn typical characteristics of categories, and a cluster center of the categories is represented by a multi-cluster, so that the learned typical characteristics are more generalized. In order to fully fuse multi-modal complementary information, a channel fusion module based on counterstudy is introduced, and channels with large promotion to the model are selected to be fused with other modes by mining the correlation of the features in the channels, so that the complementarity among the features is enhanced.
Specifically, firstly, multi-modal image data is acquired and then preprocessed, because images or pictures in a data set may contain external information irrelevant to a classification task, objects or scenes to be classified of the pictures are marked, a marked region is extracted, the same data enhancement is performed on one scene or the same object image group in the multi-modal data set, different data enhancement is performed on different image groups, the main data enhancement methods include random cutting, horizontal turning, vertical turning, random rotation, random multi-cutting, Gaussian noise increase and the like, and then the enhanced image sizes are converted into a uniform size.
Because the original image is not required in all regions due to problems of acquisition machines or manual work and the like, the required parts are selected in a frame mode, the required parts are extracted, the sizes of the images are possibly inconsistent because of different data sources, the images need to be input into a neural network in a unified format in the training process, the original data are subjected to scale transformation by using a transform algorithm in python and are converted into the required sizes, 224 x 224 is required in the disclosure, and because the task range aimed at by the disclosure is a small data task, the same data enhancement is carried out on the grouped image pairs in the original data, and different data enhancement is carried out on different image pairs.
After preprocessing the data, extracting features of the images, selecting a modality to be fused from the plurality of modalities, inputting the image data in each modality into a neural network model according to groups, extracting low-level features to obtain an image key feature information vector, specifically, randomly selecting N images in a batch, selecting M modalities, and inputting N M images, for example, if a batch is 16 images, selecting 2 types, inputting 32 images to the model at a time.
After image data are processed, image groups are simultaneously loaded and input into a user-defined neural network for low-level feature extraction according to the number of image input batches and selected modes, and key feature information vectors of the images are extracted after convolution to obtain feature map groups of the image groups.
After the low-level feature extraction, it is determined whether the first channel fusion can be performed, specifically,
by utilizing the bn layer design influence factor, the bn layer is regularized in the batch dimension for translation and scaling treatment, and the method introduces
Figure 387741DEST_PATH_IMAGE001
And
Figure 544047DEST_PATH_IMAGE002
two parameters to train the two parameters. Setting a threshold value for judgment
Figure 311015DEST_PATH_IMAGE003
The importance of the channel to the model is calculated as an influence factor,
Figure 587276DEST_PATH_IMAGE004
for measuring the bias.
If it is
Figure 24073DEST_PATH_IMAGE005
Below the threshold 0.3, a normalization process is performed, followed by an affine transformation:
Figure 464413DEST_PATH_IMAGE006
if it is
Figure 379279DEST_PATH_IMAGE007
Above the threshold 0.3, inter-modal channel fusion is performed:
Figure 837943DEST_PATH_IMAGE008
Figure 445641DEST_PATH_IMAGE009
representing features after the radial transformation;
Figure 373277DEST_PATH_IMAGE010
representing the original characteristics;
Figure 91835DEST_PATH_IMAGE011
representing the variance of the original features;
Figure 342687DEST_PATH_IMAGE012
and
Figure 449184DEST_PATH_IMAGE013
is the mean and the error of the measured data,
Figure 861186DEST_PATH_IMAGE014
the degree of influence of the channel on the model is measured,
Figure 445751DEST_PATH_IMAGE015
is a small constant that avoids a zero division,
Figure 551110DEST_PATH_IMAGE016
and
Figure 375978DEST_PATH_IMAGE017
training networks of two modes are respectively, and l is a layer 1 characteristic diagram in the model; m represents the mth modality and c represents the c-th category.
After low-level feature extraction and channel fusion are finished, inputting the low-level feature extraction result into another neural network to perform high-level feature extraction, loading the feature subjected to low-level feature extraction processing into another neural network to perform convolution operation to perform high-level feature extraction, extracting key feature information vectors of the image to obtain a feature map group of the image group, judging whether secondary channel fusion can be performed again, if so, realizing the current fusion according to a first channel fusion mode, presetting a judged threshold, and if the calculated influence factor is higher than the set threshold, performing the secondary channel fusion to obtain a finally fused feature map group.
In step 2, after low-level feature extraction and first channel fusion, performing first similarity calculation on a feature map group obtained by low-level feature fusion of different modes of a scene or an article, wherein features after shallow feature extraction are divided into different categories as much as possible, and dissimilar features are required to be extracted from features learned by a model;
similarly, in step 3, after performing high-level feature extraction, performing second channel fusion, performing first-time feature comparison between modal features extracted from the same scene or object and features obtained in a typical queue, calculating similarity between modalities with a classifier, performing second-time similarity calculation on a feature map group obtained by extracting high-level features from different modalities of a scene or an article, classifying the features after high-level feature extraction into the same category as much as possible, and requiring the features learned by a model to extract similar features.
We need to classify the fusion feature map and classification obtained above, and obtain the prediction result of classification, and compare it with learning, specifically,
after the original image is subjected to low-level feature and high-level feature extraction and primary and secondary channel fusion, a fused feature map is obtained and recorded as
Figure 137260DEST_PATH_IMAGE018
Figure 587833DEST_PATH_IMAGE019
...
Figure 750961DEST_PATH_IMAGE020
Will be
Figure 746730DEST_PATH_IMAGE021
Figure 995309DEST_PATH_IMAGE022
...
Figure 983994DEST_PATH_IMAGE023
Comparing the obtained characteristic graph with the characteristics in the typical queue to calculate the classification loss; x i Denotes the bottom layer characteristics of the ith sample, X j Bottom layer features representing the jth sample
The above-mentioned typical queue obtaining process is: after low-level features, high-level features extraction and twice channel fusion, coarse-grained density clustering is carried out on the features of the same category in the batch _ size, each feature is regarded as an independent category, the distance between every two features is calculated, when the features with the minimum distance lower than a threshold value are combined into one category, the distances between the new category and all the categories are recalculated until the minimum distance between every two features is higher than the threshold value, density clustering is carried out on the existing categories, the features are stored in a typical queue, new typical features are continuously generated in the training process, and the typical queue is updated.
Defining loss functions for measuring modal similarity after low-level feature extraction and high-level feature extraction as
Figure 1628DEST_PATH_IMAGE024
And
Figure 433878DEST_PATH_IMAGE025
Figure 966490DEST_PATH_IMAGE026
the calculation formula of (2) is as follows:
Figure 634232DEST_PATH_IMAGE027
Figure 631007DEST_PATH_IMAGE028
is the class of the jth image of modality i predicted by the N classifier, and N × M is the number of input images. The N classifier may distinguish the input image into N categories; i is the ithModality, j is the jth image.
And calculating the mean square error loss, and calculating the Euclidean distance between the predicted data and the real data. The closer the predicted value and the true value are, the smaller the mean square error of the two. The category corresponding to the maximum score is the prediction category. And performing mean square error loss calculation on the currently output prediction result and the prediction result output by historical weighting, wherein the formula is as follows:
Figure 358791DEST_PATH_IMAGE029
y is the predicted image class, y is the true image class, N x M is the number of input images.
Using mean square error loss and two modal similarity losses in a small sample network
Figure 454399DEST_PATH_IMAGE030
And
Figure 394673DEST_PATH_IMAGE031
as a function of total loss:
Figure 511534DEST_PATH_IMAGE032
Figure 410220DEST_PATH_IMAGE033
and
Figure 730474DEST_PATH_IMAGE034
is a hyper-parameter used to balance the importance of two modal similarity losses; and repeatedly carrying out back propagation training by using the formula until a set training round is reached, storing the result with the minimum loss function or the best verification set effect, and predicting by using the trained network model to obtain a corresponding category score, wherein the category corresponding to the maximum value of the category score is the predicted result.
Example 2
The present disclosure provides a multi-modal image fusion classification system based on confrontation complementary features, comprising:
the data acquisition and processing module is used for acquiring multi-modal image data and preprocessing the image data;
the characteristic extraction module is used for selecting a modality to be fused from a plurality of modalities, inputting image data in each modality into the neural network model according to groups, and extracting low-level characteristics to obtain an image key characteristic information vector; loading the features obtained from the low-level feature extraction into another convolutional neural network for convolutional operation to perform high-level feature extraction, and extracting key feature information vectors of the images to obtain a feature map group of the image group;
and the channel fusion module is used for judging whether the first channel fusion and the second channel fusion can be carried out, if so, designing an influence factor by utilizing the bn layer, setting a judged threshold, designing the influence factor to calculate the influence of the channel on final prediction, and fusing the channel with other modes according to the proportion between sub-networks of different modes when the influence factor is higher than the set threshold so as to enhance the complementarity between the characteristics.
And the modal similarity calculation module is used for comparing two modals of a scene or article in a feature map group obtained by extracting shallow features and high-level features from different modals of the scene or article after extracting the low-level features and the high-level features, and calculating the similarity.
And the calculation and prediction module is used for calculating the mean square error loss and outputting the category where the maximum similarity score is located as the prediction result of the image.
As shown in the image classification system model framework of fig. 2, the system in the dashed line frame corresponding to fig. 2 is a system module that mainly performs a classification function, wherein a custom neural network and channel fusion module and a modal similarity calculation module are used, a model is trained to determine appropriate network parameters, and finally a prediction stage is performed to obtain a required result.
The user inputs image data to be tested into the classification system, five processes of feature vector extraction, a channel fusion module, a mode similarity module, comparison learning and prediction category calculation are automatically carried out in the classification system, and finally the prediction category is output to interact with the user.
As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.
Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims (8)

1. The multi-modal image fusion classification method based on the countermeasures complementary features is characterized in that the training step comprises the following steps:
acquiring multi-modal image data, preprocessing the image data, selecting a modality to be fused from the multiple modalities, and inputting the image data in each modality into a neural network model in groups;
extracting low-level features to obtain image key feature information vectors, judging whether first channel fusion can be carried out or not, and simultaneously carrying out first similarity calculation on the obtained image key feature vectors; after extracting the low-level features, judging whether channel fusion can be performed for the first time, specifically: designing an influence factor by utilizing a bn layer, setting a judged threshold, designing the influence factor to calculate the influence of a channel on final prediction, and fusing the channel with other modes according to a proportion between sub-networks of different modes when the influence factor is higher than the set threshold;
extracting high-level features from the features extracted from the low-level features, extracting image key feature information vectors again, judging whether second channel fusion can be performed or not, and performing second similarity calculation on the obtained high-level image key feature information vectors;
clustering and contrast learning are respectively carried out on the feature maps extracted by the low-level feature and the high-level feature, classification loss is calculated, finally, prediction of the image is carried out to obtain a corresponding category score, and the category corresponding to the maximum value of the category score is used as a prediction result; extracting the original images in each batch _ size through shallow and high layer features and fusing channels to obtain features, carrying out coarse-granularity density clustering according to categories, dividing categories with relatively coarse granularity in one category, carrying out density clustering on the categories, storing the categories into a typical queue, and continuously updating the typical queue along with the training process; after the original image passes through a shallow layer feature extraction module, a high layer feature extraction module and a twice channel fusion module, an obtained feature graph is marked as
Figure 917329DEST_PATH_IMAGE001
Figure 882486DEST_PATH_IMAGE002
Will be
Figure 948531DEST_PATH_IMAGE003
Figure 530822DEST_PATH_IMAGE004
And features in the typical queue are respectively connected in series, and classification loss is calculated.
2. The multi-modal image fusion classification method based on countermeasures against complementary features as claimed in claim 1, wherein the pre-processing of the image data is: the method comprises the steps of marking image data in an acquired data set, removing external information irrelevant to a classification task, marking objects and scenes to be classified of the images, extracting a marked region, performing the same data enhancement on one scene or an image group of the same object in a multi-modal data set, performing different data enhancement on different image groups, and converting the scale of the enhanced images into a uniform size.
3. The multi-modal image fusion classification method based on countering complementary features according to claim 2, wherein the data enhancement comprises random cropping, horizontal flipping, vertical flipping, random rotation, random multi-cropping, and gaussian noise increase.
4. The multi-modal image fusion classification method based on antagonistic complementary features as claimed in claim 1, wherein, a batch of N images is randomly selected, M modes are selected, and N x M images are input; and simultaneously loading the image groups according to the number of the input batches of the images and the number of the selected modes, inputting the image groups into a neural network for low-level feature extraction, extracting key feature information vectors of the images after convolution, and obtaining the feature map groups of the image groups.
5. The method for multi-modal image fusion classification based on antagonistic complementary features as claimed in claim 1 wherein the influence factor is
Figure 800261DEST_PATH_IMAGE005
Wherein the content of the first and second substances,
Figure 493410DEST_PATH_IMAGE006
representing features after affine transformation;
Figure 148382DEST_PATH_IMAGE007
representing the original characteristics;
Figure 901575DEST_PATH_IMAGE008
representing the variance of the original features;
Figure 658309DEST_PATH_IMAGE009
and
Figure 155150DEST_PATH_IMAGE010
is the mean and the error of the measured data,
Figure 930208DEST_PATH_IMAGE011
is a small constant that avoids a zero division,
Figure 588722DEST_PATH_IMAGE012
and
Figure 754124DEST_PATH_IMAGE013
training networks of two modes are respectively, and l is a layer 1 characteristic diagram in the model; m represents the mth modality, c represents the c-th category;
Figure 664443DEST_PATH_IMAGE014
the method is used for measuring the influence degree of the channel on the model, and if the influence degree is higher than a threshold value, channel fusion between the modes is carried out, and the fused characteristics are obtained;
Figure 434952DEST_PATH_IMAGE015
for measuring the bias.
6. The multimodal image fusion classification method based on antagonistic complementary features as claimed in claim 1, wherein the features extracted from the low-level feature extraction are loaded and input into another convolutional neural network for convolutional operation to extract the high-level features, the key feature information vectors of the images are extracted to obtain the feature map group of the image group, whether the second channel fusion can be performed is judged again, and if so, the second channel fusion is performed to obtain the final fused feature map group.
7. The multi-modal image fusion classification method based on anti-complementary features as claimed in claim 1, wherein after low-level feature extraction and high-level feature extraction are performed simultaneously, a feature map group obtained by extracting shallow-level features and high-level features from different modes of a scene or an article is subjected to one-time modal feature comparison, a classifier is used for calculating similarity between the modes, the features after the low-level feature extraction are dissimilar features, and the features after the high-level feature extraction are similar features.
8. The multi-modal image fusion classification system based on the countermeasures complementary features is characterized by comprising the following steps:
the data acquisition and processing module is used for acquiring and preprocessing multi-modal image data;
the characteristic extraction module is used for selecting a modality to be fused from a plurality of modalities, inputting image data in each modality into the neural network model according to groups, and extracting low-level characteristics to obtain an image key characteristic information vector; loading the features extracted from the low-level feature extraction into another convolutional neural network for convolution operation to extract high-level features, and extracting key feature information vectors of the images to obtain a feature map group of the image group;
the channel fusion module is used for judging whether the first channel fusion and the second channel fusion can be carried out, if so, utilizing the bn layer to design an influence factor, setting a judged threshold, designing the influence factor to calculate the influence of the channel on the final prediction, and fusing the channel with other modes according to the proportion among sub-networks of different modes when the influence factor is higher than the set threshold;
the modal similarity calculation module is used for carrying out primary characteristic comparison on a characteristic graph group obtained by extracting the shallow characteristic and the high characteristic of different modes of a scene or an article after extracting the low characteristic and the high characteristic and calculating the similarity;
the calculation and prediction module is used for calculating the mean square error loss and outputting the category where the maximum similarity score is located as the prediction result of the image; extracting the original images in each batch _ size through shallow and high layer features and fusing channels to obtain features, carrying out coarse-granularity density clustering according to categories, dividing categories with relatively coarse granularity in one category, carrying out density clustering on the categories, storing the categories into a typical queue, and continuously updating the typical queue along with the training process; after the original image passes through a shallow layer feature extraction module, a high layer feature extraction module and a twice channel fusion module, an obtained feature graph is marked as
Figure 123423DEST_PATH_IMAGE016
Figure 979383DEST_PATH_IMAGE017
Will be
Figure 690463DEST_PATH_IMAGE018
Figure 49900DEST_PATH_IMAGE019
And features in the typical queue are respectively connected in series, and classification loss is calculated.
CN202210755253.4A 2022-06-30 2022-06-30 Multi-modal image fusion classification method and system based on confrontation complementary features Active CN114821206B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210755253.4A CN114821206B (en) 2022-06-30 2022-06-30 Multi-modal image fusion classification method and system based on confrontation complementary features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210755253.4A CN114821206B (en) 2022-06-30 2022-06-30 Multi-modal image fusion classification method and system based on confrontation complementary features

Publications (2)

Publication Number Publication Date
CN114821206A CN114821206A (en) 2022-07-29
CN114821206B true CN114821206B (en) 2022-09-13

Family

ID=82523286

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210755253.4A Active CN114821206B (en) 2022-06-30 2022-06-30 Multi-modal image fusion classification method and system based on confrontation complementary features

Country Status (1)

Country Link
CN (1) CN114821206B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106504255A (en) * 2016-11-02 2017-03-15 南京大学 A kind of multi-Target Image joint dividing method based on multi-tag multi-instance learning
CN109460707A (en) * 2018-10-08 2019-03-12 华南理工大学 A kind of multi-modal action identification method based on deep neural network
CN112215262A (en) * 2020-09-21 2021-01-12 清华大学 Image depth clustering method and system based on self-supervision contrast learning
WO2021022752A1 (en) * 2019-08-07 2021-02-11 深圳先进技术研究院 Multimodal three-dimensional medical image fusion method and system, and electronic device
CN112836734A (en) * 2021-01-27 2021-05-25 深圳市华汉伟业科技有限公司 Heterogeneous data fusion method and device and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106504255A (en) * 2016-11-02 2017-03-15 南京大学 A kind of multi-Target Image joint dividing method based on multi-tag multi-instance learning
CN109460707A (en) * 2018-10-08 2019-03-12 华南理工大学 A kind of multi-modal action identification method based on deep neural network
WO2021022752A1 (en) * 2019-08-07 2021-02-11 深圳先进技术研究院 Multimodal three-dimensional medical image fusion method and system, and electronic device
CN112215262A (en) * 2020-09-21 2021-01-12 清华大学 Image depth clustering method and system based on self-supervision contrast learning
CN112836734A (en) * 2021-01-27 2021-05-25 深圳市华汉伟业科技有限公司 Heterogeneous data fusion method and device and storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
CLMLF:A Contrastive Learning and Multi-Layer Fusion Method for Multimodal Sentiment Detection;Zhen Li等;《arXiv》;20220414;第1-13页 *
ClusterSCL: Cluster-Aware Supervised Contrastive Learning on Graphs;Yanling Wang等;《In Proceedings of the ACM Web》;20220425;第1-10页 *
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments;Mathilde Caron等;《arXiv》;20210108;第1-23页 *
基于层次化双重注意力网络的乳腺多模态图像分类;杨霄等;《山东大学学报(工学版)》;20220601;第52卷(第3期);第34-41页 *
非均匀类簇密度聚类的多粒度自学习算法;曾华等;《系统工程与电子技术》;20100815(第08期);第210-215页 *

Also Published As

Publication number Publication date
CN114821206A (en) 2022-07-29

Similar Documents

Publication Publication Date Title
CN109934293B (en) Image recognition method, device, medium and confusion perception convolutional neural network
CN108564129B (en) Trajectory data classification method based on generation countermeasure network
US10719780B2 (en) Efficient machine learning method
Ibrahim et al. Cluster representation of the structural description of images for effective classification
CN110929848B (en) Training and tracking method based on multi-challenge perception learning model
CN106372624B (en) Face recognition method and system
CN111832608B (en) Iron spectrum image multi-abrasive particle identification method based on single-stage detection model yolov3
CN112434732A (en) Deep learning classification method based on feature screening
Wang et al. Towards realistic predictors
CN107784288A (en) A kind of iteration positioning formula method for detecting human face based on deep neural network
CN111325237B (en) Image recognition method based on attention interaction mechanism
CN110674685B (en) Human body analysis segmentation model and method based on edge information enhancement
CN108595558B (en) Image annotation method based on data equalization strategy and multi-feature fusion
CN115578248B (en) Generalized enhanced image classification algorithm based on style guidance
CN112418320B (en) Enterprise association relation identification method, device and storage medium
CN110147841A (en) The fine grit classification method for being detected and being divided based on Weakly supervised and unsupervised component
CN105809113A (en) Three-dimensional human face identification method and data processing apparatus using the same
Lee et al. Reinforced adaboost learning for object detection with local pattern representations
CN106980878B (en) Method and device for determining geometric style of three-dimensional model
CN116645562A (en) Detection method for fine-grained fake image and model training method thereof
CN114821206B (en) Multi-modal image fusion classification method and system based on confrontation complementary features
CN113887509B (en) Rapid multi-modal video face recognition method based on image set
CN114511745B (en) Three-dimensional point cloud classification and rotation gesture prediction method and system
CN115795355A (en) Classification model training method, device and equipment
CN113177599B (en) Reinforced sample generation method based on GAN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant