CN114821206B

CN114821206B - Multi-modal image fusion classification method and system based on confrontation complementary features

Info

Publication number: CN114821206B
Application number: CN202210755253.4A
Authority: CN
Inventors: 袭肖明; 王可崧; 聂秀山; 尹义龙; 张光
Original assignee: Shandong Jianzhu University
Current assignee: Shandong Jianzhu University
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-09-13
Anticipated expiration: 2042-06-30
Also published as: CN114821206A

Abstract

The invention provides a multi-modal image fusion classification method and system based on antagonistic complementary features, which belong to the technical field of image classification, and comprise the steps of selecting a modality to be fused from a plurality of modalities, extracting and obtaining key feature information vectors of an image by low-level features, and judging whether first channel fusion and first similarity calculation can be carried out or not; then, high-level feature extraction is carried out, and whether second channel fusion and second similarity calculation can be carried out is judged; clustering and comparative learning are carried out on feature graphs extracted from low-level features and high-level features, complementary information is effectively mined and fused, complementarity among the features is enhanced, and image fusion precision is improved.

Description

Multi-modal image fusion classification method and system based on confrontation complementary features

Technical Field

The disclosure relates to the technical field of image classification, in particular to a multi-modal image fusion classification method and system based on confrontation complementary features.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Image classification is an important research direction of computer vision, and has wide application in numerous tasks such as article identification, face identification, video analysis, disease diagnosis and the like. Although the existing image classification method can achieve better performance under the condition of big data, the classification task with less images has poor effect on some images. In addition, the use of only single-mode information has certain limitations, for example, in a task of classifying by using multi-view images, a single view does not completely describe a scene, which results in poor classification performance.

Deep learning has been widely applied to fields of natural language, image processing, and the like because of its excellent extraction, learning ability, and the like. However, in some multi-modal classification tasks, there is less data and deep learning tends to fall into overfitting. In addition, the existing multi-modal fusion method based on deep learning ignores effective mining and fusion of complementary information during fusion, and limits the improvement of fusion classification precision.

Disclosure of Invention

In order to solve the above problems, the present disclosure provides a multi-modal image fusion classification method and system based on antagonistic complementary features, which adopts a contrast learning dual-branch network structure, introduces a prototype learning module based on coarse-grained density clustering to learn typical features of categories, and uses multi-points to represent class centers of the categories, so that the learned typical features are more generalized. And a channel fusion module based on counterstudy is introduced, and a channel with a large improvement on the model is selected to be fused with other modes by excavating the correlation of the features in the channel, so that the complementarity between the features is enhanced.

According to the implementation of some embodiments, the following technical scheme is adopted in the present disclosure:

the multi-modal image fusion classification method based on the antagonistic complementary features comprises the following steps:

acquiring multi-modal image data, preprocessing the image data, selecting a modality to be fused from the multiple modalities, and inputting the image data in each modality into a neural network model in groups;

extracting low-level features to obtain image key feature information vectors, judging whether first channel fusion can be carried out or not, and simultaneously carrying out first similarity calculation on the obtained image key feature vectors;

extracting high-level features from the features extracted from the low-level features, extracting image key feature information vectors again, judging whether second channel fusion can be performed or not, and performing second similarity calculation on the obtained high-level image key feature information vectors;

and clustering and comparative learning are respectively carried out on the feature maps extracted by the low-level feature and the high-level feature, classification loss is calculated, finally, prediction of the image is carried out to obtain a corresponding category score, and the category corresponding to the maximum value of the category score is used as a prediction result.

According to some implementation modes of some embodiments, the following technical scheme is adopted in the present disclosure:

a multi-modal image fusion classification system based on antagonistic complementary features, comprising:

the data acquisition and processing module is used for acquiring multi-modal image data and preprocessing the image data;

the characteristic extraction module is used for selecting a mode to be fused from a plurality of modes, inputting image data in each mode into the neural network model according to groups, and extracting low-level characteristics to obtain image key characteristic information vectors; loading the features obtained from the low-level feature extraction into another convolutional neural network for convolutional operation to perform high-level feature extraction, and extracting key feature information vectors of the images to obtain a feature map group of the image group;

and the channel fusion module is used for judging whether the first channel fusion and the second channel fusion can be carried out, if so, designing an influence factor by utilizing the bn layer, setting a judged threshold, designing the influence factor to calculate the influence of the channel on final prediction, and fusing the channel with other modes according to the proportion among the sub-networks of different modes when the influence factor is higher than the set threshold.

And the modal similarity calculation module is used for performing primary feature comparison on feature map groups obtained by extracting the shallow features and the high features of different modes of a scene or an article after extracting the low features and the high features, and calculating the similarity.

And the calculation and prediction module is used for calculating the mean square error loss and outputting the category where the maximum similarity score is located as the prediction result of the image.

Compared with the prior art, this disclosed beneficial effect does:

compared with the prior method which has small data volume and only uses a single mode, the method shows excellence in image data classification, on one hand, the method introduces a prototype learning module based on coarse-grained density clustering to learn typical characteristics of categories by using a contrast learning network structure, and expresses class centers of the categories by using multi-cluster points to solve the problem of small sample distribution imbalance, and on the other hand, the method introduces a channel fusion module based on contrast learning to strengthen information interaction between the modes by mining the correlation of the characteristics in the channels.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

FIG. 1 is a schematic diagram of a network learning process for implementing image classification by the multi-modal image fusion algorithm of the present disclosure;

fig. 2 is a schematic diagram of a model framework of an image classification system provided by the present disclosure.

The specific implementation mode is as follows:

the present disclosure is further illustrated by the following examples in conjunction with the accompanying drawings.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example 1

The present disclosure provides a multi-modal image fusion classification method based on antagonistic complementary features, which is shown in fig. 1 and comprises the following steps:

step 1: acquiring multi-modal image data, preprocessing the image data, selecting a modality to be fused from the multiple modalities, and inputting the image data in each modality into a neural network model in groups;

step 2: extracting low-level features to obtain image key feature information vectors, judging whether first channel fusion can be carried out or not, and simultaneously carrying out first similarity calculation on the obtained image key feature vectors;

and step 3: extracting high-level features from the features extracted from the low-level features, extracting image key feature information vectors again, judging whether second channel fusion can be performed or not, and performing second similarity calculation on the obtained high-level image key feature information vectors;

and 4, step 4: and clustering and contrast learning are respectively carried out on the feature maps extracted by the low-level feature and the high-level feature, classification loss is calculated, finally, prediction of the image is carried out to obtain a corresponding category score, and the category corresponding to the maximum value of the similarity category score is used as a prediction result.

A double-branch network structure of contrast learning is adopted, a prototype learning module based on coarse-grained density clustering is introduced to learn typical characteristics of categories, and a cluster center of the categories is represented by a multi-cluster, so that the learned typical characteristics are more generalized. In order to fully fuse multi-modal complementary information, a channel fusion module based on counterstudy is introduced, and channels with large promotion to the model are selected to be fused with other modes by mining the correlation of the features in the channels, so that the complementarity among the features is enhanced.

Specifically, firstly, multi-modal image data is acquired and then preprocessed, because images or pictures in a data set may contain external information irrelevant to a classification task, objects or scenes to be classified of the pictures are marked, a marked region is extracted, the same data enhancement is performed on one scene or the same object image group in the multi-modal data set, different data enhancement is performed on different image groups, the main data enhancement methods include random cutting, horizontal turning, vertical turning, random rotation, random multi-cutting, Gaussian noise increase and the like, and then the enhanced image sizes are converted into a uniform size.

Because the original image is not required in all regions due to problems of acquisition machines or manual work and the like, the required parts are selected in a frame mode, the required parts are extracted, the sizes of the images are possibly inconsistent because of different data sources, the images need to be input into a neural network in a unified format in the training process, the original data are subjected to scale transformation by using a transform algorithm in python and are converted into the required sizes, 224 x 224 is required in the disclosure, and because the task range aimed at by the disclosure is a small data task, the same data enhancement is carried out on the grouped image pairs in the original data, and different data enhancement is carried out on different image pairs.

After preprocessing the data, extracting features of the images, selecting a modality to be fused from the plurality of modalities, inputting the image data in each modality into a neural network model according to groups, extracting low-level features to obtain an image key feature information vector, specifically, randomly selecting N images in a batch, selecting M modalities, and inputting N M images, for example, if a batch is 16 images, selecting 2 types, inputting 32 images to the model at a time.

After image data are processed, image groups are simultaneously loaded and input into a user-defined neural network for low-level feature extraction according to the number of image input batches and selected modes, and key feature information vectors of the images are extracted after convolution to obtain feature map groups of the image groups.

After the low-level feature extraction, it is determined whether the first channel fusion can be performed, specifically,

by utilizing the bn layer design influence factor, the bn layer is regularized in the batch dimension for translation and scaling treatment, and the method introduces

And

two parameters to train the two parameters. Setting a threshold value for judgment

The importance of the channel to the model is calculated as an influence factor,

for measuring the bias.

If it is

Below the threshold 0.3, a normalization process is performed, followed by an affine transformation:

if it is

Above the threshold 0.3, inter-modal channel fusion is performed:

representing features after the radial transformation;

representing the original characteristics;

representing the variance of the original features;

and

is the mean and the error of the measured data,

the degree of influence of the channel on the model is measured,

is a small constant that avoids a zero division,

and

training networks of two modes are respectively, and l is a layer 1 characteristic diagram in the model; m represents the mth modality and c represents the c-th category.

After low-level feature extraction and channel fusion are finished, inputting the low-level feature extraction result into another neural network to perform high-level feature extraction, loading the feature subjected to low-level feature extraction processing into another neural network to perform convolution operation to perform high-level feature extraction, extracting key feature information vectors of the image to obtain a feature map group of the image group, judging whether secondary channel fusion can be performed again, if so, realizing the current fusion according to a first channel fusion mode, presetting a judged threshold, and if the calculated influence factor is higher than the set threshold, performing the secondary channel fusion to obtain a finally fused feature map group.

In step 2, after low-level feature extraction and first channel fusion, performing first similarity calculation on a feature map group obtained by low-level feature fusion of different modes of a scene or an article, wherein features after shallow feature extraction are divided into different categories as much as possible, and dissimilar features are required to be extracted from features learned by a model;

similarly, in step 3, after performing high-level feature extraction, performing second channel fusion, performing first-time feature comparison between modal features extracted from the same scene or object and features obtained in a typical queue, calculating similarity between modalities with a classifier, performing second-time similarity calculation on a feature map group obtained by extracting high-level features from different modalities of a scene or an article, classifying the features after high-level feature extraction into the same category as much as possible, and requiring the features learned by a model to extract similar features.

We need to classify the fusion feature map and classification obtained above, and obtain the prediction result of classification, and compare it with learning, specifically,

after the original image is subjected to low-level feature and high-level feature extraction and primary and secondary channel fusion, a fused feature map is obtained and recorded as

，

...

Will be

，

...

Comparing the obtained characteristic graph with the characteristics in the typical queue to calculate the classification loss; x _i Denotes the bottom layer characteristics of the ith sample, X _j Bottom layer features representing the jth sample

The above-mentioned typical queue obtaining process is: after low-level features, high-level features extraction and twice channel fusion, coarse-grained density clustering is carried out on the features of the same category in the batch _ size, each feature is regarded as an independent category, the distance between every two features is calculated, when the features with the minimum distance lower than a threshold value are combined into one category, the distances between the new category and all the categories are recalculated until the minimum distance between every two features is higher than the threshold value, density clustering is carried out on the existing categories, the features are stored in a typical queue, new typical features are continuously generated in the training process, and the typical queue is updated.

Defining loss functions for measuring modal similarity after low-level feature extraction and high-level feature extraction as

And

，

the calculation formula of (2) is as follows:

is the class of the jth image of modality i predicted by the N classifier, and N × M is the number of input images. The N classifier may distinguish the input image into N categories; i is the ithModality, j is the jth image.

And calculating the mean square error loss, and calculating the Euclidean distance between the predicted data and the real data. The closer the predicted value and the true value are, the smaller the mean square error of the two. The category corresponding to the maximum score is the prediction category. And performing mean square error loss calculation on the currently output prediction result and the prediction result output by historical weighting, wherein the formula is as follows:

y ^， is the predicted image class, y is the true image class, N x M is the number of input images.

Using mean square error loss and two modal similarity losses in a small sample network

And

as a function of total loss:

and

is a hyper-parameter used to balance the importance of two modal similarity losses; and repeatedly carrying out back propagation training by using the formula until a set training round is reached, storing the result with the minimum loss function or the best verification set effect, and predicting by using the trained network model to obtain a corresponding category score, wherein the category corresponding to the maximum value of the category score is the predicted result.

Example 2

The present disclosure provides a multi-modal image fusion classification system based on confrontation complementary features, comprising:

the characteristic extraction module is used for selecting a modality to be fused from a plurality of modalities, inputting image data in each modality into the neural network model according to groups, and extracting low-level characteristics to obtain an image key characteristic information vector; loading the features obtained from the low-level feature extraction into another convolutional neural network for convolutional operation to perform high-level feature extraction, and extracting key feature information vectors of the images to obtain a feature map group of the image group;

and the channel fusion module is used for judging whether the first channel fusion and the second channel fusion can be carried out, if so, designing an influence factor by utilizing the bn layer, setting a judged threshold, designing the influence factor to calculate the influence of the channel on final prediction, and fusing the channel with other modes according to the proportion between sub-networks of different modes when the influence factor is higher than the set threshold so as to enhance the complementarity between the characteristics.

And the modal similarity calculation module is used for comparing two modals of a scene or article in a feature map group obtained by extracting shallow features and high-level features from different modals of the scene or article after extracting the low-level features and the high-level features, and calculating the similarity.

As shown in the image classification system model framework of fig. 2, the system in the dashed line frame corresponding to fig. 2 is a system module that mainly performs a classification function, wherein a custom neural network and channel fusion module and a modal similarity calculation module are used, a model is trained to determine appropriate network parameters, and finally a prediction stage is performed to obtain a required result.

The user inputs image data to be tested into the classification system, five processes of feature vector extraction, a channel fusion module, a mode similarity module, comparison learning and prediction category calculation are automatically carried out in the classification system, and finally the prediction category is output to interact with the user.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. The multi-modal image fusion classification method based on the countermeasures complementary features is characterized in that the training step comprises the following steps:

extracting low-level features to obtain image key feature information vectors, judging whether first channel fusion can be carried out or not, and simultaneously carrying out first similarity calculation on the obtained image key feature vectors; after extracting the low-level features, judging whether channel fusion can be performed for the first time, specifically: designing an influence factor by utilizing a bn layer, setting a judged threshold, designing the influence factor to calculate the influence of a channel on final prediction, and fusing the channel with other modes according to a proportion between sub-networks of different modes when the influence factor is higher than the set threshold;

clustering and contrast learning are respectively carried out on the feature maps extracted by the low-level feature and the high-level feature, classification loss is calculated, finally, prediction of the image is carried out to obtain a corresponding category score, and the category corresponding to the maximum value of the category score is used as a prediction result; extracting the original images in each batch _ size through shallow and high layer features and fusing channels to obtain features, carrying out coarse-granularity density clustering according to categories, dividing categories with relatively coarse granularity in one category, carrying out density clustering on the categories, storing the categories into a typical queue, and continuously updating the typical queue along with the training process; after the original image passes through a shallow layer feature extraction module, a high layer feature extraction module and a twice channel fusion module, an obtained feature graph is marked as

，

Will be

，

And features in the typical queue are respectively connected in series, and classification loss is calculated.

2. The multi-modal image fusion classification method based on countermeasures against complementary features as claimed in claim 1, wherein the pre-processing of the image data is: the method comprises the steps of marking image data in an acquired data set, removing external information irrelevant to a classification task, marking objects and scenes to be classified of the images, extracting a marked region, performing the same data enhancement on one scene or an image group of the same object in a multi-modal data set, performing different data enhancement on different image groups, and converting the scale of the enhanced images into a uniform size.

3. The multi-modal image fusion classification method based on countering complementary features according to claim 2, wherein the data enhancement comprises random cropping, horizontal flipping, vertical flipping, random rotation, random multi-cropping, and gaussian noise increase.

4. The multi-modal image fusion classification method based on antagonistic complementary features as claimed in claim 1, wherein, a batch of N images is randomly selected, M modes are selected, and N x M images are input; and simultaneously loading the image groups according to the number of the input batches of the images and the number of the selected modes, inputting the image groups into a neural network for low-level feature extraction, extracting key feature information vectors of the images after convolution, and obtaining the feature map groups of the image groups.

5. The method for multi-modal image fusion classification based on antagonistic complementary features as claimed in claim 1 wherein the influence factor is

，

Wherein the content of the first and second substances,

representing features after affine transformation;

representing the original characteristics;

representing the variance of the original features;

and

is the mean and the error of the measured data,

is a small constant that avoids a zero division,

and

training networks of two modes are respectively, and l is a layer 1 characteristic diagram in the model; m represents the mth modality, c represents the c-th category;

the method is used for measuring the influence degree of the channel on the model, and if the influence degree is higher than a threshold value, channel fusion between the modes is carried out, and the fused characteristics are obtained;

for measuring the bias.

6. The multimodal image fusion classification method based on antagonistic complementary features as claimed in claim 1, wherein the features extracted from the low-level feature extraction are loaded and input into another convolutional neural network for convolutional operation to extract the high-level features, the key feature information vectors of the images are extracted to obtain the feature map group of the image group, whether the second channel fusion can be performed is judged again, and if so, the second channel fusion is performed to obtain the final fused feature map group.

7. The multi-modal image fusion classification method based on anti-complementary features as claimed in claim 1, wherein after low-level feature extraction and high-level feature extraction are performed simultaneously, a feature map group obtained by extracting shallow-level features and high-level features from different modes of a scene or an article is subjected to one-time modal feature comparison, a classifier is used for calculating similarity between the modes, the features after the low-level feature extraction are dissimilar features, and the features after the high-level feature extraction are similar features.

8. The multi-modal image fusion classification system based on the countermeasures complementary features is characterized by comprising the following steps:

the data acquisition and processing module is used for acquiring and preprocessing multi-modal image data;

the characteristic extraction module is used for selecting a modality to be fused from a plurality of modalities, inputting image data in each modality into the neural network model according to groups, and extracting low-level characteristics to obtain an image key characteristic information vector; loading the features extracted from the low-level feature extraction into another convolutional neural network for convolution operation to extract high-level features, and extracting key feature information vectors of the images to obtain a feature map group of the image group;

the channel fusion module is used for judging whether the first channel fusion and the second channel fusion can be carried out, if so, utilizing the bn layer to design an influence factor, setting a judged threshold, designing the influence factor to calculate the influence of the channel on the final prediction, and fusing the channel with other modes according to the proportion among sub-networks of different modes when the influence factor is higher than the set threshold;

the modal similarity calculation module is used for carrying out primary characteristic comparison on a characteristic graph group obtained by extracting the shallow characteristic and the high characteristic of different modes of a scene or an article after extracting the low characteristic and the high characteristic and calculating the similarity;

the calculation and prediction module is used for calculating the mean square error loss and outputting the category where the maximum similarity score is located as the prediction result of the image; extracting the original images in each batch _ size through shallow and high layer features and fusing channels to obtain features, carrying out coarse-granularity density clustering according to categories, dividing categories with relatively coarse granularity in one category, carrying out density clustering on the categories, storing the categories into a typical queue, and continuously updating the typical queue along with the training process; after the original image passes through a shallow layer feature extraction module, a high layer feature extraction module and a twice channel fusion module, an obtained feature graph is marked as

，

Will be

，