CN113343974A

CN113343974A - Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement

Info

Publication number: CN113343974A
Application number: CN202110770185.4A
Authority: CN
Inventors: 王剑锋; 马世乾; 余金沄; 王坤; 赵晨阳; 吴文炤; 刘剑; 秦亮; 刘开培
Original assignee: Wuhan University WHU; State Grid Information and Telecommunication Co Ltd; State Grid Tianjin Electric Power Co Ltd
Current assignee: Wuhan University WHU; State Grid Information and Telecommunication Co Ltd; State Grid Tianjin Electric Power Co Ltd; Electric Power Research Institute of State Grid Tianjin Electric Power Co Ltd
Priority date: 2021-07-06
Filing date: 2021-07-06
Publication date: 2021-09-03
Anticipated expiration: 2041-07-06
Also published as: CN113343974B

Abstract

The invention provides a multi-modal fusion classification optimization method considering inter-modal semantic distance measurement. Aiming at the problems of unstable fusion effect and limited improvement effect existing in the construction of a feature subspace under unified semantics in modal information fusion, the invention performs adaptive feature refinement based on an improved CBAM attention mechanism, and performs local semantic feature reinforcement by adopting a transverse structure, and aggregating effective information on space and channel dimensions; on the basis, a semantic approximation model based on the semantic distance between the modalities is constructed, explicit measurement for judging semantic consistency between the modalities is introduced, the distribution distance between the same semantic feature pairs is reduced, and the distribution distance between different semantic feature pairs is expanded; and finally, linear fusion under multi-modal information is carried out by taking the classification performance target of the model and the semantic approximation target of the model into consideration, so that the model can better search a common feature subspace, and the efficiency of multi-modal fusion model diagnosis is improved.

Description

Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement

Technical Field

The application relates to the field of multi-modal information fusion, in particular to a multi-modal fusion classification optimization method considering inter-modal semantic distance measurement.

Background

Modality refers to the form in which an event occurs or an objective object exists. To better exploit artificial intelligence to help us perceive and understand the world, it is necessary to interpret and reason about information and features useful in multimodal data. The multi-modal fusion technology aims to realize heterogeneous complementation of multi-field data, establish a framework capable of processing and associating interactive information among multiple modalities, and gradually develop a new scientific research direction with huge mining potential and research value from early research based on audio-visual speech recognition to recently new semantic and visual field application. However, although the multi-modal fusion technology based on deep learning can learn deep feature expressions of data of different modalities, due to differences between modalities and different influence factors, there are many problems to be solved and broken through in the prior art: on one hand, although the common feature fusion methods such as direct feature splicing, dot product and addition are simple to implement, a semantic gap exists between multi-mode data, so that the fusion effect is unstable, and the improvement effect is limited; on the other hand, because different modal characteristics have different meanings, and modal information is interfered with each other in a common characteristic space, it is difficult to establish a characteristic subspace with unified semantic representation, and the prior art lacks explicit measurement for effective fusion information selection and semantic consistency judgment between modalities.

Disclosure of Invention

In order to overcome the defects of the prior art, the method carries out self-adaptive feature refinement based on an improved CBAM attention mechanism, transversely combines a channel attention model and a space attention model, aggregates effective information on space and channel dimensions, and carries out local semantic feature reinforcement; on the basis, a semantic approximation model based on the semantic distance between the modalities is constructed, explicit measurement for judging semantic consistency between the modalities is introduced, the distribution distance between similar semantic feature pairs is reduced, and the distribution distance between different semantic feature pairs is expanded; and performing linear fusion under multi-modal information by combining the model classification performance target and the model semantic approximation target, so that the model can better search a common feature subspace, and the efficiency of multi-modal fusion model diagnosis is improved.

In order to achieve the purpose, the solution adopted by the invention is as follows:

a multi-modal fusion classification optimization method considering inter-modal semantic distance measurement comprises the following steps:

step 1: dividing data into a training set and a testing set, preprocessing the training set to obtain preprocessed data, and extracting data features of the preprocessed data by using a deep neural network, wherein the data features comprise image features F₁And text feature F₂；

Step 2: transversely merging a channel attention model and a space attention model in a CBAM attention mechanism to obtain an improved CBAM attention mechanism, inputting the data features obtained in the step 1 into the improved CBAM attention mechanism to obtain a local locking feature space, wherein the local locking feature space comprises a local locking image feature F'₁And local locked text feature F'₂；

And step 3: the method specifically comprises the following steps of constructing a semantic approximation model based on the semantic distance between modalities:

step 31: constructing a triple according to the training set obtained in the step 1, wherein the triple comprises a positive control sample group, an anchor sample group and a negative control sample group;

step 32: inputting the triplets established in the step 31 into the local locking feature space obtained in the step 2 to obtain locking image-text pair features;

step 33: according to the locked image-text pair characteristics obtained in the step 32, increasing the semantic space distance of the locked image-text pair characteristics under different semantics, reducing the semantic space distance of the locked image-text pair characteristics under the same semantics, and establishing a semantic approximation model based on the inter-modality semantic distance;

step 34: constraining the semantic approximation model based on the inter-modality semantic distance established in the step 33 to obtain a target function;

and 4, step 4: designing a general model fusion algorithm under multi-modal information in a feature public subspace according to the data features obtained in the step 1, the local locking feature space obtained in the step 2 and the semantic approximation model based on the inter-modal semantic distance obtained in the step 3 to obtain a fusion Loss function, wherein the fusion Loss function comprises an asymmetric fusion Loss function Loss_nAnd symmetric fusion Loss function Loss_yAnd carrying out training iteration of the model by using the fusion loss function.

Preferably, the semantic approximation model based on the semantic distance between modalities established in step 3 specifically includes:

in the formula (I), the compound is shown in the specification,

is Euclidean distance measurement;

inputting a locking image-text pair characteristic obtained by locally locking a characteristic space for an anchor sample group;

inputting a locked image-text pair characteristic obtained by locally locking a characteristic space for a forward reference sample group;

inputting a locking image-text pair characteristic obtained by locally locking a characteristic space for a negative contrast sample group; α is a specific threshold; n is the batch size; tau is a sample feature space; l is_pIs the picture ternary loss; l is_tIs a text ternary loss.

Preferably, an improved CBAM attention mechanism is established in the step 2 to obtain a local lock imageCharacteristic F'₁The method comprises the following specific steps:

step 21: constructing the channel attention model, and inputting the image characteristics F obtained in the step 1₁Aggregating the spatial information of feature mapping by respectively adopting a maximum pooling process and an average pooling process on the feature space to obtain the image-based feature F expressed by a one-dimensional vector₁Image channel descriptor V of_max1And V_avg1：

V_max1＝MaxPool(F₁)

V_avg1＝AvgPool(F₁)

In the formula: v_max1The image channel descriptor obtained by adopting maximum pooling processing; v_avg1The image channel descriptor obtained by average pooling processing is adopted;

aggregating the image features F by using two layers of convolution layers as a shared weight feature layer₁Of a certain area within the neighborhood of the channel, said image-based feature F₁Image channel descriptor V of_max1And V_avg1Obtaining image-based features F through the shared weight feature layer₁Feature vector VMLP of_max1And VMLP_avg1(ii) a The image-based feature F₁Feature vector VMLP of_max1And VMLP_avg1Performing pixel-by-pixel addition, and obtaining the image characteristic F through a relu activation function₁Based on the image attention vector CA (F) of the channel dimension₁)：

In the formula:

is a shared weight feature layer function;

step 22: constructing the spatial attention model along the image features F obtained in the step 1₁Respectively performing global mean pooling and maximum pooling to obtain image-based imageCharacteristic F₁Image space context descriptor T_avg1And T_max1：

T_max1＝MaxPool(F₁)

T_avg1＝AvgPool(F₁)

In the formula: t is_max1The image space context descriptor obtained by maximum pooling operation is adopted; t is_avg1The method comprises the steps of obtaining an image space context descriptor by adopting global mean pooling operation;

along the image feature F₁To the image-based feature F₁Image space context descriptor T_max1And T_avg1Splicing to obtain an image-based feature F₁Valid spatial feature descriptors of; coding and mapping information of a region needing to be emphasized or suppressed in a space by using a hole convolution to obtain a feature after convolution, and obtaining an image feature F by passing the feature after convolution through a relu activation function₁Image attention vector SA (F) based on spatial dimensions₁)：

In the formula:

splicing operation is carried out;

is a hole convolution layer function;

step 23: transversely combining the channel attention model and the space attention model in the steps 21 and 22 to obtain an image feature F₁Mixed attention vector of (3) HYB (F)₁)：

The image-based feature F₁Mixed attention vector of (3) HYB (F)₁) Injection of image features F₁Realizing local semantic feature reinforcement on space and channels to obtain local locking image feature F'₁：

Preferably, the step 2 establishes a modified CBAM attention mechanism to obtain a local locked text feature F'₂The method comprises the following specific steps:

step 21': constructing the channel attention model, and inputting the text features F obtained in the step 1₂Aggregating the spatial information of feature mapping by respectively adopting a maximum pooling process and an average pooling process on the feature space to obtain the text-based feature F represented by a one-dimensional vector₂Text channel descriptor V of_max2And V_avg2：：

V_max2＝MaxPool(F₂)

V_avg2＝AvgPool(F₂)

In the formula: v_max2The method comprises the steps of obtaining a text channel descriptor by adopting maximum pooling processing; v_avg2The text channel descriptors are obtained by average pooling;

aggregating the text features F by using two layers of convolution layers as a shared weight feature layer₂The feature of a certain area in the neighborhood of the channel, the text-based feature F₂Text channel descriptor V of_max2And V_avg2Obtaining a text-based feature F through the shared weight feature layer₂Feature vector VMLP of_max2And VMLP_avg2(ii) a The text-based feature F₂Feature vector VMLP of_max2And VMLP_avg2Performing pixel-by-pixel addition, and obtaining a text feature F through a relu activation function₂Based on the channel dimension of the text attention vector CA (F)₂)：

In the formula:

is a shared weight feature layer function;

step 22': constructing the spatial attention model along the text features F obtained in the step 1₂Respectively carrying out global mean pooling and maximum pooling to obtain a text-based feature F₂Text space context descriptor T_avg2And T_max2：

T_max2＝MaxPool(F₂)

T_avg2＝AvgPool(F₂)

In the formula: t is_max2The context descriptor of the text space obtained by adopting maximum pooling operation; t is_avg2The context descriptor of the text space is obtained by adopting global mean pooling operation;

along the text feature F₂For the text-based feature F₂Text space context descriptor T_max2And T_avg2Splicing to generate a text-based feature F₂Valid spatial feature descriptors of; then, coding and mapping information of the region needing to be emphasized or suppressed in the space by using the hole convolution, and obtaining a text feature F by passing the feature after the convolution through a relu activation function₂Is based on the text attention vector SA (F) of the space dimension₂)：

In the formula:

splicing operation is carried out;

is convolved with a holeA layer function;

step 23': transversely combining the channel attention model and the space attention model in the steps 21 'and 22' to obtain a text-based feature F₂Mixed attention vector of (3) HYB (F)₂)：

The text-based feature F₂Mixed attention vector of (3) HYB (F)₂) Injecting the text feature F₂Local semantic feature reinforcement on space and channel is realized, and local locking text feature F is obtained₂：

Preferably, the anchor sample set obtained in step 31 is a graph-text pair example sample Samp _ anc (p) randomly extracted from the training set^a、t^a) (ii) a The forward control sample group is a sample Samp _ pos (p) which is randomly selected from the training set and has the same semantic meaning with the anchor sample group^p、t^p) (ii) a The negative control sample group is a sample Samp _ neg (p) which is extracted in a random manner in the training set and has different semantics with the anchor sample groupⁿ、tⁿ)。

Preferably, the overall model fusion algorithm under the multi-modal information is designed in the step 4, and an asymmetric fusion Loss function Loss is obtained_nThe method comprises the following specific steps:

step 41: the text characteristics F obtained in the step 1 and the step 2 are processed₂Local locked image feature F'₁Splicing and fusing to obtain a combined modal characterization, inputting the combined modal characterization into a multilayer perceptron, then accessing into a softmax classifier, measuring errors of a predicted label and a real label by adopting a cross entropy function, quantifying a model classification performance target, and obtaining an asymmetric classification loss function L_Catog(F₂,F′₁)：

In the formula (I); l is_CatogIs a cross entropy function; p is a radical of_icA predicted probability that sample i belongs to class c; y is_icIs an indicator variable;

if the prediction class is the same as the true class of sample i, then the variable y is indicated_icIs 1, otherwise y_icIs 0;

step 42: considering a model classification performance objective and a model semantic approximation objective, designing a general model fusion algorithm under multi-modal information, and specifically, applying the asymmetric classification loss function L obtained in step 41_Catog(F₂,F′₁) And the picture ternary loss L obtained in the step 3_pText ternary loss L_tPerforming linear fusion to obtain asymmetric fusion Loss function Loss_n：

Loss_n＝L_Catog(F′₁,F₂)+β(L_p+L_t)

In the formula: beta is a linear proportionality coefficient;

step 43: according to the asymmetric fusion Loss function Loss_nAnd carrying out training iteration of the model.

Preferably, the overall model fusion algorithm under the multi-modal information is designed in the step 4, and a symmetric fusion Loss function Loss is obtained_yThe method comprises the following specific steps:

step 41': locally locking text feature F 'obtained in step 2'₂Local locked image feature F'₁Splicing and fusing to obtain a combined modal characterization, inputting the combined modal characterization into a multilayer perceptron, then accessing into a softmax classifier, measuring errors of a predicted label and a real label by adopting a cross entropy function, quantifying a model classification performance target, and obtaining a symmetrical classification loss L_Catog(F′₁,F′₂)：

step 42': considering the model classification performance target and the model semantic approximation target, designing an overall model fusion algorithm under multi-modal information, and specifically classifying the symmetric classification loss function L obtained in the step 41_Catog(F′₁,F′₂) And the picture ternary loss L obtained in the step 3_pText ternary loss L_tLinear fusion is carried out to obtain a symmetric fusion Loss function Loss_y：

Loss_y＝L_Catog(F′₁,F′₂)+β(L_p+L_t)

In the formula: beta is a linear proportionality coefficient;

step 43': according to the symmetric fusion Loss function Loss_yAnd carrying out training iteration of the model.

Preferably, the preprocessing of the training set in step 1 specifically includes:

step 11: separating image-text pairs in the training set data sample to obtain image data and text data;

step 12: and (3) encoding the text data obtained in the step (11) by using Word2Vec to obtain the preprocessed text data.

Preferably, the deep neural network in step 1 includes an image depth feature extraction network and a text depth feature extraction network.

Preferably, the objective function in step 34 is min (L)_p,L_t)。

Compared with the prior art, the invention has the beneficial effects that:

self-adaptive feature refinement is carried out by establishing an improved CBAM attention mechanism, and local semantic features are strengthened; and a semantic approximation model based on the semantic distance between the modalities is constructed, and explicit measurement for judging semantic consistency between the modalities is introduced, so that the robustness and the accuracy of the multi-modal fusion model are better improved.

Drawings

FIG. 1 is a schematic diagram of a general technical route of an embodiment of the present invention;

FIG. 2 is a diagram illustrating an example of classification accuracy of a test set of a text monomodal model according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an example of classification accuracy of a test set of an image single-mode model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating an example of the classification accuracy of a test set of a fusion model under a common feature stitching operation in the embodiment of the present invention;

FIG. 5 is a diagram illustrating an example of the test set classification accuracy of the multi-modal fusion model in an embodiment of the present invention.

Detailed Description

Hereinafter, embodiments of the present invention will be described with reference to the drawings.

Taking a Pascal Sences data set as an example, 20 types of object identification tasks including airplanes, bicycles, birds, ships, bottles, buses, automobiles, cats, chairs, cows, digital watches, dogs, horses, motorcycles, people, potted plants, sheep, sofas, trains and television monitors are carried out, each type has 200 image-text pairs, image data is a real shot image of an object, text data is a text description of the object, and image-text labels are shared.

As shown in fig. 1, firstly inputting image-text data, dividing a training set and a test set, obtaining image-text characteristics through a deep neural network, inputting an improved CBAM attention model for self-adaptive characteristic refinement, fusing the image-text characteristics, and performing classification performance evaluation; meanwhile, triple samples are randomly extracted to construct a semantic approximation model based on the semantic distance between the modalities, the semantic space distance of the feature pairs under different semantics is increased, and the semantic space distance of the feature pairs under the same semantics is reduced. And finally, constraining the multi-modal feature public subspace by considering the model classification performance target and the semantic consistency target. The method comprises the following specific steps:

step 1: dividing a data set into a training set and a test set, separating image-text pairs in a data sample of the training set, obtaining image-text characteristics through a deep neural network, and respectively constructing an image depth characteristic extraction network and a text depth characteristic extraction network which are respectively used for extracting image characteristics F₁And text feature F₂The method comprises the following specific steps:

firstly, a 14-layer RGB image depth feature extraction network is constructed, and the structure sequentially comprises the following steps: inputting the convolution layer twice, the pooling layer once, the convolution layer twice again, the pooling layer once, the convolution for three times again and the pooling layer once.

The feature extraction network parameters of the RGB image are set as follows: the convolution kernel size of the input convolutional layer is set to 3x3, the convolution step size is [1,1], and the number of convolution kernels is set to 64,128, 256,512 in turn. Setting the step length of the pooling layer as [2,2], and adopting a maximum pooling layer;

secondly, a 6-layer text feature extraction network is built, and the structure of the network is as follows in sequence: the system comprises an embedding layer, a convolutional layer, a BN layer, a global pooling layer, a dropout layer and a full connection layer.

The parameters of each layer of the network for setting text feature extraction are as follows: before the text is sent into the network, the text needs to be coded into a matrix with 150 dimensions after being coded by Word2 Vec. And then the text passes through the embedding layer, the convolution layer adopts 1-dimensional convolution, the convolution size is 3, the step length is 1, the number of convolution kernels is 256, and the probability of the dropout layer is 0.5. The final full link layer neuron number is set to 256.

Step 2: the method comprises the following steps of self-adaptive feature refinement based on an improved CBAM attention mechanism, and local locking of image and text feature spaces containing certain semantics, and specifically comprises the following steps:

a. for an image feature space, firstly, a Channel Attention model (CA) is constructed, and a specific modeling method is as follows:

(1) inputting extracted image feature F₁Aggregating the spatial information of feature mapping on the feature space by adopting a maximum pooling process and an average pooling process to generate two different image channel descriptors V represented by one-dimensional vectors_max1And V_avg1Maximum pooling feature and mean pooling feature are indicated, respectively, namely:

V_max1＝MaxPool(F₁) (1)

V_avg1＝AvgPool(F₁) (2)

(2) two layers of convolution layers are adopted as a shared weight characteristic layer, the characteristics of a certain area in the neighborhood of the channel are aggregated, and two channel descriptors pass through the shared weight characteristic layer to respectively obtain a characteristic vector VMLP_max1、VMLP_avg1Adding the two pixel by pixel, and obtaining an image attention vector CA (F) based on channel dimension through a relu activation function₁)：

Wherein:

is a shared weight feature layer function;

the specific operations in the examples are as follows: the invention takes the size of (224, 3) image as input, and obtains the depth image feature F with the size of (7, 256) after 14 layers of feature extraction₁At the moment, channel compression is carried out on the channel, the size is compressed to (1, 256), firstly, (7, 256) features are taken, global pooling and maximum pooling are respectively carried out, then the two pooled feature layers are sent to a shared feature layer network, the structure in the shared feature layer is convoluted twice, the cores of the convolutions are 512,256 respectively, the step length is 1, the activation function is relu, and after passing through the shared feature layer, Add fusion is carried out on the output max layer and the avg layer, namely the values of the max layer and the avg layer are added, so that a channel attention vector CA (F) with the dimension of (7, 256) is obtained₁)。

Secondly, a Spatial Attention model (SA) is constructed, and the following is specifically modeled:

(1) first along an image feature F₁The channel axes of the image space are subjected to global mean pooling and maximum pooling to generate two different image space context descriptors T_avg1And T_max1Namely:

T_max1＝MaxPool(F₁) (4)

T_avg1＝AvgPool(F₁) (5)

(2) and then splicing the two image space descriptors along the channel axis to generate an effective space feature descriptor. Then, the information of the areas needing to be emphasized or inhibited in the space is coded and mapped by using the cavity convolution, space context information is more efficiently aggregated, and the image attention vector SA (F) based on the space dimension is obtained by performing a relu activation function on the convolved features₁)：

Wherein:

splicing operation is carried out;

is a function of the void convolution layer.

The specific operations in the examples are as follows: also by image feature F₁Taking the maximum max size (7,7,1) and the average mean (7,7,1) on the channel as input, then fusing the maximum layer with the minimum layer to obtain a concatenate layer, namely splicing the two results on the channel, wherein the size is (7,7,2), sending a new (7,7,2) to the convolution layer, the convolution size is 1X1, the convolution kernel is 1, and then changing to (7,7,1), and then carrying out dot multiplication on the output (7,7,1) and X. The spatial attention vector SA (F) with the dimension of (7, 256) is obtained₁)。

Finally, the channel attention model and the space attention model are combined, and the invention abandons the longitudinal direction in the common CBAM moduleThe attention superimposing method of (1) is improved to the lateral attention, that is, the channel attention is matched with the input of the spatial attention, and then the obtained attention vector formula (3) based on the channel dimension and the obtained attention vector formula (6) based on the spatial dimension are multiplied by the corresponding elements to obtain the mixed attention vector HYB (F)₁)：

Mixing attention vectors HYB (F)₁) Injecting an original feature map to realize local semantic feature reinforcement on space and channels to obtain a locked image feature F'₁Namely:

the specific operations in the examples are as follows: the aforementioned derived attention vector CA (F) based on channel dimensions₁) And attention vector SA (F) based on spatial dimension₁) Performing point multiplication to obtain a mixed attention vector, and performing point multiplication on the mixed attention vector and the original depth image feature F₁Carrying out dot multiplication to obtain a new feature F with the dimension of (7, 256)₁'. The structure of the fusion classification part is as follows in sequence: a carboxylate layer, a first fully-connected layer, a second fully-connected layer; the splicing dimension of the coordinate layer is set to be 256, the number of output neurons of the first full-connection layer is set to be 256, and the number of output neurons of the second full-connection layer is set to be 128. The third layer is a softmax layer and a classification layer, and the number of the neurons is 20.

b. For a text feature space, firstly, a Channel Attention model (CA) is constructed, and a specific modeling method is as follows:

inputting extracted text features F₂Aggregating the spatial information of feature mapping on the feature space by adopting a maximum pooling process and an average pooling process to generate two different text channel descriptors V represented by one-dimensional vectors_max2And V_avg2Respectively represent the maximum valuesPooling characteristics and mean pooling characteristics, namely:

V_max2＝MaxPool(F₂) (1)′

V_avg2＝AvgPool(F₂) (2)′

two layers of convolution layers are adopted as a shared weight characteristic layer, the characteristics of a certain area in the neighborhood of the channel are aggregated, and two channel descriptors pass through the shared weight characteristic layer to respectively obtain a characteristic vector VMLP_max2、VMLP_avg2Adding the two pixel by pixel, and obtaining an attention vector CA (F) based on the channel dimension through a relu activation function₂)：

Wherein:

is a shared weight feature layer function;

(1) first along text feature F₂The channel axes of the text space are subjected to global mean pooling and maximum pooling to generate two different text space context descriptors T_avg2And T_max2Namely:

T_max2＝MaxPool(F₂) (4)′

T_avg2＝AvgPool(F₂) (5)′

(2) and then splicing the two text space descriptors along the channel axis to generate an effective space feature descriptor. Then, the information of the areas needing to be emphasized or inhibited in the space is coded and mapped by using the hole convolution, the space context information is more efficiently aggregated, and the text attention vector SA (F) based on the space dimension is obtained by performing a relu activation function on the convolved features₂)：

Wherein:

splicing operation is carried out;

is a function of the void convolution layer.

Finally, the channel attention model and the space attention model are combined, the longitudinal attention superposition mode in the common CBAM module is abandoned, the mode is improved to be transverse attention, namely the channel attention is consistent with the input of the space attention, corresponding elements are multiplied for the obtained attention vector formula (3) 'based on the channel dimension and the attention vector formula (6)' based on the space dimension, and a mixed attention vector HYB (F) is obtained₂)：

Mixing attention vectors HYB (F)₂) Injecting an original feature map to realize local semantic feature reinforcement on space and channels to obtain a locking text feature F'₂Namely:

and step 3: establishing a semantic approximation model based on the semantic distance between modalities, increasing the semantic space distance of the feature pairs under different semantics, and simultaneously reducing the semantic space distance of the feature pairs under the same semantics, wherein the method specifically comprises the following steps:

firstly, constructing a triple consisting of three elements in a graph and training data set sample: an example sample of a graphic pair is drawn in a random manner, denoted as Samp _ anc (p)^a、t^a) Then randomly selecting a sum Samp _ anc (p)^a、t^a) Have the sameDifferent samples Samp _ pos (t) of a class label^p、t^p) Simultaneously selecting a sample of different class as Samp _ neg (p)ⁿ、tⁿ) The triplet may be represented as (Samp _ anc, Samp _ neg, Samp _ pos).

Secondly, putting the three groups of data into the locked local feature space obtained in the step 2, and recording the obtained locked image-text pair as the following features:

wherein f is_pic、f_textAn image feature extraction function and a text feature extraction function,

inputting a locking image-text pair characteristic obtained by locally locking a characteristic space for a negative contrast sample group; the optimization goal of the model is to make the same-class semantics approach in the feature space, the inter-class semantics are far away in the feature space, that is, the distance between Samp _ anc and Samp _ pos feature expression is reduced as much as possible, the distance between Samp _ anc and Samp _ neg feature expression is increased, and a specific threshold value alpha exists to measure the minimum interval between the two distances, and the model is built as follows:

wherein N is the batch size;

in order to measure Euclidean distance, tau is a sample feature space; l is_pIs the picture ternary loss; l is_tIs a text ternary loss.

Finally, the model is constrained with an objective function of: min (L)_p,L_t) When training iteration is carried out, a feature space with the same semantic feature pair close and different semantic feature pairs far tends to be learned.

And 4, step 4: designing an overall model fusion algorithm under multi-modal information in a characteristic public subspace, which specifically comprises the following steps:

a. fusing in an asymmetric mode:

step 41: the locked image feature F 'obtained in the step 2'₁Text feature F obtained in step 1₂And splicing and fusing to obtain a combined modal characterization, inputting the combined modal characterization into the multilayer perceptron, and finally accessing the softmax classifier. For the model classification performance target, the error of the prediction label and the real label is measured by adopting a cross entropy function, and then the model classification performance target is quantized to obtain an asymmetric classification loss function L_Catog(F₂,F′₁) Wherein the cross entropy function L_CatogComprises the following steps:

wherein：p_icA predicted probability that sample i belongs to class c; y is_icFor indicating variables, if the prediction class is the same as the real class of the sample i, the prediction class is 1, otherwise, the prediction class is 0.

Step 42: designing an overall model fusion algorithm under multi-mode information by combining the model classification performance target and the model semantic approximation target, and enabling asymmetric classification loss L_Catog(F′₁,F₂) And picture ternary loss and text ternary loss L_p、L_tPerforming linear fusion to obtain an asymmetric fusion Loss function Loss_n：

Loss_n＝L_Catog(F′₁,F₂)+β(L_p+L_t) (15)

Wherein: beta is a linear proportionality coefficient. The training iteration of the model is performed according to the loss.

b. And (3) fusing in a symmetrical mode:

step 41': locally locking text feature F 'obtained in step 2'₂Local locked image feature F'₁Splicing and fusing to obtain a combined modal characterization, inputting the combined modal characterization into a multilayer perceptron, then accessing into a softmax classifier, measuring the error between a predicted label and a real label by using a cross entropy function formula (14), further quantifying a model classification performance target, and obtaining a symmetric classification loss L_Catog(F′₁,F′₂)

Step 42': designing an overall model fusion algorithm under multi-modal information by combining the model classification performance target and the model semantic approximation target, and classifying the symmetric classification loss L obtained in the step 41_Catog(F′₁,F′₂) And the picture ternary loss and the text ternary loss L obtained in the step 3_p、L_tLinear fusion is carried out to obtain a symmetric fusion Loss function Loss_y：

Loss_y＝L_Catog(F′₁,F′₂)+β(L_p+L_t) (15)′

In the formula: beta is a linear proportionality coefficient;

in the examples, the training will beInputting the collected data into the text image multi-mode fusion neural network, updating each layer of parameters of the multi-mode fusion neural network by using a gradient descent method, and assigning the updated parameter values to each layer of parameters of the multi-mode fusion neural network to obtain the trained multi-mode fusion neural network. The method for multi-modal fusion of parameters of each layer of the neural network by adopting the gradient descent method comprises the following steps: (a) the learning rate of the multi-modal fusion neural network is set to 0.001. (b) And taking the output value of the multi-mode fusion neural network and the label value of the human body action category in the text image sample as gradient values. (c) Utilizing the following formula:

and updating parameters of each layer of the skeleton-guided multi-mode fusion neural network. Wherein the content of the first and second substances,

representing the parameter value of the multi-mode fusion neural network after updating, going to represent the assignment operation, theta representing the parameter value of the multi-mode fusion neural network before updating,

representing gradient values of a multi-modal converged neural network. Finally, the 20 classified objects are identified.

As shown in FIGS. 2-5, FIG. 2 is a diagram of a text feature F extracted only in step 1₂The classification accuracy of the test set of the model rapidly climbs in the first few iterations, but the training effect is reduced after the seventh generation, and finally the classification accuracy of the test set is stabilized at 91.33%;

FIG. 3 shows the image feature F extracted in step 1 alone₁The single-mode model of the classified pictures can quickly reach a stable state in the whole training iterative process and fluctuate within an acceptable small range according to the training curve, and finally the classification accuracy of the model test set is improved84.57%;

FIG. 4 shows the image feature F extracted in step 1₁And text feature F₂The general multi-modal model classified after splicing and fusion is known from a training curve, the convergence trend of the model in the whole training process is smooth, the accuracy rate hardly changes after the model is trained to the sixth generation, the performance cannot be continuously improved, and the classification accuracy rate of the model test set is 91.82%;

FIG. 5 is a model of the invention, namely the locally locked image feature F 'obtained using step 2'₁Text feature F obtained in step 1₂After splicing and fusion, semantic consistency targets are considered for classification, and the training curve shows that the model converges after iteration to the sixteenth generation, a stable state is presented, better performance is realized, and the optimal classification effect that the classification accuracy of the model test set is 95.14% is achieved.

Compared with the method that only a single text mode or an image mode is used for identification, the improvement effect of the method that the image characteristic and the text characteristic are fused and then classified in a simple splicing mode is very limited, the improvement effect is only 0.49% higher than that of a single mode model (in the embodiment, the text single mode model) with a better classification result, the model aggregates effective characteristic information based on an improved CBAM attention mechanism, and meanwhile semantic consistency quantitative judgment among modes is introduced, so that the model can better search a common characteristic subspace, perform inter-mode complementation and improve the efficiency of multi-mode fusion model diagnosis, and from the result of the embodiment, the accuracy of the model is improved by 3.32% higher than that of the splicing and fusing model, and the effectiveness of the fusion method provided by the invention is embodied.

The above-mentioned embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements made to the technical solution of the present invention by those skilled in the art without departing from the spirit of the present invention shall fall within the protection scope defined by the claims of the present invention.

Claims

1. A multi-modal fusion classification optimization method considering inter-modal semantic distance measurement is characterized by comprising the following steps:

and 4, step 4: designing an overall model fusion algorithm under multi-modal information in a common feature subspace according to the data features obtained in the step 1, the local locking feature space obtained in the step 2 and the semantic approximation model based on the semantic distance between the modalities obtained in the step 3 to obtain fusionA Loss function, said fused Loss function comprising an asymmetric fused Loss function Loss_nAnd symmetric fusion Loss function Loss_yAnd carrying out training iteration of the model by using the fusion loss function.

2. The multi-modal fusion classification optimization method considering the inter-modal semantic distance metric according to claim 1, wherein the semantic approximation model based on the inter-modal semantic distance established in the step 33 is specifically:

in the formula (I), the compound is shown in the specification,

is Euclidean distance measurement;

3. The method of claim 1, wherein the step 2 establishes a modified CBAM attention mechanism to obtain a local locked image feature F'₁The method comprises the following specific steps:

V_max1＝MaxPool(F₁)

V_avg1＝AvgPool(F₁)

In the formula:

is a shared weight feature layer function;

step 22: constructing the spatial attention model along the image features F obtained in the step 1₁Respectively performing global mean pooling and maximum pooling to obtain image-based feature F₁Image space context descriptor T_avg1And T_max1：

T_max1＝MaxPool(F₁)

T_avg1＝AvgPool(F₁)

In the formula:

splicing operation is carried out;

is a hole convolution layer function;

The image-based feature F₁Mixed attention vector of (3) HYB (F)₁) Injecting the image feature F₁Realizing local semantic feature reinforcement on space and channels to obtain local locking image feature F'₁：

4. The method of claim 1, wherein in step 2, an improved CBAM attention mechanism is established to obtain a local locked text feature F'₂The method comprises the following specific steps:

V_max2＝MaxPool(F₂)

V_avg2＝AvgPool(F₂)

In the formula:

is a shared weight feature layer function;

T_max2＝MaxPool(F₂)

T_avg2＝AvgPool(F₂)

In the formula:

splicing operation is carried out;

is a hole convolution layer function;

The text-based feature F₂Mixed attention vector of (3) HYB (F)₂) Injecting the text feature F₂Realizing local semantic feature reinforcement on space and channels to obtain local locking text feature F'₂：

5. The multi-modal fusion classification optimization method considering inter-modal semantic distance metric of claim 1, wherein the anchor sample set obtained in the step 31 is a graph-text pair example sample Samp _ anc (p) randomly extracted from the training set obtained in the step 1^a、t^a) (ii) a The forward control sample group is a sample Samp _ pos (p) which is randomly selected from the training set and has the same semantic meaning with the anchor sample group^p、t^p) (ii) a The negative control sample group is a sample Samp _ neg (p) which is extracted in a random manner in the training set and has different semantics with the anchor sample groupⁿ、tⁿ)。

6. The multi-modal fusion classification optimization method considering the inter-modal semantic distance metric as claimed in claim 1, wherein the overall model fusion algorithm under the multi-modal information is designed in the step 4 to obtain the asymmetric fusion Loss function Loss_nThe method comprises the following specific steps:

step 41: the text characteristics F obtained in the step 1 and the step 2 are processed₂Local locked image feature F'₁Splicing and fusing to obtain a combined modal characterization, inputting the combined modal characterization into a multilayer perceptron, then accessing into a softmax classifier, measuring errors of a predicted label and a real label by adopting a cross entropy function, quantifying a model classification performance target, and obtaining an asymmetric classification loss function L_Catog(F₂，F′₁)：

step (ii) of42: considering a model classification performance objective and a model semantic approximation objective, designing a general model fusion algorithm under multi-modal information, and specifically, applying the asymmetric classification loss function L obtained in step 41_Catog(F₂，F′₁) And the picture ternary loss L obtained in the step 3_pText ternary loss L_tPerforming linear fusion to obtain asymmetric fusion Loss function Loss_n：

Loss_n＝L_Catog(F′₁，F₂)+β(L_p+L_t)

In the formula: beta is a linear proportionality coefficient;

7. The multi-modal fusion classification optimization method considering the inter-modal semantic distance metric as claimed in claim 1, wherein the overall model fusion algorithm under the multi-modal information is designed in the step 4 to obtain a symmetric fusion Loss function Loss_yThe method comprises the following specific steps:

step 41': locally locking text feature F 'obtained in step 2'₂Local locked image feature F'₁Splicing and fusing to obtain a combined modal characterization, inputting the combined modal characterization into a multilayer perceptron, then accessing into a softmax classifier, measuring errors of a predicted label and a real label by adopting a cross entropy function, quantifying a model classification performance target, and obtaining a symmetrical classification loss L_Catog(F′₁，F′₂)：

step 42': considering the model classification performance target and the model semantic approximation target, designing an overall model fusion algorithm under multi-modal information, and specifically classifying the symmetric classification loss function L obtained in the step 41_Catog(F′₁，F′₂) And the picture ternary loss L obtained in the step 3_pText ternary loss L_tLinear fusion is carried out to obtain a symmetric fusion Loss function Loss_y：

Loss_y＝L_Catog(F′₁，F′₂)+β(L_p+L_t)

In the formula: beta is a linear proportionality coefficient;

8. The multi-modal fusion classification optimization method considering the inter-modal semantic distance metric according to claim 1, wherein the preprocessing of the training set in the step 1 specifically comprises:

9. The multi-modal fusion classification optimization method considering the inter-modal semantic distance metric as claimed in claim 1, wherein the deep neural network in step 1 comprises an image deep feature extraction network and a text deep feature extraction network.

10. The method according to claim 2, wherein the objective function in step 34 is min (L)_p，L_t)。