Disclosure of Invention
In order to overcome the defects of the prior art, the method carries out self-adaptive feature refinement based on an improved CBAM attention mechanism, transversely combines a channel attention model and a space attention model, aggregates effective information on space and channel dimensions, and carries out local semantic feature reinforcement; on the basis, a semantic approximation model based on the semantic distance between the modalities is constructed, explicit measurement for judging semantic consistency between the modalities is introduced, the distribution distance between similar semantic feature pairs is reduced, and the distribution distance between different semantic feature pairs is expanded; and performing linear fusion under multi-modal information by combining the model classification performance target and the model semantic approximation target, so that the model can better search a common feature subspace, and the efficiency of multi-modal fusion model diagnosis is improved.
In order to achieve the purpose, the solution adopted by the invention is as follows:
a multi-modal fusion classification optimization method considering inter-modal semantic distance measurement comprises the following steps:
step 1: dividing data into a training set and a testing set, preprocessing the training set to obtain preprocessed data, and extracting data features of the preprocessed data by using a deep neural network, wherein the data features comprise image features F1And text feature F2;
Step 2: transversely merging a channel attention model and a space attention model in a CBAM attention mechanism to obtain an improved CBAM attention mechanism, inputting the data features obtained in the step 1 into the improved CBAM attention mechanism to obtain a local locking feature space, wherein the local locking feature space comprises a local locking image feature F'1And local locked text feature F'2;
And step 3: the method specifically comprises the following steps of constructing a semantic approximation model based on the semantic distance between modalities:
step 31: constructing a triple according to the training set obtained in the step 1, wherein the triple comprises a positive control sample group, an anchor sample group and a negative control sample group;
step 32: inputting the triplets established in the step 31 into the local locking feature space obtained in the step 2 to obtain locking image-text pair features;
step 33: according to the locked image-text pair characteristics obtained in the step 32, increasing the semantic space distance of the locked image-text pair characteristics under different semantics, reducing the semantic space distance of the locked image-text pair characteristics under the same semantics, and establishing a semantic approximation model based on the inter-modality semantic distance;
step 34: constraining the semantic approximation model based on the inter-modality semantic distance established in the step 33 to obtain a target function;
and 4, step 4: designing a general model fusion algorithm under multi-modal information in a feature public subspace according to the data features obtained in the step 1, the local locking feature space obtained in the step 2 and the semantic approximation model based on the inter-modal semantic distance obtained in the step 3 to obtain a fusion Loss function, wherein the fusion Loss function comprises an asymmetric fusion Loss function LossnAnd symmetric fusion Loss function LossyAnd carrying out training iteration of the model by using the fusion loss function.
Preferably, the semantic approximation model based on the semantic distance between modalities established in step 3 specifically includes:
in the formula (I), the compound is shown in the specification,
is Euclidean distance measurement;
inputting a locking image-text pair characteristic obtained by locally locking a characteristic space for an anchor sample group;
inputting a locked image-text pair characteristic obtained by locally locking a characteristic space for a forward reference sample group;
inputting a locking image-text pair characteristic obtained by locally locking a characteristic space for a negative contrast sample group; α is a specific threshold; n is the batch size; tau is a sample feature space; l is
pIs the picture ternary loss; l is
tIs a text ternary loss.
Preferably, an improved CBAM attention mechanism is established in the step 2 to obtain a local lock imageCharacteristic F'1The method comprises the following specific steps:
step 21: constructing the channel attention model, and inputting the image characteristics F obtained in the step 11Aggregating the spatial information of feature mapping by respectively adopting a maximum pooling process and an average pooling process on the feature space to obtain the image-based feature F expressed by a one-dimensional vector1Image channel descriptor V ofmax1And Vavg1:
Vmax1=MaxPool(F1)
Vavg1=AvgPool(F1)
In the formula: vmax1The image channel descriptor obtained by adopting maximum pooling processing; vavg1The image channel descriptor obtained by average pooling processing is adopted;
aggregating the image features F by using two layers of convolution layers as a shared weight feature layer1Of a certain area within the neighborhood of the channel, said image-based feature F1Image channel descriptor V ofmax1And Vavg1Obtaining image-based features F through the shared weight feature layer1Feature vector VMLP ofmax1And VMLPavg1(ii) a The image-based feature F1Feature vector VMLP ofmax1And VMLPavg1Performing pixel-by-pixel addition, and obtaining the image characteristic F through a relu activation function1Based on the image attention vector CA (F) of the channel dimension1):
In the formula:
is a shared weight feature layer function;
step 22: constructing the spatial attention model along the image features F obtained in the step 11Respectively performing global mean pooling and maximum pooling to obtain image-based imageCharacteristic F1Image space context descriptor Tavg1And Tmax1:
Tmax1=MaxPool(F1)
Tavg1=AvgPool(F1)
In the formula: t ismax1The image space context descriptor obtained by maximum pooling operation is adopted; t isavg1The method comprises the steps of obtaining an image space context descriptor by adopting global mean pooling operation;
along the image feature F1To the image-based feature F1Image space context descriptor Tmax1And Tavg1Splicing to obtain an image-based feature F1Valid spatial feature descriptors of; coding and mapping information of a region needing to be emphasized or suppressed in a space by using a hole convolution to obtain a feature after convolution, and obtaining an image feature F by passing the feature after convolution through a relu activation function1Image attention vector SA (F) based on spatial dimensions1):
In the formula:
splicing operation is carried out;
is a hole convolution layer function;
step 23: transversely combining the channel attention model and the space attention model in the steps 21 and 22 to obtain an image feature F1Mixed attention vector of (3) HYB (F)1):
The image-based feature F1Mixed attention vector of (3) HYB (F)1) Injection of image features F1Realizing local semantic feature reinforcement on space and channels to obtain local locking image feature F'1:
Preferably, the step 2 establishes a modified CBAM attention mechanism to obtain a local locked text feature F'2The method comprises the following specific steps:
step 21': constructing the channel attention model, and inputting the text features F obtained in the step 12Aggregating the spatial information of feature mapping by respectively adopting a maximum pooling process and an average pooling process on the feature space to obtain the text-based feature F represented by a one-dimensional vector2Text channel descriptor V ofmax2And Vavg2::
Vmax2=MaxPool(F2)
Vavg2=AvgPool(F2)
In the formula: vmax2The method comprises the steps of obtaining a text channel descriptor by adopting maximum pooling processing; vavg2The text channel descriptors are obtained by average pooling;
aggregating the text features F by using two layers of convolution layers as a shared weight feature layer2The feature of a certain area in the neighborhood of the channel, the text-based feature F2Text channel descriptor V ofmax2And Vavg2Obtaining a text-based feature F through the shared weight feature layer2Feature vector VMLP ofmax2And VMLPavg2(ii) a The text-based feature F2Feature vector VMLP ofmax2And VMLPavg2Performing pixel-by-pixel addition, and obtaining a text feature F through a relu activation function2Based on the channel dimension of the text attention vector CA (F)2):
In the formula:
is a shared weight feature layer function;
step 22': constructing the spatial attention model along the text features F obtained in the step 12Respectively carrying out global mean pooling and maximum pooling to obtain a text-based feature F2Text space context descriptor Tavg2And Tmax2:
Tmax2=MaxPool(F2)
Tavg2=AvgPool(F2)
In the formula: t ismax2The context descriptor of the text space obtained by adopting maximum pooling operation; t isavg2The context descriptor of the text space is obtained by adopting global mean pooling operation;
along the text feature F2For the text-based feature F2Text space context descriptor Tmax2And Tavg2Splicing to generate a text-based feature F2Valid spatial feature descriptors of; then, coding and mapping information of the region needing to be emphasized or suppressed in the space by using the hole convolution, and obtaining a text feature F by passing the feature after the convolution through a relu activation function2Is based on the text attention vector SA (F) of the space dimension2):
In the formula:
splicing operation is carried out;
is convolved with a holeA layer function;
step 23': transversely combining the channel attention model and the space attention model in the steps 21 'and 22' to obtain a text-based feature F2Mixed attention vector of (3) HYB (F)2):
The text-based feature F2Mixed attention vector of (3) HYB (F)2) Injecting the text feature F2Local semantic feature reinforcement on space and channel is realized, and local locking text feature F is obtained2:
Preferably, the anchor sample set obtained in step 31 is a graph-text pair example sample Samp _ anc (p) randomly extracted from the training seta、ta) (ii) a The forward control sample group is a sample Samp _ pos (p) which is randomly selected from the training set and has the same semantic meaning with the anchor sample groupp、tp) (ii) a The negative control sample group is a sample Samp _ neg (p) which is extracted in a random manner in the training set and has different semantics with the anchor sample groupn、tn)。
Preferably, the overall model fusion algorithm under the multi-modal information is designed in the step 4, and an asymmetric fusion Loss function Loss is obtainednThe method comprises the following specific steps:
step 41: the text characteristics F obtained in the step 1 and the step 2 are processed2Local locked image feature F'1Splicing and fusing to obtain a combined modal characterization, inputting the combined modal characterization into a multilayer perceptron, then accessing into a softmax classifier, measuring errors of a predicted label and a real label by adopting a cross entropy function, quantifying a model classification performance target, and obtaining an asymmetric classification loss function LCatog(F2,F′1):
In the formula (I); l isCatogIs a cross entropy function; p is a radical oficA predicted probability that sample i belongs to class c; y isicIs an indicator variable;
if the prediction class is the same as the true class of sample i, then the variable y is indicatedicIs 1, otherwise yicIs 0;
step 42: considering a model classification performance objective and a model semantic approximation objective, designing a general model fusion algorithm under multi-modal information, and specifically, applying the asymmetric classification loss function L obtained in step 41Catog(F2,F′1) And the picture ternary loss L obtained in the step 3pText ternary loss LtPerforming linear fusion to obtain asymmetric fusion Loss function Lossn:
Lossn=LCatog(F′1,F2)+β(Lp+Lt)
In the formula: beta is a linear proportionality coefficient;
step 43: according to the asymmetric fusion Loss function LossnAnd carrying out training iteration of the model.
Preferably, the overall model fusion algorithm under the multi-modal information is designed in the step 4, and a symmetric fusion Loss function Loss is obtainedyThe method comprises the following specific steps:
step 41': locally locking text feature F 'obtained in step 2'2Local locked image feature F'1Splicing and fusing to obtain a combined modal characterization, inputting the combined modal characterization into a multilayer perceptron, then accessing into a softmax classifier, measuring errors of a predicted label and a real label by adopting a cross entropy function, quantifying a model classification performance target, and obtaining a symmetrical classification loss LCatog(F′1,F′2):
In the formula (I); l isCatogIs a cross entropy function; p is a radical oficA predicted probability that sample i belongs to class c; y isicIs an indicator variable;
if the prediction class is the same as the true class of sample i, then the variable y is indicatedicIs 1, otherwise yicIs 0;
step 42': considering the model classification performance target and the model semantic approximation target, designing an overall model fusion algorithm under multi-modal information, and specifically classifying the symmetric classification loss function L obtained in the step 41Catog(F′1,F′2) And the picture ternary loss L obtained in the step 3pText ternary loss LtLinear fusion is carried out to obtain a symmetric fusion Loss function Lossy:
Lossy=LCatog(F′1,F′2)+β(Lp+Lt)
In the formula: beta is a linear proportionality coefficient;
step 43': according to the symmetric fusion Loss function LossyAnd carrying out training iteration of the model.
Preferably, the preprocessing of the training set in step 1 specifically includes:
step 11: separating image-text pairs in the training set data sample to obtain image data and text data;
step 12: and (3) encoding the text data obtained in the step (11) by using Word2Vec to obtain the preprocessed text data.
Preferably, the deep neural network in step 1 includes an image depth feature extraction network and a text depth feature extraction network.
Preferably, the objective function in step 34 is min (L)p,Lt)。
Compared with the prior art, the invention has the beneficial effects that:
self-adaptive feature refinement is carried out by establishing an improved CBAM attention mechanism, and local semantic features are strengthened; and a semantic approximation model based on the semantic distance between the modalities is constructed, and explicit measurement for judging semantic consistency between the modalities is introduced, so that the robustness and the accuracy of the multi-modal fusion model are better improved.
Detailed Description
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
Taking a Pascal Sences data set as an example, 20 types of object identification tasks including airplanes, bicycles, birds, ships, bottles, buses, automobiles, cats, chairs, cows, digital watches, dogs, horses, motorcycles, people, potted plants, sheep, sofas, trains and television monitors are carried out, each type has 200 image-text pairs, image data is a real shot image of an object, text data is a text description of the object, and image-text labels are shared.
As shown in fig. 1, firstly inputting image-text data, dividing a training set and a test set, obtaining image-text characteristics through a deep neural network, inputting an improved CBAM attention model for self-adaptive characteristic refinement, fusing the image-text characteristics, and performing classification performance evaluation; meanwhile, triple samples are randomly extracted to construct a semantic approximation model based on the semantic distance between the modalities, the semantic space distance of the feature pairs under different semantics is increased, and the semantic space distance of the feature pairs under the same semantics is reduced. And finally, constraining the multi-modal feature public subspace by considering the model classification performance target and the semantic consistency target. The method comprises the following specific steps:
step 1: dividing a data set into a training set and a test set, separating image-text pairs in a data sample of the training set, obtaining image-text characteristics through a deep neural network, and respectively constructing an image depth characteristic extraction network and a text depth characteristic extraction network which are respectively used for extracting image characteristics F1And text feature F2The method comprises the following specific steps:
firstly, a 14-layer RGB image depth feature extraction network is constructed, and the structure sequentially comprises the following steps: inputting the convolution layer twice, the pooling layer once, the convolution layer twice again, the pooling layer once, the convolution for three times again and the pooling layer once.
The feature extraction network parameters of the RGB image are set as follows: the convolution kernel size of the input convolutional layer is set to 3x3, the convolution step size is [1,1], and the number of convolution kernels is set to 64,128, 256,512 in turn. Setting the step length of the pooling layer as [2,2], and adopting a maximum pooling layer;
secondly, a 6-layer text feature extraction network is built, and the structure of the network is as follows in sequence: the system comprises an embedding layer, a convolutional layer, a BN layer, a global pooling layer, a dropout layer and a full connection layer.
The parameters of each layer of the network for setting text feature extraction are as follows: before the text is sent into the network, the text needs to be coded into a matrix with 150 dimensions after being coded by Word2 Vec. And then the text passes through the embedding layer, the convolution layer adopts 1-dimensional convolution, the convolution size is 3, the step length is 1, the number of convolution kernels is 256, and the probability of the dropout layer is 0.5. The final full link layer neuron number is set to 256.
Step 2: the method comprises the following steps of self-adaptive feature refinement based on an improved CBAM attention mechanism, and local locking of image and text feature spaces containing certain semantics, and specifically comprises the following steps:
a. for an image feature space, firstly, a Channel Attention model (CA) is constructed, and a specific modeling method is as follows:
(1) inputting extracted image feature F1Aggregating the spatial information of feature mapping on the feature space by adopting a maximum pooling process and an average pooling process to generate two different image channel descriptors V represented by one-dimensional vectorsmax1And Vavg1Maximum pooling feature and mean pooling feature are indicated, respectively, namely:
Vmax1=MaxPool(F1) (1)
Vavg1=AvgPool(F1) (2)
(2) two layers of convolution layers are adopted as a shared weight characteristic layer, the characteristics of a certain area in the neighborhood of the channel are aggregated, and two channel descriptors pass through the shared weight characteristic layer to respectively obtain a characteristic vector VMLPmax1、VMLPavg1Adding the two pixel by pixel, and obtaining an image attention vector CA (F) based on channel dimension through a relu activation function1):
Wherein:
is a shared weight feature layer function;
the specific operations in the examples are as follows: the invention takes the size of (224, 3) image as input, and obtains the depth image feature F with the size of (7, 256) after 14 layers of feature extraction1At the moment, channel compression is carried out on the channel, the size is compressed to (1, 256), firstly, (7, 256) features are taken, global pooling and maximum pooling are respectively carried out, then the two pooled feature layers are sent to a shared feature layer network, the structure in the shared feature layer is convoluted twice, the cores of the convolutions are 512,256 respectively, the step length is 1, the activation function is relu, and after passing through the shared feature layer, Add fusion is carried out on the output max layer and the avg layer, namely the values of the max layer and the avg layer are added, so that a channel attention vector CA (F) with the dimension of (7, 256) is obtained1)。
Secondly, a Spatial Attention model (SA) is constructed, and the following is specifically modeled:
(1) first along an image feature F1The channel axes of the image space are subjected to global mean pooling and maximum pooling to generate two different image space context descriptors Tavg1And Tmax1Namely:
Tmax1=MaxPool(F1) (4)
Tavg1=AvgPool(F1) (5)
(2) and then splicing the two image space descriptors along the channel axis to generate an effective space feature descriptor. Then, the information of the areas needing to be emphasized or inhibited in the space is coded and mapped by using the cavity convolution, space context information is more efficiently aggregated, and the image attention vector SA (F) based on the space dimension is obtained by performing a relu activation function on the convolved features1):
Wherein:
splicing operation is carried out;
is a function of the void convolution layer.
The specific operations in the examples are as follows: also by image feature F1Taking the maximum max size (7,7,1) and the average mean (7,7,1) on the channel as input, then fusing the maximum layer with the minimum layer to obtain a concatenate layer, namely splicing the two results on the channel, wherein the size is (7,7,2), sending a new (7,7,2) to the convolution layer, the convolution size is 1X1, the convolution kernel is 1, and then changing to (7,7,1), and then carrying out dot multiplication on the output (7,7,1) and X. The spatial attention vector SA (F) with the dimension of (7, 256) is obtained1)。
Finally, the channel attention model and the space attention model are combined, and the invention abandons the longitudinal direction in the common CBAM moduleThe attention superimposing method of (1) is improved to the lateral attention, that is, the channel attention is matched with the input of the spatial attention, and then the obtained attention vector formula (3) based on the channel dimension and the obtained attention vector formula (6) based on the spatial dimension are multiplied by the corresponding elements to obtain the mixed attention vector HYB (F)1):
Mixing attention vectors HYB (F)1) Injecting an original feature map to realize local semantic feature reinforcement on space and channels to obtain a locked image feature F'1Namely:
the specific operations in the examples are as follows: the aforementioned derived attention vector CA (F) based on channel dimensions1) And attention vector SA (F) based on spatial dimension1) Performing point multiplication to obtain a mixed attention vector, and performing point multiplication on the mixed attention vector and the original depth image feature F1Carrying out dot multiplication to obtain a new feature F with the dimension of (7, 256)1'. The structure of the fusion classification part is as follows in sequence: a carboxylate layer, a first fully-connected layer, a second fully-connected layer; the splicing dimension of the coordinate layer is set to be 256, the number of output neurons of the first full-connection layer is set to be 256, and the number of output neurons of the second full-connection layer is set to be 128. The third layer is a softmax layer and a classification layer, and the number of the neurons is 20.
b. For a text feature space, firstly, a Channel Attention model (CA) is constructed, and a specific modeling method is as follows:
inputting extracted text features F2Aggregating the spatial information of feature mapping on the feature space by adopting a maximum pooling process and an average pooling process to generate two different text channel descriptors V represented by one-dimensional vectorsmax2And Vavg2Respectively represent the maximum valuesPooling characteristics and mean pooling characteristics, namely:
Vmax2=MaxPool(F2) (1)′
Vavg2=AvgPool(F2) (2)′
two layers of convolution layers are adopted as a shared weight characteristic layer, the characteristics of a certain area in the neighborhood of the channel are aggregated, and two channel descriptors pass through the shared weight characteristic layer to respectively obtain a characteristic vector VMLPmax2、VMLPavg2Adding the two pixel by pixel, and obtaining an attention vector CA (F) based on the channel dimension through a relu activation function2):
Wherein:
is a shared weight feature layer function;
secondly, a Spatial Attention model (SA) is constructed, and the following is specifically modeled:
(1) first along text feature F2The channel axes of the text space are subjected to global mean pooling and maximum pooling to generate two different text space context descriptors Tavg2And Tmax2Namely:
Tmax2=MaxPool(F2) (4)′
Tavg2=AvgPool(F2) (5)′
(2) and then splicing the two text space descriptors along the channel axis to generate an effective space feature descriptor. Then, the information of the areas needing to be emphasized or inhibited in the space is coded and mapped by using the hole convolution, the space context information is more efficiently aggregated, and the text attention vector SA (F) based on the space dimension is obtained by performing a relu activation function on the convolved features2):
Wherein:
splicing operation is carried out;
is a function of the void convolution layer.
Finally, the channel attention model and the space attention model are combined, the longitudinal attention superposition mode in the common CBAM module is abandoned, the mode is improved to be transverse attention, namely the channel attention is consistent with the input of the space attention, corresponding elements are multiplied for the obtained attention vector formula (3) 'based on the channel dimension and the attention vector formula (6)' based on the space dimension, and a mixed attention vector HYB (F) is obtained2):
Mixing attention vectors HYB (F)2) Injecting an original feature map to realize local semantic feature reinforcement on space and channels to obtain a locking text feature F'2Namely:
and step 3: establishing a semantic approximation model based on the semantic distance between modalities, increasing the semantic space distance of the feature pairs under different semantics, and simultaneously reducing the semantic space distance of the feature pairs under the same semantics, wherein the method specifically comprises the following steps:
firstly, constructing a triple consisting of three elements in a graph and training data set sample: an example sample of a graphic pair is drawn in a random manner, denoted as Samp _ anc (p)a、ta) Then randomly selecting a sum Samp _ anc (p)a、ta) Have the sameDifferent samples Samp _ pos (t) of a class labelp、tp) Simultaneously selecting a sample of different class as Samp _ neg (p)n、tn) The triplet may be represented as (Samp _ anc, Samp _ neg, Samp _ pos).
Secondly, putting the three groups of data into the locked local feature space obtained in the step 2, and recording the obtained locked image-text pair as the following features:
wherein f is
pic、f
textAn image feature extraction function and a text feature extraction function,
inputting a locking image-text pair characteristic obtained by locally locking a characteristic space for an anchor sample group;
inputting a locked image-text pair characteristic obtained by locally locking a characteristic space for a forward reference sample group;
inputting a locking image-text pair characteristic obtained by locally locking a characteristic space for a negative contrast sample group; the optimization goal of the model is to make the same-class semantics approach in the feature space, the inter-class semantics are far away in the feature space, that is, the distance between Samp _ anc and Samp _ pos feature expression is reduced as much as possible, the distance between Samp _ anc and Samp _ neg feature expression is increased, and a specific threshold value alpha exists to measure the minimum interval between the two distances, and the model is built as follows:
wherein N is the batch size;
in order to measure Euclidean distance, tau is a sample feature space; l is
pIs the picture ternary loss; l is
tIs a text ternary loss.
Finally, the model is constrained with an objective function of: min (L)p,Lt) When training iteration is carried out, a feature space with the same semantic feature pair close and different semantic feature pairs far tends to be learned.
And 4, step 4: designing an overall model fusion algorithm under multi-modal information in a characteristic public subspace, which specifically comprises the following steps:
a. fusing in an asymmetric mode:
step 41: the locked image feature F 'obtained in the step 2'1Text feature F obtained in step 12And splicing and fusing to obtain a combined modal characterization, inputting the combined modal characterization into the multilayer perceptron, and finally accessing the softmax classifier. For the model classification performance target, the error of the prediction label and the real label is measured by adopting a cross entropy function, and then the model classification performance target is quantized to obtain an asymmetric classification loss function LCatog(F2,F′1) Wherein the cross entropy function LCatogComprises the following steps:
wherein:picA predicted probability that sample i belongs to class c; y isicFor indicating variables, if the prediction class is the same as the real class of the sample i, the prediction class is 1, otherwise, the prediction class is 0.
Step 42: designing an overall model fusion algorithm under multi-mode information by combining the model classification performance target and the model semantic approximation target, and enabling asymmetric classification loss LCatog(F′1,F2) And picture ternary loss and text ternary loss Lp、LtPerforming linear fusion to obtain an asymmetric fusion Loss function Lossn:
Lossn=LCatog(F′1,F2)+β(Lp+Lt) (15)
Wherein: beta is a linear proportionality coefficient. The training iteration of the model is performed according to the loss.
b. And (3) fusing in a symmetrical mode:
step 41': locally locking text feature F 'obtained in step 2'2Local locked image feature F'1Splicing and fusing to obtain a combined modal characterization, inputting the combined modal characterization into a multilayer perceptron, then accessing into a softmax classifier, measuring the error between a predicted label and a real label by using a cross entropy function formula (14), further quantifying a model classification performance target, and obtaining a symmetric classification loss LCatog(F′1,F′2)
Step 42': designing an overall model fusion algorithm under multi-modal information by combining the model classification performance target and the model semantic approximation target, and classifying the symmetric classification loss L obtained in the step 41Catog(F′1,F′2) And the picture ternary loss and the text ternary loss L obtained in the step 3p、LtLinear fusion is carried out to obtain a symmetric fusion Loss function Lossy:
Lossy=LCatog(F′1,F′2)+β(Lp+Lt) (15)′
In the formula: beta is a linear proportionality coefficient;
in the examples, the training will beInputting the collected data into the text image multi-mode fusion neural network, updating each layer of parameters of the multi-mode fusion neural network by using a gradient descent method, and assigning the updated parameter values to each layer of parameters of the multi-mode fusion neural network to obtain the trained multi-mode fusion neural network. The method for multi-modal fusion of parameters of each layer of the neural network by adopting the gradient descent method comprises the following steps: (a) the learning rate of the multi-modal fusion neural network is set to 0.001. (b) And taking the output value of the multi-mode fusion neural network and the label value of the human body action category in the text image sample as gradient values. (c) Utilizing the following formula:
and updating parameters of each layer of the skeleton-guided multi-mode fusion neural network. Wherein the content of the first and second substances,
representing the parameter value of the multi-mode fusion neural network after updating, going to represent the assignment operation, theta representing the parameter value of the multi-mode fusion neural network before updating,
representing gradient values of a multi-modal converged neural network. Finally, the 20 classified objects are identified.
As shown in FIGS. 2-5, FIG. 2 is a diagram of a text feature F extracted only in step 12The classification accuracy of the test set of the model rapidly climbs in the first few iterations, but the training effect is reduced after the seventh generation, and finally the classification accuracy of the test set is stabilized at 91.33%;
FIG. 3 shows the image feature F extracted in step 1 alone1The single-mode model of the classified pictures can quickly reach a stable state in the whole training iterative process and fluctuate within an acceptable small range according to the training curve, and finally the classification accuracy of the model test set is improved84.57%;
FIG. 4 shows the image feature F extracted in step 11And text feature F2The general multi-modal model classified after splicing and fusion is known from a training curve, the convergence trend of the model in the whole training process is smooth, the accuracy rate hardly changes after the model is trained to the sixth generation, the performance cannot be continuously improved, and the classification accuracy rate of the model test set is 91.82%;
FIG. 5 is a model of the invention, namely the locally locked image feature F 'obtained using step 2'1Text feature F obtained in step 12After splicing and fusion, semantic consistency targets are considered for classification, and the training curve shows that the model converges after iteration to the sixteenth generation, a stable state is presented, better performance is realized, and the optimal classification effect that the classification accuracy of the model test set is 95.14% is achieved.
Compared with the method that only a single text mode or an image mode is used for identification, the improvement effect of the method that the image characteristic and the text characteristic are fused and then classified in a simple splicing mode is very limited, the improvement effect is only 0.49% higher than that of a single mode model (in the embodiment, the text single mode model) with a better classification result, the model aggregates effective characteristic information based on an improved CBAM attention mechanism, and meanwhile semantic consistency quantitative judgment among modes is introduced, so that the model can better search a common characteristic subspace, perform inter-mode complementation and improve the efficiency of multi-mode fusion model diagnosis, and from the result of the embodiment, the accuracy of the model is improved by 3.32% higher than that of the splicing and fusing model, and the effectiveness of the fusion method provided by the invention is embodied.
The above-mentioned embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements made to the technical solution of the present invention by those skilled in the art without departing from the spirit of the present invention shall fall within the protection scope defined by the claims of the present invention.