CN113343974A - Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement - Google Patents

Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement Download PDF

Info

Publication number
CN113343974A
CN113343974A CN202110770185.4A CN202110770185A CN113343974A CN 113343974 A CN113343974 A CN 113343974A CN 202110770185 A CN202110770185 A CN 202110770185A CN 113343974 A CN113343974 A CN 113343974A
Authority
CN
China
Prior art keywords
feature
text
image
model
modal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110770185.4A
Other languages
Chinese (zh)
Other versions
CN113343974B (en
Inventor
王剑锋
马世乾
余金沄
王坤
赵晨阳
吴文炤
刘剑
秦亮
刘开培
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
State Grid Information and Telecommunication Co Ltd
State Grid Tianjin Electric Power Co Ltd
Electric Power Research Institute of State Grid Tianjin Electric Power Co Ltd
Original Assignee
Wuhan University WHU
State Grid Information and Telecommunication Co Ltd
State Grid Tianjin Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU, State Grid Information and Telecommunication Co Ltd, State Grid Tianjin Electric Power Co Ltd filed Critical Wuhan University WHU
Priority to CN202110770185.4A priority Critical patent/CN113343974B/en
Publication of CN113343974A publication Critical patent/CN113343974A/en
Application granted granted Critical
Publication of CN113343974B publication Critical patent/CN113343974B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a multi-modal fusion classification optimization method considering inter-modal semantic distance measurement. Aiming at the problems of unstable fusion effect and limited improvement effect existing in the construction of a feature subspace under unified semantics in modal information fusion, the invention performs adaptive feature refinement based on an improved CBAM attention mechanism, and performs local semantic feature reinforcement by adopting a transverse structure, and aggregating effective information on space and channel dimensions; on the basis, a semantic approximation model based on the semantic distance between the modalities is constructed, explicit measurement for judging semantic consistency between the modalities is introduced, the distribution distance between the same semantic feature pairs is reduced, and the distribution distance between different semantic feature pairs is expanded; and finally, linear fusion under multi-modal information is carried out by taking the classification performance target of the model and the semantic approximation target of the model into consideration, so that the model can better search a common feature subspace, and the efficiency of multi-modal fusion model diagnosis is improved.

Description

Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement
Technical Field
The application relates to the field of multi-modal information fusion, in particular to a multi-modal fusion classification optimization method considering inter-modal semantic distance measurement.
Background
Modality refers to the form in which an event occurs or an objective object exists. To better exploit artificial intelligence to help us perceive and understand the world, it is necessary to interpret and reason about information and features useful in multimodal data. The multi-modal fusion technology aims to realize heterogeneous complementation of multi-field data, establish a framework capable of processing and associating interactive information among multiple modalities, and gradually develop a new scientific research direction with huge mining potential and research value from early research based on audio-visual speech recognition to recently new semantic and visual field application. However, although the multi-modal fusion technology based on deep learning can learn deep feature expressions of data of different modalities, due to differences between modalities and different influence factors, there are many problems to be solved and broken through in the prior art: on one hand, although the common feature fusion methods such as direct feature splicing, dot product and addition are simple to implement, a semantic gap exists between multi-mode data, so that the fusion effect is unstable, and the improvement effect is limited; on the other hand, because different modal characteristics have different meanings, and modal information is interfered with each other in a common characteristic space, it is difficult to establish a characteristic subspace with unified semantic representation, and the prior art lacks explicit measurement for effective fusion information selection and semantic consistency judgment between modalities.
Disclosure of Invention
In order to overcome the defects of the prior art, the method carries out self-adaptive feature refinement based on an improved CBAM attention mechanism, transversely combines a channel attention model and a space attention model, aggregates effective information on space and channel dimensions, and carries out local semantic feature reinforcement; on the basis, a semantic approximation model based on the semantic distance between the modalities is constructed, explicit measurement for judging semantic consistency between the modalities is introduced, the distribution distance between similar semantic feature pairs is reduced, and the distribution distance between different semantic feature pairs is expanded; and performing linear fusion under multi-modal information by combining the model classification performance target and the model semantic approximation target, so that the model can better search a common feature subspace, and the efficiency of multi-modal fusion model diagnosis is improved.
In order to achieve the purpose, the solution adopted by the invention is as follows:
a multi-modal fusion classification optimization method considering inter-modal semantic distance measurement comprises the following steps:
step 1: dividing data into a training set and a testing set, preprocessing the training set to obtain preprocessed data, and extracting data features of the preprocessed data by using a deep neural network, wherein the data features comprise image features F1And text feature F2
Step 2: transversely merging a channel attention model and a space attention model in a CBAM attention mechanism to obtain an improved CBAM attention mechanism, inputting the data features obtained in the step 1 into the improved CBAM attention mechanism to obtain a local locking feature space, wherein the local locking feature space comprises a local locking image feature F'1And local locked text feature F'2
And step 3: the method specifically comprises the following steps of constructing a semantic approximation model based on the semantic distance between modalities:
step 31: constructing a triple according to the training set obtained in the step 1, wherein the triple comprises a positive control sample group, an anchor sample group and a negative control sample group;
step 32: inputting the triplets established in the step 31 into the local locking feature space obtained in the step 2 to obtain locking image-text pair features;
step 33: according to the locked image-text pair characteristics obtained in the step 32, increasing the semantic space distance of the locked image-text pair characteristics under different semantics, reducing the semantic space distance of the locked image-text pair characteristics under the same semantics, and establishing a semantic approximation model based on the inter-modality semantic distance;
step 34: constraining the semantic approximation model based on the inter-modality semantic distance established in the step 33 to obtain a target function;
and 4, step 4: designing a general model fusion algorithm under multi-modal information in a feature public subspace according to the data features obtained in the step 1, the local locking feature space obtained in the step 2 and the semantic approximation model based on the inter-modal semantic distance obtained in the step 3 to obtain a fusion Loss function, wherein the fusion Loss function comprises an asymmetric fusion Loss function LossnAnd symmetric fusion Loss function LossyAnd carrying out training iteration of the model by using the fusion loss function.
Preferably, the semantic approximation model based on the semantic distance between modalities established in step 3 specifically includes:
Figure BDA0003150360530000021
Figure BDA0003150360530000022
Figure BDA0003150360530000023
Figure BDA0003150360530000024
Figure BDA0003150360530000025
in the formula (I), the compound is shown in the specification,
Figure BDA0003150360530000031
is Euclidean distance measurement;
Figure BDA0003150360530000032
inputting a locking image-text pair characteristic obtained by locally locking a characteristic space for an anchor sample group;
Figure BDA0003150360530000033
inputting a locked image-text pair characteristic obtained by locally locking a characteristic space for a forward reference sample group;
Figure BDA0003150360530000034
inputting a locking image-text pair characteristic obtained by locally locking a characteristic space for a negative contrast sample group; α is a specific threshold; n is the batch size; tau is a sample feature space; l ispIs the picture ternary loss; l istIs a text ternary loss.
Preferably, an improved CBAM attention mechanism is established in the step 2 to obtain a local lock imageCharacteristic F'1The method comprises the following specific steps:
step 21: constructing the channel attention model, and inputting the image characteristics F obtained in the step 11Aggregating the spatial information of feature mapping by respectively adopting a maximum pooling process and an average pooling process on the feature space to obtain the image-based feature F expressed by a one-dimensional vector1Image channel descriptor V ofmax1And Vavg1
Vmax1=MaxPool(F1)
Vavg1=AvgPool(F1)
In the formula: vmax1The image channel descriptor obtained by adopting maximum pooling processing; vavg1The image channel descriptor obtained by average pooling processing is adopted;
aggregating the image features F by using two layers of convolution layers as a shared weight feature layer1Of a certain area within the neighborhood of the channel, said image-based feature F1Image channel descriptor V ofmax1And Vavg1Obtaining image-based features F through the shared weight feature layer1Feature vector VMLP ofmax1And VMLPavg1(ii) a The image-based feature F1Feature vector VMLP ofmax1And VMLPavg1Performing pixel-by-pixel addition, and obtaining the image characteristic F through a relu activation function1Based on the image attention vector CA (F) of the channel dimension1):
Figure BDA0003150360530000035
In the formula:
Figure BDA0003150360530000036
is a shared weight feature layer function;
step 22: constructing the spatial attention model along the image features F obtained in the step 11Respectively performing global mean pooling and maximum pooling to obtain image-based imageCharacteristic F1Image space context descriptor Tavg1And Tmax1
Tmax1=MaxPool(F1)
Tavg1=AvgPool(F1)
In the formula: t ismax1The image space context descriptor obtained by maximum pooling operation is adopted; t isavg1The method comprises the steps of obtaining an image space context descriptor by adopting global mean pooling operation;
along the image feature F1To the image-based feature F1Image space context descriptor Tmax1And Tavg1Splicing to obtain an image-based feature F1Valid spatial feature descriptors of; coding and mapping information of a region needing to be emphasized or suppressed in a space by using a hole convolution to obtain a feature after convolution, and obtaining an image feature F by passing the feature after convolution through a relu activation function1Image attention vector SA (F) based on spatial dimensions1):
Figure BDA0003150360530000041
In the formula:
Figure BDA0003150360530000042
splicing operation is carried out;
Figure BDA0003150360530000043
is a hole convolution layer function;
step 23: transversely combining the channel attention model and the space attention model in the steps 21 and 22 to obtain an image feature F1Mixed attention vector of (3) HYB (F)1):
Figure BDA0003150360530000044
The image-based feature F1Mixed attention vector of (3) HYB (F)1) Injection of image features F1Realizing local semantic feature reinforcement on space and channels to obtain local locking image feature F'1
Figure BDA0003150360530000045
Preferably, the step 2 establishes a modified CBAM attention mechanism to obtain a local locked text feature F'2The method comprises the following specific steps:
step 21': constructing the channel attention model, and inputting the text features F obtained in the step 12Aggregating the spatial information of feature mapping by respectively adopting a maximum pooling process and an average pooling process on the feature space to obtain the text-based feature F represented by a one-dimensional vector2Text channel descriptor V ofmax2And Vavg2::
Vmax2=MaxPool(F2)
Vavg2=AvgPool(F2)
In the formula: vmax2The method comprises the steps of obtaining a text channel descriptor by adopting maximum pooling processing; vavg2The text channel descriptors are obtained by average pooling;
aggregating the text features F by using two layers of convolution layers as a shared weight feature layer2The feature of a certain area in the neighborhood of the channel, the text-based feature F2Text channel descriptor V ofmax2And Vavg2Obtaining a text-based feature F through the shared weight feature layer2Feature vector VMLP ofmax2And VMLPavg2(ii) a The text-based feature F2Feature vector VMLP ofmax2And VMLPavg2Performing pixel-by-pixel addition, and obtaining a text feature F through a relu activation function2Based on the channel dimension of the text attention vector CA (F)2):
Figure BDA0003150360530000046
In the formula:
Figure BDA0003150360530000047
is a shared weight feature layer function;
step 22': constructing the spatial attention model along the text features F obtained in the step 12Respectively carrying out global mean pooling and maximum pooling to obtain a text-based feature F2Text space context descriptor Tavg2And Tmax2
Tmax2=MaxPool(F2)
Tavg2=AvgPool(F2)
In the formula: t ismax2The context descriptor of the text space obtained by adopting maximum pooling operation; t isavg2The context descriptor of the text space is obtained by adopting global mean pooling operation;
along the text feature F2For the text-based feature F2Text space context descriptor Tmax2And Tavg2Splicing to generate a text-based feature F2Valid spatial feature descriptors of; then, coding and mapping information of the region needing to be emphasized or suppressed in the space by using the hole convolution, and obtaining a text feature F by passing the feature after the convolution through a relu activation function2Is based on the text attention vector SA (F) of the space dimension2):
Figure BDA0003150360530000051
In the formula:
Figure BDA0003150360530000052
splicing operation is carried out;
Figure BDA0003150360530000053
is convolved with a holeA layer function;
step 23': transversely combining the channel attention model and the space attention model in the steps 21 'and 22' to obtain a text-based feature F2Mixed attention vector of (3) HYB (F)2):
Figure BDA0003150360530000054
The text-based feature F2Mixed attention vector of (3) HYB (F)2) Injecting the text feature F2Local semantic feature reinforcement on space and channel is realized, and local locking text feature F is obtained2
Figure BDA0003150360530000055
Preferably, the anchor sample set obtained in step 31 is a graph-text pair example sample Samp _ anc (p) randomly extracted from the training seta、ta) (ii) a The forward control sample group is a sample Samp _ pos (p) which is randomly selected from the training set and has the same semantic meaning with the anchor sample groupp、tp) (ii) a The negative control sample group is a sample Samp _ neg (p) which is extracted in a random manner in the training set and has different semantics with the anchor sample groupn、tn)。
Preferably, the overall model fusion algorithm under the multi-modal information is designed in the step 4, and an asymmetric fusion Loss function Loss is obtainednThe method comprises the following specific steps:
step 41: the text characteristics F obtained in the step 1 and the step 2 are processed2Local locked image feature F'1Splicing and fusing to obtain a combined modal characterization, inputting the combined modal characterization into a multilayer perceptron, then accessing into a softmax classifier, measuring errors of a predicted label and a real label by adopting a cross entropy function, quantifying a model classification performance target, and obtaining an asymmetric classification loss function LCatog(F2,F′1):
Figure BDA0003150360530000056
In the formula (I); l isCatogIs a cross entropy function; p is a radical oficA predicted probability that sample i belongs to class c; y isicIs an indicator variable;
if the prediction class is the same as the true class of sample i, then the variable y is indicatedicIs 1, otherwise yicIs 0;
step 42: considering a model classification performance objective and a model semantic approximation objective, designing a general model fusion algorithm under multi-modal information, and specifically, applying the asymmetric classification loss function L obtained in step 41Catog(F2,F′1) And the picture ternary loss L obtained in the step 3pText ternary loss LtPerforming linear fusion to obtain asymmetric fusion Loss function Lossn
Lossn=LCatog(F′1,F2)+β(Lp+Lt)
In the formula: beta is a linear proportionality coefficient;
step 43: according to the asymmetric fusion Loss function LossnAnd carrying out training iteration of the model.
Preferably, the overall model fusion algorithm under the multi-modal information is designed in the step 4, and a symmetric fusion Loss function Loss is obtainedyThe method comprises the following specific steps:
step 41': locally locking text feature F 'obtained in step 2'2Local locked image feature F'1Splicing and fusing to obtain a combined modal characterization, inputting the combined modal characterization into a multilayer perceptron, then accessing into a softmax classifier, measuring errors of a predicted label and a real label by adopting a cross entropy function, quantifying a model classification performance target, and obtaining a symmetrical classification loss LCatog(F′1,F′2):
Figure BDA0003150360530000061
In the formula (I); l isCatogIs a cross entropy function; p is a radical oficA predicted probability that sample i belongs to class c; y isicIs an indicator variable;
if the prediction class is the same as the true class of sample i, then the variable y is indicatedicIs 1, otherwise yicIs 0;
step 42': considering the model classification performance target and the model semantic approximation target, designing an overall model fusion algorithm under multi-modal information, and specifically classifying the symmetric classification loss function L obtained in the step 41Catog(F′1,F′2) And the picture ternary loss L obtained in the step 3pText ternary loss LtLinear fusion is carried out to obtain a symmetric fusion Loss function Lossy
Lossy=LCatog(F′1,F′2)+β(Lp+Lt)
In the formula: beta is a linear proportionality coefficient;
step 43': according to the symmetric fusion Loss function LossyAnd carrying out training iteration of the model.
Preferably, the preprocessing of the training set in step 1 specifically includes:
step 11: separating image-text pairs in the training set data sample to obtain image data and text data;
step 12: and (3) encoding the text data obtained in the step (11) by using Word2Vec to obtain the preprocessed text data.
Preferably, the deep neural network in step 1 includes an image depth feature extraction network and a text depth feature extraction network.
Preferably, the objective function in step 34 is min (L)p,Lt)。
Compared with the prior art, the invention has the beneficial effects that:
self-adaptive feature refinement is carried out by establishing an improved CBAM attention mechanism, and local semantic features are strengthened; and a semantic approximation model based on the semantic distance between the modalities is constructed, and explicit measurement for judging semantic consistency between the modalities is introduced, so that the robustness and the accuracy of the multi-modal fusion model are better improved.
Drawings
FIG. 1 is a schematic diagram of a general technical route of an embodiment of the present invention;
FIG. 2 is a diagram illustrating an example of classification accuracy of a test set of a text monomodal model according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating an example of classification accuracy of a test set of an image single-mode model according to an embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating an example of the classification accuracy of a test set of a fusion model under a common feature stitching operation in the embodiment of the present invention;
FIG. 5 is a diagram illustrating an example of the test set classification accuracy of the multi-modal fusion model in an embodiment of the present invention.
Detailed Description
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
Taking a Pascal Sences data set as an example, 20 types of object identification tasks including airplanes, bicycles, birds, ships, bottles, buses, automobiles, cats, chairs, cows, digital watches, dogs, horses, motorcycles, people, potted plants, sheep, sofas, trains and television monitors are carried out, each type has 200 image-text pairs, image data is a real shot image of an object, text data is a text description of the object, and image-text labels are shared.
As shown in fig. 1, firstly inputting image-text data, dividing a training set and a test set, obtaining image-text characteristics through a deep neural network, inputting an improved CBAM attention model for self-adaptive characteristic refinement, fusing the image-text characteristics, and performing classification performance evaluation; meanwhile, triple samples are randomly extracted to construct a semantic approximation model based on the semantic distance between the modalities, the semantic space distance of the feature pairs under different semantics is increased, and the semantic space distance of the feature pairs under the same semantics is reduced. And finally, constraining the multi-modal feature public subspace by considering the model classification performance target and the semantic consistency target. The method comprises the following specific steps:
step 1: dividing a data set into a training set and a test set, separating image-text pairs in a data sample of the training set, obtaining image-text characteristics through a deep neural network, and respectively constructing an image depth characteristic extraction network and a text depth characteristic extraction network which are respectively used for extracting image characteristics F1And text feature F2The method comprises the following specific steps:
firstly, a 14-layer RGB image depth feature extraction network is constructed, and the structure sequentially comprises the following steps: inputting the convolution layer twice, the pooling layer once, the convolution layer twice again, the pooling layer once, the convolution for three times again and the pooling layer once.
The feature extraction network parameters of the RGB image are set as follows: the convolution kernel size of the input convolutional layer is set to 3x3, the convolution step size is [1,1], and the number of convolution kernels is set to 64,128, 256,512 in turn. Setting the step length of the pooling layer as [2,2], and adopting a maximum pooling layer;
secondly, a 6-layer text feature extraction network is built, and the structure of the network is as follows in sequence: the system comprises an embedding layer, a convolutional layer, a BN layer, a global pooling layer, a dropout layer and a full connection layer.
The parameters of each layer of the network for setting text feature extraction are as follows: before the text is sent into the network, the text needs to be coded into a matrix with 150 dimensions after being coded by Word2 Vec. And then the text passes through the embedding layer, the convolution layer adopts 1-dimensional convolution, the convolution size is 3, the step length is 1, the number of convolution kernels is 256, and the probability of the dropout layer is 0.5. The final full link layer neuron number is set to 256.
Step 2: the method comprises the following steps of self-adaptive feature refinement based on an improved CBAM attention mechanism, and local locking of image and text feature spaces containing certain semantics, and specifically comprises the following steps:
a. for an image feature space, firstly, a Channel Attention model (CA) is constructed, and a specific modeling method is as follows:
(1) inputting extracted image feature F1Aggregating the spatial information of feature mapping on the feature space by adopting a maximum pooling process and an average pooling process to generate two different image channel descriptors V represented by one-dimensional vectorsmax1And Vavg1Maximum pooling feature and mean pooling feature are indicated, respectively, namely:
Vmax1=MaxPool(F1) (1)
Vavg1=AvgPool(F1) (2)
(2) two layers of convolution layers are adopted as a shared weight characteristic layer, the characteristics of a certain area in the neighborhood of the channel are aggregated, and two channel descriptors pass through the shared weight characteristic layer to respectively obtain a characteristic vector VMLPmax1、VMLPavg1Adding the two pixel by pixel, and obtaining an image attention vector CA (F) based on channel dimension through a relu activation function1):
Figure BDA0003150360530000081
Wherein:
Figure BDA0003150360530000082
is a shared weight feature layer function;
the specific operations in the examples are as follows: the invention takes the size of (224, 3) image as input, and obtains the depth image feature F with the size of (7, 256) after 14 layers of feature extraction1At the moment, channel compression is carried out on the channel, the size is compressed to (1, 256), firstly, (7, 256) features are taken, global pooling and maximum pooling are respectively carried out, then the two pooled feature layers are sent to a shared feature layer network, the structure in the shared feature layer is convoluted twice, the cores of the convolutions are 512,256 respectively, the step length is 1, the activation function is relu, and after passing through the shared feature layer, Add fusion is carried out on the output max layer and the avg layer, namely the values of the max layer and the avg layer are added, so that a channel attention vector CA (F) with the dimension of (7, 256) is obtained1)。
Secondly, a Spatial Attention model (SA) is constructed, and the following is specifically modeled:
(1) first along an image feature F1The channel axes of the image space are subjected to global mean pooling and maximum pooling to generate two different image space context descriptors Tavg1And Tmax1Namely:
Tmax1=MaxPool(F1) (4)
Tavg1=AvgPool(F1) (5)
(2) and then splicing the two image space descriptors along the channel axis to generate an effective space feature descriptor. Then, the information of the areas needing to be emphasized or inhibited in the space is coded and mapped by using the cavity convolution, space context information is more efficiently aggregated, and the image attention vector SA (F) based on the space dimension is obtained by performing a relu activation function on the convolved features1):
Figure BDA0003150360530000091
Wherein:
Figure BDA0003150360530000092
splicing operation is carried out;
Figure BDA0003150360530000093
is a function of the void convolution layer.
The specific operations in the examples are as follows: also by image feature F1Taking the maximum max size (7,7,1) and the average mean (7,7,1) on the channel as input, then fusing the maximum layer with the minimum layer to obtain a concatenate layer, namely splicing the two results on the channel, wherein the size is (7,7,2), sending a new (7,7,2) to the convolution layer, the convolution size is 1X1, the convolution kernel is 1, and then changing to (7,7,1), and then carrying out dot multiplication on the output (7,7,1) and X. The spatial attention vector SA (F) with the dimension of (7, 256) is obtained1)。
Finally, the channel attention model and the space attention model are combined, and the invention abandons the longitudinal direction in the common CBAM moduleThe attention superimposing method of (1) is improved to the lateral attention, that is, the channel attention is matched with the input of the spatial attention, and then the obtained attention vector formula (3) based on the channel dimension and the obtained attention vector formula (6) based on the spatial dimension are multiplied by the corresponding elements to obtain the mixed attention vector HYB (F)1):
Figure BDA0003150360530000094
Mixing attention vectors HYB (F)1) Injecting an original feature map to realize local semantic feature reinforcement on space and channels to obtain a locked image feature F'1Namely:
Figure BDA0003150360530000095
the specific operations in the examples are as follows: the aforementioned derived attention vector CA (F) based on channel dimensions1) And attention vector SA (F) based on spatial dimension1) Performing point multiplication to obtain a mixed attention vector, and performing point multiplication on the mixed attention vector and the original depth image feature F1Carrying out dot multiplication to obtain a new feature F with the dimension of (7, 256)1'. The structure of the fusion classification part is as follows in sequence: a carboxylate layer, a first fully-connected layer, a second fully-connected layer; the splicing dimension of the coordinate layer is set to be 256, the number of output neurons of the first full-connection layer is set to be 256, and the number of output neurons of the second full-connection layer is set to be 128. The third layer is a softmax layer and a classification layer, and the number of the neurons is 20.
b. For a text feature space, firstly, a Channel Attention model (CA) is constructed, and a specific modeling method is as follows:
inputting extracted text features F2Aggregating the spatial information of feature mapping on the feature space by adopting a maximum pooling process and an average pooling process to generate two different text channel descriptors V represented by one-dimensional vectorsmax2And Vavg2Respectively represent the maximum valuesPooling characteristics and mean pooling characteristics, namely:
Vmax2=MaxPool(F2) (1)′
Vavg2=AvgPool(F2) (2)′
two layers of convolution layers are adopted as a shared weight characteristic layer, the characteristics of a certain area in the neighborhood of the channel are aggregated, and two channel descriptors pass through the shared weight characteristic layer to respectively obtain a characteristic vector VMLPmax2、VMLPavg2Adding the two pixel by pixel, and obtaining an attention vector CA (F) based on the channel dimension through a relu activation function2):
Figure BDA0003150360530000101
Wherein:
Figure BDA0003150360530000102
is a shared weight feature layer function;
secondly, a Spatial Attention model (SA) is constructed, and the following is specifically modeled:
(1) first along text feature F2The channel axes of the text space are subjected to global mean pooling and maximum pooling to generate two different text space context descriptors Tavg2And Tmax2Namely:
Tmax2=MaxPool(F2) (4)′
Tavg2=AvgPool(F2) (5)′
(2) and then splicing the two text space descriptors along the channel axis to generate an effective space feature descriptor. Then, the information of the areas needing to be emphasized or inhibited in the space is coded and mapped by using the hole convolution, the space context information is more efficiently aggregated, and the text attention vector SA (F) based on the space dimension is obtained by performing a relu activation function on the convolved features2):
Figure BDA0003150360530000103
Wherein:
Figure BDA0003150360530000111
splicing operation is carried out;
Figure BDA0003150360530000112
is a function of the void convolution layer.
Finally, the channel attention model and the space attention model are combined, the longitudinal attention superposition mode in the common CBAM module is abandoned, the mode is improved to be transverse attention, namely the channel attention is consistent with the input of the space attention, corresponding elements are multiplied for the obtained attention vector formula (3) 'based on the channel dimension and the attention vector formula (6)' based on the space dimension, and a mixed attention vector HYB (F) is obtained2):
Figure BDA0003150360530000113
Mixing attention vectors HYB (F)2) Injecting an original feature map to realize local semantic feature reinforcement on space and channels to obtain a locking text feature F'2Namely:
Figure BDA0003150360530000114
and step 3: establishing a semantic approximation model based on the semantic distance between modalities, increasing the semantic space distance of the feature pairs under different semantics, and simultaneously reducing the semantic space distance of the feature pairs under the same semantics, wherein the method specifically comprises the following steps:
firstly, constructing a triple consisting of three elements in a graph and training data set sample: an example sample of a graphic pair is drawn in a random manner, denoted as Samp _ anc (p)a、ta) Then randomly selecting a sum Samp _ anc (p)a、ta) Have the sameDifferent samples Samp _ pos (t) of a class labelp、tp) Simultaneously selecting a sample of different class as Samp _ neg (p)n、tn) The triplet may be represented as (Samp _ anc, Samp _ neg, Samp _ pos).
Secondly, putting the three groups of data into the locked local feature space obtained in the step 2, and recording the obtained locked image-text pair as the following features:
Figure BDA0003150360530000115
wherein f ispic、ftextAn image feature extraction function and a text feature extraction function,
Figure BDA0003150360530000116
inputting a locking image-text pair characteristic obtained by locally locking a characteristic space for an anchor sample group;
Figure BDA0003150360530000117
inputting a locked image-text pair characteristic obtained by locally locking a characteristic space for a forward reference sample group;
Figure BDA0003150360530000118
inputting a locking image-text pair characteristic obtained by locally locking a characteristic space for a negative contrast sample group; the optimization goal of the model is to make the same-class semantics approach in the feature space, the inter-class semantics are far away in the feature space, that is, the distance between Samp _ anc and Samp _ pos feature expression is reduced as much as possible, the distance between Samp _ anc and Samp _ neg feature expression is increased, and a specific threshold value alpha exists to measure the minimum interval between the two distances, and the model is built as follows:
Figure BDA0003150360530000119
Figure BDA00031503605300001110
Figure BDA00031503605300001111
Figure BDA00031503605300001112
Figure BDA00031503605300001113
wherein N is the batch size;
Figure BDA0003150360530000121
in order to measure Euclidean distance, tau is a sample feature space; l ispIs the picture ternary loss; l istIs a text ternary loss.
Finally, the model is constrained with an objective function of: min (L)p,Lt) When training iteration is carried out, a feature space with the same semantic feature pair close and different semantic feature pairs far tends to be learned.
And 4, step 4: designing an overall model fusion algorithm under multi-modal information in a characteristic public subspace, which specifically comprises the following steps:
a. fusing in an asymmetric mode:
step 41: the locked image feature F 'obtained in the step 2'1Text feature F obtained in step 12And splicing and fusing to obtain a combined modal characterization, inputting the combined modal characterization into the multilayer perceptron, and finally accessing the softmax classifier. For the model classification performance target, the error of the prediction label and the real label is measured by adopting a cross entropy function, and then the model classification performance target is quantized to obtain an asymmetric classification loss function LCatog(F2,F′1) Wherein the cross entropy function LCatogComprises the following steps:
Figure BDA0003150360530000122
wherein:picA predicted probability that sample i belongs to class c; y isicFor indicating variables, if the prediction class is the same as the real class of the sample i, the prediction class is 1, otherwise, the prediction class is 0.
Step 42: designing an overall model fusion algorithm under multi-mode information by combining the model classification performance target and the model semantic approximation target, and enabling asymmetric classification loss LCatog(F′1,F2) And picture ternary loss and text ternary loss Lp、LtPerforming linear fusion to obtain an asymmetric fusion Loss function Lossn
Lossn=LCatog(F′1,F2)+β(Lp+Lt) (15)
Wherein: beta is a linear proportionality coefficient. The training iteration of the model is performed according to the loss.
b. And (3) fusing in a symmetrical mode:
step 41': locally locking text feature F 'obtained in step 2'2Local locked image feature F'1Splicing and fusing to obtain a combined modal characterization, inputting the combined modal characterization into a multilayer perceptron, then accessing into a softmax classifier, measuring the error between a predicted label and a real label by using a cross entropy function formula (14), further quantifying a model classification performance target, and obtaining a symmetric classification loss LCatog(F′1,F′2)
Step 42': designing an overall model fusion algorithm under multi-modal information by combining the model classification performance target and the model semantic approximation target, and classifying the symmetric classification loss L obtained in the step 41Catog(F′1,F′2) And the picture ternary loss and the text ternary loss L obtained in the step 3p、LtLinear fusion is carried out to obtain a symmetric fusion Loss function Lossy
Lossy=LCatog(F′1,F′2)+β(Lp+Lt) (15)′
In the formula: beta is a linear proportionality coefficient;
in the examples, the training will beInputting the collected data into the text image multi-mode fusion neural network, updating each layer of parameters of the multi-mode fusion neural network by using a gradient descent method, and assigning the updated parameter values to each layer of parameters of the multi-mode fusion neural network to obtain the trained multi-mode fusion neural network. The method for multi-modal fusion of parameters of each layer of the neural network by adopting the gradient descent method comprises the following steps: (a) the learning rate of the multi-modal fusion neural network is set to 0.001. (b) And taking the output value of the multi-mode fusion neural network and the label value of the human body action category in the text image sample as gradient values. (c) Utilizing the following formula:
Figure BDA0003150360530000131
Figure BDA0003150360530000132
and updating parameters of each layer of the skeleton-guided multi-mode fusion neural network. Wherein the content of the first and second substances,
Figure BDA0003150360530000133
representing the parameter value of the multi-mode fusion neural network after updating, going to represent the assignment operation, theta representing the parameter value of the multi-mode fusion neural network before updating,
Figure BDA0003150360530000134
representing gradient values of a multi-modal converged neural network. Finally, the 20 classified objects are identified.
As shown in FIGS. 2-5, FIG. 2 is a diagram of a text feature F extracted only in step 12The classification accuracy of the test set of the model rapidly climbs in the first few iterations, but the training effect is reduced after the seventh generation, and finally the classification accuracy of the test set is stabilized at 91.33%;
FIG. 3 shows the image feature F extracted in step 1 alone1The single-mode model of the classified pictures can quickly reach a stable state in the whole training iterative process and fluctuate within an acceptable small range according to the training curve, and finally the classification accuracy of the model test set is improved84.57%;
FIG. 4 shows the image feature F extracted in step 11And text feature F2The general multi-modal model classified after splicing and fusion is known from a training curve, the convergence trend of the model in the whole training process is smooth, the accuracy rate hardly changes after the model is trained to the sixth generation, the performance cannot be continuously improved, and the classification accuracy rate of the model test set is 91.82%;
FIG. 5 is a model of the invention, namely the locally locked image feature F 'obtained using step 2'1Text feature F obtained in step 12After splicing and fusion, semantic consistency targets are considered for classification, and the training curve shows that the model converges after iteration to the sixteenth generation, a stable state is presented, better performance is realized, and the optimal classification effect that the classification accuracy of the model test set is 95.14% is achieved.
Compared with the method that only a single text mode or an image mode is used for identification, the improvement effect of the method that the image characteristic and the text characteristic are fused and then classified in a simple splicing mode is very limited, the improvement effect is only 0.49% higher than that of a single mode model (in the embodiment, the text single mode model) with a better classification result, the model aggregates effective characteristic information based on an improved CBAM attention mechanism, and meanwhile semantic consistency quantitative judgment among modes is introduced, so that the model can better search a common characteristic subspace, perform inter-mode complementation and improve the efficiency of multi-mode fusion model diagnosis, and from the result of the embodiment, the accuracy of the model is improved by 3.32% higher than that of the splicing and fusing model, and the effectiveness of the fusion method provided by the invention is embodied.
The above-mentioned embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements made to the technical solution of the present invention by those skilled in the art without departing from the spirit of the present invention shall fall within the protection scope defined by the claims of the present invention.

Claims (10)

1. A multi-modal fusion classification optimization method considering inter-modal semantic distance measurement is characterized by comprising the following steps:
step 1: dividing data into a training set and a testing set, preprocessing the training set to obtain preprocessed data, and extracting data features of the preprocessed data by using a deep neural network, wherein the data features comprise image features F1And text feature F2
Step 2: transversely merging a channel attention model and a space attention model in a CBAM attention mechanism to obtain an improved CBAM attention mechanism, inputting the data features obtained in the step 1 into the improved CBAM attention mechanism to obtain a local locking feature space, wherein the local locking feature space comprises a local locking image feature F'1And local locked text feature F'2
And step 3: the method specifically comprises the following steps of constructing a semantic approximation model based on the semantic distance between modalities:
step 31: constructing a triple according to the training set obtained in the step 1, wherein the triple comprises a positive control sample group, an anchor sample group and a negative control sample group;
step 32: inputting the triplets established in the step 31 into the local locking feature space obtained in the step 2 to obtain locking image-text pair features;
step 33: according to the locked image-text pair characteristics obtained in the step 32, increasing the semantic space distance of the locked image-text pair characteristics under different semantics, reducing the semantic space distance of the locked image-text pair characteristics under the same semantics, and establishing a semantic approximation model based on the inter-modality semantic distance;
step 34: constraining the semantic approximation model based on the inter-modality semantic distance established in the step 33 to obtain a target function;
and 4, step 4: designing an overall model fusion algorithm under multi-modal information in a common feature subspace according to the data features obtained in the step 1, the local locking feature space obtained in the step 2 and the semantic approximation model based on the semantic distance between the modalities obtained in the step 3 to obtain fusionA Loss function, said fused Loss function comprising an asymmetric fused Loss function LossnAnd symmetric fusion Loss function LossyAnd carrying out training iteration of the model by using the fusion loss function.
2. The multi-modal fusion classification optimization method considering the inter-modal semantic distance metric according to claim 1, wherein the semantic approximation model based on the inter-modal semantic distance established in the step 33 is specifically:
Figure FDA0003150360520000011
Figure FDA0003150360520000012
Figure FDA0003150360520000013
Figure FDA0003150360520000021
Figure FDA0003150360520000022
in the formula (I), the compound is shown in the specification,
Figure FDA0003150360520000023
is Euclidean distance measurement;
Figure FDA0003150360520000024
inputting a locking image-text pair characteristic obtained by locally locking a characteristic space for an anchor sample group;
Figure FDA0003150360520000025
inputting a locked image-text pair characteristic obtained by locally locking a characteristic space for a forward reference sample group;
Figure FDA0003150360520000026
inputting a locking image-text pair characteristic obtained by locally locking a characteristic space for a negative contrast sample group; α is a specific threshold; n is the batch size; tau is a sample feature space; l ispIs the picture ternary loss; l istIs a text ternary loss.
3. The method of claim 1, wherein the step 2 establishes a modified CBAM attention mechanism to obtain a local locked image feature F'1The method comprises the following specific steps:
step 21: constructing the channel attention model, and inputting the image characteristics F obtained in the step 11Aggregating the spatial information of feature mapping by respectively adopting a maximum pooling process and an average pooling process on the feature space to obtain the image-based feature F expressed by a one-dimensional vector1Image channel descriptor V ofmax1And Vavg1
Vmax1=MaxPool(F1)
Vavg1=AvgPool(F1)
In the formula: vmax1The image channel descriptor obtained by adopting maximum pooling processing; vavg1The image channel descriptor obtained by average pooling processing is adopted;
aggregating the image features F by using two layers of convolution layers as a shared weight feature layer1Of a certain area within the neighborhood of the channel, said image-based feature F1Image channel descriptor V ofmax1And Vavg1Obtaining image-based features F through the shared weight feature layer1Feature vector VMLP ofmax1And VMLPavg1(ii) a The image-based feature F1Feature vector VMLP ofmax1And VMLPavg1Performing pixel-by-pixel addition, and obtaining the image characteristic F through a relu activation function1Based on the image attention vector CA (F) of the channel dimension1):
Figure FDA0003150360520000027
In the formula:
Figure FDA0003150360520000028
is a shared weight feature layer function;
step 22: constructing the spatial attention model along the image features F obtained in the step 11Respectively performing global mean pooling and maximum pooling to obtain image-based feature F1Image space context descriptor Tavg1And Tmax1
Tmax1=MaxPool(F1)
Tavg1=AvgPool(F1)
In the formula: t ismax1The image space context descriptor obtained by maximum pooling operation is adopted; t isavg1The method comprises the steps of obtaining an image space context descriptor by adopting global mean pooling operation;
along the image feature F1To the image-based feature F1Image space context descriptor Tmax1And Tavg1Splicing to obtain an image-based feature F1Valid spatial feature descriptors of; coding and mapping information of a region needing to be emphasized or suppressed in a space by using a hole convolution to obtain a feature after convolution, and obtaining an image feature F by passing the feature after convolution through a relu activation function1Image attention vector SA (F) based on spatial dimensions1):
Figure FDA0003150360520000031
In the formula:
Figure FDA0003150360520000032
splicing operation is carried out;
Figure FDA0003150360520000033
is a hole convolution layer function;
step 23: transversely combining the channel attention model and the space attention model in the steps 21 and 22 to obtain an image feature F1Mixed attention vector of (3) HYB (F)1):
Figure FDA0003150360520000034
The image-based feature F1Mixed attention vector of (3) HYB (F)1) Injecting the image feature F1Realizing local semantic feature reinforcement on space and channels to obtain local locking image feature F'1
Figure FDA0003150360520000035
4. The method of claim 1, wherein in step 2, an improved CBAM attention mechanism is established to obtain a local locked text feature F'2The method comprises the following specific steps:
step 21': constructing the channel attention model, and inputting the text features F obtained in the step 12Aggregating the spatial information of feature mapping by respectively adopting a maximum pooling process and an average pooling process on the feature space to obtain the text-based feature F represented by a one-dimensional vector2Text channel descriptor V ofmax2And Vavg2::
Vmax2=MaxPool(F2)
Vavg2=AvgPool(F2)
In the formula: vmax2The method comprises the steps of obtaining a text channel descriptor by adopting maximum pooling processing; vavg2The text channel descriptors are obtained by average pooling;
aggregating the text features F by using two layers of convolution layers as a shared weight feature layer2The feature of a certain area in the neighborhood of the channel, the text-based feature F2Text channel descriptor V ofmax2And Vavg2Obtaining a text-based feature F through the shared weight feature layer2Feature vector VMLP ofmax2And VMLPavg2(ii) a The text-based feature F2Feature vector VMLP ofmax2And VMLPavg2Performing pixel-by-pixel addition, and obtaining a text feature F through a relu activation function2Based on the channel dimension of the text attention vector CA (F)2):
Figure FDA0003150360520000041
In the formula:
Figure FDA0003150360520000042
is a shared weight feature layer function;
step 22': constructing the spatial attention model along the text features F obtained in the step 12Respectively carrying out global mean pooling and maximum pooling to obtain a text-based feature F2Text space context descriptor Tavg2And Tmax2
Tmax2=MaxPool(F2)
Tavg2=AvgPool(F2)
In the formula: t ismax2The context descriptor of the text space obtained by adopting maximum pooling operation; t isavg2The context descriptor of the text space is obtained by adopting global mean pooling operation;
along the text feature F2For the text-based feature F2Text space context descriptor Tmax2And Tavg2Splicing to generate a text-based feature F2Valid spatial feature descriptors of; then, coding and mapping information of the region needing to be emphasized or suppressed in the space by using the hole convolution, and obtaining a text feature F by passing the feature after the convolution through a relu activation function2Is based on the text attention vector SA (F) of the space dimension2):
Figure FDA0003150360520000043
In the formula:
Figure FDA0003150360520000044
splicing operation is carried out;
Figure FDA0003150360520000045
is a hole convolution layer function;
step 23': transversely combining the channel attention model and the space attention model in the steps 21 'and 22' to obtain a text-based feature F2Mixed attention vector of (3) HYB (F)2):
Figure FDA0003150360520000046
The text-based feature F2Mixed attention vector of (3) HYB (F)2) Injecting the text feature F2Realizing local semantic feature reinforcement on space and channels to obtain local locking text feature F'2
Figure FDA0003150360520000047
5. The multi-modal fusion classification optimization method considering inter-modal semantic distance metric of claim 1, wherein the anchor sample set obtained in the step 31 is a graph-text pair example sample Samp _ anc (p) randomly extracted from the training set obtained in the step 1a、ta) (ii) a The forward control sample group is a sample Samp _ pos (p) which is randomly selected from the training set and has the same semantic meaning with the anchor sample groupp、tp) (ii) a The negative control sample group is a sample Samp _ neg (p) which is extracted in a random manner in the training set and has different semantics with the anchor sample groupn、tn)。
6. The multi-modal fusion classification optimization method considering the inter-modal semantic distance metric as claimed in claim 1, wherein the overall model fusion algorithm under the multi-modal information is designed in the step 4 to obtain the asymmetric fusion Loss function LossnThe method comprises the following specific steps:
step 41: the text characteristics F obtained in the step 1 and the step 2 are processed2Local locked image feature F'1Splicing and fusing to obtain a combined modal characterization, inputting the combined modal characterization into a multilayer perceptron, then accessing into a softmax classifier, measuring errors of a predicted label and a real label by adopting a cross entropy function, quantifying a model classification performance target, and obtaining an asymmetric classification loss function LCatog(F2,F′1):
Figure FDA0003150360520000051
In the formula (I); l isCatogIs a cross entropy function; p is a radical oficA predicted probability that sample i belongs to class c; y isicIs an indicator variable;
if the prediction class is the same as the true class of sample i, then the variable y is indicatedicIs 1, otherwise yicIs 0;
step (ii) of42: considering a model classification performance objective and a model semantic approximation objective, designing a general model fusion algorithm under multi-modal information, and specifically, applying the asymmetric classification loss function L obtained in step 41Catog(F2,F′1) And the picture ternary loss L obtained in the step 3pText ternary loss LtPerforming linear fusion to obtain asymmetric fusion Loss function Lossn
Lossn=LCatog(F′1,F2)+β(Lp+Lt)
In the formula: beta is a linear proportionality coefficient;
step 43: according to the asymmetric fusion Loss function LossnAnd carrying out training iteration of the model.
7. The multi-modal fusion classification optimization method considering the inter-modal semantic distance metric as claimed in claim 1, wherein the overall model fusion algorithm under the multi-modal information is designed in the step 4 to obtain a symmetric fusion Loss function LossyThe method comprises the following specific steps:
step 41': locally locking text feature F 'obtained in step 2'2Local locked image feature F'1Splicing and fusing to obtain a combined modal characterization, inputting the combined modal characterization into a multilayer perceptron, then accessing into a softmax classifier, measuring errors of a predicted label and a real label by adopting a cross entropy function, quantifying a model classification performance target, and obtaining a symmetrical classification loss LCatog(F′1,F′2):
Figure FDA0003150360520000052
In the formula (I); l isCatogIs a cross entropy function; p is a radical oficA predicted probability that sample i belongs to class c; y isicIs an indicator variable;
if the prediction class is the same as the true class of sample i, then the variable y is indicatedicIs 1, otherwise yicIs 0;
step 42': considering the model classification performance target and the model semantic approximation target, designing an overall model fusion algorithm under multi-modal information, and specifically classifying the symmetric classification loss function L obtained in the step 41Catog(F′1,F′2) And the picture ternary loss L obtained in the step 3pText ternary loss LtLinear fusion is carried out to obtain a symmetric fusion Loss function Lossy
Lossy=LCatog(F′1,F′2)+β(Lp+Lt)
In the formula: beta is a linear proportionality coefficient;
step 43': according to the symmetric fusion Loss function LossyAnd carrying out training iteration of the model.
8. The multi-modal fusion classification optimization method considering the inter-modal semantic distance metric according to claim 1, wherein the preprocessing of the training set in the step 1 specifically comprises:
step 11: separating image-text pairs in the training set data sample to obtain image data and text data;
step 12: and (3) encoding the text data obtained in the step (11) by using Word2Vec to obtain the preprocessed text data.
9. The multi-modal fusion classification optimization method considering the inter-modal semantic distance metric as claimed in claim 1, wherein the deep neural network in step 1 comprises an image deep feature extraction network and a text deep feature extraction network.
10. The method according to claim 2, wherein the objective function in step 34 is min (L)p,Lt)。
CN202110770185.4A 2021-07-06 2021-07-06 Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement Active CN113343974B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110770185.4A CN113343974B (en) 2021-07-06 2021-07-06 Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110770185.4A CN113343974B (en) 2021-07-06 2021-07-06 Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement

Publications (2)

Publication Number Publication Date
CN113343974A true CN113343974A (en) 2021-09-03
CN113343974B CN113343974B (en) 2022-10-11

Family

ID=77483059

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110770185.4A Active CN113343974B (en) 2021-07-06 2021-07-06 Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement

Country Status (1)

Country Link
CN (1) CN113343974B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113723356A (en) * 2021-09-15 2021-11-30 北京航空航天大学 Heterogeneous characteristic relation complementary vehicle weight recognition method and device
CN114218380A (en) * 2021-12-03 2022-03-22 淮阴工学院 Multi-mode-based cold chain loading user portrait label extraction method and device
CN117195891A (en) * 2023-11-07 2023-12-08 成都航空职业技术学院 Engineering construction material supply chain management system based on data analysis
CN117542121A (en) * 2023-12-06 2024-02-09 河北双学教育科技有限公司 Computer vision-based intelligent training and checking system and method
CN117636074A (en) * 2024-01-25 2024-03-01 山东建筑大学 Multi-mode image classification method and system based on feature interaction fusion

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222101A (en) * 2011-06-22 2011-10-19 北方工业大学 Method for video semantic mining
CN111488474A (en) * 2020-03-21 2020-08-04 复旦大学 Fine-grained freehand sketch image retrieval method based on attention enhancement
CN111723220A (en) * 2020-06-18 2020-09-29 中南大学 Image retrieval method and device based on attention mechanism and Hash and storage medium
CN111985538A (en) * 2020-07-27 2020-11-24 成都考拉悠然科技有限公司 Small sample picture classification model and method based on semantic auxiliary attention mechanism
CN112101043A (en) * 2020-09-22 2020-12-18 浙江理工大学 Attention-based semantic text similarity calculation method
CN112365514A (en) * 2020-12-09 2021-02-12 辽宁科技大学 Semantic segmentation method based on improved PSPNet
CN112860888A (en) * 2021-01-26 2021-05-28 中山大学 Attention mechanism-based bimodal emotion analysis method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222101A (en) * 2011-06-22 2011-10-19 北方工业大学 Method for video semantic mining
CN111488474A (en) * 2020-03-21 2020-08-04 复旦大学 Fine-grained freehand sketch image retrieval method based on attention enhancement
CN111723220A (en) * 2020-06-18 2020-09-29 中南大学 Image retrieval method and device based on attention mechanism and Hash and storage medium
CN111985538A (en) * 2020-07-27 2020-11-24 成都考拉悠然科技有限公司 Small sample picture classification model and method based on semantic auxiliary attention mechanism
CN112101043A (en) * 2020-09-22 2020-12-18 浙江理工大学 Attention-based semantic text similarity calculation method
CN112365514A (en) * 2020-12-09 2021-02-12 辽宁科技大学 Semantic segmentation method based on improved PSPNet
CN112860888A (en) * 2021-01-26 2021-05-28 中山大学 Attention mechanism-based bimodal emotion analysis method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHIORI HORI ET AL.: ""Attention-Based Multimodal Fusion for Video Description"", 《ARXIV》 *
LONG CHEN ET AL.: ""SCA-CNN: Spatial and Chfaonrn Ieml-awgiese C Aatptteinotnioinng in Convolutional Networks for Image Captioning"", 《2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 *
张玉珍 等: ""基于多模态融合的足球视频语义分析"", 《计算机科学》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113723356A (en) * 2021-09-15 2021-11-30 北京航空航天大学 Heterogeneous characteristic relation complementary vehicle weight recognition method and device
CN113723356B (en) * 2021-09-15 2023-09-19 北京航空航天大学 Vehicle re-identification method and device with complementary heterogeneous characteristic relationships
CN114218380A (en) * 2021-12-03 2022-03-22 淮阴工学院 Multi-mode-based cold chain loading user portrait label extraction method and device
CN117195891A (en) * 2023-11-07 2023-12-08 成都航空职业技术学院 Engineering construction material supply chain management system based on data analysis
CN117195891B (en) * 2023-11-07 2024-01-23 成都航空职业技术学院 Engineering construction material supply chain management system based on data analysis
CN117542121A (en) * 2023-12-06 2024-02-09 河北双学教育科技有限公司 Computer vision-based intelligent training and checking system and method
CN117636074A (en) * 2024-01-25 2024-03-01 山东建筑大学 Multi-mode image classification method and system based on feature interaction fusion
CN117636074B (en) * 2024-01-25 2024-04-26 山东建筑大学 Multi-mode image classification method and system based on feature interaction fusion

Also Published As

Publication number Publication date
CN113343974B (en) 2022-10-11

Similar Documents

Publication Publication Date Title
CN113343974B (en) Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement
CN109409222B (en) Multi-view facial expression recognition method based on mobile terminal
CN109859190B (en) Target area detection method based on deep learning
CN108133188B (en) Behavior identification method based on motion history image and convolutional neural network
WO2020238293A1 (en) Image classification method, and neural network training method and apparatus
CN110084281B (en) Image generation method, neural network compression method, related device and equipment
WO2021042828A1 (en) Neural network model compression method and apparatus, and storage medium and chip
CN108596039B (en) Bimodal emotion recognition method and system based on 3D convolutional neural network
CN111754596B (en) Editing model generation method, device, equipment and medium for editing face image
CN109783666B (en) Image scene graph generation method based on iterative refinement
CN110046656B (en) Multi-mode scene recognition method based on deep learning
CN106778796B (en) Human body action recognition method and system based on hybrid cooperative training
CN109273054B (en) Protein subcellular interval prediction method based on relational graph
CN109063719B (en) Image classification method combining structure similarity and class information
CN107251059A (en) Sparse reasoning module for deep learning
US11816841B2 (en) Method and system for graph-based panoptic segmentation
CN113628294A (en) Image reconstruction method and device for cross-modal communication system
CN110175248B (en) Face image retrieval method and device based on deep learning and Hash coding
CN113139591A (en) Generalized zero sample image classification method based on enhanced multi-mode alignment
CN112434628B (en) Small sample image classification method based on active learning and collaborative representation
CN112115806B (en) Remote sensing image scene accurate classification method based on Dual-ResNet small sample learning
CN112036260A (en) Expression recognition method and system for multi-scale sub-block aggregation in natural environment
CN110111365B (en) Training method and device based on deep learning and target tracking method and device
CN113239866B (en) Face recognition method and system based on space-time feature fusion and sample attention enhancement
WO2023185074A1 (en) Group behavior recognition method based on complementary spatio-temporal information modeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220824

Address after: No. 8, Haitai Huake 4th Road, Huayuan Industrial Zone, Binhai High-tech Zone, Xiqing District, Tianjin 300392

Applicant after: ELECTRIC POWER SCIENCE & RESEARCH INSTITUTE OF STATE GRID TIANJIN ELECTRIC POWER Co.

Applicant after: STATE GRID TIANJIN ELECTRIC POWER Co.

Applicant after: WUHAN University

Applicant after: STATE GRID INFORMATION & TELECOMMUNICATION GROUP Co.,Ltd.

Address before: 300010 Tianjin city Hebei District Wujing Road No. 39

Applicant before: STATE GRID TIANJIN ELECTRIC POWER Co.

Applicant before: WUHAN University

Applicant before: STATE GRID INFORMATION & TELECOMMUNICATION GROUP Co.,Ltd.

GR01 Patent grant
GR01 Patent grant