CN118113888A - Cross-modal fine granularity retrieval method based on multi-channel fusion - Google Patents
Cross-modal fine granularity retrieval method based on multi-channel fusion Download PDFInfo
- Publication number
- CN118113888A CN118113888A CN202311663327.2A CN202311663327A CN118113888A CN 118113888 A CN118113888 A CN 118113888A CN 202311663327 A CN202311663327 A CN 202311663327A CN 118113888 A CN118113888 A CN 118113888A
- Authority
- CN
- China
- Prior art keywords
- mode
- cross
- modal
- fine
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 230000004927 fusion Effects 0.000 title claims abstract description 25
- 230000000694 effects Effects 0.000 claims description 9
- 238000011176 pooling Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 5
- 238000007500 overflow downdraw method Methods 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims 1
- 238000005215 recombination Methods 0.000 claims 1
- 230000006798 recombination Effects 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 5
- 230000003993 interaction Effects 0.000 abstract description 4
- 238000011835 investigation Methods 0.000 abstract 1
- 230000006870 function Effects 0.000 description 14
- 238000012549 training Methods 0.000 description 9
- 238000002474 experimental method Methods 0.000 description 6
- 238000013459 approach Methods 0.000 description 4
- 238000012512 characterization method Methods 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 3
- 238000012935 Averaging Methods 0.000 description 2
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 2
- 230000003042 antagnostic effect Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 238000002679 ablation Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008033 biological extinction Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/43—Querying
- G06F16/432—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/48—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/483—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/245—Classification techniques relating to the decision surface
- G06F18/2451—Classification techniques relating to the decision surface linear, e.g. hyperplane
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Library & Information Science (AREA)
- Medical Informatics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a cross-mode fine granularity retrieval method based on multi-channel fusion. In one aspect, the method uses a branched network to extract depth feature information of four modalities. The mode can greatly extract the characteristics special for each mode and fully utilize the characteristic information of each mode. On the other hand, after the depth characteristic information of each mode is extracted, the depth characteristic information is divided into four channels and then recombined, so that each group of recombined information contains the depth characteristic information of the four modes, and therefore, when a model is learned, the information of the mode can be learned, and meanwhile, the information brought by other modes is learned in advance, the information interaction capability among the modes is greatly enhanced, the classification capability of the model is enhanced, more accurate classification results are provided for subsequent retrieval tasks, and the cross-mode retrieval capability of the model is further improved. The technology can be applied to a search engine or a public security system, and the retrieval accuracy and the crime investigation efficiency are effectively improved.
Description
Technical Field
The invention belongs to the field of artificial intelligence, in particular to technologies such as deep learning, cross-modal retrieval, fine granularity retrieval, channel fusion and the like, and particularly relates to a cross-modal fine granularity retrieval method based on multi-channel fusion.
Background
With the development of social science and technology, four channels of video, image, text and audio have become the main forms of human awareness of the world and mutual communication. The rapid growth of multimodal data has brought about a great deal of application requirements for cross-modal retrieval, with the goal of achieving mutual retrieval of cross-modal content. Cross-modal retrieval presents a higher technical challenge than single-modal retrieval due to the large inter-modal variation. However, the current cross-modal retrieval task is generally focused on coarse granularity, and is far from meeting the needs of practical applications. In contrast, fine-grained retrieval has greater application needs and research value, both in industry and academia. Therefore, cross-modal fine-grained retrieval has become an important research direction, and more cross-modal fine-grained retrieval theory and technique needs to be developed.
The existing cross-modal retrieval method mainly focuses on the modes of image-text pairs, and has less research on four modes of images, videos, audios and texts. The multi-mode search refers to that a user gives a sample of any one mode as a query sample, the system searches to obtain and feed back samples of each mode belonging to the same category as the query sample, and the number of the modes required by the multi-mode search is more than or equal to two, which also brings great difficulty to experiments. There are generally two approaches in the traditional cross-modal fine-grained retrieval model based on deep learning. One is to use different neural network models to extract feature vectors for different modalities, such as using an image feature extractor and a linear classifier to predict some labels, and training an image encoder and a text encoder in combination to predict the correct pairing of a batch of (image, text) training samples. At test time, the learned text encoder synthesizes a zero sample linear classifier by embedding the names or descriptions of the classes of the target dataset. Another approach is to use a backbone network to extract feature vectors of different modes simultaneously, e.g. using ResNet as a basic depth model, 448 x 448 as input size, and after the last convolutional layer, pass through an average pooling layer with kernel size of 14 and step number of 1, the aim is to extract features of all modes using a network. Some methods propose a dual-path attention model for feature learning, which integrates a deep convolutional neural network, an attention mechanism and a recurrent neural network to learn cross-modal fine-granularity salient features, and fully dig fine-granularity semantic correlation among different modal data. Other methods apply bar pooling to image and text modalities, which is a lightweight spatial attention mechanism for capturing spatial semantic information of modalities, and simultaneously explore a second-order covariance pool to obtain multi-modality semantic representation, capture semantic information of modality channels, and realize semantic alignment among image text modalities. The method also comprises the steps of respectively extracting the characteristic of the obvious image area and the word characteristic with the context, constructing a fine granularity similarity matrix of the image area and the text word, and carrying out semantic supervision on the characteristic of the image area and the corresponding characteristic of the image context based on the image potential space and the text potential space of the attention mechanism. The method for extracting the features by using the backbone network emphasizes the inter-mode connection and commonality, but the connection and commonality are only a small part of all data, so that a great amount of effective mode specific information is lost, the features of each mode are extracted by using the branch network emphasizes the mode specific information, but the commonality of the inter-mode connection and different mode samples is difficult to extract, most of researches are focusing on finding the connection between images and texts, and a great amount of information contained in videos and audios is ignored.
Disclosure of Invention
The invention aims to solve the problems of heterogeneous gaps and semantic gaps among various modes of the existing method and provides a cross-mode fine-grained retrieval network based on multi-channel fusion.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
1) Clipping the image mode data according to the existing boundary box label, extracting a key frame from the video mode data according to a mode of extracting one frame every ten frames, and converting the audio mode data into a spectrogram by using short-time Fourier transform.
2) And extracting features from the data converted into the image mode, the video mode and the audio mode of the picture by using ResNet network to finally obtain the mode specific features of the three modes, res (f I),Res(fV),Res(fA).
3) And extracting the characteristics of the Text mode data by using TextCNN to finally obtain the mode specific characteristics Text of the Text mode (f T).
4) The modality specific features of each modality are equally divided into four parts, respectively.
5) And performing cross-channel fusion on the four Res (f I),Res(fV),Res(fA),Text(fT) features by using a multi-channel fusion method to obtain a new mode specific feature f' I,f'V,f'A,f'T.
6) Inputting the new modal specific features into a linear classifier for classification, and optimizing the model effect by using a cross entropy loss function and a noise contrast estimated loss function.
7) Inputting any type of any mode in the classified results, and searching other mode samples of the same type according to the classification to realize cross-mode fine granularity searching.
Specifically, the step 1) includes:
For image data, since we only need the part of the whole image related to the search object, we cut out the image in order to eliminate the interference of the background noise. Clipping is performed according to the pixel coordinates of the search object to obtain a portion containing only the object.
For video data, we extract ten frames of images per video as input samples for the video modality.
For audio data, a short-time fourier transform is used to convert it into a corresponding spectrogram as an input sample of the audio modality.
Further, step 3) includes:
Firstly obtaining local features of a text, extracting N-Gram information of the text through different convolution kernel sizes, then highlighting the most critical information extracted by each convolution operation through maximum pooling operation, combining the features through a full connection layer after splicing, and finally training a model through a cross entropy loss function.
Further, step 5) includes:
In general, the biggest challenge of cross-modal fine-granularity retrieval is heterogeneous gaps and semantic gaps among modalities, so the invention uses a channel fusion method to fuse and recombine deep features of four modalities in the training process,
In order to realize the multi-channel fusion cross-mode retrieval method, the first part of the image is replaced with the first part of the video mode, the second part of the image mode is replaced with the second part of the audio mode, the third part of the image mode is replaced with the third part of the text mode, and the fourth part of the image mode is unchanged from the rest of other modes.
Further, step 6) includes:
Since the invention aims at the problem of fine-grained cross-modal retrieval of different birds in 200 belonging to the same general class of birds, the classification task is very heavy, and therefore the model needs to be optimized by using a noise contrast estimation loss function.
Classifying by the cross-modal fine-granularity retriever according to the characteristics obtained in the step 5) to obtain a category prediction loss:
where l (x k,yk) is the cross entropy loss function, I, V, a, T represent image modality, video modality, audio modality, and text modality.
Because the search task of the invention is classified firstly and clustered later, and finally the search task is carried out, in order to reduce the huge calculation amount caused by 200 classifications, the invention uses the noise contrast estimation loss function to reduce the calculation amount:
G(x,y)=F(x,y)-logQ(y|x)
Where F (x, y) represents the degree of matching between x, y, i.e., the output of the model, x, y each represent a predicted correct instance, and y' represents a negative instance of x, taken from the set of the entire set y. The total loss function will be defined as L total=αLcls+βLNCE.
Through steps 1) to 7), for a use case of a certain class of a certain mode input by the user, the system retrieves and feeds back use cases of other modes belonging to the same class as the use case.
The beneficial effects of the invention are as follows:
1) And the input of the picture mode is cut according to the boundary box, so that the interference of background noise is eliminated.
2) The data information of image, video and audio modes is converted into pictures, so that network processing is more convenient and the structure is more uniform.
3) The cross entropy loss function and the noise contrast estimation loss function are introduced, network structure parameters are guided to be effectively updated, and the noise contrast estimation loss function can be used for effectively reducing calculated amount and improving efficiency.
4) Through the multi-channel fusion technology, information interaction among modes is enhanced, heterogeneous gaps and semantic gaps among modes are reduced, and the model learns more abundant information.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
FIG. 2 is a schematic diagram of a model framework of the present invention.
Fig. 3 is a data set used in experiments of the present invention.
FIG. 4 is a graph comparing the performance of the present invention on a PKU FGXmedia dataset with other methods of single-to-single-modality retrieval.
FIG. 5 is a graph comparing the performance of the present invention on a PKU FGXmedia dataset with other methods of single-to-multi-modal retrieval.
FIG. 6 is a graph comparing the performance of different variants of the invention in a single-to-single mode search.
FIG. 7 is a graph comparing performance of different variants of the present invention in single-to-multi-modal retrieval.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention provides a cross-mode fine granularity retrieval network based on single-mode guidance and multi-channel fusion, and the main flow of the method is shown in figure 1. The specific implementation process is as follows:
1) Clipping the image mode data according to the existing boundary box label, extracting a key frame from the video mode data according to a mode of extracting one frame every ten frames, and converting the audio mode data into a spectrogram by using short-time Fourier transform.
2) And extracting features from the data converted into the image mode, the video mode and the audio mode of the picture by using ResNet network to finally obtain the mode specific features of the three modes, res (f I),Res(fV),Res(fA).
3) And extracting the characteristics of the Text mode data by using TextCNN to finally obtain the mode specific characteristics Text of the Text mode (f T).
4) The modality specific characteristics of each modality are equally divided into four channels, respectively.
5) And performing cross-channel fusion on the four Res (f I),Res(fV),Res(fA),Text(fT) features by using a multi-channel fusion method to obtain a new mode specific feature f' I,f'V,f'A,f'T.
6) Inputting the new modal specific features into a linear classifier for classification, and optimizing the model effect by using a cross entropy loss function, a noise contrast estimated loss function and a fine grain cross modal center loss.
7) Inputting any type of any mode in the classified results, and searching other mode samples of the same type according to the classification to realize cross-mode fine granularity searching.
Specifically, the step 2) includes:
Compared to CNN networks, we use the more advanced ResNet network to process picture input, resNet has its main feature of depth, which is a 50-layer convolutional neural network. The depth of this model enables it to learn more complex features, thereby improving its accuracy. One problem with deep learning models is the disappearance of gradients, which can result in the model not being trained. To solve this problem ResNet uses a residual learning method. The idea of residual learning is that a layer is an identity map if its inputs and outputs are the same. If the input and output of a layer are different, then that layer is a residual map. ResNet50 residual learning is achieved using a residual block. Each residual block contains two convolutional layers and one skip connection. The jump connection passes the input directly to the output, avoiding the problem of gradient extinction. Another feature of ResNet is that it uses a global averaging pooling layer. This layer takes as the output of each feature map the average of all pixels of that feature map. The effect of this layer is to reduce the number of parameters of the model and thus the risk of overfitting. Overall ResNet is a very powerful deep learning model. Its depth and residual learning approach enables it to learn more complex features, thereby improving its accuracy. Its global averaging pooling layer can reduce the number of parameters of the model, thereby reducing the risk of overfitting. ResNet50 has been widely used in the field of computer vision, such as tasks for image classification, object detection, and semantic segmentation.
Further, step 5) includes:
Four modalities m= { I, V, a, T }, where I represents an image modality, V represents a video modality, a represents an audio modality, and T represents a text modality, are known. Let the class space be k= { K 1,k2,k3,...kn }, the sample space be Wherein/>Is a collection of instances, representing that modality M belongs to category k i. During training, we randomly select a class k m, we are/>, for each modality in MTo construct a multimodal dataset/>, of image-video-audio-textWherein/>Is a randomly selected image sample in category k m,/>Is a randomly selected video sample in category k m,/>Is a randomly selected audio sample in category k m,/>Is a randomly selected image sample in category k m, i.e./>Subsequently we will/>As the input of the network, the image, video and audio modes are led to pass through Vision Transformer (ViT) main network, and the text mode is led to extract depth characteristic information/>, through bert main networkWherein the method comprises the steps of
Where d is the feature size of the division of the present invention where the mode feature sizes are the same, i.e., d I=dV=dA=dT =d.
In order to realize the multi-channel fusion cross-modal retrieval method, we first represent the extracted feature information asThe feature information of each modality is then divided into four parts, for example image modalities
The other three modalities are all processed in this way, where L 1,L2,L3,L4 represents the length of each part, and L 1+L2+L3+L4 = d, then we do a multi-channel feature fusion operation, the fused image-video-audio-text four-modality depth feature information representation
In order to achieve a better effect of the model, a Squeeze-and-Excitation Networks (SENet) is applied to the depth feature information after mixing, so that the effective feature map channel has a large weight, and the ineffective or small-effect feature map channel has a small weight. Finally, brand new image-video-audio-text four-mode depth characteristic information is obtainedThis is put as input into a linear classifier to obtain the final predicted value.
Through steps 1) to 7), for a use case of a certain class of a certain mode input by the user, the system retrieves and feeds back use cases of other modes belonging to the same class as the use case.
Experiment verification
1. Data set
The data set used by the invention is the unique four-mode-crossing fine granularity retrieval public data set FG-Xmedia at present, and the data set comprises data of four modes of images, videos, audios and texts. Wherein the image data source is CUB-200-2011, which is the most widely used fine-grained image classification dataset, comprising 11788 images of 200 subcategories, belonging to the same basic coarse-grained category as "Bird", wherein the training set comprises 5994 images, the testing set comprises 5794 images, each image annotation comprises a subcategory label of one image level, a bounding box of an object, 15 partial positions and 312 binary attributes. The video source is YouTube Birds, which is a new fine-grained video dataset. The same category classification approach as that of CUB-200-2011 was used, where the training set included 12666 videos and the test set included 5684 videos. The audio and text data are obtained from professional websites according to the classified categories, and the four modes form a public data set FG-Xmedia together.
2. Details of implementation
The invention mainly uses two backbone networks, and since the data preprocessing stage converts the input of image, video and audio modes into images, they commonly use ResNet as backbone network. Features are extracted using TextCNN for the text modality. After preprocessing the data of the four modes, the dimension of the data sample is fixed to 448×448×3, the whole program code is written by Pytorch, and the display card is RTX 3090. During training, we set batchsize to 8, the initial learning rate to 0.00005, select AdamW as the optimizer, and adjust the learning rate to the warm-up 1000-step cosine learning rate.
3. Comparative experiments
The present invention compares the proposed method with some representative models and names the proposed method as MCF.
(1) ACMR: a novel method of antagonistic cross-modal retrieval (ACMR) seeks an efficient common subspace based on antagonistic learning.
(2) CMDN: a cross-media multi-depth network utilizes complex cross-media association through hierarchical learning, learns rich cross-media correlation through two stages, and finally obtains shared characterization through a stacked network working mode.
(3) JRL: a novel cross-media data feature learning algorithm, namely joint characterization learning (JRL), can jointly explore related information and semantic information under a unified optimization framework. .
(4) GSPH: a simple and effective general hash framework is applicable to all different situations, and the semantic distance between data points is reserved.
(5) MHTN: a modality-reverse hybrid transmission network (MHTN) is directed to enabling knowledge transmission from a single-modality source domain to a cross-modality target domain and learning cross-modality co-characterization.
(6) FGCrossNet: a unified depth model that learns 4 types of media simultaneously without separate processing. Three constraints are considered together, namely classification constraint ensures the learning of the distinguishing features of fine-grained subcategories, central constraint ensures the compactness features of the same subcategory, and sorting constraint ensures the sparsity features of the features of different subcategories.
(7) DBFC-Net: a novel dual-branch fine-grained cross-media network (DBFC-Net) utilizes specific media information to build common-feature work through a unified framework. It also designed an effective distance measure for fine-grained cross-media retrieval.
(8) SAFGCM: an attention space training method learns a common characterization of different media data. In particular, a local self-attention layer is utilized to learn a common attention space between different media data.
For fair comparison, the invention uses a unified evaluation index of multi-modal fine-granularity retrieval task, namely, an average Precision (mAP). In the experiment, the average precision mean value refers to that the average precision of the query sample of each category is calculated first, and then the average value of the average precision of 200 categories is calculated to obtain mAP.
Meanwhile, in order to more comprehensively show the effect of cross-mode fine granularity retrieval, 18 evaluation indexes are set in total, wherein the index comprises the separate retrieval scores of each mode for other three modes, which are respectively expressed as I, V, I, A, I, T, V, I, V, A, V, T, A, I, A, V, A, T, I, T, V, T, A and average value of the scores. Also included are the search scores that each modality has in common for ALL other modalities, i.e., I→ALL, V→ALL, A→ALL, T→ALL, and the averages of the above scores. As can be seen from the data in FIG. 5, the performance of our algorithm is significantly improved compared to other methods by more than 110% in the I- & gt T, T- & gt I, T- & gt A, T- & gt V, A- & gt T, T- & gt ALL scenario compared to FGCrossNet algorithm. The average retrieval mAP value of the one-to-one mode retrieval is improved by 52% compared with FGCrossNet algorithm, and the average retrieval mAP value of the one-to-multi mode retrieval is improved by 33% compared with FGCrossNet algorithm, which proves the effectiveness of MCF in cross-mode fine granularity retrieval, can fully integrate information of different modes, enhance information interaction of fine granularity objects in different modes, reduce difference among different modes, fully utilize abundant semantic information of text modes, reduce information loss and the like, and thus can better promote retrieval effect.
4. Ablation experiments
Since the MCF proposed by the present invention contains a number of key components, the present invention will in this section compare variants of MCF from several aspects to demonstrate the effectiveness of MCF:
(1) ResNet50: after extracting features using ResNet networks, the model is optimized using only the cross entropy loss function and the noise contrast estimated loss function.
(2) ResNet +mcf: and after the characteristics of the convolutional neural network used by most networks are extracted, adding a multi-channel fusion module for subsequent processing.
(3) ResNet +mcf+fccl: the most advanced image feature extractor and the most advanced text feature extractor are used for extracting features and then classifying the features, and a single-mode guiding and multi-channel fusion module is added for improving the model effect.
Figure 5 shows the effect of various variants of MCF on PKU FG-Xmedia dataset.
As can be readily seen from fig. 5, the ResNet +mcf+fccl performance is evident due to ResNet +mcf, res net50, which demonstrates the effectiveness of each of the sub-panels on the MCF ensemble model. In addition, it can be observed that the performance of removing the MCF is most obviously reduced in the three variants, which shows that the channel fusion has the greatest influence on the performance of the MCF model, and the information interaction between modes can be greatly enhanced by using a multi-channel fusion technology, so that heterogeneous gaps and semantic gaps are eliminated, and the retrieval performance is improved.
Claims (6)
1. A cross-mode fine granularity retrieval method based on multi-channel fusion is characterized in that: the method comprises the following steps: 1) Clipping the image mode data according to the existing boundary frame label, extracting a key frame from the video mode data according to a mode of extracting one frame every ten frames, and converting the audio mode data into a spectrogram by using short-time Fourier transform; 2) Extracting features from the data converted into image mode, video mode and audio mode of the picture by ResNet network to obtain mode specific features of three modes finally, res (f I),Res(fV),Res(fA); 3) Extracting characteristics of the Text mode data by TextCNN to finally obtain mode specific characteristics Text (f T) of the Text mode; 4) Dividing the mode specific characteristics of each mode into four parts; 5) Performing cross-channel fusion on four features of Res (f I),Res(fV),Res(fA),Text(fT) by using a multi-channel fusion method to obtain a new mode specific feature f' I,f'V,f'A,f'T; 6) Inputting the new modal specific features into a linear classifier for classification, and optimizing the model effect by using a cross entropy loss function and a noise contrast estimated loss function; 7) Inputting any type of any mode in the classified results, and searching other mode samples of the same type according to the classification to realize cross-mode fine granularity searching.
2. The cross-modal fine-grained retrieval method based on multi-channel fusion according to claim 1, wherein the step 1) comprises: the input data is clipped by using the existing bounding box labels of the image mode, the interference of background noise is eliminated, meanwhile, in order to process video and audio information by using ResNet, the video is subjected to frame extraction processing, and the audio is converted into pictures by using short-time Fourier transform processing, so that the pictures are taken as the input of ResNet.
3. The cross-modal fine-grained retrieval method based on multi-channel fusion according to claim 1, wherein the step 3) comprises: to obtain local features of a text, N-Gram information of the text is extracted through different convolution kernel sizes, then the most critical information extracted by each convolution operation is highlighted through a maximum pooling operation, the features are combined through a full connection layer after splicing, and finally a model is trained through a cross entropy loss function.
4. The cross-modal fine-grained retrieval method based on multi-channel fusion according to claim 1, wherein the step 5) comprises: for the extracted mode specific features, the invention divides the features of each mode into four channels, and then fusion recombination is carried out to obtain a new mode specific feature f' I,f'V,f'A,f'T.
5. The cross-modal fine-grained retrieval method based on multi-channel fusion according to claim 1, wherein the step 6) comprises: the cross-mode fine granularity retriever predicts which category the input case specifically belongs to according to the characteristics obtained in the step 5), obtains the category prediction loss of each mode, classifies the case according to the category under the condition of not considering the mode by minimizing the category prediction loss improvement model, and provides preconditions for subsequent retrieval tasks.
6. The cross-modal fine-grained retrieval method based on multi-channel fusion according to claim 1, wherein the step 7) comprises: the cross-modal fine-grained retriever retrieves according to the classification result obtained in the step 6), and the classification in the step 6) is carried out according to the classification without considering the modes, so that the use cases of the same class in different modes can be divided together, after a user inputs a certain class in a certain mode, the system can retrieve and feed back the use cases of other modes which belong to the same class as the input sample, and the cross-modal fine-grained retrieval task is realized.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311663327.2A CN118113888A (en) | 2023-12-05 | 2023-12-05 | Cross-modal fine granularity retrieval method based on multi-channel fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311663327.2A CN118113888A (en) | 2023-12-05 | 2023-12-05 | Cross-modal fine granularity retrieval method based on multi-channel fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118113888A true CN118113888A (en) | 2024-05-31 |
Family
ID=91209511
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311663327.2A Pending CN118113888A (en) | 2023-12-05 | 2023-12-05 | Cross-modal fine granularity retrieval method based on multi-channel fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118113888A (en) |
-
2023
- 2023-12-05 CN CN202311663327.2A patent/CN118113888A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kavasidis et al. | A saliency-based convolutional neural network for table and chart detection in digitized documents | |
CN111090763B (en) | Picture automatic labeling method and device | |
CN115994230A (en) | Intelligent archive construction method integrating artificial intelligence and knowledge graph technology | |
Singh et al. | Systematic Linear Word String Recognition and Evaluation Technique | |
Jun | A forecasting model for technological trend using unsupervised learning | |
CN112925905B (en) | Method, device, electronic equipment and storage medium for extracting video subtitles | |
CN114461890A (en) | Hierarchical multi-modal intellectual property search engine method and system | |
US20240296652A1 (en) | Content recognition method and apparatus, device, storage medium, and computer program product | |
Sreeja et al. | A unified model for egocentric video summarization: an instance-based approach | |
CN116128998A (en) | Multi-path parallel text-to-image generation method and system | |
CN116956128A (en) | Hypergraph-based multi-mode multi-label classification method and system | |
Nguyen et al. | Manga-mmtl: Multimodal multitask transfer learning for manga character analysis | |
Saleem et al. | Stateful human-centered visual captioning system to aid video surveillance | |
Rasheed et al. | A deep learning-based method for Turkish text detection from videos | |
Li | A deep learning-based text detection and recognition approach for natural scenes | |
Rajkumar et al. | Content based image retrieval system using combination of color and shape features, and siamese neural network | |
Hu et al. | MmFilter: Language-guided video analytics at the edge | |
Kota et al. | Summarizing lecture videos by key handwritten content regions | |
CN118113888A (en) | Cross-modal fine granularity retrieval method based on multi-channel fusion | |
Priya et al. | Developing an offline and real-time Indian sign language recognition system with machine learning and deep learning | |
CN114842301A (en) | Semi-supervised training method of image annotation model | |
Liu et al. | Text detection based on bidirectional feature fusion and sa attention mechanism | |
CN113377959A (en) | Few-sample social media rumor detection method based on meta learning and deep learning | |
Sudha et al. | Reducing semantic gap in video retrieval with fusion: A survey | |
Sunuwar et al. | A comparative analysis on major key-frame extraction techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |