CN118113888A - Cross-modal fine granularity retrieval method based on multi-channel fusion - Google Patents

Cross-modal fine granularity retrieval method based on multi-channel fusion Download PDF

Info

Publication number
CN118113888A
CN118113888A CN202311663327.2A CN202311663327A CN118113888A CN 118113888 A CN118113888 A CN 118113888A CN 202311663327 A CN202311663327 A CN 202311663327A CN 118113888 A CN118113888 A CN 118113888A
Authority
CN
China
Prior art keywords
mode
cross
modal
fine
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311663327.2A
Other languages
Chinese (zh)
Inventor
陈乔松
陈浩
李远路
刘峻卓
张冶
张冬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202311663327.2A priority Critical patent/CN118113888A/en
Publication of CN118113888A publication Critical patent/CN118113888A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/432Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/245Classification techniques relating to the decision surface
    • G06F18/2451Classification techniques relating to the decision surface linear, e.g. hyperplane
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Library & Information Science (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a cross-mode fine granularity retrieval method based on multi-channel fusion. In one aspect, the method uses a branched network to extract depth feature information of four modalities. The mode can greatly extract the characteristics special for each mode and fully utilize the characteristic information of each mode. On the other hand, after the depth characteristic information of each mode is extracted, the depth characteristic information is divided into four channels and then recombined, so that each group of recombined information contains the depth characteristic information of the four modes, and therefore, when a model is learned, the information of the mode can be learned, and meanwhile, the information brought by other modes is learned in advance, the information interaction capability among the modes is greatly enhanced, the classification capability of the model is enhanced, more accurate classification results are provided for subsequent retrieval tasks, and the cross-mode retrieval capability of the model is further improved. The technology can be applied to a search engine or a public security system, and the retrieval accuracy and the crime investigation efficiency are effectively improved.

Description

Cross-modal fine granularity retrieval method based on multi-channel fusion
Technical Field
The invention belongs to the field of artificial intelligence, in particular to technologies such as deep learning, cross-modal retrieval, fine granularity retrieval, channel fusion and the like, and particularly relates to a cross-modal fine granularity retrieval method based on multi-channel fusion.
Background
With the development of social science and technology, four channels of video, image, text and audio have become the main forms of human awareness of the world and mutual communication. The rapid growth of multimodal data has brought about a great deal of application requirements for cross-modal retrieval, with the goal of achieving mutual retrieval of cross-modal content. Cross-modal retrieval presents a higher technical challenge than single-modal retrieval due to the large inter-modal variation. However, the current cross-modal retrieval task is generally focused on coarse granularity, and is far from meeting the needs of practical applications. In contrast, fine-grained retrieval has greater application needs and research value, both in industry and academia. Therefore, cross-modal fine-grained retrieval has become an important research direction, and more cross-modal fine-grained retrieval theory and technique needs to be developed.
The existing cross-modal retrieval method mainly focuses on the modes of image-text pairs, and has less research on four modes of images, videos, audios and texts. The multi-mode search refers to that a user gives a sample of any one mode as a query sample, the system searches to obtain and feed back samples of each mode belonging to the same category as the query sample, and the number of the modes required by the multi-mode search is more than or equal to two, which also brings great difficulty to experiments. There are generally two approaches in the traditional cross-modal fine-grained retrieval model based on deep learning. One is to use different neural network models to extract feature vectors for different modalities, such as using an image feature extractor and a linear classifier to predict some labels, and training an image encoder and a text encoder in combination to predict the correct pairing of a batch of (image, text) training samples. At test time, the learned text encoder synthesizes a zero sample linear classifier by embedding the names or descriptions of the classes of the target dataset. Another approach is to use a backbone network to extract feature vectors of different modes simultaneously, e.g. using ResNet as a basic depth model, 448 x 448 as input size, and after the last convolutional layer, pass through an average pooling layer with kernel size of 14 and step number of 1, the aim is to extract features of all modes using a network. Some methods propose a dual-path attention model for feature learning, which integrates a deep convolutional neural network, an attention mechanism and a recurrent neural network to learn cross-modal fine-granularity salient features, and fully dig fine-granularity semantic correlation among different modal data. Other methods apply bar pooling to image and text modalities, which is a lightweight spatial attention mechanism for capturing spatial semantic information of modalities, and simultaneously explore a second-order covariance pool to obtain multi-modality semantic representation, capture semantic information of modality channels, and realize semantic alignment among image text modalities. The method also comprises the steps of respectively extracting the characteristic of the obvious image area and the word characteristic with the context, constructing a fine granularity similarity matrix of the image area and the text word, and carrying out semantic supervision on the characteristic of the image area and the corresponding characteristic of the image context based on the image potential space and the text potential space of the attention mechanism. The method for extracting the features by using the backbone network emphasizes the inter-mode connection and commonality, but the connection and commonality are only a small part of all data, so that a great amount of effective mode specific information is lost, the features of each mode are extracted by using the branch network emphasizes the mode specific information, but the commonality of the inter-mode connection and different mode samples is difficult to extract, most of researches are focusing on finding the connection between images and texts, and a great amount of information contained in videos and audios is ignored.
Disclosure of Invention
The invention aims to solve the problems of heterogeneous gaps and semantic gaps among various modes of the existing method and provides a cross-mode fine-grained retrieval network based on multi-channel fusion.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
1) Clipping the image mode data according to the existing boundary box label, extracting a key frame from the video mode data according to a mode of extracting one frame every ten frames, and converting the audio mode data into a spectrogram by using short-time Fourier transform.
2) And extracting features from the data converted into the image mode, the video mode and the audio mode of the picture by using ResNet network to finally obtain the mode specific features of the three modes, res (f I),Res(fV),Res(fA).
3) And extracting the characteristics of the Text mode data by using TextCNN to finally obtain the mode specific characteristics Text of the Text mode (f T).
4) The modality specific features of each modality are equally divided into four parts, respectively.
5) And performing cross-channel fusion on the four Res (f I),Res(fV),Res(fA),Text(fT) features by using a multi-channel fusion method to obtain a new mode specific feature f' I,f'V,f'A,f'T.
6) Inputting the new modal specific features into a linear classifier for classification, and optimizing the model effect by using a cross entropy loss function and a noise contrast estimated loss function.
7) Inputting any type of any mode in the classified results, and searching other mode samples of the same type according to the classification to realize cross-mode fine granularity searching.
Specifically, the step 1) includes:
For image data, since we only need the part of the whole image related to the search object, we cut out the image in order to eliminate the interference of the background noise. Clipping is performed according to the pixel coordinates of the search object to obtain a portion containing only the object.
For video data, we extract ten frames of images per video as input samples for the video modality.
For audio data, a short-time fourier transform is used to convert it into a corresponding spectrogram as an input sample of the audio modality.
Further, step 3) includes:
Firstly obtaining local features of a text, extracting N-Gram information of the text through different convolution kernel sizes, then highlighting the most critical information extracted by each convolution operation through maximum pooling operation, combining the features through a full connection layer after splicing, and finally training a model through a cross entropy loss function.
Further, step 5) includes:
In general, the biggest challenge of cross-modal fine-granularity retrieval is heterogeneous gaps and semantic gaps among modalities, so the invention uses a channel fusion method to fuse and recombine deep features of four modalities in the training process,
In order to realize the multi-channel fusion cross-mode retrieval method, the first part of the image is replaced with the first part of the video mode, the second part of the image mode is replaced with the second part of the audio mode, the third part of the image mode is replaced with the third part of the text mode, and the fourth part of the image mode is unchanged from the rest of other modes.
Further, step 6) includes:
Since the invention aims at the problem of fine-grained cross-modal retrieval of different birds in 200 belonging to the same general class of birds, the classification task is very heavy, and therefore the model needs to be optimized by using a noise contrast estimation loss function.
Classifying by the cross-modal fine-granularity retriever according to the characteristics obtained in the step 5) to obtain a category prediction loss:
where l (x k,yk) is the cross entropy loss function, I, V, a, T represent image modality, video modality, audio modality, and text modality.
Because the search task of the invention is classified firstly and clustered later, and finally the search task is carried out, in order to reduce the huge calculation amount caused by 200 classifications, the invention uses the noise contrast estimation loss function to reduce the calculation amount:
G(x,y)=F(x,y)-logQ(y|x)
Where F (x, y) represents the degree of matching between x, y, i.e., the output of the model, x, y each represent a predicted correct instance, and y' represents a negative instance of x, taken from the set of the entire set y. The total loss function will be defined as L total=αLcls+βLNCE.
Through steps 1) to 7), for a use case of a certain class of a certain mode input by the user, the system retrieves and feeds back use cases of other modes belonging to the same class as the use case.
The beneficial effects of the invention are as follows:
1) And the input of the picture mode is cut according to the boundary box, so that the interference of background noise is eliminated.
2) The data information of image, video and audio modes is converted into pictures, so that network processing is more convenient and the structure is more uniform.
3) The cross entropy loss function and the noise contrast estimation loss function are introduced, network structure parameters are guided to be effectively updated, and the noise contrast estimation loss function can be used for effectively reducing calculated amount and improving efficiency.
4) Through the multi-channel fusion technology, information interaction among modes is enhanced, heterogeneous gaps and semantic gaps among modes are reduced, and the model learns more abundant information.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
FIG. 2 is a schematic diagram of a model framework of the present invention.
Fig. 3 is a data set used in experiments of the present invention.
FIG. 4 is a graph comparing the performance of the present invention on a PKU FGXmedia dataset with other methods of single-to-single-modality retrieval.
FIG. 5 is a graph comparing the performance of the present invention on a PKU FGXmedia dataset with other methods of single-to-multi-modal retrieval.
FIG. 6 is a graph comparing the performance of different variants of the invention in a single-to-single mode search.
FIG. 7 is a graph comparing performance of different variants of the present invention in single-to-multi-modal retrieval.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention provides a cross-mode fine granularity retrieval network based on single-mode guidance and multi-channel fusion, and the main flow of the method is shown in figure 1. The specific implementation process is as follows:
1) Clipping the image mode data according to the existing boundary box label, extracting a key frame from the video mode data according to a mode of extracting one frame every ten frames, and converting the audio mode data into a spectrogram by using short-time Fourier transform.
2) And extracting features from the data converted into the image mode, the video mode and the audio mode of the picture by using ResNet network to finally obtain the mode specific features of the three modes, res (f I),Res(fV),Res(fA).
3) And extracting the characteristics of the Text mode data by using TextCNN to finally obtain the mode specific characteristics Text of the Text mode (f T).
4) The modality specific characteristics of each modality are equally divided into four channels, respectively.
5) And performing cross-channel fusion on the four Res (f I),Res(fV),Res(fA),Text(fT) features by using a multi-channel fusion method to obtain a new mode specific feature f' I,f'V,f'A,f'T.
6) Inputting the new modal specific features into a linear classifier for classification, and optimizing the model effect by using a cross entropy loss function, a noise contrast estimated loss function and a fine grain cross modal center loss.
7) Inputting any type of any mode in the classified results, and searching other mode samples of the same type according to the classification to realize cross-mode fine granularity searching.
Specifically, the step 2) includes:
Compared to CNN networks, we use the more advanced ResNet network to process picture input, resNet has its main feature of depth, which is a 50-layer convolutional neural network. The depth of this model enables it to learn more complex features, thereby improving its accuracy. One problem with deep learning models is the disappearance of gradients, which can result in the model not being trained. To solve this problem ResNet uses a residual learning method. The idea of residual learning is that a layer is an identity map if its inputs and outputs are the same. If the input and output of a layer are different, then that layer is a residual map. ResNet50 residual learning is achieved using a residual block. Each residual block contains two convolutional layers and one skip connection. The jump connection passes the input directly to the output, avoiding the problem of gradient extinction. Another feature of ResNet is that it uses a global averaging pooling layer. This layer takes as the output of each feature map the average of all pixels of that feature map. The effect of this layer is to reduce the number of parameters of the model and thus the risk of overfitting. Overall ResNet is a very powerful deep learning model. Its depth and residual learning approach enables it to learn more complex features, thereby improving its accuracy. Its global averaging pooling layer can reduce the number of parameters of the model, thereby reducing the risk of overfitting. ResNet50 has been widely used in the field of computer vision, such as tasks for image classification, object detection, and semantic segmentation.
Further, step 5) includes:
Four modalities m= { I, V, a, T }, where I represents an image modality, V represents a video modality, a represents an audio modality, and T represents a text modality, are known. Let the class space be k= { K 1,k2,k3,...kn }, the sample space be Wherein/>Is a collection of instances, representing that modality M belongs to category k i. During training, we randomly select a class k m, we are/>, for each modality in MTo construct a multimodal dataset/>, of image-video-audio-textWherein/>Is a randomly selected image sample in category k m,/>Is a randomly selected video sample in category k m,/>Is a randomly selected audio sample in category k m,/>Is a randomly selected image sample in category k m, i.e./>Subsequently we will/>As the input of the network, the image, video and audio modes are led to pass through Vision Transformer (ViT) main network, and the text mode is led to extract depth characteristic information/>, through bert main networkWherein the method comprises the steps of
Where d is the feature size of the division of the present invention where the mode feature sizes are the same, i.e., d I=dV=dA=dT =d.
In order to realize the multi-channel fusion cross-modal retrieval method, we first represent the extracted feature information asThe feature information of each modality is then divided into four parts, for example image modalities
The other three modalities are all processed in this way, where L 1,L2,L3,L4 represents the length of each part, and L 1+L2+L3+L4 = d, then we do a multi-channel feature fusion operation, the fused image-video-audio-text four-modality depth feature information representation
In order to achieve a better effect of the model, a Squeeze-and-Excitation Networks (SENet) is applied to the depth feature information after mixing, so that the effective feature map channel has a large weight, and the ineffective or small-effect feature map channel has a small weight. Finally, brand new image-video-audio-text four-mode depth characteristic information is obtainedThis is put as input into a linear classifier to obtain the final predicted value.
Through steps 1) to 7), for a use case of a certain class of a certain mode input by the user, the system retrieves and feeds back use cases of other modes belonging to the same class as the use case.
Experiment verification
1. Data set
The data set used by the invention is the unique four-mode-crossing fine granularity retrieval public data set FG-Xmedia at present, and the data set comprises data of four modes of images, videos, audios and texts. Wherein the image data source is CUB-200-2011, which is the most widely used fine-grained image classification dataset, comprising 11788 images of 200 subcategories, belonging to the same basic coarse-grained category as "Bird", wherein the training set comprises 5994 images, the testing set comprises 5794 images, each image annotation comprises a subcategory label of one image level, a bounding box of an object, 15 partial positions and 312 binary attributes. The video source is YouTube Birds, which is a new fine-grained video dataset. The same category classification approach as that of CUB-200-2011 was used, where the training set included 12666 videos and the test set included 5684 videos. The audio and text data are obtained from professional websites according to the classified categories, and the four modes form a public data set FG-Xmedia together.
2. Details of implementation
The invention mainly uses two backbone networks, and since the data preprocessing stage converts the input of image, video and audio modes into images, they commonly use ResNet as backbone network. Features are extracted using TextCNN for the text modality. After preprocessing the data of the four modes, the dimension of the data sample is fixed to 448×448×3, the whole program code is written by Pytorch, and the display card is RTX 3090. During training, we set batchsize to 8, the initial learning rate to 0.00005, select AdamW as the optimizer, and adjust the learning rate to the warm-up 1000-step cosine learning rate.
3. Comparative experiments
The present invention compares the proposed method with some representative models and names the proposed method as MCF.
(1) ACMR: a novel method of antagonistic cross-modal retrieval (ACMR) seeks an efficient common subspace based on antagonistic learning.
(2) CMDN: a cross-media multi-depth network utilizes complex cross-media association through hierarchical learning, learns rich cross-media correlation through two stages, and finally obtains shared characterization through a stacked network working mode.
(3) JRL: a novel cross-media data feature learning algorithm, namely joint characterization learning (JRL), can jointly explore related information and semantic information under a unified optimization framework. .
(4) GSPH: a simple and effective general hash framework is applicable to all different situations, and the semantic distance between data points is reserved.
(5) MHTN: a modality-reverse hybrid transmission network (MHTN) is directed to enabling knowledge transmission from a single-modality source domain to a cross-modality target domain and learning cross-modality co-characterization.
(6) FGCrossNet: a unified depth model that learns 4 types of media simultaneously without separate processing. Three constraints are considered together, namely classification constraint ensures the learning of the distinguishing features of fine-grained subcategories, central constraint ensures the compactness features of the same subcategory, and sorting constraint ensures the sparsity features of the features of different subcategories.
(7) DBFC-Net: a novel dual-branch fine-grained cross-media network (DBFC-Net) utilizes specific media information to build common-feature work through a unified framework. It also designed an effective distance measure for fine-grained cross-media retrieval.
(8) SAFGCM: an attention space training method learns a common characterization of different media data. In particular, a local self-attention layer is utilized to learn a common attention space between different media data.
For fair comparison, the invention uses a unified evaluation index of multi-modal fine-granularity retrieval task, namely, an average Precision (mAP). In the experiment, the average precision mean value refers to that the average precision of the query sample of each category is calculated first, and then the average value of the average precision of 200 categories is calculated to obtain mAP.
Meanwhile, in order to more comprehensively show the effect of cross-mode fine granularity retrieval, 18 evaluation indexes are set in total, wherein the index comprises the separate retrieval scores of each mode for other three modes, which are respectively expressed as I, V, I, A, I, T, V, I, V, A, V, T, A, I, A, V, A, T, I, T, V, T, A and average value of the scores. Also included are the search scores that each modality has in common for ALL other modalities, i.e., I→ALL, V→ALL, A→ALL, T→ALL, and the averages of the above scores. As can be seen from the data in FIG. 5, the performance of our algorithm is significantly improved compared to other methods by more than 110% in the I- & gt T, T- & gt I, T- & gt A, T- & gt V, A- & gt T, T- & gt ALL scenario compared to FGCrossNet algorithm. The average retrieval mAP value of the one-to-one mode retrieval is improved by 52% compared with FGCrossNet algorithm, and the average retrieval mAP value of the one-to-multi mode retrieval is improved by 33% compared with FGCrossNet algorithm, which proves the effectiveness of MCF in cross-mode fine granularity retrieval, can fully integrate information of different modes, enhance information interaction of fine granularity objects in different modes, reduce difference among different modes, fully utilize abundant semantic information of text modes, reduce information loss and the like, and thus can better promote retrieval effect.
4. Ablation experiments
Since the MCF proposed by the present invention contains a number of key components, the present invention will in this section compare variants of MCF from several aspects to demonstrate the effectiveness of MCF:
(1) ResNet50: after extracting features using ResNet networks, the model is optimized using only the cross entropy loss function and the noise contrast estimated loss function.
(2) ResNet +mcf: and after the characteristics of the convolutional neural network used by most networks are extracted, adding a multi-channel fusion module for subsequent processing.
(3) ResNet +mcf+fccl: the most advanced image feature extractor and the most advanced text feature extractor are used for extracting features and then classifying the features, and a single-mode guiding and multi-channel fusion module is added for improving the model effect.
Figure 5 shows the effect of various variants of MCF on PKU FG-Xmedia dataset.
As can be readily seen from fig. 5, the ResNet +mcf+fccl performance is evident due to ResNet +mcf, res net50, which demonstrates the effectiveness of each of the sub-panels on the MCF ensemble model. In addition, it can be observed that the performance of removing the MCF is most obviously reduced in the three variants, which shows that the channel fusion has the greatest influence on the performance of the MCF model, and the information interaction between modes can be greatly enhanced by using a multi-channel fusion technology, so that heterogeneous gaps and semantic gaps are eliminated, and the retrieval performance is improved.

Claims (6)

1. A cross-mode fine granularity retrieval method based on multi-channel fusion is characterized in that: the method comprises the following steps: 1) Clipping the image mode data according to the existing boundary frame label, extracting a key frame from the video mode data according to a mode of extracting one frame every ten frames, and converting the audio mode data into a spectrogram by using short-time Fourier transform; 2) Extracting features from the data converted into image mode, video mode and audio mode of the picture by ResNet network to obtain mode specific features of three modes finally, res (f I),Res(fV),Res(fA); 3) Extracting characteristics of the Text mode data by TextCNN to finally obtain mode specific characteristics Text (f T) of the Text mode; 4) Dividing the mode specific characteristics of each mode into four parts; 5) Performing cross-channel fusion on four features of Res (f I),Res(fV),Res(fA),Text(fT) by using a multi-channel fusion method to obtain a new mode specific feature f' I,f'V,f'A,f'T; 6) Inputting the new modal specific features into a linear classifier for classification, and optimizing the model effect by using a cross entropy loss function and a noise contrast estimated loss function; 7) Inputting any type of any mode in the classified results, and searching other mode samples of the same type according to the classification to realize cross-mode fine granularity searching.
2. The cross-modal fine-grained retrieval method based on multi-channel fusion according to claim 1, wherein the step 1) comprises: the input data is clipped by using the existing bounding box labels of the image mode, the interference of background noise is eliminated, meanwhile, in order to process video and audio information by using ResNet, the video is subjected to frame extraction processing, and the audio is converted into pictures by using short-time Fourier transform processing, so that the pictures are taken as the input of ResNet.
3. The cross-modal fine-grained retrieval method based on multi-channel fusion according to claim 1, wherein the step 3) comprises: to obtain local features of a text, N-Gram information of the text is extracted through different convolution kernel sizes, then the most critical information extracted by each convolution operation is highlighted through a maximum pooling operation, the features are combined through a full connection layer after splicing, and finally a model is trained through a cross entropy loss function.
4. The cross-modal fine-grained retrieval method based on multi-channel fusion according to claim 1, wherein the step 5) comprises: for the extracted mode specific features, the invention divides the features of each mode into four channels, and then fusion recombination is carried out to obtain a new mode specific feature f' I,f'V,f'A,f'T.
5. The cross-modal fine-grained retrieval method based on multi-channel fusion according to claim 1, wherein the step 6) comprises: the cross-mode fine granularity retriever predicts which category the input case specifically belongs to according to the characteristics obtained in the step 5), obtains the category prediction loss of each mode, classifies the case according to the category under the condition of not considering the mode by minimizing the category prediction loss improvement model, and provides preconditions for subsequent retrieval tasks.
6. The cross-modal fine-grained retrieval method based on multi-channel fusion according to claim 1, wherein the step 7) comprises: the cross-modal fine-grained retriever retrieves according to the classification result obtained in the step 6), and the classification in the step 6) is carried out according to the classification without considering the modes, so that the use cases of the same class in different modes can be divided together, after a user inputs a certain class in a certain mode, the system can retrieve and feed back the use cases of other modes which belong to the same class as the input sample, and the cross-modal fine-grained retrieval task is realized.
CN202311663327.2A 2023-12-05 2023-12-05 Cross-modal fine granularity retrieval method based on multi-channel fusion Pending CN118113888A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311663327.2A CN118113888A (en) 2023-12-05 2023-12-05 Cross-modal fine granularity retrieval method based on multi-channel fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311663327.2A CN118113888A (en) 2023-12-05 2023-12-05 Cross-modal fine granularity retrieval method based on multi-channel fusion

Publications (1)

Publication Number Publication Date
CN118113888A true CN118113888A (en) 2024-05-31

Family

ID=91209511

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311663327.2A Pending CN118113888A (en) 2023-12-05 2023-12-05 Cross-modal fine granularity retrieval method based on multi-channel fusion

Country Status (1)

Country Link
CN (1) CN118113888A (en)

Similar Documents

Publication Publication Date Title
Kavasidis et al. A saliency-based convolutional neural network for table and chart detection in digitized documents
CN111090763B (en) Picture automatic labeling method and device
CN115994230A (en) Intelligent archive construction method integrating artificial intelligence and knowledge graph technology
Singh et al. Systematic Linear Word String Recognition and Evaluation Technique
Jun A forecasting model for technological trend using unsupervised learning
CN112925905B (en) Method, device, electronic equipment and storage medium for extracting video subtitles
CN114461890A (en) Hierarchical multi-modal intellectual property search engine method and system
US20240296652A1 (en) Content recognition method and apparatus, device, storage medium, and computer program product
Sreeja et al. A unified model for egocentric video summarization: an instance-based approach
CN116128998A (en) Multi-path parallel text-to-image generation method and system
CN116956128A (en) Hypergraph-based multi-mode multi-label classification method and system
Nguyen et al. Manga-mmtl: Multimodal multitask transfer learning for manga character analysis
Saleem et al. Stateful human-centered visual captioning system to aid video surveillance
Rasheed et al. A deep learning-based method for Turkish text detection from videos
Li A deep learning-based text detection and recognition approach for natural scenes
Rajkumar et al. Content based image retrieval system using combination of color and shape features, and siamese neural network
Hu et al. MmFilter: Language-guided video analytics at the edge
Kota et al. Summarizing lecture videos by key handwritten content regions
CN118113888A (en) Cross-modal fine granularity retrieval method based on multi-channel fusion
Priya et al. Developing an offline and real-time Indian sign language recognition system with machine learning and deep learning
CN114842301A (en) Semi-supervised training method of image annotation model
Liu et al. Text detection based on bidirectional feature fusion and sa attention mechanism
CN113377959A (en) Few-sample social media rumor detection method based on meta learning and deep learning
Sudha et al. Reducing semantic gap in video retrieval with fusion: A survey
Sunuwar et al. A comparative analysis on major key-frame extraction techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination