CN113656539A - Cross-modal retrieval method based on feature separation and reconstruction - Google Patents

Cross-modal retrieval method based on feature separation and reconstruction Download PDF

Info

Publication number
CN113656539A
CN113656539A CN202110859387.6A CN202110859387A CN113656539A CN 113656539 A CN113656539 A CN 113656539A CN 202110859387 A CN202110859387 A CN 202110859387A CN 113656539 A CN113656539 A CN 113656539A
Authority
CN
China
Prior art keywords
text
image
information
modal
reconstruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110859387.6A
Other languages
Chinese (zh)
Other versions
CN113656539B (en
Inventor
邬向前
卜巍
张力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202110859387.6A priority Critical patent/CN113656539B/en
Publication of CN113656539A publication Critical patent/CN113656539A/en
Application granted granted Critical
Publication of CN113656539B publication Critical patent/CN113656539B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cross-modal retrieval method based on feature separation and reconstruction, which comprises the following steps: step one, obtaining a visual representation; step two, converting the word sequence into text representation; performing linear transformation through a visual and text multilayer perceptron to respectively obtain characteristic vectors of a visual space and a text space; step four, decomposing the feature vectors of different modal spaces into modal information, semantic information and specific information; separating the modal information, the semantic information and the specific information from the visual/text representation by using a feature separation module to obtain the modal information, the semantic information and the specific information of the visual representation and the text representation; step six, combining three different information of the image to reconstruct the image; and step seven, performing text reconstruction on the three different kinds of information of the text. The method of the invention can well carry out cross-modal retrieval and obtain competitive results on a plurality of databases.

Description

Cross-modal retrieval method based on feature separation and reconstruction
Technical Field
The invention relates to a cross-modal retrieval method, in particular to a cross-modal retrieval method based on feature separation and reconstruction.
Background
With the rapid development of multimedia, there is a great deal of information on the internet, such as images, text, video, audio, etc. It is becoming increasingly difficult to manually obtain useful information between different modalities in mass data. Naturally, we need a powerful method to help us obtain the text, image or video we need. Cross-modality retrieval takes one modality of data as a query, and retrieves related data of another modality. For example, we can use text to retrieve images of interest (as we do on google image search) or use images to retrieve corresponding text. Of course, modalities are not limited to images and text, and other modalities such as voice, physical signals, and video may be a component of the cross-modality retrieval.
Disclosure of Invention
In order to better perform cross-modal retrieval, the invention provides a cross-modal retrieval method based on feature separation and reconstruction.
The purpose of the invention is realized by the following technical scheme:
a cross-modal retrieval method based on feature separation and reconstruction comprises the following steps:
step one, for an image part in an image-text pair, using a ResNet152 network as a basic image network of image branches, selecting an image in the image-text pair as an input of the image branches, and directly extracting image features from a penultimate full-link layer to obtain a visual representation v;
secondly, for a text part in the image-text pair, word coding is used, each token is coded into a word vector, and then a GRU is used as a basic text network of text branches to convert a word sequence into a text representation l;
after respectively obtaining the visual representation v and the text representation l, carrying out linear transformation through a visual and text multilayer perceptron to respectively obtain the characteristic vectors of a visual space and a text space;
step four, decomposing the feature vectors of different modal spaces into modal information, semantic information and specific information, wherein:
(1) modal information mo, characterizing the source of the feature vector;
(2) semantic information se representing high-level semantics represented by the feature vector;
(3) specific information sp representing information specific to different modal characteristics;
step five, separating the modal information, the semantic information and the specific information from the visual/text representation by utilizing a feature separation module to obtain the modal information (v) of the visual representation v and the text representation lmo,lmo) Semantic information (v)se,lse) And specific information (v)sp,lsp);
Step six, adopting a generator and a discriminator of DCGAN as a generator G and a discriminator D of image reconstruction respectively1Introduction of a discriminator D2Determining whether the generated image corresponds in content to the real image, combining three different pieces of information (v) of the imagemo;vse;vsp) Carrying out image reconstruction;
step seven, three different information (l) of the text are decoded by using RNNmo;lse;lsp) And reconstructing the text to finally generate a complete sentence.
Compared with the prior art, the invention has the following advantages:
1. the invention decomposes feature vectors from different modal spaces into three parts: modality information, semantic information, and specific information.
2. The invention introduces feature separation in the traditional cross-modal retrieval task to deal with information asymmetry between different modalities, and supervises different parts of feature vectors by using different loss functions.
3. The invention also introduces three different information of images and texts which are respectively combined by the image reconstruction task and the text reconstruction task, improves the performance of the cross-modal retrieval task through multi-task joint learning, and has good robustness.
4. The method of the invention can well carry out cross-modal retrieval and obtain competitive results on a plurality of databases.
Drawings
FIG. 1 is a cross-modal search flow chart based on feature separation and reconstruction in accordance with the present invention;
FIG. 2 is a diagram illustrating the reconstruction of an input image and text from a given image text pair according to the present invention;
FIG. 3 is a visual result of a visualization of a text search of a given image query on a MS-COCO dataset by the method of the present invention;
FIG. 4 is a visual result of the visualization of an image retrieval of a given text query on an MS-COCO dataset by the method of the present invention.
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings, but not limited thereto, and any modification or equivalent replacement of the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention shall be covered by the protection scope of the present invention.
The invention provides a cross-modal retrieval method based on feature separation and reconstruction, and fig. 1 shows the overall structure of the whole network, which can be roughly divided into three parts, specifically as follows:
the first part is: a multi-modal feature extraction network.
The invention follows the common space learning method to extract the characteristics of the image and the text. Formally, given an arbitrary image-text pair (x, y), where x is the image and y is (w)0...wt...wT-1) Is a text sentence, w0、 wtAnd wT-1One-hot codes (tokens) respectively representing the 1 st, T +1 th and T-th words, T representing the sentence length. By using Wewt∈RKEmbedding each token into a distributed representation to encode text, where K300, WeIs word2vec embeddingAnd (4) matrix. This variable-length word sequence is then converted into a meaningful, fixed-size text representation, i ∈ R, using GRU as the underlying text networkd(only the last step of the GRU needs to be output as a textual representation of the entire sentence), where d 3072. As for image coding, in order to accommodate variable-size images and benefit from the performance of very deep architectures, the present invention relies on the full convolution residual ResNet-152 (pre-trained on ImageNet) as the underlying image network to extract image features directly from the penultimate fully connected layer FC7 to obtain the visual representation v ∈ Rd. For ResNet152, the dimension of the image embedding is 2048.
In the invention, the ResNet152 network comprises six convolution modules of Pool5, Rec5b, Rec4f, Rec3d, Rec2c and Pool1, six additional modules of Pool5_ A, Rec5b _ A, Rec4f _ A, Rec3 _ 3d _ A, Rec2c _ a and Pool1_ a connected after each convolution module, and 3 full connection layers connected in series after all convolution modules, wherein: pool5_ A, Rec5b _ A, Rec4f _ A, Rec3d _ A, Rec2c _ a and Pool1_ a have the same structure: 3 convolutional layers, 1 Concate layer, 1 Correlation layer, 1 sigmoid layer, and 3 fully-connected layers.
The second part is that: the specific structure of the feature separation module.
After the visual representation and the text representation are obtained separately, the present invention decomposes the feature vectors of different modality spaces into three parts due to the asymmetry between different modalities: (1) modality information characterizing the source of the feature vector. The modal information of the same modal feature vector should be as close as possible. Instead, they should be as far apart as possible. (2) Semantic information, which characterizes the high level semantics represented by the feature vector. The semantic information of the feature vectors, from different modalities, having the same or related semantics (i.e. a pair of samples in the database) should be as close to each other as possible. Instead, they should be as far apart as possible. The invention adopts the image semantic information vector and the text semantic information vector to carry out cross-modal retrieval. (3) Specific information characterizing features of different modalities. For example, the feature vectors of an image modality have image-specific detail information (the details of different images are significantly different), which is not present in the feature vectors of the corresponding text modality. The invention utilizes a feature separation module to separate the modal information, semantic information and specific information from the visual/text representation.
Specifically, the present invention inputs a visual representation v into an MLPvIn (1), the text representation l is input into the MLPlIn the method, the learnable parameters of the multilayer perceptron are subjected to linear transformation and normalization processing, and finally modal information (v) of visual representation v and text representation l is obtainedmo,lmo) Semantic information (v)se,lse) And specific information (v)sp,lsp) As shown in the following formula:
Figure BDA0003185280970000051
Figure BDA0003185280970000052
wherein, MLPvA visual multi-layer perceptron is represented,
Figure BDA0003185280970000053
indicating that it can learn the parameters of the device,
Figure BDA0003185280970000054
respectively representing visual modal information, visual semantic information and visual specific information before normalization; MLPlA multi-layer perceptron of the text is represented,
Figure BDA0003185280970000055
indicating that it can learn the parameters of the device,
Figure BDA0003185280970000056
respectively representing text modal information, text semantic information, and text specific information before normalization.
And a third part: image and text reconstruction.
The invention constructs image and text reconstructionThe task combines three different information of the image and the text respectively, and the performance of the cross-modal retrieval task is improved through multi-task joint learning. In order to combine three different information of images and texts, the invention introduces image and text reconstruction respectively. The goal of image reconstruction is to encourage three different pieces of information in the visual representation to generate an image that resembles a real image. A generative countermeasure network (GAN) may be used to generate an image, which consists of a discriminator and a generator. The invention adopts a generator and a discriminator of DCGAN as a generator G and a discriminator D of image reconstruction respectively1. Due to D1Only true and false images can be distinguished, so that the consistency of the generated image on the ground in content cannot be guaranteed. The invention also introduces D2To determine whether the generated image is consistent in content with the real image, as shown in the following equation:
x′=G([vmo;vse;vsp];θG);
Figure BDA0003185280970000061
Figure DEST_PATH_1
where x 'represents the image generated by the generator, p and q represent the probability that n belongs to x', and θ represents a learnable parameter.
The goal of text reconstruction is to encourage the generation of a true sentence from three different text-representing information. The invention uses RNN to decode three different information of text, and finally generates a complete sentence, which is shown as the following formula:
Figure DEST_PATH_2
where y' represents the probability distribution of the reconstructed sentence, WeRepresenting word2vec embedded matrix, θRNNLearnable parameters, FC, representing RNN2A fully-connected layer is shown,
Figure BDA0003185280970000064
denotes FC2May be used to learn the parameters.
In the invention, a loss function L is used as an optimization target to train the whole network (including a basic image network, a basic text network, a feature separation module, an image reconstruction network and a text reconstruction network), as shown in the following formula:
L=Lmo+Lse+Lsp+Limg-re+Lcap-re
wherein the modal loss LmoSemantic loss (ternary ranking loss with hard negative sample mining) LseSpecific loss LspImage reconstruction loss Limg-reAnd text reconstruction loss Lcap-reRespectively as follows:
Figure BDA0003185280970000071
Figure BDA0003185280970000072
wherein N represents the number of sample pairs in the database,
Figure BDA0003185280970000073
represents the information characteristic of the ith visual modality,
Figure BDA0003185280970000074
to represent
Figure BDA0003185280970000075
The mean value vector of (a) is,
Figure BDA0003185280970000076
represents the ith text modality information feature,
Figure BDA0003185280970000077
to represent
Figure BDA0003185280970000078
The mean vector, | · | | luminance2Representing a two-norm.
Figure BDA0003185280970000079
Figure BDA00031852809700000710
Figure BDA00031852809700000711
Wherein the content of the first and second substances,
Figure BDA00031852809700000712
representing the cosine similarity of the two vectors,
Figure BDA00031852809700000713
represents the jth visual semantic information characteristic,
Figure BDA00031852809700000714
representing the cosine similarity of the two vectors,
Figure BDA00031852809700000715
representing the jth text semantic information feature.
Figure BDA00031852809700000716
Wherein the content of the first and second substances,
Figure BDA00031852809700000717
represents the ith visually specific information characteristic,
Figure BDA00031852809700000718
representing the ith text-specific information feature, T representsAnd (4) transposition.
Figure BDA00031852809700000719
Figure BDA00031852809700000720
Figure BDA00031852809700000721
Wherein the content of the first and second substances,
Figure BDA00031852809700000722
the representation x is derived from an image in a database,
Figure BDA00031852809700000723
the representation x' originates from an image generator G,
Figure BDA00031852809700000724
indicating that x "is derived from the portion of G from which x' was removed.
Figure BDA0003185280970000081
Wherein, wtIs the word one-hot code, p (w), derived from the sentence yt) Is derived from w generated by RNNtProbability distribution of (2).
The experimental results are as follows:
the dimension of word embedding entered into the GRU by the present invention is set to 300. The present invention does not limit sentence length because it has little impact on the memory of the graphics processor. Both the GRU and RNN of the present invention are trained ab initio with 3072-dimensional four stacked hidden layers. The invention sets the dimension of the joint embedding space to 3072. The present invention trains 30 epochs using an ADAM optimizer. The learning rate for the first 15 epochs was 0.0002, and the last 15 epochs reduced the learning rate to 0.00002. For test set evaluation, the present invention solves the overfitting problem by selecting the model checkpoint that performs best on the validation set. The best model is selected based on the total number of recalls in the validation set.
In verifying the performance of the invention, the invention uses two standard public databases to evaluate the performance of the cross-modal search method proposed by the invention, namely MS-COCO and Flickr 30K.
Fig. 2 shows the reconstruction result of the method of the invention given an input image-text pair. The first and second columns represent the image-text pairs input into the network, respectively, and the third and fourth columns represent the reconstruction results of the present invention, respectively. As can be seen from fig. 3, the method of the present invention can reconstruct similar images and texts from the input image-text pair.
Table 1 shows the results of quantitative evaluation of the evaluation criteria R @1, R @5 and R @10 in the MS-COCO data set by the method of the present invention and the 15 best cross-modal search methods. Due to the limitation of hardware, the invention cannot set the BatchSize size set by most methods in the experiment, such as 100, 128, 160, etc. The present invention selects VSE + + as the baseline and re-runs their code under existing hardware conditions (BatchSize 24). The method of the invention has significant improvements in image-to-text (-3.4R @1, -1.9R @5 and-1.0R @10) and text-to-image (-1.7R @1, -1.2R @5 and-0.3R @ 10). As for the results in table 1, several comparative methods gave better performance. Due to hardware limitations, the present invention sets BatchSize to 24 in a single RTX 2080Ti (11G), and these methods achieve better performance with larger BatchSize. The method of the present invention requires more GPUs if the same batches as them are set. It can be seen that the method of the present invention can achieve better performance across modal search tasks in cases where BatchSize is 64, 128, 160, etc.
TABLE 1 comparison of the experimental results with the best current cross-modal search results on the MS-COCO database
Figure BDA0003185280970000091
Table 2 shows the results of the quantitative evaluation of the method of the invention in the Flickr30K dataset. The present invention selects VSE + + as the baseline and re-runs their code under existing hardware conditions (BatchSize 24). The method of the present invention is a significant improvement in image-to-text (3.3-R @1, 2.7-R @5, and 2.0-R @10) and text-to-image (3.2-R @1, 2.5-R @5, and 1.3-R @ 10). It can be seen that the method of the present invention can achieve better performance across modal search tasks in cases where BatchSize is 64, 128, 160, etc.
TABLE 2 comparison of the results of the experiment with the best current results of cross-modality search on the Flickr30K database
Figure BDA0003185280970000101
FIG. 3 shows the visual results of the visualization of the text retrieval of a given image query on the MS-COCO dataset by the present invention, and FIG. 4 shows the visual results of the image retrieval of a given text query on the MS-COCO dataset by the present invention. For each image query, the top 5 retrieved texts are presented, ordered according to the similarity scores predicted by the method of the present invention. For each text query, the top 3 retrieved images ranked by similarity score are presented. The true matching samples are labeled blue and the false matching samples are labeled red.

Claims (4)

1. A cross-modal retrieval method based on feature separation and reconstruction is characterized by comprising the following steps:
step one, for an image part in an image-text pair, using a ResNet152 network as a basic image network of image branches, selecting an image in the image-text pair as an input of the image branches, and directly extracting image features from a penultimate full-link layer to obtain a visual representation v;
secondly, for a text part in the image-text pair, word coding is used, each token is coded into a word vector, and then a GRU is used as a basic text network of text branches to convert a word sequence into a text representation l;
after respectively obtaining the visual representation v and the text representation l, carrying out linear transformation through a visual and text multilayer perceptron to respectively obtain the characteristic vectors of a visual space and a text space;
step four, decomposing the feature vectors of different modal spaces into modal information, semantic information and specific information, wherein:
(1) modal information mo, characterizing the source of the feature vector;
(2) semantic information se representing high-level semantics represented by the feature vector;
(3) specific information sp representing information specific to different modal characteristics;
step five, separating the modal information, the semantic information and the specific information from the visual/text representation by utilizing a feature separation module to obtain the modal information (v) of the visual representation v and the text representation lmo,lmo) Semantic information (v)se,lse) And specific information (v)sp,lsp);
Step six, adopting a generator and a discriminator of DCGAN as a generator G and a discriminator D of image reconstruction respectively1Introduction of a discriminator D2Determining whether the generated image corresponds in content to the real image, combining three different pieces of information (v) of the imagemo;vse;vsp) Carrying out image reconstruction;
step seven, three different information (l) of the text are decoded by using RNNmo;lse;lsp) And reconstructing the text to finally generate a complete sentence.
2. The method of claim 1, wherein the ResNet152 network comprises six convolution modules of Pool5, Rec5b, Rec4f, Rec3d, Rec2c and Pool1 and six additional modules of Pool5_ A, Rec5b _ A, Rec4f _ A, Rec3d _ A, Rec2 _ 2c _ A and Pool1_ A following each convolution module and 3 fully connected layers connected in series after all convolution modules, wherein: pool5_ A, Rec5b _ A, Rec4f _ A, Rec3d _ A, Rec2c _ a and Pool1_ a have the same structure: 3 convolutional layers, 1 Concate layer, 1 Correlation layer, 1 sigmoid layer, and 3 fully-connected layers.
3. The cross-modal search method based on feature separation and reconstruction as claimed in claim 1, wherein in the sixth step, the image reconstruction formula is as follows:
x′=G([vmo;vse;vsp];θG);
Figure FDA0003185280960000021
Figure 1
where x 'represents the image generated by the generator, p and q represent the probability that n belongs to x', and θ represents a learnable parameter.
4. The method for cross-modal search based on feature separation and reconstruction as claimed in claim 1, wherein in the seventh step, the formula of text reconstruction is as follows:
Figure 2
where y' represents the probability distribution of the reconstructed sentence, WeRepresenting word2vec embedded matrix, θRNNLearnable parameters, FC, representing RNN2A fully-connected layer is shown,
Figure FDA0003185280960000024
denotes FC2May be used to learn the parameters.
CN202110859387.6A 2021-07-28 2021-07-28 Cross-modal retrieval method based on feature separation and reconstruction Active CN113656539B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110859387.6A CN113656539B (en) 2021-07-28 2021-07-28 Cross-modal retrieval method based on feature separation and reconstruction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110859387.6A CN113656539B (en) 2021-07-28 2021-07-28 Cross-modal retrieval method based on feature separation and reconstruction

Publications (2)

Publication Number Publication Date
CN113656539A true CN113656539A (en) 2021-11-16
CN113656539B CN113656539B (en) 2023-08-18

Family

ID=78478872

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110859387.6A Active CN113656539B (en) 2021-07-28 2021-07-28 Cross-modal retrieval method based on feature separation and reconstruction

Country Status (1)

Country Link
CN (1) CN113656539B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107784118A (en) * 2017-11-14 2018-03-09 北京林业大学 A kind of Video Key information extracting system semantic for user interest
CN110807122A (en) * 2019-10-18 2020-02-18 浙江大学 Image-text cross-modal feature disentanglement method based on depth mutual information constraint
WO2020042597A1 (en) * 2018-08-31 2020-03-05 深圳大学 Cross-modal retrieval method and system
CN111428071A (en) * 2020-03-26 2020-07-17 电子科技大学 Zero-sample cross-modal retrieval method based on multi-modal feature synthesis
US20210012150A1 (en) * 2019-07-11 2021-01-14 Xidian University Bidirectional attention-based image-text cross-modal retrieval method
CN112800292A (en) * 2021-01-15 2021-05-14 南京邮电大学 Cross-modal retrieval method based on modal specificity and shared feature learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107784118A (en) * 2017-11-14 2018-03-09 北京林业大学 A kind of Video Key information extracting system semantic for user interest
WO2020042597A1 (en) * 2018-08-31 2020-03-05 深圳大学 Cross-modal retrieval method and system
US20210012150A1 (en) * 2019-07-11 2021-01-14 Xidian University Bidirectional attention-based image-text cross-modal retrieval method
CN110807122A (en) * 2019-10-18 2020-02-18 浙江大学 Image-text cross-modal feature disentanglement method based on depth mutual information constraint
CN111428071A (en) * 2020-03-26 2020-07-17 电子科技大学 Zero-sample cross-modal retrieval method based on multi-modal feature synthesis
CN112800292A (en) * 2021-01-15 2021-05-14 南京邮电大学 Cross-modal retrieval method based on modal specificity and shared feature learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
田薪: "基于深度哈希学习的图文跨模态检索研究", 中国优秀硕士学位论文全文数据库 (信息科技辑), vol. 2021, no. 2, pages 138 - 2644 *

Also Published As

Publication number Publication date
CN113656539B (en) 2023-08-18

Similar Documents

Publication Publication Date Title
Chen et al. Shallowing deep networks: Layer-wise pruning based on feature representations
CN112241468A (en) Cross-modal video retrieval method and system based on multi-head self-attention mechanism and storage medium
CN113312500A (en) Method for constructing event map for safe operation of dam
CN110298395B (en) Image-text matching method based on three-modal confrontation network
CN111400494B (en) Emotion analysis method based on GCN-Attention
Landthaler et al. Extending full text search for legal document collections using word embeddings
Zhang et al. Learning implicit class knowledge for RGB-D co-salient object detection with transformers
CN112687388A (en) Interpretable intelligent medical auxiliary diagnosis system based on text retrieval
JP2022547163A (en) Spatio-temporal interactions for video comprehension
Chen et al. HCP-MIC at VQA-Med 2020: Effective Visual Representation for Medical Visual Question Answering.
Li et al. Adapting clip for phrase localization without further training
CN111858984A (en) Image matching method based on attention mechanism Hash retrieval
CN117421591A (en) Multi-modal characterization learning method based on text-guided image block screening
CN109766918A (en) Conspicuousness object detecting method based on the fusion of multi-level contextual information
CN116662502A (en) Method, equipment and storage medium for generating financial question-answer text based on retrieval enhancement
CN113807307B (en) Multi-mode joint learning method for video multi-behavior recognition
Zhao et al. Deeply supervised active learning for finger bones segmentation
Xia et al. Destruction and reconstruction learning for facial expression recognition
Saleem et al. Stateful human-centered visual captioning system to aid video surveillance
Mohamed et al. ImageCLEF 2020: An approach for Visual Question Answering using VGG-LSTM for Different Datasets.
CN112329441A (en) Legal document reading model and construction method
CN116524915A (en) Weak supervision voice-video positioning method and system based on semantic interaction
CN113656539B (en) Cross-modal retrieval method based on feature separation and reconstruction
CN116186312A (en) Multi-mode data enhancement method for data sensitive information discovery model
CN116227486A (en) Emotion analysis method based on retrieval and contrast learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant