CN116821391A - Cross-modal image-text retrieval method based on multi-level semantic alignment - Google Patents
Cross-modal image-text retrieval method based on multi-level semantic alignment Download PDFInfo
- Publication number
- CN116821391A CN116821391A CN202310855462.0A CN202310855462A CN116821391A CN 116821391 A CN116821391 A CN 116821391A CN 202310855462 A CN202310855462 A CN 202310855462A CN 116821391 A CN116821391 A CN 116821391A
- Authority
- CN
- China
- Prior art keywords
- text
- image
- feature
- global
- local
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 81
- 230000004927 fusion Effects 0.000 claims abstract description 62
- 230000003993 interaction Effects 0.000 claims abstract description 31
- 235000019580 granularity Nutrition 0.000 claims description 69
- 239000013598 vector Substances 0.000 claims description 32
- 239000011159 matrix material Substances 0.000 claims description 30
- 238000011176 pooling Methods 0.000 claims description 19
- 238000012549 training Methods 0.000 claims description 16
- 239000010410 layer Substances 0.000 claims description 13
- 230000002457 bidirectional effect Effects 0.000 claims description 11
- 238000013527 convolutional neural network Methods 0.000 claims description 11
- 230000003044 adaptive effect Effects 0.000 claims description 7
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 4
- 230000011218 segmentation Effects 0.000 claims description 4
- 239000002356 single layer Substances 0.000 claims description 4
- GNFTZDOKVXKIBK-UHFFFAOYSA-N 3-(2-methoxyethoxy)benzohydrazide Chemical compound COCCOC1=CC=CC(C(=O)NN)=C1 GNFTZDOKVXKIBK-UHFFFAOYSA-N 0.000 claims description 3
- 125000004122 cyclic group Chemical group 0.000 claims description 3
- 230000002452 interceptive effect Effects 0.000 claims description 3
- 238000012795 verification Methods 0.000 claims description 3
- 230000000007 visual effect Effects 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 2
- 238000010606 normalization Methods 0.000 claims description 2
- 238000011160 research Methods 0.000 abstract description 10
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 230000001627 detrimental effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/53—Querying
- G06F16/532—Query formulation, e.g. graphical querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/383—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/42—Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Library & Information Science (AREA)
- Multimedia (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biodiversity & Conservation Biology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Medical Informatics (AREA)
- Probability & Statistics with Applications (AREA)
- Image Analysis (AREA)
Abstract
A cross-modal image-text retrieval method based on multi-level semantic alignment belongs to the technical field of cross-modal retrieval and artificial intelligence. The method provided by the invention has the advantages that the simple and symmetrical network architecture is provided for coding the image and text characteristics, the multi-level semantic alignment of global-global, global-local and local-local is considered, the fusion interaction of different granularity characteristics on different levels is realized by introducing a fine granularity characteristic interaction attention network between modes and a different granularity characteristic fusion network in the modes, and the technical problems that the multi-granularity characteristic interaction is weak, and the image-text pairs with similar image region characteristics or similar text semantics are difficult to distinguish in the existing cross-mode retrieval research work are solved; meanwhile, the method adopts the ternary ordering loss of the multi-level semantic matching total score and the self-adaptive margin value, realizes better cross-modal semantic alignment, and greatly improves the accuracy of cross-modal image-text retrieval tasks.
Description
Technical Field
The invention relates to the technical field of cross-modal retrieval and artificial intelligence, in particular to a cross-modal image-text retrieval method based on multi-level semantic alignment.
Background
Cross-modal image-text retrieval is a continuously developed research direction in the multi-modal learning field, is used for cross-modal bidirectional retrieval between image-text pairs, and is widely applied to the fields of information retrieval, information recommendation and the like. Along with the arrival of big data age and the development of internet technology, multi-mode data mainly comprising images and texts grows exponentially, and how to effectively fuse and align a large number of multi-source heterogeneous images and text data so as to meet the diversified retrieval requirements of users is a challenging task. At present, many research efforts have been made to explore efficient interaction modes between cross-modal image-text pairs, wherein deep learning methods based on deep neural networks have shown great potential in cross-modal retrieval tasks, and certain achievements are achieved. However, most of the cross-modal search researches still have the problems of weak cross-modal feature interaction or lack of semantic alignment in the modes, and different image-text pairs with similar image local areas or similar text descriptions are difficult to distinguish.
The existing cross-modal search research is mainly divided into a global-level feature alignment method and a local-level feature alignment method. The feature alignment method of the global level focuses on learning a dual-path deep neural network, and is commonly performed by firstly using a convolutional neural network (Convolutional Neural Network, CNN) and a cyclic neural network (Recurrent Neural Network, RNN) to extract global features of image-text pairs respectively, and then mapping the global features of two modes into a joint representation space so as to obtain global correlation among the modes. However, this coarse-grained alignment ignores local relationships between different modalities, lacks fine interactions between image regions and text words, and therefore has poor retrieval results. Since text typically describes only certain locally significant regions of an image, which are critical to the text retrieval process with corresponding descriptive words, further research has focused on local level feature alignment methods. According to the method, the association between different modal local fragments is modeled by extracting the fine granularity characteristics of the image area and the text word, so that the fusion and alignment of the cross-modal fine granularity characteristics are realized. The feature matching method of the local level improves the accuracy of the cross-mode search task to a certain extent, but the method directly converts the global association of the whole image text pair into the local similarity between the region and the word, and the local detail of the image text mode is excessively focused in the search process to ignore global information, so that the method cannot distinguish the same word with different meanings in different contexts. Meanwhile, due to the lack of multi-granularity feature fusion and multi-level semantic interaction, the current global level and local level alignment methods are difficult to distinguish different image-text pairs with similar features of local areas of images or similar text semantics. In addition, the loss function adopted in the current research is mostly ternary group ordering loss with fixed margin values, and the loss has good retrieval performance on images and texts with larger difference of visual features of images and different text semantics, but has poor retrieval performance on difficult cases on images and texts with very similar image features and similar text semantics.
In summary, the current cross-modal image-text retrieval method mainly has the following problems: 1. the existing retrieval model lacks multi-granularity feature fusion of images and texts and is aligned with multi-level semantics, so that the model is difficult to distinguish different image-text pairs with similar image local areas or similar text semantics; 2. ternary ordering loss based on fixed margin values is detrimental to the model in further distinguishing difficult cases during training. The present invention will develop research into these two problems in order to adequately align and interact across modal features at multiple levels.
Disclosure of Invention
The invention aims to solve the problem of poor retrieval effect caused by insufficient multi-granularity feature interaction in the existing image-text retrieval method, provides a multi-level semantic alignment cross-modal retrieval method which simultaneously considers global-global matching, global-local matching and local-local matching, and further improves the accuracy of the cross-modal retrieval task through ternary ordering loss with self-adaptive margin values, and has wide application value.
In order to achieve the above purpose, the invention provides a cross-modal image-text retrieval method based on multi-level semantic alignment, which comprises the following steps:
step one, collecting a cross-mode image-text retrieval data set. Collecting images and corresponding text descriptions thereof as a cross-modal image-text retrieval data set, forming an image-text pair by one image and a corresponding text description thereof, and dividing all the collected image-text pairs into a training set, a verification set and a test set according to a certain rule;
and step two, extracting the characteristics of the image-text pairs. For images in the image-text pair, extracting K regional features of each image by using a target detector FasterR-CNN to obtain local fine granularity features V of the images l Global coarse-grained feature V for each image is extracted using convolutional neural network ResNet152 g The method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, for texts in the image-text pair, extracting word characteristics of each text by using a bi-directional gating circulating unit BiGRU to obtain local fine granularity characteristics Y of the text l The method comprises the steps of carrying out a first treatment on the surface of the Then, for Y l Global by makingAveraging and pooling to obtain global coarse granularity characteristic Y of text g The method comprises the steps of carrying out a first treatment on the surface of the Finally, calculating the global coarse granularity characteristic V of the image g And text global coarse granularity feature Y g Cosine similarity between them to obtain Global-Global level (GGl) feature matching score S GGl ;
And thirdly, building a fine-grained feature interaction attention network among modes. The interaction attention network adopts a two-way symmetrical structure, wherein each path of input is composed of image local fine granularity characteristic V obtained in the second step l And text local fine granularity feature Y l Two parts. First, the interaction attention network calculates the dot product correlation s between the ith image region and the jth text word ij The method comprises the steps of carrying out a first treatment on the surface of the Secondly, obtaining two local correlation matrixes based on the dot product correlation, namely, obtaining a correlation matrix between an image area and a text word when the image area is used as a query and a correlation matrix between the text word and the image area when the text word is used as the query, and normalizing the two matrixes to obtain a normalized local correlation matrix s m1 Sum s m2 The method comprises the steps of carrying out a first treatment on the surface of the Then, for s m1 And s m2 Solving for image region as weight coefficient delta between query and text word using Softmax operation ij And text words as weighting coefficients gamma between the query and the image area ij The method comprises the steps of carrying out a first treatment on the surface of the Next, the obtained coefficient delta is used ij And gamma ij Respectively for local fine granularity characteristic V of image l And text local fine granularity feature Y l Performing weighted operation to obtain two-way output Y of the interaction attention network l ' and V l ', wherein Y l ' representing text local fine granularity features after cross-modal feature interaction, V l ' represent cross-modal feature interacted image local fine-grained features; finally, calculate Y l ' and V l Cosine similarity S between 1 ,V l ' and Y l Cosine similarity S between 2 Local-Local level (LLl) feature matching score S LLl Namely S 1 And S is 2 Is the average value of (2);
and fourthly, building a feature fusion network with different granularity in the mode. The feature fusion network comprisesThe image mode internal feature fusion sub-network and the text mode internal feature fusion sub-network are connected by a multi-head self-attention module and a door control fusion unit. For the feature fusion sub-network in the image mode, the input is V obtained in the second step g V obtained in the third step l '. First, V is calculated by a multi-headed self-attention module l Similarity between different areas in' and giving higher weight to areas with high similarity; then, the output V of the multi-head self-attention module is obtained o Then to V o Global average pooling is carried out, and the pooled V is carried out o Sending the image global coarse granularity characteristic V into a gating fusion unit g Selectively gating and fusing to obtain the embedded representation V of the fused image with different granularity image features f This is the output of the feature fusion subnetwork within the image modality. Similarly, for the feature fusion sub-network in the text mode, the input is Y obtained in the step two g And Y obtained in the third step l '. First, Y is calculated by a multi-head self-attention module l Similarity between different words in' and giving higher weight to words with high similarity; then, the output Y of the multi-head self-attention module is obtained o Then to Y o Global average pooling is carried out, and the pooled Y is carried out o Sending the text global coarse granularity characteristic Y into a gating fusion unit g Selectively gating and fusing to obtain text embedded representation Y after fusing text features with different granularities f This is the output of the feature fusion sub-network within the text modality; finally, calculate V f And Y is equal to f Cosine similarity between them to obtain Global-Local level (GLl) feature matching score S GLl ;
And fifthly, calculating the total score of multi-level semantic matching between image-text pairs, and training the multi-level semantic alignment model by adopting ternary ordering loss with self-adaptive margin values. Firstly, a multi-level semantic alignment model is connected by a feature extraction network of image-text pairs in a second step, a fine-granularity feature interaction attention network among modes in a third step and a feature fusion network with different granularity in modes in a fourth stepIs formed by the steps of; then, the global-global level feature matching score S obtained in the second step is used for the multi-level semantic matching total score GGl Local-local level feature matching score S obtained in step three LLl Global-local level feature matching score S obtained in step four GLl Weighting to obtain; finally, for training the multi-level semantic alignment model, a ternary ordering penalty with adaptive margin values is employed, wherein the margin values are sized according to the negative sample duty cycle in the batch of samples, when the negative sample duty cycle exceeds a threshold ζ 0 When the edge distance value is changed adaptively according to a certain rule, otherwise, the edge distance value is kept unchanged;
and step six, acquiring a cross-modal image-text pair bidirectional retrieval result. Two-way retrieval is divided into two types, image retrieval and text retrieval: inputting an image to be searched into a trained multi-level semantic alignment model to obtain a multi-level semantic matching total score between the image to be searched and the text description, and taking the text description with the highest multi-level semantic matching total score as a search result of the image; the text retrieval inputs the text description to be retrieved into a trained multi-level semantic alignment model, the multi-level semantic matching total score between the text description to be retrieved and the image is obtained, and the image with the highest multi-level semantic matching total score is used as a retrieval result of the text description. Whether the obtained bidirectional retrieval result is the same as the true value or not is observed, so that the cross-mode image-text bidirectional retrieval process is completed;
compared with the prior art, the invention has the following beneficial effects:
(1) Compared with the existing cross-modal retrieval research work, the invention realizes alignment and matching of cross-modal features on different granularity levels by constructing a simple and symmetrical multi-level semantic alignment network and simultaneously taking account of global-global, global-local and local-local multi-level matching modes, can obtain more robust image and text embedded representation, ensures that images and texts in different representation spaces are interactively fused with features more thoroughly, and greatly improves the accuracy of the cross-modal retrieval task.
(2) The invention adopts the ternary sorting loss with the self-adaptive edge value, is beneficial to the discrimination of difficult cases by the image-text with similar characteristics in the training process of the model, and realizes better cross-modal semantic alignment by carrying out the strong matching of positive samples and the strong separation of negative samples with higher standards.
Drawings
FIG. 1 is a flow chart of a cross-modal graph-text retrieval method based on multi-level semantic alignment;
FIG. 2 is a block diagram of a multi-level semantic alignment model;
FIG. 3 is a block diagram of a multi-headed self-attention module;
FIG. 4 is a block diagram of a gated fusion unit;
FIG. 5 is an example of bi-directional retrieval results of a trained multi-level semantic alignment model on a Flickr30K open source dataset.
Detailed Description
The invention will be described in further detail with reference to fig. 1 and a specific example, and a cross-modal image-text retrieval method based on multi-level semantic alignment includes the following steps:
step one, collecting a cross-mode image-text retrieval data set. Collecting images and corresponding text descriptions thereof as a cross-modal image-text retrieval data set, forming an image-text pair by one image and a corresponding text description thereof, and dividing all the collected image-text pairs into a training set, a verification set and a test set according to a certain rule;
the collected pairs of images are derived from cross-modal retrieval of open source datasets MS-COCO and Flickr30K, where each image has a corresponding five text descriptions. For the partitioning of the dataset, MS-COCO contained 123287 images in total, using 5000 images and corresponding text descriptions as the validation set, and 5000 images and corresponding text descriptions as the test set, the remaining images and corresponding text descriptions as the training set; the Flickr30K contains 31784 images in total, using 1000 images and corresponding text descriptions as a validation set, another 1000 images and corresponding text descriptions as a test set, and the remaining images and corresponding text descriptions as a training set.
And step two, extracting the characteristics of the image-text pairs. For images in the image-text pair, the target detector FaterR-CNN extracts K regional characteristics of each image to obtain local fine-grained characteristic V of the image l Global coarse-grained feature V for each image is extracted using convolutional neural network ResNet152 g The method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, for texts in the image-text pair, extracting word characteristics of each text by using a bi-directional gating circulating unit BiGRU to obtain local fine granularity characteristics Y of the text l The method comprises the steps of carrying out a first treatment on the surface of the Then, for Y l Global average pooling is carried out to obtain global coarse granularity characteristic Y of the text g The method comprises the steps of carrying out a first treatment on the surface of the Finally, calculating the global coarse granularity characteristic V of the image g And text global coarse granularity feature Y g Cosine similarity between them to obtain Global-Global level (GGl) feature matching score S GGl ;
Step two, for each image in the image-text pair, extracting K regional features by using a FasterR-CNN target detector with a ResNet101 as a main network pre-trained on an open source data set Visual Genome, wherein K is generally 36. All regional feature dimensions are adjusted to d by using a single-layer full-connection layer, and the local fine-grained feature of the image is obtainedWherein v is i For the i-th region feature vector of the image, d is the feature dimension, and the value is 1024,/-for the feature vector>Is vector space. Meanwhile, the ResNet152 network is used for extracting the global feature of the whole image, and the global feature dimension is adjusted to d by using a single-layer full-connection layer to obtain the global coarse-granularity feature of the image>
Step B, firstly, word segmentation is carried out on each text in the image-text pair, and each word after word segmentation is encoded into one-hot independent heat vectors; meanwhile, a pre-trained word embedding method Glove is adopted to process the single hot vector of the word, so that the word embedding vector of each word is obtained; then, the word embedding vector is sent into the bi-directional cyclic neural network BiGRU to extract the local part of the textFine grain size characteristicsWherein L is the number of words after text word segmentation, y j The extraction process for the feature vector of the j-th word of the text is as follows:
wherein t is j A word embedding vector for the j-th word of text,and->Respectively forward operations->And backward arithmetic->Hidden state of y j And taking the average value of the two hidden states as the characteristic vector of the jth word in the text, wherein d is the characteristic dimension and is the same as the value of the image.
Finally, for the text local fine granularity feature Y l Global average pooling is performed to obtain global coarse granularity characteristics of the text
Y g =AvgPool(Y l )
Where AvgPool represents a global average pooling operation.
Step C, calculating the global coarse granularity characteristic V of the image obtained in the step A g And the text global coarse granularity characteristic Y obtained in the step B g Cosine similarity between the two to obtain global-global level feature matching score S GGl :
In the formula, |·| represents the L2 norm, and the superscript italic T represents the transpose operation.
And thirdly, building a fine-grained feature interaction attention network among modes. The interaction attention network adopts a two-way symmetrical structure, wherein each path of input is composed of image local fine granularity characteristic V obtained in the second step l And text local fine granularity feature Y l Two parts. First, the interaction attention network calculates the dot product correlation s between the ith image region and the jth text word ij The method comprises the steps of carrying out a first treatment on the surface of the Secondly, obtaining two local correlation matrixes based on the dot product correlation, namely, obtaining a correlation matrix between an image area and a text word when the image area is used as a query and a correlation matrix between the text word and the image area when the text word is used as the query, and normalizing the two matrixes to obtain a normalized local correlation matrix s m1 Sum s m2 The method comprises the steps of carrying out a first treatment on the surface of the Then, for s m1 And s m2 Solving for image region as weight coefficient delta between query and text word using Softmax operation ij And text words as weighting coefficients gamma between the query and the image area ij The method comprises the steps of carrying out a first treatment on the surface of the Next, the obtained coefficient delta is used ij And gamma ij Respectively for local fine granularity characteristic V of image l And text local fine granularity feature Y l Performing weighted operation to obtain two-way output Y of the interaction attention network l ' and V l ', wherein Y l ' representing text local fine granularity features after cross-modal feature interaction, V l ' represent cross-modal feature interacted image local fine-grained features; finally, calculate Y l ' and V l Between which are locatedCosine similarity S 1 ,V l ' and Y l Cosine similarity S between 2 Local-Local level (LLl) feature matching score S LLl Namely S 1 And S is 2 Is the average value of (2);
step three a, see fig. 1 and 2, the interactive attention network first calculates the dot product between the image region and the text word feature vector pair:
wherein s is m1 Representing the image area as a normalized correlation matrix with the text word when querying; s is(s) m2 Representing text words as normalized correlation matrix between query and image region, [ x ]] + =max(x,0)。
Then, for s m1 And s m2 Obtaining an image region as a weight coefficient delta of a corresponding text word when inquiring by adopting Softmax operation ij Weighting coefficient gamma of image region corresponding to text word as query ij :
Wherein exp (·) is an exponential operation, η 1 、η 2 Representing the temperature super-parameters, and respectively taking values of 4 and 9.
Then, the obtained coefficient delta is used ij And gamma ij Respectively for text local fine granularity characteristicsAnd image local fine-grained feature->Weighting operation is carried out to obtain text local fine granularity characteristics after cross-modal characteristic interactionAnd image local fine-grained feature->This is the output of the interactive attention network:
wherein s is ij Representing the dot product correlation between the i-th region and the j-th word, L is the same as defined above.
Step three B, calculating V obtained in step three A l ' and Y obtained in step II A l Cosine similarity S between 1 Y obtained in step three A l ' and V obtained in step II A l Similarity S of local cosine between 2 :
Wherein S is 1 Is V (V) l ' and Y l Cosine similarity between S 2 Represents Y l ' and V l Cosine similarity between them.
Local-local level feature matching score S LLl S is taken out 1 And S is 2 Is the average value of (a):
and fourthly, building a feature fusion network with different granularity in the mode. Feature fusionThe integrated network comprises an image mode internal feature fusion sub-network and a text mode internal feature fusion sub-network, wherein each fusion sub-network is formed by connecting a multi-head self-attention module with a door control fusion unit. For the feature fusion sub-network in the image mode, the input is V obtained in the second step g V obtained in the third step l '. First, V is calculated by a multi-headed self-attention module l Similarity between different areas in' and giving higher weight to areas with high similarity; then, the output V of the multi-head self-attention module is obtained o Then to V o Global average pooling is carried out, and the pooled V is carried out o Sending the image global coarse granularity characteristic V into a gating fusion unit g Selectively gating and fusing to obtain the embedded representation V of the fused image with different granularity image features f This is the output of the feature fusion subnetwork within the image modality. Similarly, for the feature fusion sub-network in the text mode, the input is Y obtained in the step two g And Y obtained in the third step l '. First, Y is calculated by a multi-head self-attention module l Similarity between different words in' and giving higher weight to words with high similarity; then, the output Y of the multi-head self-attention module is obtained o Then to Y o Global average pooling is carried out, and the pooled Y is carried out o Sending the text global coarse granularity characteristic Y into a gating fusion unit g Selectively gating and fusing to obtain text embedded representation Y after fusing text features with different granularities f This is the output of the feature fusion sub-network within the text modality; finally, calculate V f And Y is equal to f Cosine similarity between them to obtain Global-Local level (GLl) feature matching score S GLl ;
Step four a, a multi-headed self-attention module is further described with reference to fig. 1, 2 and 3. For the image local fine granularity characteristic V obtained in the step three A l ' the multi-head self-attention module performs layer normalization on the multi-head self-attention module firstly; then, calculating the results of the h heads and connecting; then, the projection matrix is output to obtain the target V l ' Multi-head total output Multiattn 1 :
Wherein, concat (·) is a channel dimension connection operation,for outputting the projection matrix>And
as a parameter matrix, v i 'and v' j Normalized V of representative layer l The ith feature vector and the jth feature vector in' j->Representing the calculation result of the xth head, the dimension d takes the same value as the second step, and h=16 parallel self-attention heads are adopted, d v =d/h=64。
Similarly, for text local fine granularity feature Y l ' it is also first layer normalized; then, calculating the results of the h heads and connecting; then, the projection matrix is outputted to obtain the target Y l ' Multi-head total output Multiattn 2 :
In the method, in the process of the invention,for outputting the projection matrix>And->As a parameter matrix, y i 'and y' j Representing the ith feature vector and the jth feature vector in the layer normalized text representation,/>Represents the x-th head output, d v =d/h=64, and the remaining parameters have the same meaning as described above.
Finally, after obtaining two multi-head total outputs, for Multiattn 1 And V is equal to l ' element summing to obtain V o At the same time for Multiattn 2 And Y is equal to l ' element summing to get Y o ,Y o And V o The final output of the multi-head self-attention module is as follows:
Y o =Y l '+multiattn 1
V o =V l '+multiattn 2
step four B, the gating fusion unit is further described in conjunction with fig. 1, 2 and 4. The gating fusion unit in the feature fusion sub-network in the image mode firstly carries out the V obtained in the step four A o Global average pooling is carried out, and the global average pooling is matched with the global coarse granularity characteristic V of the image g Performing gating fusion to obtain an image embedded representation V f The method comprises the steps of carrying out a first treatment on the surface of the The gating fusion unit in the feature fusion sub-network in the text mode firstly carries out the Y obtained in the step four A o Global average pooling is performed, and the global coarse granularity characteristic Y is matched with the text g Performing gating fusion to obtain text embedded representation Y f The calculation process is as follows:
z v =σ(W v concat(AvgPool(V o ),V g )+b v )
z y =σ(W y concat(AvgPool(Y o ),Y g )+b y )
V f =(1-z v )×V o +z v ×V g
Y f =(1-z y )×Y o +z y ×Y g
wherein σ represents a sigmoid activation function, z v ,z y In order to obtain the fusion coefficient,for the weight matrix to be learned, b v ,b y For the bias term to be learned, V f And Y f For the output of the gating fusion unit, the image embedded representation and the text embedded representation are represented respectively, and the rest parameters have the same meaning as the above.
Step four C, calculating V obtained in step four B f And Y is equal to f Cosine similarity between them to obtain global-local level feature matching score S GLl :
And fifthly, calculating the total score of multi-level semantic matching between image-text pairs, and training the multi-level semantic alignment model by adopting ternary ordering loss with self-adaptive margin values. Firstly, a multi-level semantic alignment model is formed by connecting a feature extraction network of image-text pairs in a second step, a fine-granularity feature interaction attention network among modes in a third step and a different-granularity feature fusion network in a fourth step; then, the global-global level feature matching score S obtained in the second step is used for the multi-level semantic matching total score GGl Local-local level feature matching score S obtained in step three LLl Global-local level feature matching score S obtained in step four GLl Weighting to obtain; finally, for training the multi-level semantic alignment model, a ternary ordering penalty with adaptive margin values is employed, wherein the margin value size is adjusted according to the negative sample duty cycle in the batch of samples, whenThe negative sample duty cycle exceeds a threshold value ζ 0 When the edge distance value is changed adaptively according to a certain rule, otherwise, the edge distance value is kept unchanged;
step five, weighting and summing the feature matching scores of the three different levels obtained in the step two, the step three and the step four to obtain a multi-level semantic matching total score S between image-text pairs total (Ι,T):
S total (Ι,T)=w 1 S GGl +w 2 S GLl +w 3 S LLl
Wherein S is total (I, T) is the total score, w, of the multi-level semantic matching of the image I and the text T 1 、w 2 And w 3 The values are respectively 0.2, 0.6 and 0.2 for the weighting coefficients.
Step five B, dividing the margin value into m in the image retrieval process v And m in text retrieval process y Two kinds. For the self-adaptive adjustment of two edge distance values, firstly, calculating the proportion of negative samples in batch samples by utilizing the multi-level semantic matching total score obtained in the step five A, wherein the proportion is as follows:
wherein B represents the batch size and has a value of 128, (I, T) + ) Sum (T, I) + ) Is a matched image-text pair, (I, T) - ) Sum (T, I) - ) For the unmatched graph-text pairs, sum (χ > 0) represents the number of elements greater than 0 in matrix χ, r y Representing the negative sample duty cycle in the text retrieval process, r v Representing the negative sample duty ratio, m in the image retrieval process y And m v The initial values of (2) are all 0.2.
Then, the larger of the two negative sample duty cycles is taken out, when the value is greater than the threshold value ζ 0 When two margin values are adaptively more according to a certain ruleNew, otherwise, remain unchanged, specifically as shown in the following formula:
r m =max(r v ,r y )
wherein r is m For the larger of the two types of negative sample duty ratios, max represents the maximum value taking operation, ζ 0 The threshold value was 0.5.
Finally, ternary ordering penalty L with adaptive margin values total The expression is as follows:
L total =max(m y -S total (Ι,T + )+S total (Ι,T - ),0)+max(m v -S total (T,Ι + )+S total (T,Ι-),0)
and step six, acquiring a cross-modal image-text pair bidirectional retrieval result. Two-way retrieval is divided into two types, image retrieval and text retrieval: inputting an image to be searched into a trained multi-level semantic alignment model to obtain a multi-level semantic matching total score between the image to be searched and the text description, and taking the text description with the highest multi-level semantic matching total score as a search result of the image; the text retrieval inputs the text description to be retrieved into a trained multi-level semantic alignment model, the multi-level semantic matching total score between the text description to be retrieved and the image is obtained, and the image with the highest multi-level semantic matching total score is used as a retrieval result of the text description. Whether the obtained bidirectional retrieval result is the same as the true value or not is observed, so that the cross-mode image-text bidirectional retrieval process is completed;
the software environment required in the experimental process of the invention comprises Ubuntu20.04 operating systems, python3.8 and Pytorch 1.10.0 deep learning frameworks; the hardware environment comprises an InterCore i9-12900K processor, a 64.0GB RAM and a display card which is a single NVIDIA GeForce RTX 3090.
The experimental set-up of the invention is as follows: the training period was 20 epochs, the batch size was 128, the initial learning rate was 2e-4, and the decay was ten times at 10 epochs. For adaptive updating of the margin values, the margin values are updated every 200 iterations on MS-COCO and every 400 iterations on Flickr30K. In a specific implementation process, in order to facilitate comparison with the existing method, two performance evaluation indexes are adopted: recall R@K and Rsum, wherein R@K represents the percentage of correct results in the top K results with the highest score retrieved, and is divided into R@1, R@5 and r@10, i.e. the proportion of correct results in the top 1, 5 and 10 retrieved results, and the higher the recall, the better the model effect is; rsum reflects the overall performance of the model, being the sum of R@1, R@5 and r@10. All experimental results are obtained on a test set, and the comparison methods for cross-modal retrieval tasks are 14: SCAN, SCO, CAAN, VSE ++, IMRAM, VSRN, SHAN, SGM, DPRNN, MLASM, fusion layer, VCLTN, PASF and LAGSC.
TABLE 1 comparison of the present invention with the results of the prior art methods on the Flickr30K dataset
TABLE 2 comparison of the results of the present invention with the prior art methods on the MS-COCO dataset
As can be seen from the table, the method provided by the invention is outstanding in two cross-modal image-text retrieval data sets MS-COCO and Flickr30K, and the best results at present are obtained in a plurality of indexes, which shows that the method overcomes the technical problems of lack of multi-level interaction of cross-modal characteristics in the existing research work to a great extent, and comprehensively considers the interaction of the characteristics with different granularities in different levels in three levels of global-global, global-local and local. In addition, the optimization training is carried out by adopting the self-adaptive ternary sorting loss of the margin value, so that the accuracy of the cross-modal retrieval task is greatly improved. FIG. 5 is a graph showing the results of the two-way search of the trained model on the Flickr30K data set, and it can be seen from the graph that the search results are identical to the real results, further illustrating the effectiveness of the present invention.
Claims (6)
1. A cross-modal image-text retrieval method based on multi-level semantic alignment is characterized by comprising the following steps:
collecting a cross-mode image-text retrieval data set, collecting images and corresponding text descriptions thereof as the cross-mode image-text retrieval data set, forming an image-text pair by one image and one corresponding text description thereof, and dividing all collected image-text pairs into a training set, a verification set and a test set according to a certain rule;
step two, extracting characteristics of image-text pairs, namely extracting K regional characteristics of each image by using a target detector FasterR-CNN for the images in the image-text pairs to obtain local fine-grained characteristics V of the images l Global coarse-grained feature V for each image is extracted using convolutional neural network ResNet152 g The method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, for texts in the image-text pair, extracting word characteristics of each text by using a bi-directional gating circulating unit BiGRU to obtain local fine granularity characteristics Y of the text l The method comprises the steps of carrying out a first treatment on the surface of the Then, for Y l Global average pooling is carried out to obtain global coarse granularity characteristic Y of the text g The method comprises the steps of carrying out a first treatment on the surface of the Finally, calculating the global coarse granularity characteristic V of the image g And text global coarse granularity feature Y g Cosine similarity between them to obtain Global-Global level (GGl) feature matching score S GGl ;
Step three, building a fine-grained feature interaction attention network among modes, wherein the interaction attention network adopts a two-way symmetrical structure, and each path of input is composed of image local fine-grained feature V obtained in the step two l And text local fine granularity feature Y l Two parts are formed; first, the interaction attention network calculates the dot product correlation s between the ith image region and the jth text word ij The method comprises the steps of carrying out a first treatment on the surface of the Secondly, two local correlation matrices are obtained based on the dot product correlation, namely, the correlation matrix between the image region and the text word when the query is made and the correlation matrix between the text word and the image region when the query is made,normalizing the two matrixes to obtain a normalized local correlation matrix s m1 Sum s m2 The method comprises the steps of carrying out a first treatment on the surface of the Then, for s m1 And s m2 Solving for image region as weight coefficient delta between query and text word using Softmax operation ij And text words as weighting coefficients gamma between the query and the image area ij The method comprises the steps of carrying out a first treatment on the surface of the Next, the obtained coefficient delta is used ij And gamma ij Respectively for local fine granularity characteristic V of image l And text local fine granularity feature Y l Performing weighted operation to obtain two-way output Y of the interaction attention network l ' and V l ', wherein Y l ' representing text local fine granularity features after cross-modal feature interaction, V l ' represent cross-modal feature interacted image local fine-grained features; finally, calculate Y l ' and V l Cosine similarity S between 1 ,V l ' and Y l Cosine similarity S between 2 Local-Local level (LLl) feature matching score S LLl Namely S 1 And S is 2 Is the average value of (2);
building feature fusion networks with different granularity in the mode, wherein the feature fusion networks comprise feature fusion sub-networks in the image mode and feature fusion sub-networks in the text mode, and each fusion sub-network is formed by connecting a multi-head self-attention module with a door control fusion unit; for the feature fusion sub-network in the image mode, the input is V obtained in the second step g V obtained in the third step l 'A'; first, V is calculated by a multi-headed self-attention module l Similarity between different areas in' and giving higher weight to areas with high similarity; then, the output V of the multi-head self-attention module is obtained o Then to V o Global average pooling is carried out, and the pooled V is carried out o Sending the image global coarse granularity characteristic V into a gating fusion unit g Selectively gating and fusing to obtain the embedded representation V of the fused image with different granularity image features f This is the output of the feature fusion subnetwork within the image modality; similarly, for a feature fusion sub-network within a text modality, its input is step twoY to g And Y obtained in the third step l 'A'; first, Y is calculated by a multi-head self-attention module l Similarity between different words in' and giving higher weight to words with high similarity; then, the output Y of the multi-head self-attention module is obtained o Then to Y o Global average pooling is carried out, and the pooled Y is carried out o Sending the text global coarse granularity characteristic Y into a gating fusion unit g Selectively gating and fusing to obtain text embedded representation Y after fusing text features with different granularities f This is the output of the feature fusion sub-network within the text modality; finally, calculate V f And Y is equal to f Cosine similarity between them to obtain Global-Local level (GLl) feature matching score S GLl ;
Calculating the total score of multi-level semantic matching between image-text pairs and training a multi-level semantic alignment model by adopting ternary ordering loss with a self-adaptive margin value, wherein the multi-level semantic alignment model is formed by connecting a feature extraction network of the image-text pairs in the second step, a fine-grained feature interaction attention network among the modes in the third step and a feature fusion network with different granularity in the modes in the fourth step; then, the global-global level feature matching score S obtained in the second step is used for the multi-level semantic matching total score GGl Local-local level feature matching score S obtained in step three LLl Global-local level feature matching score S obtained in step four GLl Weighting to obtain; finally, for training the multi-level semantic alignment model, a ternary ordering penalty with adaptive margin values is employed, wherein the margin values are sized according to the negative sample duty cycle in the batch of samples, when the negative sample duty cycle exceeds a threshold ζ 0 When the edge distance value is changed adaptively according to a certain rule, otherwise, the edge distance value is kept unchanged;
step six, acquiring a cross-modal image-text pair bidirectional retrieval result, wherein bidirectional retrieval is divided into two types of image retrieval and text retrieval: inputting an image to be searched into a trained multi-level semantic alignment model to obtain a multi-level semantic matching total score between the image to be searched and the text description, and taking the text description with the highest multi-level semantic matching total score as a search result of the image; inputting the text description to be searched into a trained multi-level semantic alignment model to obtain a multi-level semantic matching total score between the text description to be searched and the image, and taking the image with the highest multi-level semantic matching total score as a search result of the text description; and observing whether the obtained bidirectional retrieval result is the same as the true value, so that the cross-mode image-text bidirectional retrieval process is completed.
2. The multi-level semantic alignment-based cross-modal teletext retrieval method according to claim 1, wherein the teletext pairs in step one are derived from cross-modal retrieval open source data sets MS-COCO and Flickr30K.
3. The multi-level semantic alignment-based cross-modal image-text retrieval method according to claim 1, wherein the feature extraction of the image-text pair in the second step comprises the following steps:
step two, for each image in the image-text pair, extracting K regional features of the image-text pair by adopting a Faster R-CNN target detector with a ResNet101 as a main network pre-trained on an open source data set Visual Genome, wherein K is generally 36; all regional feature dimensions are adjusted to d by using a single-layer full-connection layer, and the local fine-grained feature of the image is obtainedWherein v is i For the i-th region feature vector of the image, d is the feature dimension, and the value is 1024,/-for the feature vector>Is vector space; meanwhile, the ResNet152 network is used for extracting the global feature of the whole image, and the global feature dimension is adjusted to d by using a single-layer full-connection layer to obtain the global coarse-granularity feature of the image>
Step B, aiming at the graphics contextFirstly, segmenting each text in the pair, and encoding each segmented word into one-hot independent heat vectors; meanwhile, a pre-trained word embedding method Glove is adopted to process the single hot vector of the word, so that the word embedding vector of each word is obtained; then, the word embedding vector is sent into a bi-directional cyclic neural network BiGRU to extract local fine granularity characteristics of the textWherein L is the number of words after text word segmentation, y j The extraction process for the feature vector of the j-th word of the text is as follows:
wherein t is j A word embedding vector for the j-th word of text,and->Respectively forward operations->Sum-and-back operationHidden state of y j Taking the average value of two hidden states as the feature vector of the jth word in the text, wherein d is the feature dimension and is the same as the value of the image;
finally, the text is locally fine-grainedSign Y l Global average pooling is performed to obtain global coarse granularity characteristics of the text
Y g =AvgPool(Y l )
Wherein AvgPool represents a global average pooling operation;
step C, calculating the global coarse granularity characteristic V of the image obtained in the step A g And the text global coarse granularity characteristic Y obtained in the step B g Cosine similarity between the two to obtain global-global level feature matching score S GGl :
In the formula, |·| represents the L2 norm, and the superscript italic T represents the transpose operation.
4. The multi-level semantic alignment-based cross-modal image-text retrieval method according to claim 1, wherein the building of the inter-modal fine-grained feature interaction attention network in the third step comprises the following steps:
step three A, the interaction attention network firstly calculates dot product correlation between an image area and text word feature vectors:
s ij =v i T y j ,i∈[1,36],j∈[1,L]
wherein s is ij Representing the dot product correlation between the i-th region and the j-th word, L being as defined above;
secondly, a normalized local correlation matrix can be obtained based on the dot product correlation between the image region and the text word:
wherein s is m1 Representing image regions as query and text wordsA normalized correlation matrix between the two; s is(s) m2 Representing text words as normalized correlation matrix between query and image region, [ x ]] + =max(x,0);
Then, for s m1 And s m2 Obtaining an image region as a weight coefficient delta of a corresponding text word when inquiring by adopting Softmax operation ij Weighting coefficient gamma of image region corresponding to text word as query ij :
Wherein exp (·) is an exponential operation, η 1 、η 2 Representing the temperature super parameter, wherein the values are respectively 4 and 9;
then, the obtained coefficient delta is used ij And gamma ij Respectively for text local fine granularity characteristicsAnd image local fine-grained feature->Weighting operation is carried out to obtain text local fine granularity characteristics after cross-modal characteristic interaction>And image local fine-grained feature->This is the output of the interactive attention network:
wherein y 'is' i Is Y l The i-th word feature vector in 'v' j Is V (V) l The j-th region feature vector in';
step three B, calculating V obtained in step three A l ' and Y obtained in step II A l Cosine similarity S between 1 Y obtained in step three A l ' and V obtained in step II A l Similarity S of local cosine between 2 :
Wherein S is 1 Is V (V) l ' and Y l Cosine similarity between S 2 Represents Y l ' and V l Cosine similarity between them;
local-local level feature matching score S LLl S is taken out 1 And S is 2 Is the average value of (a):
5. the multi-level semantic alignment-based cross-modal image-text retrieval method according to claim 1, wherein the step four is to build a feature fusion network with different granularity in a mode, and the method comprises the following steps:
step four A, for the image local fine granularity characteristic V obtained in step three A l ' the multi-head self-attention module performs layer normalization on the multi-head self-attention module firstly; then, calculating the results of the h heads and connecting; then, the projection matrix is output to obtain the target V l ' Multi-head total output Multiattn 1 :
Wherein, concat (·) is a channel dimension connection operation,for outputting the projection matrix>Andas a parameter matrix, v' i And v' j Normalized V of representative layer l The ith feature vector and the jth feature vector in' j->Representing the calculation result of the xth head, the dimension d takes the same value as the second step, and h=16 parallel self-attention heads are adopted, d v =d/h=64;
Similarly, for text local fine granularity feature Y l ' it is also first layer normalized; then, calculating the results of the h heads and connecting; then, the projection matrix is outputted to obtain the target Y l ' Multi-head total output Multiattn 2 :
In the method, in the process of the invention,for outputting the projection matrix>And->As a parameter matrix, y' i And y' j Representing the ith feature vector and the jth feature vector in the layer normalized text representation,/>Represents the x-th head output, d v =d/h=64, the remaining parameters have the same meaning as before;
finally, after obtaining two multi-head total outputs, for Multiattn 1 And V is equal to l ' element summing to obtain V o At the same time for Multiattn 2 And Y is equal to l ' element summing to get Y o ,Y o And V o The final output of the multi-head self-attention module is as follows:
Y o =Y l '+multiattn 1
V o =V l '+multiattn 2
step four B, a gating fusion unit in the feature fusion sub-network in the image mode firstly carries out on the V obtained in the step four A o Global average pooling is carried out, and the global average pooling is matched with the global coarse granularity characteristic V of the image g Performing gating fusion to obtain an image embedded representation V f The method comprises the steps of carrying out a first treatment on the surface of the The gating fusion unit in the feature fusion sub-network in the text mode firstly carries out the Y obtained in the step four A o Global average pooling is performed, and the global coarse granularity characteristic Y is matched with the text g Performing gating fusion to obtain text embedded representation Y f The calculation process is as follows:
z v =σ(W v concat(AvgPool(V o ),V g )+b v )
z y =σ(W y concat(AvgPool(Y o ),Y g )+b y )
V f =(1-z v )×V o +z v ×V g
Y f =(1-z y )×Y o +z y ×Y g
wherein σ represents a sigmoid activation function, z v ,z y For the determination of the fusion coefficient, W v ,For the weight matrix to be learned, b v ,b y For the bias term to be learned, V f And Y f For the output of the gating fusion unit, respectively representing an image embedded representation and a text embedded representation, and the rest parameters have the same meaning as the above;
step four C, calculating V obtained in step four B f And Y is equal to f Cosine similarity between them to obtain global-local level feature matching score S GLl :
6. The multi-level semantic alignment-based cross-modal teletext retrieval method according to claim 1, wherein the step five of calculating a multi-level semantic matching total score between teletext pairs and performing model training by using a ternary ordering loss with an adaptive margin value comprises the steps of:
step five, weighting and summing the feature matching scores of the three different levels obtained in the step two, the step three and the step four to obtain a multi-level semantic matching total score S between image-text pairs total (Ι,T):
S total (Ι,T)=w 1 S GGl +w 2 S GLl +w 3 S LLl
Wherein S is total (I, T) is the total score, w, of the multi-level semantic matching of the image I and the text T 1 、w 2 And w 3 The values of the weighting coefficients are respectively 0.2, 0.6 and 0.2;
step five B, dividing the margin value into m in the image retrieval process v And m in text retrieval process y Two kinds; for two margin valuesAnd (3) self-adaptive adjustment, namely firstly, calculating the proportion of negative samples in the batch samples by utilizing the multi-level semantic matching total score obtained in the step (five A), wherein the proportion is shown in the following formula:
wherein B represents the batch size and has a value of 128, (I, T) + ) Sum (T, I) + ) Is a matched image-text pair, (I, T) - ) Sum (T, I) - ) For the unmatched graph-text pairs, sum (χ > 0) represents the number of elements greater than 0 in matrix χ, r y Representing the negative sample duty cycle in the text retrieval process, r v Representing the negative sample duty ratio, m in the image retrieval process y And m v The initial values of (2) are all 0.2;
then, the larger of the two negative sample duty cycles is taken out, when the value is greater than the threshold value ζ 0 When the two edge distance values are updated adaptively according to a certain rule, otherwise, the two edge distance values are kept unchanged, and the two edge distance values are shown in the following specific formula:
r m =max(r v ,r y )
wherein r is m For the larger of the two types of negative sample duty ratios, max represents the maximum value taking operation, ζ 0 The value of the threshold value is 0.5;
finally, ternary ordering penalty L with adaptive margin values total The expression is as follows:
L total =max(m y -S total (Ι,T + )+S total (Ι,T-),0)+max(m v -S total (T,Ι + )+S total (T,Ι - ),0)。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310855462.0A CN116821391A (en) | 2023-07-13 | 2023-07-13 | Cross-modal image-text retrieval method based on multi-level semantic alignment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310855462.0A CN116821391A (en) | 2023-07-13 | 2023-07-13 | Cross-modal image-text retrieval method based on multi-level semantic alignment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116821391A true CN116821391A (en) | 2023-09-29 |
Family
ID=88124000
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310855462.0A Pending CN116821391A (en) | 2023-07-13 | 2023-07-13 | Cross-modal image-text retrieval method based on multi-level semantic alignment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116821391A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117522479A (en) * | 2023-11-07 | 2024-02-06 | 北京创信合科技有限公司 | Accurate Internet advertisement delivery method and system |
CN117708354A (en) * | 2024-02-06 | 2024-03-15 | 湖南快乐阳光互动娱乐传媒有限公司 | Image indexing method and device, electronic equipment and storage medium |
CN117874262A (en) * | 2024-03-12 | 2024-04-12 | 北京邮电大学 | Text-dynamic picture cross-modal retrieval method based on progressive prototype matching |
CN118279925A (en) * | 2024-06-04 | 2024-07-02 | 鲁东大学 | Image text matching algorithm integrating local and global semantics |
-
2023
- 2023-07-13 CN CN202310855462.0A patent/CN116821391A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117522479A (en) * | 2023-11-07 | 2024-02-06 | 北京创信合科技有限公司 | Accurate Internet advertisement delivery method and system |
CN117708354A (en) * | 2024-02-06 | 2024-03-15 | 湖南快乐阳光互动娱乐传媒有限公司 | Image indexing method and device, electronic equipment and storage medium |
CN117708354B (en) * | 2024-02-06 | 2024-04-30 | 湖南快乐阳光互动娱乐传媒有限公司 | Image indexing method and device, electronic equipment and storage medium |
CN117874262A (en) * | 2024-03-12 | 2024-04-12 | 北京邮电大学 | Text-dynamic picture cross-modal retrieval method based on progressive prototype matching |
CN117874262B (en) * | 2024-03-12 | 2024-06-04 | 北京邮电大学 | Text-dynamic picture cross-modal retrieval method based on progressive prototype matching |
CN118279925A (en) * | 2024-06-04 | 2024-07-02 | 鲁东大学 | Image text matching algorithm integrating local and global semantics |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110717431B (en) | Fine-grained visual question and answer method combined with multi-view attention mechanism | |
CN110263912B (en) | Image question-answering method based on multi-target association depth reasoning | |
Abiyev et al. | Sign language translation using deep convolutional neural networks | |
CN110059217B (en) | Image text cross-media retrieval method for two-stage network | |
CN112905822B (en) | Deep supervision cross-modal counterwork learning method based on attention mechanism | |
CN116821391A (en) | Cross-modal image-text retrieval method based on multi-level semantic alignment | |
Al-Jarrah et al. | Recognition of gestures in Arabic sign language using neuro-fuzzy systems | |
Gao et al. | Multi‐dimensional data modelling of video image action recognition and motion capture in deep learning framework | |
CN115033670A (en) | Cross-modal image-text retrieval method with multi-granularity feature fusion | |
CN111414461A (en) | Intelligent question-answering method and system fusing knowledge base and user modeling | |
CN112417097A (en) | Multi-modal data feature extraction and association method for public opinion analysis | |
CN114201592A (en) | Visual question-answering method for medical image diagnosis | |
CN114817673A (en) | Cross-modal retrieval method based on modal relation learning | |
CN116610778A (en) | Bidirectional image-text matching method based on cross-modal global and local attention mechanism | |
Das et al. | A deep sign language recognition system for Indian sign language | |
CN113255602A (en) | Dynamic gesture recognition method based on multi-modal data | |
CN116561305A (en) | False news detection method based on multiple modes and transformers | |
CN114239612A (en) | Multi-modal neural machine translation method, computer equipment and storage medium | |
CN116187349A (en) | Visual question-answering method based on scene graph relation information enhancement | |
CN117333908A (en) | Cross-modal pedestrian re-recognition method based on attitude feature alignment | |
CN114973305B (en) | Accurate human body analysis method for crowded people | |
Abdullahi et al. | Spatial–temporal feature-based End-to-end Fourier network for 3D sign language recognition | |
Li et al. | Egocentric action recognition by automatic relation modeling | |
CN114722798A (en) | Ironic recognition model based on convolutional neural network and attention system | |
Wang et al. | Listen, look, and find the one: Robust person search with multimodality index |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication |