CN116821391A - Cross-modal image-text retrieval method based on multi-level semantic alignment - Google Patents

Cross-modal image-text retrieval method based on multi-level semantic alignment Download PDF

Info

Publication number
CN116821391A
CN116821391A CN202310855462.0A CN202310855462A CN116821391A CN 116821391 A CN116821391 A CN 116821391A CN 202310855462 A CN202310855462 A CN 202310855462A CN 116821391 A CN116821391 A CN 116821391A
Authority
CN
China
Prior art keywords
text
image
feature
global
local
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310855462.0A
Other languages
Chinese (zh)
Inventor
遆晓光
王文状
刘茂振
高峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202310855462.0A priority Critical patent/CN116821391A/en
Publication of CN116821391A publication Critical patent/CN116821391A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Library & Information Science (AREA)
  • Multimedia (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Medical Informatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

A cross-modal image-text retrieval method based on multi-level semantic alignment belongs to the technical field of cross-modal retrieval and artificial intelligence. The method provided by the invention has the advantages that the simple and symmetrical network architecture is provided for coding the image and text characteristics, the multi-level semantic alignment of global-global, global-local and local-local is considered, the fusion interaction of different granularity characteristics on different levels is realized by introducing a fine granularity characteristic interaction attention network between modes and a different granularity characteristic fusion network in the modes, and the technical problems that the multi-granularity characteristic interaction is weak, and the image-text pairs with similar image region characteristics or similar text semantics are difficult to distinguish in the existing cross-mode retrieval research work are solved; meanwhile, the method adopts the ternary ordering loss of the multi-level semantic matching total score and the self-adaptive margin value, realizes better cross-modal semantic alignment, and greatly improves the accuracy of cross-modal image-text retrieval tasks.

Description

Cross-modal image-text retrieval method based on multi-level semantic alignment
Technical Field
The invention relates to the technical field of cross-modal retrieval and artificial intelligence, in particular to a cross-modal image-text retrieval method based on multi-level semantic alignment.
Background
Cross-modal image-text retrieval is a continuously developed research direction in the multi-modal learning field, is used for cross-modal bidirectional retrieval between image-text pairs, and is widely applied to the fields of information retrieval, information recommendation and the like. Along with the arrival of big data age and the development of internet technology, multi-mode data mainly comprising images and texts grows exponentially, and how to effectively fuse and align a large number of multi-source heterogeneous images and text data so as to meet the diversified retrieval requirements of users is a challenging task. At present, many research efforts have been made to explore efficient interaction modes between cross-modal image-text pairs, wherein deep learning methods based on deep neural networks have shown great potential in cross-modal retrieval tasks, and certain achievements are achieved. However, most of the cross-modal search researches still have the problems of weak cross-modal feature interaction or lack of semantic alignment in the modes, and different image-text pairs with similar image local areas or similar text descriptions are difficult to distinguish.
The existing cross-modal search research is mainly divided into a global-level feature alignment method and a local-level feature alignment method. The feature alignment method of the global level focuses on learning a dual-path deep neural network, and is commonly performed by firstly using a convolutional neural network (Convolutional Neural Network, CNN) and a cyclic neural network (Recurrent Neural Network, RNN) to extract global features of image-text pairs respectively, and then mapping the global features of two modes into a joint representation space so as to obtain global correlation among the modes. However, this coarse-grained alignment ignores local relationships between different modalities, lacks fine interactions between image regions and text words, and therefore has poor retrieval results. Since text typically describes only certain locally significant regions of an image, which are critical to the text retrieval process with corresponding descriptive words, further research has focused on local level feature alignment methods. According to the method, the association between different modal local fragments is modeled by extracting the fine granularity characteristics of the image area and the text word, so that the fusion and alignment of the cross-modal fine granularity characteristics are realized. The feature matching method of the local level improves the accuracy of the cross-mode search task to a certain extent, but the method directly converts the global association of the whole image text pair into the local similarity between the region and the word, and the local detail of the image text mode is excessively focused in the search process to ignore global information, so that the method cannot distinguish the same word with different meanings in different contexts. Meanwhile, due to the lack of multi-granularity feature fusion and multi-level semantic interaction, the current global level and local level alignment methods are difficult to distinguish different image-text pairs with similar features of local areas of images or similar text semantics. In addition, the loss function adopted in the current research is mostly ternary group ordering loss with fixed margin values, and the loss has good retrieval performance on images and texts with larger difference of visual features of images and different text semantics, but has poor retrieval performance on difficult cases on images and texts with very similar image features and similar text semantics.
In summary, the current cross-modal image-text retrieval method mainly has the following problems: 1. the existing retrieval model lacks multi-granularity feature fusion of images and texts and is aligned with multi-level semantics, so that the model is difficult to distinguish different image-text pairs with similar image local areas or similar text semantics; 2. ternary ordering loss based on fixed margin values is detrimental to the model in further distinguishing difficult cases during training. The present invention will develop research into these two problems in order to adequately align and interact across modal features at multiple levels.
Disclosure of Invention
The invention aims to solve the problem of poor retrieval effect caused by insufficient multi-granularity feature interaction in the existing image-text retrieval method, provides a multi-level semantic alignment cross-modal retrieval method which simultaneously considers global-global matching, global-local matching and local-local matching, and further improves the accuracy of the cross-modal retrieval task through ternary ordering loss with self-adaptive margin values, and has wide application value.
In order to achieve the above purpose, the invention provides a cross-modal image-text retrieval method based on multi-level semantic alignment, which comprises the following steps:
step one, collecting a cross-mode image-text retrieval data set. Collecting images and corresponding text descriptions thereof as a cross-modal image-text retrieval data set, forming an image-text pair by one image and a corresponding text description thereof, and dividing all the collected image-text pairs into a training set, a verification set and a test set according to a certain rule;
and step two, extracting the characteristics of the image-text pairs. For images in the image-text pair, extracting K regional features of each image by using a target detector FasterR-CNN to obtain local fine granularity features V of the images l Global coarse-grained feature V for each image is extracted using convolutional neural network ResNet152 g The method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, for texts in the image-text pair, extracting word characteristics of each text by using a bi-directional gating circulating unit BiGRU to obtain local fine granularity characteristics Y of the text l The method comprises the steps of carrying out a first treatment on the surface of the Then, for Y l Global by makingAveraging and pooling to obtain global coarse granularity characteristic Y of text g The method comprises the steps of carrying out a first treatment on the surface of the Finally, calculating the global coarse granularity characteristic V of the image g And text global coarse granularity feature Y g Cosine similarity between them to obtain Global-Global level (GGl) feature matching score S GGl
And thirdly, building a fine-grained feature interaction attention network among modes. The interaction attention network adopts a two-way symmetrical structure, wherein each path of input is composed of image local fine granularity characteristic V obtained in the second step l And text local fine granularity feature Y l Two parts. First, the interaction attention network calculates the dot product correlation s between the ith image region and the jth text word ij The method comprises the steps of carrying out a first treatment on the surface of the Secondly, obtaining two local correlation matrixes based on the dot product correlation, namely, obtaining a correlation matrix between an image area and a text word when the image area is used as a query and a correlation matrix between the text word and the image area when the text word is used as the query, and normalizing the two matrixes to obtain a normalized local correlation matrix s m1 Sum s m2 The method comprises the steps of carrying out a first treatment on the surface of the Then, for s m1 And s m2 Solving for image region as weight coefficient delta between query and text word using Softmax operation ij And text words as weighting coefficients gamma between the query and the image area ij The method comprises the steps of carrying out a first treatment on the surface of the Next, the obtained coefficient delta is used ij And gamma ij Respectively for local fine granularity characteristic V of image l And text local fine granularity feature Y l Performing weighted operation to obtain two-way output Y of the interaction attention network l ' and V l ', wherein Y l ' representing text local fine granularity features after cross-modal feature interaction, V l ' represent cross-modal feature interacted image local fine-grained features; finally, calculate Y l ' and V l Cosine similarity S between 1 ,V l ' and Y l Cosine similarity S between 2 Local-Local level (LLl) feature matching score S LLl Namely S 1 And S is 2 Is the average value of (2);
and fourthly, building a feature fusion network with different granularity in the mode. The feature fusion network comprisesThe image mode internal feature fusion sub-network and the text mode internal feature fusion sub-network are connected by a multi-head self-attention module and a door control fusion unit. For the feature fusion sub-network in the image mode, the input is V obtained in the second step g V obtained in the third step l '. First, V is calculated by a multi-headed self-attention module l Similarity between different areas in' and giving higher weight to areas with high similarity; then, the output V of the multi-head self-attention module is obtained o Then to V o Global average pooling is carried out, and the pooled V is carried out o Sending the image global coarse granularity characteristic V into a gating fusion unit g Selectively gating and fusing to obtain the embedded representation V of the fused image with different granularity image features f This is the output of the feature fusion subnetwork within the image modality. Similarly, for the feature fusion sub-network in the text mode, the input is Y obtained in the step two g And Y obtained in the third step l '. First, Y is calculated by a multi-head self-attention module l Similarity between different words in' and giving higher weight to words with high similarity; then, the output Y of the multi-head self-attention module is obtained o Then to Y o Global average pooling is carried out, and the pooled Y is carried out o Sending the text global coarse granularity characteristic Y into a gating fusion unit g Selectively gating and fusing to obtain text embedded representation Y after fusing text features with different granularities f This is the output of the feature fusion sub-network within the text modality; finally, calculate V f And Y is equal to f Cosine similarity between them to obtain Global-Local level (GLl) feature matching score S GLl
And fifthly, calculating the total score of multi-level semantic matching between image-text pairs, and training the multi-level semantic alignment model by adopting ternary ordering loss with self-adaptive margin values. Firstly, a multi-level semantic alignment model is connected by a feature extraction network of image-text pairs in a second step, a fine-granularity feature interaction attention network among modes in a third step and a feature fusion network with different granularity in modes in a fourth stepIs formed by the steps of; then, the global-global level feature matching score S obtained in the second step is used for the multi-level semantic matching total score GGl Local-local level feature matching score S obtained in step three LLl Global-local level feature matching score S obtained in step four GLl Weighting to obtain; finally, for training the multi-level semantic alignment model, a ternary ordering penalty with adaptive margin values is employed, wherein the margin values are sized according to the negative sample duty cycle in the batch of samples, when the negative sample duty cycle exceeds a threshold ζ 0 When the edge distance value is changed adaptively according to a certain rule, otherwise, the edge distance value is kept unchanged;
and step six, acquiring a cross-modal image-text pair bidirectional retrieval result. Two-way retrieval is divided into two types, image retrieval and text retrieval: inputting an image to be searched into a trained multi-level semantic alignment model to obtain a multi-level semantic matching total score between the image to be searched and the text description, and taking the text description with the highest multi-level semantic matching total score as a search result of the image; the text retrieval inputs the text description to be retrieved into a trained multi-level semantic alignment model, the multi-level semantic matching total score between the text description to be retrieved and the image is obtained, and the image with the highest multi-level semantic matching total score is used as a retrieval result of the text description. Whether the obtained bidirectional retrieval result is the same as the true value or not is observed, so that the cross-mode image-text bidirectional retrieval process is completed;
compared with the prior art, the invention has the following beneficial effects:
(1) Compared with the existing cross-modal retrieval research work, the invention realizes alignment and matching of cross-modal features on different granularity levels by constructing a simple and symmetrical multi-level semantic alignment network and simultaneously taking account of global-global, global-local and local-local multi-level matching modes, can obtain more robust image and text embedded representation, ensures that images and texts in different representation spaces are interactively fused with features more thoroughly, and greatly improves the accuracy of the cross-modal retrieval task.
(2) The invention adopts the ternary sorting loss with the self-adaptive edge value, is beneficial to the discrimination of difficult cases by the image-text with similar characteristics in the training process of the model, and realizes better cross-modal semantic alignment by carrying out the strong matching of positive samples and the strong separation of negative samples with higher standards.
Drawings
FIG. 1 is a flow chart of a cross-modal graph-text retrieval method based on multi-level semantic alignment;
FIG. 2 is a block diagram of a multi-level semantic alignment model;
FIG. 3 is a block diagram of a multi-headed self-attention module;
FIG. 4 is a block diagram of a gated fusion unit;
FIG. 5 is an example of bi-directional retrieval results of a trained multi-level semantic alignment model on a Flickr30K open source dataset.
Detailed Description
The invention will be described in further detail with reference to fig. 1 and a specific example, and a cross-modal image-text retrieval method based on multi-level semantic alignment includes the following steps:
step one, collecting a cross-mode image-text retrieval data set. Collecting images and corresponding text descriptions thereof as a cross-modal image-text retrieval data set, forming an image-text pair by one image and a corresponding text description thereof, and dividing all the collected image-text pairs into a training set, a verification set and a test set according to a certain rule;
the collected pairs of images are derived from cross-modal retrieval of open source datasets MS-COCO and Flickr30K, where each image has a corresponding five text descriptions. For the partitioning of the dataset, MS-COCO contained 123287 images in total, using 5000 images and corresponding text descriptions as the validation set, and 5000 images and corresponding text descriptions as the test set, the remaining images and corresponding text descriptions as the training set; the Flickr30K contains 31784 images in total, using 1000 images and corresponding text descriptions as a validation set, another 1000 images and corresponding text descriptions as a test set, and the remaining images and corresponding text descriptions as a training set.
And step two, extracting the characteristics of the image-text pairs. For images in the image-text pair, the target detector FaterR-CNN extracts K regional characteristics of each image to obtain local fine-grained characteristic V of the image l Global coarse-grained feature V for each image is extracted using convolutional neural network ResNet152 g The method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, for texts in the image-text pair, extracting word characteristics of each text by using a bi-directional gating circulating unit BiGRU to obtain local fine granularity characteristics Y of the text l The method comprises the steps of carrying out a first treatment on the surface of the Then, for Y l Global average pooling is carried out to obtain global coarse granularity characteristic Y of the text g The method comprises the steps of carrying out a first treatment on the surface of the Finally, calculating the global coarse granularity characteristic V of the image g And text global coarse granularity feature Y g Cosine similarity between them to obtain Global-Global level (GGl) feature matching score S GGl
Step two, for each image in the image-text pair, extracting K regional features by using a FasterR-CNN target detector with a ResNet101 as a main network pre-trained on an open source data set Visual Genome, wherein K is generally 36. All regional feature dimensions are adjusted to d by using a single-layer full-connection layer, and the local fine-grained feature of the image is obtainedWherein v is i For the i-th region feature vector of the image, d is the feature dimension, and the value is 1024,/-for the feature vector>Is vector space. Meanwhile, the ResNet152 network is used for extracting the global feature of the whole image, and the global feature dimension is adjusted to d by using a single-layer full-connection layer to obtain the global coarse-granularity feature of the image>
Step B, firstly, word segmentation is carried out on each text in the image-text pair, and each word after word segmentation is encoded into one-hot independent heat vectors; meanwhile, a pre-trained word embedding method Glove is adopted to process the single hot vector of the word, so that the word embedding vector of each word is obtained; then, the word embedding vector is sent into the bi-directional cyclic neural network BiGRU to extract the local part of the textFine grain size characteristicsWherein L is the number of words after text word segmentation, y j The extraction process for the feature vector of the j-th word of the text is as follows:
wherein t is j A word embedding vector for the j-th word of text,and->Respectively forward operations->And backward arithmetic->Hidden state of y j And taking the average value of the two hidden states as the characteristic vector of the jth word in the text, wherein d is the characteristic dimension and is the same as the value of the image.
Finally, for the text local fine granularity feature Y l Global average pooling is performed to obtain global coarse granularity characteristics of the text
Y g =AvgPool(Y l )
Where AvgPool represents a global average pooling operation.
Step C, calculating the global coarse granularity characteristic V of the image obtained in the step A g And the text global coarse granularity characteristic Y obtained in the step B g Cosine similarity between the two to obtain global-global level feature matching score S GGl
In the formula, |·| represents the L2 norm, and the superscript italic T represents the transpose operation.
And thirdly, building a fine-grained feature interaction attention network among modes. The interaction attention network adopts a two-way symmetrical structure, wherein each path of input is composed of image local fine granularity characteristic V obtained in the second step l And text local fine granularity feature Y l Two parts. First, the interaction attention network calculates the dot product correlation s between the ith image region and the jth text word ij The method comprises the steps of carrying out a first treatment on the surface of the Secondly, obtaining two local correlation matrixes based on the dot product correlation, namely, obtaining a correlation matrix between an image area and a text word when the image area is used as a query and a correlation matrix between the text word and the image area when the text word is used as the query, and normalizing the two matrixes to obtain a normalized local correlation matrix s m1 Sum s m2 The method comprises the steps of carrying out a first treatment on the surface of the Then, for s m1 And s m2 Solving for image region as weight coefficient delta between query and text word using Softmax operation ij And text words as weighting coefficients gamma between the query and the image area ij The method comprises the steps of carrying out a first treatment on the surface of the Next, the obtained coefficient delta is used ij And gamma ij Respectively for local fine granularity characteristic V of image l And text local fine granularity feature Y l Performing weighted operation to obtain two-way output Y of the interaction attention network l ' and V l ', wherein Y l ' representing text local fine granularity features after cross-modal feature interaction, V l ' represent cross-modal feature interacted image local fine-grained features; finally, calculate Y l ' and V l Between which are locatedCosine similarity S 1 ,V l ' and Y l Cosine similarity S between 2 Local-Local level (LLl) feature matching score S LLl Namely S 1 And S is 2 Is the average value of (2);
step three a, see fig. 1 and 2, the interactive attention network first calculates the dot product between the image region and the text word feature vector pair:
wherein s is m1 Representing the image area as a normalized correlation matrix with the text word when querying; s is(s) m2 Representing text words as normalized correlation matrix between query and image region, [ x ]] + =max(x,0)。
Then, for s m1 And s m2 Obtaining an image region as a weight coefficient delta of a corresponding text word when inquiring by adopting Softmax operation ij Weighting coefficient gamma of image region corresponding to text word as query ij
Wherein exp (·) is an exponential operation, η 1 、η 2 Representing the temperature super-parameters, and respectively taking values of 4 and 9.
Then, the obtained coefficient delta is used ij And gamma ij Respectively for text local fine granularity characteristicsAnd image local fine-grained feature->Weighting operation is carried out to obtain text local fine granularity characteristics after cross-modal characteristic interactionAnd image local fine-grained feature->This is the output of the interactive attention network:
wherein s is ij Representing the dot product correlation between the i-th region and the j-th word, L is the same as defined above.
Step three B, calculating V obtained in step three A l ' and Y obtained in step II A l Cosine similarity S between 1 Y obtained in step three A l ' and V obtained in step II A l Similarity S of local cosine between 2
Wherein S is 1 Is V (V) l ' and Y l Cosine similarity between S 2 Represents Y l ' and V l Cosine similarity between them.
Local-local level feature matching score S LLl S is taken out 1 And S is 2 Is the average value of (a):
and fourthly, building a feature fusion network with different granularity in the mode. Feature fusionThe integrated network comprises an image mode internal feature fusion sub-network and a text mode internal feature fusion sub-network, wherein each fusion sub-network is formed by connecting a multi-head self-attention module with a door control fusion unit. For the feature fusion sub-network in the image mode, the input is V obtained in the second step g V obtained in the third step l '. First, V is calculated by a multi-headed self-attention module l Similarity between different areas in' and giving higher weight to areas with high similarity; then, the output V of the multi-head self-attention module is obtained o Then to V o Global average pooling is carried out, and the pooled V is carried out o Sending the image global coarse granularity characteristic V into a gating fusion unit g Selectively gating and fusing to obtain the embedded representation V of the fused image with different granularity image features f This is the output of the feature fusion subnetwork within the image modality. Similarly, for the feature fusion sub-network in the text mode, the input is Y obtained in the step two g And Y obtained in the third step l '. First, Y is calculated by a multi-head self-attention module l Similarity between different words in' and giving higher weight to words with high similarity; then, the output Y of the multi-head self-attention module is obtained o Then to Y o Global average pooling is carried out, and the pooled Y is carried out o Sending the text global coarse granularity characteristic Y into a gating fusion unit g Selectively gating and fusing to obtain text embedded representation Y after fusing text features with different granularities f This is the output of the feature fusion sub-network within the text modality; finally, calculate V f And Y is equal to f Cosine similarity between them to obtain Global-Local level (GLl) feature matching score S GLl
Step four a, a multi-headed self-attention module is further described with reference to fig. 1, 2 and 3. For the image local fine granularity characteristic V obtained in the step three A l ' the multi-head self-attention module performs layer normalization on the multi-head self-attention module firstly; then, calculating the results of the h heads and connecting; then, the projection matrix is output to obtain the target V l ' Multi-head total output Multiattn 1
Wherein, concat (·) is a channel dimension connection operation,for outputting the projection matrix>And
as a parameter matrix, v i 'and v' j Normalized V of representative layer l The ith feature vector and the jth feature vector in' j->Representing the calculation result of the xth head, the dimension d takes the same value as the second step, and h=16 parallel self-attention heads are adopted, d v =d/h=64。
Similarly, for text local fine granularity feature Y l ' it is also first layer normalized; then, calculating the results of the h heads and connecting; then, the projection matrix is outputted to obtain the target Y l ' Multi-head total output Multiattn 2
In the method, in the process of the invention,for outputting the projection matrix>And->As a parameter matrix, y i 'and y' j Representing the ith feature vector and the jth feature vector in the layer normalized text representation,/>Represents the x-th head output, d v =d/h=64, and the remaining parameters have the same meaning as described above.
Finally, after obtaining two multi-head total outputs, for Multiattn 1 And V is equal to l ' element summing to obtain V o At the same time for Multiattn 2 And Y is equal to l ' element summing to get Y o ,Y o And V o The final output of the multi-head self-attention module is as follows:
Y o =Y l '+multiattn 1
V o =V l '+multiattn 2
step four B, the gating fusion unit is further described in conjunction with fig. 1, 2 and 4. The gating fusion unit in the feature fusion sub-network in the image mode firstly carries out the V obtained in the step four A o Global average pooling is carried out, and the global average pooling is matched with the global coarse granularity characteristic V of the image g Performing gating fusion to obtain an image embedded representation V f The method comprises the steps of carrying out a first treatment on the surface of the The gating fusion unit in the feature fusion sub-network in the text mode firstly carries out the Y obtained in the step four A o Global average pooling is performed, and the global coarse granularity characteristic Y is matched with the text g Performing gating fusion to obtain text embedded representation Y f The calculation process is as follows:
z v =σ(W v concat(AvgPool(V o ),V g )+b v )
z y =σ(W y concat(AvgPool(Y o ),Y g )+b y )
V f =(1-z v )×V o +z v ×V g
Y f =(1-z y )×Y o +z y ×Y g
wherein σ represents a sigmoid activation function, z v ,z y In order to obtain the fusion coefficient,for the weight matrix to be learned, b v ,b y For the bias term to be learned, V f And Y f For the output of the gating fusion unit, the image embedded representation and the text embedded representation are represented respectively, and the rest parameters have the same meaning as the above.
Step four C, calculating V obtained in step four B f And Y is equal to f Cosine similarity between them to obtain global-local level feature matching score S GLl
And fifthly, calculating the total score of multi-level semantic matching between image-text pairs, and training the multi-level semantic alignment model by adopting ternary ordering loss with self-adaptive margin values. Firstly, a multi-level semantic alignment model is formed by connecting a feature extraction network of image-text pairs in a second step, a fine-granularity feature interaction attention network among modes in a third step and a different-granularity feature fusion network in a fourth step; then, the global-global level feature matching score S obtained in the second step is used for the multi-level semantic matching total score GGl Local-local level feature matching score S obtained in step three LLl Global-local level feature matching score S obtained in step four GLl Weighting to obtain; finally, for training the multi-level semantic alignment model, a ternary ordering penalty with adaptive margin values is employed, wherein the margin value size is adjusted according to the negative sample duty cycle in the batch of samples, whenThe negative sample duty cycle exceeds a threshold value ζ 0 When the edge distance value is changed adaptively according to a certain rule, otherwise, the edge distance value is kept unchanged;
step five, weighting and summing the feature matching scores of the three different levels obtained in the step two, the step three and the step four to obtain a multi-level semantic matching total score S between image-text pairs total (Ι,T):
S total (Ι,T)=w 1 S GGl +w 2 S GLl +w 3 S LLl
Wherein S is total (I, T) is the total score, w, of the multi-level semantic matching of the image I and the text T 1 、w 2 And w 3 The values are respectively 0.2, 0.6 and 0.2 for the weighting coefficients.
Step five B, dividing the margin value into m in the image retrieval process v And m in text retrieval process y Two kinds. For the self-adaptive adjustment of two edge distance values, firstly, calculating the proportion of negative samples in batch samples by utilizing the multi-level semantic matching total score obtained in the step five A, wherein the proportion is as follows:
wherein B represents the batch size and has a value of 128, (I, T) + ) Sum (T, I) + ) Is a matched image-text pair, (I, T) - ) Sum (T, I) - ) For the unmatched graph-text pairs, sum (χ > 0) represents the number of elements greater than 0 in matrix χ, r y Representing the negative sample duty cycle in the text retrieval process, r v Representing the negative sample duty ratio, m in the image retrieval process y And m v The initial values of (2) are all 0.2.
Then, the larger of the two negative sample duty cycles is taken out, when the value is greater than the threshold value ζ 0 When two margin values are adaptively more according to a certain ruleNew, otherwise, remain unchanged, specifically as shown in the following formula:
r m =max(r v ,r y )
wherein r is m For the larger of the two types of negative sample duty ratios, max represents the maximum value taking operation, ζ 0 The threshold value was 0.5.
Finally, ternary ordering penalty L with adaptive margin values total The expression is as follows:
L total =max(m y -S total (Ι,T + )+S total (Ι,T - ),0)+max(m v -S total (T,Ι + )+S total (T,Ι-),0)
and step six, acquiring a cross-modal image-text pair bidirectional retrieval result. Two-way retrieval is divided into two types, image retrieval and text retrieval: inputting an image to be searched into a trained multi-level semantic alignment model to obtain a multi-level semantic matching total score between the image to be searched and the text description, and taking the text description with the highest multi-level semantic matching total score as a search result of the image; the text retrieval inputs the text description to be retrieved into a trained multi-level semantic alignment model, the multi-level semantic matching total score between the text description to be retrieved and the image is obtained, and the image with the highest multi-level semantic matching total score is used as a retrieval result of the text description. Whether the obtained bidirectional retrieval result is the same as the true value or not is observed, so that the cross-mode image-text bidirectional retrieval process is completed;
the software environment required in the experimental process of the invention comprises Ubuntu20.04 operating systems, python3.8 and Pytorch 1.10.0 deep learning frameworks; the hardware environment comprises an InterCore i9-12900K processor, a 64.0GB RAM and a display card which is a single NVIDIA GeForce RTX 3090.
The experimental set-up of the invention is as follows: the training period was 20 epochs, the batch size was 128, the initial learning rate was 2e-4, and the decay was ten times at 10 epochs. For adaptive updating of the margin values, the margin values are updated every 200 iterations on MS-COCO and every 400 iterations on Flickr30K. In a specific implementation process, in order to facilitate comparison with the existing method, two performance evaluation indexes are adopted: recall R@K and Rsum, wherein R@K represents the percentage of correct results in the top K results with the highest score retrieved, and is divided into R@1, R@5 and r@10, i.e. the proportion of correct results in the top 1, 5 and 10 retrieved results, and the higher the recall, the better the model effect is; rsum reflects the overall performance of the model, being the sum of R@1, R@5 and r@10. All experimental results are obtained on a test set, and the comparison methods for cross-modal retrieval tasks are 14: SCAN, SCO, CAAN, VSE ++, IMRAM, VSRN, SHAN, SGM, DPRNN, MLASM, fusion layer, VCLTN, PASF and LAGSC.
TABLE 1 comparison of the present invention with the results of the prior art methods on the Flickr30K dataset
TABLE 2 comparison of the results of the present invention with the prior art methods on the MS-COCO dataset
As can be seen from the table, the method provided by the invention is outstanding in two cross-modal image-text retrieval data sets MS-COCO and Flickr30K, and the best results at present are obtained in a plurality of indexes, which shows that the method overcomes the technical problems of lack of multi-level interaction of cross-modal characteristics in the existing research work to a great extent, and comprehensively considers the interaction of the characteristics with different granularities in different levels in three levels of global-global, global-local and local. In addition, the optimization training is carried out by adopting the self-adaptive ternary sorting loss of the margin value, so that the accuracy of the cross-modal retrieval task is greatly improved. FIG. 5 is a graph showing the results of the two-way search of the trained model on the Flickr30K data set, and it can be seen from the graph that the search results are identical to the real results, further illustrating the effectiveness of the present invention.

Claims (6)

1. A cross-modal image-text retrieval method based on multi-level semantic alignment is characterized by comprising the following steps:
collecting a cross-mode image-text retrieval data set, collecting images and corresponding text descriptions thereof as the cross-mode image-text retrieval data set, forming an image-text pair by one image and one corresponding text description thereof, and dividing all collected image-text pairs into a training set, a verification set and a test set according to a certain rule;
step two, extracting characteristics of image-text pairs, namely extracting K regional characteristics of each image by using a target detector FasterR-CNN for the images in the image-text pairs to obtain local fine-grained characteristics V of the images l Global coarse-grained feature V for each image is extracted using convolutional neural network ResNet152 g The method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, for texts in the image-text pair, extracting word characteristics of each text by using a bi-directional gating circulating unit BiGRU to obtain local fine granularity characteristics Y of the text l The method comprises the steps of carrying out a first treatment on the surface of the Then, for Y l Global average pooling is carried out to obtain global coarse granularity characteristic Y of the text g The method comprises the steps of carrying out a first treatment on the surface of the Finally, calculating the global coarse granularity characteristic V of the image g And text global coarse granularity feature Y g Cosine similarity between them to obtain Global-Global level (GGl) feature matching score S GGl
Step three, building a fine-grained feature interaction attention network among modes, wherein the interaction attention network adopts a two-way symmetrical structure, and each path of input is composed of image local fine-grained feature V obtained in the step two l And text local fine granularity feature Y l Two parts are formed; first, the interaction attention network calculates the dot product correlation s between the ith image region and the jth text word ij The method comprises the steps of carrying out a first treatment on the surface of the Secondly, two local correlation matrices are obtained based on the dot product correlation, namely, the correlation matrix between the image region and the text word when the query is made and the correlation matrix between the text word and the image region when the query is made,normalizing the two matrixes to obtain a normalized local correlation matrix s m1 Sum s m2 The method comprises the steps of carrying out a first treatment on the surface of the Then, for s m1 And s m2 Solving for image region as weight coefficient delta between query and text word using Softmax operation ij And text words as weighting coefficients gamma between the query and the image area ij The method comprises the steps of carrying out a first treatment on the surface of the Next, the obtained coefficient delta is used ij And gamma ij Respectively for local fine granularity characteristic V of image l And text local fine granularity feature Y l Performing weighted operation to obtain two-way output Y of the interaction attention network l ' and V l ', wherein Y l ' representing text local fine granularity features after cross-modal feature interaction, V l ' represent cross-modal feature interacted image local fine-grained features; finally, calculate Y l ' and V l Cosine similarity S between 1 ,V l ' and Y l Cosine similarity S between 2 Local-Local level (LLl) feature matching score S LLl Namely S 1 And S is 2 Is the average value of (2);
building feature fusion networks with different granularity in the mode, wherein the feature fusion networks comprise feature fusion sub-networks in the image mode and feature fusion sub-networks in the text mode, and each fusion sub-network is formed by connecting a multi-head self-attention module with a door control fusion unit; for the feature fusion sub-network in the image mode, the input is V obtained in the second step g V obtained in the third step l 'A'; first, V is calculated by a multi-headed self-attention module l Similarity between different areas in' and giving higher weight to areas with high similarity; then, the output V of the multi-head self-attention module is obtained o Then to V o Global average pooling is carried out, and the pooled V is carried out o Sending the image global coarse granularity characteristic V into a gating fusion unit g Selectively gating and fusing to obtain the embedded representation V of the fused image with different granularity image features f This is the output of the feature fusion subnetwork within the image modality; similarly, for a feature fusion sub-network within a text modality, its input is step twoY to g And Y obtained in the third step l 'A'; first, Y is calculated by a multi-head self-attention module l Similarity between different words in' and giving higher weight to words with high similarity; then, the output Y of the multi-head self-attention module is obtained o Then to Y o Global average pooling is carried out, and the pooled Y is carried out o Sending the text global coarse granularity characteristic Y into a gating fusion unit g Selectively gating and fusing to obtain text embedded representation Y after fusing text features with different granularities f This is the output of the feature fusion sub-network within the text modality; finally, calculate V f And Y is equal to f Cosine similarity between them to obtain Global-Local level (GLl) feature matching score S GLl
Calculating the total score of multi-level semantic matching between image-text pairs and training a multi-level semantic alignment model by adopting ternary ordering loss with a self-adaptive margin value, wherein the multi-level semantic alignment model is formed by connecting a feature extraction network of the image-text pairs in the second step, a fine-grained feature interaction attention network among the modes in the third step and a feature fusion network with different granularity in the modes in the fourth step; then, the global-global level feature matching score S obtained in the second step is used for the multi-level semantic matching total score GGl Local-local level feature matching score S obtained in step three LLl Global-local level feature matching score S obtained in step four GLl Weighting to obtain; finally, for training the multi-level semantic alignment model, a ternary ordering penalty with adaptive margin values is employed, wherein the margin values are sized according to the negative sample duty cycle in the batch of samples, when the negative sample duty cycle exceeds a threshold ζ 0 When the edge distance value is changed adaptively according to a certain rule, otherwise, the edge distance value is kept unchanged;
step six, acquiring a cross-modal image-text pair bidirectional retrieval result, wherein bidirectional retrieval is divided into two types of image retrieval and text retrieval: inputting an image to be searched into a trained multi-level semantic alignment model to obtain a multi-level semantic matching total score between the image to be searched and the text description, and taking the text description with the highest multi-level semantic matching total score as a search result of the image; inputting the text description to be searched into a trained multi-level semantic alignment model to obtain a multi-level semantic matching total score between the text description to be searched and the image, and taking the image with the highest multi-level semantic matching total score as a search result of the text description; and observing whether the obtained bidirectional retrieval result is the same as the true value, so that the cross-mode image-text bidirectional retrieval process is completed.
2. The multi-level semantic alignment-based cross-modal teletext retrieval method according to claim 1, wherein the teletext pairs in step one are derived from cross-modal retrieval open source data sets MS-COCO and Flickr30K.
3. The multi-level semantic alignment-based cross-modal image-text retrieval method according to claim 1, wherein the feature extraction of the image-text pair in the second step comprises the following steps:
step two, for each image in the image-text pair, extracting K regional features of the image-text pair by adopting a Faster R-CNN target detector with a ResNet101 as a main network pre-trained on an open source data set Visual Genome, wherein K is generally 36; all regional feature dimensions are adjusted to d by using a single-layer full-connection layer, and the local fine-grained feature of the image is obtainedWherein v is i For the i-th region feature vector of the image, d is the feature dimension, and the value is 1024,/-for the feature vector>Is vector space; meanwhile, the ResNet152 network is used for extracting the global feature of the whole image, and the global feature dimension is adjusted to d by using a single-layer full-connection layer to obtain the global coarse-granularity feature of the image>
Step B, aiming at the graphics contextFirstly, segmenting each text in the pair, and encoding each segmented word into one-hot independent heat vectors; meanwhile, a pre-trained word embedding method Glove is adopted to process the single hot vector of the word, so that the word embedding vector of each word is obtained; then, the word embedding vector is sent into a bi-directional cyclic neural network BiGRU to extract local fine granularity characteristics of the textWherein L is the number of words after text word segmentation, y j The extraction process for the feature vector of the j-th word of the text is as follows:
wherein t is j A word embedding vector for the j-th word of text,and->Respectively forward operations->Sum-and-back operationHidden state of y j Taking the average value of two hidden states as the feature vector of the jth word in the text, wherein d is the feature dimension and is the same as the value of the image;
finally, the text is locally fine-grainedSign Y l Global average pooling is performed to obtain global coarse granularity characteristics of the text
Y g =AvgPool(Y l )
Wherein AvgPool represents a global average pooling operation;
step C, calculating the global coarse granularity characteristic V of the image obtained in the step A g And the text global coarse granularity characteristic Y obtained in the step B g Cosine similarity between the two to obtain global-global level feature matching score S GGl
In the formula, |·| represents the L2 norm, and the superscript italic T represents the transpose operation.
4. The multi-level semantic alignment-based cross-modal image-text retrieval method according to claim 1, wherein the building of the inter-modal fine-grained feature interaction attention network in the third step comprises the following steps:
step three A, the interaction attention network firstly calculates dot product correlation between an image area and text word feature vectors:
s ij =v i T y j ,i∈[1,36],j∈[1,L]
wherein s is ij Representing the dot product correlation between the i-th region and the j-th word, L being as defined above;
secondly, a normalized local correlation matrix can be obtained based on the dot product correlation between the image region and the text word:
wherein s is m1 Representing image regions as query and text wordsA normalized correlation matrix between the two; s is(s) m2 Representing text words as normalized correlation matrix between query and image region, [ x ]] + =max(x,0);
Then, for s m1 And s m2 Obtaining an image region as a weight coefficient delta of a corresponding text word when inquiring by adopting Softmax operation ij Weighting coefficient gamma of image region corresponding to text word as query ij
Wherein exp (·) is an exponential operation, η 1 、η 2 Representing the temperature super parameter, wherein the values are respectively 4 and 9;
then, the obtained coefficient delta is used ij And gamma ij Respectively for text local fine granularity characteristicsAnd image local fine-grained feature->Weighting operation is carried out to obtain text local fine granularity characteristics after cross-modal characteristic interaction>And image local fine-grained feature->This is the output of the interactive attention network:
wherein y 'is' i Is Y l The i-th word feature vector in 'v' j Is V (V) l The j-th region feature vector in';
step three B, calculating V obtained in step three A l ' and Y obtained in step II A l Cosine similarity S between 1 Y obtained in step three A l ' and V obtained in step II A l Similarity S of local cosine between 2
Wherein S is 1 Is V (V) l ' and Y l Cosine similarity between S 2 Represents Y l ' and V l Cosine similarity between them;
local-local level feature matching score S LLl S is taken out 1 And S is 2 Is the average value of (a):
5. the multi-level semantic alignment-based cross-modal image-text retrieval method according to claim 1, wherein the step four is to build a feature fusion network with different granularity in a mode, and the method comprises the following steps:
step four A, for the image local fine granularity characteristic V obtained in step three A l ' the multi-head self-attention module performs layer normalization on the multi-head self-attention module firstly; then, calculating the results of the h heads and connecting; then, the projection matrix is output to obtain the target V l ' Multi-head total output Multiattn 1
Wherein, concat (·) is a channel dimension connection operation,for outputting the projection matrix>Andas a parameter matrix, v' i And v' j Normalized V of representative layer l The ith feature vector and the jth feature vector in' j->Representing the calculation result of the xth head, the dimension d takes the same value as the second step, and h=16 parallel self-attention heads are adopted, d v =d/h=64;
Similarly, for text local fine granularity feature Y l ' it is also first layer normalized; then, calculating the results of the h heads and connecting; then, the projection matrix is outputted to obtain the target Y l ' Multi-head total output Multiattn 2
In the method, in the process of the invention,for outputting the projection matrix>And->As a parameter matrix, y' i And y' j Representing the ith feature vector and the jth feature vector in the layer normalized text representation,/>Represents the x-th head output, d v =d/h=64, the remaining parameters have the same meaning as before;
finally, after obtaining two multi-head total outputs, for Multiattn 1 And V is equal to l ' element summing to obtain V o At the same time for Multiattn 2 And Y is equal to l ' element summing to get Y o ,Y o And V o The final output of the multi-head self-attention module is as follows:
Y o =Y l '+multiattn 1
V o =V l '+multiattn 2
step four B, a gating fusion unit in the feature fusion sub-network in the image mode firstly carries out on the V obtained in the step four A o Global average pooling is carried out, and the global average pooling is matched with the global coarse granularity characteristic V of the image g Performing gating fusion to obtain an image embedded representation V f The method comprises the steps of carrying out a first treatment on the surface of the The gating fusion unit in the feature fusion sub-network in the text mode firstly carries out the Y obtained in the step four A o Global average pooling is performed, and the global coarse granularity characteristic Y is matched with the text g Performing gating fusion to obtain text embedded representation Y f The calculation process is as follows:
z v =σ(W v concat(AvgPool(V o ),V g )+b v )
z y =σ(W y concat(AvgPool(Y o ),Y g )+b y )
V f =(1-z v )×V o +z v ×V g
Y f =(1-z y )×Y o +z y ×Y g
wherein σ represents a sigmoid activation function, z v ,z y For the determination of the fusion coefficient, W vFor the weight matrix to be learned, b v ,b y For the bias term to be learned, V f And Y f For the output of the gating fusion unit, respectively representing an image embedded representation and a text embedded representation, and the rest parameters have the same meaning as the above;
step four C, calculating V obtained in step four B f And Y is equal to f Cosine similarity between them to obtain global-local level feature matching score S GLl
6. The multi-level semantic alignment-based cross-modal teletext retrieval method according to claim 1, wherein the step five of calculating a multi-level semantic matching total score between teletext pairs and performing model training by using a ternary ordering loss with an adaptive margin value comprises the steps of:
step five, weighting and summing the feature matching scores of the three different levels obtained in the step two, the step three and the step four to obtain a multi-level semantic matching total score S between image-text pairs total (Ι,T):
S total (Ι,T)=w 1 S GGl +w 2 S GLl +w 3 S LLl
Wherein S is total (I, T) is the total score, w, of the multi-level semantic matching of the image I and the text T 1 、w 2 And w 3 The values of the weighting coefficients are respectively 0.2, 0.6 and 0.2;
step five B, dividing the margin value into m in the image retrieval process v And m in text retrieval process y Two kinds; for two margin valuesAnd (3) self-adaptive adjustment, namely firstly, calculating the proportion of negative samples in the batch samples by utilizing the multi-level semantic matching total score obtained in the step (five A), wherein the proportion is shown in the following formula:
wherein B represents the batch size and has a value of 128, (I, T) + ) Sum (T, I) + ) Is a matched image-text pair, (I, T) - ) Sum (T, I) - ) For the unmatched graph-text pairs, sum (χ > 0) represents the number of elements greater than 0 in matrix χ, r y Representing the negative sample duty cycle in the text retrieval process, r v Representing the negative sample duty ratio, m in the image retrieval process y And m v The initial values of (2) are all 0.2;
then, the larger of the two negative sample duty cycles is taken out, when the value is greater than the threshold value ζ 0 When the two edge distance values are updated adaptively according to a certain rule, otherwise, the two edge distance values are kept unchanged, and the two edge distance values are shown in the following specific formula:
r m =max(r v ,r y )
wherein r is m For the larger of the two types of negative sample duty ratios, max represents the maximum value taking operation, ζ 0 The value of the threshold value is 0.5;
finally, ternary ordering penalty L with adaptive margin values total The expression is as follows:
L total =max(m y -S total (Ι,T + )+S total (Ι,T-),0)+max(m v -S total (T,Ι + )+S total (T,Ι - ),0)。
CN202310855462.0A 2023-07-13 2023-07-13 Cross-modal image-text retrieval method based on multi-level semantic alignment Pending CN116821391A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310855462.0A CN116821391A (en) 2023-07-13 2023-07-13 Cross-modal image-text retrieval method based on multi-level semantic alignment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310855462.0A CN116821391A (en) 2023-07-13 2023-07-13 Cross-modal image-text retrieval method based on multi-level semantic alignment

Publications (1)

Publication Number Publication Date
CN116821391A true CN116821391A (en) 2023-09-29

Family

ID=88124000

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310855462.0A Pending CN116821391A (en) 2023-07-13 2023-07-13 Cross-modal image-text retrieval method based on multi-level semantic alignment

Country Status (1)

Country Link
CN (1) CN116821391A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117522479A (en) * 2023-11-07 2024-02-06 北京创信合科技有限公司 Accurate Internet advertisement delivery method and system
CN117708354A (en) * 2024-02-06 2024-03-15 湖南快乐阳光互动娱乐传媒有限公司 Image indexing method and device, electronic equipment and storage medium
CN117874262A (en) * 2024-03-12 2024-04-12 北京邮电大学 Text-dynamic picture cross-modal retrieval method based on progressive prototype matching
CN118279925A (en) * 2024-06-04 2024-07-02 鲁东大学 Image text matching algorithm integrating local and global semantics

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117522479A (en) * 2023-11-07 2024-02-06 北京创信合科技有限公司 Accurate Internet advertisement delivery method and system
CN117708354A (en) * 2024-02-06 2024-03-15 湖南快乐阳光互动娱乐传媒有限公司 Image indexing method and device, electronic equipment and storage medium
CN117708354B (en) * 2024-02-06 2024-04-30 湖南快乐阳光互动娱乐传媒有限公司 Image indexing method and device, electronic equipment and storage medium
CN117874262A (en) * 2024-03-12 2024-04-12 北京邮电大学 Text-dynamic picture cross-modal retrieval method based on progressive prototype matching
CN117874262B (en) * 2024-03-12 2024-06-04 北京邮电大学 Text-dynamic picture cross-modal retrieval method based on progressive prototype matching
CN118279925A (en) * 2024-06-04 2024-07-02 鲁东大学 Image text matching algorithm integrating local and global semantics

Similar Documents

Publication Publication Date Title
CN110717431B (en) Fine-grained visual question and answer method combined with multi-view attention mechanism
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
Abiyev et al. Sign language translation using deep convolutional neural networks
CN110059217B (en) Image text cross-media retrieval method for two-stage network
CN112905822B (en) Deep supervision cross-modal counterwork learning method based on attention mechanism
CN116821391A (en) Cross-modal image-text retrieval method based on multi-level semantic alignment
Al-Jarrah et al. Recognition of gestures in Arabic sign language using neuro-fuzzy systems
Gao et al. Multi‐dimensional data modelling of video image action recognition and motion capture in deep learning framework
CN115033670A (en) Cross-modal image-text retrieval method with multi-granularity feature fusion
CN111414461A (en) Intelligent question-answering method and system fusing knowledge base and user modeling
CN112417097A (en) Multi-modal data feature extraction and association method for public opinion analysis
CN114201592A (en) Visual question-answering method for medical image diagnosis
CN114817673A (en) Cross-modal retrieval method based on modal relation learning
CN116610778A (en) Bidirectional image-text matching method based on cross-modal global and local attention mechanism
Das et al. A deep sign language recognition system for Indian sign language
CN113255602A (en) Dynamic gesture recognition method based on multi-modal data
CN116561305A (en) False news detection method based on multiple modes and transformers
CN114239612A (en) Multi-modal neural machine translation method, computer equipment and storage medium
CN116187349A (en) Visual question-answering method based on scene graph relation information enhancement
CN117333908A (en) Cross-modal pedestrian re-recognition method based on attitude feature alignment
CN114973305B (en) Accurate human body analysis method for crowded people
Abdullahi et al. Spatial–temporal feature-based End-to-end Fourier network for 3D sign language recognition
Li et al. Egocentric action recognition by automatic relation modeling
CN114722798A (en) Ironic recognition model based on convolutional neural network and attention system
Wang et al. Listen, look, and find the one: Robust person search with multimodality index

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication