CN116821391A

CN116821391A - Cross-modal image-text retrieval method based on multi-level semantic alignment

Info

Publication number: CN116821391A
Application number: CN202310855462.0A
Authority: CN
Inventors: 遆晓光; 王文状; 刘茂振; 高峰
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2023-07-13
Filing date: 2023-07-13
Publication date: 2023-09-29

Abstract

A cross-modal image-text retrieval method based on multi-level semantic alignment belongs to the technical field of cross-modal retrieval and artificial intelligence. The method provided by the invention has the advantages that the simple and symmetrical network architecture is provided for coding the image and text characteristics, the multi-level semantic alignment of global-global, global-local and local-local is considered, the fusion interaction of different granularity characteristics on different levels is realized by introducing a fine granularity characteristic interaction attention network between modes and a different granularity characteristic fusion network in the modes, and the technical problems that the multi-granularity characteristic interaction is weak, and the image-text pairs with similar image region characteristics or similar text semantics are difficult to distinguish in the existing cross-mode retrieval research work are solved; meanwhile, the method adopts the ternary ordering loss of the multi-level semantic matching total score and the self-adaptive margin value, realizes better cross-modal semantic alignment, and greatly improves the accuracy of cross-modal image-text retrieval tasks.

Description

Cross-modal image-text retrieval method based on multi-level semantic alignment

Technical Field

The invention relates to the technical field of cross-modal retrieval and artificial intelligence, in particular to a cross-modal image-text retrieval method based on multi-level semantic alignment.

Background

Cross-modal image-text retrieval is a continuously developed research direction in the multi-modal learning field, is used for cross-modal bidirectional retrieval between image-text pairs, and is widely applied to the fields of information retrieval, information recommendation and the like. Along with the arrival of big data age and the development of internet technology, multi-mode data mainly comprising images and texts grows exponentially, and how to effectively fuse and align a large number of multi-source heterogeneous images and text data so as to meet the diversified retrieval requirements of users is a challenging task. At present, many research efforts have been made to explore efficient interaction modes between cross-modal image-text pairs, wherein deep learning methods based on deep neural networks have shown great potential in cross-modal retrieval tasks, and certain achievements are achieved. However, most of the cross-modal search researches still have the problems of weak cross-modal feature interaction or lack of semantic alignment in the modes, and different image-text pairs with similar image local areas or similar text descriptions are difficult to distinguish.

The existing cross-modal search research is mainly divided into a global-level feature alignment method and a local-level feature alignment method. The feature alignment method of the global level focuses on learning a dual-path deep neural network, and is commonly performed by firstly using a convolutional neural network (Convolutional Neural Network, CNN) and a cyclic neural network (Recurrent Neural Network, RNN) to extract global features of image-text pairs respectively, and then mapping the global features of two modes into a joint representation space so as to obtain global correlation among the modes. However, this coarse-grained alignment ignores local relationships between different modalities, lacks fine interactions between image regions and text words, and therefore has poor retrieval results. Since text typically describes only certain locally significant regions of an image, which are critical to the text retrieval process with corresponding descriptive words, further research has focused on local level feature alignment methods. According to the method, the association between different modal local fragments is modeled by extracting the fine granularity characteristics of the image area and the text word, so that the fusion and alignment of the cross-modal fine granularity characteristics are realized. The feature matching method of the local level improves the accuracy of the cross-mode search task to a certain extent, but the method directly converts the global association of the whole image text pair into the local similarity between the region and the word, and the local detail of the image text mode is excessively focused in the search process to ignore global information, so that the method cannot distinguish the same word with different meanings in different contexts. Meanwhile, due to the lack of multi-granularity feature fusion and multi-level semantic interaction, the current global level and local level alignment methods are difficult to distinguish different image-text pairs with similar features of local areas of images or similar text semantics. In addition, the loss function adopted in the current research is mostly ternary group ordering loss with fixed margin values, and the loss has good retrieval performance on images and texts with larger difference of visual features of images and different text semantics, but has poor retrieval performance on difficult cases on images and texts with very similar image features and similar text semantics.

In summary, the current cross-modal image-text retrieval method mainly has the following problems: 1. the existing retrieval model lacks multi-granularity feature fusion of images and texts and is aligned with multi-level semantics, so that the model is difficult to distinguish different image-text pairs with similar image local areas or similar text semantics; 2. ternary ordering loss based on fixed margin values is detrimental to the model in further distinguishing difficult cases during training. The present invention will develop research into these two problems in order to adequately align and interact across modal features at multiple levels.

Disclosure of Invention

The invention aims to solve the problem of poor retrieval effect caused by insufficient multi-granularity feature interaction in the existing image-text retrieval method, provides a multi-level semantic alignment cross-modal retrieval method which simultaneously considers global-global matching, global-local matching and local-local matching, and further improves the accuracy of the cross-modal retrieval task through ternary ordering loss with self-adaptive margin values, and has wide application value.

In order to achieve the above purpose, the invention provides a cross-modal image-text retrieval method based on multi-level semantic alignment, which comprises the following steps:

step one, collecting a cross-mode image-text retrieval data set. Collecting images and corresponding text descriptions thereof as a cross-modal image-text retrieval data set, forming an image-text pair by one image and a corresponding text description thereof, and dividing all the collected image-text pairs into a training set, a verification set and a test set according to a certain rule;

and step two, extracting the characteristics of the image-text pairs. For images in the image-text pair, extracting K regional features of each image by using a target detector FasterR-CNN to obtain local fine granularity features V of the images _l Global coarse-grained feature V for each image is extracted using convolutional neural network ResNet152 _g The method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, for texts in the image-text pair, extracting word characteristics of each text by using a bi-directional gating circulating unit BiGRU to obtain local fine granularity characteristics Y of the text _l The method comprises the steps of carrying out a first treatment on the surface of the Then, for Y _l Global by makingAveraging and pooling to obtain global coarse granularity characteristic Y of text _g The method comprises the steps of carrying out a first treatment on the surface of the Finally, calculating the global coarse granularity characteristic V of the image _g And text global coarse granularity feature Y _g Cosine similarity between them to obtain Global-Global level (GGl) feature matching score S _GGl ；

And thirdly, building a fine-grained feature interaction attention network among modes. The interaction attention network adopts a two-way symmetrical structure, wherein each path of input is composed of image local fine granularity characteristic V obtained in the second step _l And text local fine granularity feature Y _l Two parts. First, the interaction attention network calculates the dot product correlation s between the ith image region and the jth text word _ij The method comprises the steps of carrying out a first treatment on the surface of the Secondly, obtaining two local correlation matrixes based on the dot product correlation, namely, obtaining a correlation matrix between an image area and a text word when the image area is used as a query and a correlation matrix between the text word and the image area when the text word is used as the query, and normalizing the two matrixes to obtain a normalized local correlation matrix s _m1 Sum s _m2 The method comprises the steps of carrying out a first treatment on the surface of the Then, for s _m1 And s _m2 Solving for image region as weight coefficient delta between query and text word using Softmax operation _ij And text words as weighting coefficients gamma between the query and the image area _ij The method comprises the steps of carrying out a first treatment on the surface of the Next, the obtained coefficient delta is used _ij And gamma _ij Respectively for local fine granularity characteristic V of image _l And text local fine granularity feature Y _l Performing weighted operation to obtain two-way output Y of the interaction attention network _l ' and V _l ', wherein Y _l ' representing text local fine granularity features after cross-modal feature interaction, V _l ' represent cross-modal feature interacted image local fine-grained features; finally, calculate Y _l ' and V _l Cosine similarity S between ₁ ，V _l ' and Y _l Cosine similarity S between ₂ Local-Local level (LLl) feature matching score S _LLl Namely S ₁ And S is ₂ Is the average value of (2);

and fourthly, building a feature fusion network with different granularity in the mode. The feature fusion network comprisesThe image mode internal feature fusion sub-network and the text mode internal feature fusion sub-network are connected by a multi-head self-attention module and a door control fusion unit. For the feature fusion sub-network in the image mode, the input is V obtained in the second step _g V obtained in the third step _l '. First, V is calculated by a multi-headed self-attention module _l Similarity between different areas in' and giving higher weight to areas with high similarity; then, the output V of the multi-head self-attention module is obtained _o Then to V _o Global average pooling is carried out, and the pooled V is carried out _o Sending the image global coarse granularity characteristic V into a gating fusion unit _g Selectively gating and fusing to obtain the embedded representation V of the fused image with different granularity image features _f This is the output of the feature fusion subnetwork within the image modality. Similarly, for the feature fusion sub-network in the text mode, the input is Y obtained in the step two _g And Y obtained in the third step _l '. First, Y is calculated by a multi-head self-attention module _l Similarity between different words in' and giving higher weight to words with high similarity; then, the output Y of the multi-head self-attention module is obtained _o Then to Y _o Global average pooling is carried out, and the pooled Y is carried out _o Sending the text global coarse granularity characteristic Y into a gating fusion unit _g Selectively gating and fusing to obtain text embedded representation Y after fusing text features with different granularities _f This is the output of the feature fusion sub-network within the text modality; finally, calculate V _f And Y is equal to _f Cosine similarity between them to obtain Global-Local level (GLl) feature matching score S _GLl ；

And fifthly, calculating the total score of multi-level semantic matching between image-text pairs, and training the multi-level semantic alignment model by adopting ternary ordering loss with self-adaptive margin values. Firstly, a multi-level semantic alignment model is connected by a feature extraction network of image-text pairs in a second step, a fine-granularity feature interaction attention network among modes in a third step and a feature fusion network with different granularity in modes in a fourth stepIs formed by the steps of; then, the global-global level feature matching score S obtained in the second step is used for the multi-level semantic matching total score _GGl Local-local level feature matching score S obtained in step three _LLl Global-local level feature matching score S obtained in step four _GLl Weighting to obtain; finally, for training the multi-level semantic alignment model, a ternary ordering penalty with adaptive margin values is employed, wherein the margin values are sized according to the negative sample duty cycle in the batch of samples, when the negative sample duty cycle exceeds a threshold ζ ₀ When the edge distance value is changed adaptively according to a certain rule, otherwise, the edge distance value is kept unchanged;

and step six, acquiring a cross-modal image-text pair bidirectional retrieval result. Two-way retrieval is divided into two types, image retrieval and text retrieval: inputting an image to be searched into a trained multi-level semantic alignment model to obtain a multi-level semantic matching total score between the image to be searched and the text description, and taking the text description with the highest multi-level semantic matching total score as a search result of the image; the text retrieval inputs the text description to be retrieved into a trained multi-level semantic alignment model, the multi-level semantic matching total score between the text description to be retrieved and the image is obtained, and the image with the highest multi-level semantic matching total score is used as a retrieval result of the text description. Whether the obtained bidirectional retrieval result is the same as the true value or not is observed, so that the cross-mode image-text bidirectional retrieval process is completed;

compared with the prior art, the invention has the following beneficial effects:

(1) Compared with the existing cross-modal retrieval research work, the invention realizes alignment and matching of cross-modal features on different granularity levels by constructing a simple and symmetrical multi-level semantic alignment network and simultaneously taking account of global-global, global-local and local-local multi-level matching modes, can obtain more robust image and text embedded representation, ensures that images and texts in different representation spaces are interactively fused with features more thoroughly, and greatly improves the accuracy of the cross-modal retrieval task.

(2) The invention adopts the ternary sorting loss with the self-adaptive edge value, is beneficial to the discrimination of difficult cases by the image-text with similar characteristics in the training process of the model, and realizes better cross-modal semantic alignment by carrying out the strong matching of positive samples and the strong separation of negative samples with higher standards.

Drawings

FIG. 1 is a flow chart of a cross-modal graph-text retrieval method based on multi-level semantic alignment;

FIG. 2 is a block diagram of a multi-level semantic alignment model;

FIG. 3 is a block diagram of a multi-headed self-attention module;

FIG. 4 is a block diagram of a gated fusion unit;

FIG. 5 is an example of bi-directional retrieval results of a trained multi-level semantic alignment model on a Flickr30K open source dataset.

Detailed Description

The invention will be described in further detail with reference to fig. 1 and a specific example, and a cross-modal image-text retrieval method based on multi-level semantic alignment includes the following steps:

the collected pairs of images are derived from cross-modal retrieval of open source datasets MS-COCO and Flickr30K, where each image has a corresponding five text descriptions. For the partitioning of the dataset, MS-COCO contained 123287 images in total, using 5000 images and corresponding text descriptions as the validation set, and 5000 images and corresponding text descriptions as the test set, the remaining images and corresponding text descriptions as the training set; the Flickr30K contains 31784 images in total, using 1000 images and corresponding text descriptions as a validation set, another 1000 images and corresponding text descriptions as a test set, and the remaining images and corresponding text descriptions as a training set.

And step two, extracting the characteristics of the image-text pairs. For images in the image-text pair, the target detector FaterR-CNN extracts K regional characteristics of each image to obtain local fine-grained characteristic V of the image _l Global coarse-grained feature V for each image is extracted using convolutional neural network ResNet152 _g The method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, for texts in the image-text pair, extracting word characteristics of each text by using a bi-directional gating circulating unit BiGRU to obtain local fine granularity characteristics Y of the text _l The method comprises the steps of carrying out a first treatment on the surface of the Then, for Y _l Global average pooling is carried out to obtain global coarse granularity characteristic Y of the text _g The method comprises the steps of carrying out a first treatment on the surface of the Finally, calculating the global coarse granularity characteristic V of the image _g And text global coarse granularity feature Y _g Cosine similarity between them to obtain Global-Global level (GGl) feature matching score S _GGl ；

Step two, for each image in the image-text pair, extracting K regional features by using a FasterR-CNN target detector with a ResNet101 as a main network pre-trained on an open source data set Visual Genome, wherein K is generally 36. All regional feature dimensions are adjusted to d by using a single-layer full-connection layer, and the local fine-grained feature of the image is obtainedWherein v is _i For the i-th region feature vector of the image, d is the feature dimension, and the value is 1024,/-for the feature vector>Is vector space. Meanwhile, the ResNet152 network is used for extracting the global feature of the whole image, and the global feature dimension is adjusted to d by using a single-layer full-connection layer to obtain the global coarse-granularity feature of the image>

Step B, firstly, word segmentation is carried out on each text in the image-text pair, and each word after word segmentation is encoded into one-hot independent heat vectors; meanwhile, a pre-trained word embedding method Glove is adopted to process the single hot vector of the word, so that the word embedding vector of each word is obtained; then, the word embedding vector is sent into the bi-directional cyclic neural network BiGRU to extract the local part of the textFine grain size characteristicsWherein L is the number of words after text word segmentation, y _j The extraction process for the feature vector of the j-th word of the text is as follows:

wherein t is _j A word embedding vector for the j-th word of text,and->Respectively forward operations->And backward arithmetic->Hidden state of y _j And taking the average value of the two hidden states as the characteristic vector of the jth word in the text, wherein d is the characteristic dimension and is the same as the value of the image.

Finally, for the text local fine granularity feature Y _l Global average pooling is performed to obtain global coarse granularity characteristics of the text

Y _g ＝AvgPool(Y _l )

Where AvgPool represents a global average pooling operation.

Step C, calculating the global coarse granularity characteristic V of the image obtained in the step A _g And the text global coarse granularity characteristic Y obtained in the step B _g Cosine similarity between the two to obtain global-global level feature matching score S _GGl ：

In the formula, |·| represents the L2 norm, and the superscript italic T represents the transpose operation.

And thirdly, building a fine-grained feature interaction attention network among modes. The interaction attention network adopts a two-way symmetrical structure, wherein each path of input is composed of image local fine granularity characteristic V obtained in the second step _l And text local fine granularity feature Y _l Two parts. First, the interaction attention network calculates the dot product correlation s between the ith image region and the jth text word _ij The method comprises the steps of carrying out a first treatment on the surface of the Secondly, obtaining two local correlation matrixes based on the dot product correlation, namely, obtaining a correlation matrix between an image area and a text word when the image area is used as a query and a correlation matrix between the text word and the image area when the text word is used as the query, and normalizing the two matrixes to obtain a normalized local correlation matrix s _m1 Sum s _m2 The method comprises the steps of carrying out a first treatment on the surface of the Then, for s _m1 And s _m2 Solving for image region as weight coefficient delta between query and text word using Softmax operation _ij And text words as weighting coefficients gamma between the query and the image area _ij The method comprises the steps of carrying out a first treatment on the surface of the Next, the obtained coefficient delta is used _ij And gamma _ij Respectively for local fine granularity characteristic V of image _l And text local fine granularity feature Y _l Performing weighted operation to obtain two-way output Y of the interaction attention network _l ' and V _l ', wherein Y _l ' representing text local fine granularity features after cross-modal feature interaction, V _l ' represent cross-modal feature interacted image local fine-grained features; finally, calculate Y _l ' and V _l Between which are locatedCosine similarity S ₁ ，V _l ' and Y _l Cosine similarity S between ₂ Local-Local level (LLl) feature matching score S _LLl Namely S ₁ And S is ₂ Is the average value of (2);

step three a, see fig. 1 and 2, the interactive attention network first calculates the dot product between the image region and the text word feature vector pair:

wherein s is _m1 Representing the image area as a normalized correlation matrix with the text word when querying; s is(s) _m2 Representing text words as normalized correlation matrix between query and image region, [ x ]] ₊ ＝max(x,0)。

Then, for s _m1 And s _m2 Obtaining an image region as a weight coefficient delta of a corresponding text word when inquiring by adopting Softmax operation _ij Weighting coefficient gamma of image region corresponding to text word as query _ij ：

Wherein exp (·) is an exponential operation, η ₁ 、η ₂ Representing the temperature super-parameters, and respectively taking values of 4 and 9.

Then, the obtained coefficient delta is used _ij And gamma _ij Respectively for text local fine granularity characteristicsAnd image local fine-grained feature->Weighting operation is carried out to obtain text local fine granularity characteristics after cross-modal characteristic interactionAnd image local fine-grained feature->This is the output of the interactive attention network:

wherein s is _ij Representing the dot product correlation between the i-th region and the j-th word, L is the same as defined above.

Step three B, calculating V obtained in step three A _l ' and Y obtained in step II A _l Cosine similarity S between ₁ Y obtained in step three A _l ' and V obtained in step II A _l Similarity S of local cosine between ₂ ：

Wherein S is ₁ Is V (V) _l ' and Y _l Cosine similarity between S ₂ Represents Y _l ' and V _l Cosine similarity between them.

Local-local level feature matching score S _LLl S is taken out ₁ And S is ₂ Is the average value of (a):

and fourthly, building a feature fusion network with different granularity in the mode. Feature fusionThe integrated network comprises an image mode internal feature fusion sub-network and a text mode internal feature fusion sub-network, wherein each fusion sub-network is formed by connecting a multi-head self-attention module with a door control fusion unit. For the feature fusion sub-network in the image mode, the input is V obtained in the second step _g V obtained in the third step _l '. First, V is calculated by a multi-headed self-attention module _l Similarity between different areas in' and giving higher weight to areas with high similarity; then, the output V of the multi-head self-attention module is obtained _o Then to V _o Global average pooling is carried out, and the pooled V is carried out _o Sending the image global coarse granularity characteristic V into a gating fusion unit _g Selectively gating and fusing to obtain the embedded representation V of the fused image with different granularity image features _f This is the output of the feature fusion subnetwork within the image modality. Similarly, for the feature fusion sub-network in the text mode, the input is Y obtained in the step two _g And Y obtained in the third step _l '. First, Y is calculated by a multi-head self-attention module _l Similarity between different words in' and giving higher weight to words with high similarity; then, the output Y of the multi-head self-attention module is obtained _o Then to Y _o Global average pooling is carried out, and the pooled Y is carried out _o Sending the text global coarse granularity characteristic Y into a gating fusion unit _g Selectively gating and fusing to obtain text embedded representation Y after fusing text features with different granularities _f This is the output of the feature fusion sub-network within the text modality; finally, calculate V _f And Y is equal to _f Cosine similarity between them to obtain Global-Local level (GLl) feature matching score S _GLl ；

Step four a, a multi-headed self-attention module is further described with reference to fig. 1, 2 and 3. For the image local fine granularity characteristic V obtained in the step three A _l ' the multi-head self-attention module performs layer normalization on the multi-head self-attention module firstly; then, calculating the results of the h heads and connecting; then, the projection matrix is output to obtain the target V _l ' Multi-head total output Multiattn ₁ ：

Wherein, concat (·) is a channel dimension connection operation,for outputting the projection matrix>And

as a parameter matrix, v _i 'and v' _j Normalized V of representative layer _l The ith feature vector and the jth feature vector in' j->Representing the calculation result of the xth head, the dimension d takes the same value as the second step, and h=16 parallel self-attention heads are adopted, d _v ＝d/h＝64。

Similarly, for text local fine granularity feature Y _l ' it is also first layer normalized; then, calculating the results of the h heads and connecting; then, the projection matrix is outputted to obtain the target Y _l ' Multi-head total output Multiattn ₂ ：

In the method, in the process of the invention,for outputting the projection matrix>And->As a parameter matrix, y _i 'and y' _j Representing the ith feature vector and the jth feature vector in the layer normalized text representation,/>Represents the x-th head output, d _v =d/h=64, and the remaining parameters have the same meaning as described above.

Finally, after obtaining two multi-head total outputs, for Multiattn ₁ And V is equal to _l ' element summing to obtain V _o At the same time for Multiattn ₂ And Y is equal to _l ' element summing to get Y _o ，Y _o And V _o The final output of the multi-head self-attention module is as follows:

Y _o ＝Y _l '+multiattn ₁

V _o ＝V _l '+multiattn ₂

step four B, the gating fusion unit is further described in conjunction with fig. 1, 2 and 4. The gating fusion unit in the feature fusion sub-network in the image mode firstly carries out the V obtained in the step four A _o Global average pooling is carried out, and the global average pooling is matched with the global coarse granularity characteristic V of the image _g Performing gating fusion to obtain an image embedded representation V _f The method comprises the steps of carrying out a first treatment on the surface of the The gating fusion unit in the feature fusion sub-network in the text mode firstly carries out the Y obtained in the step four A _o Global average pooling is performed, and the global coarse granularity characteristic Y is matched with the text _g Performing gating fusion to obtain text embedded representation Y _f The calculation process is as follows:

z _v ＝σ(W _v concat(AvgPool(V _o ),V _g )+b _v )

z _y ＝σ(W _y concat(AvgPool(Y _o ),Y _g )+b _y )

V _f ＝(1-z _v )×V _o +z _v ×V _g

Y _f ＝(1-z _y )×Y _o +z _y ×Y _g

wherein σ represents a sigmoid activation function, z _v ，z _y In order to obtain the fusion coefficient,for the weight matrix to be learned, b _v ，b _y For the bias term to be learned, V _f And Y _f For the output of the gating fusion unit, the image embedded representation and the text embedded representation are represented respectively, and the rest parameters have the same meaning as the above.

Step four C, calculating V obtained in step four B _f And Y is equal to _f Cosine similarity between them to obtain global-local level feature matching score S _GLl ：

And fifthly, calculating the total score of multi-level semantic matching between image-text pairs, and training the multi-level semantic alignment model by adopting ternary ordering loss with self-adaptive margin values. Firstly, a multi-level semantic alignment model is formed by connecting a feature extraction network of image-text pairs in a second step, a fine-granularity feature interaction attention network among modes in a third step and a different-granularity feature fusion network in a fourth step; then, the global-global level feature matching score S obtained in the second step is used for the multi-level semantic matching total score _GGl Local-local level feature matching score S obtained in step three _LLl Global-local level feature matching score S obtained in step four _GLl Weighting to obtain; finally, for training the multi-level semantic alignment model, a ternary ordering penalty with adaptive margin values is employed, wherein the margin value size is adjusted according to the negative sample duty cycle in the batch of samples, whenThe negative sample duty cycle exceeds a threshold value ζ ₀ When the edge distance value is changed adaptively according to a certain rule, otherwise, the edge distance value is kept unchanged;

step five, weighting and summing the feature matching scores of the three different levels obtained in the step two, the step three and the step four to obtain a multi-level semantic matching total score S between image-text pairs _total (Ι,T)：

S _total (Ι,T)＝w ₁ S _GGl +w ₂ S _GLl +w ₃ S _LLl

Wherein S is _total (I, T) is the total score, w, of the multi-level semantic matching of the image I and the text T ₁ 、w ₂ And w ₃ The values are respectively 0.2, 0.6 and 0.2 for the weighting coefficients.

Step five B, dividing the margin value into m in the image retrieval process _v And m in text retrieval process _y Two kinds. For the self-adaptive adjustment of two edge distance values, firstly, calculating the proportion of negative samples in batch samples by utilizing the multi-level semantic matching total score obtained in the step five A, wherein the proportion is as follows:

wherein B represents the batch size and has a value of 128, (I, T) ₊ ) Sum (T, I) ₊ ) Is a matched image-text pair, (I, T) _- ) Sum (T, I) _- ) For the unmatched graph-text pairs, sum (χ > 0) represents the number of elements greater than 0 in matrix χ, r _y Representing the negative sample duty cycle in the text retrieval process, r _v Representing the negative sample duty ratio, m in the image retrieval process _y And m _v The initial values of (2) are all 0.2.

Then, the larger of the two negative sample duty cycles is taken out, when the value is greater than the threshold value ζ ₀ When two margin values are adaptively more according to a certain ruleNew, otherwise, remain unchanged, specifically as shown in the following formula:

r _m ＝max(r _v ,r _y )

wherein r is _m For the larger of the two types of negative sample duty ratios, max represents the maximum value taking operation, ζ ₀ The threshold value was 0.5.

Finally, ternary ordering penalty L with adaptive margin values _total The expression is as follows:

L _total ＝max(m _y -S _total (Ι,T ₊ )+S _total (Ι,T _- ),0)+max(m _v -S _total (T,Ι ₊ )+S _total (T,Ι-),0)

the software environment required in the experimental process of the invention comprises Ubuntu20.04 operating systems, python3.8 and Pytorch 1.10.0 deep learning frameworks; the hardware environment comprises an InterCore i9-12900K processor, a 64.0GB RAM and a display card which is a single NVIDIA GeForce RTX 3090.

The experimental set-up of the invention is as follows: the training period was 20 epochs, the batch size was 128, the initial learning rate was 2e-4, and the decay was ten times at 10 epochs. For adaptive updating of the margin values, the margin values are updated every 200 iterations on MS-COCO and every 400 iterations on Flickr30K. In a specific implementation process, in order to facilitate comparison with the existing method, two performance evaluation indexes are adopted: recall R@K and Rsum, wherein R@K represents the percentage of correct results in the top K results with the highest score retrieved, and is divided into R@1, R@5 and r@10, i.e. the proportion of correct results in the top 1, 5 and 10 retrieved results, and the higher the recall, the better the model effect is; rsum reflects the overall performance of the model, being the sum of R@1, R@5 and r@10. All experimental results are obtained on a test set, and the comparison methods for cross-modal retrieval tasks are 14: SCAN, SCO, CAAN, VSE ++, IMRAM, VSRN, SHAN, SGM, DPRNN, MLASM, fusion layer, VCLTN, PASF and LAGSC.

TABLE 1 comparison of the present invention with the results of the prior art methods on the Flickr30K dataset

TABLE 2 comparison of the results of the present invention with the prior art methods on the MS-COCO dataset

As can be seen from the table, the method provided by the invention is outstanding in two cross-modal image-text retrieval data sets MS-COCO and Flickr30K, and the best results at present are obtained in a plurality of indexes, which shows that the method overcomes the technical problems of lack of multi-level interaction of cross-modal characteristics in the existing research work to a great extent, and comprehensively considers the interaction of the characteristics with different granularities in different levels in three levels of global-global, global-local and local. In addition, the optimization training is carried out by adopting the self-adaptive ternary sorting loss of the margin value, so that the accuracy of the cross-modal retrieval task is greatly improved. FIG. 5 is a graph showing the results of the two-way search of the trained model on the Flickr30K data set, and it can be seen from the graph that the search results are identical to the real results, further illustrating the effectiveness of the present invention.

Claims

1. A cross-modal image-text retrieval method based on multi-level semantic alignment is characterized by comprising the following steps:

collecting a cross-mode image-text retrieval data set, collecting images and corresponding text descriptions thereof as the cross-mode image-text retrieval data set, forming an image-text pair by one image and one corresponding text description thereof, and dividing all collected image-text pairs into a training set, a verification set and a test set according to a certain rule;

step two, extracting characteristics of image-text pairs, namely extracting K regional characteristics of each image by using a target detector FasterR-CNN for the images in the image-text pairs to obtain local fine-grained characteristics V of the images _l Global coarse-grained feature V for each image is extracted using convolutional neural network ResNet152 _g The method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, for texts in the image-text pair, extracting word characteristics of each text by using a bi-directional gating circulating unit BiGRU to obtain local fine granularity characteristics Y of the text _l The method comprises the steps of carrying out a first treatment on the surface of the Then, for Y _l Global average pooling is carried out to obtain global coarse granularity characteristic Y of the text _g The method comprises the steps of carrying out a first treatment on the surface of the Finally, calculating the global coarse granularity characteristic V of the image _g And text global coarse granularity feature Y _g Cosine similarity between them to obtain Global-Global level (GGl) feature matching score S _GGl ；

Step three, building a fine-grained feature interaction attention network among modes, wherein the interaction attention network adopts a two-way symmetrical structure, and each path of input is composed of image local fine-grained feature V obtained in the step two _l And text local fine granularity feature Y _l Two parts are formed; first, the interaction attention network calculates the dot product correlation s between the ith image region and the jth text word _ij The method comprises the steps of carrying out a first treatment on the surface of the Secondly, two local correlation matrices are obtained based on the dot product correlation, namely, the correlation matrix between the image region and the text word when the query is made and the correlation matrix between the text word and the image region when the query is made,normalizing the two matrixes to obtain a normalized local correlation matrix s _m1 Sum s _m2 The method comprises the steps of carrying out a first treatment on the surface of the Then, for s _m1 And s _m2 Solving for image region as weight coefficient delta between query and text word using Softmax operation _ij And text words as weighting coefficients gamma between the query and the image area _ij The method comprises the steps of carrying out a first treatment on the surface of the Next, the obtained coefficient delta is used _ij And gamma _ij Respectively for local fine granularity characteristic V of image _l And text local fine granularity feature Y _l Performing weighted operation to obtain two-way output Y of the interaction attention network _l ' and V _l ', wherein Y _l ' representing text local fine granularity features after cross-modal feature interaction, V _l ' represent cross-modal feature interacted image local fine-grained features; finally, calculate Y _l ' and V _l Cosine similarity S between ₁ ，V _l ' and Y _l Cosine similarity S between ₂ Local-Local level (LLl) feature matching score S _LLl Namely S ₁ And S is ₂ Is the average value of (2);

building feature fusion networks with different granularity in the mode, wherein the feature fusion networks comprise feature fusion sub-networks in the image mode and feature fusion sub-networks in the text mode, and each fusion sub-network is formed by connecting a multi-head self-attention module with a door control fusion unit; for the feature fusion sub-network in the image mode, the input is V obtained in the second step _g V obtained in the third step _l 'A'; first, V is calculated by a multi-headed self-attention module _l Similarity between different areas in' and giving higher weight to areas with high similarity; then, the output V of the multi-head self-attention module is obtained _o Then to V _o Global average pooling is carried out, and the pooled V is carried out _o Sending the image global coarse granularity characteristic V into a gating fusion unit _g Selectively gating and fusing to obtain the embedded representation V of the fused image with different granularity image features _f This is the output of the feature fusion subnetwork within the image modality; similarly, for a feature fusion sub-network within a text modality, its input is step twoY to _g And Y obtained in the third step _l 'A'; first, Y is calculated by a multi-head self-attention module _l Similarity between different words in' and giving higher weight to words with high similarity; then, the output Y of the multi-head self-attention module is obtained _o Then to Y _o Global average pooling is carried out, and the pooled Y is carried out _o Sending the text global coarse granularity characteristic Y into a gating fusion unit _g Selectively gating and fusing to obtain text embedded representation Y after fusing text features with different granularities _f This is the output of the feature fusion sub-network within the text modality; finally, calculate V _f And Y is equal to _f Cosine similarity between them to obtain Global-Local level (GLl) feature matching score S _GLl ；

Calculating the total score of multi-level semantic matching between image-text pairs and training a multi-level semantic alignment model by adopting ternary ordering loss with a self-adaptive margin value, wherein the multi-level semantic alignment model is formed by connecting a feature extraction network of the image-text pairs in the second step, a fine-grained feature interaction attention network among the modes in the third step and a feature fusion network with different granularity in the modes in the fourth step; then, the global-global level feature matching score S obtained in the second step is used for the multi-level semantic matching total score _GGl Local-local level feature matching score S obtained in step three _LLl Global-local level feature matching score S obtained in step four _GLl Weighting to obtain; finally, for training the multi-level semantic alignment model, a ternary ordering penalty with adaptive margin values is employed, wherein the margin values are sized according to the negative sample duty cycle in the batch of samples, when the negative sample duty cycle exceeds a threshold ζ ₀ When the edge distance value is changed adaptively according to a certain rule, otherwise, the edge distance value is kept unchanged;

step six, acquiring a cross-modal image-text pair bidirectional retrieval result, wherein bidirectional retrieval is divided into two types of image retrieval and text retrieval: inputting an image to be searched into a trained multi-level semantic alignment model to obtain a multi-level semantic matching total score between the image to be searched and the text description, and taking the text description with the highest multi-level semantic matching total score as a search result of the image; inputting the text description to be searched into a trained multi-level semantic alignment model to obtain a multi-level semantic matching total score between the text description to be searched and the image, and taking the image with the highest multi-level semantic matching total score as a search result of the text description; and observing whether the obtained bidirectional retrieval result is the same as the true value, so that the cross-mode image-text bidirectional retrieval process is completed.

2. The multi-level semantic alignment-based cross-modal teletext retrieval method according to claim 1, wherein the teletext pairs in step one are derived from cross-modal retrieval open source data sets MS-COCO and Flickr30K.

3. The multi-level semantic alignment-based cross-modal image-text retrieval method according to claim 1, wherein the feature extraction of the image-text pair in the second step comprises the following steps:

step two, for each image in the image-text pair, extracting K regional features of the image-text pair by adopting a Faster R-CNN target detector with a ResNet101 as a main network pre-trained on an open source data set Visual Genome, wherein K is generally 36; all regional feature dimensions are adjusted to d by using a single-layer full-connection layer, and the local fine-grained feature of the image is obtainedWherein v is _i For the i-th region feature vector of the image, d is the feature dimension, and the value is 1024,/-for the feature vector>Is vector space; meanwhile, the ResNet152 network is used for extracting the global feature of the whole image, and the global feature dimension is adjusted to d by using a single-layer full-connection layer to obtain the global coarse-granularity feature of the image>

Step B, aiming at the graphics contextFirstly, segmenting each text in the pair, and encoding each segmented word into one-hot independent heat vectors; meanwhile, a pre-trained word embedding method Glove is adopted to process the single hot vector of the word, so that the word embedding vector of each word is obtained; then, the word embedding vector is sent into a bi-directional cyclic neural network BiGRU to extract local fine granularity characteristics of the textWherein L is the number of words after text word segmentation, y _j The extraction process for the feature vector of the j-th word of the text is as follows:

wherein t is _j A word embedding vector for the j-th word of text,and->Respectively forward operations->Sum-and-back operationHidden state of y _j Taking the average value of two hidden states as the feature vector of the jth word in the text, wherein d is the feature dimension and is the same as the value of the image;

finally, the text is locally fine-grainedSign Y _l Global average pooling is performed to obtain global coarse granularity characteristics of the text

Y _g ＝AvgPool(Y _l )

Wherein AvgPool represents a global average pooling operation;

4. The multi-level semantic alignment-based cross-modal image-text retrieval method according to claim 1, wherein the building of the inter-modal fine-grained feature interaction attention network in the third step comprises the following steps:

step three A, the interaction attention network firstly calculates dot product correlation between an image area and text word feature vectors:

s _ij ＝v _i ^T y _j ,i∈[1,36],j∈[1,L]

wherein s is _ij Representing the dot product correlation between the i-th region and the j-th word, L being as defined above;

secondly, a normalized local correlation matrix can be obtained based on the dot product correlation between the image region and the text word:

wherein s is _m1 Representing image regions as query and text wordsA normalized correlation matrix between the two; s is(s) _m2 Representing text words as normalized correlation matrix between query and image region, [ x ]] ₊ ＝max(x,0)；

Wherein exp (·) is an exponential operation, η ₁ 、η ₂ Representing the temperature super parameter, wherein the values are respectively 4 and 9;

then, the obtained coefficient delta is used _ij And gamma _ij Respectively for text local fine granularity characteristicsAnd image local fine-grained feature->Weighting operation is carried out to obtain text local fine granularity characteristics after cross-modal characteristic interaction>And image local fine-grained feature->This is the output of the interactive attention network:

wherein y 'is' _i Is Y _l The i-th word feature vector in 'v' _j Is V (V) _l The j-th region feature vector in';

Wherein S is ₁ Is V (V) _l ' and Y _l Cosine similarity between S ₂ Represents Y _l ' and V _l Cosine similarity between them;

5. the multi-level semantic alignment-based cross-modal image-text retrieval method according to claim 1, wherein the step four is to build a feature fusion network with different granularity in a mode, and the method comprises the following steps:

step four A, for the image local fine granularity characteristic V obtained in step three A _l ' the multi-head self-attention module performs layer normalization on the multi-head self-attention module firstly; then, calculating the results of the h heads and connecting; then, the projection matrix is output to obtain the target V _l ' Multi-head total output Multiattn ₁ ：

Wherein, concat (·) is a channel dimension connection operation,for outputting the projection matrix>Andas a parameter matrix, v' _i And v' _j Normalized V of representative layer _l The ith feature vector and the jth feature vector in' j->Representing the calculation result of the xth head, the dimension d takes the same value as the second step, and h=16 parallel self-attention heads are adopted, d _v ＝d/h＝64；

In the method, in the process of the invention,for outputting the projection matrix>And->As a parameter matrix, y' _i And y' _j Representing the ith feature vector and the jth feature vector in the layer normalized text representation,/>Represents the x-th head output, d _v =d/h=64, the remaining parameters have the same meaning as before;

Y _o ＝Y _l '+multiattn ₁

V _o ＝V _l '+multiattn ₂

step four B, a gating fusion unit in the feature fusion sub-network in the image mode firstly carries out on the V obtained in the step four A _o Global average pooling is carried out, and the global average pooling is matched with the global coarse granularity characteristic V of the image _g Performing gating fusion to obtain an image embedded representation V _f The method comprises the steps of carrying out a first treatment on the surface of the The gating fusion unit in the feature fusion sub-network in the text mode firstly carries out the Y obtained in the step four A _o Global average pooling is performed, and the global coarse granularity characteristic Y is matched with the text _g Performing gating fusion to obtain text embedded representation Y _f The calculation process is as follows:

z _v ＝σ(W _v concat(AvgPool(V _o ),V _g )+b _v )

z _y ＝σ(W _y concat(AvgPool(Y _o ),Y _g )+b _y )

V _f ＝(1-z _v )×V _o +z _v ×V _g

Y _f ＝(1-z _y )×Y _o +z _y ×Y _g

wherein σ represents a sigmoid activation function, z _v ，z _y For the determination of the fusion coefficient, W _v ，For the weight matrix to be learned, b _v ，b _y For the bias term to be learned, V _f And Y _f For the output of the gating fusion unit, respectively representing an image embedded representation and a text embedded representation, and the rest parameters have the same meaning as the above;

6. The multi-level semantic alignment-based cross-modal teletext retrieval method according to claim 1, wherein the step five of calculating a multi-level semantic matching total score between teletext pairs and performing model training by using a ternary ordering loss with an adaptive margin value comprises the steps of:

S _total (Ι,T)＝w ₁ S _GGl +w ₂ S _GLl +w ₃ S _LLl

Wherein S is _total (I, T) is the total score, w, of the multi-level semantic matching of the image I and the text T ₁ 、w ₂ And w ₃ The values of the weighting coefficients are respectively 0.2, 0.6 and 0.2;

step five B, dividing the margin value into m in the image retrieval process _v And m in text retrieval process _y Two kinds; for two margin valuesAnd (3) self-adaptive adjustment, namely firstly, calculating the proportion of negative samples in the batch samples by utilizing the multi-level semantic matching total score obtained in the step (five A), wherein the proportion is shown in the following formula:

wherein B represents the batch size and has a value of 128, (I, T) ₊ ) Sum (T, I) ₊ ) Is a matched image-text pair, (I, T) _- ) Sum (T, I) _- ) For the unmatched graph-text pairs, sum (χ > 0) represents the number of elements greater than 0 in matrix χ, r _y Representing the negative sample duty cycle in the text retrieval process, r _v Representing the negative sample duty ratio, m in the image retrieval process _y And m _v The initial values of (2) are all 0.2;

then, the larger of the two negative sample duty cycles is taken out, when the value is greater than the threshold value ζ ₀ When the two edge distance values are updated adaptively according to a certain rule, otherwise, the two edge distance values are kept unchanged, and the two edge distance values are shown in the following specific formula:

r _m ＝max(r _v ,r _y )

wherein r is _m For the larger of the two types of negative sample duty ratios, max represents the maximum value taking operation, ζ ₀ The value of the threshold value is 0.5;

L _total ＝max(m _y -S _total (Ι,T ₊ )+S _total (Ι,T-),0)+max(m _v -S _total (T,Ι ₊ )+S _total (T,Ι _- ),0)。