CN114969423A

CN114969423A - Image text cross-modal retrieval model and method based on local shared semantic center and computer equipment

Info

Publication number: CN114969423A
Application number: CN202210718696.6A
Authority: CN
Inventors: 孟铃涛; 张飞飞; 徐常胜
Original assignee: Tianjin University of Technology
Current assignee: Tianjin University of Technology
Priority date: 2022-06-23
Filing date: 2022-06-23
Publication date: 2022-08-30

Abstract

The invention discloses an image text cross-modal retrieval model, method and computer equipment based on a local shared semantic center. And then defining a group of trainable semantic centers shared by image texts, calculating the similarity between each local feature and the semantic center, and distributing the local features to a plurality of semantic centers according to the similarity to obtain a plurality of semantically aligned image representations and text representations. And performing multi-level modeling on the regional characteristic weight of the image and the word characteristic weight of the text by using bi-GRU to obtain multi-level global representation of the integrated local characteristic. Local similarity of the image and the text is calculated through semantically aligned image representation and text representation, and global similarity of the image and the text is calculated through multi-layer global representation of the image and the text. The invention can effectively improve the accuracy of the cross-modal retrieval of the image text.

Description

Image text cross-modal retrieval model and method based on local shared semantic center and computer equipment

Technical Field

The invention belongs to the field of image text cross-modal retrieval, and particularly relates to an image text cross-modal retrieval model and method based on a local shared semantic center and computer equipment.

Background

The image text cross-modal retrieval aims to retrieve data with the same semantic meaning as the data in one modal by using the data in the other modal, is an important research direction related to the fields of machine vision, natural language processing, multi-modal learning and the like, and has become a research hotspot at home and abroad at present. In recent years, with the development of deep learning technology, image text cross-modal retrieval has achieved excellent performance. However, the task still faces a great challenge because not only the semantic knowledge of images and texts needs to be deeply understood, but also the semantic correspondence between different modalities needs to be acquired across the modality gap.

In order to solve the above challenges, the current method focuses more on fine-grained correspondence between image texts, highlights important semantic knowledge through a local alignment method, and learns the images and the texts more comprehensively. But current methods ignore the heavy computational burden associated with local alignment. Therefore, reducing the interaction size of local features while fully understanding images and text is important for image-text cross-modality retrieval.

Recently, the cluster learning method has been successful in optimizing semantic representations common to features. However, most of the feature learning of the current clustering focuses on global representation, so that fine-grained local information is ignored, and the challenge of cross-modal retrieval of image texts cannot be well met. Therefore, the invention designs a clustering center for sharing the image text, and realizes fine-grained alignment between the image and the text by adopting a soft distribution strategy, thereby deeply understanding the semantic corresponding relation between the image and the text and improving the retrieval efficiency.

Disclosure of Invention

The invention aims to express the semantic commonality of the local features of the image text by using a trainable semantic center shared by the image text, and realize the fine-grained alignment of the image text through the semantic center, thereby excavating the deep image semantics and text semantics, avoiding the direct interaction of the local features of the image text and reducing the calculation scale. And the global alignment is used as the supplement of the local alignment, the cross-modal semantic correspondence of the image text is realized from multiple angles, and the semantic information is summarized more comprehensively. The technical scheme for realizing the invention is as follows:

a cross-modal retrieval model of image texts based on a local shared semantic center is obtained through the following steps:

and S1, extracting the region features of the image and the word-level features of the text respectively, and then obtaining the image features and the text features for local alignment and global alignment respectively through two layers of independent mapping.

S2, clustering the image features and the text features in the step S1 to obtain k initialized shared semantic centers;

s3, calculating the similarity between the image text features in the step S1 and the shared semantic centers in the step S2, aggregating the image features into image semantic representations of k corresponding shared semantic centers by using the similarity, and aggregating the text features into text semantic representations of k corresponding shared semantic centers;

s4, modeling the pooling operation of the regional characteristics and the text word-level characteristics of the image in the step 1 to obtain image global representation and text global representation;

s5, calculating the local similarity of the image text by using the image semantic representation and the text semantic representation with the same shared semantic center in the step S3, calculating the global similarity of the image text by using the image global representation and the text global representation in the step S4, and expressing the overall similarity of the image and the text by using the weighted sum of the local similarity and the global similarity to complete the modeling.

And S6, training the image text cross-modal retrieval model by using the overall similarity, and performing real-time image text cross-modal retrieval by using the trained model.

As a preferred technical solution, the specific process of extracting the image text features in step S1 includes:

step 51-1, extracting the regional characteristics of the image by using pre-trained fast-RCNN, and respectively mapping the extracted regional characteristics to obtain two groups of image characteristics through two layers of independent multilayer perceptrons

And

step S1-2, dividing the input text sentence into words, filling 0 to fixed word length, sending the divided and filled text to pre-trained Bert to obtain word-level feature representation, and then using two layers of independent multilayer perceptrons to respectively map to obtain two groups of text features

And

as a preferred technical solution, the specific process of initializing the semantic center in step S2 includes:

step S2-1, randomly sampling image features and text features in a training data set;

step S2-2, carrying out K-means clustering on the randomly sampled image features and text features to obtain K initialized clustering centers

And k < n;

and step S2-3, defining the initialized clustering center C as a trainable shared semantic center, and training the clustering center C along with the model.

As a preferred technical solution, the specific process of obtaining the image text alignment semantic representation in step S3 includes:

step S3-1, for the image feature V in step S1-1 ^l And text feature T in step S1-2 ^l Respectively calculating cosine distances with the shared semantic center C in the step S2-3 to obtain similarity matrixes of the image and the shared semantic center and similarity matrixes of the text and the shared semantic center, and performing softmax operation on the similarity matrixes to obtain normalized similarity matrixes;

step S3-2, using the value of the normalized similarity matrix in step S3-1 as the image feature V in step S1-1 ^l And the text feature T in step S1-2 ^l The weighted sum of the features in the modality is the image features and the text features of the corresponding semantic centers, and since the number of the semantic centers is k in step S2-3, the number of the image features and the text features aligned according to the semantic centers is k.

As a preferred technical solution, the specific process of obtaining the global representation of the image text in step S4 includes:

step S4-1, for the image feature V in step S1-1 ^g And text feature T in step S1-2 ^g Performing different pooling operations, such as maximum pooling, second value pooling, minimum pooling and the like, to obtain pooled image features and text features;

and step 54-2, modeling the pooling features of the image and the text respectively by using the bi-GRU, solving coefficients required by optimal pooling, and then obtaining the global features of the image and the text according to the solved optimal pooling strategy.

As a preferred technical solution, the specific process of calculating the image text similarity and training the model in step S5 includes:

s5-1, fine-grained knowledge in the image and the text is aligned through the step S3-2, the local similarity between the image and the text corresponding to a certain semantic center is represented by the cosine distance between the image feature and the text feature aligned with the same semantic center in the step S3-2, and the sum of the local similarities aligned with all the semantic centers is calculated to be used as the local similarity between the image and the text;

s5-2, the global similarity between the image and the text is represented by the cosine similarity of the global features of the image and the text in step S4-2.

And S5-3, finally, representing the overall similarity as the weighted sum of the local similarity and the global similarity, and training by adopting ternary sequencing loss according to the overall similarity.

As a preferred technical solution, the method process of image text cross-modality retrieval in step S6 includes:

for any group of image text pairs, firstly, the feature extraction method of step S1 is adopted to extract the features of the images and the texts, then the local features and the global features of the images and the local features and the global features of the texts are extracted according to step S3 and step S4, the extracted global features and the local features are subjected to local alignment and global alignment of the images and the texts according to the method of step S5, the similarity of the images and the texts is calculated, and a retrieval result is obtained.

A computer device is internally provided with an instruction or a program of the image text cross-modal retrieval model based on the local shared semantic center or an instruction or a program of the image text cross-modal retrieval method.

The invention has the beneficial effects that:

(1) the method solves the problems that the traditional image text cross-modal retrieval local alignment calculation amount is large, and fine granularity between an image and a text cannot be deeply explored.

(2) According to the image text cross-modal retrieval method based on the local shared semantic center, local features of the image and the text can be indirectly aligned through the semantic center by learning the trainable semantic center shared by a group of image texts, so that the semantic relationship between the image and the text is deeply mined, and the interaction cost brought by local alignment is reduced.

(2) The invention applies soft distribution to the matching problem of clustering, the weight coefficient becomes smooth and differentiable due to the soft distribution, and the clustering center can carry out end-to-end training along with the model, thereby generating a reliable shared semantic center.

(3) On the basis of local alignment, the alignment of global features is used as auxiliary information to promote semantic matching between image texts, and the relationship between the image and the text is understood from the local and global angles, so that the calculation retrieval performance is improved.

Drawings

FIG. 1 is a flow chart of image text cross-modal retrieval based on a locally shared semantic center.

Detailed Description

The method comprises the steps of firstly extracting regional features of an image and word-level features of a text, and respectively obtaining the image features and the text features for local alignment and global alignment through two layers of independent mapping. And obtaining an initialized cluster center group by using a clustering method, setting the cluster center as a trainable shared semantic center, and updating along with network training. And aligning the image text features to the corresponding shared semantic centers through the cosine distance between the image text features and the shared semantic centers, thereby obtaining the image local features and the text local features with the same number as the shared semantic centers. And calculating the global features of the image and the global features of the text by a method for modeling image text feature pooling operation. And performing local alignment by using the local features of the image texts, performing global alignment by using the global features of the image texts, finally obtaining the multi-angle image text similarity, and training the model by using the ternary ordering loss.

The invention is described in further detail below with reference to the figures and specific embodiments.

Fig. 1 is a flowchart of an image text cross-modal retrieval method based on a local shared semantic center according to the present invention. The method comprises the following steps of firstly extracting the characteristics of an image text, then defining a group of trainable shared semantic centers, calculating the local characteristics of image text alignment according to the relation between the image text and the shared semantic centers, obtaining the global characteristics of the image text by modeling the pooling method of the image text characteristics, calculating the overall image text similarity by using local alignment and global alignment, and finally training by using ternary ordering loss, wherein the method specifically comprises the following steps:

s1, extracting image text features: and extracting the regional characteristics of the image and the word-level characteristics of the text respectively, and then obtaining the image characteristics and the text characteristics for local alignment and global alignment respectively through two layers of independent mapping.

The method specifically comprises the following steps: extracting the regional characteristics of the image by using pre-trained Faster-RCNN, and respectively taking the extracted regional characteristics as two independent multi-layer perceptron MLPs ^Vl And MLP ^Vg Is mapped to obtain two sets of image features

And

then, the input text sentence is divided into words, 0 is used for filling the words to a fixed word length, the divided and filled text is sent to a pre-trained Bert to obtain the feature representation of word level, and then the extracted feature representation of word level is used as a two-layer independent multi-layer perceptron MLP ^Tl And MLP ^Tg Respectively mapping to obtain two groups of text features

And

s2, initializing a shared semantic center: and performing K-Means clustering on the image features and the text features in the step S1 to obtain K initialized shared semantic centers.

The method specifically comprises the following steps: firstly, randomly sampling image features and text features in a training data set to obtain a plurality of untrained image features and text features, then carrying out K-means clustering on the randomly sampled image features and text features,get k initialized cluster centers

And k < n, then defining the initialized clustering center C as a trainable shared semantic center, and updating the parameters of the shared semantic center along with network training.

S3, learning the aligned semantic representation of the image text: and calculating the similarity between the image text features in the step S1 and the shared semantic centers in the step S2, aggregating the image features into k image semantic representations corresponding to the shared semantic centers by using the similarity, and aggregating the text features into k text semantic representations corresponding to the shared semantic centers.

The method specifically comprises the following steps: for the image feature V in step S1 ^l And text feature T ^l And respectively calculating cosine distances with the shared semantic center C in the step S2 to obtain similarity matrixes of the image and the shared semantic center and the text and the shared semantic center, and performing softmax operation on the similarity matrixes to obtain normalized similarity matrixes.

The value of the normalized similarity matrix in step S3 is then taken as the image feature V in step S1 ^l Weight of and text feature T ^l The weighted sum of the features in the modalities is the image features and the text features corresponding to the semantic centers, and since the number of the semantic centers in step S2 is k, the number of the image features and the number of the text features aligned according to the semantic centers are both k.

S4, global representation of the learning image text: and (3) modeling the regional characteristics of the image and the word-level characteristics of the text in the step 1 by using bi-GRU to obtain the optimal image global representation and text global representation.

The method specifically comprises the following steps: for the image feature V in step S1 ^g And text feature T ^g And performing different pooling operations, such as maximum pooling, second value pooling, minimum pooling and the like, to obtain pooled image features and text features.

And further modeling the pooling features of the image and the text by using the bi-GRU respectively, solving coefficients required by optimal pooling, and then obtaining the global features of the image and the text according to the solved optimal pooling strategy.

S5, calculating the similarity of image texts: the local similarity of the image text is calculated by using the image semantic representation and the text semantic representation having the same shared semantic center in the step S3, the global similarity of the image text is calculated by using the image global representation and the text global representation in the step S4, and the overall similarity of the image and the text is expressed by a weighted sum of the local similarity and the global similarity.

The method specifically comprises the following steps: fine-grained knowledge in the image and the text has been aligned by step S3, the local similarity between the image and the text corresponding to a certain semantic center is represented by the cosine distance of the image feature and the text feature aligned to the same semantic center in step S3, and the sum of the local similarities aligned to all semantic centers is found as the local similarity between the image and the text.

The global similarity between the image and the text is represented by the cosine similarity of the global features of the image and the text in step S4. And calculating the overall similarity of the image and the text by the weighted sum of the local similarity and the global similarity, and finally training by adopting ternary sequencing loss according to the overall similarity.

The present invention will be explained below with reference to specific examples. The implementation of the present invention includes the model building and training process and the image text retrieval process, which are described in detail below.

1. The model building and training process comprises the following steps:

1.1 feature extraction Process for image text

The regional characteristics and the text word-level characteristics of the image are extracted by using pre-trained Faster R-CNN and pre-trained Bert respectively, and for the subsequent alignment of image texts from the local and global angles, the image characteristics and the text characteristics are extracted by using two layers of independent multilayer perceptrons.

1.1.1 feature extraction of images

Given image I, a region R in the image is detected using pre-trained Faster R-CNN _i And extracting each region r _i Characteristic f of _i . Then useTwo independent multi-layer perceptrons are used for characterizing the region f of the image _i Respectively mapped to obtain

And

in the formulae (1) and (2), MLP ^Vl 、MLP ^Vg Representing two independent multi-layered perceptrons, deriving image features for local and global alignment, respectively, as

And

1.1.2 feature extraction of text

Given text S, the text is first divided into a number of individual words using a segmentation tool and the words are filled to a fixed length with 0S. Word sequences s of fixed length _i Input to the pretrained Bert to obtain word-level text features z _i . The word-level features z of the text are then transformed using two independent multi-layered perceptrons _i Respectively mapped to obtain

And

z _i ＝Bert(s _i )#(3)

in the formulae (4) and (5), MLP ^Tl 、MLP ^Tg Representing two independent multi-layered perceptrons, deriving text feature representations for local alignment and global alignment, respectively, as

And

1.2 initialization of semantic centers

First, image features V for local alignment are aligned in a training dataset ^l And text feature T for local alignment ^l Random sampling is carried out to obtain a plurality of untrained image characteristics and text characteristics, then K-means clustering is carried out on the randomly sampled image characteristics and text characteristics to obtain K initialized clustering centers

1.3 aligned semantic representation of image text

According to the semantic commonality between the image text and the shared semantics, the image context feature and the text context feature which are aligned semantically are obtained, and because the local feature of the image and the local feature of the text are aligned based on the shared semantic center, the local similarity between the image and the text can be represented by the context features of the image and the text under the same shared semantic center.

1.3.1 obtaining aligned semantic representations of images

In order to obtain the image context feature aligned with the shared semantic center, calculating the cosine similarity between the image feature and the shared semantic center:

in the formula (6)

Representing a transpose of the ith shared semantic center,

representing the jth image feature for local alignment,

expressing the cosine similarity between the ith shared semantic center and the jth image characteristic used for local alignment, and performing softamx operation on a cosine similarity matrix to obtain a normalized similarity matrix:

in the formula (7), λ represents a temperature coefficient,

represents the cosine similarity after normalization, and is

As

The corresponding semantic center c is calculated according to the weight of the semantic center c _i The local features of the image of (1):

in the formula (8), the reaction mixture is,

finger corresponds to the ith shared semantic center c _i To obtain image features sharing semantic alignment

1.3.2 obtaining aligned semantic representations of text

As in step 1.3.1, in order to obtain the text context feature aligned with the shared semantic center, the cosine similarity between the text feature and the shared semantic center is calculated:

in the formula (9)

Representing a transpose of the ith shared semantic center,

representing the jth text feature for local alignment,

expressing the cosine similarity between the ith shared semantic center and the jth text feature for local alignment, and performing softamx operation on a cosine similarity matrix to obtain a normalized similarity matrix:

in the formula (10), λ represents a temperature coefficient,

represents the cosine similarity after normalization, and is

As

The corresponding semantic center c is calculated according to the weight of the semantic center c _i Text context feature of (2):

in the formula (11), the reaction mixture is,

finger corresponds to the ith shared semantic center c _i To obtain shared semantically aligned text features

1.4 Global representation of image text

The global alignment of the image text provides more general and comprehensive semantic information for understanding the shared semantics of the image and the text than the local alignment, so that the semantic alignment from the global perspective can be regarded as auxiliary information for the image text alignment.

1.4.1 extracting Global features of an image

Performing multi-pooling on the image features for global alignment in step 1.1.1 to obtain global representations of a plurality of images:

in the formula (12), the reaction mixture is,

representing a pool of image featuresConversion result, max _i Characteristic of expression pair

Pooling ith value, e.g. max when i is 1 ₁ It means that the image features are maximally pooled,

results of maximum pooling. To find the optimal pooling strategy, all pooling results are modeled using bi-GRU to approximate the maximum pooling, second-value pooling, average pooling, or more complex pooling results:

in the formula (13)

A position code representing a feature of the image,

representing the output of a position-coded counterpart bi-GRU, having dimensions of

Output of each position code corresponding bi-GRU

Are all d-dimensional features whose dimensions are mapped to using a fully-connected layer

Then the normalization operation is performed using softmax:

w in formula (14) ^v Weight matrix representing fully connected layers, b ^v Representing the bias of the fully-connected layer, since the output dimension of the fully-connected layer is

So w therein ^v Has the dimension of

b ^v Has the dimension of

And representing the weight coefficient corresponding to the ith value pooling result, wherein the global feature of the image is represented by the weighted sum of the pooling results:

1.4.2 extracting Global representations of text

As with step 1.4.1, the text features used for global alignment in step 1.1.1 are multiclassified to obtain global representations of multiple texts:

in the formula (18), the reaction mixture,

representing pooled results, max, on text features _i Characteristic of expression pair

Pooling ith value, e.g. max when i is 1 ₁ It means that the text features are maximally pooled,

results of maximum pooling. To find the best poolA pooling strategy, modeling all pooling results using bi-GRU to approximate maximum pooling, second value pooling, average pooling, or more complex pooling results:

in the formula (17)

A position code representing a feature of the text,

Using the fully-connected layer to map its dimensions into

Then the normalization operation is performed using softmax:

in the formula (18), w ^t Weight matrix representing fully connected layers, b ^t Representing the bias of the fully-connected layer, since the output dimension of the fully-connected layer is

Therefore w therein ^t Has the dimension of

b ^t Has the dimension of

And representing the weight coefficient corresponding to the ith value pooling result, wherein the global feature of the text is represented by the weighted sum of the pooling results:

1.5 image text similarity calculation

Since the local features of the image texts are aligned by the shared semantic center, the local similarity between the image texts can be calculated by the image context feature and the text context feature under the same shared semantic; the global similarity between the image texts is calculated by the global features of the image and the text, and is used as auxiliary information to improve the retrieval precision.

1.5.1 local similarity of image text

Step 1.3.1 and step 1.3.2 have extracted picture context characteristic and text context characteristic through sharing the semantic alignment separately, the local similarity is expressed by the cosine similarity of the text context characteristic of the picture:

in the formula (20)

Expressed in a shared semantic center c _i Contextual features of the lower image

Contextual features with text

Cosine similarity between them. Taking the sum of the similarity of all aligned semantic centers as the local similarity of the image and the text:

1.5.2 Global similarity of image text

Steps 1.4.1 and 1.4.2 extract the global representations of the image and the text, respectively, and the global similarity is represented by the cosine similarity of the global representations of the image and the text:

r in the formula (22) ^g (v, t) represents the global feature g of the image ^v And global features g of text ^t Cosine similarity between them.

1.5.3 Overall similarity of image text

According to the steps 1.5.1 and 1.5.2, the local similarity and the global similarity between the image texts are obtained, and the overall similarity of the image and the text is determined by the local similarity and the global similarity:

R(v，t)＝β ₁ R ^l (v，t)+β ₂ R ^g (v，t)#(23)

in the formula (23) < beta > ₁ And beta ₂ Is a hyper-parameter determining the local and global ratio, in practice, will be beta ₁ Set to 0.2, set beta ₂ Setting the value to be 1 can obtain a better result, and training by adopting ternary sequencing loss according to the obtained similarity:

in the formula (24), Δ is a hyperparameter, and in practice, setting Δ to 0.15 can achieve a good result, and (v, t) represents a data set

The positive sample pair of (1) is,

the most difficult negative sample representing v,

indicating that the condition t' ≠ t is satisfied

When R (v, t') is maximized.

The most difficult negative sample representing t,

indicating that the condition v' ≠ v is satisfied

Then, R (v, t') is maximized. [ x ] of]+ ≡ max (0, x), distance between pairs of positive samples is pulled closer with a ternary ordering penalty; wherein t 'and v' are intermediate variables.

2. Cross-modal retrieval process for image text

After the model is fully trained in the training set, calculating the similarity between any image to be tested and all texts in the test library through a formula (23), and retrieving the text with the maximum similarity as a retrieval result; giving a section of text to be tested, calculating the similarity between the text and all images in the test library through a formula (23), and searching the image with the maximum similarity as a search result.

In summary, the invention discloses an image text cross-modal retrieval method based on a local shared semantic center. The method performs cross-modal semantic alignment of image text from the perspective of local alignment and global alignment. For local alignment, training a semantic center shared by image texts, wherein the semantic center describes semantic commonality of local features of the image and the text, so that the local features of the image and the text can be aligned according to the same semantic center, and the method for local alignment omits complicated direct interaction of the local features, so that the calculation amount of local alignment is reduced while fine-grained semantic information is mined; for global alignment, the images and texts are expressed by using global features, more comprehensive semantic knowledge can be obtained, and the semantic knowledge can be used as auxiliary information to improve the accuracy of cross-modal retrieval. Therefore, the method solves the problems of redundant calculation and low recall rate of cross-modal retrieval local alignment of the image text.

The above-listed series of detailed descriptions are merely specific illustrations of possible embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent means or modifications that do not depart from the technical spirit of the present invention are intended to be included within the scope of the present invention.

Claims

1. An image text cross-modal retrieval model based on a local shared semantic center is characterized in that the model is obtained by the following steps:

s1, extracting the regional features of the image and the word-level features of the text respectively, and then obtaining the image features and the text features for local alignment and global alignment respectively through two layers of independent mapping;

s2, clustering the image features and the text features in S1 to obtain k initialized shared semantic centers;

s3, obtaining the image text alignment semantic representation: calculating the similarity between the image text characteristics in the S1 and the shared semantic center in the step S2, aggregating the image characteristics into k image alignment semantic representations corresponding to the shared semantic center by using the similarity, namely the image context characteristics, and aggregating the text characteristics into k text alignment semantic representations corresponding to the shared semantic center, namely the text context characteristics;

2. The model for cross-modal image text retrieval based on locally shared semantic center as claimed in claim 1, wherein the specific implementation of S1 includes:

s1.1 feature extraction of images

Given image I, a region R in the image is detected using pre-trained Faster R-CNN _i And extracting each region r _i Characteristic f of _i Then using two independent multi-layer perceptrons to characterize the region f of the image _i Respectively mapped to obtain

And

in the formulae (1) and (2), MLP ^Vl 、MLP ^Vl Representing two independent multi-layered perceptrons, deriving image features for local and global alignment, respectively, as

And

s1.2 feature extraction of text

Given a text S, the text is first divided into a number of individual words using a segmentation tool and the words are filled to a fixed length with 0SAnd (3) inputting a word sequence with a fixed length into a pre-trained Bert to obtain word-level text features, and then using two independent multilayer perceptrons to obtain the word-level features f of the text _i Respectively mapped to obtain

And

z _i ＝Bert(s _i )#(3)

in equation (3), Bert denotes a pre-trained Bert network, s _i Representing the original input text, z _i Representing the text word-level features extracted by Bert, MLP in equations (4) (5) ^Tl 、MLP ^Tg Representing two independent multi-layered perceptrons, deriving text feature representations for local alignment and global alignment, respectively, as

And

3. the model for cross-modal image text retrieval based on locally shared semantic center as claimed in claim 1, wherein the specific implementation of S2 includes:

s2.1 pairing image features V for local alignment in a training dataset ^l And text feature T for local alignment ^l Random sampling is carried out to obtain a plurality of untrained samplesThe image features and the text features of (a),

s2.2, carrying out K-means clustering on the randomly sampled image features and text features to obtain K initialized clustering centers

k < m and k < n,

s2.3, the initialized clustering center C is defined as a trainable shared semantic center, and parameters of the shared semantic center are updated along with network training.

4. The model for cross-modal image text retrieval based on locally shared semantic center as claimed in claim 1, wherein the specific implementation of S3 includes:

s3.1 obtaining an aligned semantic representation of an image

in formula (6)

Representing a transpose of the ith shared semantic center,

representing the jth image feature for local alignment,

expressing the cosine similarity between the ith shared semantic center and the jth image feature for local alignment, and performing softamx operation on the cosine similarity matrix to obtain a normalized similarity matrix:

in the formula (7), λ represents a temperature coefficient, a _ij Represents the cosine similarity after normalization, and is

As

in the formula (8), p _i Finger corresponds to the ith shared semantic center c _i To obtain image features sharing semantic alignment

S3.2 obtaining aligned semantic representation of text

As in step S3.1, in order to obtain the text context feature aligned with the shared semantic center, the cosine similarity between the text feature and the shared semantic center is calculated:

in the formula (9)

Representing a transpose of the ith shared semantic center,

representing the jth text feature for local alignment,

expressing the cosine similarity between the ith shared semantic center and the jth text feature for local alignment, and performing softamx operation on the cosine similarity matrix to obtain a normalized similarity matrix:

in the formula (10), λ represents a temperature coefficient, a _ij Expressing the cosine similarity after normalization, and a _ij As

in the formula (11), p _i Finger corresponds to the ith shared semantic center c _i To obtain shared semantically aligned text features

5. The model for cross-modal image text retrieval based on locally shared semantic center as claimed in claim 1, wherein the specific implementation of S4 includes:

s4.1 extracting global features of image

Performing multi-pooling on the image features for global alignment in the step 1 to obtain global representations of a plurality of images:

max in formula (12) _i Characteristic of expression pair

The pooling of the ith value is carried out,

all pooling results were modeled using bi-GRU to approximate different pooling results:

in the formula (13)

A position code representing a feature of the image,

Using the fully-connected layer to map its dimensions into

Then the normalization operation is performed using softmax:

So w therein ^v Has a dimension of

b ^v Has the dimension of

The global features of the image are represented by a weighted sum of the pooling results:

s4.2 extracting a Global representation of text

As in step S4.1, the text features used for global alignment in step 1 are multi-pooled to obtain global representations of multiple texts:

max in the formula (18) _i Characteristic of expression pair

The pooling of the ith value is carried out,

all pooling results were modeled using bi-GRU to approximate the pooling results for different pooling:

in the formula (17)

A position code representing a feature of the image,

Using the fully-connected layer to map its dimensions into

Then the normalization operation is performed using softmax:

So w therein ^t Has the dimension of

b ^t Has the dimension of

6. the model for cross-modal image text retrieval based on locally shared semantic center as claimed in claim 1, wherein the specific implementation of S5 includes:

s5.1 local similarity of image texts

The local similarity is represented by cosine similarity of image text context features:

in the formula (20)

Represented in a shared semantic center c _i Contextual features of the lower image

Contextual features with text

Cosine similarity between the images and the texts, taking the sum of the similarity of all aligned semantic centers as the local similarity of the images and the texts:

s5.2 Global similarity of image text

The global similarity is represented by the cosine similarity of the image global representation and the text global representation:

r in the formula (22) ^g (v, t) represents the global feature g of the image ^v And global features g of text ^t Cosine similarity between them;

s5.3 Overall similarity of image text

According to the local similarity and the global similarity between the image texts obtained in the steps S5.1 and S5.2, the overall similarity between the image and the text is determined by the local similarity and the global similarity:

R(v,t)＝β ₁ R ^l (v,t)+β ₂ R ^g (v,t)#(23)

in the formula (23) < beta > ₁ And beta ₂ Is a hyper-parameter that determines the local global scale.

7. The image text cross-modal retrieval model based on the local shared semantic center according to any one of claims 1 to 6, characterized in that the image text cross-modal retrieval model is trained by using overall similarity; the method comprises the following specific steps:

and training by adopting ternary sequencing loss according to the obtained overall similarity:

in the formula (24), Δ is a hyperparameter, and (v, t) represents a data set

The positive sample pair of (1) is,

the most difficult negative sample representing v,

the hardest negative sample, x, representing t] ⁺ ≡ max (0, x), the distance between pairs of positive samples is drawn closer using a ternary ordering penalty.

8. The image text cross-modal retrieval method based on the image text cross-modal retrieval model with the locally shared semantic center according to any one of claims 1 to 6, characterized in that any image to be tested is input into the model according to any one of claims 1 to 6, the overall similarity between the image and all texts in a model test library is calculated, and the text with the maximum similarity is retrieved as a retrieval result; and for any section of text to be detected, calculating the similarity between the text and all images in the test library, and searching the image with the maximum similarity as a search result.

9. A computer device, characterized in that the computer device is provided with the instruction or program of the image text cross-modal retrieval model based on the local shared semantic center according to any one of claims 1 to 6 or the instruction or program of the image text cross-modal retrieval method according to claim 8.