CN117520590B

CN117520590B - Ocean cross-modal image-text retrieval method, system, equipment and storage medium

Info

Publication number: CN117520590B
Application number: CN202410009510.9A
Authority: CN
Inventors: 陈亚雄; 黄吉瑞; 龚腾飞; 熊盛武; 袁景凌
Original assignee: Sanya Science and Education Innovation Park of Wuhan University of Technology
Current assignee: Sanya Science and Education Innovation Park of Wuhan University of Technology
Priority date: 2024-01-04
Filing date: 2024-01-04
Publication date: 2024-04-26
Anticipated expiration: 2044-01-04
Also published as: CN117520590A

Abstract

The application discloses a method, a system, equipment and a storage medium for searching ocean cross-modal graphics context, wherein the method respectively carries out similar clustering on image global features and text global features through a global similarity measurement module, and extracts reconstructed image global features of images and reconstructed text global features of texts; the multi-layer guiding module is used for fusing and reconstructing the image local feature, the text local feature, the reconstructed image global feature and the reconstructed text global feature, so that the image local feature and the text local feature of the image and the text local feature can be processed in a targeted manner, the image local feature and the text local feature can be fused organically, and then fused and reconstructed with the reconstructed image global feature and the reconstructed text global feature to obtain an aligned image feature and an aligned text feature with multi-mode information, the information between two modes of the aligned image and the text is realized, and the retrieval precision is improved.

Description

Ocean cross-modal image-text retrieval method, system, equipment and storage medium

Technical Field

The invention relates to the technical field of multiple modes, in particular to a method, a system, equipment and a storage medium for searching ocean cross-mode graphics context.

Background

The intelligent ocean is an ocean advanced system form developed by applying an intelligent technology and an advanced equipment technology on the basis of ocean digitization. The remote sensing image-text retrieval task is an important application field in intelligent ocean construction. Remote sensing technology acquires remote marine image data by using sensors, which can be transmitted and processed to acquire information about the marine environment. By combining with intelligent technology such as image recognition, machine learning and big data analysis, the ocean image data can be automatically analyzed and searched, and the required information can be quickly obtained. The method has important significance for marine scientific researchers and related industries, and can support decisions and actions in aspects of marine resource development, environmental protection, marine safety and the like.

At present, the existing retrieval methods are mostly remote sensing ocean cross-modal image-text retrieval methods based on subtitles, the methods firstly generate text description of remote sensing images, then match and compare the generated text with query text, and finally retrieve relevant remote sensing images. The essence of the idea is still that the retrieval matching work of a single mode can not be carried out, and the direct matching between different modes can not be carried out. And information interaction between modalities is not sufficiently performed, resulting in low final retrieval accuracy.

Therefore, in the prior art, in the process of carrying out image-text retrieval, the problem of low retrieval precision caused by insufficient alignment of information among modes exists.

Disclosure of Invention

In view of the foregoing, it is necessary to provide a method, a system, a device and a storage medium for retrieving ocean cross-modal graphics and texts, which are used for solving the problem of low retrieval precision caused by insufficient alignment of information among modes in the graphics and texts retrieving process in the prior art.

In order to solve the problems, the invention provides a marine cross-modal image-text retrieval method, which comprises the following steps:

acquiring an image text data set, wherein the image text data set comprises an image data set and a text data set;

Constructing an initial image-text retrieval multi-level guidance network model, wherein the initial image-text retrieval multi-level guidance network model comprises an image feature extraction module, a text feature extraction module, a global similarity measurement module, a multi-layer guidance module and a multi-subspace joint learning module;

Inputting the image text data set into an initial image-text retrieval multi-level guidance network model, and extracting image global features and image local features of the image data set according to an image feature extraction module; acquiring text global features and text local features of a text data set according to a text feature extraction module; respectively carrying out similar clustering on the image global features and the text global features according to the global similarity measurement module to respectively obtain reconstructed image global features and reconstructed text global features; respectively carrying out fusion reconstruction on the image local feature, the text local feature, the reconstructed image global feature and the reconstructed text global feature according to the multi-layer guiding module to respectively obtain an aligned image feature and an aligned text feature; performing modal cross matching and modal countermeasure fusion on the aligned image features and the aligned text features according to the multi-subspace joint learning module to obtain a complete training target image-text retrieval multi-level guidance network model;

and carrying out image-text retrieval according to the target image-text retrieval multi-stage guiding network model.

Further, performing similarity clustering on the image global feature and the text global feature according to the global similarity measurement module to obtain a reconstructed image global feature and a reconstructed text global feature respectively, including:

respectively constructing an image subspace and a text subspace according to the image global features and the text global features;

Calculating an image loss function of the image subspace and a text loss function of the text subspace according to the intra-mode loss function;

and performing similar clustering on the image subspace and the text subspace according to the image loss function and the text loss function to obtain a reconstructed image global feature and a reconstructed text global feature.

Further, the multi-layer guiding module comprises an implicit local mutual guiding unit and a local enhanced global reconstruction unit; respectively carrying out fusion reconstruction on the image local feature, the text local feature, the reconstructed image global feature and the reconstructed text global feature according to the multi-layer guiding module to respectively obtain an aligned image feature and an aligned text feature, wherein the method comprises the following steps:

According to the implicit local mutual guidance unit, carrying out similar fusion on the local image features and the local text features to obtain reconstructed image local features and reconstructed text local features;

And carrying out fusion reconstruction on the local features of the reconstructed image, the local features of the reconstructed text, the global features of the reconstructed image and the global features of the reconstructed text according to the locally enhanced global reconstruction unit to respectively obtain the aligned image features and the aligned text features.

Further, according to the implicit local mutual guidance unit, performing similar fusion on the image local feature and the text local feature to obtain a reconstructed image local feature and a reconstructed text local feature, including:

Calculating the similarity of the local features of the images and the local features of the texts according to the cosine similarity function to obtain a similarity matrix between the text images;

and carrying out similar fusion on the image local features and the text local features based on the similarity matrix to obtain reconstructed image local features and reconstructed text local features.

Further, according to the locally enhanced global reconstruction unit, the reconstructed image local feature, the reconstructed text local feature, the reconstructed image global feature and the reconstructed text global feature are fused and reconstructed to obtain an aligned image feature and an aligned text feature respectively, which comprises the following steps:

Calculating the similarity between the local features of the reconstructed image and the global features of the reconstructed image according to the cosine similarity function to obtain an image weight matrix containing local image information;

Performing enhancement reconstruction on global features of the reconstructed image according to the image weight matrix to obtain aligned image features;

Calculating the similarity between the local features of the reconstructed text and the global features of the reconstructed text according to the cosine similarity function to obtain a text weight matrix containing local text information;

and carrying out enhancement reconstruction on global features of the reconstructed text according to the text weight matrix to obtain aligned text features.

Further, the multi-subspace joint learning module comprises a modal cross-matching learning module and a modal countermeasure fusion learning module; performing modal cross matching and modal countermeasure fusion on the aligned image features and the aligned text features according to the multi-subspace joint learning module to obtain a complete training target image-text retrieval multi-level guidance network model, wherein the method comprises the following steps:

Calculating a modal cross-matching learning loss between the aligned image features and the aligned text features according to a modal cross-matching learning module;

And calculating common representation generator loss and modal discrimination loss between the aligned image features and the aligned text features according to the modal countermeasure fusion learning module, constructing an optimized objective function based on the modal cross-matching learning loss, the common representation generator loss and the modal discrimination loss, and determining a complete training objective image-text retrieval multi-level guidance network model according to the optimized objective function.

Further, the calculation formula of the modal cross-matching learning loss is as follows:

the calculation formula for the common representation generator loss is:

the calculation formula of the modal discrimination loss is as follows:

the calculation formula of the optimization objective function is as follows:

Q

wherein, Learning loss for modal cross-matches,/>As a parameter of the marginal threshold value,For the final remote sensing image feature vector,/>For the final remote sensing text feature vector,/>For similarity score between remote sensing image and text,/>For all AND/>Unmatched text feature vectors,/>For similarity score of remote sensing image and all text not matched with the remote sensing image,/>For all AND/>Unmatched remote sensing image features,/>Scoring similarity of text to all remote sensing images that do not match it,/>Loss for public representation generator,/>Loss of generator for public representation of images,/>Loss for text public representation generator,/>Representing a network of modal discriminators,For the i-th final remote sensing image feature vector,/>Loss is judged for modeLoss is discriminated for image modality,/>For discriminating loss of text mode,/>，/>Expressed as image modality at 0,/>The text mode is expressed as 1, N is the total number of sample pairs, and Q is an optimization objective function.

In order to solve the above problems, the present invention further provides a marine cross-modal image-text retrieval system, comprising:

The data set acquisition module is used for acquiring an image text data set, wherein the image text data set comprises an image data set and a text data set;

The system comprises an initial image-text retrieval multi-level guidance network model construction module, a multi-level guidance network model analysis module and a multi-subspace joint learning module, wherein the initial image-text retrieval multi-level guidance network model construction module is used for constructing an initial image-text retrieval multi-level guidance network model, and the initial image-text retrieval multi-level guidance network model comprises an image feature extraction module, a text feature extraction module, a global similarity measurement module, a multi-level guidance module and a multi-subspace joint learning module;

The target image-text retrieval multi-level guidance network model acquisition module is used for inputting the image text data set into the initial image-text retrieval multi-level guidance network model, and extracting the image global features and the image local features of the image data set according to the image feature extraction module; acquiring text global features and text local features of a text data set according to a text feature extraction module; respectively carrying out similar clustering on the image global features and the text global features according to the global similarity measurement module to respectively obtain reconstructed image global features and reconstructed text global features; respectively carrying out fusion reconstruction on the image local feature, the text local feature, the reconstructed image global feature and the reconstructed text global feature according to the multi-layer guiding module to respectively obtain an aligned image feature and an aligned text feature; performing modal cross matching and modal countermeasure fusion on the aligned image features and the aligned text features according to the multi-subspace joint learning module to obtain a complete training target image-text retrieval multi-level guidance network model;

And the image-text retrieval module is used for carrying out image-text retrieval according to the target image-text retrieval multi-level guidance network model.

In order to solve the above problems, the present invention further provides an image-text retrieval device, which includes a processor and a memory, wherein the memory stores a computer program, and when the computer program is executed by the processor, the ocean cross-modal image-text retrieval method is implemented as described above.

In order to solve the above-mentioned problems, the present invention also provides a storage medium storing computer program instructions which, when executed by a computer, cause the computer to perform the marine cross-modal image-text retrieval method as described above.

The beneficial effects of adopting the embodiment are as follows: the invention provides a method, a system, equipment and a storage medium for searching ocean cross-modal graphics context, wherein the method respectively carries out similar clustering on image global features and text global features through a global similarity measurement module, and extracts reconstructed image global features of images and reconstructed text global features of texts; the multi-layer guiding module is used for fusing and reconstructing the image local feature, the text local feature, the reconstructed image global feature and the reconstructed text global feature, so that the image local feature and the text local feature of the image and the text local feature can be processed in a targeted manner, the image local feature and the text local feature can be fused organically, and then fused and reconstructed with the reconstructed image global feature and the reconstructed text global feature to obtain an aligned image feature and an aligned text feature with multi-mode information, the information between two modes of the aligned image and the text is realized, and the retrieval precision is improved.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of a method for retrieving ocean cross-modal text provided by the invention;

FIG. 2 is a schematic diagram of an embodiment of an initial image-text search multi-level guidance network model according to the present invention;

FIG. 3 is a schematic flow chart of an embodiment of obtaining global features of a reconstructed image and global features of a reconstructed text according to the present invention;

FIG. 4 is a flowchart of a first embodiment of obtaining aligned image features and aligned text features according to the present invention;

FIG. 5 is a schematic diagram of a multi-layer guiding module according to an embodiment of the present invention;

FIG. 6 is a schematic flow chart diagram of an embodiment of obtaining a reconstructed image local feature and a reconstructed text local feature according to the present invention;

FIG. 7 is a flowchart of a second embodiment of obtaining aligned image features and aligned text features according to the present invention;

FIG. 8 is a schematic flow chart diagram of an embodiment of a training complete target image-text retrieval multi-level guidance network model according to the present invention;

FIG. 9 is a schematic structural diagram of an embodiment of a multi-subspace joint learning module according to the present invention;

FIG. 10 is a block diagram of an embodiment of an ocean cross-modal teletext retrieval system provided by the invention;

fig. 11 is a block diagram of an embodiment of the image-text searching apparatus provided by the present invention.

Detailed Description

The following detailed description of preferred embodiments of the application is made in connection with the accompanying drawings, which form a part hereof, and together with the description of the embodiments of the application, are used to explain the principles of the application and are not intended to limit the scope of the application.

In order to solve the problems, the invention provides a method, a system, a device and a storage medium for searching ocean cross-mode graphics and texts, which are respectively described in detail below.

Fig. 1 is a schematic flow chart of an embodiment of an ocean cross-modal image-text retrieval method provided by the invention, and as shown in fig. 1, the ocean cross-modal image-text retrieval method comprises:

Step S101: acquiring an image text data set, wherein the image text data set comprises an image data set and a text data set;

step S102: constructing an initial image-text retrieval multi-level guidance network model, wherein the initial image-text retrieval multi-level guidance network model comprises an image feature extraction module, a text feature extraction module, a global similarity measurement module, a multi-layer guidance module and a multi-subspace joint learning module;

Step S103: inputting the image text data set into an initial image-text retrieval multi-level guidance network model, and extracting image global features and image local features of the image data set according to an image feature extraction module; acquiring text global features and text local features of a text data set according to a text feature extraction module; respectively carrying out similar clustering on the image global features and the text global features according to the global similarity measurement module to respectively obtain reconstructed image global features and reconstructed text global features; respectively carrying out fusion reconstruction on the image local feature, the text local feature, the reconstructed image global feature and the reconstructed text global feature according to the multi-layer guiding module to respectively obtain an aligned image feature and an aligned text feature; performing modal cross matching and modal countermeasure fusion on the aligned image features and the aligned text features according to the multi-subspace joint learning module to obtain a complete training target image-text retrieval multi-level guidance network model;

Step S104: and carrying out image-text retrieval according to the target image-text retrieval multi-stage guiding network model.

In this embodiment, first, an image text data set is acquired, the image text data set including an image data set and a text data set; secondly, an initial image-text retrieval multi-level guidance network model is constructed, wherein the initial image-text retrieval multi-level guidance network model comprises an image feature extraction module, a text feature extraction module, a global similarity measurement module, a multi-layer guidance module and a multi-subspace joint learning module; then, inputting the image text data set into an initial image-text retrieval multi-level guidance network model, and extracting image global features and image local features of the image data set according to an image feature extraction module; acquiring text global features and text local features of a text data set according to a text feature extraction module; respectively carrying out similar clustering on the image global features and the text global features according to the global similarity measurement module to respectively obtain reconstructed image global features and reconstructed text global features; respectively carrying out fusion reconstruction on the image local feature, the text local feature, the reconstructed image global feature and the reconstructed text global feature according to the multi-layer guiding module to respectively obtain an aligned image feature and an aligned text feature; performing modal cross matching and modal countermeasure fusion on the aligned image features and the aligned text features according to the multi-subspace joint learning module to obtain a complete training target image-text retrieval multi-level guidance network model; finally, the multi-level guidance network model is searched according to the target image and text to search the image and text.

In the embodiment, the global similarity measurement module is used for carrying out similarity clustering on the global image features and the global text features respectively, and the global image features of the reconstructed images and the global text features of the reconstructed texts of the images are extracted pertinently; the multi-layer guiding module is used for respectively carrying out fusion reconstruction on the image local feature, the text local feature, the reconstructed image global feature and the reconstructed text global feature, so that the image local feature and the text local feature of the image and the text local feature can be processed in a targeted manner, the image local feature and the text local feature can be organically fused, and then the fusion reconstruction is carried out on the image local feature and the reconstructed image global feature and the reconstructed text global feature, and the aligned image feature and the aligned text feature with multi-mode information are obtained; the multi-subspace joint learning module is used for carrying out mode cross matching and mode countermeasure fusion control on the training process of the initial image-text retrieval multi-level guidance network model, so that a reliable target image-text retrieval multi-level guidance network model is obtained, and the reliability of subsequent image-text retrieval is ensured.

In a specific embodiment, in step S102, as shown in fig. 2, fig. 2 is a schematic structural diagram of an embodiment of an initial graph-text search multi-level guidance network model provided in the present invention.

As a preferred embodiment, in step S103, in order to obtain the image global features and the image local features, for the image dataset, a model pre-trained on the ImageNet dataset using a pyramid visual transformation model (pyramid vision transformer, PVT) and its parameters are transferred to a retrieval task to construct the image subspace.

In particular, the structure of the image subspace construction includes 4 modules. Each module comprisesAn encoder layer. For each module, the input can be divided into/>Image blocks. Inputting the flattened image block intoIn the layer, the output is reshaped into a feature map, which is the output of each module. Finally, the 4 th isThe output vector of the block is flattened and fed into a linear projection to obtain the global characteristic of each remote sensing image. This process can be expressed as:

wherein, Representing each/>The block is provided with a plurality of channels,Representing a flattening operation. /(I)Representing each/>The output of the block.

Although global featuresContains most of the information of the remote sensing image, but only uses global features/>Representing that there is a significant deficiency in picture information. Thus, the present application utilizes the 4 th/>, pyramid visual transformation modelThe output of the block provides local information for the picture. Specifically, remote sensing image local features/>Expressed as:

In a specific embodiment, in order to obtain the text global feature and the text local feature, in particular, a query text may be expressed as ，/>Representing the included/>, of each text instanceA word. Then, after word segmentation preprocessing and insertion of two special mask symbols [ cls ] and [ sep ] and so on, trm is an encoder block with a multi-headed attention converter, which can learn multiple representation features from the text, is fed as an input vector into a bi-directional encoding model (Bidirectional Encoder Representations from Transformers, bert) from a transformer. After being processed by a bidirectional coding model from a transformer, the output C of the mask symbol [ cls ] is used as the global feature/>, of the text after being processed by a full connection layer and an activation functionThe procedure is as follows:

wherein, Representing a fully connected layer,/>Representing an activation function.

And each word of the text is obtained by passing through a bidirectional coding model from a transformerSending the text into the full connection layer to obtain the local feature/>, of the textThe process can be expressed as:

wherein, Representing a fully connected layer.

Further, in order to perform similarity clustering on the image global feature and the text global feature according to the global similarity measurement module, to obtain a reconstructed image global feature and a reconstructed text global feature respectively, as shown in fig. 3, fig. 3 is a flow chart of an embodiment of obtaining the reconstructed image global feature and the reconstructed text global feature provided by the present invention, which includes:

step S131: respectively constructing an image subspace and a text subspace according to the image global features and the text global features;

step S132: calculating an image loss function of the image subspace and a text loss function of the text subspace according to the intra-mode loss function;

Step S133: and performing similar clustering on the image subspace and the text subspace according to the image loss function and the text loss function to obtain a reconstructed image global feature and a reconstructed text global feature.

In this embodiment, first, an image subspace and a text subspace are respectively constructed according to an image global feature and a text global feature; then, calculating an image loss function of the image subspace and a text loss function of the text subspace according to the intra-mode loss function; and finally, performing similar clustering on the image subspace and the text subspace according to the image loss function and the text loss function to obtain a reconstructed image global feature and a reconstructed text global feature.

In this embodiment, the image loss function of the image subspace and the text loss function of the text subspace are calculated to respectively quantify the similarity of the image global feature and the text global feature, so that the reconstructed image global feature and the reconstructed text global feature with higher similarity are directionally obtained, and the reliability of the global feature is ensured.

In a specific embodiment, the two sets of global-local features extracted generate corresponding image subspaces and text subspaces, respectively, as follows:

In order to fully exploit the feature information in the single-mode global subspace, the samples in the single-mode global subspace are constrained by intra-modal metric losses of the global similarity metric unit. In particular, intra-modal loss functions are set Then usingTo constrain the image modality and the text modality, respectively, as defined below:

wherein, Is vector/>And/>Similarity function between,/>And/>Is two thresholds defining a boundary.

The application sets the similarity functionAs a cosine distance function, the calculation process is as follows:

wherein, And d is the dimension of the global feature. Distance measures between global features of the text are also calculated as in the procedure described above.

Will beAnd/>The image and text global features in the sequence are input into the global similarity measurement unit for use/>And (5) performing constraint. The constrained single-mode global features obtained through the process are respectively recorded as/>And/>Further, zhang Chengchong intra-modal global subspaces are respectively denoted as/>And/>。

Further, the multi-layer guiding module comprises an implicit local mutual guiding unit and a local enhanced global reconstruction unit; in order to respectively perform fusion reconstruction on an image local feature, a text local feature, a reconstructed image global feature and a reconstructed text global feature according to a multi-layer guiding module to respectively obtain an aligned image feature and an aligned text feature, as shown in fig. 4, fig. 4 is a flow diagram of a first embodiment of obtaining the aligned image feature and the aligned text feature provided by the present invention, which includes:

Step S231: according to the implicit local mutual guidance unit, carrying out similar fusion on the local image features and the local text features to obtain reconstructed image local features and reconstructed text local features;

Step S232: and carrying out fusion reconstruction on the local features of the reconstructed image, the local features of the reconstructed text, the global features of the reconstructed image and the global features of the reconstructed text according to the locally enhanced global reconstruction unit to respectively obtain the aligned image features and the aligned text features.

In the embodiment, firstly, similar fusion is carried out on the image local features and the text local features according to an implicit local mutual guidance unit to obtain reconstructed image local features and reconstructed text local features; and then, carrying out fusion reconstruction on the local features of the reconstructed image, the local features of the reconstructed text, the global features of the reconstructed image and the global features of the reconstructed text according to the locally enhanced global reconstruction unit to respectively obtain the aligned image features and the aligned text features.

In the embodiment, the local features of the image and the local features of the text are subjected to targeted similar fusion respectively through the implicit local mutual guidance unit, so that the local features of the reconstructed image and the local features of the reconstructed text with higher reliability are obtained, the multi-mode combined local features are realized, namely, the features of the image and the text at two angles are subjected to adaptive fusion, and the reliability of the obtained local features is effectively ensured; and carrying out fusion reconstruction on the local features of the reconstructed image, the local features of the reconstructed text, the global features of the reconstructed image and the global features of the reconstructed text through a locally enhanced global reconstruction unit, realizing adaptive feature alignment, respectively obtaining aligned image features and aligned text features, and realizing the organic fusion of the local features and the global features.

In an embodiment, as shown in fig. 5, fig. 5 is a schematic structural diagram of an embodiment of a multi-layer guiding module according to the present invention.

As a preferred embodiment, in step S231, in order to perform similar fusion on the image local feature and the text local feature according to the implicit local interaction unit to obtain a reconstructed image local feature and a reconstructed text local feature, as shown in fig. 6, fig. 6 is a schematic flow chart of an embodiment of obtaining the reconstructed image local feature and the reconstructed text local feature provided by the present invention, where the flow chart includes:

step S2311: calculating the similarity of the local features of the images and the local features of the texts according to the cosine similarity function to obtain a similarity matrix between the text images;

step S2312: and carrying out similar fusion on the image local features and the text local features based on the similarity matrix to obtain reconstructed image local features and reconstructed text local features.

In the embodiment, firstly, calculating the similarity of the local features of the image and the local features of the text according to a cosine similarity function to obtain a similarity matrix between the text images; and then, carrying out similar fusion on the image local features and the text local features based on the similarity matrix to obtain reconstructed image local features and reconstructed text local features.

In the embodiment, the image and the text are fused by calculating the similarity of the image local feature and the text local feature, and then the image local feature and the text local feature are fused similarly based on the similarity matrix, so that the image local feature and the text local feature are adjusted adaptively, and the reliability of the obtained reconstructed image local feature and the reliability of the reconstructed text local feature are improved.

In a particular embodiment, a set of image and text local feature pairs is givenWhereinM represents the number of image blocks per image,/>K represents the number of words each text contains.

In order to establish the local subspace alignment of the text to the image, the system designs a module for guiding the alignment of slice information in the remote sensing image by word characteristics in the query text, and fully utilizes the local attribute of the query text to locate relevant or important areas in the remote sensing image. The specific description is as follows:

① For image local subspaces The local features of each group of images are used as Source keys, and the local features need to be guided by words related to the local information of each image, so that the information of a text mode is supplemented in the subspace of the image mode.

② For text local subspacesThe local characteristics of each group of texts in the image are used as Query keys, and the local characteristics of each group of texts are used as Query keys, so that the local information of each image which is most relevant to the corresponding word vector is found, the retrieval effectiveness of the local information of the image is improved, and the connection among modalities is enhanced.

③ Calculating the similarity between vectors in the two local subspaces using a cosine similarity function. This process can be expressed as:

wherein, Represents the/>Individual regions and/>Similarity between words,/>Representing the transpose function.

④ After the similarity between all the areas and the words is calculated, a similarity matrix based on the alignment between the modes of the text guidance image can be obtained. This process can be expressed as

⑤ The similarity matrix is activated by Leakly ReLU, then subjected to L2 normalization operation, and then again subjected to operation by an activation function to obtain a weight matrix. This process can be expressed as:

wherein, Representing Leakly ReLU functions,/>The representation is given by the normalization of L2,Representing an activation function.

⑥ Features in local subspaces of imagesAnd the weight matrix/>Multiplying to obtain the local feature/>, of the reconstructed image. The specific process is as follows:

wherein, Representing features in a local subspace of the original image, τ representing the transpose function.

Further, when local subspace alignment of the image to the text is established, a module for guiding relevant word alignment in the query text by slice information in the remote sensing image is designed, and the key description in the text is positioned by fully utilizing the local semantic attribute of the local subspace of the remote sensing image. The specific description is as follows:

① For text local subspaces The local characteristics of each group of texts in the text mode are used as Source keys, and the local characteristics are guided by local semantic information in each image, so that richer information of the image mode is added in a text mode subspace.

② For image local subspacesThe local features of each group of images are used as Query keys, and word vectors which are most relevant to the local information of each image are needed to be found, so that the attention of Query text to the local information of the images is improved.

③ Calculating the similarity between vectors in the two local subspaces using a cosine similarity function. Similarly, the cosine similarity/>, between vectors in two local subspaces is calculated. The formula of this step is written as

Wherein,Represents the/>Individual words and/>Similarity between individual regions,/>Representing the transpose function.

④ After the similarity between all the areas and the words is calculated, a similarity matrix based on the alignment among the modalities of the image guidance text can be obtained. This process can be expressed as

⑤ Similar to the above, the similarity matrix is activated using Leakly ReLU, then L2 normalized, and finally again activated by the activation function to obtain the weight matrix. This process can be expressed as:

⑥ Characterizing words in a local subspace of textAnd the weight matrix/>Multiplying to obtain local features/>, of reconstructed text. The specific process is as follows:

wherein, Representing features in the original text local subspace.

Finally, after bi-directional guidance among modes, local features of all remote sensing imagesAnd local features of remote sensing text/>The tensed spaces respectively form a new image local subspace/>, which contains the information of another modeAnd text local subspace/>The expression is as follows:

Wherein the method comprises the steps of ，/>And respectively reconstructing remote sensing images and text local features.

As a preferred embodiment, in step S232, in order to perform fusion reconstruction on the local image feature, the local text feature, the global image feature, and the global text feature according to the locally enhanced global reconstruction unit, to obtain an aligned image feature and an aligned text feature, as shown in fig. 7, fig. 7 is a schematic flow chart of a second embodiment of obtaining an aligned image feature and an aligned text feature provided by the present invention, which includes:

step S2321: calculating the similarity between the local features of the reconstructed image and the global features of the reconstructed image according to the cosine similarity function to obtain an image weight matrix containing local image information;

step S2322: performing enhancement reconstruction on global features of the reconstructed image according to the image weight matrix to obtain aligned image features;

Step S2323: calculating the similarity between the local features of the reconstructed text and the global features of the reconstructed text according to the cosine similarity function to obtain a text weight matrix containing local text information;

step S2324: and carrying out enhancement reconstruction on global features of the reconstructed text according to the text weight matrix to obtain aligned text features.

In this embodiment, first, calculating the similarity between the local feature of the reconstructed image and the global feature of the reconstructed image according to the cosine similarity function to obtain an image weight matrix containing the local image information; performing enhancement reconstruction on global features of the reconstructed image according to the image weight matrix to obtain aligned image features; then, calculating the similarity between the local features of the reconstructed text and the global features of the reconstructed text according to the cosine similarity function to obtain a text weight matrix containing local text information; and carrying out enhancement reconstruction on global features of the reconstructed text according to the text weight matrix to obtain aligned text features.

In the embodiment, the similarity between the local features and the global features of the image is calculated based on the cosine similarity function, and the global features of the reconstructed image are enhanced and reconstructed according to the image weight matrix, so that the fusion of the image features is realized; and calculating the similarity between the local features and the global features of the text based on the cosine similarity function, and carrying out enhancement reconstruction on the global features of the reconstructed text according to the text weight matrix, thereby realizing the fusion of the text features.

In a specific embodiment, in order to fully utilize the important semantic information in the local subspace to supplement the local information lacking in the global subspace, the guiding information between modalities is introduced at the same time, and the specific process is as follows:

(1) Respectively reconstructing local subspaces of images Reconstructed image Global feature/>And reconstructed local subspace of text/>In reconstructed text local features/>Taking it as a Query key, it needs to supplement the global space with local alignment information between modalities.

(2) Respectively reconstructing global subspaces of imagesReconstructed image Global feature/>And a reconstructed global subspace/>, of textText global feature of medium reconstruction/>It is used as Source key.

(3) The cosine similarity function is used to calculate the similarity between vectors in the global and local subspaces. This process can be expressed as:

wherein, Representing the similarity of all local information and global features in a picture,/>Representing the similarity of all local information and global features in a text, τ represents the transpose function.

(4) For a pair ofAnd/>The following operations are performed to obtain a weight matrix containing local information:

/>

(5) To be obtainedAnd/>Global features/>, respectively with image reconstructionAnd global feature of text reconstruction/>Multiplying to obtain the image feature/>, which gives consideration to the local-global information and the inter-modal alignment informationAnd text feature/>. The specific process is as follows:

Where τ represents the transpose function.

(6) After local and global fusion, all the obtained image featuresAnd text feature/>The space is the single-mode image subspace/>, which finally needs to be subjected to joint learningAnd Shan Motai text subspace/>The expression is as follows:

After determining the aligned image features and the aligned text features, in order to effectively control the optimization direction of the image-text retrieval multi-stage guidance network model according to the multi-subspace joint learning module to obtain a well-trained target image-text retrieval multi-stage guidance network model, a multi-subspace joint learning module is specially provided, wherein the multi-subspace joint learning module comprises a modal cross-matching learning module and a modal countermeasure fusion learning module, as shown in fig. 8, fig. 8 is a flow diagram of an embodiment of the well-trained target image-text retrieval multi-stage guidance network model provided by the invention, and the flow diagram comprises:

Step S331: calculating a modal cross-matching learning loss between the aligned image features and the aligned text features according to a modal cross-matching learning module;

step S332: and calculating common representation generator loss and modal discrimination loss between the aligned image features and the aligned text features according to the modal countermeasure fusion learning module, constructing an optimized objective function based on the modal cross-matching learning loss, the common representation generator loss and the modal discrimination loss, and determining a complete training objective image-text retrieval multi-level guidance network model according to the optimized objective function.

In this embodiment, the modal cross matching learning loss between the aligned image feature and the aligned text feature is calculated as a first loss function value according to the modal cross matching learning module, then the common representation generator loss and the modal discrimination loss between the aligned image feature and the aligned text feature are calculated according to the modal countermeasure fusion learning module to obtain two other loss function values, and a final optimization objective function is constructed based on the three loss function values, so that the optimization direction of the image-text retrieval multi-level guidance network model is comprehensively determined to obtain the target image-text retrieval multi-level guidance network model which meets the requirement most.

In an embodiment, as shown in fig. 9, fig. 9 is a schematic structural diagram of an embodiment of a multi-subspace joint learning module provided by the present invention.

In one embodiment, the calculation formula of the modal cross-match learning loss is:

the calculation formula for the common representation generator loss is:

the calculation formula of the modal discrimination loss is as follows:

the calculation formula of the optimization objective function is as follows:

Q

Through the mode, the global similarity measurement module is used for carrying out similarity clustering on the global image features and the global text features respectively, and the global image features of the reconstructed images and the global text features of the reconstructed texts of the images are extracted pertinently; the multi-layer guiding module is used for respectively carrying out fusion reconstruction on the image local feature, the text local feature, the reconstructed image global feature and the reconstructed text global feature, so that the image local feature and the text local feature of the image and the text local feature can be processed in a targeted manner, the image local feature and the text local feature can be organically fused, and then the fusion reconstruction is carried out on the image local feature and the reconstructed image global feature and the reconstructed text global feature, and the aligned image feature and the aligned text feature with multi-mode information are obtained; the multi-subspace joint learning module is used for carrying out mode cross matching and mode countermeasure fusion control on the training process of the initial image-text retrieval multi-level guidance network model, so that a reliable target image-text retrieval multi-level guidance network model is obtained, and the reliability of subsequent image-text retrieval is ensured.

In order to solve the above-mentioned problems, the present invention further provides a system for retrieving marine cross-modal graphics and text, as shown in fig. 10, fig. 10 is a block diagram of an embodiment of the marine cross-modal graphics and text retrieving system provided by the present invention, where the marine cross-modal graphics and text retrieving system 1000 includes:

a data set obtaining module 1001, configured to obtain an image text data set, where the image text data set includes an image data set and a text data set;

The initial image-text retrieval multi-level guidance network model construction module 1002 is configured to construct an initial image-text retrieval multi-level guidance network model, where the initial image-text retrieval multi-level guidance network model includes an image feature extraction module, a text feature extraction module, a global similarity measurement module, a multi-layer guidance module, and a multi-subspace joint learning module;

The target image-text retrieval multi-level guidance network model acquisition module 1003 is used for inputting the image text data set into the initial image-text retrieval multi-level guidance network model, and extracting the image global features and the image local features of the image data set according to the image feature extraction module; acquiring text global features and text local features of a text data set according to a text feature extraction module; respectively carrying out similar clustering on the image global features and the text global features according to the global similarity measurement module to respectively obtain reconstructed image global features and reconstructed text global features; respectively carrying out fusion reconstruction on the image local feature, the text local feature, the reconstructed image global feature and the reconstructed text global feature according to the multi-layer guiding module to respectively obtain an aligned image feature and an aligned text feature; performing modal cross matching and modal countermeasure fusion on the aligned image features and the aligned text features according to the multi-subspace joint learning module to obtain a complete training target image-text retrieval multi-level guidance network model;

The image-text searching module 1004 is used for searching the image-text according to the target image-text searching multi-level guiding network model.

The invention also correspondingly provides image-text retrieval equipment, as shown in fig. 11, and fig. 11 is a structural block diagram of an embodiment of the image-text retrieval equipment provided by the invention. The image-text retrieval device 1100 may be a computing device such as a mobile terminal, desktop computer, notebook, palm top computer, and server. The teletext retrieval arrangement 1100 comprises a processor 1101 and a memory 1102, wherein the memory 1102 has a teletext retrieval program 1103 stored thereon.

Memory 1102 may be an internal storage unit of a computer device in some embodiments, such as a hard disk or memory of a computer device. The memory 1102 may also be an external storage device of the computer device in other embodiments, such as a plug-in hard disk provided on the computer device, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like. Further, the memory 1102 may also include both internal storage units and external storage devices of the computer device. The memory 1102 is used for storing application software installed on the computer device and various types of data, such as program code for installing the computer device. Memory 1102 may also be used to temporarily store data that has been output or is to be output. In one embodiment, the teletext retrieval program 1103 can be executed by the processor 1101, thereby implementing the marine cross-modal teletext retrieval method of embodiments of the invention.

The processor 1101 may in some embodiments be a central processing unit (Central Processing Unit, CPU), microprocessor or other data processing chip for running program code or processing data stored in the memory 1102, such as executing a teletext retrieval program or the like.

The embodiment also provides a computer readable storage medium, on which a graph-text retrieval program is stored, which when executed by a processor, implements the ocean cross-modal graph-text retrieval method as described above.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.

Claims

1. The ocean cross-modal image-text retrieval method is characterized by comprising the following steps of:

Constructing an initial image-text retrieval multi-level guidance network model, wherein the initial image-text retrieval multi-level guidance network model comprises an image feature extraction module, a text feature extraction module, a global similarity measurement module, a multi-layer guidance module and a multi-subspace combined learning module, and the multi-subspace combined learning module comprises a modal cross-matching learning module and a modal countermeasure fusion learning module;

Inputting the image text data set into the initial image-text retrieval multi-level guidance network model, and extracting image global features and image local features of the image data set according to the image feature extraction module; acquiring text global features and text local features of the text data set according to the text feature extraction module; respectively carrying out similar clustering on the image global features and the text global features according to the global similarity measurement module to respectively obtain reconstructed image global features and reconstructed text global features; respectively carrying out fusion reconstruction on the image local feature, the text local feature, the reconstructed image global feature and the reconstructed text global feature according to the multi-layer guiding module to respectively obtain an aligned image feature and an aligned text feature; calculating modal cross matching learning loss between the aligned image features and the aligned text features according to the modal cross matching learning module; calculating common representation generator loss and modal discrimination loss between the aligned image features and the aligned text features according to the modal countermeasure fusion learning module, constructing an optimized objective function based on the modal cross matching learning loss, the common representation generator loss and the modal discrimination loss, and determining a complete training objective image-text retrieval multi-level guidance network model according to the optimized objective function;

Performing image-text retrieval according to the target image-text retrieval multi-level guidance network model;

The calculation formula of the modal cross matching learning loss is as follows:

The calculation formula of the common representation generator loss is as follows:

The calculation formula of the modal discrimination loss is as follows:

The calculation formula of the optimization objective function is as follows:

Q

wherein, Learning a penalty for the modality cross-match,/>As a parameter of the marginal threshold value,For the final remote sensing image feature vector,/>For the final remote sensing text feature vector,/>For similarity score between remote sensing image and text,/>For all AND/>Unmatched text feature vectors,/>For similarity score of remote sensing image and all text not matched with the remote sensing image,/>For all AND/>Unmatched remote sensing image features,/>Scoring similarity of text to all remote sensing images that do not match it,/>Lost for the public representation generator,/>Loss of generator for public representation of images,/>Loss for text public representation generator,/>Representing a modality discriminator network,/>For the i-th final remote sensing image feature vector,/>Loss is discriminated for the modality,/>Loss is discriminated for image modality,/>For discriminating loss of text mode,/>，/>Expressed as image modality at 0,/>And 1, representing the text mode, N is the total number of sample pairs, and Q is the optimization objective function.

2. The method for searching ocean cross-modal text according to claim 1, wherein the performing similar clustering on the image global feature and the text global feature according to the global similarity measurement module respectively to obtain a reconstructed image global feature and a reconstructed text global feature respectively includes:

And carrying out similar clustering on the image subspace and the text subspace according to the image loss function and the text loss function to obtain a reconstructed image global feature and a reconstructed text global feature.

3. The marine cross-modal teletext retrieval method according to claim 1, wherein the multi-layer coaching module comprises an implicit local mutual coaching unit and a locally enhanced global reconstruction unit; the fusing and reconstructing are respectively carried out on the image local feature, the text local feature, the reconstructed image global feature and the reconstructed text global feature according to the multi-layer guiding module to respectively obtain an aligned image feature and an aligned text feature, and the method comprises the following steps:

performing similar fusion on the image local features and the text local features according to the implicit local mutual guidance unit to obtain reconstructed image local features and reconstructed text local features;

4. The method for searching marine cross-modal text according to claim 3, wherein the performing similar fusion on the image local feature and the text local feature according to the implicit local mutual guidance unit to obtain a reconstructed image local feature and a reconstructed text local feature comprises:

Calculating the similarity of the image local features and the text local features according to a cosine similarity function to obtain a similarity matrix between text images;

5. The method for searching marine cross-modal text according to claim 3, wherein the performing fusion reconstruction on the local image feature, the local text feature, the global image feature and the global text feature according to the locally enhanced global reconstruction unit to obtain an aligned image feature and an aligned text feature respectively includes:

calculating the similarity between the local features of the reconstructed image and the global features of the reconstructed image according to a cosine similarity function to obtain an image weight matrix containing local image information;

performing enhancement reconstruction on the global features of the reconstructed image according to the image weight matrix to obtain aligned image features;

Calculating the similarity between the local features of the reconstructed text and the global features of the reconstructed text according to a cosine similarity function to obtain a text weight matrix containing local text information;

And carrying out enhancement reconstruction on the global features of the reconstructed text according to the text weight matrix to obtain aligned text features.

6. An ocean cross-modal teletext retrieval system, comprising:

A data set acquisition module for acquiring an image text data set, the image text data set comprising an image data set and a text data set;

The system comprises an initial image-text retrieval multi-level guidance network model construction module, a multi-level guidance network model analysis module and a multi-subspace combined learning module, wherein the initial image-text retrieval multi-level guidance network model construction module is used for constructing an initial image-text retrieval multi-level guidance network model, and comprises an image feature extraction module, a text feature extraction module, a global similarity measurement module, a multi-level guidance module and a multi-subspace combined learning module, wherein the multi-subspace combined learning module comprises a modal cross matching learning module and a modal countermeasure fusion learning module;

The target image-text retrieval multi-level guidance network model acquisition module is used for inputting the image text data set into the initial image-text retrieval multi-level guidance network model, and extracting image global features and image local features of the image data set according to the image feature extraction module; acquiring text global features and text local features of the text data set according to the text feature extraction module; respectively carrying out similar clustering on the image global features and the text global features according to the global similarity measurement module to respectively obtain reconstructed image global features and reconstructed text global features; respectively carrying out fusion reconstruction on the image local feature, the text local feature, the reconstructed image global feature and the reconstructed text global feature according to the multi-layer guiding module to respectively obtain an aligned image feature and an aligned text feature; calculating modal cross matching learning loss between the aligned image features and the aligned text features according to the modal cross matching learning module; calculating common representation generator loss and modal discrimination loss between the aligned image features and the aligned text features according to the modal countermeasure fusion learning module, constructing an optimized objective function based on the modal cross matching learning loss, the common representation generator loss and the modal discrimination loss, and determining a complete training objective image-text retrieval multi-level guidance network model according to the optimized objective function;

The image-text retrieval module is used for performing image-text retrieval according to the target image-text retrieval multi-level guidance network model;

The calculation formula of the modal discrimination loss is as follows:

The calculation formula of the optimization objective function is as follows:

Q

7. An image-text retrieval device comprising a processor and a memory, wherein the memory has a computer program stored thereon, which when executed by the processor, implements the marine cross-modal image-text retrieval method of any one of claims 1-5.

8. A storage medium storing computer program instructions which, when executed by a computer, cause the computer to perform the marine cross-modal teletext retrieval method according to any one of claims 1 to 5.