CN114942984A

CN114942984A - Visual scene text fusion model pre-training and image-text retrieval method and device

Info

Publication number: CN114942984A
Application number: CN202210590151.1A
Authority: CN
Inventors: 孙逸鹏; 程梦钧; 王龙超; 朱雄威; 姚锟; 韩钧宇; 刘经拓; 丁二锐; 王井东; 王海峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2022-08-26
Anticipated expiration: 2042-05-26
Also published as: CN114942984B; US20230386168A1

Abstract

The invention provides a method and a device for pre-training a visual scene text fusion model and retrieving pictures and texts, and relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision. The specific implementation scheme is as follows: obtaining a sample image-text pair; extracting a sample scene text in a sample image; inputting the sample text into a text coding network to obtain sample text characteristics; inputting the sample image and the initial sample fusion feature into a visual coding sub-network, and inputting the initial sample fusion feature and the sample scene text into a scene coding sub-network to obtain the global image feature and the learned sample fusion feature of the sample image; and pre-training the visual scene text fusion model according to the sample text characteristics, the global image characteristics of the sample images and the learned sample fusion characteristics. By the technical scheme, the image-text cross-modal retrieval performance can be improved.

Description

Visual scene text fusion model pre-training and image-text retrieval method and device

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and in particular to the field of deep learning, image processing, and computer vision technology.

Background

With the development of artificial intelligence technology, the retrieval forms are more and more diversified. The image-text retrieval is an important retrieval form in the existing retrieval forms. The task form of teletext retrieval may be to select the image from the image library that is most relevant to the search text given a search text, or it may also be to select the text from the text library that is most relevant to the given image given an image. In the context of image-text retrieval, how to accurately complete the retrieval task is of great importance.

Disclosure of Invention

The disclosure provides a visual scene text fusion model pre-training and image-text retrieval method and device.

According to an aspect of the present disclosure, there is provided a pre-training method of a visual scene text fusion model, wherein the visual scene text fusion model includes a text coding network and a visual scene coding network, the visual scene coding network includes a visual coding sub-network and a scene coding sub-network, the method includes:

obtaining a sample image-text pair; wherein the sample image-text pair comprises a sample image and a sample text;

extracting sample scene texts in the sample images;

inputting the sample text into the text coding network to obtain sample text characteristics;

inputting the sample image and the initial sample fusion feature into the visual coding sub-network, and inputting the initial sample fusion feature and the sample scene text into the scene coding sub-network to obtain a global image feature of the sample image and a learned sample fusion feature;

and pre-training the visual scene text fusion model according to the sample text features, the global image features of the sample images and the learned sample fusion features.

According to another aspect of the present disclosure, there is provided a training method of a visual scene text fusion model, the method including:

acquiring a service image-text pair provided by a service party; the service graphic and text pair comprises a service image and a service text;

the service image and the service text are used as training data, and a visual scene text fusion model is finely adjusted; the visual scene text fusion model is obtained by pre-training based on any one of the pre-training methods of the visual scene text fusion model disclosed by the disclosure.

According to another aspect of the present disclosure, there is provided a method for retrieving text and graphics in a visual scene text fusion model, wherein the visual scene text fusion model includes a text coding network and a visual scene coding network, the visual scene coding network includes a visual coding sub-network and a scene coding sub-network, the method includes:

acquiring a target text to be retrieved;

extracting candidate scene texts in the candidate images;

inputting the target text into the text coding network to obtain target text characteristics;

inputting the candidate image and the initial candidate fusion feature into the visual coding sub-network, and inputting the initial candidate fusion feature and the candidate scene text into the scene coding sub-network to obtain the global image feature of the candidate image;

and determining a target image from the candidate images according to the target text characteristics and the global image characteristics of the candidate images.

According to yet another aspect of the present disclosure, there is provided a pre-training apparatus for a visual scene text fusion model, wherein the visual scene text fusion model includes a text coding network and a visual scene coding network, the visual scene coding network includes a visual coding sub-network and a scene coding sub-network, the apparatus includes:

the sample image-text pair acquisition module is used for acquiring a sample image-text pair; wherein the sample image-text pair comprises a sample image and a sample text;

the sample scene text extraction module is used for extracting sample scene texts in the sample images;

the sample text characteristic determining module is used for inputting the sample text into the text coding network to obtain sample text characteristics;

the sample global feature determining module is used for inputting the sample images and the initial sample fusion features into the visual coding sub-network, and inputting the initial sample fusion features and the sample scene texts into the scene coding sub-network to obtain the global image features of the sample images and the learned sample fusion features;

and the model pre-training module is used for pre-training the visual scene text fusion model according to the sample text features, the global image features of the sample images and the learned sample fusion features.

According to another aspect of the present disclosure, there is provided a training apparatus for a visual scene text fusion model, including:

the service image-text pair acquisition module is used for acquiring a service image-text pair provided by a service party; the service graphic and text pair comprises a service image and a service text;

the fine tuning module is used for taking the service image and the service text as training data to perform fine tuning on a visual scene text fusion model; the visual scene text fusion model is obtained based on a pre-training device of any one of the visual scene text fusion models disclosed in the present disclosure.

According to still another aspect of the present disclosure, there is provided a graphics context retrieval apparatus of a visual scene text fusion model, wherein the visual scene text fusion model includes a text coding network and a visual scene coding network, the visual scene coding network includes a visual coding sub-network and a scene coding sub-network, the apparatus includes:

the target text acquisition module is used for acquiring a target text to be retrieved;

the candidate scene text extraction module is used for extracting candidate scene texts in the candidate images;

the target text characteristic determining module is used for inputting the target text into the text coding network to obtain target text characteristics;

a candidate global feature determination module, configured to input the candidate image and the initial candidate fusion feature into the visual coding sub-network, and input the initial candidate fusion feature and the candidate scene text into the scene coding sub-network, so as to obtain a global image feature of the candidate image;

and the target image determining module is used for determining a target image from the candidate images according to the target text characteristics and the global image characteristics of the candidate images.

According to still another aspect of the present disclosure, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a pre-training method, a training method, or a teletext retrieval method of a visual scene text fusion model according to any one of the embodiments of the disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a pre-training method, a training method or a graph-text retrieval method of a visual scene text fusion model according to any one of the embodiments of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a flowchart of a pre-training method of a visual scene text fusion model provided according to an embodiment of the present disclosure;

FIG. 2A is a flowchart of another method for pre-training a visual scene text fusion model provided in accordance with an embodiment of the present disclosure;

FIG. 2B is a schematic diagram of a visual scene text fusion model provided in accordance with an embodiment of the present disclosure;

FIG. 2C is a schematic diagram illustrating a process for determining a fusion characteristic of a sample according to an embodiment of the disclosure;

FIG. 3 is a flowchart of a pre-training method for a visual scene text fusion model provided in accordance with an embodiment of the present disclosure;

fig. 4 is a flowchart of a method for retrieving text in a visual scene text fusion model according to an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of a pre-training apparatus for a visual scene text fusion model according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a training apparatus for a visual scene text fusion model provided in an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an image-text retrieval device of a visual scene text fusion model according to an embodiment of the present disclosure;

fig. 8 is a block diagram of an electronic device for implementing a pre-training method, a training method, or a graph-text retrieval method of a visual scene text fusion model according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the terms "first," "second," "third," "target," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "sample," "candidate," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between different phases and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the process of image-text cross-modal retrieval, scene texts in the images have positive value for understanding visual semantic information of the images, and image-text cross-modal retrieval in related technologies does not fully utilize the scene text information. However, only the partial image has the scene text, and the partial image has no scene text, which leads to the degradation of the image-text retrieval performance without the scene text if the scene text information is directly added to the image cross-mode retrieval process.

Based on the above problems, the present disclosure provides a visual Scene Text fusion (ViSTA) model with a significant effect, which is used for processing a cross-modal search task in the presence and absence of Scene Text. The Vista model is designed by completely adopting a deformer (transformer), and scene texts are introduced in the visual coding process through the fusion characteristic sharing of the visual coding and the scene coding. In order to solve the problem of scene text missing, we further propose a dual contrast supervision that fuses text contrast loss and image text contrast loss. Compared with cross-modal retrieval in the related art, the ViSTA can fuse scene text semantics and visual features together, and can solve the problem of reduced retrieval performance in the absence of scene text.

The contributions of the present disclosure include at least the following aspects: 1) the network architecture completely adopting the deformer is provided, the visual image and the scene text can be effectively fused, and the architecture is suitable for the condition that the scene text exists and the condition that the scene text does not exist; 2) through the visual coding sub-network and the scene coding sub-network, the relevant information of the visual text and the scene text is exchanged through sharing the fusion characteristic, and the visual characteristic is enhanced through double contrast loss; 3) the performance of the ViSTA model is found to be far superior to that of the image-text cross-mode retrieval method in the related technology through research.

Fig. 1 is a flowchart of a pre-training method of a visual scene text fusion model according to an embodiment of the present disclosure. The embodiment is suitable for the situation of pre-training the visual scene text fusion model. The method can be executed by a pre-training device of the visual scene text fusion model, the device can be realized in a software and/or hardware mode, and can be integrated in electronic equipment bearing the pre-training function of the visual scene text fusion model. Optionally, the visual scene text fusion model includes a text coding network and a visual scene coding network, and the visual scene coding network includes a visual coding sub-network and a scene coding sub-network. As shown in fig. 1, the pre-training method for the visual scene text fusion model of the present embodiment may include:

and S101, obtaining a sample image-text pair.

In this embodiment, the sample image-text pair is sample data for pre-training the visual scene text fusion model. Optionally, each sample image-text pair includes a sample image and a sample text, and the sample text is used to describe image content in the sample image. Taking the example of eating a hamburger image, the sample text may be a hamburger that a young person is eating.

Optionally, a large number of sample image-text pairs may be captured from the internet or the like to pre-train the visual scene text fusion model. In consideration of the fact that the image in the actual scene has two situations of having a scene text and not having the scene text, in order to enable the visual scene text fusion model to be simultaneously applicable to the image-text retrieval scene having the scene text and also applicable to the image-text retrieval scene not having the scene text, the sample image in this embodiment may be an image having the scene text or an image not having the scene text.

And S102, extracting a sample scene text in the sample image.

In this embodiment, the sample scene text refers to text existing in the sample image.

Specifically, the sample image may be processed based on an Optical Character Recognition (OCR) technique to obtain a sample scene text in the sample image; further, the image position coding information of the sample scene text in the sample image can be obtained. It should be noted that one or more sample scene texts may be included in the sample image; for each sample scene text, a word or words may be included.

In the sample image without the scene text, the sample scene text extracted from the sample image is a null text.

And S103, inputting the sample text into a text coding network to obtain the sample text characteristics.

In this embodiment, the sample text feature refers to a semantic feature of the sample text, and may be represented in a matrix or vector form.

Specifically, the sample text may be input to a text coding network, and processed by the text coding network to obtain sample text features, i.e., a text token.

And S104, inputting the sample image and the initial sample fusion feature into a visual coding sub-network, and inputting the initial sample fusion feature and the sample scene text into a scene coding sub-network to obtain the global image feature of the sample image and the learned sample fusion feature.

In this embodiment, the sample fusion feature is used to characterize a correlation feature between the image and the scene text, that is, a fusion token, and may be in a matrix or vector form; it should be noted that the initial sample fusion characteristics may be preset, and in the model pre-training and model fine-tuning processes, the sample fusion characteristics may be dynamically learned and updated. The global image feature of the sample image is obtained by performing multi-modal feature extraction on the sample image from a global perspective, and may be referred to as an image token.

Specifically, the sample image and the initial sample fusion feature may be input into a visual coding sub-network in the visual scene coding network, and the initial sample fusion feature and the sample scene text may be input into a scene coding sub-network in the visual coding network, and the visual coding sub-network and the scene coding sub-network cooperate to obtain the global image feature and the learned sample fusion feature of the sample image output by the visual coding scene network.

And S105, pre-training the visual scene text fusion model according to the sample text characteristics, the global image characteristics of the sample image and the learned sample fusion characteristics.

In an alternative mode, training loss can be calculated according to sample text characteristics, global image characteristics of sample images and learned sample fusion characteristics on the basis of a preset loss function; and pre-training the visual scene text fusion model according to the training loss until the training loss is stabilized in a set range or the training times reach the set times, and stopping the pre-training. Wherein, the setting range and the setting times can be set by the technicians in the field according to the actual requirements.

According to the technical scheme of the embodiment of the disclosure, sample texts in sample image-text pairs are input into a text coding network to obtain sample text characteristics; and inputting the sample images in the sample image-text pairs and the initial sample fusion features into a visual coding sub-network, inputting the initial sample fusion features and the sample scene texts extracted from the sample images into the scene coding sub-network to obtain the global image features of the sample images and the learned sample fusion features, and then pre-training the visual scene text fusion model according to the sample text features, the global image features of the sample images and the learned sample fusion features. According to the technical scheme, the sample fusion characteristics, the global image characteristics of the sample images and the sample text characteristics are introduced to pre-train the visual scene text fusion model, so that the global image characteristics of the images can be accurately output no matter whether the input images have scene texts, and the visual scene text fusion model can support a retrieval task under a large-range image text set, namely, the visual scene text fusion model is not only suitable for cross-modal retrieval under the condition of scene texts, but also suitable for cross-modal retrieval without the scene texts; in addition, the scheme performs parallel processing on the image and the text through a double-coding structure, namely a parallel text coding network and a visual scene coding network, and can improve the cross-modal retrieval efficiency.

On the basis of the above embodiment, as an optional mode of the present disclosure, the sample text is input into the text coding network, and the obtained sample text features may be that word embedding is performed on the sample text to obtain a sample text word vector; determining a word coding result of the sample text according to the modal information of the sample text, the position coding information of the sample text and the word vector of the sample text; constructing a coding sequence of the sample text according to the initial sample text characteristics and the word coding result of the sample text; and inputting the coding sequence of the sample text into a text coding network to obtain the learned sample text characteristics.

The mode information of the sample text is preset encoding information for characterizing the mode of the sample text, and is distinguished from the mode information of the subsequent sample image block and the mode information of the sample scene text, for example, the mode information of the sample image block, the mode information of the sample scene text, and the mode information of the sample text may be sequentially represented by m0, m1, and m2, that is, the encoding information of each mode is different. And the position coding information of the sample text is the word sequence coding information of the words in the sample text.

Specifically, the sample text may be processed based on a word embedding technique to obtain a vector of each word in the sample text, that is, a sample text word vector.

And for each word in the sample text, splicing the modal information of the sample text, the position coding information of the word and the sample text word vector of the word based on a certain rule to obtain a word coding result of the word.

After the word coding results of the words in the sample text are determined, the coding sequence of the sample text can be constructed according to the initial sample text characteristics and the word coding results of the words. The initial sample text features can be specifically composed of initial text codes, modal information of sample texts and position coding information of the sample texts; optionally, the initial sample text feature is located at the head of the coding sequence of the sample text.

After the coding sequence of the sample text is determined, the coding sequence of the sample text can be input into a text coding network to obtain learned sample text features.

It can be understood that the sample text features are learned by combining the modal information of the sample text, the position coding information of the sample text, the sample text word vectors and the like, so that the finally determined sample text features are more accurate.

On the basis of the above embodiment, as an optional way of the present disclosure, the step of inputting the sample image into the visual coding sub-network may be that the sample image is partitioned to obtain a sample image block sequence; carrying out linear projection on sample image blocks in the sample image block sequence to obtain a coding result of the sample image blocks; processing the coding result of the sample image block according to the modal information of the sample image block and the position coding information of the sample image block to obtain the processed coding result of the sample image block; constructing a coding sequence of the sample image according to the initial global image characteristics and the processed coding result of the sample image block; an encoded sequence of sample images is input to an input layer in a visual coding sub-network.

In this embodiment, the mode information of the sample image block is preset encoding information for characterizing the mode of the sample image. The position coding information of the sample image block is the identification information of the sample image block.

Specifically, the sample image may be divided into a plurality of sample image blocks (for example, N blocks, where N is a positive integer) with the same size according to a certain size, so as to obtain a sample image block sequence, which is recorded as a sample image block sequence

After the sample image block sequence is obtained, linear projection is carried out on each sample image block in the sample image block sequence, and the coding result of the sample image block is obtained. And based on a certain rule, splicing the coding result of the sample image block, the modal information of the sample image block and the position coding information of the sample image block to obtain a processed coding result of the sample image block.

After the processed coding result of each sample image block is determined, a coding sequence of the sample image can be constructed according to the initial global image feature and the processed coding result of each sample image block. The initial global image feature may specifically be composed of an image code to be learned, modality information of the sample image block, and position coding information of the sample image block. Optionally, the initial global picture feature is located at the first of the coding sequence of the sample picture. After determining the encoded sequence of the sample image, the encoded sequence of the sample image may be input to an input layer in the visual coding sub-network.

It can be understood that, the present disclosure determines the encoding result of the sample image block by using the image block linear projection mode, and compared with the method of acquiring the encoding result of the image block in the image by using the detector, the present disclosure reduces the computational complexity and improves the efficiency; meanwhile, the coding sequence of the sample image is determined by combining the modal information of the sample image block, the position coding information of the sample image block, the initial global image characteristic and the like, so that the visual coding sub-network can extract more abundant characteristics in the sample image.

On the basis of the above embodiment, as an optional mode of the present disclosure, the input of the sample scene text into the scene coding subnetwork may be that the sample scene text is word-embedded to obtain a sample scene text vector; determining a coding result of the sample image scene text according to the image position coding information of the sample scene text, the modal information of the sample scene text, the word position coding information of the sample scene text and the sample scene text vector; constructing a coding sequence of the sample scene text according to the initial sample scene text characteristics and the coding result of the sample image scene text; an encoded sequence of sample scene text is input into an input layer in a scene coding subnetwork.

In this embodiment, the mode information of the sample scene text is preset encoding information for characterizing the mode of the scene text in the sample image. Word position coding information of the sample scene text is used to uniquely identify the order of words in the sample scene text. The image position coding information of the sample scene text is image position information of the sample scene text in the sample image.

Specifically, for each sample scene text extracted from the sample image, the sample scene text may be processed based on a word Embedding technique to obtain a sample scene text vector of the sample scene text, which is denoted as Embedding (o) ^word )。

After determining the sample scene text vector of the sample scene text, the graph position coding information (denoted as F) of the sample scene text may be based on a certain rule _linear (o ^bbox ) And modal information (denoted as S) of sample scene text ^type ) Word position coding information (denoted as S) of sample scene text ^token_id ) And a sample scene text vector (denoted as Embedding (o) ^word ) Splicing to obtain the coding result (marked as S) of the sample scene text ₀ ). For example, the image position coding information, the modal information of the sample scene text, the word position coding information of the sample scene text, and the sample scene text vector may be sequentially spliced to obtain the coding result of the sample scene text, for example, the coding result of the sample scene text may be obtained by the following formula:

S ^init ＝Embedding(o ^word )+F _linear (o ^bbox )

S ₀ ＝S ^init +S ^type +S ^token_id

wherein S is ₀ Representing the result of the encoding of the text of the sample scene, F _linear (o ^bbox ) Indicating the view position coding information, Embedding (o) ^word ) Representing a sample scene text vector, S ^type Modality information representing sample scene text, S ^token_id Word position encoding information representing the text of the sample scene.

After the coding result of each sample scene text in the sample image is determined, a coding sequence of the sample scene text can be constructed according to the initial sample scene text characteristics and the coding result of each sample scene text. The initial sample scene text features can be specifically composed of scene text codes to be learned, modal information of the sample scene texts, graph position code information and word position code information of the sample scene texts; optionally, the initial sample scene text feature is located at the head of the coding sequence of the sample scene text. After determining the encoding sequence of the sample scene text, the encoding sequence of the sample scene text may be input to an input layer in a scene coding subnetwork.

It can be understood that the coding sequence of the sample scene text is determined by combining the image position coding information, the modal information of the sample scene text, the word position coding information of the sample scene text, the initial sample scene text feature, and the like, so that the visual coding subnetwork can extract richer features in the sample scene text.

FIG. 2A is a flowchart of another training method for a visual scene text fusion model provided in accordance with an embodiment of the present disclosure; fig. 2B is a schematic diagram of a visual scene text fusion model provided according to an embodiment of the present disclosure. On the basis of the above embodiments, the present embodiment provides an alternative solution to further optimize the "inputting the sample images and the initial sample fusion features into the visual coding network, and inputting the initial sample fusion features and the sample scene text into the scene coding sub-network, so as to obtain the global image features and the learned sample fusion features of the sample images". As shown in fig. 2A, the training method of the visual scene text fusion model of the embodiment may include:

s201, obtaining a sample image-text pair.

Wherein the sample-text pair includes a sample image and a sample text.

S202, extracting a sample scene text in the sample image.

And S203, inputting the sample text into a text coding network to obtain the sample text characteristics.

S204, inputting the sample image into an input layer in the visual coding sub-network, and inputting the initial sample fusion feature into a fusion layer in the visual coding sub-network to obtain the global image feature of the sample image output by the visual coding sub-network and the visual fusion feature output by the visual coding sub-network.

S205, inputting the sample scene text into an input layer in the scene coding sub-network, and inputting the initial sample fusion characteristics into a fusion layer in the scene coding sub-network to obtain the scene fusion characteristics output by the scene coding sub-network.

And S206, fusing the visual fusion characteristics output by the visual coding sub-network and the scene fusion characteristics output by the scene coding sub-network to obtain learned sample fusion characteristics.

And S207, pre-training the visual scene text fusion model according to the sample text features, the global image features of the sample images and the learned sample fusion features.

In this embodiment, the visual coding sub-network may be formed by a plurality of transform (morpher) layers, and the scene coding sub-network may also be formed by a plurality of transform layers; preferably, the network structure, i.e. the number of network layers, of the visual coding subnetwork and the scene coding subnetwork may be the same.

Further, the visual coding subnetwork may include at least one fusion layer, and the scene coding subnetwork may also include at least one fusion layer. Optionally, the number of the fusion layers in the visual coding subnetwork is the same as the number of the fusion layers in the scene coding subnetwork; when the number of network layers of the visual coding sub-network and the scene coding sub-network is the same, the position of the fusion layer in the visual coding sub-network is the same as the position of the fusion layer in the scene coding sub-network.

For example, in the visual coding sub-network, the learning process of any transform layer in the visual coding sub-network before the fusion layer can be represented by the following formula:

Y _l ←MHSA(LN(V _l ))+V _l

V _l+1 ←MLP(LN(Y _l ))+Y _l ，

wherein, V _l Image features (i.e. maps) representing the prediction of the l-th layerLike token), MHSA (-) refers to a multi-headed self-attention layer, MLP (-) refers to a multi-layer perceptual layer, LN (-) refers to layer normalization, Y _l Is an intermediate variable; wherein l is a positive integer.

In the visual coding sub-network, the learning process of any one of the transform layers in the scene coding sub-network before the fusion layer can be represented by the following formula:

X _l ←MHSA(LN(S _l ))+S _l

S _l+1 ←MLP(LN(X _l ))+X _l ，

wherein S is _l Predicted scene text feature (i.e., scene text token), X, representing the l-th layer _l Is an intermediate variable.

In the visual coding sub-network, a first fusion layer outputs new global image characteristics and visual fusion characteristics according to initial sample fusion characteristics and global image characteristics received from a previous layer; in a scene coding sub-network, a first fusion layer outputs new scene text characteristics and scene fusion characteristics according to initial sample fusion characteristics and scene text characteristics received from a previous layer; and fusing the visual fusion characteristics output by the first fusion layer in the visual coding sub-network and the scene fusion characteristics output by the first fusion layer in the scene coding sub-network to obtain more learned sample fusion characteristics.

Then, in the visual coding sub-network, a second fusion layer outputs a new global image feature and a new visual fusion feature according to the global image feature output by the first fusion layer and the more learned sample fusion feature; in the scene coding sub-network, a second fusion layer outputs a new scene text characteristic and a new scene fusion characteristic according to the scene text characteristic output by the first fusion layer and the more learned sample fusion characteristic; and fusing the visual fusion characteristics output by the second fusion layer in the visual coding sub-network and the scene fusion characteristics output by the second fusion layer in the scene coding sub-network to obtain more learned sample fusion characteristics.

By analogy, in the visual coding sub-network, the global image feature and the visual fusion feature output by the last fusion layer (namely, the output layer) can be obtained; in the scene coding subnetwork, the scene fusion characteristics output by the last fusion layer (namely, the output layer) can be obtained; and fusing the visual fusion characteristics output by the last fusion layer in the visual coding sub-network and the scene fusion characteristics output by the last fusion layer in the scene coding sub-network to obtain learned sample fusion characteristics serving as final learning results of the sample fusion characteristics.

For example, the processing flow of any fusion layer in the visual coding subnetwork may be:

M _l ←MHSA(LN([V _l ；F _l ]))+[V _l ；F _l ]

[V _l+1 ；V _FUS ]←MLP(LN(M _l ))+M _l ，

wherein, V _FUS Visual fusion features, M, representing the output of a visually encoded subnetwork _l Is an intermediate variable, F _l Is a sample fusion feature, i.e. a sample fusion token.

Similarly, the processing flow of any fusion layer in the scene coding subnetwork may be:

N _l ←MHSA(L _N ([S _l ；F _l ]))+[S _l ；F _l ]

[S _l+1 ；S _FUS ]←MLP(LN(N _l ))+N _l ，

wherein S is _l Features of the scene text representing the l-th layer, S _FUS Representation is a scene fusion feature of the scene coding subnetwork output, N _l Is an intermediate variable.

As shown in fig. 2B, it is assumed that the visual coding sub-network is composed of K transform layers, and the scene coding sub-network is composed of K transform layers; and the K-1 th and the K transform layers in the visual coding sub-network are both used as fusion layers; the K-1 th and the K Transformer layers in the scene coding sub-network are used as fusion layers. Wherein K is a positive integer greater than 2.

The global image features and the learned sample fusion features of the sample images may be determined by the following process: in the visual coding sub-network, a sample image is input into an input layer in the visual coding sub-network, and global image characteristics predicted by the K-2 th Transformer layer are obtained through learning of the first K-2 Transformer layers. Similarly, in the scene coding sub-network, the sample scene text is input into an input layer in the scene coding sub-network, and the scene text characteristics predicted by the K-2 transform layers are obtained through learning of the first K-2 transform layers.

As shown in fig. 2C, in the visual coding sub-network, inputting the initial sample fusion feature and the global image feature predicted by the K-2 fransformer layer into the first fusion layer in the visual coding sub-network, so as to obtain the global image feature and the visual fusion feature output by the first fusion layer in the visual coding sub-network; in a scene coding sub-network, inputting an initial sample fusion characteristic and a scene text characteristic obtained by predicting a K-2 transform layer into a first fusion layer in the scene coding sub-network to obtain a scene text characteristic and a scene fusion characteristic output by the first fusion layer in the scene coding sub-network; and fusing the visual fusion characteristics and the scene fusion characteristics determined by the first fusion layer to obtain more learned sample fusion characteristics.

Then, in the visual coding sub-network, inputting the more learned sample fusion characteristics and the global image characteristics output by the first fusion layer in the visual coding sub-network into the second fusion layer in the visual coding sub-network to obtain the learned global image characteristics and new visual fusion characteristics; in the scene coding sub-network, inputting the learned sample fusion characteristics and the scene text characteristics output by the first fusion layer in the scene coding sub-network into the second fusion layer in the scene coding sub-network to obtain scene fusion characteristics; and fusing the visual fusion characteristics and the scene fusion characteristics determined by the second fusion layer to obtain more learned sample fusion characteristics serving as final sample fusion characteristics.

According to the technical scheme, the sample scene text in the sample image-text pair is extracted, the sample text in the sample image-text pair is input into a text coding network to obtain sample text characteristics, then the sample image is input into an input layer in a visual coding sub-network, and initial sample fusion characteristics are input into a fusion layer in the visual coding sub-network to obtain global image characteristics and visual fusion characteristics of the sample image output by the visual coding sub-network; and inputting the sample scene text into an input layer in the scene coding sub-network and inputting the initial sample fusion feature into a fusion layer in the scene coding sub-network to obtain a scene fusion feature output by the scene coding sub-network, fusing the visual fusion feature and the scene fusion feature to obtain a learned sample fusion feature, and then pre-training the visual scene text fusion model according to the sample text feature, the global image feature of the sample image and the learned sample fusion feature. According to the technical scheme, the fusion layers are respectively introduced into the visual coding sub-network and the scene coding sub-network, so that the visual scene coding network can fully learn the scene text characteristics, and the scene text characteristics are adopted to predict the global image characteristics, so that the accuracy of the global image characteristics is improved.

Fig. 3 is a flowchart of a training method of a visual scene text fusion model provided according to an embodiment of the present disclosure. On the basis of the above embodiments, the present embodiment further optimizes "pre-training the visual scene text fusion model according to the sample text features, the global image features of the sample images, and the learned sample fusion features", and provides an alternative solution. As shown in fig. 3, the training method of the visual scene text fusion model of the present embodiment may include:

s301, obtaining a sample image-text pair.

Wherein the sample image-text pair comprises a sample image and a sample text.

S302, extracting a sample scene text in the sample image.

And S303, inputting the sample text into a text coding network to obtain the characteristics of the sample text.

S304, inputting the sample image and the initial sample fusion feature into a visual coding sub-network, and inputting the initial sample fusion feature and the sample scene text into a scene coding sub-network to obtain the global image feature and the learned sample fusion feature of the sample image.

S305, determining the contrast loss of the fused text according to the sample text characteristics and the learned sample fusion characteristics.

Specifically, for N groups of sample image-text pairs, N may be formed ² Image-text feature pairs, or N ² A fused-text feature pair, the contrast loss is that the similarity between the correct N pairs of data is maximized, and N is minimized ² -similarity between N pairs of data, then defining the fused text contrast loss as:

wherein the content of the first and second substances,

wherein L is _f2c Representing a fused text contrast loss; f. of _i The sample fusion characteristics of the learned sample image-text pairs of the ith group are represented; t is t _i Sample text features representing the ith set of sample text-text pairs; σ represents a training variable, which may be initially set to 0.07. N and i are positive integers, and i is less than or equal to N.

S306, determining the contrast loss of the image text according to the global image characteristics and the sample text characteristics of the sample image.

Specifically, for N groups of sample image-text pairs, N may be formed ² Image-text feature pairs, or N ² Global-text feature pairs, the contrast loss is to maximize the similarity between the correct N pairs of data, and minimize N ² -similarity between N pairs of data, then defining the image text contrast loss as:

wherein the content of the first and second substances,

wherein L is _itc Representing image text contrast loss; v. of _i The global image characteristics of the sample image corresponding to the ith group of sample images and texts are represented; t is t _i Representing sample text characteristics corresponding to the ith group of sample graphic-text pairs; σ represents a training variable, which may be initially set to 0.07. N and i are positive integers, and i is less than or equal to N.

And S307, determining training loss according to the fusion text contrast loss and the image text contrast loss.

Specifically, the fusion text contrast loss and the image text contrast loss can be weighted and summed to obtain the training loss. The training loss can be determined, for example, by the following formula:

L _total ＝αL _itc +(1-α)L _ftc

wherein L is _total Represents a loss of training, L _ftc Indicating a loss of contrast, L, of the fused text _itc Representing the image-text contrast loss, and alpha represents a parameter for adjusting the weight between the fused-text contrast loss and the image-text contrast loss, which may preferably be 0.9.

And S308, pre-training the visual scene text fusion model by adopting training loss.

According to the technical scheme of the embodiment of the disclosure, sample scene texts in sample images in sample image-text pairs are extracted, and the sample texts in the sample image-text pairs are input into a text coding network to obtain sample text characteristics; inputting the sample image and the initial sample fusion feature in the sample image-text pair into a visual coding sub-network, inputting the initial sample fusion feature and the sample scene text into a scene coding sub-network, obtaining the global image feature and the learned sample fusion feature of the sample image, further determining the fusion text contrast loss according to the sample text feature and the learned sample fusion feature, determining the image text contrast loss according to the global image feature and the sample text feature of the sample image, determining the training loss according to the fusion text contrast loss and the image text contrast loss, and pre-training the visual scene text fusion model by adopting the training loss. According to the technical scheme, the image-text retrieval precision of the visual scene text fusion model is improved under the condition of scene text by calculating the loss between the sample text characteristic and the learned sample fusion characteristic and the loss between the global image characteristic of the sample image and the sample text characteristic, namely the dual contrast loss; further, the dual contrast loss comprises the loss between the global image feature of the sample image and the sample text feature, namely the loss between the image and the text, so that the image-text retrieval accuracy of the visual scene text fusion model can be ensured even in the absence of scene text.

On the basis of the above embodiment, as an optional mode of the present disclosure, according to the fusion text contrast loss and the image text contrast loss, it is determined that the training loss may be, whether the sample scene text is an empty text is determined;

if the sample scene text is an empty text, taking the image text contrast loss as the training loss; otherwise, taking the sum of the fusion text contrast loss and the image text contrast loss as a training loss.

Specifically, whether the sample scene text is a null text or not can be determined based on the OCR technology, and a determination result is obtained; if the result is determined to be the empty text, determining that the fusion text contrast loss is invalid, and taking the image text contrast loss as the training loss; and if the determination result is not the empty text, determining that the fusion text contrast loss is effective, and determining the training loss according to the fusion text contrast loss and the image text contrast loss. By only adopting image text contrast loss to perform model pre-training under the condition that the sample scene text is empty text, the model training result can still keep good performance for images without scene text.

On the basis of the embodiment, a scene text perception retrieval task and a traditional image-text retrieval task are respectively adopted for carrying out experiments. The scene text perception retrieval task is image-text retrieval under a scene with scene text in the image; the traditional image-text retrieval task is image-text retrieval under the scene without scene text in the image.

Illustratively, during the scene text-aware retrieval task, the CTC-1K, CTC-5K data set was used for experiments, and the experimental results are verified as shown in table 1. Compared with the existing cross-modal retrieval technologies such as SCAN, VSRN and STARNet, the visual scene text fusion model (namely Vista-S and Vista-B) disclosed by the invention has superior performance on a CTC data set, and the visual scene text fusion model disclosed by the invention is improved by 7.4% by taking a CTC-1K text retrieval task as an example. Under the condition of increasing the volume of the model, the visual scene text fusion model can obtain higher performance.

TABLE 1 Performance comparison of the presently disclosed solution to other methods on CTC datasets

Illustratively, during the process of the traditional image-text retrieval task, the Flickr30K and COCO-5K data set are used for carrying out experiments, and the experimental verification effect is shown in the table 2. Under the condition of close time consumption, the visual scene text fusion model (namely the Vista-B model) disclosed by the invention has more excellent performance compared with cross-mode retrieval technologies such as IMRAM, GSMN, SGRAF, Vil-BERT, Unicode-VL, UNITER-B, ERNIE-Vil-B, Pixel-BERT-X, SOHO, H Xue et al, Pixel-BERT-R, ViLT-B and the like; and, in case of increasing the model volume, the visual scene text fusion model (i.e. ViSTA-L model) of the present disclosure can be comparable to the more time consuming single tower model.

TABLE 2 comparison of Performance of the presently disclosed solution to other methods on Flickr30K, COCO-5K data sets

In conclusion, the scene text perception data set and the traditional image-text retrieval data set verify the effectiveness of the visual scene text fusion model, and the optimal level of the disclosure index can be obtained under the image-text retrieval task with or without the scene text.

It should be noted that the visual scene text fusion model obtained by image-text comparison and pre-training in the present disclosure can be directly used for a downstream retrieval task, and can also be used for any business party to perform fine tuning on the pre-trained visual scene text fusion model so as to meet actual business requirements.

Optionally, an embodiment of the present disclosure further provides a training method for a visual scene text fusion model, which specifically includes: acquiring a service image-text pair provided by a service party; the service image and the service text are used as training data, and the visual scene text fusion model is finely adjusted; the visual scene text fusion model is obtained based on any visual scene text fusion model pre-training method disclosed by the invention. The service graphic-text pair comprises a service image and a service text, and the service text is used for describing the image content in the service image.

Specifically, a service image-text pair of a service party is obtained and used as training data; and inputting the training data into the visual scene text fusion model, and adjusting model parameters in the visual scene text fusion model.

It can be understood that, in the embodiment, the visual scene text fusion model after pre-training is subjected to fine tuning, so that the visual scene text fusion model is more suitable for an actual service scene.

Fig. 4 is a flowchart of a method for retrieving text in a visual scene text fusion model according to an embodiment of the present disclosure. The embodiment is suitable for the situation of how to perform the image-text retrieval. The method can be executed by a teletext retrieval arrangement, which can be implemented in software and/or hardware and which can be integrated in an electronic device carrying teletext retrieval functions. The visual scene text fusion model comprises a text coding network and a visual scene coding network, wherein the visual scene coding network comprises a visual coding sub-network and a scene coding sub-network. As shown in fig. 4, the method for retrieving texts in a visual scene text fusion model according to this embodiment may include:

s401, obtaining a target text to be retrieved.

Specifically, the target text input by the search demander can be acquired. For example, the search demander may input the target text through the search interactive interface.

S402, extracting candidate scene texts in the candidate images.

In this embodiment, the candidate image is an image to be selected, and may be, for example, a part of or all images in an image library to be retrieved. By candidate scene text is meant text in a candidate image, which may include one or more words.

Specifically, each candidate image may be processed based on an Optical Character Recognition (OCR) technique to obtain candidate scene texts in the candidate image, and further, may obtain image position coding information of the candidate scene texts in the candidate image. It should be noted that the candidate image may include one or more candidate scene texts; for each candidate scene text, one or more words may be included.

And S403, inputting the target text into a text coding network to obtain the target text characteristics.

In this embodiment, the target text feature refers to a semantic feature of the target text, and may be represented in a matrix or vector form. Specifically, the target text may be input to a text coding network, and the target text characteristics may be obtained through processing by the text coding network.

S404, inputting the candidate image and the initial candidate fusion feature into a visual coding sub-network, and inputting the initial candidate fusion feature and the candidate scene text into a scene coding sub-network to obtain the global image feature of the candidate image.

In this embodiment, the candidate fusion feature is used to characterize a correlation feature between the candidate image and the text, and may be in a matrix or vector form; the initial candidate fusion feature is a preset correlation feature between the image and the text. The global image feature of the candidate image is a feature of the candidate image extracted from a global perspective, and may be in a matrix or vector form.

Specifically, the candidate image may be input to an input layer in the visual coding sub-network, the initial candidate fusion feature may be input to a fusion layer in the visual coding sub-network, the candidate scene text may be input to an input layer in the scene coding sub-network, the initial candidate fusion feature may be input to a fusion layer in the scene coding sub-network, and the visual coding sub-network and the scene coding sub-network cooperate to obtain the global image feature of the candidate image.

It should be noted that, in the training phase of the model, the visual scene coding network can fully learn the correlation between the vision and the scene text, and predict the global image characteristics; therefore, in the using stage of the model, the related features of the scene text are fused in the global image features output by the visual scene coding network.

S405, determining a target image from the candidate images according to the target text characteristics and the global image characteristics of the candidate images.

Specifically, for each candidate image, the similarity between the target text feature and the global image feature of the candidate image may be determined; then, the target image is determined from the candidate images based on the similarity, and for example, the candidate image having the highest similarity may be used as the target image.

It should be noted that the visual scene text fusion model in this embodiment may be obtained by pre-training any one of the visual scene text fusion models disclosed herein, or may be obtained by any one of the training methods for the visual scene text fusion models disclosed herein.

According to the technical scheme, the target text characteristics are obtained by extracting the candidate scene texts in the candidate images and inputting the obtained target texts into a text coding network, then the candidate images and the initial candidate fusion characteristics are input into a visual coding sub-network, the initial candidate fusion characteristics and the candidate scene texts are input into a scene coding sub-network, the global image characteristics of the candidate images are obtained, and then the target images are determined from the candidate images according to the target text characteristics and the global image characteristics of the candidate images. According to the technical scheme, the image-text retrieval is carried out through the parallel text coding network and the visual scene coding network, and the visual scene text fusion model with the double coding structure improves the efficiency of the image-text retrieval; moreover, by introducing the fusion features, the global image features of the image can be effectively learned no matter whether scene texts exist or not, so that the image-text retrieval precision is improved.

On the basis of the above embodiment, as an optional way of the present disclosure, inputting a target text into a text coding network to obtain target text features may be to perform word embedding on the target text to obtain a target text word vector; determining a word coding result of the target text according to the modal information of the target text, the position coding information of the target text and the word vector of the target text; constructing a coding sequence of the target text according to the initial target text characteristics and the word coding result of the target text; and inputting the coding sequence of the target text into a text coding network to obtain the processed target text characteristics.

In this embodiment, the modal information of the target text is preset identification information for representing the mode of the text, and a value of the modal information is the same as that of the modal information of the sample text. The position coding information of the target text is position identification information of words in the target text.

Specifically, the target text may be processed based on a word embedding technology to obtain a vector of each word in the target text, that is, a target text word vector.

Then, for each word in the target text, the modal information of the target text, the position coding information of the word and the target text word vector of the word can be spliced based on a certain rule to obtain the coding result of the word.

After the coding result of each word in the target text is determined, a coding sequence of the target text can be constructed according to the initial target text characteristics and the coding result of each word. The initial target text features can be specifically composed of text features to be learned, modal information and position coding information of the target text; optionally, the initial target text feature is located at the head of the coding sequence of the target text. After the coding sequence of the target text is determined, the coding sequence of the target text can be input into the text coding network to obtain the target text characteristics.

It can be understood that the target text features are determined by combining the modal information, the position coding information, the target text word vectors and the like of the target text, so that the finally determined target text features are more accurate.

On the basis of the above embodiment, as an optional way of the present disclosure, the candidate image may be input into the visual coding sub-network, where the candidate image is partitioned to obtain a candidate image block sequence; performing linear projection on candidate image blocks in the candidate image block sequence to obtain a coding result of the candidate image blocks; processing the coding result of the candidate image block according to the modal information and the position coding information of the candidate image block to obtain the processed coding result of the candidate image block; constructing a coding sequence of the candidate image according to the initial global image characteristics and the processed coding result of the candidate image block; an encoded sequence of candidate images is input to an input layer in the visual coding sub-network.

In this embodiment, the modality information of the candidate image block is identification information that is preset to characterize the modality of the image, and a value of the identification information may be the same as the modality information of the sample image. The position coding information of the candidate image block is the identification information of the candidate image block.

Specifically, the candidate image may be divided into a plurality of candidate image blocks (for example, N blocks, where N is a positive integer) having the same size according to a certain size, so as to obtain a candidate image block sequence.

After the candidate image block sequence is obtained, performing linear projection on each candidate image block in the candidate image block sequence to obtain a coding result of the candidate image block. Moreover, based on a certain rule, the encoding result of the candidate image block, the modal information of the candidate image block, and the position encoding information of the candidate image block may be spliced to obtain the processed encoding result of the candidate image block.

After the processed coding result of each candidate image block is determined, a coding sequence of the candidate image may be constructed according to the initial global image feature and the processed coding result of each candidate image block, and the coding sequence of the candidate image may be input to an input layer in the visual coding sub-network. The initial global image feature may specifically be composed of an image code to be learned, modality information of the candidate image block, and position coding information of the candidate image block.

It can be understood that, the present disclosure uses a way of image block linear projection to determine the encoding result of the candidate image block, which reduces the computational complexity and improves the efficiency compared with the method of using a detector to obtain the encoding result of the image block in the image; meanwhile, the coding sequence of the candidate image is determined by combining the modal information of the candidate image block, the position coding information of the candidate block, the initial global image feature and the like, so that the visual coding sub-network can extract more abundant features in the candidate image.

On the basis of the above embodiment, as an optional way of the present disclosure, the candidate scene text may be input into the scene coding subnetwork, and the candidate scene text is subjected to word embedding to obtain a candidate scene text vector; determining a coding result of the candidate scene text according to the image position coding information, the modal information, the word position coding information and the candidate scene text vector of the candidate scene text; constructing a coding sequence of the candidate scene text according to the initial candidate scene text characteristics and the coding result of the candidate scene text; an encoding sequence of candidate scene text is input to an input layer in a scene encoding subnetwork.

In this embodiment, the modal information of the candidate scene text is preset identification information for representing the mode of the scene text, and a value of the modal information of the candidate scene text may be the same as that of the modal information of the sample scene text. Word position coding information of the candidate scene text is used for uniquely identifying the sequence of words in the candidate scene text. The picture position code information of the candidate scene text is used for uniquely identifying the picture position information of the candidate scene text in the candidate picture.

Specifically, for each candidate scene text extracted from the candidate image, the candidate scene text may be processed based on a word embedding technique to obtain a candidate scene text vector of the candidate scene text. And based on a certain rule, splicing the image position coding information of the candidate scene text, the modal information of the candidate scene text, the word position coding information of the candidate scene text and the candidate scene text vector to obtain a coding result of the candidate scene text. And constructing a coding sequence of the candidate scene texts according to the initial candidate scene text features and the coding results of the candidate scene texts, and inputting the coding sequence of the candidate scene texts into an input layer in the scene coding sub-network. The initial candidate scene text feature may specifically be composed of a scene text feature to be learned, modal information of the candidate scene text, and image position coding information and word position coding information of the candidate scene text.

It can be understood that the coding sequence of the candidate scene text is determined by combining the map position coding information of the candidate scene text, the modal information of the candidate scene text, the word position coding information of the candidate scene text, the initial candidate visual features and the like, so that the visual coding subnetwork can extract richer features in the candidate scene text.

Fig. 5 is a schematic structural diagram of a pre-training apparatus for a visual scene text fusion model according to an embodiment of the present disclosure. The embodiment is suitable for the situation of pre-training the visual scene text fusion model. The device can be realized in a software and/or hardware mode, and can be integrated in electronic equipment with a pre-training function for bearing a visual scene text fusion model. Optionally, the visual scene text fusion model includes a text coding network and a visual scene coding network, and the visual scene coding network includes a visual coding sub-network and a scene coding sub-network. As shown in fig. 5, the pre-training apparatus 500 for the visual scene text fusion model of the present embodiment may include:

a sample image-text pair obtaining module 501, configured to obtain a sample image-text pair; wherein, the sample image-text pair comprises a sample image and a sample text;

a sample scene text extraction module 502, configured to extract a sample scene text in a sample image;

a sample text feature determining module 503, configured to input the sample text into a text coding network to obtain a sample text feature;

a sample global feature determination module 504, configured to input the sample image and the initial sample fusion feature into the visual coding sub-network, and input the initial sample fusion feature and the sample scene text into the scene coding sub-network, so as to obtain a global image feature of the sample image and a learned sample fusion feature;

and a model pre-training module 505, configured to pre-train the visual scene text fusion model according to the sample text features, the global image features of the sample images, and the learned sample fusion features.

Further, the sample global feature determination module 504 is specifically configured to:

inputting the sample image into an input layer in a visual coding sub-network, and inputting the initial sample fusion characteristic into a fusion layer in the visual coding sub-network to obtain a global image characteristic of the sample image output by the visual coding sub-network and a visual fusion characteristic output by the visual coding sub-network;

inputting a sample scene text into an input layer in a scene coding subnetwork, and inputting an initial sample fusion characteristic into a fusion layer in the scene coding subnetwork to obtain a scene fusion characteristic output by the scene coding subnetwork;

and fusing the visual fusion features output by the visual coding sub-network and the scene fusion features output by the scene coding sub-network to obtain the learned sample fusion features.

Further, the sample text feature determination module 503 is specifically configured to:

performing word embedding on the sample text to obtain a sample text word vector;

determining a word coding result of the sample text according to the modal information of the sample text, the position coding information of the sample text and the word vector of the sample text;

constructing a coding sequence of the sample text according to the initial sample text characteristics and the word coding result of the sample text;

and inputting the coding sequence of the sample text into a text coding network to obtain the learned sample text characteristics.

Further, the sample global feature determining module 504 is specifically configured to:

partitioning the sample image to obtain a sample image block sequence;

carrying out linear projection on sample image blocks in the sample image block sequence to obtain a coding result of samples of the sample image blocks;

processing the coding result of the sample image block according to the modal information and the position coding information of the sample image block to obtain the processed coding result of the sample image block;

constructing a coding sequence of the sample image according to the initial global image characteristics and the processed coding result of the sample image block;

an encoded sequence of sample images is input to an input layer in a visual coding sub-network.

performing word embedding on the sample scene text to obtain a sample scene text vector;

determining a coding result of the sample scene text according to the image position coding information of the sample scene text, the modal information of the sample scene text, the word position coding information of the sample scene text and the sample scene text vector;

constructing a coding sequence of the sample scene text according to the initial sample scene text characteristics and the coding result of the sample scene text;

an encoded sequence of sample scene text is input into an input layer in a scene coding subnetwork.

Further, the model pre-training module 505 comprises:

the fusion contrast loss unit is used for determining the contrast loss of the fusion text according to the text characteristics of the sample and the learned fusion characteristics of the sample;

the image-text contrast loss unit is used for determining the image-text contrast loss according to the global image characteristics of the sample image and the learned sample text characteristics;

the training loss unit is used for determining the training loss according to the fusion text contrast loss and the image text contrast loss;

and the model pre-training unit is used for pre-training the visual scene text fusion model by adopting training loss.

Further, the training loss unit is specifically configured to:

determining whether the sample scene text is empty text;

if the sample scene text is empty text, the image text contrast loss is used as training loss; otherwise, the sum of the fusion text contrast loss and the image text contrast loss is used as the training loss.

Fig. 6 is a schematic structural diagram of a training apparatus for a visual scene text fusion model provided in an embodiment of the present disclosure; the device can be realized in a software and/or hardware mode, and can be integrated in electronic equipment bearing the training function of the visual scene text fusion model. As shown in fig. 6, the training apparatus 600 for the visual scene text fusion model of the present embodiment may include:

a service image-text pair obtaining module 601, configured to obtain a service image-text pair provided by a service party; the service graphic and text pair comprises a service image and a service text;

the fine tuning module 602 is configured to use the service image and the service text as training data to perform fine tuning on a visual scene text fusion model; the visual scene text fusion model is obtained based on any visual scene text fusion model pre-training method in the disclosure.

According to the technical scheme, the visual scene text fusion model after pre-training is subjected to fine tuning, so that the visual scene text fusion model is more suitable for an actual service scene.

Fig. 7 is a schematic structural diagram of an image-text retrieval device of a visual scene text fusion model according to an embodiment of the present disclosure. The embodiment is suitable for the situation of how to perform the image-text retrieval. The device can be realized in a software and/or hardware mode and can be integrated in electronic equipment bearing the image-text retrieval function. The visual scene text fusion model comprises a text coding network and a visual scene coding network, wherein the visual scene coding network comprises a visual coding sub-network and a scene coding sub-network. As shown in fig. 7, the apparatus 700 for retrieving text in a text fusion model of a visual scene according to this embodiment may include:

a target text acquisition module 701, configured to acquire a target text to be retrieved;

a candidate scene text extraction module 702, configured to extract candidate scene texts in candidate images;

a target text feature determination module 703, configured to input a target text into a text coding network to obtain a target text feature;

a candidate global feature determining module 704, configured to input the candidate image and the initial candidate fusion feature into a visual coding sub-network, and input the initial candidate fusion feature and the candidate scene text into a scene coding sub-network, to obtain a global image feature of the candidate image;

and a target image determining module 705, configured to determine a target image from the candidate images according to the target text feature and the global image feature of the candidate images.

According to the technical scheme, the target text characteristics are obtained by extracting the candidate scene texts in the candidate images and inputting the obtained target texts into a text coding network, the candidate images and the initial candidate fusion characteristics are input into a visual coding sub-network, the initial candidate fusion characteristics and the candidate scene texts are input into a scene coding sub-network, the global image characteristics of the candidate images are obtained, and the target images are determined from the candidate images according to the target text characteristics and the global image characteristics of the candidate images. According to the technical scheme, the image-text retrieval is carried out through the parallel text coding network and the visual scene coding network, and the visual scene text fusion model with the double coding structure improves the efficiency of the image-text retrieval; moreover, by introducing the candidate fusion features, the global image features of the candidate images can be effectively learned no matter whether scene texts exist or not, so that the image-text retrieval precision is improved.

Further, the target text feature determination module 703 is specifically configured to:

performing word embedding on a target text to obtain a target text word vector;

determining a word coding result of the target text according to the modal information of the target text, the position coding information of the target text and the word vector of the target text;

constructing a coding sequence of the target text according to the initial target text characteristics and the word coding result of the target text;

and inputting the coding sequence of the target text into a text coding network to obtain the processed target text characteristics.

Further, the candidate global feature determination module 704 is specifically configured to:

partitioning the candidate image to obtain a candidate image block sequence;

performing linear projection on candidate image blocks in the candidate image block sequence to obtain a coding result of the candidate image blocks;

processing the coding result of the candidate image block according to the modal information and the position coding information of the candidate image block, and determining the processed coding result of the candidate image block;

constructing a coding sequence of the candidate image according to the initial global image characteristics and the processed coding result of the candidate image block;

an encoding sequence of the candidate images is input to an input layer in the visual coding sub-network.

Further, the candidate global feature determination module 704 is further specifically configured to:

performing word embedding on the candidate scene text to obtain a candidate scene text vector;

determining a coding result of the candidate scene text according to the image position coding information of the candidate scene text, the modal information of the candidate scene text, the word position coding information of the candidate scene text and the candidate scene text vector;

constructing a coding sequence of the candidate scene text according to the initial candidate scene text characteristics and the coding result of the candidate scene text;

an encoded sequence of candidate scene text is input into an input layer in a scene coding subnetwork.

Further, the visual scene text fusion model is obtained based on a training device of any one of the visual scene text fusion models disclosed in the present disclosure.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and the like of the related sample image-text pair, the target text, the candidate image and the like all accord with the regulations of related laws and regulations, and do not violate the good custom of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

Fig. 8 is a block diagram of an electronic device for implementing a training method, or a graph-text retrieval method of a visual scene text fusion model according to an embodiment of the present disclosure. FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the electronic apparatus 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 801 performs the various methods and processes described above, such as a pre-training method, a training method, or a teletext retrieval method of the visual scene text fusion model. For example, in some embodiments, the pre-training method, the training method, or the teletext retrieval method of the visual scene text fusion model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. When loaded into RAM 803 and executed by the computing unit 801, the computer program may perform one or more steps of the pre-training method, the training method or the teletext retrieval method of the visual scene text fusion model described above. Alternatively, in other embodiments, the computing unit 801 may be configured to perform a pre-training method, a training method, or a teletext retrieval method of the visual scene text fusion model in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge map technology and the like.

Cloud computing (cloud computing) refers to a technology system that accesses a flexibly extensible shared physical or virtual resource pool through a network, where resources may include servers, operating systems, networks, software, applications, storage devices, and the like, and may be deployed and managed in a self-service manner as needed. Through the cloud computing technology, high-efficiency and strong data processing capacity can be provided for technical application and model training of artificial intelligence, block chains and the like.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of pre-training a visual scene text fusion model, wherein the visual scene text fusion model comprises a text coding network and a visual scene coding network, the visual scene coding network comprising a visual coding subnetwork and a scene coding subnetwork, the method comprising:

extracting sample scene texts in the sample images;

2. The method of claim 1, wherein said inputting the sample images and initial sample fusion features into the visual coding network and initial sample fusion features and the sample scene text into the scene coding sub-network results in global image features and learned sample fusion features for sample images, comprising:

inputting the sample image into an input layer in the visual coding sub-network, and inputting the initial sample fusion characteristic into a fusion layer in the visual coding sub-network to obtain a global image characteristic of the sample image output by the visual coding sub-network and a visual fusion characteristic output by the visual coding sub-network;

inputting the sample scene text into an input layer in the scene coding sub-network, and inputting the initial sample fusion characteristics into a fusion layer in the scene coding sub-network to obtain the scene fusion characteristics output by the scene coding sub-network;

and fusing the visual fusion characteristics output by the visual coding sub-network and the scene fusion characteristics output by the scene coding sub-network to obtain learned sample fusion characteristics.

3. The method of claim 1, wherein said entering the sample text into the text-coding network resulting in sample text features comprises:

determining a word coding result of the sample text according to the modal information of the sample text, the position coding information of the sample text and the sample text word vector;

and inputting the coding sequence of the sample text into the text coding network to obtain the learned sample text characteristics.

4. The method of claim 1, wherein inputting the sample image into the visual coding sub-network comprises:

partitioning the sample image to obtain a sample image block sequence;

carrying out linear projection on sample image blocks in the sample image block sequence to obtain a coding result of the sample image blocks;

processing the coding result of the sample image block according to the modal information of the sample image block and the position coding information of the sample image block to obtain the processed coding result of the sample image block;

inputting a coded sequence of the sample image into an input layer in the visual coding sub-network.

5. The method of claim 1, wherein entering the sample scene text into the scene coding subnetwork comprises:

determining an encoding result of the sample scene text according to the image position encoding information of the sample scene text, the modal information of the sample scene text, the word position encoding information of the sample scene text and the sample scene text vector;

constructing a coding sequence of a sample scene text according to the initial sample scene text characteristics and the coding result of the sample scene text;

inputting the coded sequence of sample scene text into an input layer in the scene coding subnetwork.

6. The method of claim 1, wherein the pre-training the visual scene text fusion model according to the sample text features, the global image features of the sample images, and the learned sample fusion features comprises:

determining fusion text contrast loss according to the sample text features and the learned sample fusion features;

determining image text contrast loss according to the global image characteristics of the sample image and the sample text characteristics;

determining training loss according to the fusion text contrast loss and the image text contrast loss;

and pre-training the visual scene text fusion model by adopting the training loss.

7. The method of claim 6, wherein said determining a training loss from the fused text contrast loss and the image text contrast loss comprises:

determining whether the sample scene text is empty text;

if the sample scene text is an empty text, taking the image text contrast loss as the training loss; and if not, taking the sum of the fusion text contrast loss and the image text contrast loss as a training loss.

8. A training method of a visual scene text fusion model comprises the following steps:

the service image and the service text are used as training data, and a visual scene text fusion model is finely adjusted; the visual scene text fusion model is obtained based on the pre-training method of the visual scene text fusion model according to any one of claims 1 to 7.

9. A method for retrieving graphics and text of a visual scene text fusion model, wherein the visual scene text fusion model comprises a text coding network and a visual scene coding network, the visual scene coding network comprises a visual coding sub-network and a scene coding sub-network, and the method comprises the following steps:

acquiring a target text to be retrieved;

extracting candidate scene texts in the candidate images;

10. The method of claim 9, wherein said entering the target text into the text-coding network resulting in target text features comprises:

performing word embedding on the target text to obtain a target text word vector;

constructing a coding sequence of a target text according to the initial target text characteristics and the word coding result of the target text;

and inputting the coding sequence of the target text into the text coding network to obtain the processed target text characteristics.

11. The method of claim 9, wherein inputting the candidate image into the visual coding sub-network comprises:

partitioning the candidate image to obtain a candidate image block sequence;

inputting the encoded sequence of candidate images into an input layer in the visual coding sub-network.

12. The method of claim 9, wherein entering the candidate scene text into the scene coding subnetwork comprises:

determining an encoding result of the candidate scene text according to the image position encoding information of the candidate scene text, the modal information of the candidate scene text, the word position encoding information of the candidate scene text and the candidate scene text vector;

and inputting the coding sequence of the candidate scene text into an input layer in the scene coding sub-network.

13. The method of claim 9, wherein the visual scene text fusion model is obtained based on the training method of the visual scene text fusion model of claim 8.

14. A pre-training apparatus for a visual scene text fusion model, wherein the visual scene text fusion model includes a text coding network and a visual scene coding network, the visual scene coding network including a visual coding subnetwork and a scene coding subnetwork, the apparatus comprising:

the sample global feature determining module is used for inputting the sample image and the initial sample fusion feature into the visual coding sub-network, and inputting the initial sample fusion feature and the sample scene text into the scene coding sub-network to obtain the global image feature of the sample image and the learned sample fusion feature;

15. The apparatus of claim 14, wherein the sample global feature determination module is specifically configured to:

16. The apparatus of claim 14, wherein the sample text feature determination module is specifically configured to:

17. The apparatus of claim 14, wherein the sample global feature determination module is specifically configured to:

partitioning the sample image to obtain a sample image block sequence;

18. The apparatus of claim 14, wherein the sample global feature determination module is specifically configured to:

19. The apparatus of claim 14, wherein the model pre-training module comprises:

the fusion contrast loss unit is used for determining the contrast loss of the fusion text according to the sample text characteristics and the learned sample fusion characteristics;

and the model pre-training unit is used for pre-training the visual scene text fusion model by adopting the training loss.

20. The apparatus of claim 19, wherein the training loss unit is specifically configured to:

determining whether the sample scene text is empty text;

if the sample scene text is an empty text, taking the image text contrast loss as the training loss; and if not, taking the sum of the fusion text contrast loss and the image text contrast loss as the training loss.

21. A training device for a visual scene text fusion model comprises:

the fine tuning module is used for taking the service image and the service text as training data to perform fine tuning on a visual scene text fusion model; wherein the visual scene text fusion model is obtained based on a pre-training method of the visual scene text fusion model according to any one of claims 14 to 20.

22. An apparatus for retrieving graphics and text of a visual scene text fusion model, wherein the visual scene text fusion model comprises a text coding network and a visual scene coding network, the visual scene coding network comprises a visual coding sub-network and a scene coding sub-network, the apparatus comprising:

23. The apparatus of claim 22, wherein the target text feature determination module is specifically configured to:

24. The apparatus of claim 22, wherein the candidate global feature determination module is specifically configured to:

partitioning the candidate image to obtain a candidate image block sequence;

25. The apparatus of claim 22, wherein the candidate global feature determination module is further specifically configured to:

inputting the coded sequence of candidate scene text into an input layer in the scene coding subnetwork.

26. The apparatus of claim 22, wherein the visual scene text fusion model is obtained based on the training apparatus of the visual scene text fusion model of claim 21.

27. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-13.

28. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-13.

29. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-13.