CN116383428B

CN116383428B - Graphic encoder training method, graphic matching method and device

Info

Publication number: CN116383428B
Application number: CN202310342377.4A
Authority: CN
Inventors: 杨馥魁
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-03-31
Filing date: 2023-03-31
Publication date: 2024-04-05
Anticipated expiration: 2043-03-31
Also published as: CN116383428A

Abstract

The disclosure provides a training method of an image-text encoder, an image-text matching method and an image-text matching device, and relates to the technical field of artificial intelligence, in particular to the technical field of image processing and text recognition. The specific implementation scheme is as follows: and acquiring multiple groups of training samples, wherein each group of training samples comprises a sample picture and a sample text, and the sample text is used for describing a target object in the sample picture. For each set of training samples, identifying a target object in a sample picture included in the set of training samples, and generating location text describing a location of the target object in the sample picture. And then, based on the plurality of groups of training samples and the position texts of the plurality of groups of training samples, performing joint training on a text encoder and a picture encoder, wherein the text encoder is used for extracting text features, and the picture encoder is used for extracting picture features. Thereby improving the accuracy of image-text matching.

Description

Graphic encoder training method, graphic matching method and device

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to the field of image processing and text recognition.

Background

With the development of computer vision technology and natural language processing technology, the user's demand for image-text matching is generated. For example, when a user needs to insert a picture in an article, the user wants to search for a picture related to the text content by using the text content in the article. For another example, when looking up a book, a user wants to find the name of the book using the book cover picture.

Disclosure of Invention

The disclosure provides a graphic encoder training method, a graphic matching method and a device.

In a first aspect of an embodiment of the present disclosure, a method for training a graphics encoder is provided, including:

obtaining a plurality of groups of training samples, wherein each group of training samples comprises a sample picture and a sample text, and the sample text is used for describing a target object in the sample picture;

for each set of training samples, identifying a target object in a sample picture included in the set of training samples, and generating a position text describing the position of the target object in the sample picture;

and based on the plurality of groups of training samples and the position texts of the plurality of groups of training samples, performing joint training on a text encoder and a picture encoder, wherein the text encoder is used for extracting text features, and the picture encoder is used for extracting picture features.

In a second aspect of the embodiments of the present disclosure, there is provided an image-text matching method, including:

acquiring a picture and a text to be matched;

extracting features of the picture by using a picture encoder to obtain picture features, wherein the picture encoder is obtained by training the method according to any one of the first aspects;

extracting features of the text by using a text encoder to obtain text features, wherein the text encoder is obtained by training any one of the methods in the first aspect;

and determining a matching result between the picture and the text based on the picture feature and the text feature.

In a third aspect of the embodiments of the present disclosure, there is provided a graphic encoder training apparatus, including:

the system comprises an acquisition module, a storage module and a display module, wherein the acquisition module is used for acquiring a plurality of groups of training samples, each group of training samples comprises a sample picture and a sample text, and the sample text is used for describing a target object in the sample picture;

the generation module is used for identifying a target object in a sample picture included in each group of training samples acquired by the acquisition module and generating a position text for describing the position of the target object in the sample picture;

And the training module is used for carrying out joint training on a text encoder and a picture encoder based on the plurality of groups of training samples and the position texts of the plurality of groups of training samples, wherein the text encoder is used for extracting text characteristics, and the picture encoder is used for extracting picture characteristics.

In a fourth aspect of the embodiments of the present disclosure, there is provided an image-text matching apparatus, including:

the acquisition module is used for acquiring the pictures and the texts to be matched;

the feature extraction module is used for extracting features of the picture by using a picture encoder to obtain picture features, and the picture encoder is obtained by training the method according to any one of the first aspect;

the feature extraction module is further configured to perform feature extraction on the text by using a text encoder to obtain text features, where the text encoder is obtained by training by using the method according to any one of the first aspects;

and the matching module is used for determining a matching result between the picture and the text based on the picture features and the text features extracted by the feature extraction module.

In a fifth aspect of embodiments of the present disclosure, there is provided an electronic device, including:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the first or second aspects.

A sixth aspect of embodiments of the present disclosure provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method according to any one of the first or second aspects.

A seventh aspect of the disclosed embodiments provides a computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of the first or second aspects.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a first method of training a teletext encoder provided by an embodiment of the disclosure;

FIG. 2 is an exemplary schematic diagram of a sample picture provided by an embodiment of the present disclosure;

FIG. 3 is an exemplary schematic diagram of another sample picture provided by an embodiment of the present disclosure;

FIG. 4 is a flow chart of a second method of teletext encoder training provided by an embodiment of the disclosure;

FIG. 5 is a flow chart of a third method of teletext encoder training provided by an embodiment of the disclosure;

FIG. 6 is an exemplary schematic diagram of a teletext encoder training process provided by an embodiment of the disclosure;

fig. 7 is a flowchart of an image-text matching method provided in an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an image-text encoder training device according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an image-text matching device according to an embodiment of the present disclosure;

fig. 10 is a block diagram of an electronic device for implementing the teletext encoder training method, the teletext matching method of an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In a conventional image-text training algorithm, an image-text matching model is directly utilized to identify and match images and texts, and training is performed based on a matching result.

In order to improve accuracy of image-text matching, the embodiment of the disclosure provides an image-text encoder training method, which is applied to electronic equipment, for example, the electronic equipment can be equipment with image and text processing capability, such as a server, a desktop computer or a notebook computer. As shown in fig. 1, the method comprises the steps of:

s101, acquiring a plurality of groups of training samples.

Each group of training samples comprises a sample picture and sample text, and the sample text is used for describing a target object in the sample picture. I.e. the sample picture and the sample text are matched to each other in each set of training samples.

For example, as shown in fig. 2, the sample text matched with the sample picture is: "a boy is playing football". "it should be noted that fig. 2 is only an example provided by the embodiment of the present disclosure, and in practical application, the sample picture may be an actually taken picture, a screenshot in a video, a manually made picture, or an automatically generated picture, and the embodiment of the present disclosure is not limited in particular.

The sample text matched with the sample picture may be manually marked text, or a title corresponding to the sample picture, which is not particularly limited in the embodiment of the present disclosure.

S102, identifying target objects in sample pictures included in each group of training samples, and generating position texts for describing positions of the target objects in the sample pictures.

The target object in the sample picture can be identified by the specified target identification model. For example, the type to which the target object belongs and the position of the target object in the sample picture are identified. The position of the target object in the sample picture may be the position of the minimum circumscribed rectangle of the target object in the sample picture, or the position of the outline of the target object in the sample picture, etc. The type to which each target object belongs may be a person or thing, such as a child, cat, football, tree, or the like. The type of the target object which can be identified by the target identification model can be set according to actual requirements.

Taking fig. 2 as an example, the generated location text may be: "boy on the left side of the picture" and "football on the right side of the picture". The manner in which the location text is specifically generated may be described below.

S103, performing joint training on the text encoder and the picture encoder based on a plurality of groups of training samples and the position texts of the plurality of groups of training samples.

The text encoder is used for extracting text features, and the picture encoder is used for extracting picture features. That is, after training is completed, text features may be extracted with a text encoder for the text and the picture to be matched, picture features may be extracted with a picture encoder, and then whether the text and the picture are matched may be determined based on the text features and the picture features.

Through the method, the embodiment of the disclosure can generate the position text describing the position of the target object in the sample picture, train the text encoder and the picture encoder according to the position text, enable the text encoder and the picture encoder to be more sensitive to the position of the target object in the picture in the training process, further understand the content of the picture and the text deeply, improve the accuracy of feature extraction of the encoder, and enable the text encoder and the picture encoder to be matched with each other more accurately based on the training.

In some embodiments of the present disclosure, the manner of generating the position text for describing the position of the target object in the sample picture for the sample picture in each set of training samples in S102 includes the following steps:

Step one, dividing the sample picture into a preset number of sub-pictures.

Wherein, there is no intersection between each sub-picture, and the union of each sub-picture is the sample picture.

For example, referring to fig. 3, a sample picture may be divided into 9 sub-pictures of the same size.

Step two, numbering each sub-picture in sequence.

Alternatively, each sub-picture may be numbered sequentially in the order from left to right, top to bottom as [0,1,2, …, k ₁ *k ₂ -1]Wherein k is ₁ Representing the number of lines, k, of sub-pictures dividing a sample picture ₂ Representing the number of columns, k, of sub-pictures dividing a sample picture ₁ And k ₂ May be equal or unequal. Or may be numbered in other sequences, to which embodiments of the disclosure are not specifically limited.

For example, referring to fig. 3, each sub-picture is numbered in turn [0,1,2,3,4,5,6,7,8] in a left-to-right, top-to-bottom order.

And thirdly, generating a position text based on the word describing the target object and the serial number of each sub-picture if the proportion of the target object in the sub-picture is larger than a preset proportion.

For each sub-picture, the ratio of the area of the intersection area of the target object and the sub-picture to the area of the sub-picture can be used as the proportion of the target object in the sub-picture. And substituting the word describing the target object and the serial number of the sub-picture into a text template when the ratio is larger than a preset ratio to obtain a position text. The preset proportion can be set according to actual requirements, for example, the preset proportion is 50%.

For example, the text template is a theris a_in post_; assuming that the word describing the target object is cat, the number of the sub-picture is one, the cat is substituted into the first "_", the one is substituted into the second "_", and the obtained position text is: there is a cat in postion one. Alternatively, the text template may be in other forms, and embodiments of the present disclosure are not particularly limited thereto.

If the proportion of the plurality of target objects in one sub-picture is larger than the preset proportion, the target object with the largest proportion can be selected, and the position text is generated based on the word describing the target object and the number of the sub-picture.

Through the method, the embodiment of the disclosure can automatically generate the position text through the position of the target object in the sample picture, and compared with a manual labeling mode, the embodiment of the disclosure can improve the efficiency of generating the position text, thereby improving the training efficiency of the encoder.

In an embodiment of the present disclosure, referring to fig. 4, the method for performing joint training on a text encoder and a picture encoder based on multiple sets of training samples and position texts of the multiple sets of training samples in S103 includes the following steps:

s401, for each group of training samples, a text encoder is utilized to encode sample texts included in the group of training samples, and sample text characteristics are obtained.

The text Encoder may be an Encoder (Encoder) in a transducer model.

For ease of description, the sample text feature is denoted as f_t.

S402, encoding the position text of the group of training samples by using a text encoder to obtain position text characteristics, hiding key areas in sample pictures included in the group of training samples, and encoding the sample pictures hiding the key areas by using a picture encoder to obtain picture characteristics.

The key area is an area including the target object. If the sample picture comprises a target object, the largest area occupied by the target object can be used as a key area and hidden. If the sample picture comprises a plurality of target objects, one target object can be randomly selected, and the largest area occupied by the target object is taken as a key area and hidden. Each region in the sample picture corresponds to one sub-picture divided for the sample picture.

Alternatively, the critical area may be determined by other means, which is not specifically limited by the embodiments of the present disclosure.

The picture Encoder may be an Encoder in a transformer model, or may be a feature extraction network in a convolutional neural network (Convolutional Neural Networks, CNN), or the like.

Alternatively, when the key region is hidden, the pixel values of the key region may be set to preset values. For example, the preset value is a white pixel value. Or may otherwise conceal the critical area, as the disclosed embodiments are not specifically limited.

For convenience of description, the picture feature is denoted as f_v, and the position text feature is denoted as f_p.

S403, encoding sample pictures included in the training samples by using a picture encoder to obtain picture features, hiding keywords in the position texts of the training samples, and encoding the position texts of the hidden keywords by using a text encoder to obtain position text features.

The keywords are words describing the target object or words describing the position of the target object in the sample picture.

Alternatively, if the position text of the sample picture is a sentence, the word describing the target object or the word describing the position of the target object in the sample picture may be randomly selected. For example, taking fig. 3 as an example, the position text is "boy is in region 0", and "boy" or "0" can be randomly selected to be hidden.

If the position text of the sample picture is a plurality of sentences, a sentence can be randomly selected, and words for hiding the description target object or words for describing the position of the target object in the sample picture can be randomly selected from the sentences. For example, the location text is: "boy is in region 0" and "football is in region 7", and assuming "football is in region 7" is selected, either "football" or "7" can be randomly selected to be hidden.

Alternatively, when the keywords are hidden, the keywords in the location text may be replaced with preset words. For example, the preset word is "mask". Or may also conceal keywords in other ways, to which embodiments of the present disclosure are not particularly limited.

A hidden and coded approach may be randomly selected from between S402 and S403 for each set of training samples.

S404, based on sample text features, position text features and picture features corresponding to each group of training samples, the text encoder and the picture encoder are trained in a combined mode.

The loss value may be calculated based on the sample text feature, the position text feature, and the picture feature corresponding to each set of training samples, and parameters of the text encoder and the picture encoder may be adjusted using the loss value until the text encoder and the picture encoder both converge, to determine that training is complete, and for a specific manner, reference may be made to the following description.

Through the method, the key region in the sample picture can be hidden, and the text encoder and the picture encoder are trained based on the text features, the position text features and the picture features of the sample picture after the key region is hidden, so that the picture encoder can learn the relations between the sample text and the sample picture which are matched with each other, the sample text features and the position text features and the picture features, and the picture encoder can infer the missing region in the picture through the sample text features and the position text features, thereby having deeper understanding of the picture and improving the accuracy of the picture encoder for extracting the picture features.

Similarly, the embodiment of the disclosure can hide keywords in the position text, train the text encoder and the picture encoder based on the sample text features, the picture features and the position text features of the position text after hiding the keywords, so that the text encoder can learn the relations between the position text features, the sample text features and the picture features aiming at the sample text and the sample picture which are matched with each other, the text encoder can infer the keywords missing in the position text through the picture features, the text is deeply understood, and the accuracy of extracting the text features by the text encoder is improved.

In some embodiments of the present disclosure, referring to fig. 5, the encoder training method in S404 may include the following steps:

s501, combining sample text features corresponding to each group of training samples with position text features to obtain combined text features, and determining similarity between the combined text features and picture features corresponding to each group of training samples.

For convenience of description, the merged text feature is denoted as f_c= [ f_t, f_p ].

S502, determining an alignment loss value of the training samples based on the similarity.

The alignment loss value for each set of training samples may be determined by equation (1):

where loss1 is the alignment loss value, f_v is the picture feature, f_c is the merged text feature, f_v is f_c ⁺ Representing the product of the picture feature and the merged text feature corresponding to a set of training samples, the product may represent a similarity between the merged text feature and the picture feature. f_v x f_c represents the product of each picture feature and each merged text feature for the current round of training, exp represents an exponential function based on e.

It can be understood that the product of the merged text feature and the picture feature corresponding to the set of training samples can embody the similarity between the merged text feature and the picture feature, and can embody the similarity between the sample picture and the sample text in the set of training samples. Therefore, the larger the product of the merged text feature and the picture feature corresponding to the training sample, that is, the larger the numerator of the formula (1), the higher the similarity between the features corresponding to the sample text and the sample picture, which are matched with each other, which means that the higher the feature extraction accuracy of the text encoder and the picture encoder is, and therefore, the smaller the loss1 is.

S503, reconstructing the position text features corresponding to the training samples to obtain reconstructed text, and determining a text reconstruction loss value between the reconstructed text and the position text of the training samples.

The text decoder may be utilized to decode the location text features corresponding to the set of training samples to obtain reconstructed text. The text decoder may be a decoder (decoder) in a transducer model, among others.

In the embodiment of the disclosure, the text encoder can predict the position text characteristics of the complete position text based on the position text of the hidden keyword; the text decoder can decode the position text characteristics predicted by the text encoder to obtain a complete position text, so that the prediction accuracy of the text decoder is influenced by the prediction accuracy of the text encoder.

Therefore, the text decoder decodes the position text features, so that hidden keywords are predicted, so that the accuracy of decoding the position text features by the text decoder and the accuracy of extracting the text features by the text encoder can be obtained by comparing the similarity between the predicted reconstructed text and the position text before hiding.

In the disclosed embodiment, the reconstructed text may be a predicted keyword, or the reconstructed text may be a predicted pre-hidden complete location text.

Taking the reconstructed text as a predicted keyword as an example, a text reconstruction loss value can be calculated by the formula (2):

loss2＝classify(decoder(f_p),word) (2)

Where loss2 is a text reconstruction loss value, decoder (f_p) is a predicted keyword, word is an actually hidden keyword, and classify represents a probability that the decoder (f_p) and word do not belong to the same type.

In addition to calculating loss2 using classify, loss2 may also be calculated by other algorithms, which are not specifically limited by the embodiments of the present disclosure.

When the reconstructed text is a predicted pre-hidden complete position text, a penalty value may be reconstructed based on the text between the predicted reconstructed text and the position text before the hidden keyword.

It will be appreciated that, for each set of training samples, when the samples are hidden and encoded in the manner of fig. 4, the sample pictures may be hidden in the key region, and the position text is not hidden, and at this time, the text encoder encodes the complete position text to obtain position text features, and the corresponding text decoder restores the complete position text based on the position text features. In this case, the reconstructed text and the position text have a high similarity, so that the calculated text reconstruction loss value is small, and therefore the influence on the coding accuracy of the encoder during training is small, and the text reconstruction loss value can be not additionally processed, i.e. the influence of the situation on training is not considered. On the other hand, if the keywords in the text of the hidden position are selected during hiding, the text reconstruction loss value can reflect the prediction accuracy of the text codec, so that the text reconstruction loss value is used for training, and the prediction accuracy of the text codec can be improved.

S504, reconstructing the picture features corresponding to the training samples to obtain a reconstructed picture, and determining a picture reconstruction loss value between the reconstructed picture and the sample pictures included in the training samples.

And decoding the picture characteristics corresponding to the training samples by using a picture decoder to obtain a reconstructed picture. The picture decoder may be a decoder in a transformer model, or may be a reconstruction network in a CNN.

The picture encoder can predict the picture characteristics of the complete sample picture based on the sample picture of the hidden key region; the picture decoder can decode the picture characteristics predicted by the picture encoder to obtain a complete sample picture, so that the prediction accuracy of the picture decoder is influenced by the prediction accuracy of the picture encoder.

Therefore, the picture characteristics are decoded through the picture decoder, so that the hidden key region is predicted, and the accuracy of the picture decoder in decoding the picture characteristics and the accuracy of the picture encoder in extracting the picture characteristics can be obtained by comparing the similarity between the predicted reconstructed picture and the sample picture before hiding.

In the embodiment of the disclosure, the reconstructed picture may be a predicted key region, or the reconstructed picture may be a predicted pre-concealment complete sample picture.

Taking the reconstructed picture as a predicted sample picture before concealment as an example, a picture reconstruction loss value can be determined through a formula (3):

loss3＝L2(decoder(f_v),img) (3)

where loss3 is the picture reconstruction loss value, decoder (f_v) is the predicted sample picture before concealment, img is the actual sample picture before concealment, and L2 is the average square error (mean square error, MSE).

And calculating loss3 by using L2, namely calculating the sum of squares of differences between pixel values of pixel points at each identical position in the reconstructed picture and the sample picture. In addition to calculating loss3 using L2, loss3 may also be calculated using other algorithms, such as using L1, i.e., mean Absolute Error (MAE), which embodiments of the disclosure are not particularly limited.

When the reconstructed picture is a predicted key region, a loss value may be reconstructed based on the picture between the predicted key region and the actual hidden key region.

It will be appreciated that, for each set of training samples, when the training samples are hidden and encoded in the manner of fig. 4, the position text may be hidden by keywords, and the sample pictures are not hidden, and at this time, the picture encoder encodes the complete sample picture to obtain picture features, and the corresponding picture decoder restores the complete sample picture based on the picture features. In this case, the reconstructed picture and the sample picture have a high similarity, so that the calculated picture reconstruction loss value is small, and thus the influence on the coding accuracy of the encoder is small, and such picture reconstruction loss value may not be additionally processed, i.e., the influence of such a case on training is not considered. On the other hand, if the key region in the sample picture is hidden during hiding, the picture reconstruction loss value can reflect the prediction accuracy of the picture coder-decoder, so that the picture reconstruction loss value is used for training, and the prediction accuracy of the picture coder-decoder can be improved.

S505, performing joint training on the text encoder and the picture encoder based on the alignment loss value, the text reconstruction loss value and the picture reconstruction loss value of each group of training samples.

The smaller the alignment loss value is, the higher the similarity between the picture features corresponding to each group of training samples and the combined text features is, which means that the more accurate the text encoder and the features extracted by the picture encoder are; the smaller the text reconstruction loss value is, the smaller the error between the reconstructed text and the position text is, which indicates that the more accurate the extracted characteristics of the text encoder are; the smaller the picture reconstruction loss is, the smaller the error between the reconstructed picture and the sample picture is, which means that the more accurate the extracted features of the picture encoder are. Therefore, the text encoder and the picture encoder can be trained through the three loss values, so that the accuracy of feature extraction of the encoder is improved.

In some embodiments of the present disclosure, the encoder training method of S505 may include the steps of:

and step 1, summing the alignment loss value, the text reconstruction loss value and the picture reconstruction loss value of each group of training samples to obtain a total loss value.

For convenience of description, the total LOSS value is noted as loss=l1+l2+l3. Where L1 represents the sum of the alignment loss values, L2 represents the sum of the text reconstruction loss values, and L3 represents the sum of the picture reconstruction loss values. L1, L2 and L3 are each a number.

And 2, adjusting parameters of the picture decoder and the picture encoder by using the total loss value.

The parameters of the network layers in the picture decoder and the picture encoder may be adjusted in a counter-propagating manner, i.e. in the order from the last network layer to the first network layer of the picture decoder and then from the last network layer to the first network layer of the picture encoder.

And 3, adjusting parameters of the text decoder and the text encoder by using the total loss value.

The parameters of each network layer in the text decoder and the text encoder may be adjusted in a counter-propagating manner, i.e. in the order from the last network layer to the first network layer of the text decoder and then from the last network layer to the first network layer of the text encoder.

And 4, if the text encoder and the picture encoder are converged, determining that the training is completed, otherwise, performing the next training round.

Alternatively, it may be determined that both the text encoder and the picture encoder converge when the number of training times reaches a preset number, or the total loss value calculated this time is smaller than a preset value, or the difference between the total loss value calculated this time and the total loss value calculated last time is smaller than a preset difference. Or may determine whether the text encoder and the picture encoder converge in other manners, which is not particularly limited by the embodiments of the present disclosure.

Through the method, the embodiment of the disclosure can use three loss values to perform joint training on the text coder and the picture coder, so that the feature extraction accuracy of the picture coder and the text coder is improved, and the reconstruction accuracy of the picture decoder and the text decoder is improved.

Referring to fig. 6, the following describes the overall flow of the graphic encoder training provided in the embodiment of the present disclosure in combination with an actual application scenario:

and acquiring multiple groups of training samples, wherein each group of training samples comprises sample text and sample pictures. And encoding sample text included in each group of training samples by using a text encoder to obtain sample text characteristics.

For each group of training samples, performing position coding on a sample picture included in the group of training samples, namely dividing the sample picture into a plurality of sub-pictures, and numbering each sub-picture in sequence. And for each sub-picture, if the proportion of the target object in the sub-picture is larger than a preset proportion, generating a position text based on the word describing the target object and the serial number of the sub-picture.

The method comprises the steps of encoding position texts of the training samples by using a text encoder to obtain position text characteristics, hiding key areas in sample pictures included in the training samples, and encoding sample pictures hiding the key areas by using a picture encoder to obtain picture characteristics; or, encoding sample pictures included in the training samples by using a picture encoder to obtain picture characteristics, hiding keywords in the position texts of the training samples, and encoding the position texts of the hidden keywords by using a text encoder to obtain position text characteristics.

Decoding the position text features of the training samples by using a text decoder to obtain a reconstructed text; a text reconstruction penalty value is determined between the reconstructed text and the positional text of the set of training samples.

And decoding the picture characteristics of the training samples by using a picture decoder to obtain a reconstructed picture, and determining a picture reconstruction loss value between the reconstructed picture and the sample pictures of the training samples.

And determining the alignment loss value of the training samples based on the sample text features, the position text features and the picture features corresponding to the training samples.

The text encoder and the picture encoder are jointly trained based on the alignment loss value, the text reconstruction loss value, and the picture reconstruction loss value for each set of training samples.

The specific implementation of each step in fig. 6 may refer to the above description, and will not be repeated here.

Based on the same inventive concept, the embodiment of the disclosure also provides a graph-text matching method, which is applied to electronic equipment, for example, the electronic equipment can be equipment with picture and text processing capability, such as a server, a desktop computer or a notebook computer. Moreover, the electronic device applied by the image-text encoder training method and the electronic device applied by the image-text matching method can be the same device or different devices. As shown in fig. 7, the image-text matching method provided by the embodiment of the disclosure includes the following steps:

S701, obtaining a picture and a text to be matched.

In the scenario of searching for text in a picture, the picture to be matched may be a picture uploaded by the user or a picture selected by the user, etc., and the text to be matched may be each text in the search library.

In the context of searching for pictures, the text to be matched may be the text uploaded by the user or the text selected by the user, and the picture to be matched may be each picture in the search library.

S702, extracting features of the picture by using a picture encoder to obtain picture features.

The picture encoder is obtained by training the picture encoder training method.

If a plurality of pictures to be matched exist, each picture can be respectively input into a picture encoder to obtain the picture characteristics of each picture output by the picture encoder.

S703, extracting characteristics of the text by using a text encoder to obtain text characteristics.

The text encoder is obtained by training the image-text encoder training method.

If a plurality of texts to be matched exist, each text can be respectively input into the text encoder, and the text characteristics of each text output by the text encoder are obtained.

S704, determining a matching result between the picture and the text based on the picture characteristics and the text characteristics.

And calculating the similarity between the picture features and the text features by using a preset similarity algorithm, and if the calculated similarity is smaller than the preset similarity, determining that the picture to be matched is not matched with the text. Otherwise, if the calculated similarity is greater than or equal to the preset similarity, determining that the picture to be matched is matched with the text.

The preset similarity algorithm may be cosine similarity, or may also be other algorithms, which is not specifically limited in the embodiments of the present disclosure.

The preset similarity can be set according to actual requirements. For example, in the case where the similarity value range is [0,1], the preset similarity is 0.8.

By the method, the image-text matching can be performed through the image encoder and the text encoder which are more sensitive to the position of the target object in the image, so that the accuracy of the image-text matching is improved.

And after determining the matching result of the picture to be matched and each text in the scene of searching the text by the picture, N texts can be selected from the texts matched with the picture according to the sequence of the similarity from high to low and fed back to the user.

After the matching result of the text to be matched and each picture is determined in the context search scene, M pictures can be selected from the pictures matched with the text according to the sequence of the similarity from high to low and fed back to the user.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of texts and pictures accord with the regulations of related laws and regulations, and the public order is not violated.

Based on the same inventive concept, the embodiment of the present disclosure further provides an apparatus for training an image-text encoder, as shown in fig. 8, where the apparatus includes: an acquisition module 801, a generation module 802, and a training module 803;

an obtaining module 801, configured to obtain multiple sets of training samples, where each set of training samples includes a sample picture and a sample text, and the sample text is used to describe a target object in the sample picture;

a generating module 802, configured to identify, for each set of training samples acquired by the acquiring module 801, a target object in a sample picture included in the set of training samples, and generate a position text for describing a position of the target object in the sample picture;

training module 803 is configured to jointly train a text encoder and a picture encoder based on the multiple sets of training samples and the position texts of the multiple sets of training samples, where the text encoder is used to extract text features, and the picture encoder is used to extract picture features.

In some embodiments of the present disclosure, the training module 803 is specifically configured to:

For each group of training samples, a text encoder is utilized to encode sample texts included in the group of training samples, so as to obtain sample text characteristics;

the method comprises the steps of encoding position texts of the training samples by using a text encoder to obtain position text characteristics, hiding key areas in sample pictures included in the training samples, and encoding sample pictures hiding the key areas by using a picture encoder to obtain picture characteristics; or, encoding sample pictures included in the group of training samples by using a picture encoder to obtain picture characteristics, hiding keywords in the position texts of the group of training samples, and encoding the position texts of the hidden keywords by using a text encoder to obtain position text characteristics; the key area is an area comprising a target object, and the key words are words describing the target object or words describing the position of the target object in the sample picture;

and performing joint training on the text encoder and the picture encoder based on the sample text characteristics, the position text characteristics and the picture characteristics corresponding to each group of training samples.

Combining sample text features and position text features corresponding to each group of training samples to obtain combined text features, and determining similarity between the combined text features and picture features corresponding to the group of training samples;

determining an alignment loss value for the set of training samples based on the similarity;

reconstructing the position text features corresponding to the training samples to obtain reconstructed texts, and determining text reconstruction loss values between the reconstructed texts and the position texts of the training samples;

reconstructing the picture features corresponding to the training samples to obtain reconstructed pictures, and determining picture reconstruction loss values between the reconstructed pictures and sample pictures included in the training samples;

decoding the text features at the positions corresponding to the training samples by using a text decoder to obtain a reconstructed text;

training module 803, specifically for:

and decoding the picture features corresponding to the training samples by using a picture decoder to obtain a reconstructed picture.

summing the alignment loss value, the text reconstruction loss value and the picture reconstruction loss value of each group of training samples to obtain a total loss value;

adjusting parameters of the picture decoder and the picture encoder by using the total loss value;

adjusting parameters of the text decoder and the text encoder by using the total loss value;

if the text encoder and the picture encoder are converged, the training is determined to be completed, otherwise, the next training is carried out.

In some embodiments of the present disclosure, the generating module 802 is specifically configured to:

dividing the sample picture into a preset number of sub-pictures; wherein, there is no intersection between every two sub-pictures, and the union of every sub-picture is the sample picture;

numbering each sub-picture in turn;

and for each sub-picture, if the proportion of the target object in the sub-picture is larger than a preset proportion, generating a position text based on the word describing the target object and the serial number of the sub-picture.

Based on the same inventive concept, the embodiment of the present disclosure further provides an image-text matching device, as shown in fig. 9, where the device includes: an acquisition module 901, a feature extraction module 902 and a matching module 903;

An obtaining module 901, configured to obtain a picture and a text to be matched;

the feature extraction module 902 is configured to perform feature extraction on a picture by using a picture encoder to obtain a picture feature, where the picture encoder is obtained by training by using the above-mentioned image-text encoder training method;

the feature extraction module 902 is further configured to perform feature extraction on the text by using a text encoder to obtain text features, where the text encoder is obtained by training the image-text encoder training method;

the matching module 903 is configured to determine a matching result between the picture and the text based on the picture feature and the text feature extracted by the feature extracting module 902.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 10 shows a schematic block diagram of an example electronic device 1000 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the electronic device 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for the operation of the electronic apparatus 1000 can also be stored. The computing unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

Various components in the electronic device 1000 are connected to the I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and communication unit 1009 such as a network card, modem, wireless communication transceiver, etc. Communication unit 1009 allows electronic device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.

The computing unit 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1001 performs the various methods and processes described above, such as the teletext encoder training method, the teletext matching method. For example, in some embodiments, the teletext encoder training method, the teletext matching method, may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 1000 via the ROM 1002 and/or the communication unit 1009. When the computer program is loaded into RAM 1003 and executed by computing unit 1001, one or more of the steps of the teletext encoder training method, the teletext matching method described above can be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the teletext encoder training method, the teletext matching method, by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of graphic encoder training comprising:

obtaining a plurality of groups of training samples, wherein each group of training samples comprises a sample picture and a sample text, the sample text is used for describing a target object in the sample picture, and the type of the target object comprises a person or an object;

Based on the multiple groups of training samples and the position texts of the multiple groups of training samples, performing joint training on a text encoder and a picture encoder, wherein the text encoder is used for extracting text features, and the picture encoder is used for extracting picture features;

the performing joint training on the text encoder and the picture encoder based on the plurality of groups of training samples and the position texts of the plurality of groups of training samples includes:

for each group of training samples, encoding sample texts included in the group of training samples by using the text encoder to obtain sample text characteristics;

coding the position text of the group of training samples by using the text coder to obtain position text characteristics, hiding key areas in sample pictures included in the group of training samples, and coding sample pictures hiding the key areas by using the picture coder to obtain picture characteristics; or, coding sample pictures included in the group of training samples by using the picture coder to obtain picture characteristics, hiding keywords in the position texts of the group of training samples, and coding the position texts of the hidden keywords by using the text coder to obtain position text characteristics; the key area is an area comprising a target object, and the key words are words describing the target object or words describing the position of the target object in the sample picture;

And performing joint training on the text encoder and the picture encoder based on sample text features, position text features and picture features corresponding to each group of training samples.

2. The method of claim 1, wherein the jointly training the text encoder and the picture encoder based on sample text features, position text features, and picture features corresponding to each set of training samples comprises:

reconstructing the position text features corresponding to the training samples to obtain reconstructed text, and determining a text reconstruction loss value between the reconstructed text and the position text of the training samples;

reconstructing picture features corresponding to the training samples to obtain reconstructed pictures, and determining picture reconstruction loss values between the reconstructed pictures and sample pictures included in the training samples;

And performing joint training on the text encoder and the picture encoder based on the alignment loss value, the text reconstruction loss value and the picture reconstruction loss value of each group of training samples.

3. The method of claim 2, wherein reconstructing the location text feature corresponding to the set of training samples to obtain reconstructed text comprises:

reconstructing the picture features corresponding to the training samples to obtain a reconstructed picture, including:

4. The method of claim 3, wherein the jointly training the text encoder and the picture encoder based on the alignment loss value, the text reconstruction loss value, and the picture reconstruction loss value for each set of training samples comprises:

Adjusting parameters of the text decoder and the text encoder using the total loss value;

and if the text encoder and the picture encoder are converged, determining that the training is completed, otherwise, performing the next training round.

5. The method of any of claims 1-4, wherein the generating location text describing the location of the target object in the sample picture comprises:

numbering each sub-picture in turn;

and for each sub-picture, if the proportion of the target object in the sub-picture is larger than a preset proportion, generating the position text based on the word describing the target object and the number of the sub-picture.

6. A picture and text matching method comprises the following steps:

acquiring a picture and a text to be matched;

extracting features of the picture by using a picture encoder to obtain picture features, wherein the picture encoder is obtained by training the method of any one of claims 1-5;

extracting features of the text by using a text encoder to obtain text features, wherein the text encoder is obtained by training the method of any one of claims 1-5;

7. A graphic encoder training apparatus comprising:

the system comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring a plurality of groups of training samples, each group of training samples comprises a sample picture and a sample text, the sample text is used for describing a target object in the sample picture, and the type of the target object comprises a person or an object;

the training module is used for carrying out joint training on a text encoder and a picture encoder based on the plurality of groups of training samples and the position texts of the plurality of groups of training samples, the text encoder is used for extracting text features, and the picture encoder is used for extracting picture features;

the training module is specifically configured to:

8. The device of claim 7, wherein the training module is specifically configured to:

9. The apparatus of claim 8, wherein the training module is specifically configured to:

the training module is specifically configured to:

10. The apparatus of claim 9, wherein the training module is specifically configured to:

11. The apparatus according to any of claims 7-10, wherein the generating module is specifically configured to:

numbering each sub-picture in turn;

12. An image-text matching device, comprising:

a feature extraction module, configured to perform feature extraction on the picture by using a picture encoder to obtain a picture feature, where the picture encoder is obtained by training a method according to any one of claims 1 to 5;

The feature extraction module is further used for extracting features of the text by using a text encoder to obtain text features, and the text encoder is obtained by training the method of any one of claims 1-5;

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5 or 6.

14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-5 or 6.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-5 or 6.