CN114022690A

CN114022690A - Instance matching method, device, equipment and storage medium

Info

Publication number: CN114022690A
Application number: CN202111176314.3A
Authority: CN
Inventors: 刘艺; 秦伟; 李蒙蒙; 郑奇斌; 刁兴春
Original assignee: Beijing Big Data Advanced Technology Research Institute
Current assignee: Beijing Big Data Advanced Technology Research Institute
Priority date: 2021-10-09
Filing date: 2021-10-09
Publication date: 2022-02-08

Abstract

The embodiment of the application relates to the technical field of data processing, in particular to an instance matching method, an instance matching device, instance matching equipment and a storage medium, and aims to improve the accuracy of an instance matching task. The method comprises the following steps: inputting texts and images to be matched into a cyclic generation network; extracting the features of the text through a text embedded network to obtain text features, and inputting the text features into a text image generation network; generating a semantic image through a text image generation network; judging the semantic image, the real image and the error image which is not matched with the text to obtain the probability that the image is the real image and the conditional probability of the image; and inputting the semantic image, the original image and the text into a text reconstruction network to obtain the conditional probability of the text, and outputting a matching result according to the conditional probability of the image and the conditional probability of the text.

Description

Instance matching method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of data processing, in particular to an instance matching method, an instance matching device, instance matching equipment and a storage medium.

Background

The example matching is to match data of different modalities, for example, data of two different modalities, namely an image and text describing the image, and the example matching task is applied in many aspects, for example, an electronic book, a webpage and the like. Due to the fact that the data of different modes describe information in different modes, the information contained in the samples of different modes is not completely symmetrical, and the usability of the data in tasks such as cross-mode instance matching can be affected by the information asymmetry caused by mode differences. In the prior art, characters and images are generally subjected to feature extraction, and the obtained characters and the obtained images are mapped into a common space for matching.

In the prior art, much attention is paid to how to mine the relevance among modal data, how to overcome the semantic gap among the modal data, and the problem of information asymmetry among the modal data is ignored, so that an error result is easy to occur in an example matching task.

Disclosure of Invention

The embodiment of the application provides an example matching method, device, equipment and storage medium, and aims to improve the accuracy of an example matching task.

A first aspect of an embodiment of the present application provides an instance matching method, where the method includes:

inputting a plurality of texts and a plurality of images to be matched into a cyclic generation network;

extracting features of each text in the plurality of texts through a text embedding network in the cyclic generation network to obtain text features corresponding to the texts, and inputting the text features into a text image generation network in the cyclic generation network;

generating a semantic image according to the text features through the text image generation network;

judging the semantic image, a real image in the plurality of images and an error image which is not matched with the text in the plurality of images to obtain the probability that each image in the plurality of images is a real image and the conditional probability of each image in the plurality of images;

inputting the semantic image, the plurality of images and the plurality of texts into a text reconstruction network in the cycle generation network to obtain a conditional probability of each text in the plurality of texts;

and outputting a matching result according to the conditional probability of each image in the plurality of images and the conditional probability of each text in the plurality of texts.

Optionally, generating, by the text image generation network, a semantic image according to the text feature includes:

inputting the text features into a generator in the image generation network;

generating, by the generator, the semantic image from the text features.

Optionally, the discriminating the semantic image, a real image in the plurality of images, and an error image in the plurality of images that does not match the text to obtain a probability that each image in the plurality of images is a real image and a conditional probability of each image in the plurality of images includes:

inputting the semantic image, a real image of the plurality of images, and an erroneous image of the plurality of images that does not match the text into a discriminator in the text image generation network;

obtaining, by the discriminator, a probability that each of the plurality of images is a true image and a conditional probability of each of the plurality of images.

Optionally, the semantic image and the plurality of texts are input into a text reconstruction network in the cycle generation network, so as to obtain a conditional probability of each text in the plurality of texts, and the method further includes:

generating a text description corresponding to the semantic image according to the semantic image and the text;

calculating the probability that the semantic image and the text description are matched.

Optionally, reconstructing a text description of the semantic image from the semantic image and the text includes:

extracting image features of the semantic image through the text reconstruction model, and mapping the image features into a text representation space;

and obtaining a text description corresponding to the semantic image according to the mapped image characteristics and the text.

Optionally, outputting a matching result according to the conditional probability of each of the plurality of images and the conditional probability of each of the plurality of texts, including:

presetting a matching probability threshold;

outputting images and texts of which the probability value in the conditional probability of each image in the plurality of images is higher than the matching probability threshold value as matching pairs;

outputting the text and the image with the probability value higher than the matching probability threshold value in the conditional probability of each text in the plurality of texts as a matching pair.

A second aspect of the embodiments of the present application provides an example matching apparatus, including:

the data input module is used for inputting a plurality of texts and a plurality of images to be matched into a cyclic generation network;

the text embedding module is used for extracting the characteristics of each text in the plurality of texts through a text embedding network in the cyclic generation network to obtain the text characteristics corresponding to the texts, and inputting the text characteristics into a text image generation network in the cyclic generation network;

the image generation module generates a semantic image according to the text features by using a network generated by the text image;

an image discrimination module, configured to discriminate the semantic image, a real image in the multiple images, and an error image in the multiple images that is not matched with the text, to obtain a probability that each image in the multiple images is a real image and a conditional probability of each image in the multiple images;

a text reconstruction module, configured to input the semantic image, the plurality of images, and the plurality of texts into a text reconstruction network in the cyclic generation network, so as to obtain a conditional probability of each text in the plurality of texts;

and the result output module is used for outputting a matching result according to the conditional probability of each image in the plurality of images and the conditional probability of each text in the plurality of texts.

Optionally, the image generation module comprises:

a first image generation sub-module for inputting the text features into a generator in the image generation network;

and the second image generation submodule is used for generating the semantic image according to the text features through the generator.

Optionally, the image discriminating module includes:

an image input sub-module for inputting the semantic image, a real image of the plurality of images, and an error image of the plurality of images that does not match the text into a discriminator in the text image generation network;

a first probability calculation sub-module for obtaining, by the discriminator, a probability that each of the plurality of images is a true image and a conditional probability of each of the plurality of images.

Optionally, the text reconstruction module further includes:

the text description generation submodule is used for generating a text description corresponding to the semantic image according to the semantic image and the text;

and the second probability calculation submodule is used for calculating the probability of matching the semantic image with the text description.

Optionally, the text description generation sub-module includes:

the feature extraction submodule is used for extracting the image features of the semantic image through the text reconstruction model and mapping the image features to a text representation space;

and the text description obtaining submodule is used for obtaining the text description corresponding to the semantic image according to the mapped image characteristics and the text.

Optionally, the result output sub-module includes:

the threshold setting submodule is used for presetting a matching probability threshold;

a first result output sub-module configured to output, as a matching pair, an image and a text for which a probability value of the conditional probability of each of the plurality of images is higher than the matching probability threshold;

and the second result output submodule is used for outputting the texts and the images of which the probability values in the conditional probability of each text in the plurality of texts are higher than the matching probability threshold value as matching pairs.

A third aspect of embodiments of the present application provides a readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps in the method according to the first aspect of the present application.

A fourth aspect of the embodiments of the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the method according to the first aspect of the present application.

By adopting the example matching method provided by the application, a plurality of texts and a plurality of images to be matched are input into a cyclic generation network; extracting features of each text in the plurality of texts through a text embedding network in the cyclic generation network to obtain text features corresponding to the texts, and inputting the text features into a text image generation network in the cyclic generation network; generating a semantic image according to the text features through the text image generation network; judging the semantic image, a real image in the plurality of images and an error image which is not matched with the text in the plurality of images to obtain the probability that each image in the plurality of images is a real image and the conditional probability of each image in the plurality of images; inputting the semantic image, the plurality of images and the plurality of texts into a text reconstruction network in the cycle generation network to obtain a conditional probability of each text in the plurality of texts; and outputting a matching result according to the conditional probability of each image in the plurality of images and the conditional probability of each text in the plurality of texts. According to the method, a cyclic generation network is used for example judgment, the network comprises a text embedding network, a text image generation network and a text reconstruction network, the text image generation network can generate a semantic image according to an input text and further accurately estimate the conditional probability of the image according to the semantic image and the input image, the text reconstruction network can obtain the conditional probability of the text according to the input image and the text, the text image generation network can generate an image according to the text and further judge the generated image, the judgment capability is strong, the judgment capability and the image generation capability are mutually dependent, so that the text image generation network can accurately obtain the conditional probability of the image, the text conditional probability is obtained through a text reconstruction model, and an accurate matching result is obtained.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a flow chart of an example matching method presented in an embodiment of the present application;

FIG. 2 is a schematic diagram of a cycle generation network according to an embodiment of the present application;

fig. 3 is a schematic diagram of an example matching apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without inventive step, are within the scope of the present disclosure.

Referring to fig. 1, fig. 1 is a flowchart of an example matching method according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:

s11: a plurality of texts and a plurality of images to be matched are input into a cycle generation network.

In this embodiment, the cyclic generation network includes a text embedding network, a text image generation network, and a text reconstruction network, and the model models processes of determining a matching image from a text and determining a matching text from an image, and the model can estimate conditional distributions of the two aspects.

In this embodiment, a plurality of texts and a plurality of images are input into a cyclic generation network, the texts and the images can be paired one by one, and in the matched texts and images, the characters in the texts describe the contents in the images.

For example, the text may be "one apple tree", with one apple tree in the image paired with it.

S12: and extracting the characteristics of each text in the plurality of texts through a text embedding network in the cyclic generation network to obtain text characteristics corresponding to the texts, and inputting the text characteristics into a text image generation network in the cyclic generation network.

In this embodiment, the text embedding network in the cyclic generation network is used to extract text features, and given an input text T, corresponding word embedding tw and sentence embedding ts are extracted and transmitted to the text image generation network.

For example, the text embedding network may use RNN (recurrent neural network) in order to be able to catch simultaneouslyCapturing local and global information of the text, and inputting the original text T ═ w through a recurrent neural network encoder₁,…,w_LAnd (wi represents a word at the i position, and L is the sentence length) into word embedding tw and sentence embedding ts, as shown in equation 1:

t_s,t_w＝RNN(T) (1)

where word embedding tw is a hidden state representation of each word in T, and sentence embedding ts is given by the final output of the RNN encoder.

S13: and generating a semantic image according to the text features through the text image generation network.

In this embodiment, the text image generation network is a conditional generation countermeasure network, and the semantic image is an image generated by the image generation network according to the input word embedding and sentence embedding, and the image may reflect the content of the text description, but has a certain difference in detail from the original image matched with the text.

In the embodiment, the text features are input into the text image generation network to obtain the semantic image, and in the classic generation countermeasure network, the content of the generated image is controlled by random noise subject to given distribution (usually gaussian distribution), so that the content of the generated image cannot be controlled, and the generated content is random. The present embodiment proposes a conditional generation countermeasure network that controls the generation of images reflecting the corresponding text content by the input text features, i.e., word embedding tw and sentence embedding ts.

Illustratively, the entered text is "a log cabin next to a river" and the generated semantic image is a house next to a river.

S14: and distinguishing the semantic images, real images in the plurality of images and error images which are not matched with the text in the plurality of images to obtain the probability that each image in the plurality of images is a real image and the conditional probability of each image in the plurality of images.

In this embodiment, the real images refer to a plurality of images which are input into the recurrent neural network at first, are original images, are not semantic images generated by the text image generation network, error images which do not match with the text are other images which do not match with the text feature in the input plurality of images, the probability that each image is a real image is the probability that each image is not a newly generated semantic image, and the conditional probability of each image is the probability that each image matches with the text when the text is determined, i.e. p (I | T), where I represents an image and T represents the text.

In this embodiment, after the text image generation network generates the semantic images, the images need to be distinguished, and the probability that each image is a real image and the conditional probability of each image in the plurality of images are obtained in the distinguishing process. For example, the probability of matching image 1 with the text is 50%, the probability of matching image 2 with the text is 80%, and the probability of matching image 3 with the text is 96%. The higher the probability value, the higher the similarity.

Illustratively, the entered text is "a log cabin next to a river" and the generated semantic image is a house next to a river. But may not be a wooden house and the input original image is the image of a wooden house beside the river. Therefore, the semantic image and the original image have slight difference.

S15: and inputting the semantic image, the plurality of images and the plurality of texts into a text reconstruction network in the cycle generation network to obtain the conditional probability of each text in the plurality of texts.

In this embodiment, the text reconstruction network reconstructs the text description Trec of the image according to the content of the generated semantic image and the information of the original text, and gives the quality of text reconstruction (the probability of matching the text description with the semantic image). In this process, the text reconstruction network obtains a conditional probability for each of the plurality of texts. I.e. the probability that each text matches the image in the case of image determination, i.e. p (T | I), where T stands for text and I for image. The higher the probability value, the higher the similarity. The text creation network may use an RNN network.

Illustratively, in the case of image 1 determination, the probability of matching text 1 with image 1 is 60%, the probability of matching text 2 with image 1 is 80%, and the probability of matching text 3 with image 1 is 95%.

S16: and outputting a matching result according to the conditional probability of each image in the plurality of images and the conditional probability of each text in the plurality of texts.

In this embodiment, the matching result is the matched text and image, and the text and image that are input at the beginning are matched.

After the conditional probability of each image and the conditional probability of each text are obtained, a matching result can be determined according to a certain rule.

Illustratively, there are 10 images and 10 texts, the 10 images and 10 texts are in one-to-one correspondence, in the case that the 10 images are determined in the text 1, the highest conditional probability value is the image 3, and the image 3 and the text 1 are paired and output as the matching result. When the image 1 is determined to have 10 texts, if the text 5 has the highest conditional probability value, the image 1 and the text 5 are paired and output as a matching result.

When the matching pair with the highest conditional probability value of the image and the text is not the same, it may be considered comprehensively, for example, a pair with a higher probability value is selected as the matching result, for example, if the image 3 is determined in the text 5, the conditional probability value is 98%, that is, the matching probability is 98%, the matching probability is highest among 10 images, if the text 5 is determined in the image 6, the conditional probability value is 95%, and the matching probability is highest among 10 texts, a group with a higher matching probability is selected as the final result to be output, that is, the image 3 and the text 5 are matched.

In the embodiment, the text and the image are matched through the circular generation network, word embedding and sentence embedding are obtained through the text embedding network, the conditional countermeasure generation network is controlled to generate the semantic image, the semantic image and the original image are judged to obtain the conditional probability of the image, and the semantic image and the original image are input into the text reconstruction network to obtain the conditional probability of the text. Because the ability of generating images in the conditional countermeasure generation network and the ability of distinguishing the images are interdependent, the conditional probability of the images can be accurately distinguished while semantic images are generated. The text generation model is similar to the existing image labeling model, and the corresponding text can be determined according to the image, so that the conditional probability of the text can be accurately obtained, and the final matching result can be output after the conditional probabilities in the two directions are obtained.

In another embodiment of the present application, a semantic image is generated from the text features by the text image generation network:

s21: inputting the text feature into a generator in the image generation network.

S22: generating, by the generator, the semantic image from the text features.

In this embodiment, a semantic image is generated by using a generator in a conditional countermeasure generation network, and the specific generation steps are as follows:

first, a dependency sentence embedding t is introduced_sIs represented by equation 2:

where μ and σ are the mean and standard deviation,

is a compression function. The variable z is introduced for two reasons: firstly, the text embedding and the random noise are directly connected in parallel, and the unsmooth situation between the text embedding and the random noise can be reduced by controlling the distribution of the random noise by the text embedding; second, sampling from the variable z can reduce the impact of insufficient training data.

In order to comprehensively utilize word embedding and sentence embedding, make the generated image be consistent with the input text T on the whole and the local, and improve the generation quality, the structure of the generator in the AttnGAN is introduced. The image generation part is composed of k feature transformations and a generator G derived from DCGAN, and the output of the k feature transformations is denoted as f_l,……,f_kThen the expression of the generation process is:

f₁＝F₁(z,t_w,t_s) (1)

f_i＝F_i(f_i-1,A_i(f_i-1,t_w,t_s)) (2)

I_f＝G(f_k) (3)

where Fi denotes the ith feature transform, f_lFor the transformed features, A_iIs Global-Local cooperative Attention Model (Global-Local cooperative Attention Model), I_fTo generate an image, G is an inverse convolutional neural network, i.e., generator. Attention model A_iAttention model by word level

Attention model at sentence level

Two parts are formed. Wherein the attention model at word level is obtained by calculating attention score and word embedding t_wThe inner product between the implicit representations imposes a local constraint on the generator, as shown in equation 6:

where Ui-1 is a perceptual layer, soft is the softmax function,

is a hidden representation of tw. In addition, a global constraint is also applied to the generator by sentence embedding, and the formula is:

where Vi-1 is also a perceptual Layer (perceptual Layer) and |, indicates a multiplication between elements of a vector.

In addition, the loss function when training the generator is:

wherein Ppeak is the distribution of the generated image, I_fFor the image produced by the generator, lambda_sAnd λ_rFor the adjustment factor μ and for the mean and standard deviation, N (0,1) is the standard normal distribution, D_KLThe KL divergence (Kullback-Leibler divergence) can enhance the smoothness of the invisible manifold, L_rFor text reconstruction loss, semantic consistency of the image and text can be enhanced, and E represents expectation.

In the embodiment, the semantic image is generated by using the generator, a sample is provided for subsequent discrimination by generating a high-quality semantic image, and the performance of the generator and the performance of the discriminator are interdependent, so that the closer the generated semantic image is to the real image, the higher the discrimination accuracy of the discriminator is.

In another embodiment of the present application, the discriminating the semantic image, the real image in the plurality of images, and the error image in the plurality of images that does not match the text to obtain a probability that each of the plurality of images is the real image and a conditional probability of each of the plurality of images includes:

s31: inputting the semantic image, a real image of the plurality of images, and an erroneous image of the plurality of images that does not match the text into a discriminator in the text image generation network.

S32: obtaining, by the discriminator, a probability that each of the plurality of images is a true image and a conditional probability of each of the plurality of images.

In the present embodiment, the probability for obtaining that the input image is a real image and the conditional probability of each image, that is, the probability of how each image is consistent with the text semantics, are determined.

Illustratively, the arbiter consists of a convolutional neural network and two full connectivity layers FC1 and FC 2. FC1 and FC2 give the probability that an image is authentic and that the image is semantically consistent with the text, respectively.

When the arbiter is trained, the loss function used is as shown in equation 9:

wherein Preal represents the distribution of the real image, Iw represents the unmatched image, λ_fAnd λ_wFor the weighting factors, D denotes the discriminator and E denotes the expectation.

The generator G and the discriminator D are optimized by minimizing the penalty V, with the formula:

in the embodiment, the input image is distinguished by the discriminator, the distinguishing capability of the discriminator can be continuously strengthened through training, the distinguishing result can be fed back to the generator, the parameter of the generator is further adjusted, the generator can generate an image with higher quality, the distinguishing capability of the discriminator is strengthened while the image generating capability of the generator is strengthened, and finally the trained conditional probability of the image output by the text image generating network can be accurate.

In the embodiment, the performances of the generator and the discriminator in the generation countermeasure network are interdependent, so that the discriminator can effectively estimate the matching probability between the image and the text. The specific formula is as follows:

p(I|T)＝p(D(I,t_s)＝1) (11)

wherein D represents a discriminator, I represents an image, T represents a text, and I_fIs embedded for the sentence of the input text T.

In another embodiment of the present application, the semantic image and the texts are input into a text reconstruction network in the cycle generation network, and a conditional probability of each text in the texts is obtained, where the method further includes:

s41: and generating a text description corresponding to the semantic image according to the semantic image and the text.

In this embodiment, in order to enhance semantic consistency between a semantic image and a text, the text needs to be reconstructed according to the content of the image, and the reconstruction loss is fed back to the generator. The method comprises the following specific steps:

s41-1: and extracting image features of the semantic image through the text reconstruction model, and mapping the image features into a text representation space.

In this embodiment, the image features may be extracted by a Convolutional Neural Network (CNN), and then mapped into a text representation space, and a vector mapped into the text representation space is used as an input of a Recurrent Neural Network (RNN). The specific expression is as follows:

wherein the content of the first and second substances,

for the initial input of the recurrent neural network, We is a linear mapping matrix, and I is the input image.

S41-2: and obtaining a text description corresponding to the semantic image according to the mapped image characteristics and the text.

The mapped image features are input into a Recurrent Neural Network (RNN) along with the reference text to learn the distribution of words in the sentence, as shown in equation 12:

the meaning of the above formula is that the first t-1 position of a given sentence is

At the t position is

Probability p of_θ. The distribution of words is generally learned by minimizing cross-entropy loss, as shown in equation 13:

where L is the sentence length.

However, the cross entropy loss in equation 13 does not directly reflect the quality of the generated text, and BLEU, ROUGE, and CIDEr are generally used as evaluation indexes in natural language processing tasks, and since these indexes are not trivial, it is difficult to directly serve as training loss. Inconsistencies between the training loss and the evaluation index are detrimental to the quality of text generation. Some studies model this problem as a reinforcement learning problem, and such models are generally trained by minimizing negative expected return loss, as shown in equation 15:

in which the reward r is the text generated by comparison

And a reference text (such as the BLEU indicator described above),

the words sampled from the model for time t.

To sum up, the loss function L of the text reconstruction_rThe method is composed of cross entropy loss and reinforcement learning loss, and is shown as formula 16:

L_r＝L_XE+L_RL (16)

wherein L is_XEFor cross entropy loss, L_RLTo reinforce the learning loss.

In the process of training the text reconstruction network, the text description corresponding to the semantic image is obtained according to the mapped image features and the text, the text description and the image features can be used as another pair of data to train the whole cycle generation network, parameters of a generator can be adjusted by using a loss function of text reconstruction, and the quality of the semantic image generated by the generator is continuously improved.

In this embodiment, the formula for the text intermediate model to calculate the conditional probability of the text is as follows:

wherein the content of the first and second substances,

a word representing the t position in the sentence.

S42: calculating the probability that the semantic image and the text description are matched.

In this embodiment, the text reconstruction network may also give the quality of semantic reconstruction, which is the probability that a semantic image matches a corresponding text description. This helps to improve semantic consistency of the generated image and text.

In another embodiment of the present application, outputting a matching result according to the conditional probability of each of the plurality of images and the conditional probability of each of the plurality of texts comprises:

s51: a matching probability threshold is preset.

In this embodiment, in order to filter out matched images and texts, a matching probability threshold may be preset, and the text and the image with the matching probability value higher than the matching probability threshold are regarded as a matching pair.

Illustratively, if the threshold matching probability is set to 90%, the text and the image with the matching probability higher than 90% are regarded as a matching pair.

S52: and outputting the image and the text of which the probability value in the conditional probability of each image in the plurality of images is higher than the matching probability threshold value as a matching pair.

S53: outputting the text and the image with the probability value higher than the matching probability threshold in the conditional probability of each text in the plurality of texts as a matching pair.

In this embodiment, the matching degree, that is, the matching probability value, of the text and the image can be obtained through the probabilities in the two directions, and the text and the image, in which the conditional probability value of the text and the conditional probability value of the image are higher than the matching probability threshold value, can be output as a matching pair.

Illustratively, the matching probability threshold is set to 90%, the matching probability of the text 1 and the image 1 is 95%, the text 1 and the image 1 are regarded as a pair of matching pairs, the matching probability of the text 2 and the image 2 is 80%, and the matching probability threshold is lower than the matching probability threshold, the pair of matching pairs is not included.

In another embodiment of the present application, as shown in fig. 2, fig. 2 is a schematic structural diagram of a loop generation network proposed in an embodiment of the present application, and as shown in fig. 2, the loop generation network includes a text embedding network, a text image generation network, and a text reconstruction network. The text embedding network is mainly composed of a Recurrent Neural Network (RNN), a text T is input into the text embedding network to obtain word embedding tw and sentence embedding ts, the text image generating network comprises a generator and a discriminator, the text reconstruction network is mainly composed of the RNN, and the generator receives the word embedding tw and the sentence embedding ts to generate a semantic image I in the training stage of the recurrent neural network_fThen the semantic image I is processed_fTrue image I_rAnd image I not matching the text_wRespectively inputting the data into a discriminator D which gives the probability that the image is a real image on one hand and the conditional probability p (I | T) of the image on the other hand, and then inputting the semantic image I_fTrue image I_rAnd image I not matching the text_wInputting the text T into a text reconstruction network, and inputting the text T into the text reconstruction network according to the semantic image I_fText description T of semantic image reconstructed from text_recThe conditional probability p (TI) of the text is calculated simultaneously, and the reconstructed text describes T_recSemantic image I_fAnd the method can be used for the next round of training and provides samples for training.

In the whole training process of the circular generation network, the quality of semantic images generated by the generator is continuously increased, the performance of the discriminator is continuously enhanced, the conditional probability of the obtained images is more and more accurate, and then a new sample is generated according to the high-quality semantic image reconstruction text description to train the circular generation network. After the whole cyclic generation network is trained, inputting a text and an image to be matched, generating a high-quality semantic image by the cyclic generation network according to the input text, reconstructing text description corresponding to the semantic image, obtaining the conditional probability of the input image and the conditional probability of the input text in the process, and finally outputting the result of matching the image and the text.

Based on the same inventive concept, an embodiment of the present application provides an instance matching apparatus. Referring to fig. 3, fig. 3 is a schematic diagram of an example matching apparatus 300 according to an embodiment of the present application. As shown in fig. 3, the apparatus includes:

a data input module 301, configured to input a plurality of texts and a plurality of images to be matched into a cyclic generation network;

a text embedding module 302, configured to perform feature extraction on each text in the multiple texts through a text embedding network in the cyclic generation network to obtain a text feature corresponding to the text, and input the text feature into a text image generation network in the cyclic generation network;

an image generation module 303, for generating semantic images according to the text features by using a network generated by the text images;

an image discrimination module 304, configured to discriminate the semantic image, a real image in the multiple images, and an error image in the multiple images that is not matched with the text, so as to obtain a probability that each of the multiple images is a real image and a conditional probability of each of the multiple images;

a text reconstruction module 305, configured to input the semantic image, the plurality of images, and the plurality of texts into a text reconstruction network in the cyclic generation network, so as to obtain a conditional probability of each text in the plurality of texts;

a result output module 306, configured to output a matching result according to the conditional probability of each of the plurality of images and the conditional probability of each of the plurality of texts.

Optionally, the image generation module comprises:

Optionally, the image discriminating module includes:

Optionally, the text reconstruction module further includes:

Optionally, the text description generation sub-module includes:

Optionally, the result output sub-module includes:

Based on the same inventive concept, another embodiment of the present application provides a readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps in the example matching method according to any of the above embodiments of the present application.

Based on the same inventive concept, another embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and when the processor executes the computer program, the electronic device implements the steps in the example matching method according to any of the above embodiments of the present application.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The above example matching method, apparatus, device and storage medium provided by the present application are introduced in detail, and a specific example is applied in the present application to explain the principle and implementation manner of the present application, and the description of the above embodiment is only used to help understand the method and core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An instance matching method, the method comprising:

2. The method of claim 1, wherein generating, by the text image generation network, a semantic image from the text features comprises:

inputting the text features into a generator in the image generation network;

generating, by the generator, the semantic image from the text features.

3. The method of claim 1, wherein the discriminating between the semantic image, a real image of the plurality of images, and an erroneous image of the plurality of images that does not match the text to obtain a probability that each image of the plurality of images is a real image and a conditional probability for each image of the plurality of images comprises:

4. The method of claim 1, wherein inputting the semantic image and the plurality of texts into a text reconstruction network in the cycle generation network results in a conditional probability for each text in the plurality of texts, the method further comprising:

5. The method of claim 4, wherein reconstructing a text description of the semantic image from the semantic image and the text comprises:

6. The method of claim 1, wherein outputting a matching result according to the conditional probability of each of the plurality of images and the conditional probability of each of the plurality of texts comprises:

presetting a matching probability threshold;

7. An instance matching apparatus, the apparatus comprising:

8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 6 are implemented when the computer program is executed by the processor.