CN114677569B

CN114677569B - Character-image pair generation method and device based on feature decoupling

Info

Publication number: CN114677569B
Application number: CN202210148651.XA
Authority: CN
Inventors: 王蕊; 梁栋; 李太豪; 裴冠雄
Original assignee: Institute of Information Engineering of CAS; Zhejiang Lab
Current assignee: Institute of Information Engineering of CAS; Zhejiang Lab
Priority date: 2022-02-17
Filing date: 2022-02-17
Publication date: 2024-05-10
Anticipated expiration: 2042-02-17
Also published as: CN114677569A

Abstract

The invention discloses a character-image pair generating method and device based on characteristic decoupling, wherein the method firstly utilizes a character-image pair data training encoder with labels to map two modes of characters and images to the same hidden space; then training an image encoder and a decoder by using the unlabeled image data, and training a text encoder and a decoder by using the unlabeled text data; extracting initial character-image features by using a trained character-image feature encoder network, decoupling after adding random sampling noise in a hidden space, and generating diversified character-image pairs by using a decoder. The invention can realize better text-image data editing in natural scenes, such as changing high-level semantic attributes of textures, colors and the like.

Description

Character-image pair generation method and device based on feature decoupling

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a character-image pair generation method and device based on feature decoupling.

Background

With the rapid development of computers and the internet, the form in which humans send and receive information has also become diversified. The characters are used as carriers for information transmission, and rich semantic information is contained; the image is taken as the input of visual information, and is a means for human intuitively perceiving the world. Learning and fusion understanding are carried out on knowledge of the two modes, so that the machine can better utilize multimedia data, and the method has a help effect on many related fields. However, the labeling of such text-image pairs requires a lot of manpower and material resources, and some professional image labeling even requires a certain expertise foundation for labeling persons. Therefore, how to effectively and accurately augment existing data by using a generative model becomes an important method for solving the problem. The text-image pair generating algorithm comprises two parts, and under the condition of a group of text-image labels, firstly, the text is reasonably modified, the semantic correctness of the text is ensured, meanwhile, the image is correspondingly modified, and the text-image pair generating algorithm accords with the text description.

The text image pair generation method is very different from the image generation method. Currently common image-to-image conversion methods can convert an image from a source domain to a target domain. It is limited to predefined fields and cannot be generalized to images that use arbitrary semantic literal operations. For example GANDissection can enable the addition or deletion of certain objects by modifying the hidden space. However, it is limited to editing only a small number of predefined objects and content, which must be identifiable by semantic segmentation and which can be expressed correspondingly in hidden space. Another type of task that is more relevant to the present task is language-based image editing. Such methods require a large number of image and scene annotations, modification instructions and modified images. But for large-scale datasets this marker information is often difficult to obtain. In order to avoid the use of annotation information, some methods have recently emerged that use only images and text annotations as training data. Given image a and unmatched target literal description B, the model needs to edit a to match B. The loss function constraint generates the authenticity of the image and the consistency with the modification description without requiring the authentic modified image as a training supervisor. However, this approach assumes that any modification of random sampling is possible. For example, given an image of a red flower, the method may use "yellow flower" as a modified description. But it is meaningless to use "blue birds" as the modification instructions for the red flower image. This approach is limited to data sets with fine-grained descriptions of human annotations for each image and cannot be generalized to other complex image data sets. How to generate reasonable text-image pairs with limited annotation data remains a challenging task.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention provides a character-image pair generation method and device based on characteristic decoupling, and the specific technical scheme is as follows:

Firstly, using a character-image pair data training encoder with labels to map two modes of characters and images to the same hidden space; then training an image encoder and a decoder by using the unlabeled image data, and training a text encoder and a decoder by using the unlabeled text data; extracting initial character-image characteristics by using a trained character-image characteristic encoder network, adding random sampling noise, and generating diversified character-image pairs by using a decoder. The method simultaneously generates text and image data, randomly samples and decouples the text and image data in a text-image fusion hidden space, and utilizes the constraint of a conditional contrast loss function to generate the association between the text and the image, so that the close semantic relativity between the text and the image can be ensured; and a large amount of non-labeling data is used for training the codec, so that the generation effect of the image and the text is improved. The invention can realize better text-image data editing in natural scenes, such as changing high-level semantic attributes of textures, colors and the like.

More specifically, the character-image pair generating method based on characteristic decoupling comprises the following steps:

generating a character-image feature encoder of an countermeasure network structure based on GAN, utilizing character-image pair data with labels, training the character-image feature encoder by restricting the correlation of character and image features through a maximized ternary loss function, mapping two modes of the character and the image to the same hidden space for fusion, and obtaining the fusion feature after encoding;

generating an antagonism network based on GAN to construct a character-image feature decoder, and decoupling the fusion feature, wherein the image feature decoder network is trained by the antagonism loss function and the perception loss function constraint, the character feature decoder is trained by the cross entropy loss function, the image feature encoder and the image feature decoder are trained by using unlabeled image data, and the character feature encoder and the character feature decoder are trained by using unlabeled character data;

And thirdly, extracting character-image characteristics by using a trained character-image characteristic encoder as initial characteristics, adding random sampling noise, sampling and decoupling the fused character-image characteristics by using a trained character-image characteristic decoder to obtain character and image characteristics with semantic association, and generating diversified character-image data.

Further, the character-image feature encoder is composed of 7 ResNet blocks with downsampling layers and an LSTM network, the images and characters in the character-image pair data are respectively input into the image encoder and the character encoder, the features of two modes of the images and the characters are respectively output, and the features of the two modes are multiplied to obtain fusion features.

Further, the ternary loss function expression is:

Wherein v and Representing the result of channel-wise averaging of the image features, positive and negative examples, t andText features representing positive and negative examples, representing inner products.

Further, the calculation formula for mapping the two modes of the text and the image to the same hidden space for fusion to obtain the fusion characteristics after encoding is as follows:

f＝t⊙V

Wherein ∈R ^1024×7×7 indicates positive and negative image characteristics by multiplying by element.

Further, the expression of the counterdamage function is:

L_GAN＝-E[D(I)]+E[D(G(v))]

I is image data, G is a generator, D is a arbiter, and E is an averaging operation.

Further, the expression of the perceptual loss function is:

wherein F _k is the k-th layer output of the VGG network generated by the target image, N _k represents the number of channels output by the k-th layer network, and N is the length of the text sequence.

Further, the formula of the cross entropy loss function is as follows:

Wherein S is word vector expression of the word T, S _t is word vector expression of the word T _t, p _t＝LSTM(x_t-1), T epsilon {1, …, N } represents output of the LSTM network, x _t is input of each moment of the LSTM network, and initial values and calculation methods thereof are as follows:

x_-1＝CNN(I)

x_t＝W_eS_t,t∈{0,…,N-1}，

Wherein CNN is an image feature extraction network, and VGG network is used for extracting image features in the experiment; w _e is a trainable parameter.

Further, the image feature decoder is composed of 7 ResNet blocks with up-sampling layers, the text feature decoder adopts a long-short-term memory LSTM network, the text-image feature decoder adopts a conditional antagonism loss function as a text-image semantic association loss function to restrict the semantic association of the text and the image, and the conditional antagonism loss function has the expression:

L_pair＝-E[D(I|t)]+E[D(G(v|t))]

i is image data, G is a generator, D is a discriminator, E is an averaging operation, v ε R ^1024×1×1 represents the result of channel-wise averaging of the positive image features, and t ε R ^1024×1×1 represents the word features of the positive examples.

Further, the third step is to extract the character-image feature as the initial feature by using the trained character-image feature encoder, and then sample in a neighborhood of the initial feature in the hidden space to obtain a new coding vector:

wherein z-N (0,I) are random vectors, and f is the fusion feature after encoding, namely the initial feature;

the new coding vector is input into the trained decoder network, and finally modified characters and images are obtained.

A character-image pair generation device based on feature decoupling, comprising:

The character-image feature coding module comprises a character feature coding module, is a character feature extraction network based on LSTM, and generates character semantic features according to character description labels; the image feature coding module is used for extracting corresponding visual image features for a given image for an image feature extraction network based on ResNet; the two modules are trained together, and the association between characters and image features is restrained by using a ternary loss function in the training process; the two modules encode the text and the image at the same time and perform feature fusion on the text and the image;

The character-image feature decoding module comprises a character feature decoding module and is used for generating a network for characters based on LSTM and mapping features to the characters and training by using a cross entropy loss function; the image feature decoding module is responsible for mapping the fusion features to an image space, restricting the authenticity of image generation by using the contrast loss function and the perception loss function, and restricting the relevance of characters and images by using the condition contrast loss function;

The method comprises the steps of sampling out the fused text-image characteristics in a random sampling mode, then decoupling to obtain text and image characteristics with semantic association, and generating text-image pairs simultaneously by using a text and image decoding module to obtain corresponding output images

In summary, the invention designs a character-image pair generating method and device based on feature fusion, which can correspondingly modify images according to character description. Compared with the prior art, the invention has the advantages that:

1. improving the generation countermeasure network based on GAN, and designing a character-image pair generation network;

2. The contrast loss function is adjusted and adopted, so that the relevance of the text semantics is restrained;

3. The network adaptability is strong, and the complex image can be edited by training with a small amount of marked samples and a large amount of unmarked data.

Drawings

FIG. 1 is a diagram of a text-to-image feature encoder architecture of the present invention;

FIG. 2 is a diagram of a text-to-image feature decoder architecture of the present invention;

FIG. 3 is a diagram of an example of the text-to-image pair generation results of the present invention.

Detailed Description

In order to make the objects, technical solutions and technical effects of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments of the present invention.

The invention discloses a character-image pair generation method based on characteristic decoupling, which comprises the following steps:

Wherein the training of the text-to-image feature encoder:

and respectively inputting the images and the characters in the character-image pair data into an image encoder and a character encoder, respectively outputting the features of two modes of the images and the characters, and multiplying the features of the two modes to obtain a fusion feature.

Specifically, as shown in fig. 1, a set of labeled text-image pairs (I, T) is sampled as a positive example, where I is an image and T is a text description corresponding to the image. Firstly, the image is cut and scaled to 256×256 resolution, and then input to an image feature encoder for feature extraction, and the feature v is output as 1024×7×7. Specifically, the image feature encoder consists of 7 blocks of residual network ResNet with downsampling layers. Meanwhile, characters are input into a character feature encoder for feature extraction, and the character feature encoder adopts a long-term memory LSTM network to output 1024 multiplied by 1 features t. Randomly sampling a group of characters-images as negative examples, and respectively extracting characteristics as followsAnd/>The text-to-image feature encoder is trained by maximizing the relevance of the ternary loss function constraint text to image features.

Wherein, the ternary loss function expression is:

Wherein v and The representation is the result of channel-wise averaging of the visual features of the positive and negative examples, t andText features representing positive and negative examples, representing inner products.

The calculation formula for feature fusion by mapping the two modes of the text and the image to the same hidden space is as follows:

f＝t⊙V

wherein ∈R ^1024×7×7 indicates that the visual characteristics of the positive example and the negative example are shown by multiplying by the element.

The image feature decoder consists of 7 ResNet blocks with up-sampling layers, the text feature decoder adopts a long-term memory LSTM network, and the text feature decoder and the image feature decoder are trained:

The training process is divided into text decoding training and image decoding training, and the training process simultaneously utilizes the marked data and the image data without text marking. When training using annotation data, the decoder is trained with a fixed text feature encoder and an image feature encoder, as shown in FIG. 2. Inputting 256×256 resolution image I into image feature encoder, extracting 1024×7×7 hidden variable as v, inputting into image feature decoder, and obtaining reconstructed 256×256 resolution image Similarly, the same reconstruction operation is performed for the text data. The authenticity of the generated result and the semantic relevance of the text and the image are constrained by constraining the contrast loss function and the perception loss function adopted by the image feature decoder, constraining the cross entropy loss function adopted by the text feature decoder and constraining the conditional contrast loss function to be used as the text-image semantic relevance loss function. When using image data without text labels, a decoder is used to generate the image and text, and the authenticity of the text and image results is generated by countering the loss function constraints.

Wherein the expression of the counterattack loss function is:

L_GAN＝-E[D(I)]+E[D(G(v))]

The expression of the perceptual loss function is:

The formula of the cross entropy loss function is as follows:

x_-1＝CNN(I)

x_t＝W_eS_t,t∈{0,…,N-1}。

wherein CNN is an image feature extraction network, and VGG network is used for extracting image features in the experiment; w _e is a trainable parameter; the conditional challenge loss function is expressed as:

L_pair＝-E[D(I|t)]+E[D(G(v|t))]。

the third step is specifically as follows: extracting character-image characteristics as initial characteristics by using a trained character-image characteristic encoder, and then sampling in a neighborhood of the initial characteristics in a hidden space to obtain a new coding vector:

wherein z-N (0,I) are random vectors;

The method provided by the invention has the following test environment and experimental results:

(1) Test environment:

system environment: centOS 7;

hardware environment: memory: 64GB, GPU: TITAN XP, hard disk: 256GB;

(2) Experimental data:

Training data:

Conceptual Captions database. Image data with text labels comprising 3,000,000. The image data includes various figures, objects, scenery, etc., and is obtained from the network by means of keyword search.

The training optimization method comprises the following steps: ADAM optimization algorithm.

Test data: MS COCO 2017 dataset.

(3) Experimental results:

to illustrate the effect of the present invention, data augmentation is performed on text-image pairs in the validation set of the MS COCO 2017 database. The fusion feature is encoded by using a character-image feature encoder, and a decoder is utilized to generate a corresponding image after random sampling noise is added. The results of the method are shown in FIG. 3.

Another embodiment of the present invention provides a character-image pair generating device based on feature fusion, which includes:

By the method and the device, after a group of text-image pairs are given, sampling is carried out in a neighborhood of initial features in the hidden space, fused text-image features are sampled in a random sampling mode, new coding vectors are obtained, decoupling is carried out to obtain text and image features with close semantic association, and then the text-image pairs are generated simultaneously by using the text and image decoding module, so that corresponding output images are obtained.

The invention adopts an open editing framework, and generates images and corresponding text descriptions according to hidden variables sampled randomly. In particular, a generic visual semantic generation model pre-trained on a large-scale text-image dataset is utilized that decouples the hidden space into text feature space and visual feature space, constructing a mapping of hidden space to arbitrary images and text descriptions. Features in the hidden space may locate the relevant region where the description of the input image is located and manipulate the relevant visual features, performed by vector arithmetic operations between visual feature mapping and text features, e.g., visual embedding of red flower = visual embedding of yellow flower-text embedding of yellow flower + text embedding of red flower. Then, an image is generated from the modified visual feature map using an image decoder. The image generator only constrains the image reconstruction loss function, and does not require any pair of annotated modification descriptions. In the test process, the features are sampled in the hidden space to obtain diversified hidden variables, and then the hidden variables are mapped to the image and text space by using the image and text decoder to obtain the augmented text-image pair. Therefore, the problem of mismatching between the text and the image semantics can be avoided during the test.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention in any way. Although the foregoing detailed description of the invention has been provided, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing examples, and that certain features may be substituted for those illustrated and described herein. Modifications, equivalents, and alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The character-image pair generation method based on characteristic decoupling is characterized by comprising the following steps of:

2. The character-image pair generating method based on characteristic decoupling as claimed in claim 1, wherein the character-image characteristic encoder is composed of 7 ResNet blocks with downsampling layers and an LSTM network, the images and characters in the character-image pair data are respectively input into the image encoder and the character encoder, the characteristics of two modes of the images and the characters are respectively output, and the characteristics of the two modes are multiplied to obtain the fusion characteristic.

3. The character-image pair generation method based on feature decoupling as claimed in claim 1, wherein the ternary loss function expression is:

4. The character-image pair generating method based on feature decoupling as claimed in claim 1, wherein the calculation formula for mapping the two modes of the character and the image to the same hidden space for fusion to obtain the fusion feature after encoding is as follows:

f＝t⊙V

5. The character-image pair generation method based on feature decoupling as claimed in claim 1, wherein the expression of the counterdamage function is:

L_GAN＝-E[D(I)]+E[D(G(v))]

6. The character-image pair generation method based on feature decoupling as claimed in claim 1, wherein the expression of the perceptual loss function is:

7. The character-image pair generation method based on feature decoupling as claimed in claim 1, wherein the cross entropy loss function is formulated as follows:

x_-1＝CNN(I)

x_t＝W_eS_t,t∈{0,…,N-1}，

8. The character-image pair generating method based on feature decoupling as claimed in claim 1, wherein the image feature decoder is composed of 7 ResNet blocks with up-sampling layers, the character feature decoder uses a long-short-term memory LSTM network, the character-image feature decoder uses a conditional antagonism loss function as a character-image semantic association loss function to constrain semantic association of characters and images, the conditional antagonism loss function is expressed as:

L_pair＝-E[D(I|t)]+E[D(G(v|t))]

9. The character-image pair generating method based on feature decoupling as claimed in claim 1, wherein said step three, the trained character-image feature encoder is used to extract the character-image feature as the initial feature, and then sampling is performed in a neighborhood of the initial feature in the hidden space to obtain the new coding vector:

10. A character-image pair generating device based on feature decoupling, comprising:

And (3) sampling the fused text-image characteristics in a random sampling mode, then decoupling to obtain text and image characteristics with semantic association, and generating text-image pairs by using a text and image decoding module at the same time to obtain corresponding output images.