CN114677569B - Character-image pair generation method and device based on feature decoupling - Google Patents

Character-image pair generation method and device based on feature decoupling Download PDF

Info

Publication number
CN114677569B
CN114677569B CN202210148651.XA CN202210148651A CN114677569B CN 114677569 B CN114677569 B CN 114677569B CN 202210148651 A CN202210148651 A CN 202210148651A CN 114677569 B CN114677569 B CN 114677569B
Authority
CN
China
Prior art keywords
image
character
feature
text
loss function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210148651.XA
Other languages
Chinese (zh)
Other versions
CN114677569A (en
Inventor
王蕊
梁栋
李太豪
裴冠雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Zhejiang Lab
Original Assignee
Institute of Information Engineering of CAS
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS, Zhejiang Lab filed Critical Institute of Information Engineering of CAS
Priority to CN202210148651.XA priority Critical patent/CN114677569B/en
Publication of CN114677569A publication Critical patent/CN114677569A/en
Application granted granted Critical
Publication of CN114677569B publication Critical patent/CN114677569B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Compression Of Band Width Or Redundancy In Fax (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a character-image pair generating method and device based on characteristic decoupling, wherein the method firstly utilizes a character-image pair data training encoder with labels to map two modes of characters and images to the same hidden space; then training an image encoder and a decoder by using the unlabeled image data, and training a text encoder and a decoder by using the unlabeled text data; extracting initial character-image features by using a trained character-image feature encoder network, decoupling after adding random sampling noise in a hidden space, and generating diversified character-image pairs by using a decoder. The invention can realize better text-image data editing in natural scenes, such as changing high-level semantic attributes of textures, colors and the like.

Description

Character-image pair generation method and device based on feature decoupling
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a character-image pair generation method and device based on feature decoupling.
Background
With the rapid development of computers and the internet, the form in which humans send and receive information has also become diversified. The characters are used as carriers for information transmission, and rich semantic information is contained; the image is taken as the input of visual information, and is a means for human intuitively perceiving the world. Learning and fusion understanding are carried out on knowledge of the two modes, so that the machine can better utilize multimedia data, and the method has a help effect on many related fields. However, the labeling of such text-image pairs requires a lot of manpower and material resources, and some professional image labeling even requires a certain expertise foundation for labeling persons. Therefore, how to effectively and accurately augment existing data by using a generative model becomes an important method for solving the problem. The text-image pair generating algorithm comprises two parts, and under the condition of a group of text-image labels, firstly, the text is reasonably modified, the semantic correctness of the text is ensured, meanwhile, the image is correspondingly modified, and the text-image pair generating algorithm accords with the text description.
The text image pair generation method is very different from the image generation method. Currently common image-to-image conversion methods can convert an image from a source domain to a target domain. It is limited to predefined fields and cannot be generalized to images that use arbitrary semantic literal operations. For example GANDissection can enable the addition or deletion of certain objects by modifying the hidden space. However, it is limited to editing only a small number of predefined objects and content, which must be identifiable by semantic segmentation and which can be expressed correspondingly in hidden space. Another type of task that is more relevant to the present task is language-based image editing. Such methods require a large number of image and scene annotations, modification instructions and modified images. But for large-scale datasets this marker information is often difficult to obtain. In order to avoid the use of annotation information, some methods have recently emerged that use only images and text annotations as training data. Given image a and unmatched target literal description B, the model needs to edit a to match B. The loss function constraint generates the authenticity of the image and the consistency with the modification description without requiring the authentic modified image as a training supervisor. However, this approach assumes that any modification of random sampling is possible. For example, given an image of a red flower, the method may use "yellow flower" as a modified description. But it is meaningless to use "blue birds" as the modification instructions for the red flower image. This approach is limited to data sets with fine-grained descriptions of human annotations for each image and cannot be generalized to other complex image data sets. How to generate reasonable text-image pairs with limited annotation data remains a challenging task.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention provides a character-image pair generation method and device based on characteristic decoupling, and the specific technical scheme is as follows:
Firstly, using a character-image pair data training encoder with labels to map two modes of characters and images to the same hidden space; then training an image encoder and a decoder by using the unlabeled image data, and training a text encoder and a decoder by using the unlabeled text data; extracting initial character-image characteristics by using a trained character-image characteristic encoder network, adding random sampling noise, and generating diversified character-image pairs by using a decoder. The method simultaneously generates text and image data, randomly samples and decouples the text and image data in a text-image fusion hidden space, and utilizes the constraint of a conditional contrast loss function to generate the association between the text and the image, so that the close semantic relativity between the text and the image can be ensured; and a large amount of non-labeling data is used for training the codec, so that the generation effect of the image and the text is improved. The invention can realize better text-image data editing in natural scenes, such as changing high-level semantic attributes of textures, colors and the like.
More specifically, the character-image pair generating method based on characteristic decoupling comprises the following steps:
generating a character-image feature encoder of an countermeasure network structure based on GAN, utilizing character-image pair data with labels, training the character-image feature encoder by restricting the correlation of character and image features through a maximized ternary loss function, mapping two modes of the character and the image to the same hidden space for fusion, and obtaining the fusion feature after encoding;
generating an antagonism network based on GAN to construct a character-image feature decoder, and decoupling the fusion feature, wherein the image feature decoder network is trained by the antagonism loss function and the perception loss function constraint, the character feature decoder is trained by the cross entropy loss function, the image feature encoder and the image feature decoder are trained by using unlabeled image data, and the character feature encoder and the character feature decoder are trained by using unlabeled character data;
And thirdly, extracting character-image characteristics by using a trained character-image characteristic encoder as initial characteristics, adding random sampling noise, sampling and decoupling the fused character-image characteristics by using a trained character-image characteristic decoder to obtain character and image characteristics with semantic association, and generating diversified character-image data.
Further, the character-image feature encoder is composed of 7 ResNet blocks with downsampling layers and an LSTM network, the images and characters in the character-image pair data are respectively input into the image encoder and the character encoder, the features of two modes of the images and the characters are respectively output, and the features of the two modes are multiplied to obtain fusion features.
Further, the ternary loss function expression is:
Wherein v and Representing the result of channel-wise averaging of the image features, positive and negative examples, t andText features representing positive and negative examples, representing inner products.
Further, the calculation formula for mapping the two modes of the text and the image to the same hidden space for fusion to obtain the fusion characteristics after encoding is as follows:
f=t⊙V
Wherein ∈R 1024×7×7 indicates positive and negative image characteristics by multiplying by element.
Further, the expression of the counterdamage function is:
LGAN=-E[D(I)]+E[D(G(v))]
I is image data, G is a generator, D is a arbiter, and E is an averaging operation.
Further, the expression of the perceptual loss function is:
wherein F k is the k-th layer output of the VGG network generated by the target image, N k represents the number of channels output by the k-th layer network, and N is the length of the text sequence.
Further, the formula of the cross entropy loss function is as follows:
Wherein S is word vector expression of the word T, S t is word vector expression of the word T t, p t=LSTM(xt-1), T epsilon {1, …, N } represents output of the LSTM network, x t is input of each moment of the LSTM network, and initial values and calculation methods thereof are as follows:
x-1=CNN(I)
xt=WeSt,t∈{0,…,N-1},
Wherein CNN is an image feature extraction network, and VGG network is used for extracting image features in the experiment; w e is a trainable parameter.
Further, the image feature decoder is composed of 7 ResNet blocks with up-sampling layers, the text feature decoder adopts a long-short-term memory LSTM network, the text-image feature decoder adopts a conditional antagonism loss function as a text-image semantic association loss function to restrict the semantic association of the text and the image, and the conditional antagonism loss function has the expression:
Lpair=-E[D(I|t)]+E[D(G(v|t))]
i is image data, G is a generator, D is a discriminator, E is an averaging operation, v ε R 1024×1×1 represents the result of channel-wise averaging of the positive image features, and t ε R 1024×1×1 represents the word features of the positive examples.
Further, the third step is to extract the character-image feature as the initial feature by using the trained character-image feature encoder, and then sample in a neighborhood of the initial feature in the hidden space to obtain a new coding vector:
wherein z-N (0,I) are random vectors, and f is the fusion feature after encoding, namely the initial feature;
the new coding vector is input into the trained decoder network, and finally modified characters and images are obtained.
A character-image pair generation device based on feature decoupling, comprising:
The character-image feature coding module comprises a character feature coding module, is a character feature extraction network based on LSTM, and generates character semantic features according to character description labels; the image feature coding module is used for extracting corresponding visual image features for a given image for an image feature extraction network based on ResNet; the two modules are trained together, and the association between characters and image features is restrained by using a ternary loss function in the training process; the two modules encode the text and the image at the same time and perform feature fusion on the text and the image;
The character-image feature decoding module comprises a character feature decoding module and is used for generating a network for characters based on LSTM and mapping features to the characters and training by using a cross entropy loss function; the image feature decoding module is responsible for mapping the fusion features to an image space, restricting the authenticity of image generation by using the contrast loss function and the perception loss function, and restricting the relevance of characters and images by using the condition contrast loss function;
The method comprises the steps of sampling out the fused text-image characteristics in a random sampling mode, then decoupling to obtain text and image characteristics with semantic association, and generating text-image pairs simultaneously by using a text and image decoding module to obtain corresponding output images
In summary, the invention designs a character-image pair generating method and device based on feature fusion, which can correspondingly modify images according to character description. Compared with the prior art, the invention has the advantages that:
1. improving the generation countermeasure network based on GAN, and designing a character-image pair generation network;
2. The contrast loss function is adjusted and adopted, so that the relevance of the text semantics is restrained;
3. The network adaptability is strong, and the complex image can be edited by training with a small amount of marked samples and a large amount of unmarked data.
Drawings
FIG. 1 is a diagram of a text-to-image feature encoder architecture of the present invention;
FIG. 2 is a diagram of a text-to-image feature decoder architecture of the present invention;
FIG. 3 is a diagram of an example of the text-to-image pair generation results of the present invention.
Detailed Description
In order to make the objects, technical solutions and technical effects of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments of the present invention.
The invention discloses a character-image pair generation method based on characteristic decoupling, which comprises the following steps:
generating a character-image feature encoder of an countermeasure network structure based on GAN, utilizing character-image pair data with labels, training the character-image feature encoder by restricting the correlation of character and image features through a maximized ternary loss function, mapping two modes of the character and the image to the same hidden space for fusion, and obtaining the fusion feature after encoding;
generating an antagonism network based on GAN to construct a character-image feature decoder, and decoupling the fusion feature, wherein the image feature decoder network is trained by the antagonism loss function and the perception loss function constraint, the character feature decoder is trained by the cross entropy loss function, the image feature encoder and the image feature decoder are trained by using unlabeled image data, and the character feature encoder and the character feature decoder are trained by using unlabeled character data;
And thirdly, extracting character-image characteristics by using a trained character-image characteristic encoder as initial characteristics, adding random sampling noise, sampling and decoupling the fused character-image characteristics by using a trained character-image characteristic decoder to obtain character and image characteristics with semantic association, and generating diversified character-image data.
Wherein the training of the text-to-image feature encoder:
and respectively inputting the images and the characters in the character-image pair data into an image encoder and a character encoder, respectively outputting the features of two modes of the images and the characters, and multiplying the features of the two modes to obtain a fusion feature.
Specifically, as shown in fig. 1, a set of labeled text-image pairs (I, T) is sampled as a positive example, where I is an image and T is a text description corresponding to the image. Firstly, the image is cut and scaled to 256×256 resolution, and then input to an image feature encoder for feature extraction, and the feature v is output as 1024×7×7. Specifically, the image feature encoder consists of 7 blocks of residual network ResNet with downsampling layers. Meanwhile, characters are input into a character feature encoder for feature extraction, and the character feature encoder adopts a long-term memory LSTM network to output 1024 multiplied by 1 features t. Randomly sampling a group of characters-images as negative examples, and respectively extracting characteristics as followsAnd/>The text-to-image feature encoder is trained by maximizing the relevance of the ternary loss function constraint text to image features.
Wherein, the ternary loss function expression is:
Wherein v and The representation is the result of channel-wise averaging of the visual features of the positive and negative examples, t andText features representing positive and negative examples, representing inner products.
The calculation formula for feature fusion by mapping the two modes of the text and the image to the same hidden space is as follows:
f=t⊙V
wherein ∈R 1024×7×7 indicates that the visual characteristics of the positive example and the negative example are shown by multiplying by the element.
The image feature decoder consists of 7 ResNet blocks with up-sampling layers, the text feature decoder adopts a long-term memory LSTM network, and the text feature decoder and the image feature decoder are trained:
The training process is divided into text decoding training and image decoding training, and the training process simultaneously utilizes the marked data and the image data without text marking. When training using annotation data, the decoder is trained with a fixed text feature encoder and an image feature encoder, as shown in FIG. 2. Inputting 256×256 resolution image I into image feature encoder, extracting 1024×7×7 hidden variable as v, inputting into image feature decoder, and obtaining reconstructed 256×256 resolution image Similarly, the same reconstruction operation is performed for the text data. The authenticity of the generated result and the semantic relevance of the text and the image are constrained by constraining the contrast loss function and the perception loss function adopted by the image feature decoder, constraining the cross entropy loss function adopted by the text feature decoder and constraining the conditional contrast loss function to be used as the text-image semantic relevance loss function. When using image data without text labels, a decoder is used to generate the image and text, and the authenticity of the text and image results is generated by countering the loss function constraints.
Wherein the expression of the counterattack loss function is:
LGAN=-E[D(I)]+E[D(G(v))]
I is image data, G is a generator, D is a arbiter, and E is an averaging operation.
The expression of the perceptual loss function is:
wherein F k is the k-th layer output of the VGG network generated by the target image, N k represents the number of channels output by the k-th layer network, and N is the length of the text sequence.
The formula of the cross entropy loss function is as follows:
Wherein S is word vector expression of the word T, S t is word vector expression of the word T t, p t=LSTM(xt-1), T epsilon {1, …, N } represents output of the LSTM network, x t is input of each moment of the LSTM network, and initial values and calculation methods thereof are as follows:
x-1=CNN(I)
xt=WeSt,t∈{0,…,N-1}。
wherein CNN is an image feature extraction network, and VGG network is used for extracting image features in the experiment; w e is a trainable parameter; the conditional challenge loss function is expressed as:
Lpair=-E[D(I|t)]+E[D(G(v|t))]。
the third step is specifically as follows: extracting character-image characteristics as initial characteristics by using a trained character-image characteristic encoder, and then sampling in a neighborhood of the initial characteristics in a hidden space to obtain a new coding vector:
wherein z-N (0,I) are random vectors;
the new coding vector is input into the trained decoder network, and finally modified characters and images are obtained.
The method provided by the invention has the following test environment and experimental results:
(1) Test environment:
system environment: centOS 7;
hardware environment: memory: 64GB, GPU: TITAN XP, hard disk: 256GB;
(2) Experimental data:
Training data:
Conceptual Captions database. Image data with text labels comprising 3,000,000. The image data includes various figures, objects, scenery, etc., and is obtained from the network by means of keyword search.
The training optimization method comprises the following steps: ADAM optimization algorithm.
Test data: MS COCO 2017 dataset.
(3) Experimental results:
to illustrate the effect of the present invention, data augmentation is performed on text-image pairs in the validation set of the MS COCO 2017 database. The fusion feature is encoded by using a character-image feature encoder, and a decoder is utilized to generate a corresponding image after random sampling noise is added. The results of the method are shown in FIG. 3.
Another embodiment of the present invention provides a character-image pair generating device based on feature fusion, which includes:
The character-image feature coding module comprises a character feature coding module, is a character feature extraction network based on LSTM, and generates character semantic features according to character description labels; the image feature coding module is used for extracting corresponding visual image features for a given image for an image feature extraction network based on ResNet; the two modules are trained together, and the association between characters and image features is restrained by using a ternary loss function in the training process; the two modules encode the text and the image at the same time and perform feature fusion on the text and the image;
The character-image feature decoding module comprises a character feature decoding module and is used for generating a network for characters based on LSTM and mapping features to the characters and training by using a cross entropy loss function; the image feature decoding module is responsible for mapping the fusion features to an image space, restricting the authenticity of image generation by using the contrast loss function and the perception loss function, and restricting the relevance of characters and images by using the condition contrast loss function;
By the method and the device, after a group of text-image pairs are given, sampling is carried out in a neighborhood of initial features in the hidden space, fused text-image features are sampled in a random sampling mode, new coding vectors are obtained, decoupling is carried out to obtain text and image features with close semantic association, and then the text-image pairs are generated simultaneously by using the text and image decoding module, so that corresponding output images are obtained.
The invention adopts an open editing framework, and generates images and corresponding text descriptions according to hidden variables sampled randomly. In particular, a generic visual semantic generation model pre-trained on a large-scale text-image dataset is utilized that decouples the hidden space into text feature space and visual feature space, constructing a mapping of hidden space to arbitrary images and text descriptions. Features in the hidden space may locate the relevant region where the description of the input image is located and manipulate the relevant visual features, performed by vector arithmetic operations between visual feature mapping and text features, e.g., visual embedding of red flower = visual embedding of yellow flower-text embedding of yellow flower + text embedding of red flower. Then, an image is generated from the modified visual feature map using an image decoder. The image generator only constrains the image reconstruction loss function, and does not require any pair of annotated modification descriptions. In the test process, the features are sampled in the hidden space to obtain diversified hidden variables, and then the hidden variables are mapped to the image and text space by using the image and text decoder to obtain the augmented text-image pair. Therefore, the problem of mismatching between the text and the image semantics can be avoided during the test.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention in any way. Although the foregoing detailed description of the invention has been provided, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing examples, and that certain features may be substituted for those illustrated and described herein. Modifications, equivalents, and alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (10)

1. The character-image pair generation method based on characteristic decoupling is characterized by comprising the following steps of:
generating a character-image feature encoder of an countermeasure network structure based on GAN, utilizing character-image pair data with labels, training the character-image feature encoder by restricting the correlation of character and image features through a maximized ternary loss function, mapping two modes of the character and the image to the same hidden space for fusion, and obtaining the fusion feature after encoding;
generating an antagonism network based on GAN to construct a character-image feature decoder, and decoupling the fusion feature, wherein the image feature decoder network is trained by the antagonism loss function and the perception loss function constraint, the character feature decoder is trained by the cross entropy loss function, the image feature encoder and the image feature decoder are trained by using unlabeled image data, and the character feature encoder and the character feature decoder are trained by using unlabeled character data;
And thirdly, extracting character-image characteristics by using a trained character-image characteristic encoder as initial characteristics, adding random sampling noise, sampling and decoupling the fused character-image characteristics by using a trained character-image characteristic decoder to obtain character and image characteristics with semantic association, and generating diversified character-image data.
2. The character-image pair generating method based on characteristic decoupling as claimed in claim 1, wherein the character-image characteristic encoder is composed of 7 ResNet blocks with downsampling layers and an LSTM network, the images and characters in the character-image pair data are respectively input into the image encoder and the character encoder, the characteristics of two modes of the images and the characters are respectively output, and the characteristics of the two modes are multiplied to obtain the fusion characteristic.
3. The character-image pair generation method based on feature decoupling as claimed in claim 1, wherein the ternary loss function expression is:
Wherein v and Representing the result of channel-wise averaging of the image features, positive and negative examples, t andText features representing positive and negative examples, representing inner products.
4. The character-image pair generating method based on feature decoupling as claimed in claim 1, wherein the calculation formula for mapping the two modes of the character and the image to the same hidden space for fusion to obtain the fusion feature after encoding is as follows:
f=t⊙V
Wherein ∈R 1024×7×7 indicates positive and negative image characteristics by multiplying by element.
5. The character-image pair generation method based on feature decoupling as claimed in claim 1, wherein the expression of the counterdamage function is:
LGAN=-E[D(I)]+E[D(G(v))]
I is image data, G is a generator, D is a arbiter, and E is an averaging operation.
6. The character-image pair generation method based on feature decoupling as claimed in claim 1, wherein the expression of the perceptual loss function is:
wherein F k is the k-th layer output of the VGG network generated by the target image, N k represents the number of channels output by the k-th layer network, and N is the length of the text sequence.
7. The character-image pair generation method based on feature decoupling as claimed in claim 1, wherein the cross entropy loss function is formulated as follows:
Wherein S is word vector expression of the word T, S t is word vector expression of the word T t, p t=LSTM(xt-1), T epsilon {1, …, N } represents output of the LSTM network, x t is input of each moment of the LSTM network, and initial values and calculation methods thereof are as follows:
x-1=CNN(I)
xt=WeSt,t∈{0,…,N-1},
Wherein CNN is an image feature extraction network, and VGG network is used for extracting image features in the experiment; w e is a trainable parameter.
8. The character-image pair generating method based on feature decoupling as claimed in claim 1, wherein the image feature decoder is composed of 7 ResNet blocks with up-sampling layers, the character feature decoder uses a long-short-term memory LSTM network, the character-image feature decoder uses a conditional antagonism loss function as a character-image semantic association loss function to constrain semantic association of characters and images, the conditional antagonism loss function is expressed as:
Lpair=-E[D(I|t)]+E[D(G(v|t))]
i is image data, G is a generator, D is a discriminator, E is an averaging operation, v ε R 1024×1×1 represents the result of channel-wise averaging of the positive image features, and t ε R 1024×1×1 represents the word features of the positive examples.
9. The character-image pair generating method based on feature decoupling as claimed in claim 1, wherein said step three, the trained character-image feature encoder is used to extract the character-image feature as the initial feature, and then sampling is performed in a neighborhood of the initial feature in the hidden space to obtain the new coding vector:
wherein z-N (0,I) are random vectors, and f is the fusion feature after encoding, namely the initial feature;
the new coding vector is input into the trained decoder network, and finally modified characters and images are obtained.
10. A character-image pair generating device based on feature decoupling, comprising:
The character-image feature coding module comprises a character feature coding module, is a character feature extraction network based on LSTM, and generates character semantic features according to character description labels; the image feature coding module is used for extracting corresponding visual image features for a given image for an image feature extraction network based on ResNet; the two modules are trained together, and the association between characters and image features is restrained by using a ternary loss function in the training process; the two modules encode the text and the image at the same time and perform feature fusion on the text and the image;
The character-image feature decoding module comprises a character feature decoding module and is used for generating a network for characters based on LSTM and mapping features to the characters and training by using a cross entropy loss function; the image feature decoding module is responsible for mapping the fusion features to an image space, restricting the authenticity of image generation by using the contrast loss function and the perception loss function, and restricting the relevance of characters and images by using the condition contrast loss function;
And (3) sampling the fused text-image characteristics in a random sampling mode, then decoupling to obtain text and image characteristics with semantic association, and generating text-image pairs by using a text and image decoding module at the same time to obtain corresponding output images.
CN202210148651.XA 2022-02-17 2022-02-17 Character-image pair generation method and device based on feature decoupling Active CN114677569B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210148651.XA CN114677569B (en) 2022-02-17 2022-02-17 Character-image pair generation method and device based on feature decoupling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210148651.XA CN114677569B (en) 2022-02-17 2022-02-17 Character-image pair generation method and device based on feature decoupling

Publications (2)

Publication Number Publication Date
CN114677569A CN114677569A (en) 2022-06-28
CN114677569B true CN114677569B (en) 2024-05-10

Family

ID=82072241

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210148651.XA Active CN114677569B (en) 2022-02-17 2022-02-17 Character-image pair generation method and device based on feature decoupling

Country Status (1)

Country Link
CN (1) CN114677569B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115293109B (en) * 2022-08-03 2024-03-19 合肥工业大学 Text image generation method and system based on fine granularity semantic fusion

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019148898A1 (en) * 2018-02-01 2019-08-08 北京大学深圳研究生院 Adversarial cross-media retrieving method based on restricted text space
CN110866958A (en) * 2019-10-28 2020-03-06 清华大学深圳国际研究生院 Method for text to image
CN112818646A (en) * 2021-02-26 2021-05-18 南京邮电大学 Method for editing pictures according to texts based on generation countermeasure network and dynamic editing module
CN113935899A (en) * 2021-09-06 2022-01-14 杭州志创科技有限公司 Ship plate image super-resolution method based on semantic information and gradient supervision

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019148898A1 (en) * 2018-02-01 2019-08-08 北京大学深圳研究生院 Adversarial cross-media retrieving method based on restricted text space
CN110866958A (en) * 2019-10-28 2020-03-06 清华大学深圳国际研究生院 Method for text to image
CN112818646A (en) * 2021-02-26 2021-05-18 南京邮电大学 Method for editing pictures according to texts based on generation countermeasure network and dynamic editing module
CN113935899A (en) * 2021-09-06 2022-01-14 杭州志创科技有限公司 Ship plate image super-resolution method based on semantic information and gradient supervision

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杜海骏 ; 刘学亮 ; .融合约束学习的图像字幕生成方法.中国图象图形学报.2020,(02),全文. *

Also Published As

Publication number Publication date
CN114677569A (en) 2022-06-28

Similar Documents

Publication Publication Date Title
Zhan et al. Multimodal image synthesis and editing: A survey and taxonomy
CN111858954B (en) Task-oriented text-generated image network model
CN111291212B (en) Zero sample sketch image retrieval method and system based on graph convolution neural network
CN114339450B (en) Video comment generation method, system, device and storage medium
CN111369646B (en) Expression synthesis method integrating attention mechanism
CN114677569B (en) Character-image pair generation method and device based on feature decoupling
CN116611496A (en) Text-to-image generation model optimization method, device, equipment and storage medium
Bai et al. Loopy residual hashing: Filling the quantization gap for image retrieval
Li et al. [Retracted] Multimedia Data Processing Technology and Application Based on Deep Learning
Yan et al. Comprehensive visual question answering on point clouds through compositional scene manipulation
Zhan et al. Multimodal image synthesis and editing: A survey
Bende et al. VISMA: A Machine Learning Approach to Image Manipulation
CN114399646B (en) Image description method and device based on transform structure
CN116186312A (en) Multi-mode data enhancement method for data sensitive information discovery model
Shah et al. Inferring context from pixels for multimodal image classification
CN115496134A (en) Traffic scene video description generation method and device based on multi-modal feature fusion
CN115270917A (en) Two-stage processing multi-mode garment image generation method
Sun et al. PattGAN: Pluralistic Facial Attribute Editing
Hammad et al. Characterizing the impact of using features extracted from pre-trained models on the quality of video captioning sequence-to-sequence models
Kong et al. DualPathGAN: Facial reenacted emotion synthesis
Bayoumi et al. Text-to-Image Synthesis: A Comparative Study
CN111898456B (en) Text modification picture network model training method based on multi-level attention mechanism
Chen et al. A review of multimodal learning for text to images
US20230360294A1 (en) Unsupervised style and color cues for transformer-based image generation
Patil et al. Real-time Audio Video Summarization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant