CN108959322B - Information processing method and device for generating image based on text - Google Patents

Information processing method and device for generating image based on text Download PDF

Info

Publication number
CN108959322B
CN108959322B CN201710379515.0A CN201710379515A CN108959322B CN 108959322 B CN108959322 B CN 108959322B CN 201710379515 A CN201710379515 A CN 201710379515A CN 108959322 B CN108959322 B CN 108959322B
Authority
CN
China
Prior art keywords
text
image
local
decoder
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710379515.0A
Other languages
Chinese (zh)
Other versions
CN108959322A (en
Inventor
侯翠琴
夏迎炬
杨铭
张姝
孙俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201710379515.0A priority Critical patent/CN108959322B/en
Publication of CN108959322A publication Critical patent/CN108959322A/en
Application granted granted Critical
Publication of CN108959322B publication Critical patent/CN108959322B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The invention discloses an information processing method and a device for generating an image based on a text. The method comprises the following steps: extracting text features representing the relevance between words in the sample text from the sample text; selectively intercepting each part of the text features by using the window with the changed size to obtain each local text feature; training an image generation model based on each local text feature of the sample text and a sample image corresponding to the sample text, wherein the image generation model comprises an encoder module and a decoder module, the decoder module in the trained image generation model iteratively generates an image corresponding to the input text according to each local text feature of the input text, and each local text feature is intercepted in each iteration.

Description

Information processing method and device for generating image based on text
Technical Field
The present invention relates to the field of information processing, in particular to the field of deep learning, and in particular to an information processing method and an apparatus for generating an image based on text.
Background
The automatic generation of images based on natural language description is a very important research content in the field of artificial intelligence and has a very wide application. In this regard, deep learning approaches have made many advances. In the deep learning technology, two methods are mainly used for generating images, one is a variational self-coding method, and the other is a method for generating a confrontation network.
The variational self-coding method proposed by Kingma & Welling can be regarded as a neural network with continuous hidden variables. The encoding end model approximates the posterior probability distribution of the hidden variables, and the decoding end model constructs an image based on the probability distribution of the hidden variables. Gregor et al propose a depth cycle Attention Write model (DRAW) to generate images, which extends the variational self-coding method to a sequence variational self-coding framework.
The method for generating the countermeasure network includes a generator model for generating data based on probability distribution and a discriminator model for judging whether the data is true data or generated data. Gauthier proposes a conditional countermeasure network to generate different classes of images. Denton et al trained a conditional generation confrontation network for each layer of images under the Laplacian pyramid framework, and then generated images from coarse to fine based on the conditional confrontation network for each layer under the Laplacian pyramid framework.
Although the above-described techniques for generating images exist in the prior art, there is still a need for improved methods for generating images based on text.
Disclosure of Invention
The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention, and it is not intended to identify key or critical elements of the invention or to delineate the scope of the invention. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.
The invention provides an information processing method, which comprises the following steps: extracting text features representing the relevance between words in the sample text from the sample text; selectively intercepting each local part of the text features by a window with variable size to obtain each local text feature; training an image generation model based on each local text feature of the sample text and a sample image corresponding to the sample text, wherein the image generation model comprises an encoder module and a decoder module, the decoder module in the trained image generation model iteratively generates an image corresponding to the input text according to each local text feature of the input text, and each local text feature is intercepted in each iteration.
According to another aspect of the present invention, there is provided an apparatus for generating an image based on text, including: a text feature extraction unit that extracts text features representing the relevance between words in a text; a local text feature intercepting part, which selectively intercepts each local part of the text feature by a window with variable size to obtain a local text feature; and an image generation model, wherein a decoder module in the image generation model generates an image corresponding to the input text in an iteration mode according to each local text feature of the input text, and each local text feature is intercepted in each iteration.
According to a further aspect of the present invention, there is provided a method for generating an image based on text using the trained device described above, comprising: extracting, by the text feature extraction section, text features representing associations between words in a text; selectively intercepting, by the local text feature intercepting part, respective parts of the text features with a variable-sized window to obtain local text features; and a decoder module in the image generation model generates an image corresponding to the input text iteratively according to each local text feature of the input text, wherein each local text feature is intercepted in each iteration.
According to still another aspect of the present invention, there is also provided a storage medium. The storage medium includes a program code readable by a machine, which, when executed on an information processing apparatus, causes the information processing apparatus to execute the above-described method according to the present invention.
According to still another aspect of the present invention, there is also provided a program. The program comprises machine-executable instructions that, when executed on an information processing device, cause the information processing device to perform the above-described method according to the invention.
These and other advantages of the present invention will become more apparent from the following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings.
Drawings
Other features and advantages of the present invention will be more readily understood from the following description of the various embodiments of the invention taken with the accompanying drawings, which are for the purpose of illustrating embodiments of the invention by way of illustration only, and not in all possible implementations, and which are not intended to limit the scope of the invention. In the drawings:
fig. 1 is a schematic diagram illustrating a structure of an apparatus for generating an image based on text according to an embodiment of the present invention.
Fig. 2 is a schematic diagram illustrating a structure of a text feature extraction section in an apparatus for generating an image based on a text according to an embodiment of the present invention.
Fig. 3 is a schematic diagram illustrating a structure of an image generation model in an apparatus for generating an image based on text according to an embodiment of the present invention.
Fig. 4 is a flowchart illustrating a training process of an apparatus for generating an image based on text according to an embodiment of the present invention.
Fig. 5 is a flowchart illustrating a training process of an image generation model in an apparatus for generating an image based on text according to an embodiment of the present invention.
Fig. 6 is a schematic diagram showing a configuration example of an apparatus for generating an image based on text in a training state according to an embodiment of the present invention.
Fig. 7 is a schematic diagram showing a configuration example of an image generation model in an apparatus for generating an image based on text according to an embodiment of the present invention.
FIG. 8 is a flowchart illustrating a method of generating an image using a trained text-based image generation apparatus according to an embodiment of the present invention.
Fig. 9 is a flowchart showing a process in which the decoder module generates an image in a use state.
Fig. 10 is a schematic diagram showing a configuration example of an apparatus for generating an image based on text in a use state according to an embodiment of the present invention.
FIG. 11 is a schematic block diagram illustrating a computer for implementing methods and apparatus in accordance with embodiments of the invention
Detailed Description
Embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that the following description is only exemplary and is not intended to limit the present invention. Further, in the following description, the same reference numbers will be used to refer to the same or like parts in different drawings. The different features in the different embodiments described below can be combined with each other to form further embodiments within the scope of the invention.
Referring first to fig. 1, there is shown a schematic diagram illustrating a structure of an apparatus for generating an image based on text according to an embodiment of the present invention. As shown in fig. 1, the apparatus 100 includes a text feature extraction unit 110, a local text feature extraction unit 120, and an image generation model 130.
The text feature extraction unit 110 is configured to extract text features representing associations between words in a text. Specifically, as shown in fig. 2, the text feature extraction section 110 includes a vectorization unit 111 and a text feature extraction unit 112. The vectorization unit 111 vectorizes (not shown in fig. 1) the text using existing distribution representation techniques, such as Log bilinear language model (LBL), C & W model, Word2vec, etc., to get low-dimensional Word vectors. The text feature extraction unit 112 extracts text features characterizing the association between words in the text based on the word vectors using the well-known forward and backward recurrent neural networks. Here, the text feature may also be extracted by using a forward recurrent neural network or a backward recurrent neural network alone.
The local text feature intercepting part 120 selectively intercepts each of the parts of the text feature with a variable-size window to obtain the local text feature. Wherein each local text feature is truncated in each iteration operation of the image generation model 130, respectively, and in the current iteration, the local text feature truncation 120 truncates the local text feature based on the output of the decoder in the decoder module in the previous iteration.
The image generation model 130 is trained based on individual local text features of the sample text and the sample images corresponding to the sample text. And the decoder module in the trained image generation model generates images corresponding to the input text in an iterative manner according to each local text characteristic of the input text. The image generation model 130 may be the well-known draw (deep secure writer) model.
FIG. 3 shows a schematic structural diagram of the image generation model 130 according to an embodiment of the present invention. As shown in FIG. 3, the image generation model 130 includes a decoder module 131, an encoder module 132, and a calculation module 133.
In the training state, the encoder module 132 iteratively compresses the sample image and outputs, in each iteration, a first distribution of feature quantities characterizing key information of the sample image and the sample text. In the training state, the decoder module 131 iteratively generates an output image based on the respective local text features and the respective first distributions of the sample text, and calculates second distributions of the feature quantities in the respective iterations. Here, the encoder module 132 and the decoder module 131 are each implemented by a Recurrent Neural Network (RNN). The calculation module 133 calculates a loss function of the image generation model based on the sample image, the output image, the first distribution and the second distribution to optimize the image generation model.
In a use state in which the trained apparatus 100 is used to generate an image based on a text, the decoder module 131 iteratively generates an image corresponding to the input text according to each local text feature of the input text and each second distribution, each local text feature being truncated in each iteration. In the use state, the encoder module 132 does not participate in the operation.
As shown in fig. 3, the encoder module 132 includes a reading section 1321, an encoder 1322, and a building section 1323. The reading section 1321 reads a part of the sample image based on the output of the decoder module and the output of the decoder in the previous iteration to obtain a local sample image. The encoder 1322 compresses the local sample image based on the output of the encoder and decoder in the previous iteration. The constructing section 1323 constructs the first distribution based on the output of the encoder.
The decoder module 131 includes a sampling section 1311, a decoder 1312, a construction section 1313, and a writing section 1314. In the training state, the sampling unit 1311 collects feature amounts from the first distribution. The decoder 1312 decodes the collected feature quantity based on the local text feature and the decoder output in the previous iteration. The construction section 1313 constructs the second distribution based on the output of the decoder in the previous iteration. The write-out unit 1314 writes out the decoder output in the current iteration into the corresponding region of the distribution matrix. The decoder module generates an output image based on the final resulting distribution matrix.
In the use state, the construction section 1313 constructs the second distribution based on the output of the decoder in the previous iteration. The sampling section 1311 collects feature amounts from the second distribution. The decoder 1312 decodes the collected feature quantity based on the local text feature and the decoder output in the previous iteration. The write-out unit 1314 writes out the decoder output in the current iteration into the corresponding region of the distribution matrix.
The calculation module 133 includes a first calculation part 1331, a second calculation part 1332, and a determination part 1333. The first calculation section 1331 calculates a first loss function with respect to the sample image and the output image. The second calculation section 1332 calculates a second loss function with respect to the first distribution and the second distribution. The determination section 1333 determines a total loss function based on the first loss function and the second loss function.
Next, the training process of the apparatus 100 is described with reference to fig. 4 to 7. Fig. 4 is a flowchart illustrating a training process of an apparatus for generating an image based on text according to an embodiment of the present invention. As shown in fig. 4, the training process 200 includes steps S210 to S230.
In step S210, text features characterizing the association between words in the sample text are extracted from the sample text. Specifically, the sample text is first vectorized using well-known distributed representation techniques to obtain a plurality of word vectors of low dimensionality. Text features characterizing the associations between words in the sample text are then extracted based on the word vectors, where the text features can be extracted using a forward recurrent neural network and/or a backward recurrent neural network.
In step 220, portions of the text feature are selectively truncated with a variable-size window to obtain portions of the text feature.
In step S230, an image generation model is trained based on each local text feature of the sample text and a sample image corresponding to the sample text, wherein the image generation model includes an encoder module and a decoder module, the decoder module in the trained image generation model iteratively generates an image corresponding to the input text according to each local text feature of the input text, and each local text feature is respectively truncated in each iteration.
Fig. 5 shows a specific flow of the training process of the image generation model 130. As shown in fig. 5, the process of step S230 specifically includes steps S231 to S233.
Referring to fig. 5, in step S231, the sample image is iteratively compressed by the encoder module, and a first distribution of feature quantities characterizing key information of the sample image and the sample text is output from the encoder module in each iteration. Specifically, a part of the sample image is read based on an output of the decoder module and an output of the decoder in a previous iteration to obtain a local sample image; and the local sample image is compressed based on the output of the decoder and encoder in the previous iteration. In addition, a first distribution in each iteration is constructed based on the output of the encoder.
In step S232, based on the respective local text features and the respective first distributions of the sample text, a second distribution of feature quantities in the respective iterations is calculated with the decoder module and an output image is iteratively generated. Specifically, feature quantities are collected from the first distribution in each iteration; decoding the collected feature quantity by using a decoder based on the local text feature and the output of the decoder in the previous iteration; constructing a second distribution based on the output of the decoder in the previous iteration; writing the output of the decoder to the same matrix in each iteration as the output of the decoder module; and generating an output image based on the resulting matrix.
In step S233, a loss function of the image generation model is calculated based on the sample image, the output image, the first distribution, and the second distribution to optimize the image generation model. Specifically, a first loss function between the sample image and the output image is first calculated, a second loss function between the first distribution and the second distribution is then calculated, and finally an overall loss function is determined based on the first loss function and the second loss function, and parameters of the model are updated using, for example, a back propagation method to minimize the loss function.
The training process of the text-based image generation apparatus according to the embodiment of the present invention is specifically explained below with reference to configuration examples in fig. 6 and 7. Fig. 6 is a diagram illustrating a configuration example of the apparatus 100 for generating an image based on text in a training state according to an embodiment of the present invention. Fig. 7 is a schematic diagram showing a configuration example of an image generation model. In fig. 6 and 7, the image generation model is shown as a DRAW model, however, the image generation model of the present invention is not limited to DRAW, and any other model capable of implementing the present invention may be adopted as needed by those skilled in the art.
In the following description, RNN is usedencRepresenting a function, RNN, implemented by the encoder 1322 in a single iteration stepencThe output in the t-th iteration step is a hidden vector h of a coding endt enc. Similarly, with RNNdecRepresents a function that the decoder 1312 implements in a single iteration step, and RNNdecThe output in the t-th iteration step is a hidden vector h of a coding endt dec. Using RNNfRepresenting a function implemented in a single iteration step by a forward recurrent neural network in the text feature extraction section 110, RNNfThe output in the t-th iteration step is a vector ht f. Similarly, with RNNbRepresenting functions implemented by a backward-circulating neural network in a single iteration step, RNNbThe output in the t-th iteration step is a vector ht b. In addition, in the following description, unless otherwise specified, b ═ w (a) denotes that the vector a is subjected to linear weighting and offset operation to obtain a vector b. The specific training process is as follows:
procedure 1. initialization: initializing initial states of cyclic neural networks of an encoding end and a decoding end, and initializing an initial state of a bidirectional cyclic neural networkThe initial state. Let the state h of the encoder0 encState h of the decoder0 decState h of the forward recurrent neural network0 fState h of the backward cyclic neural networkL-1 bIs a 0 vector of the corresponding dimension. Initializing the distribution matrix C0Is a 0 matrix. The initial states of the writing section 1313, the reading section 1321, and the local text feature intercepting section 120 are initialized. The value of the total number of iteration steps T is set.
And 2, extracting text features from the sample text: the sentence y described in natural language is input and the well-known distribution representation technique (such as Log bilinear language model (LBL), C) is utilized&W model, Word2vec, etc.) quantizes the y-direction of the sentence into a low-dimensional Word vector ey ═ y (ey)0,ey1,..eyL-1) Where L is the number of words contained in sentence y. Inputting L word vectors eyi into the bidirectional recurrent neural network, L bidirectional states S ═ h (h) are obtained as text features0 s,h1 s,…hL-1 s)=(h0 f h0 b,h1 f h1 b,…,hL-1 f hL-1 b) Wherein h isi f=RNNf(hi-1 f,eyi),hi b=RNNb(hi-1 b,eyi_r) Wherein the word vector (ey)0_r,ey1_r,…,eyL-1_r)=(eyL-1,...,ey1,ey0)。
And 3, intercepting local text features: the local Text feature extraction unit 120 selectively extracts each local portion of the Text feature S in a variable-size attention window using the attention model Text _ att. In particular, the attention model is based on the decoder output h in the t-1 th iteration stept-1 decThe center position and size of the attention window on S are calculated:
note the center position P of the windowcenter=L×sigmoid(ht-1 dec×Watt+batt),
Attention is paid to the windowDimension Kwidth=0.5×L×sigmoid(ht-1 dec×Wwid+bwid) Wherein W isatt、batt、WwidAnd bwidAre the parameters of the attention model Text att.
Next, the attention model Text _ att is applied to S to obtain St,stIs at S with PcenterIs a center and has a width KwidthThe local text feature of (1).
Process 4. read local sample image: the reading unit 1321 reads a part of the image x by using the conventional attention model Read _ att. Specifically, each partial image is obtained by applying a two-dimensional gaussian filter array to the image x and changing the position and zoom of the attention window.
The position of an N x N gaussian filter array in an image is located by specifying the center coordinates (gX, gY) of the filter array and the step δ between adjacent filters. The step delta controls the "zoom" of the attention window, in other words, the larger the step delta, the larger the area of the partial image taken from the original image, but the lower the resolution of the image. In the filter array, the filter position μ in the ith row and the jth columni X,μj YCan be expressed as:
Figure GDA0003130843140000081
Figure GDA0003130843140000082
in addition to the above attention parameters gX, gY and δ, two additional attention parameters are required to determine the operation of the Gaussian filter, i.e., the accuracy σ of the Gaussian filter2(ii) a And increasing the scalar strength gamma of the filter response. Given an A B input image x, in each iteration step, the output h is passed through the decoderdecTo dynamically determine five parameters:
Figure GDA0003130843140000083
Figure GDA0003130843140000084
Figure GDA0003130843140000085
Figure GDA0003130843140000086
given the above noted parameters, the horizontal filter matrix F of the filter arrayXAnd a vertical filter matrix FY(dimensions nxa and nxb, respectively) are defined as follows:
Figure GDA0003130843140000087
Figure GDA0003130843140000088
where (i, j) is a point in the attention window, and the range of variation of i and j is from 0 to N-1; and (a, b) are points in the input image, the variation ranges of a, b are [0, A-1 ] respectively]And [0, B-1 ]]And Zx, Zy are such that ∑ is satisfiedaFX[i,a]1 and ΣbFY[j,b]Normalized constant of 1.
Given by ht-1 decDetermined FX,FYSum intensity gamma and input image x and error image
Figure GDA0003130843140000091
Wherein
Figure GDA0003130843140000092
And σ represents a logical sigmoid function
Figure GDA0003130843140000093
The reading section returns the concatenation of two nxn matrices according to the input image and the error image:
Figure GDA0003130843140000094
here, the same filter matrix is applied to both the input image and the error image.
Process 5. compress sample image: in the t-th iteration step ht-1 dec、xtAnd
Figure GDA0003130843140000095
input to encoder 1322 to obtain the state
Figure GDA0003130843140000096
Figure GDA0003130843140000097
Wherein Wenc1、Wenc2And bencAre parameters of the encoder.
Process 6. construct the first distribution: encoder-based output ht encTo construct a vector for the feature quantity ztFirst distribution Q (Z)t|z1,…,zt-1X, y). Here, the first distribution Q obeys a mean value μ expressed by the following equationtSum variance σtGaussian distribution of
Figure GDA0003130843140000098
Figure GDA0003130843140000099
Figure GDA00031308431400000910
The first distribution Q is not limited to the gaussian distribution described above, and those skilled in the art can select other suitable distributions according to actual needs.
Process 7. sample feature quantities from the first distribution: the sampling unit 1311 samples the first distribution Q (Z)t|z1,…,zt-1X, y) are sampled to obtain a characteristic quantity zt
And 8, decoding the characteristic quantity: will ztAnd stInput to the decoder 1312 to obtain the state h of the decoder 1312 at the t-th iteration stept dec
Process 9. write the output of the decoder out to the distribution matrix: write-out section 1313 uses attention model Write _ att to output h of decoder at t-th iteration stept decWrite out to distribution matrix C. Specifically, five parameters (gX, gY, σ, δ, γ) of the attention model Write _ att are calculated similar to procedure 4:
Figure GDA00031308431400000911
Figure GDA0003130843140000101
Figure GDA0003130843140000102
Figure GDA0003130843140000103
wherein W (h)t dec)=sigmoid(ht dec×Wwrite+bwrite). And, a filter matrix F of a Gaussian filterxAnd FyRespectively as follows:
Figure GDA0003130843140000104
Figure GDA0003130843140000105
next, the attention model Write _ att is applied to ht decObtain the moment Writet
Figure GDA0003130843140000106
Process 10. construct a second distribution: based on ht decConstruction of the second distribution P (Z)t|z1…zt-1) The second distribution P is subject to a mean value of μ'tAnd variance is σ'tOf Gaussian distribution N (Zt | mu't,σ’t) Wherein:
μ’t=W(ht dec),
σ’t=exp(W(ht dec))。
process 11. update distribution matrix: updating distribution matrix Ct=Ct-1+WritetWhere C is a matrix of the same size as the input image.
Process 12. iterative operation: the processes 3 to 11 are repeatedly performed until the maximum number of iterations T is satisfied.
Process 13. calculate the loss function, update the parameters of the apparatus 100 using back propagation to minimize the loss function: the loss function used here is:
Figure GDA0003130843140000107
wherein, -logP (x | y, z)1,…,zT) Representing the loss of image reconstruction, which can be understood as the similarity of the generated image to the input image; and is
Figure GDA0003130843140000111
Representing the loss of the constructed first distribution Q and second distribution P.
The training process of the apparatus 100 for generating an image based on text is described above. A method of generating an image using the trained apparatus 100 and a configuration of the apparatus 100 in a use state will be described below with reference to fig. 8 to 10.
FIG. 8 illustrates a flow chart of a method for generating images based on text using the trained device 100 according to an embodiment of the present invention. As shown in fig. 8, the method 300 includes steps S310 to S330. In step S310, text features characterizing the association between words in the sample text are extracted from the sample text by the text feature extraction section. In step S320, each part of the text feature is selectively cut out in a variable-size window by the local text feature cutting-out section to obtain each local text feature. In step S330, the decoder module iteratively generates an image corresponding to the input text according to each local text feature of the input text, each local text feature being truncated in each iteration.
The operations of step S310 and step S320 are the same as the operations of step S210 and step S220 in fig. 4, and for the sake of simplicity, the description is omitted here. Hereinafter, the process of step S330 is specifically described with reference to fig. 9.
As shown in fig. 9, the process S330 of the decoder module generating an image includes steps S331 to S335. In step S331, the second distribution P is constructed by the constructing section 1313 in each iteration based on the output of the decoder in the previous iteration. In step S332, the sampling unit 1311 collects feature amounts from the second distribution P. In step S333, the acquired feature amount is decoded by the decoder 1312 based on the local text feature and the decoder output in the previous iteration. In step S334, the output of the decoder is written out to the same matrix by the write-out section 1314 in each iteration. In step S334, an output image is generated by the decoder module 131 based on the resulting matrix.
Fig. 10 is a schematic diagram showing a configuration example of the apparatus 100 in a use state according to the embodiment of the present invention. Fig. 10 omits the encoder module 132 from operation because it is not needed to construct the first distribution Q in the use state.
Next, a specific process of generating an image using the trained apparatus 100 is explained in detail with reference to fig. 10.
Procedure 1. initialization: initializing the initial state of the decoding end cyclic neural network and initializing the initial state of the bidirectional cyclic neural network. Let the state h of the decoder0 decState h of the forward recurrent neural network0 fState h of the backward cyclic neural networkL-1 bAnd a distribution matrix C0Is a 0 vector or matrix of the corresponding dimension. The initial states of the writing-out section 1313 and the local text feature intercepting section 120 are initialized. The value of the total number of iteration steps T is set, preferably the value of T in the use state is the same as the value of T in the training state.
And 2, extracting text features: the text features S of the entered text y are extracted using well-known distributed representation techniques.
And 3, process: extracting local text features: the center position and size of the attention window of the attention model Text att in the local Text feature intercepting part 120 are calculated based on ht-1dec, and the attention model Text att is applied to S to obtain the local Text feature st.
Process 4. construct the second distribution: based on ht-1 decTo construct a vector for the feature quantity ztSecond distribution P (Zt | z)1,…,zt-1)。
Process 5. sampling feature quantities from the second distribution: the sampling unit 1311 samples P (Zt | z) from the second distribution1,…,zt-1) Sampling characteristic quantity zt
And 6, a process: decoding the feature quantity: will ztAnd stInput to decoder 1312 to obtain the state h of step tt dec
Process 7. write the output of the decoder out to the distribution matrix: based on ht decCalculates the parameters (gX, gY, σ, δ, γ) of the attention model Write _ att in the writing section 1313, and applies the attention model Write _ att to ht decGet the matrix Writet
Process 8. update distribution matrix: updating distribution matrix Ct=Ct-1+Writet
Process 9. iterative operation: the processes 3 to 8 are repeatedly performed until the maximum number of iterations T is satisfied.
Process 10. generate image: based on matrix CTGenerating an output image x', x ═ sigmoid (C)T)。
In addition, it is noted that the components of the above system may be configured by software, firmware, hardware or a combination thereof. The specific means or manner in which the configuration can be used is well known to those skilled in the art and will not be described further herein. In the case of implementation by software or firmware, a program constituting the software is installed from a storage medium or a network to a computer (for example, a general-purpose computer 1100 shown in fig. 11) having a dedicated hardware configuration, and the computer can execute various functions and the like when various programs are installed.
FIG. 11 shows a schematic block diagram of a computer that may be used to implement methods and systems according to embodiments of the invention.
In fig. 11, a Central Processing Unit (CPU)1101 performs various processes in accordance with a program stored in a Read Only Memory (ROM)1102 or a program loaded from a storage section 1108 to a Random Access Memory (RAM) 1103. In the RAM 1103, data necessary when the CPU 1101 executes various processes and the like is also stored as necessary. The CPU 1101, ROM 1102, and RAM 1103 are connected to each other via a bus 1104. An input/output interface 1105 is also connected to bus 1104.
The following components are connected to the input/output interface 1105: an input section 1106 (including a keyboard, a mouse, and the like), an output section 1107 (including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker and the like), a storage section 1108 (including a hard disk and the like), a communication section 1109 (including a network interface card such as a LAN card, a modem, and the like). The communication section 1109 performs communication processing via a network such as the internet. The driver 1110 may also be connected to the input/output interface 1105 as needed. A removable medium 1111 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like can be mounted on the drive 1110 as necessary, so that a computer program read out therefrom is installed into the storage section 1108 as necessary.
In the case where the above-described series of processes is realized by software, a program constituting the software is installed from a network such as the internet or a storage medium such as the removable medium 1111.
It should be understood by those skilled in the art that such a storage medium is not limited to the removable medium 1111 shown in fig. 11, in which the program is stored, distributed separately from the apparatus to provide the program to the user. Examples of the removable medium 1111 include a magnetic disk (including a floppy disk (registered trademark)), an optical disk (including a compact disc read only memory (CD-ROM) and a Digital Versatile Disc (DVD)), a magneto-optical disk (including a Mini Disk (MD) (registered trademark)), and a semiconductor memory. Alternatively, the storage medium may be the ROM 1102, a hard disk included in the storage section 1108, or the like, in which programs are stored and which are distributed to users together with the device including them.
The invention also provides a program product with machine readable instruction codes stored. The instruction codes are read by a machine and can execute the method according to the embodiment of the invention when being executed.
Accordingly, storage media carrying the above-described program product having machine-readable instruction code stored thereon are also within the scope of the present invention. Including, but not limited to, floppy disks, optical disks, magneto-optical disks, memory cards, memory sticks, and the like.
It should be noted that the method of the present invention is not limited to being performed in the chronological order described in the specification, and may be performed sequentially in other orders, in parallel, or independently. Therefore, the order of execution of the methods described in this specification does not limit the technical scope of the present invention.
The foregoing description of the various embodiments of the invention is provided for the purpose of illustration only and is not intended to be limiting of the invention. It should be noted that in the above description, features described and/or illustrated with respect to one embodiment may be used in the same or similar manner in one or more other embodiments, in combination with or instead of the features of the other embodiments. It will be understood by those skilled in the art that various changes and modifications may be made to the above-described embodiments without departing from the inventive concept of the present invention, and all such changes and modifications are intended to be included within the scope of the present invention.
In summary, in the embodiments according to the present invention, the present invention provides the following technical solutions.
Scheme 1. an information processing method, comprising:
extracting text features representing the relevance between words in the sample text from the sample text;
selectively intercepting each local part of the text features by using a window with a variable size to obtain each local text feature;
training an image generation model based on respective local text features of the sample text and sample images corresponding to the sample text,
the decoder module in the trained image generation model generates images corresponding to the input text in an iterative mode according to each local text feature of the input text, and each local text feature is intercepted in each iteration.
Scheme 2. the information processing method according to scheme 1, wherein extracting the text features from the sample text comprises:
vectorizing the sample text to obtain a plurality of word vectors with low dimensionality; and
extracting text features characterizing associations between words in the sample text based on the word vectors.
Scheme 3. the information processing method according to scheme 2, wherein the text features are extracted using a forward recurrent neural network and/or a backward recurrent neural network.
Scheme 4. the information processing method according to scheme 1, wherein training the image generation model comprises:
iteratively compressing the sample image with the encoder module and outputting, from the encoder module in each iteration, a first distribution of feature quantities characterizing key information of the sample image and the sample text;
iteratively generating, with the decoder module, an output image based on the respective local text features and the respective first distributions of the sample text, and constructing, with the decoder module, second distributions of the feature quantities in the respective iterations; and
calculating a loss function of the image generation model based on the sample image, the output image, the first distribution, and the second distribution to optimize the image generation model.
Scheme 5. the information processing method of scheme 4, wherein in each iteration the local text feature is truncated based on the output of the decoder in the decoder module in the previous iteration.
Scheme 6. the information processing method of scheme 5, wherein, in each iteration, iteratively compressing the sample image with an encoder module comprises:
reading a local portion of the sample image based on an output of the decoder module and an output of the decoder in a previous iteration to obtain a local sample image; and
the encoder compresses the local sample image based on the encoder in the encoder module and the decoder's output in the previous iteration.
Scheme 7. the information processing method of scheme 6, wherein the first distribution in each iteration is constructed based on the output of the encoder.
Scheme 8. the information processing method of scheme 7, wherein iteratively generating an image with the decoder module comprises:
acquiring the feature quantity from the first distribution in each iteration;
decoding the collected feature quantity by using a decoder based on the local text feature and the output of the decoder in the previous iteration;
constructing the second distribution based on an output of a decoder in a previous iteration;
writing out the output of the decoder to the same matrix in each iteration as the output of the decoder module; and
generating the output image based on the resulting matrix.
Scheme 9. the information processing method according to scheme 8, wherein calculating a loss function of the image generation model comprises:
calculating a first loss function for the sample image and the output image;
calculating a second loss function with respect to the first distribution and the second distribution; and
determining the loss function based on the first loss function and the second loss function.
Scheme 10. the information processing method according to any one of schemes 6 to 9, wherein the encoder and the decoder are implemented using a recurrent neural network.
Scheme 11. the information processing method according to scheme 10, wherein the image generation model is a DRAW neural network.
Scheme 12. an apparatus for generating an image based on text, comprising:
a text feature extraction unit that extracts text features representing the relevance between words in a text;
a local text feature intercepting part, which selectively intercepts each local part of the text feature by a window with variable size to obtain a local text feature; and
an image generation model in which a decoder module iteratively generates an image corresponding to an input text from respective local text features of the input text, each local text feature being truncated in a respective iteration.
13. The apparatus according to claim 12, wherein the text feature extraction section includes:
the vectorization unit is used for vectorizing the sample text to obtain a plurality of word vectors with low dimensionality; and
a text feature extraction unit that extracts text features characterizing the association between words in the sample text based on the word vectors.
14. The apparatus according to claim 13, wherein the text feature extraction section extracts the text feature using a forward recurrent neural network and/or a backward recurrent neural network.
15. The apparatus of claim 12, wherein the image generation model comprises:
an encoder module that iteratively compresses the sample image and outputs, in each iteration, a first distribution of feature quantities characterizing key information of the sample image and the sample text;
a decoder module that calculates a second distribution of the feature quantity in each iteration and iteratively generates an output image based on each local text feature and each first distribution of the sample text; and
a calculation module to calculate a loss function of the image generation model based on the sample image, the output image, the first distribution, and the second distribution to optimize the image generation model.
16. The apparatus of claim 15, wherein the encoder module comprises:
a reading section that reads a part of the sample image based on an output of the decoder module and an output of a decoder within the decoder module in a previous iteration to obtain a local sample image;
an encoder to compress the local sample image based on outputs of the encoder and the decoder in a previous iteration; and
a construction section that constructs a first distribution based on an output of the encoder in each iteration.
17. The apparatus of claim 16, wherein the decoder module comprises:
a sampling unit that collects the feature amount from the first distribution in each iteration;
a decoding unit that decodes the acquired feature amount based on the local text feature and a decoder output in a previous iteration;
a construction section that constructs the second distribution based on an output of the decoder in a previous iteration; and
a write-out section writing out an output of the decoder to the same matrix as an output of the decoder module in each iteration,
wherein the decoder module generates the output image based on the final resulting matrix.
18. The apparatus of claim 17, wherein the means for calculating comprises:
a first calculation unit that calculates a first loss function with respect to the sample image and the output image;
a second calculation unit that calculates a second loss function with respect to the first distribution and the second distribution; and
a determination section that determines the loss function based on the first loss function and the second loss function.
The apparatus according to any of claims 12-18, wherein the image generation model is a DRAW neural network.
20. A method for generating images based on text using a trained device according to aspects 12-19, comprising:
extracting, by the text feature extraction section, text features representing associations between words in a text;
selectively intercepting, by the local text feature intercepting part, respective parts of the text features with windows of varying sizes to obtain local text features; and
and a decoder module in the image generation model generates an image corresponding to the input text in an iteration mode according to each local text feature of the input text, and each local text feature is intercepted in each iteration.

Claims (9)

1. An information processing method comprising:
extracting text features representing the relevance between words in the sample text from the sample text;
selectively intercepting each local part of the text features by using a window with a variable size to obtain each local text feature; and
training an image generation model, comprising:
iteratively compressing a sample image with an encoder module and outputting, from the encoder module in each iteration, a first distribution of feature quantities characterizing key information of the sample image and the sample text,
iteratively generating an output image based on respective local text features and respective first distributions of the sample text with a decoder module, and constructing a second distribution of the feature quantities in respective iterations with the decoder module, an
Calculating a loss function of the image generation model based on the sample image, the output image, the first distribution and the second distribution to optimize the image generation model,
the decoder module in the trained image generation model generates images corresponding to the input text in an iterative mode according to each local text feature of the input text, and each local text feature is intercepted in each iteration.
2. The information processing method of claim 1, wherein extracting the text feature from the sample text comprises:
vectorizing the sample text to obtain a plurality of word vectors with low dimensionality; and
extracting text features characterizing associations between words in the sample text based on the word vectors.
3. The information processing method according to claim 1, wherein in each iteration, the local text feature is truncated based on an output of a decoder in a decoder module in a previous iteration.
4. The information processing method of claim 3, wherein, in each iteration, iteratively compressing the sample image with an encoder module comprises:
reading a local portion of the sample image based on an output of the decoder module and an output of the decoder in a previous iteration to obtain a local sample image; and
the encoder compresses the local sample image based on the encoder in the encoder module and the decoder's output in the previous iteration.
5. The information processing method of claim 4, wherein the first distribution in each iteration is constructed based on an output of the encoder.
6. The information processing method of claim 5, wherein iteratively generating an image with the decoder module comprises:
acquiring the feature quantity from the first distribution in each iteration;
decoding the collected feature quantity by using a decoder based on the local text feature and the output of the decoder in the previous iteration;
constructing the second distribution based on an output of a decoder in a previous iteration;
writing out the output of the decoder to the same matrix in each iteration as the output of the decoder module; and
generating the output image based on the resulting matrix.
7. The information processing method according to claim 6, wherein calculating a loss function of the image generation model includes:
calculating a first loss function for the sample image and the output image;
calculating a second loss function with respect to the first distribution and the second distribution; and
determining the loss function based on the first loss function and the second loss function.
8. The information processing method according to any one of claims 4 to 7, wherein the encoder and the decoder are implemented with a recurrent neural network.
9. An apparatus for generating an image based on text, comprising:
a text feature extraction unit that extracts text features representing the relevance between words in a text;
a local text feature intercepting part, which selectively intercepts each local part of the text feature by a window with variable size to obtain a local text feature; and
an image generation model in which a decoder module iteratively generates an image corresponding to an input text from respective local text features of the input text, each local text feature being truncated in a respective iteration.
CN201710379515.0A 2017-05-25 2017-05-25 Information processing method and device for generating image based on text Active CN108959322B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710379515.0A CN108959322B (en) 2017-05-25 2017-05-25 Information processing method and device for generating image based on text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710379515.0A CN108959322B (en) 2017-05-25 2017-05-25 Information processing method and device for generating image based on text

Publications (2)

Publication Number Publication Date
CN108959322A CN108959322A (en) 2018-12-07
CN108959322B true CN108959322B (en) 2021-09-10

Family

ID=64494571

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710379515.0A Active CN108959322B (en) 2017-05-25 2017-05-25 Information processing method and device for generating image based on text

Country Status (1)

Country Link
CN (1) CN108959322B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111340056B (en) * 2018-12-18 2023-09-08 富士通株式会社 Information processing method and information processing apparatus
CN109933320B (en) * 2018-12-28 2021-05-18 联想(北京)有限公司 Image generation method and server
CN109920016B (en) * 2019-03-18 2021-06-25 北京市商汤科技开发有限公司 Image generation method and device, electronic equipment and storage medium
CN110163267A (en) * 2019-05-09 2019-08-23 厦门美图之家科技有限公司 A kind of method that image generates the training method of model and generates image
CN111985243B (en) * 2019-05-23 2023-09-08 中移(苏州)软件技术有限公司 Emotion model training method, emotion analysis device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1402853A (en) * 1999-12-02 2003-03-12 三菱电机株式会社 Image retrieval system and image retrieval method
CN101655912A (en) * 2009-09-17 2010-02-24 上海交通大学 Method for detecting computer generated image and natural image based on wavelet transformation
CN101924851A (en) * 2009-06-16 2010-12-22 佳能株式会社 Image processing apparatus and image processing method
CN106529586A (en) * 2016-10-25 2017-03-22 天津大学 Image classification method based on supplemented text characteristic

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9836671B2 (en) * 2015-08-28 2017-12-05 Microsoft Technology Licensing, Llc Discovery of semantic similarities between images and text

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1402853A (en) * 1999-12-02 2003-03-12 三菱电机株式会社 Image retrieval system and image retrieval method
CN101924851A (en) * 2009-06-16 2010-12-22 佳能株式会社 Image processing apparatus and image processing method
CN101655912A (en) * 2009-09-17 2010-02-24 上海交通大学 Method for detecting computer generated image and natural image based on wavelet transformation
CN106529586A (en) * 2016-10-25 2017-03-22 天津大学 Image classification method based on supplemented text characteristic

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Learning to generalize to new compositions in image understanding;Atzmon,Y.等;《arXiv》;20160826;全文 *
基于深度学习的图像语义理解研究;梁欢;《中国优秀硕士学位论文全文数据库(电子期刊)》;20170315(第3期);全文 *

Also Published As

Publication number Publication date
CN108959322A (en) 2018-12-07

Similar Documents

Publication Publication Date Title
CN108959322B (en) Information processing method and device for generating image based on text
US11475343B2 (en) Database utilizing spatial probability models for data compression
US10671889B2 (en) Committed information rate variational autoencoders
CN106649542B (en) System and method for visual question answering
RU2661750C1 (en) Symbols recognition with the use of artificial intelligence
US11657230B2 (en) Referring image segmentation
US20190180154A1 (en) Text recognition using artificial intelligence
US11010666B1 (en) Systems and methods for generation and use of tensor networks
CN111832440B (en) Face feature extraction model construction method, computer storage medium and equipment
CN111428557A (en) Method and device for automatically checking handwritten signature based on neural network model
US11966707B2 (en) Quantum enhanced word embedding for natural language processing
CN112434131A (en) Text error detection method and device based on artificial intelligence, and computer equipment
CN111562915A (en) Generation method and device of front-end code generation model
CN111598444A (en) Well logging lithology identification method and system based on convolutional neural network
CN112667979A (en) Password generation method and device, password identification method and device, and electronic device
Yan et al. A hybrid evolutionary algorithm for multiobjective sparse reconstruction
Amram et al. Denoising diffusion models with geometry adaptation for high fidelity calorimeter simulation
Lu et al. On the asymptotical regularization for linear inverse problems in presence of white noise
CN116452706A (en) Image generation method and device for presentation file
CN111161266A (en) Multi-style font generation method of variational self-coding machine based on vector quantization
US20210365622A1 (en) Noise mitigation through quantum state purification by classical ansatz training
CN113408418A (en) Calligraphy font and character content synchronous identification method and system
US20240135610A1 (en) Image generation using a diffusion model
CN116758618B (en) Image recognition method, training device, electronic equipment and storage medium
Bhandari et al. Nepali Handwritten Letter Generation using GAN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant