CN108959322B

CN108959322B - Information processing method and device for generating image based on text

Info

Publication number: CN108959322B
Application number: CN201710379515.0A
Authority: CN
Inventors: 侯翠琴; 夏迎炬; 杨铭; 张姝; 孙俊
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2017-05-25
Filing date: 2017-05-25
Publication date: 2021-09-10
Anticipated expiration: 2037-05-25
Also published as: CN108959322A

Abstract

The invention discloses an information processing method and a device for generating an image based on a text. The method comprises the following steps: extracting text features representing the relevance between words in the sample text from the sample text; selectively intercepting each part of the text features by using the window with the changed size to obtain each local text feature; training an image generation model based on each local text feature of the sample text and a sample image corresponding to the sample text, wherein the image generation model comprises an encoder module and a decoder module, the decoder module in the trained image generation model iteratively generates an image corresponding to the input text according to each local text feature of the input text, and each local text feature is intercepted in each iteration.

Description

Information processing method and device for generating image based on text

Technical Field

The present invention relates to the field of information processing, in particular to the field of deep learning, and in particular to an information processing method and an apparatus for generating an image based on text.

Background

The automatic generation of images based on natural language description is a very important research content in the field of artificial intelligence and has a very wide application. In this regard, deep learning approaches have made many advances. In the deep learning technology, two methods are mainly used for generating images, one is a variational self-coding method, and the other is a method for generating a confrontation network.

The variational self-coding method proposed by Kingma & Welling can be regarded as a neural network with continuous hidden variables. The encoding end model approximates the posterior probability distribution of the hidden variables, and the decoding end model constructs an image based on the probability distribution of the hidden variables. Gregor et al propose a depth cycle Attention Write model (DRAW) to generate images, which extends the variational self-coding method to a sequence variational self-coding framework.

The method for generating the countermeasure network includes a generator model for generating data based on probability distribution and a discriminator model for judging whether the data is true data or generated data. Gauthier proposes a conditional countermeasure network to generate different classes of images. Denton et al trained a conditional generation confrontation network for each layer of images under the Laplacian pyramid framework, and then generated images from coarse to fine based on the conditional confrontation network for each layer under the Laplacian pyramid framework.

Although the above-described techniques for generating images exist in the prior art, there is still a need for improved methods for generating images based on text.

Disclosure of Invention

The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention, and it is not intended to identify key or critical elements of the invention or to delineate the scope of the invention. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

The invention provides an information processing method, which comprises the following steps: extracting text features representing the relevance between words in the sample text from the sample text; selectively intercepting each local part of the text features by a window with variable size to obtain each local text feature; training an image generation model based on each local text feature of the sample text and a sample image corresponding to the sample text, wherein the image generation model comprises an encoder module and a decoder module, the decoder module in the trained image generation model iteratively generates an image corresponding to the input text according to each local text feature of the input text, and each local text feature is intercepted in each iteration.

According to another aspect of the present invention, there is provided an apparatus for generating an image based on text, including: a text feature extraction unit that extracts text features representing the relevance between words in a text; a local text feature intercepting part, which selectively intercepts each local part of the text feature by a window with variable size to obtain a local text feature; and an image generation model, wherein a decoder module in the image generation model generates an image corresponding to the input text in an iteration mode according to each local text feature of the input text, and each local text feature is intercepted in each iteration.

According to a further aspect of the present invention, there is provided a method for generating an image based on text using the trained device described above, comprising: extracting, by the text feature extraction section, text features representing associations between words in a text; selectively intercepting, by the local text feature intercepting part, respective parts of the text features with a variable-sized window to obtain local text features; and a decoder module in the image generation model generates an image corresponding to the input text iteratively according to each local text feature of the input text, wherein each local text feature is intercepted in each iteration.

According to still another aspect of the present invention, there is also provided a storage medium. The storage medium includes a program code readable by a machine, which, when executed on an information processing apparatus, causes the information processing apparatus to execute the above-described method according to the present invention.

According to still another aspect of the present invention, there is also provided a program. The program comprises machine-executable instructions that, when executed on an information processing device, cause the information processing device to perform the above-described method according to the invention.

These and other advantages of the present invention will become more apparent from the following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings.

Drawings

Other features and advantages of the present invention will be more readily understood from the following description of the various embodiments of the invention taken with the accompanying drawings, which are for the purpose of illustrating embodiments of the invention by way of illustration only, and not in all possible implementations, and which are not intended to limit the scope of the invention. In the drawings:

fig. 1 is a schematic diagram illustrating a structure of an apparatus for generating an image based on text according to an embodiment of the present invention.

Fig. 2 is a schematic diagram illustrating a structure of a text feature extraction section in an apparatus for generating an image based on a text according to an embodiment of the present invention.

Fig. 3 is a schematic diagram illustrating a structure of an image generation model in an apparatus for generating an image based on text according to an embodiment of the present invention.

Fig. 4 is a flowchart illustrating a training process of an apparatus for generating an image based on text according to an embodiment of the present invention.

Fig. 5 is a flowchart illustrating a training process of an image generation model in an apparatus for generating an image based on text according to an embodiment of the present invention.

Fig. 6 is a schematic diagram showing a configuration example of an apparatus for generating an image based on text in a training state according to an embodiment of the present invention.

Fig. 7 is a schematic diagram showing a configuration example of an image generation model in an apparatus for generating an image based on text according to an embodiment of the present invention.

FIG. 8 is a flowchart illustrating a method of generating an image using a trained text-based image generation apparatus according to an embodiment of the present invention.

Fig. 9 is a flowchart showing a process in which the decoder module generates an image in a use state.

Fig. 10 is a schematic diagram showing a configuration example of an apparatus for generating an image based on text in a use state according to an embodiment of the present invention.

FIG. 11 is a schematic block diagram illustrating a computer for implementing methods and apparatus in accordance with embodiments of the invention

Detailed Description

Embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that the following description is only exemplary and is not intended to limit the present invention. Further, in the following description, the same reference numbers will be used to refer to the same or like parts in different drawings. The different features in the different embodiments described below can be combined with each other to form further embodiments within the scope of the invention.

Referring first to fig. 1, there is shown a schematic diagram illustrating a structure of an apparatus for generating an image based on text according to an embodiment of the present invention. As shown in fig. 1, the apparatus 100 includes a text feature extraction unit 110, a local text feature extraction unit 120, and an image generation model 130.

The text feature extraction unit 110 is configured to extract text features representing associations between words in a text. Specifically, as shown in fig. 2, the text feature extraction section 110 includes a vectorization unit 111 and a text feature extraction unit 112. The vectorization unit 111 vectorizes (not shown in fig. 1) the text using existing distribution representation techniques, such as Log bilinear language model (LBL), C & W model, Word2vec, etc., to get low-dimensional Word vectors. The text feature extraction unit 112 extracts text features characterizing the association between words in the text based on the word vectors using the well-known forward and backward recurrent neural networks. Here, the text feature may also be extracted by using a forward recurrent neural network or a backward recurrent neural network alone.

The local text feature intercepting part 120 selectively intercepts each of the parts of the text feature with a variable-size window to obtain the local text feature. Wherein each local text feature is truncated in each iteration operation of the image generation model 130, respectively, and in the current iteration, the local text feature truncation 120 truncates the local text feature based on the output of the decoder in the decoder module in the previous iteration.

The image generation model 130 is trained based on individual local text features of the sample text and the sample images corresponding to the sample text. And the decoder module in the trained image generation model generates images corresponding to the input text in an iterative manner according to each local text characteristic of the input text. The image generation model 130 may be the well-known draw (deep secure writer) model.

FIG. 3 shows a schematic structural diagram of the image generation model 130 according to an embodiment of the present invention. As shown in FIG. 3, the image generation model 130 includes a decoder module 131, an encoder module 132, and a calculation module 133.

In the training state, the encoder module 132 iteratively compresses the sample image and outputs, in each iteration, a first distribution of feature quantities characterizing key information of the sample image and the sample text. In the training state, the decoder module 131 iteratively generates an output image based on the respective local text features and the respective first distributions of the sample text, and calculates second distributions of the feature quantities in the respective iterations. Here, the encoder module 132 and the decoder module 131 are each implemented by a Recurrent Neural Network (RNN). The calculation module 133 calculates a loss function of the image generation model based on the sample image, the output image, the first distribution and the second distribution to optimize the image generation model.

In a use state in which the trained apparatus 100 is used to generate an image based on a text, the decoder module 131 iteratively generates an image corresponding to the input text according to each local text feature of the input text and each second distribution, each local text feature being truncated in each iteration. In the use state, the encoder module 132 does not participate in the operation.

As shown in fig. 3, the encoder module 132 includes a reading section 1321, an encoder 1322, and a building section 1323. The reading section 1321 reads a part of the sample image based on the output of the decoder module and the output of the decoder in the previous iteration to obtain a local sample image. The encoder 1322 compresses the local sample image based on the output of the encoder and decoder in the previous iteration. The constructing section 1323 constructs the first distribution based on the output of the encoder.

The decoder module 131 includes a sampling section 1311, a decoder 1312, a construction section 1313, and a writing section 1314. In the training state, the sampling unit 1311 collects feature amounts from the first distribution. The decoder 1312 decodes the collected feature quantity based on the local text feature and the decoder output in the previous iteration. The construction section 1313 constructs the second distribution based on the output of the decoder in the previous iteration. The write-out unit 1314 writes out the decoder output in the current iteration into the corresponding region of the distribution matrix. The decoder module generates an output image based on the final resulting distribution matrix.

In the use state, the construction section 1313 constructs the second distribution based on the output of the decoder in the previous iteration. The sampling section 1311 collects feature amounts from the second distribution. The decoder 1312 decodes the collected feature quantity based on the local text feature and the decoder output in the previous iteration. The write-out unit 1314 writes out the decoder output in the current iteration into the corresponding region of the distribution matrix.

The calculation module 133 includes a first calculation part 1331, a second calculation part 1332, and a determination part 1333. The first calculation section 1331 calculates a first loss function with respect to the sample image and the output image. The second calculation section 1332 calculates a second loss function with respect to the first distribution and the second distribution. The determination section 1333 determines a total loss function based on the first loss function and the second loss function.

Next, the training process of the apparatus 100 is described with reference to fig. 4 to 7. Fig. 4 is a flowchart illustrating a training process of an apparatus for generating an image based on text according to an embodiment of the present invention. As shown in fig. 4, the training process 200 includes steps S210 to S230.

In step S210, text features characterizing the association between words in the sample text are extracted from the sample text. Specifically, the sample text is first vectorized using well-known distributed representation techniques to obtain a plurality of word vectors of low dimensionality. Text features characterizing the associations between words in the sample text are then extracted based on the word vectors, where the text features can be extracted using a forward recurrent neural network and/or a backward recurrent neural network.

In step 220, portions of the text feature are selectively truncated with a variable-size window to obtain portions of the text feature.

In step S230, an image generation model is trained based on each local text feature of the sample text and a sample image corresponding to the sample text, wherein the image generation model includes an encoder module and a decoder module, the decoder module in the trained image generation model iteratively generates an image corresponding to the input text according to each local text feature of the input text, and each local text feature is respectively truncated in each iteration.

Fig. 5 shows a specific flow of the training process of the image generation model 130. As shown in fig. 5, the process of step S230 specifically includes steps S231 to S233.

Referring to fig. 5, in step S231, the sample image is iteratively compressed by the encoder module, and a first distribution of feature quantities characterizing key information of the sample image and the sample text is output from the encoder module in each iteration. Specifically, a part of the sample image is read based on an output of the decoder module and an output of the decoder in a previous iteration to obtain a local sample image; and the local sample image is compressed based on the output of the decoder and encoder in the previous iteration. In addition, a first distribution in each iteration is constructed based on the output of the encoder.

In step S232, based on the respective local text features and the respective first distributions of the sample text, a second distribution of feature quantities in the respective iterations is calculated with the decoder module and an output image is iteratively generated. Specifically, feature quantities are collected from the first distribution in each iteration; decoding the collected feature quantity by using a decoder based on the local text feature and the output of the decoder in the previous iteration; constructing a second distribution based on the output of the decoder in the previous iteration; writing the output of the decoder to the same matrix in each iteration as the output of the decoder module; and generating an output image based on the resulting matrix.

In step S233, a loss function of the image generation model is calculated based on the sample image, the output image, the first distribution, and the second distribution to optimize the image generation model. Specifically, a first loss function between the sample image and the output image is first calculated, a second loss function between the first distribution and the second distribution is then calculated, and finally an overall loss function is determined based on the first loss function and the second loss function, and parameters of the model are updated using, for example, a back propagation method to minimize the loss function.

The training process of the text-based image generation apparatus according to the embodiment of the present invention is specifically explained below with reference to configuration examples in fig. 6 and 7. Fig. 6 is a diagram illustrating a configuration example of the apparatus 100 for generating an image based on text in a training state according to an embodiment of the present invention. Fig. 7 is a schematic diagram showing a configuration example of an image generation model. In fig. 6 and 7, the image generation model is shown as a DRAW model, however, the image generation model of the present invention is not limited to DRAW, and any other model capable of implementing the present invention may be adopted as needed by those skilled in the art.

In the following description, RNN is used^encRepresenting a function, RNN, implemented by the encoder 1322 in a single iteration step^encThe output in the t-th iteration step is a hidden vector h of a coding end_t ^enc. Similarly, with RNN^decRepresents a function that the decoder 1312 implements in a single iteration step, and RNN^decThe output in the t-th iteration step is a hidden vector h of a coding end_t ^dec. Using RNN^fRepresenting a function implemented in a single iteration step by a forward recurrent neural network in the text feature extraction section 110, RNN^fThe output in the t-th iteration step is a vector h_t ^f. Similarly, with RNN^bRepresenting functions implemented by a backward-circulating neural network in a single iteration step, RNN^bThe output in the t-th iteration step is a vector h_t ^b. In addition, in the following description, unless otherwise specified, b ═ w (a) denotes that the vector a is subjected to linear weighting and offset operation to obtain a vector b. The specific training process is as follows:

procedure 1. initialization: initializing initial states of cyclic neural networks of an encoding end and a decoding end, and initializing an initial state of a bidirectional cyclic neural networkThe initial state. Let the state h of the encoder₀ ^encState h of the decoder₀ ^decState h of the forward recurrent neural network₀ ^fState h of the backward cyclic neural network_L-1 ^bIs a 0 vector of the corresponding dimension. Initializing the distribution matrix C₀Is a 0 matrix. The initial states of the writing section 1313, the reading section 1321, and the local text feature intercepting section 120 are initialized. The value of the total number of iteration steps T is set.

And 2, extracting text features from the sample text: the sentence y described in natural language is input and the well-known distribution representation technique (such as Log bilinear language model (LBL), C) is utilized&W model, Word2vec, etc.) quantizes the y-direction of the sentence into a low-dimensional Word vector ey ═ y (ey)₀,ey₁,..ey_L-1) Where L is the number of words contained in sentence y. Inputting L word vectors eyi into the bidirectional recurrent neural network, L bidirectional states S ═ h (h) are obtained as text features₀ ^s,h₁ ^s,…h_L-1 ^s)＝(h₀ ^f h₀ ^b,h₁ ^f h₁ ^b,…,h_L-1 ^f h_L-1 ^b) Wherein h is_i ^f＝RNN^f(h_i-1 ^f,ey_i),h_i ^b＝RNN^b(h_i-1 ^b,ey_{i_r}) Wherein the word vector (ey)_{0_r},ey_{1_r},…,ey_{L-1_r})＝(ey_L-1,...,ey₁,ey₀)。

And 3, intercepting local text features: the local Text feature extraction unit 120 selectively extracts each local portion of the Text feature S in a variable-size attention window using the attention model Text _ att. In particular, the attention model is based on the decoder output h in the t-1 th iteration step_t-1 ^decThe center position and size of the attention window on S are calculated:

note the center position P of the window_center＝L×sigmoid(h_t-1 ^dec×W_att+b_att),

Attention is paid to the windowDimension K_width＝0.5×L×sigmoid(h_t-1 ^dec×W_wid+b_wid) Wherein W is_att、b_att、W_widAnd b_widAre the parameters of the attention model Text att.

Next, the attention model Text _ att is applied to S to obtain S_t，s_tIs at S with P_centerIs a center and has a width K_widthThe local text feature of (1).

Process 4. read local sample image: the reading unit 1321 reads a part of the image x by using the conventional attention model Read _ att. Specifically, each partial image is obtained by applying a two-dimensional gaussian filter array to the image x and changing the position and zoom of the attention window.

The position of an N x N gaussian filter array in an image is located by specifying the center coordinates (gX, gY) of the filter array and the step δ between adjacent filters. The step delta controls the "zoom" of the attention window, in other words, the larger the step delta, the larger the area of the partial image taken from the original image, but the lower the resolution of the image. In the filter array, the filter position μ in the ith row and the jth columnⁱ _X，μ^j _YCan be expressed as:

in addition to the above attention parameters gX, gY and δ, two additional attention parameters are required to determine the operation of the Gaussian filter, i.e., the accuracy σ of the Gaussian filter²(ii) a And increasing the scalar strength gamma of the filter response. Given an A B input image x, in each iteration step, the output h is passed through the decoder^decTo dynamically determine five parameters:

given the above noted parameters, the horizontal filter matrix F of the filter array_XAnd a vertical filter matrix F_Y(dimensions nxa and nxb, respectively) are defined as follows:

where (i, j) is a point in the attention window, and the range of variation of i and j is from 0 to N-1; and (a, b) are points in the input image, the variation ranges of a, b are [0, A-1 ] respectively]And [0, B-1 ]]And Zx, Zy are such that ∑ is satisfied_aF_X[i，a]1 and Σ_bF_Y[j，b]Normalized constant of 1.

Given by h_t-1 ^decDetermined F_X，F_YSum intensity gamma and input image x and error image

Wherein

And σ represents a logical sigmoid function

The reading section returns the concatenation of two nxn matrices according to the input image and the error image:

here, the same filter matrix is applied to both the input image and the error image.

Process 5. compress sample image: in the t-th iteration step h_t-1 ^dec、x_tAnd

input to encoder 1322 to obtain the state

Wherein W_enc1、W_enc2And b_encAre parameters of the encoder.

Process 6. construct the first distribution: encoder-based output h_t ^encTo construct a vector for the feature quantity z_tFirst distribution Q (Z)_t|z₁,…,z_t-1X, y). Here, the first distribution Q obeys a mean value μ expressed by the following equation_tSum variance σ_tGaussian distribution of

The first distribution Q is not limited to the gaussian distribution described above, and those skilled in the art can select other suitable distributions according to actual needs.

Process 7. sample feature quantities from the first distribution: the sampling unit 1311 samples the first distribution Q (Z)_t|z₁,…,z_t-1X, y) are sampled to obtain a characteristic quantity z_t。

And 8, decoding the characteristic quantity: will z_tAnd s_tInput to the decoder 1312 to obtain the state h of the decoder 1312 at the t-th iteration step_t ^dec。

Process 9. write the output of the decoder out to the distribution matrix: write-out section 1313 uses attention model Write _ att to output h of decoder at t-th iteration step_t ^decWrite out to distribution matrix C. Specifically, five parameters (gX, gY, σ, δ, γ) of the attention model Write _ att are calculated similar to procedure 4:

wherein W (h)_t ^dec)＝sigmoid(h_t ^dec×W_write+b_write). And, a filter matrix F of a Gaussian filter_xAnd F_yRespectively as follows:

next, the attention model Write _ att is applied to h_t ^decObtain the moment Write_t：

Process 10. construct a second distribution: based on h_t ^decConstruction of the second distribution P (Z)_t|z₁…z_t-1) The second distribution P is subject to a mean value of μ'_tAnd variance is σ'_tOf Gaussian distribution N (Zt | mu'_t,σ’_t) Wherein:

μ’_t＝W(h_t ^dec)，

σ’_t＝exp(W(h_t ^dec))。

process 11. update distribution matrix: updating distribution matrix C_t＝C_t-1+Write_tWhere C is a matrix of the same size as the input image.

Process 12. iterative operation: the processes 3 to 11 are repeatedly performed until the maximum number of iterations T is satisfied.

Process 13. calculate the loss function, update the parameters of the apparatus 100 using back propagation to minimize the loss function: the loss function used here is:

wherein, -logP (x | y, z)₁,…,z_T) Representing the loss of image reconstruction, which can be understood as the similarity of the generated image to the input image; and is

Representing the loss of the constructed first distribution Q and second distribution P.

The training process of the apparatus 100 for generating an image based on text is described above. A method of generating an image using the trained apparatus 100 and a configuration of the apparatus 100 in a use state will be described below with reference to fig. 8 to 10.

FIG. 8 illustrates a flow chart of a method for generating images based on text using the trained device 100 according to an embodiment of the present invention. As shown in fig. 8, the method 300 includes steps S310 to S330. In step S310, text features characterizing the association between words in the sample text are extracted from the sample text by the text feature extraction section. In step S320, each part of the text feature is selectively cut out in a variable-size window by the local text feature cutting-out section to obtain each local text feature. In step S330, the decoder module iteratively generates an image corresponding to the input text according to each local text feature of the input text, each local text feature being truncated in each iteration.

The operations of step S310 and step S320 are the same as the operations of step S210 and step S220 in fig. 4, and for the sake of simplicity, the description is omitted here. Hereinafter, the process of step S330 is specifically described with reference to fig. 9.

As shown in fig. 9, the process S330 of the decoder module generating an image includes steps S331 to S335. In step S331, the second distribution P is constructed by the constructing section 1313 in each iteration based on the output of the decoder in the previous iteration. In step S332, the sampling unit 1311 collects feature amounts from the second distribution P. In step S333, the acquired feature amount is decoded by the decoder 1312 based on the local text feature and the decoder output in the previous iteration. In step S334, the output of the decoder is written out to the same matrix by the write-out section 1314 in each iteration. In step S334, an output image is generated by the decoder module 131 based on the resulting matrix.

Fig. 10 is a schematic diagram showing a configuration example of the apparatus 100 in a use state according to the embodiment of the present invention. Fig. 10 omits the encoder module 132 from operation because it is not needed to construct the first distribution Q in the use state.

Next, a specific process of generating an image using the trained apparatus 100 is explained in detail with reference to fig. 10.

Procedure 1. initialization: initializing the initial state of the decoding end cyclic neural network and initializing the initial state of the bidirectional cyclic neural network. Let the state h of the decoder₀ ^decState h of the forward recurrent neural network₀ ^fState h of the backward cyclic neural network_L-1 ^bAnd a distribution matrix C₀Is a 0 vector or matrix of the corresponding dimension. The initial states of the writing-out section 1313 and the local text feature intercepting section 120 are initialized. The value of the total number of iteration steps T is set, preferably the value of T in the use state is the same as the value of T in the training state.

And 2, extracting text features: the text features S of the entered text y are extracted using well-known distributed representation techniques.

And 3, process: extracting local text features: the center position and size of the attention window of the attention model Text att in the local Text feature intercepting part 120 are calculated based on ht-1dec, and the attention model Text att is applied to S to obtain the local Text feature st.

Process 4. construct the second distribution: based on h_t-1 ^decTo construct a vector for the feature quantity z_tSecond distribution P (Zt | z)₁,…,z_t-1)。

Process 5. sampling feature quantities from the second distribution: the sampling unit 1311 samples P (Zt | z) from the second distribution₁,…,z_t-1) Sampling characteristic quantity z_t。

And 6, a process: decoding the feature quantity: will z_tAnd s_tInput to decoder 1312 to obtain the state h of step t_t ^dec。

Process 7. write the output of the decoder out to the distribution matrix: based on h_t ^decCalculates the parameters (gX, gY, σ, δ, γ) of the attention model Write _ att in the writing section 1313, and applies the attention model Write _ att to h_t ^decGet the matrix Write_t。

Process 8. update distribution matrix: updating distribution matrix C_t＝C_t-1+Write_t。

Process 9. iterative operation: the processes 3 to 8 are repeatedly performed until the maximum number of iterations T is satisfied.

Process 10. generate image: based on matrix C_TGenerating an output image x', x ═ sigmoid (C)_T)。

In addition, it is noted that the components of the above system may be configured by software, firmware, hardware or a combination thereof. The specific means or manner in which the configuration can be used is well known to those skilled in the art and will not be described further herein. In the case of implementation by software or firmware, a program constituting the software is installed from a storage medium or a network to a computer (for example, a general-purpose computer 1100 shown in fig. 11) having a dedicated hardware configuration, and the computer can execute various functions and the like when various programs are installed.

FIG. 11 shows a schematic block diagram of a computer that may be used to implement methods and systems according to embodiments of the invention.

In fig. 11, a Central Processing Unit (CPU)1101 performs various processes in accordance with a program stored in a Read Only Memory (ROM)1102 or a program loaded from a storage section 1108 to a Random Access Memory (RAM) 1103. In the RAM 1103, data necessary when the CPU 1101 executes various processes and the like is also stored as necessary. The CPU 1101, ROM 1102, and RAM 1103 are connected to each other via a bus 1104. An input/output interface 1105 is also connected to bus 1104.

The following components are connected to the input/output interface 1105: an input section 1106 (including a keyboard, a mouse, and the like), an output section 1107 (including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker and the like), a storage section 1108 (including a hard disk and the like), a communication section 1109 (including a network interface card such as a LAN card, a modem, and the like). The communication section 1109 performs communication processing via a network such as the internet. The driver 1110 may also be connected to the input/output interface 1105 as needed. A removable medium 1111 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like can be mounted on the drive 1110 as necessary, so that a computer program read out therefrom is installed into the storage section 1108 as necessary.

In the case where the above-described series of processes is realized by software, a program constituting the software is installed from a network such as the internet or a storage medium such as the removable medium 1111.

It should be understood by those skilled in the art that such a storage medium is not limited to the removable medium 1111 shown in fig. 11, in which the program is stored, distributed separately from the apparatus to provide the program to the user. Examples of the removable medium 1111 include a magnetic disk (including a floppy disk (registered trademark)), an optical disk (including a compact disc read only memory (CD-ROM) and a Digital Versatile Disc (DVD)), a magneto-optical disk (including a Mini Disk (MD) (registered trademark)), and a semiconductor memory. Alternatively, the storage medium may be the ROM 1102, a hard disk included in the storage section 1108, or the like, in which programs are stored and which are distributed to users together with the device including them.

The invention also provides a program product with machine readable instruction codes stored. The instruction codes are read by a machine and can execute the method according to the embodiment of the invention when being executed.

Accordingly, storage media carrying the above-described program product having machine-readable instruction code stored thereon are also within the scope of the present invention. Including, but not limited to, floppy disks, optical disks, magneto-optical disks, memory cards, memory sticks, and the like.

It should be noted that the method of the present invention is not limited to being performed in the chronological order described in the specification, and may be performed sequentially in other orders, in parallel, or independently. Therefore, the order of execution of the methods described in this specification does not limit the technical scope of the present invention.

The foregoing description of the various embodiments of the invention is provided for the purpose of illustration only and is not intended to be limiting of the invention. It should be noted that in the above description, features described and/or illustrated with respect to one embodiment may be used in the same or similar manner in one or more other embodiments, in combination with or instead of the features of the other embodiments. It will be understood by those skilled in the art that various changes and modifications may be made to the above-described embodiments without departing from the inventive concept of the present invention, and all such changes and modifications are intended to be included within the scope of the present invention.

In summary, in the embodiments according to the present invention, the present invention provides the following technical solutions.

Scheme 1. an information processing method, comprising:

extracting text features representing the relevance between words in the sample text from the sample text;

selectively intercepting each local part of the text features by using a window with a variable size to obtain each local text feature;

training an image generation model based on respective local text features of the sample text and sample images corresponding to the sample text,

the decoder module in the trained image generation model generates images corresponding to the input text in an iterative mode according to each local text feature of the input text, and each local text feature is intercepted in each iteration.

Scheme 2. the information processing method according to scheme 1, wherein extracting the text features from the sample text comprises:

vectorizing the sample text to obtain a plurality of word vectors with low dimensionality; and

extracting text features characterizing associations between words in the sample text based on the word vectors.

Scheme 3. the information processing method according to scheme 2, wherein the text features are extracted using a forward recurrent neural network and/or a backward recurrent neural network.

Scheme 4. the information processing method according to scheme 1, wherein training the image generation model comprises:

iteratively compressing the sample image with the encoder module and outputting, from the encoder module in each iteration, a first distribution of feature quantities characterizing key information of the sample image and the sample text;

iteratively generating, with the decoder module, an output image based on the respective local text features and the respective first distributions of the sample text, and constructing, with the decoder module, second distributions of the feature quantities in the respective iterations; and

calculating a loss function of the image generation model based on the sample image, the output image, the first distribution, and the second distribution to optimize the image generation model.

Scheme 5. the information processing method of scheme 4, wherein in each iteration the local text feature is truncated based on the output of the decoder in the decoder module in the previous iteration.

Scheme 6. the information processing method of scheme 5, wherein, in each iteration, iteratively compressing the sample image with an encoder module comprises:

reading a local portion of the sample image based on an output of the decoder module and an output of the decoder in a previous iteration to obtain a local sample image; and

the encoder compresses the local sample image based on the encoder in the encoder module and the decoder's output in the previous iteration.

Scheme 7. the information processing method of scheme 6, wherein the first distribution in each iteration is constructed based on the output of the encoder.

Scheme 8. the information processing method of scheme 7, wherein iteratively generating an image with the decoder module comprises:

acquiring the feature quantity from the first distribution in each iteration;

decoding the collected feature quantity by using a decoder based on the local text feature and the output of the decoder in the previous iteration;

constructing the second distribution based on an output of a decoder in a previous iteration;

writing out the output of the decoder to the same matrix in each iteration as the output of the decoder module; and

generating the output image based on the resulting matrix.

Scheme 9. the information processing method according to scheme 8, wherein calculating a loss function of the image generation model comprises:

calculating a first loss function for the sample image and the output image;

calculating a second loss function with respect to the first distribution and the second distribution; and

determining the loss function based on the first loss function and the second loss function.

Scheme 10. the information processing method according to any one of schemes 6 to 9, wherein the encoder and the decoder are implemented using a recurrent neural network.

Scheme 11. the information processing method according to scheme 10, wherein the image generation model is a DRAW neural network.

Scheme 12. an apparatus for generating an image based on text, comprising:

a text feature extraction unit that extracts text features representing the relevance between words in a text;

a local text feature intercepting part, which selectively intercepts each local part of the text feature by a window with variable size to obtain a local text feature; and

an image generation model in which a decoder module iteratively generates an image corresponding to an input text from respective local text features of the input text, each local text feature being truncated in a respective iteration.

13. The apparatus according to claim 12, wherein the text feature extraction section includes:

the vectorization unit is used for vectorizing the sample text to obtain a plurality of word vectors with low dimensionality; and

a text feature extraction unit that extracts text features characterizing the association between words in the sample text based on the word vectors.

14. The apparatus according to claim 13, wherein the text feature extraction section extracts the text feature using a forward recurrent neural network and/or a backward recurrent neural network.

15. The apparatus of claim 12, wherein the image generation model comprises:

an encoder module that iteratively compresses the sample image and outputs, in each iteration, a first distribution of feature quantities characterizing key information of the sample image and the sample text;

a decoder module that calculates a second distribution of the feature quantity in each iteration and iteratively generates an output image based on each local text feature and each first distribution of the sample text; and

a calculation module to calculate a loss function of the image generation model based on the sample image, the output image, the first distribution, and the second distribution to optimize the image generation model.

16. The apparatus of claim 15, wherein the encoder module comprises:

a reading section that reads a part of the sample image based on an output of the decoder module and an output of a decoder within the decoder module in a previous iteration to obtain a local sample image;

an encoder to compress the local sample image based on outputs of the encoder and the decoder in a previous iteration; and

a construction section that constructs a first distribution based on an output of the encoder in each iteration.

17. The apparatus of claim 16, wherein the decoder module comprises:

a sampling unit that collects the feature amount from the first distribution in each iteration;

a decoding unit that decodes the acquired feature amount based on the local text feature and a decoder output in a previous iteration;

a construction section that constructs the second distribution based on an output of the decoder in a previous iteration; and

a write-out section writing out an output of the decoder to the same matrix as an output of the decoder module in each iteration,

wherein the decoder module generates the output image based on the final resulting matrix.

18. The apparatus of claim 17, wherein the means for calculating comprises:

a first calculation unit that calculates a first loss function with respect to the sample image and the output image;

a second calculation unit that calculates a second loss function with respect to the first distribution and the second distribution; and

a determination section that determines the loss function based on the first loss function and the second loss function.

The apparatus according to any of claims 12-18, wherein the image generation model is a DRAW neural network.

20. A method for generating images based on text using a trained device according to aspects 12-19, comprising:

extracting, by the text feature extraction section, text features representing associations between words in a text;

selectively intercepting, by the local text feature intercepting part, respective parts of the text features with windows of varying sizes to obtain local text features; and

and a decoder module in the image generation model generates an image corresponding to the input text in an iteration mode according to each local text feature of the input text, and each local text feature is intercepted in each iteration.

Claims

1. An information processing method comprising:

selectively intercepting each local part of the text features by using a window with a variable size to obtain each local text feature; and

training an image generation model, comprising:

iteratively compressing a sample image with an encoder module and outputting, from the encoder module in each iteration, a first distribution of feature quantities characterizing key information of the sample image and the sample text,

iteratively generating an output image based on respective local text features and respective first distributions of the sample text with a decoder module, and constructing a second distribution of the feature quantities in respective iterations with the decoder module, an

Calculating a loss function of the image generation model based on the sample image, the output image, the first distribution and the second distribution to optimize the image generation model,

2. The information processing method of claim 1, wherein extracting the text feature from the sample text comprises:

3. The information processing method according to claim 1, wherein in each iteration, the local text feature is truncated based on an output of a decoder in a decoder module in a previous iteration.

4. The information processing method of claim 3, wherein, in each iteration, iteratively compressing the sample image with an encoder module comprises:

5. The information processing method of claim 4, wherein the first distribution in each iteration is constructed based on an output of the encoder.

6. The information processing method of claim 5, wherein iteratively generating an image with the decoder module comprises:

acquiring the feature quantity from the first distribution in each iteration;

generating the output image based on the resulting matrix.

7. The information processing method according to claim 6, wherein calculating a loss function of the image generation model includes:

calculating a first loss function for the sample image and the output image;

8. The information processing method according to any one of claims 4 to 7, wherein the encoder and the decoder are implemented with a recurrent neural network.

9. An apparatus for generating an image based on text, comprising: