CN110021051A

CN110021051A - One kind passing through text Conrad object image generation method based on confrontation network is generated

Info

Publication number: CN110021051A
Application number: CN201910257463.9A
Authority: CN
Inventors: 周星然; 黄思羽; 李斌; 李英明; 张仲非
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-04-01
Filing date: 2019-04-01
Publication date: 2019-07-16
Anticipated expiration: 2039-04-01
Also published as: CN110021051B

Abstract

The invention discloses one kind based on confrontation network is generated by text Conrad object image generation method, belongs to computer vision field.Specifically comprise the following steps: to obtain the character image data set for training, and defines algorithm target；The posture information for obtaining all images in character image data set, obtains basic poses from all posture informations by clustering algorithm；Utilize the study carried out based on the attitude prediction device for generating confrontation network from text to prediction posture；Using the acquistion of the middle school S2~S3 to attitude prediction device predict to obtain corresponding personage's posture from text；Using the study for the personage's picture generation for carrying out meeting text description based on the personage's picture generator for generating confrontation network, while the mapping relations between picture subregion and text are established using multi-modal error.It is of the invention based on generating confrontation network by text Conrad object image generation method, identified again etc. in scenes in picture generations, picture editor, pedestrian, with good application value.

Description

One kind passing through text Conrad object image generation method based on confrontation network is generated

Technical field

The invention belongs to computer vision fields, particularly a kind of to fight what network was instructed by text based on generation Character image generation method.

Background technique

The generation of text Conrad's object image is defined as following problem: according to the description of target text, changing simultaneously ginseng It examines the posture of personage and attribute (such as clothes color) in picture and reaches consistent with text description.In recent years, it was regarded in computer Feel task is as particular picture generation, image retrieval, personage identify that specified content can be generated in generation method in field again Picture expands data set, increases the important function of algorithm robustness.Mainly there are two key points for the task: how first be The targeted attitude of personage is predicted from text, targeted attitude should describe to be consistent with text, and the guidance as the transformation of personage's posture. Second is how to change simultaneously the posture and attribute of personage in reference picture, and the posture of personage changes simultaneously in the picture of generation And meet the attribute of verbal description.For first point, it is considered herein that personage's posture contains posture direction and posture movement two A factor, posture direction determine movement towards angle, posture movement be human limbs variation.For second point, this hair It is bright be embedded in a network attention up-sampling module, effectively integrated when generating personage's picture multiple mode (text, posture, Picture) data, it is ensured that personage is completed at the same time posture changing and attribute modification.Some previous methods are considered to personage The problem of posture changes, also some methods generate for text-image, and less method considers to change people according to the description of text Object posture and attribute.

Due to the validity of statistical modeling, gradually it is applied to currently based on the method for study in the task of picture generation. The existing method based on study mainly using confrontation network frame is generated, inputs a width character image and target text, output Meet the character image of text description.

Summary of the invention

To solve the above problems, fighting what network was instructed by text based on generation the purpose of the present invention is to provide a kind of Character image generation method.During passing through text prediction personage posture, since text itself does not include specific space Corresponding informance, we first pass through clustering method and obtain the basic poses with different directions, by text to specific basic Posture carries out the adjustment in part and details, obtains the personage's posture for meeting text description.It is also required to consider from text simultaneously Effectively extract key message, it is related to personage's posture about the information of direction and movement in text, and describe attribute information and It is related to generate personage's perceptual property performance in picture.In addition, network is considered from multiple during generating personage's picture The data (text, posture, image) of mode, for the fusion and expression of multiple modal characteristics, we introduce and adopt in attention Egf block.Using relevant information in attention mechanism concern text, while also completing the variation of personage's posture.In summary three A aspect, we devise one based on the learning framework for generating confrontation network, make model foundation picture subregion and text it Between connection, to carry out the feature representation of different postures, attribute personage's picture.The generation of picture is controlled to user by text Providing convenience property and friendly.

To achieve the above object, the technical solution of the present invention is as follows:

Based on generating confrontation network by text Conrad object image generation method, it the following steps are included:

S1, it obtains for trained character image data set, and defines algorithm target；

S2, the posture information for obtaining all images in character image data set, by clustering algorithm from all posture informations Middle acquisition basic poses；

S3, the study carried out based on the posture generator for generating confrontation network from target text to prediction posture is utilized；

S4, using the acquistion of the middle school S2~S3 to posture generator predict to obtain corresponding personage's posture from text；

S5, it is generated using the personage's picture for meet text description based on the personage's picture generator for generating confrontation network Study, while establishing the mapping relations between picture subregion and text using multi-modal error.

S6, the personage's picture generator learnt using S5 input the description text of reference picture and Target Photo, raw At the personage's picture for meeting text description.

Based on above scheme, each step can be accomplished in that

In step S1, the character image data set includes several personage's pictures, and each personage's picture is labelled with needle Text description to personage in the picture, the algorithm target of definition are as follows: for each of training set personage, exist with reference to figure Piece x, Target Photo x ', the posture p of the personage and description text t of Target Photo in Target Photo；Input reference picture x and The description text t of Target Photo, it is desirable that from the posture and movement of description text t prediction target, generate similar to Target Photo x ' Picture

Further, in step S2, the posture information of all images in character image data set is obtained, clustering algorithm is passed through Basic poses are obtained from all posture informations, specifically include following sub-step:

S21, personage's posture that all pictures in data set are obtained by attitude detection algorithm；

S22, personage's posture is clustered by K-means clustering algorithm, and calculates the average posture of ith clusterAnd as basic poses, K basic poses are acquired altogether

Further, it in step S3, carries out using based on the posture generator for generating confrontation network from target text to pre- The study for surveying posture, specifically includes following sub-step:

S31, using a LSTM network, extract the feature representation vector of goal description text tBy connecting mind entirely Through network F^oriPredict the direction o of posture described by text, i.e.,Wherein o ∈ { 1 ..., K }, The consistent basic poses of direction o obtained with prediction are selected from K basic poses

S32, a generator G is used₁Study is based on text informationTo adjust basic posesGenerate one in advance Survey postureI.e.In learning process, the calculating of softmax function and true directions are utilized to direction o Between error, calculateMean square error between posture true value p, calculates simultaneouslyConfrontation error, by three kinds of errors It is used as supervision message together.

Further, in step S4, using the acquistion of the middle school S2~S3 to posture generator predict to obtain phase from text Personage's posture is answered to specifically include following sub-step:

Based on the personage's posture generator established by S2~S3, the description text t of Target Photo is inputted, is predicted from text Personage's posture direction, and basic poses are adjusted according to text, it generates the personage that one meets text description and predicts posture

Further, it in step S5, is retouched using carrying out meeting text based on the personage's picture generator for generating confrontation network The study that the personage's picture stated generates, while establishing the mapping relations between picture subregion and text using multi-modal error and having Body includes following sub-step:

S51, feature extraction, the depth being chosen in different sizes are carried out to personage's reference picture x using convolutional neural networks Spend feature (v₁, v₂..., v_m), v_iFor the picture depth feature in i-th of size, wherein i=1,2 ..., m, m are down-sampling Sum；

S52, posture is predicted to personage obtained in step S4 using convolutional neural networksFeature extraction is carried out, is chosen at Depth characteristic (s in different sizes₁, s₂..., s_m), s_iFor the posture depth characteristic in i-th of size, wherein i=1, 2 ..., m, m are the sum of down-sampling；

S53, text feature matrix e, e are extracted by all hidden state vector h using a two-way LSTM_jSplicing group At i.e. e=(h₁, h₂..., h_N), wherein j=1,2 ..., N, N are word quantity in text；

Vision text attention c on i-th S54, calculating of size_i=v_iSoftmax(v_i ^TE), pass through multiple scale vision Text distanceCome the distance between subregion and the text t for measuring picture x, establish between picture subregion and text Relationship:

Wherein c_ijFor vision text attention c_iJth column, e_jIt is h for the jth column of text feature matrix e_j, r () It is the cosine similarity between two vectors；

S55, each training pair is calculatedMultiple scale vision text distance matrix Λ, I be each training batch The sum of secondary middle training pair, x_iAnd t_iThe reference picture of respectively i-th trained centering and the description text of Target Photo；Λ's I-th row jth column element beThe matched posterior probability of picture and text is P (t_i|x_i)= Softmax(Λ)_{(i, i)}, the posterior probability of text and picture match is P (x_i|t_i)=Softmax (Λ^T)_{(i, i)}；It is multi-modal similar Property errorIt calculates are as follows:

S56, attention up-sampling operation is carried out when generating personage's picture: first calculating the word vision in i-th of size Attention z_i=eSoftmax (e^Tv_i), the up-sampling in i-th of size isWhereinFor Closest up-sampling operation in i-th of size, u_i-1It is the up-sampling in previous size as a result, as i=1

The up-sampling operation of multiple attention is cascaded, personage's picture is generatedLearnt by fighting error；It learns During habit, multi-modal similitude error is calculatedGenerate personage's pictureConfrontation error and Target Photo x ' with L1 error, regard three kinds of errors as supervision message together.

Of the invention passes through text Conrad object image generation method based on generation confrontation network, compared to existing people Object image generation method, has the advantages that

Firstly, the present invention considers the generation for describing control personage's picture by text, i.e., both controlled by text description The attitudes vibration of personage, also has modified the clothes color attribute of personage.By seeking the control of text description, for a user It is more friendly and conveniently.

Secondly, one can be predicted from text description the invention proposes the method for passing through text prediction personage posture A reasonable personage's posture for meeting direction in text description, movement.

Finally, module is up-sampled the invention proposes attention, data of the module effective integration from different modalities, Including text, posture and image.At the same time, which can retain identity of personage information in reference picture, to make to give birth to At personage's picture it is more natural and true.

Of the invention passes through text Conrad object image generation method based on generation confrontation network, in picture generation, figure Piece editor, pedestrian identify etc. in scenes there is good application value again.For example, being retouched in picture editor's scene according to text One can be generated with reference picture by, which stating, communicates with personage in reference picture but posture and clothes color attribute change Picture, obtain the picture of different postures and attribute by the keyword in modification text description, such approach carrys out user It says more friendly and conveniently.Generating such picture has fundamental role for others work, because obtaining data set Itself be it is expensive, in some cases even be difficult to obtain, these personage's pictures be can be generated out by this application, be conducive to Development to other related works.

Detailed description of the invention

Fig. 1 is flow diagram of the invention；

Fig. 2 is the flow diagram in embodiment；

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.

On the contrary, the present invention covers any substitution done on the essence and scope of the present invention being defined by the claims, repairs Change, equivalent method and scheme.Further, in order to make the public have a better understanding the present invention, below to of the invention thin It is detailed to describe some specific detail sections in section description.Part without these details for a person skilled in the art The present invention can also be understood completely in description.

With reference to Fig. 1, in the preferred embodiment, one kind passing through text Conrad object based on confrontation network is generated Image generating method, comprising the following steps:

S1, it obtains for trained character image data set, and defines algorithm target.Its specific sub-step is as follows: personage Image data set includes several personage's pictures, and each personage's picture is labelled with the text description for personage in the picture, fixed The algorithm target of justice are as follows: for each of training set personage, there are reference picture x, Target Photo x ', people in Target Photo The posture p of the object and description text t of Target Photo；Input the description text t of reference picture x and Target Photo, it is desirable that from retouching The posture and movement for stating text t prediction target, generate and the similar picture of Target Photo x '

S2, the posture information for obtaining all images in character image data set, by clustering algorithm from all posture informations Middle acquisition basic poses.Its specific sub-step is as follows: the posture information of all images in character image data set is obtained, by poly- Class algorithm obtains basic poses from all posture informations, specifically includes following sub-step:

S3, the study carried out based on the posture generator for generating confrontation network from target text to prediction posture is utilized.Its Specific sub-step is as follows:

S4, using the acquistion of the middle school S2~S3 to posture generator predict to obtain corresponding personage's posture from text.It has Body sub-step is as follows:

S5, it is generated using the personage's picture for meet text description based on the personage's picture generator for generating confrontation network Study, while establishing the mapping relations between picture subregion and text using multi-modal error.Its specific sub-step is as follows:

Vision text attention c on i-th S54, calculating of size_i=v_iSoftmax(v_i ^TE), pass through multiple scale vision text Character-spacing fromCome the distance between subregion and the text t for measuring picture x, establish between picture subregion and text Relationship:

S6, the personage's picture generator learnt using S5 input the description text of reference picture and Target Photo, i.e., Produce the personage's picture for meeting text description.

The above method is applied in specific embodiment below, so as to those skilled in the art can better understand that this hair Bright effect.

Embodiment

Learn to have obtained personage's posture generator according to the step of aforementioned S1~S5 in the present embodiment and personage's picture generates Device, the implementation method of each step show its effect only for case data below as previously mentioned, no longer elaborate specific step Fruit.The present embodiment is implemented in the CUHK-PEDES data set with text marking, and image sources identify data in 5 pedestrians again Collection is respectively CUHK03, Market-1501, SSM, VIPER and CUHK01, altogether includes 40206 pictures of 13003 personages.

The present embodiment is tested on CUHK-PEDES data set.

The main flow that personage's picture generates is as follows:

1) the personage's posture being consistent is predicted from description text by personage's posture generator；

2) change the keyword that color attribute is described in description text, as shown in Figure 2；

3) personage's posture of prediction, modified description text and reference picture are inputted into personage's picture generator, obtained Personage's picture that personage's posture and attribute change；

It 4) is the validity that this method is comprehensively compared, we compare other more advanced methods and suitably modified similar Character image generate frame to adapt to the targeted task of this method；

5) structural similarity (SSIM) of the present embodiment and Inception score (Inception score) are shown in Table 1, Middle PT, which refers to, only changes posture, and P&AT, which refers to, not only to be changed posture but also change color attribute, is furthermore directed in the task the present embodiment and proposes VQA perceives score (VQA perceptual score) and measures the correctness that color attribute changes.Data are shown in figure, the present invention In structural similarity, Inception score and VQA perceive the performance in three indexs of score, with phase after other methods and modification As compared according to the character image generation method of text description control under frame, have further promotion on the whole.Its The calculation method of middle VQA perception score are as follows:Description is changed by program at random first Color attribute (considering 10 kinds of colors altogether) in text generates corresponding picture, and the color attribute of change is registered as correctly answering Case, then program inquires one, VQA model the problem of being relevant to character physical part (color of clothes or trousers), finally receives Collect the problem of VQA model returns answer and accuracy in computations, wherein T is to return to the correct picture number of answer, and N is that picture is total Number.

1 the present embodiment of table SSIM and IS index on CUHK-PEDES data set

Method	SSIM(PT)	IS(PT)	IS(P&AT)
				SIS[1]	0.239±0.106	3.707±0.185	3.790±0.182
AttnGAN[2]	0.298±0.126	3.695±0.110	3.726±0.123
				PG2[3]	0.237±0.120	3.473±0.009	3.486±0.125
Single AU	0.305±0.121	4.015±0.009	4.071±0.149
				ours	0.364±0.123	4.209±0.165	4.218±0.195

2 the present embodiment of table VQA on CUHK-PEDES data set perceives Scoring Guidelines

Method	VQA perceptual score
		Real image	0.698
SIS[1]	0.275
		AttnGAN[2]	0.139
PG2[3]	0.110
		Single AU	0.205
ours	0.334

Wherein ours is the method for the present embodiment, and cascades 3 up-sampling operations in S56；Single AU refers to In S56, the operation of 3 up-samplings is not used to be cascaded, is changed to only use an attention up-sampling operation, remaining way with Ours is consistent；Real image refers to that data concentrate original image to pass through the result that VQA model is putd question to and answered in table 2.Remaining method pair The bibliography answered is as follows:

[1]H.Dong,S.Yu,C.Wu,and Y.Guo.Semantic image synthesis via adversarial learning.In ICCV,2017.

[2]T.Xu,P.Zhang,Q.Huang,H.Zhang,Z.Gan,X.Huang,and X.He.Attngan:Fine- grained text to image generation with attentional generative adversarial networks.In CVPR,2018.

[3]L.Ma,J.Xu,Q.Sun,B.Schiele,T.Tuytelaars,and L.Van Gool.Pose guided person image generation.In NIPS,2017.

By above technical scheme, the present invention implements to provide based on depth learning technology a kind of based on generation confrontation network Pass through text Conrad's object image generation method.True and animated characters' image can be generated in the present invention, by describing text Carry out the control of the posture of personage and attribute in generation character image.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims

1. one kind passes through text Conrad object image generation method based on confrontation network is generated, which is characterized in that including following Step:

S2, the posture information for obtaining all images in character image data set, are obtained from all posture informations by clustering algorithm Take basic poses；

S5, the personage's picture generation for carrying out meeting text description based on the personage's picture generator for generating confrontation network is utilized It practises, while establishing the mapping relations between picture subregion and text using multi-modal error.

S6, the personage's picture generator learnt using S5 input the description text of reference picture and Target Photo, generate symbol Close personage's picture of text description.

2. as described in claim 1 pass through text Conrad object image generation method, feature based on generation confrontation network It is, in step S1, the character image data set includes several personage's pictures, and each personage's picture is labelled with to be directed to and be somebody's turn to do The text description of personage, the algorithm target of definition in picture are as follows: for each of training set personage, there are reference picture x, Target Photo x ', the posture p of the personage and description text t of Target Photo in Target Photo；Input reference picture x and target The description text t of picture, it is desirable that from the posture and movement of description text t prediction target, generate and the similar figure of Target Photo x ' Piece

3. as claimed in claim 2 pass through text Conrad object image generation method, feature based on generation confrontation network It is, in step S2, obtains the posture information of all images in character image data set, believed by clustering algorithm from all postures Basic poses are obtained in breath, specifically include following sub-step:

S22, personage's posture is clustered by K-means clustering algorithm, and calculates the average posture of ith cluster And as basic poses, K basic poses are acquired altogether

4. as claimed in claim 3 pass through text Conrad object image generation method, feature based on generation confrontation network It is, in step S3, carries out using based on the posture generator for generating confrontation network from target text to the study for predicting posture, Specifically include following sub-step:

S31, using a LSTM network, extract the feature representation vector of goal description text tBy connecting nerve net entirely Network F^oriPredict the direction o of posture described by text, i.e.,Wherein o ∈ { 1 ..., K }, from K The consistent basic poses of direction o obtained with prediction are selected in a basic poses

S32, a generator G is used₁Study is based on text informationTo adjust basic posesGenerate a prediction appearance StateI.e.In learning process, direction o is calculated between true directions using softmax function Error, calculateMean square error between posture true value p, calculates simultaneouslyConfrontation error, three kinds of errors are made together For supervision message.

5. as claimed in claim 4 pass through text Conrad object image generation method, feature based on generation confrontation network Be, in step S4, using the acquistion of the middle school S2~S3 to posture generator predict to obtain corresponding personage's posture from text and have Body includes following sub-step:

Based on the personage's posture generator established by S2~S3, the description text t of Target Photo is inputted, personage is predicted from text Posture direction, and basic poses are adjusted according to text, it generates the personage that one meets text description and predicts posture

6. as claimed in claim 5 pass through text Conrad object image generation method, feature based on generation confrontation network It is, in step S5, utilizes the personage's picture for carrying out meeting text description based on the personage's picture generator for generating confrontation network The study of generation, while following son is specifically included using the mapping relations that multi-modal error is established between picture subregion and text Step:

S51, feature extraction is carried out to personage's reference picture x using convolutional neural networks, the depth being chosen in different sizes is special Levy (v₁, v₂..., v_m), v_iFor the picture depth feature in i-th of size, wherein i=1,2 ..., m, m are the total of down-sampling Number；

S52, posture is predicted to personage obtained in step S4 using convolutional neural networksFeature extraction is carried out, difference is chosen at Depth characteristic (s in size₁, s₂..., s_m), s_iFor the posture depth characteristic in i-th of size, wherein i=1,2 ..., m, M is the sum of down-sampling；

S53, text feature matrix e, e are extracted by all hidden state vector h using a two-way LSTM_jSplicing composition, i.e. e =(h₁, h₂..., h_N), wherein j=1,2 ..., N, N are word quantity in text；

Vision text attention c on i-th S54, calculating of size_i=v_iSoftmax(v_i ^TE), by multiple scale vision text away from FromCome the distance between subregion and the text t for measuring picture x, the relationship between picture subregion and text is established:

Wherein c_ijFor vision text attention c_iJth column, e_jIt is h for the jth column of text feature matrix e_j, r () is two Cosine similarity between a vector；

S55, each training pair is calculatedMultiple scale vision text distance matrix Λ, I is in each trained batch The sum of training pair, x_iAnd t_iThe reference picture of respectively i-th trained centering and the description text of Target Photo；The i-th row of Λ Jth column element beThe matched posterior probability of picture and text is P (t_i|x_i)=Softmax (Λ)_{(i, i)}, the posterior probability of text and picture match is P (x_i|t_i)=Softmax (Λ^T)_{(i, i)}；Multi-modal similitude errorIt calculates are as follows:

S56, attention up-sampling operation is carried out when generating personage's picture: first calculating the word vision attention in i-th of size z_i=eSoftmax (e^Tv_i), the up-sampling in i-th of size isWhereinFor i-th of size On closest up-sampling operation, u_i-1It is the up-sampling in previous size as a result, as i=1

The up-sampling operation of multiple attention is cascaded, personage's picture is generatedLearnt by fighting error；Learnt Cheng Zhong calculates multi-modal similitude errorGenerate personage's pictureConfrontation error and Target Photo x ' withL1 Error regard three kinds of errors as supervision message together.