CN108268845A

CN108268845A - A kind of dynamic translation system using generation confrontation network synthesis face video sequence

Info

Publication number: CN108268845A
Application number: CN201810045782.9A
Authority: CN
Inventors: 夏春秋
Original assignee: Shenzhen Vision Technology Co Ltd
Current assignee: Shenzhen Vision Technology Co Ltd
Priority date: 2018-01-17
Filing date: 2018-01-17
Publication date: 2018-07-10

Abstract

A kind of dynamic translation system using generation confrontation network synthesis face video sequence proposed in the present invention, main contents include：Bottom frame definition, grid setting, grid solve, its process is, definition includes the target framework including generator and discriminator, using pre-training Recursive Networks generation appearance compression behavioral characteristics and pass through dynamic channel encoder and merge with static nature and be input to generator；Then design minimizes function comprising the maximization including whole variables, is solved by qualifications with three steps fractionations and obtains the optimal solution of grid.The present invention, which can realize, is substituted into a target image in one section of face video sequence, and with the dynamics properties of former video, generation confrontation network is provided to pursue optimal balance solution, while the invention has preferably kept the detailed information and degree of dynamism of face while completing and replacing video frame.

Description

A kind of dynamic translation system using generation confrontation network synthesis face video sequence

Technical field

The present invention relates to computer vision fields, and generation confrontation network synthesis face video is utilized more particularly, to a kind of The dynamic translation system of sequence.

Background technology

In computer vision, face is replaced in video sequence, especially replaced face can follow former video sequence Variation and change, become very challenging problem.With the universal and smart mobile phone camera point of smart mobile phone The raising of resolution, people can take out mobile phone photograph, self-timer and recorded video whenever and wherever possible and whenever and wherever possible by photo or Video sharing is to internet.These high-resolution images provide largely with video for the application in terms of facial image replacement Material.In addition, recognition of face detects, blocks the constantly improve of the technologies such as detection and localization, machine learning and pattern-recognition, also it is figure The research of the automatic face replacement system of picture/video provides sufficient technical support.And the face in video sequence is replaced, synthesis Technology has important theory significance and application value in amusement, virtual reality, secret protection, Video chat etc..It is giving pleasure to Happy economic aspect, the relevant application program of face replacement technology occupies in major mobile phone application market downloads ranking list Preceding column position, they can bring enjoyment to the entertainment life of people, produce huge economic benefit, different productions simultaneously behind Industry, company personal can be constantly opened to seize market resource closer to, more true the relevant technologies；Secondly, face figure As replacement technology has critical role in terms of virtual reality technology.Same face is needed to exist simultaneously in different scenes, Or it simulates different people and occurs in Same Scene, such as International video meeting, disaster escape rehearsal, panoramic technique, tourism The occasions such as foreground point preview can all benefit from face replacement technology.It is protected in addition, being replaced in facial image in the privacy being concerned There is important research significance in terms of shield.Such as how to reject or replace when acquiring a large amount of public informations key person face, In terms of the protection of the irrelevant personnel involved in public security or criminal case, such privacy concern can all need face replacement technology It solves.However, in place of current industry or educational circles still have some deficits for such technology, such as during image is transformed into video, Do not retain abundant facial expression or detailed information, while noise may be introduced, cause the distortion of image.

The present invention proposes a kind of dynamic using generation confrontation network synthesis face video sequence proposed in the present invention Converting system, first definition include generator and the target framework including discriminator, are generated using the Recursive Networks of pre-training outer It sees compression behavioral characteristics and passes through dynamic channel encoder and merge with static nature and be input to generator；Then design is comprising all Maximization including variable minimizes function, and splitting solution by three steps of qualifications obtains the optimal of grid Solution.The present invention, which can realize, is substituted into a target image in one section of face video sequence, and with the dynamic of former video Change performance, provide generation confrontation network to pursue optimal balance solution, while the invention is while completing to replace video frame The detailed information and degree of dynamism of face are preferably kept.

Invention content

For replacement face in video sequence is solved the problems, such as, generation confrontation is utilized the purpose of the present invention is to provide a kind of Network synthesizes the dynamic translation system of face video sequence, and definition first includes generator and the target framework including discriminator, Using pre-training Recursive Networks generation appearance compression behavioral characteristics and pass through dynamic channel encoder merge with static nature it is defeated Enter to generator；Then design minimizes function comprising the maximization including whole variables, passes through three steps of qualifications It splits to solve and obtains the optimal solution of grid.The present invention can be realized is substituted into one section of face video sequence by a target image In row, and with the dynamics properties of former video, generation confrontation network is provided to pursue optimal balance solution, while the hair The bright detailed information and degree of dynamism that face has been preferably kept while completing and replacing video frame.

Turn to solve the above problems, the present invention provides a kind of dynamic using generation confrontation network synthesis face video sequence System is changed, main contents include：

(1) bottom frame defines；

(2) grid is set；

(3) grid solves.

Wherein, bottom frame definition defines applicability generation confrontation network model, specially：

1) data-oriented collectionDefinition generation networkIts role is to receive to input random become Amount Afterwards, data set x is imitated, changes the distribution of z and generates imitation data set

2) discriminator is definedIts role is to differentiate the data set of imitationWhether given with true Data set x has consistent distribution；

3) Game Rule is defined：The D if data of G generations are successfully out-tricked, G win victory；If D successfully identifies the mould of G generations Imitative data, then D triumphs；

4) definition frame target：Training network G and D simultaneously, and at the same time being allowed to obtain optimal performance, compete with one another for reaching After balance, the data set of G generations at this timeWith the distribution closest to x, the specific minimum process that maximizes is：

Wherein, p_xAnd p_zIt is the distribution of variable x and z respectively.

Target quiescent facial image is replaced dynamic face figure in original video sequence by the grid setting Picture, including appearance compressive features encoder A, dynamic channel encoder F, generator G, discriminator group D_s、D_d。

The appearance compressive features encoder, using the recurrent neural network of pre-training, by dynamic in original video Face and same video in first frame Static Human Face between do calculus of differences, obtain appearance compression behavioral characteristics, this feature The as input feature vector of system, specially：Given length is the original video dynamic sequence Y=[y of T first₀,y₁,…,y_T], it cuts Take the face figure y of its first frame₀It as starting point, is replicated T times, generates a static sequence Y^(st)=[y₀,y₀,…,y₀]；So Generate hiding space-time characterisation feature H and H accordingly to dynamic and static sequence using the recurrent neural network of pre-training respectively afterwards^(st)； Finally use H and H^(st)Calculus of differences is done, obtains appearance compression behavioral characteristics：

Wherein, time span remains T.

The dynamic channel encoder, compresses behavioral characteristics by appearanceWith static spatial feature Temporally frame t carries out linear combining, and the feature combination for then finishing merging is input to generator G, wherein t ∈ T.

The generator carries out feature using symmetry, front and rear connected convolutional neural networks to image Study and extraction, feature of image itself can be kept during training.

The discriminator group, for the dynamic video sequence currently exported, design static authentication device D_sAnd dynamic discrimination Device D_dTo differentiate the sequence true and false of generation, wherein, static authentication device D_sBe currently generated the fidelity of content frame for checking, i.e., with The extent of deviation of original target image；Dynamic discrimination device D_dFor check current sequence whether be dynamic, i.e., the expression of face with Whether appearance is in variable condition, if being in true dynamic, output token Z^(d), otherwise it is labeled as

The grid solves, including object function and optimization process.

The object function, the parameter for being related to variation in the training process is required for dynamic training, including dynamic channel Encoder F, generator G, discriminator D_s、D_d, specifically, training objective be so that the image of F and G generations is closest to original video, Therefore it needs to minimize its error, while needs to maximize discriminator D_s、D_dError, thus object function mathematical expression Formula is：

Wherein, T represents time span.

The optimization process, point three steps solve the optimal solution in mathematic(al) representation (3), specially：

1) discriminator D is maximized_s、D_dLoss itemWith, wherein

2) confrontation loss is minimizedTo train generator, wherein,

Meanwhile reconstruct loss by minimizing the still image based on L1 normsTo improve each frame still image The quality of reconstruct, wherein,

3) it is that dynamic continuity is kept in the case where time span is larger, the reconstruct of appearance compression behavioral characteristics is lost The restriction based on L1 norms is carried out, wherein,

It, can be in the hope of the optimal solution in mathematic(al) representation (3) by above three step.

Description of the drawings

Fig. 1 is a kind of frame of dynamic translation system using generation confrontation network synthesis face video sequence of the present invention Figure.

Fig. 2 is a kind of system network of dynamic translation system using generation confrontation network synthesis face video sequence of the present invention Network setting figure.

Fig. 3 is that a kind of example of dynamic translation system using generation confrontation network synthesis face video sequence of the present invention shows It is intended to.

Specific embodiment

It should be noted that in the absence of conflict, the feature in embodiment and embodiment in the application can phase It mutually combines, the present invention is described in further detail in the following with reference to the drawings and specific embodiments.

Fig. 1 is a kind of frame of dynamic translation system using generation confrontation network synthesis face video sequence of the present invention Figure.Mainly include bottom frame to define；Grid is set, and grid solves.

Bottom frame defines, and defines applicability generation confrontation network model, specially：

Wherein, p_xAnd p_zIt is the distribution of variable x and z respectively.

Fig. 2 is a kind of system network of dynamic translation system using generation confrontation network synthesis face video sequence of the present invention Network setting figure.Mainly include appearance compressive features encoder A, dynamic channel encoder F, generator G, discriminator group D_s、D_d。

Appearance compressive features encoder, using the recurrent neural network of pre-training, by face dynamic in original video Calculus of differences is done between the first frame Static Human Face in same video, obtains appearance compression behavioral characteristics, this feature is to be The input feature vector of system, specially：Given length is the original video dynamic sequence Y=[y of T first₀,y₁,…,y_T], intercept its The face figure y of one frame₀It as starting point, is replicated T times, generates a static sequence Y^(st)=[y₀,y₀,…,y₀]；Then distinguish Hiding space-time characterisation feature H and H accordingly are generated to dynamic and static sequence using the recurrent neural network of pre-training^(st)；Finally use H and H^(st)Calculus of differences is done, obtains appearance compression behavioral characteristics：

Wherein, time span remains T.

Appearance is compressed behavioral characteristics by dynamic channel encoderWith static spatial featureTemporally Frame t carries out linear combining, and the feature combination for then finishing merging is input to generator G, wherein t ∈ T.

Generator, image is carried out using symmetry, front and rear connected convolutional neural networks the study of feature with Extraction, feature of image itself can be kept during training.

Discriminator group, for the dynamic video sequence currently exported, design static authentication device D_sWith dynamic discrimination device D_dWith mirror The sequence true and false not generated, wherein, static authentication device D_sAnd original object the fidelity of content frame is currently generated for checking, i.e., The extent of deviation of image；Dynamic discrimination device D_dFor checking whether current sequence is dynamic, i.e., whether the expression of face and appearance In variable condition, if in true dynamic, output token Z^(d), otherwise it is labeled as

Grid solves, including object function and optimization process.

Object function, the parameter for being related to variation in the training process is required for dynamic training, including dynamic channel encoder F, generator G, discriminator D_s、D_d, specifically, training objective is so that the image of F and G generations is closest to original video, therefore needs Its error is minimized, while needs to maximize discriminator D_s、D_dError, thus the mathematic(al) representation of object function is：

Wherein, T represents time span.

Optimization process, point three steps solve the optimal solution in mathematic(al) representation (3), specially：

1) discriminator D is maximized_s、D_dLoss itemWith, wherein

2) confrontation loss is minimizedTo train generator, wherein,

Fig. 3 is that a kind of example of dynamic translation system using generation confrontation network synthesis face video sequence of the present invention shows It is intended to.As shown in the figure, in " smile " and " surprised " two expressions, replaced face can show former video in video Expression remains abundant detailed information, and there is no introduce to be enough the distortion content for causing visual discomfort.

For those skilled in the art, the present invention is not limited to the details of above-described embodiment, in the essence without departing substantially from the present invention In the case of refreshing and range, the present invention can be realized in other specific forms.In addition, those skilled in the art can be to this hair Bright to carry out various modification and variations without departing from the spirit and scope of the present invention, these improvements and modifications also should be regarded as the present invention's Protection domain.Therefore, appended claims are intended to be construed to include preferred embodiment and fall into all changes of the scope of the invention More and change.

Claims

1. a kind of dynamic translation system using generation confrontation network synthesis face video sequence, which is characterized in that mainly include Bottom frame defines (one)；Grid sets (two), and grid solves (three).

2. (one) is defined based on the bottom frame described in claims 1, which is characterized in that define applicability generation confrontation network Model, specially：

1) data-oriented collectionDefinition generation network G:Its role is to receive to input stochastic variableAfterwards, data set x is imitated, changes the distribution of z and generates imitation data set

2) discriminator D is defined:Its role is to differentiate the data set of imitationWhether with true data-oriented Collecting x has consistent distribution；

3) Game Rule is defined：The D if data of G generations are successfully out-tricked, G win victory；If D successfully identifies the imitation number of G generations According to then D wins；

4) definition frame target：Training network G and D simultaneously, and at the same time being allowed to obtain optimal performance, compete with one another for reaching balance Afterwards, the data set of G generations at this timeWith the distribution closest to x, the specific minimum process that maximizes is：

Wherein, p_xAnd p_zIt is the distribution of variable x and z respectively.

3. (two) are set based on the grid described in claims 1, which is characterized in that replace target quiescent facial image Dynamic facial image in original video sequence, including appearance compressive features encoder A, dynamic channel encoder F, generator G, Discriminator group D_s、D_d。

4. based on the appearance compressive features encoder described in claims 3, which is characterized in that use the recurrent neural of pre-training Network will do calculus of differences between the first frame Static Human Face in face and same video dynamic in original video, obtain Appearance compresses behavioral characteristics, this feature is the input feature vector of system, specially：Given length is that the original video of T moves first State sequence Y=[y₀,y₁,…,y_T], intercept the face figure y of its first frame₀It as starting point, is replicated T times, generates a static state Sequence Y^(st)=[y₀,y₀,…,y₀]；Then dynamic and static sequence is generated using the recurrent neural network of pre-training respectively corresponding Hiding space-time characterisation feature H and H^(st)；Finally use H and H^(st)Calculus of differences is done, obtains appearance compression behavioral characteristics：

Wherein, time span remains T.

5. based on the dynamic channel encoder described in claims 3, which is characterized in that appearance is compressed behavioral characteristicsWith Static spatial featureTemporally frame t carries out linear combining, and the feature combination for then finishing merging is input to Generator G, wherein t ∈ T.

6. based on the generator described in claims 3, which is characterized in that use symmetry, front and rear connected convolution Neural network carries out image the study and extraction of feature, can keep feature of image itself during training.

7. based on the discriminator group described in claims 3, which is characterized in that for the dynamic video sequence currently exported, if Count static authentication device D_sWith dynamic discrimination device D_dTo differentiate the sequence true and false of generation, wherein, static authentication device D_sIt is current for checking Generate content frame fidelity, i.e., with the extent of deviation of original target image；Dynamic discrimination device D_dFor checking that current sequence is No is dynamic, i.e., whether the expression of face is in variable condition with appearance, if in really dynamic, output token Z^(d), otherwise It is labeled as

8. (three) are solved based on the grid described in claims 1, which is characterized in that including object function and optimized Journey.

9. based on the object function described in claims 8, which is characterized in that being related to the parameter of variation in the training process all needs Dynamic training is wanted, including dynamic channel encoder F, generator G, discriminator D_s、D_d, specifically, training objective is so that F and G life Into image closest to original video, it is therefore desirable to minimize its error, while need to maximize discriminator D_s、D_dError, by This mathematic(al) representation for obtaining object function is：

Wherein, T represents time span.

10. based on the optimization process described in claims 8, which is characterized in that point three steps are solved in mathematic(al) representation (3) most Excellent solution, specially：

1) discriminator D is maximized_s、D_dLoss itemWithWherein

2) confrontation loss is minimizedTo train generator, wherein,

Meanwhile reconstruct loss by minimizing the still image based on L1 normsTo improve each frame still image reconstruct Quality, wherein,

3) it is that dynamic continuity is kept in the case where time span is larger, the reconstruct of appearance compression behavioral characteristics is lost and is carried out Based on the restriction of L1 norms, wherein,