CN114511441A

CN114511441A - Model training method, image stylizing method, device, electronic equipment and storage medium

Info

Publication number: CN114511441A
Application number: CN202210101622.8A
Authority: CN
Inventors: 蒋剑斌
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2022-05-17

Abstract

The embodiment of the invention provides a method, a device, equipment and a medium for model training and image stylization. The method comprises the following steps: obtaining a first training sample set, each training sample set comprising an original image and a stylized image of a target object, carrying out the same motion simulation on the original image and the target object in the stylized image to obtain the original image and the stylized image after the motion simulation, generating an optical flow image according to the original image and the corresponding original image after the motion simulation, training to obtain a first stylized image generation model according to the training sample, the optical flow image and the stylized image after the motion simulation, making adjacent frames in the video by utilizing the motion simulation, introducing the optical flow image, therefore, when the stylized images are generated by the model, the motion change between the stylized images of the adjacent frames can be kept consistent with the motion change between the adjacent frames, unpredictable flicker of the stylized images between frames is avoided, and the stability of the stylized video is improved.

Description

Model training method, image stylizing method, device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a model training method, an image stylizing method, a model training apparatus, an image stylizing apparatus, an electronic device, and a storage medium.

Background

With the development of computer technology, the use of intelligent terminals is widely popularized, and more application programs are developed to facilitate and enrich the work and life of people. Currently, many applications are dedicated to providing more personalized visual special effects with better visual perception for intelligent terminal users, such as filter effects, sticker effects, deformation effects, and the like.

The special effect for changing the style of the image is a common visual special effect, and the image can be changed into another style by changing the attributes of the image such as color, texture and the like.

The existing image stylization generation model only considers the generation problem of a graph-to-graph, but does not consider the time sequence relation between frames in a video, only stylizes each frame independently, and the shape contour in the stylized images of adjacent frames may have large difference. Therefore, the conventional image stylized generative model is poor in video shooting, and is particularly characterized in that unpredictable flickering exists among frames in the shot video.

Disclosure of Invention

Embodiments of the present invention provide a model training method, an image stylizing method, a model training apparatus, an image stylizing apparatus, an electronic device, and a storage medium, so as to solve the problem of unpredictable flicker between frames in a captured video.

In order to solve the above problem, in a first aspect implemented by the present invention, there is provided a model training method, including:

acquiring a first training sample set, wherein each group of training samples in the first training sample set comprises an original image containing a target object and a corresponding stylized image;

aiming at each group of training samples, carrying out the same motion simulation on the target object in the original image and the target object in the corresponding stylized image to obtain one motion-simulated original image and one corresponding motion-simulated stylized image;

generating a corresponding optical flow image according to the original image and the corresponding original image after the motion simulation;

and training to obtain a first stylized image generation model according to the training sample, the optical flow image and the stylized image after the motion simulation.

Optionally, the motion simulation comprises at least one of rotation, translation, random deformation, motion blur.

Optionally, the training, according to the training sample, the optical flow image and the stylized image after the motion simulation, to obtain a first stylized image generation model includes:

performing image splicing on the original image, the stylized image and the optical flow image on a channel dimension to obtain an input image;

forming a second training sample set according to the input image and the stylized image after the motion simulation, wherein each group of training samples in the second training sample set comprises one input image and one stylized image after the motion simulation;

and training to obtain the first stylized image generation model according to the second training sample set.

Optionally, the training to obtain the first stylized image generation model according to the second training sample set includes:

according to a first ratio of image loss to countermeasure loss, carrying out weighted summation on the image loss and the countermeasure loss to obtain a first total loss, wherein the difference value between the first ratio and 1 is smaller than a first preset value;

training the first stylized image generation model according to the second training sample set based on the first total loss until the first total loss converges;

after the first total loss is converged, carrying out weighted summation on the image loss and the countermeasure loss according to a second proportion of the image loss and the countermeasure loss to obtain a second total loss, wherein the second proportion is greater than or equal to a second preset value;

and training the first stylized image generation model according to the second training sample set on the basis of the second total loss until the second total loss is converged.

In a second aspect of the present invention, there is also provided an image stylizing method, comprising:

acquiring video data;

inputting a previous frame image of the video data into a second stylized image generation model for stylized processing to obtain a corresponding previous frame stylized image, wherein the second stylized image generation model is obtained by training according to a first training sample set, and each group of training samples in the first training sample set comprises an original image containing a target object and a corresponding stylized image;

generating a corresponding optical flow image according to a current frame image and a previous frame image of the video data;

inputting the previous frame image, the previous frame stylized image and the optical flow image into a first stylized image generation model for stylizing to obtain a corresponding current frame stylized image, wherein the first stylized image generation model is obtained by obtaining a first training sample set, performing the same motion simulation on a target object in the original image and a target object in the corresponding stylized image aiming at each group of training samples to obtain an original image after motion simulation and a corresponding stylized image after motion simulation, generating a corresponding optical flow image according to the original image and the original image after motion simulation, and training according to the training samples, the optical flow image and the stylized image after motion simulation to obtain the corresponding current frame stylized image.

Optionally, the method further comprises:

forming a third training sample set according to the current frame image and the corresponding current frame stylized image, wherein each group of training samples in the third training sample set comprises one current frame image and one corresponding current frame stylized image;

training to obtain a third stylized image generation model according to the third training sample set;

acquiring target video data;

and inputting the frame image in the target video data into the third stylized image generation model for stylized processing to obtain a corresponding stylized image.

In a third aspect of the present invention, there is also provided a model training apparatus, including:

the system comprises a sample set acquisition module, a data processing module and a data processing module, wherein the sample set acquisition module is used for acquiring a first training sample set, and each group of training samples in the first training sample set comprises an original image containing a target object and a corresponding stylized image;

the motion model module is used for carrying out the same motion simulation on the target object in the original image and the target object in the corresponding stylized image aiming at each group of training samples to obtain one motion-simulated original image and one corresponding motion-simulated stylized image;

the image generation module is used for generating a corresponding optical flow image according to the original image and the corresponding original image after the motion simulation;

and the model training module is used for training to obtain a first stylized image generation model according to the training sample, the optical flow image and the stylized image after the motion simulation.

Optionally, the model training module comprises:

the image splicing submodule is used for carrying out image splicing on the original image, the stylized image and the optical flow image on a channel dimension to obtain an input image;

the sample set forming submodule is used for forming a second training sample set according to the input image and the stylized image after the motion simulation, and each group of training samples in the second training sample set comprises one input image and one stylized image after the motion simulation;

and the model training submodule is used for training to obtain the first stylized image generation model according to the second training sample set.

Optionally, the model training sub-module includes:

the first summing unit is used for carrying out weighted summation on the image loss and the countermeasure loss according to a first proportion of the image loss and the countermeasure loss to obtain a first total loss, and the difference value between the first proportion and 1 is smaller than a first preset value;

a first training unit, configured to train the first stylized image generation model according to the second training sample set based on the first total loss until the first total loss converges;

the second summation unit is used for weighting and summing the image loss and the countermeasure loss according to a second proportion of the image loss and the countermeasure loss after the first total loss converges to obtain a second total loss, wherein the second proportion is greater than or equal to a second preset value;

and the second training unit is used for training the first stylized image generation model according to the second training sample set based on the second total loss until the second total loss is converged.

In a fourth aspect of the present invention, there is also provided an image stylizing apparatus comprising:

the data acquisition module is used for acquiring video data;

the first stylized processing module is used for inputting a previous frame of image of the video data into a second stylized image generation model for stylized processing to obtain a corresponding previous frame of stylized image, the second stylized image generation model is obtained by training according to a first training sample set, and each group of training samples in the first training sample set comprises an original image containing a target object and a corresponding stylized image;

the image generation module is used for generating a corresponding optical flow image according to a current frame image and a previous frame image of the video data;

and the second stylized processing module is used for inputting the previous frame image, the previous frame stylized image and the optical flow image into a first stylized image generation model for stylized processing to obtain a corresponding current frame stylized image, wherein the first stylized image generation model is obtained by acquiring the first training sample set, performing the same motion simulation on a target object in the original image and a target object in the corresponding stylized image aiming at each group of training samples to obtain an original image after the motion simulation and a corresponding stylized image after the motion simulation, generating a corresponding optical flow image according to the original image and the corresponding original image after the motion simulation, and training the optical flow image and the stylized image after the motion simulation according to the training samples.

Optionally, the apparatus further comprises:

a sample set forming module, configured to form a third training sample set according to the current frame image and the corresponding current frame stylized image, where each set of training samples in the third training sample set includes one current frame image and one corresponding current frame stylized image;

the model training module is used for training to obtain a third stylized image generation model according to the third training sample set;

the target video acquisition module is used for acquiring target video data;

and the third stylized processing module is used for inputting the frame image in the target video data into the third stylized image generation model for stylized processing to obtain a corresponding stylized image.

In another aspect of the present invention, there is also provided an electronic device, including a processor, a communication interface, a memory and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing any of the above method steps when executing a program stored in the memory.

In yet another aspect of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform any of the methods described above.

In yet another aspect of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the methods described above.

The embodiment of the invention provides a model training method, a device, an electronic device and a storage medium, wherein a first training sample set is obtained, each set of training samples in the first training sample set comprises an original image containing a target object and a corresponding stylized image, the target object in the original image and the target object in the corresponding stylized image are subjected to the same motion simulation aiming at each set of training samples, a motion-simulated original image and a corresponding motion-simulated stylized image are obtained, a corresponding optical flow image is generated according to the original image and the corresponding motion-simulated original image, a first stylized image generation model is obtained by training according to the training samples, the optical flow image and the motion-simulated stylized image, so that adjacent frames in a video are manufactured by utilizing the motion simulation for model training, and the optical flow image between the original image and the original image after motion simulation is introduced as the input of the model, so that the first stylized image generation model can utilize the correlation between adjacent frames when generating the stylized image, the motion change between the stylized images of the generated adjacent frames can keep consistent with the motion change between the adjacent frames, the unpredictable flicker existing in the stylized images between frames is avoided, and the stability of the stylized video is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

FIG. 1 is a flow chart illustrating the steps of one embodiment of a model training method of the present invention;

FIG. 2 is a flow chart illustrating the steps of one embodiment of a model training method of the present invention;

FIG. 3 is a flow chart illustrating the steps of one embodiment of a model training method of the present invention;

FIG. 4 is a flow chart illustrating the steps of one embodiment of an image stylization method of the present invention;

FIG. 5 is a block diagram illustrating the structure of an embodiment of a model training apparatus according to the present invention;

FIG. 6 is a block diagram illustrating the architecture of an embodiment of an image stylization apparatus of the present invention;

fig. 7 shows a schematic diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a model training method according to the present invention is shown, which may specifically include the following steps:

step 101, a first training sample set is obtained, and each set of training samples in the first training sample set includes an original image containing a target object and a corresponding stylized image.

In the embodiment of the present invention, the target object may be preset according to actual needs, and is not limited herein. For example, the target object may be at least one of a person, an animal, or a thing, or may be at least one of one or more parts of a person, an animal, or a thing.

In the embodiment of the present invention, the stylization is to change the image into another style by changing attributes of the image, such as color, texture, and the like, for example, a black-and-white line style, a color oil painting style, a cartoon style, a hand-drawing style, a 3D (three-dimensional) style, and the like, or any other suitable style, and the embodiment of the present invention is not limited thereto.

For example, human face stylization is to convert a human face image into a specific stylized human face image, such as a sketch portrait style, a cartoon image (animation) style, an oil painting style, and the like.

In an embodiment of the present invention, the first training sample set includes pairs of raw images and stylized images, and each pair of corresponding raw images and stylized images constitutes a set of training samples. Wherein each pair of corresponding original image and stylized image have the same image content. That is, the stylized image in each set of training samples may be stylized from the original image in the set of training samples.

In an optional embodiment of the present invention, a specific generation manner of the first training sample set includes: the method comprises the steps of carrying out first target object detection on video frames in first video data to obtain a plurality of video frames containing first target objects, training the video frames containing the first target objects to obtain a stylized image pair generation model, generating a first training sample set by utilizing the stylized image pair generation model, wherein each group of training samples in the first training sample set comprises an original image containing a second target object and a corresponding stylized image, and the shape of the second target object is related to the shape of the first target object.

The stylized image pair generation model is a machine learning model that, after training, can be used to generate an image pair, which refers to an original image and a corresponding stylized image. The trained first stylized image generation model can output an original image and a corresponding stylized image according to needs. The generated model of the stylized image needs to be trained by using stylized images, and the obtained model is marked as the stylized image generation model.

For example, the stylized image pair generation model may employ a network architecture that generates a countermeasure network (StyleGAN) based on styles. The pattern-based generation countermeasure network may be constructed based on various types of countermeasure generation networks (GAN).

Take the example that the first target object is a human face and the first video data is video data of a cartoon episode. For the video data of a certain cartoon episode, a multimedia video processing tool (such as FFmpeg) is used for frame cutting, for example, 3 frames are cut in1 second. And detecting the human face and the five sense organs of the video frame by using a special cartoon detection human face detection model to obtain a plurality of video frames containing cartoon human faces. And training the stylized image pair generation model by using a plurality of video frames containing cartoon faces as training samples.

And generating a model by using the stylized image pair to generate an original image containing the real human face and a corresponding stylized image. There is a correlation between the shape of a real face and the shape of a cartoon face. And recording a training sample set consisting of an original image containing a real face and a corresponding stylized image as a first training sample set.

In an alternative embodiment of the present invention, the second target object in the original image in the first training sample set generated in the above manner is affine transformed with the stylized image in the first training sample set, so that the difference between the shape of the second target object in the stylized image and the shape of the second target object in the original image becomes small.

Affine transformation means that in geometry, one vector space is subjected to linear transformation and then subjected to translation, and transformed into another vector space, namely transformed from one two-dimensional coordinate system to another two-dimensional coordinate system, belonging to linear transformation. For example, the warp algorithm is an algorithm that implements affine transformations.

There is a difference between the shape of the face in the stylized image generated by the production model and the shape of the face in the original image due to the stylized image. And deforming the cartoon face in the stylized image according to the face shape of the real face in the original image by using a warp (warping) algorithm, so that the difference between the shape of the face in the stylized image and the shape of the face in the original image is reduced, and the training difficulty of the stylized image generation model is reduced.

In an alternative embodiment of the present invention, the training of the generated model by the stylized image is made difficult by the different shapes and angles of the first target object in the plurality of video frames obtained from the first video data. Therefore, before training the stylized image pair generation model using a plurality of video frames containing the first target object as training samples, the method may further include: and carrying out affine transformation on a plurality of video frames containing the first target object according to the reference first target object contained in the reference image to obtain a plurality of video frames with the first target object aligned with the reference first target object.

The reference first target object included in the reference image is a reference first target object included in an image having a shape and an angle satisfying the reference requirement. Affine transformation is carried out on a plurality of video frames containing the cartoon face, deformation is carried out according to the face shape of the reference face in the reference image, and a plurality of video frames with the cartoon face aligned with the reference face are obtained and serve as training samples.

And 102, performing the same motion simulation on the target object in the original image and the target object in the corresponding stylized image aiming at each group of training samples to obtain one motion-simulated original image and one corresponding motion-simulated stylized image.

In the embodiment of the invention, the motion simulation is to change the target object in the image so as to simulate various motions of the target object. For example, the motion simulation may include changes such as rotation, translation, random deformation, motion-generated blur, and the like, or any other suitable changes, which are not limited by the embodiments of the present invention.

In the embodiment of the present invention, for each group of training samples in the first training sample set, after performing the same motion simulation on the target object in the original image and the target object in the corresponding stylized image, one motion-simulated original image and one corresponding motion-simulated stylized image can be obtained.

In an embodiment of the invention, a plurality of motion simulations may be performed on a set of training samples to obtain a plurality of pairs of motion simulated raw images and motion simulated stylized images. The specific type of motion simulation performed on the training samples is not limited.

In an alternative embodiment of the invention, the motion simulation comprises at least one of rotation, translation, random deformation, motion blur. The rotation is a rotation process of performing a target angle on a target object. The translation is a process of moving the target object by a target distance in a target direction. The random deformation is a process of randomly deforming the target object. Motion blur is a blurring process generated by moving a target object.

And 103, generating a corresponding optical flow image according to the original image and the corresponding original image after the motion simulation.

In the embodiment of the invention, the optical flow is the instantaneous speed of the pixel motion of a spatial moving object on an observation imaging plane, and is a method for finding the corresponding relation between the previous frame and the current frame by using the change of the pixels in the image sequence in a time domain and the correlation between adjacent frames so as to calculate the motion information of the object between the adjacent frames. Because the pixel coordinates of the pixels at the corresponding positions in the original image and the corresponding motion-simulated original image are different, the displacement relation of the pixels at the corresponding positions in the original image and the corresponding motion-simulated original image is searched through registration, and the optical flow image is obtained. The optical flow image can specifically calculate the registration image pair through an optical flow algorithm, and the common optical flow algorithm mainly comprises a horn-schunck (horns-schuk) algorithm, a Lucas-Kanade (Lucas-Kanadard) algorithm and the like, and can be specifically selected according to actual needs.

In the embodiment of the present invention, when performing a registration operation on an original image and a corresponding motion-simulated original image, pixels of the original image are first mapped into a first coordinate system (e.g., a wide-angle coordinate system) of the motion-simulated original image by means of coordinate mapping, so that the pixels of the motion-simulated original image and the pixels of the original image are both in the same coordinate system (i.e., the first coordinate system).

And 104, training to obtain a first stylized image generation model according to the training sample, the optical flow image and the stylized image after the motion simulation.

In an embodiment of the present invention, the first stylized image generation model is a machine learning model, and after training, the first stylized image generation model can be used to stylize an original image, and after providing an image, the trained first stylized image generation model can output a corresponding stylized image. The first stylized image generation model needs to be trained by adopting paired images, and the obtained model is marked as the first stylized image generation model.

For example, a first stylized image generation model may employ a network architecture that opposes a generation network. The challenge generation network may be constructed based on various types of challenge generation networks (GAN), and the main structure of GAN includes a generator g (generation) and a discriminator d (discriminator). The generator G is used for stylizing the original image in the training sample and outputting a generated image; and a discriminator D for discriminating the genuineness of the stylized image and the generated image in the training sample, i.e. whether the stylized image is true (Real) or false (Fake), and whether the generated image is true or false.

In the embodiment of the invention, considering that the current frame in the video may have motion relative to the previous frame, each frame is just stylized separately, and the shape contour in the stylized images of adjacent frames may have large difference, so that the video composed of the stylized images of a plurality of continuous frames has unpredictable flickering problem. In order to eliminate the unpredictable flickering problem, when the first stylized image generation model generates the stylized image of the current frame, the invention provides that an original image of a previous frame, the stylized image of the previous frame and an optical flow image representing motion information between the previous frame and the current frame are used as generation bases. Considering that the previous frame is equivalent to an original image and a corresponding stylized image, and the current frame is equivalent to an original image after motion simulation and a corresponding stylized image, in the training process, the first stylized image generation model needs to adopt the original image and the corresponding stylized image in a training sample, and the optical flow image and the stylized image after motion simulation are trained, so that the first stylized image generation model does not only stylize each frame alone, but stylizes adjacent images and motion information between the adjacent images in consideration.

In one implementation, the original images and corresponding stylized images in the training sample, and the optical flow images, are taken as one input image, while the motion-simulated stylized image is taken as the other input image, which make up the pair of images of the input model. The manner in which the original image and the corresponding stylized image in the training sample, as well as the optical flow image, are converted into one input image is not limited.

According to the embodiment of the invention, by obtaining a first training sample set, each set of training samples in the first training sample set comprises an original image containing a target object and a corresponding stylized image, performing the same motion simulation on the target object in the original image and the target object in the corresponding stylized image for each set of training samples to obtain a motion-simulated original image and a corresponding motion-simulated stylized image, generating a corresponding optical flow image according to the original image and the corresponding motion-simulated original image, training to obtain a first stylized image generation model according to the training samples, the optical flow image and the motion-simulated stylized image, making adjacent frames in a video by using motion simulation for model training, and introducing the optical flow image between the original image and the motion-simulated original image as an input of the model, therefore, the first stylized image generation model can utilize the correlation between adjacent frames when generating stylized images, the motion change between the generated stylized images of the adjacent frames can be kept consistent with the motion change between the adjacent frames, unpredictable flicker existing in the stylized images between frames is avoided, and the stability of the stylized video is improved.

In an alternative embodiment of the present invention, as shown in fig. 2, the step 104 includes:

step 1041, performing image stitching on the original image, the stylized image and the optical flow image in a channel dimension to obtain an input image.

And 1042, forming a second training sample set according to the input image and the stylized image after the motion simulation, wherein each group of training samples in the second training sample set comprises one input image and a corresponding stylized image after the motion simulation.

And 1043, training to obtain the first stylized image generation model according to the second training sample set.

Image stitching is a method of stitching multiple images into a larger image, the output of which is the union of multiple input images. And carrying out image splicing on the original image, the stylized image and the optical flow image on the channel dimension to obtain an image, and marking the image as an input image. And obtaining corresponding input images by adopting the image splicing mode aiming at each group of training samples and corresponding optical flow images. When images are spliced, the splicing sequence of the original images, the stylized images and the optical flow images needs to be fixed, namely the same splicing sequence is adopted when the models are trained and used for generating the stylized images.

The second training sample set comprises pairs of input images and motion-simulated stylized images, each pair of corresponding input image and motion-simulated stylized image forming a set of training samples. And the first stylized image generation model is trained by adopting a plurality of groups of training samples in the second training sample set to obtain a trained first stylized image generation model.

For example, the training samples of the first training sample set are input and output pair images < x, y >, where x represents the original image and y represents the stylized image. The same motion simulation is performed for each set < x, y >, resulting in motion simulated pair images < x1, y1 >. From x and the corresponding x1, an optical flow image x2 is generated. And then image splicing is carried out on X, y and X2 in the channel dimension, the spliced matrix is used as input and is recorded as X0, y1 is used as output, and the training sample for constructing the second training sample set is pair image < X0, y1 >.

In an alternative embodiment of the present invention, as shown in fig. 3, the step 1043 includes:

step 10431, performing weighted summation on the image loss and the countermeasure loss according to a first ratio of the image loss and the countermeasure loss to obtain a first total loss, where a difference between the first ratio and 1 is smaller than a first preset value.

Step 10432, training the first stylized image generation model according to the second training sample set based on the first total loss until the first total loss converges.

Step 10433, after the convergence of the first total loss, performing weighted summation on the image loss and the countermeasure loss according to a second ratio of the image loss and the countermeasure loss to obtain a second total loss, where the second ratio is greater than or equal to a second preset value.

Step 10434, based on the second total loss, training the first stylized image generation model according to the second training sample set until the second total loss converges.

In an embodiment of the invention, the first stylized image generation model employs a network architecture that opposes the generation network.

In the training process, the input image and the corresponding generated image in each set of training samples may be different, the same pixels in the input image and the corresponding generated image may be compared one by one aiming at the corresponding input image and the corresponding generated image, the difference value of each pixel is determined, and then the image loss between the input image and the generated image is determined according to the difference value of each pixel. For example, L1 Loss (absolute error distance) has an absolute error as a distance.

The antagonistic losses may include true sample losses corresponding to the stylized image after the motion simulation, false sample true losses corresponding to the generated image, and false sample false losses corresponding to the generated image.

In the embodiment of the present disclosure, the discrimination network needs to determine all of the stylized images after motion simulation as true samples (that is, true samples, where the true probability is 1), but in the actual training process, the probability that each stylized image after motion simulation is discriminated as true by the discrimination network may not be 1, and at this time, a countermeasure loss may be determined based on the determination of the true and false probabilities of the stylized images after motion simulation.

Since the discriminant network needs to judge all of the generated images as false samples (i.e., the generated samples have a true probability of 0), but in the actual training process, the probability that each generated image is discriminated as true by the discriminant network may not be 0, and at this time, another countermeasure loss may be determined based on the judgment of the true and false probabilities of the generated images, which is defined as the true loss of the false samples corresponding to the generated images in the embodiment of the present disclosure.

Because the generation network needs to reduce the difference between the generated sample (generated image) and the real sample (stylized image after motion simulation) as much as possible, that is, the generation network makes the judgment of the judgment network wrong as much as possible, and judges a plurality of generated images as real samples. Yet another countermeasure loss may be determined at this time based on a determination of the probability of true or false of the generated image (false determination) caused by the generation network, which is defined in the embodiments of the present disclosure as a false sample false loss corresponding to the generated image.

Considering that the image loss and the antagonistic loss corresponding to each set of training samples have different degrees of contribution to network optimization, in the embodiment of the present disclosure, a ratio between the image loss and the antagonistic loss is set to adjust the importance degree of each loss.

According to a second training sample set, the process of training the first stylized image generation model can be performed in two stages.

In the first stage, the ratio between the image loss and the antagonistic loss is set to a first ratio, and the difference between the first ratio and 1 is smaller than a first preset value. The first preset value is not limited, and is set according to actual needs, so that the first ratio is close to 1, that is, the importance of image loss and counterdamage is close to 1, for example, the first ratio is set to 1. And according to the first proportion of the image loss and the countermeasure loss, carrying out weighted summation on the image loss and the countermeasure loss, and recording the obtained total loss as a first total loss.

And then in the training process of the first stage, according to the first total loss corresponding to each group of training samples in the second training sample set, adjusting the model parameters of the first stylized image generation model, optimizing the first stylized image generation model, and after the adjustment of the plurality of groups of training samples, converging the first total loss to finish the training of the first stage of the first stylized image generation model.

In a second phase, the ratio between the image loss and the antagonistic loss is set to a second ratio, the second ratio being greater than a second preset value. The second preset value is not limited, and is set according to actual needs, so that the second ratio is larger, that is, the importance degree of the image loss is higher than the importance degree of the image loss, for example, the second ratio is set to be 200. And according to the second proportion of the image loss and the countermeasure loss, carrying out weighted summation on the image loss and the countermeasure loss, and recording the obtained total loss as a second total loss.

Then in the training process of the second stage, according to the second total loss corresponding to each group of training samples in the second training sample set, the model parameters of the first stylized image generation model are adjusted, the first stylized image generation model is optimized, after the adjustment of the plurality of groups of training samples, the second total loss is converged, and the training of the second stage of the first stylized image generation model is completed.

The inventor of the present disclosure finds, through a large number of experiments, that a first ratio of image loss and antagonistic loss is set to be close to 1, and importance degrees of the image loss and the antagonistic loss are close to each other, the first stylized image generation model is optimized according to the first total loss, which is beneficial to the stylized effect of the first stylized image generation model, and then a second ratio of the image loss and the antagonistic loss is set to be larger, and the image loss has a larger importance degree than the antagonistic loss.

Referring to fig. 4, a flowchart illustrating steps of an embodiment of an image stylization method according to the present invention is shown, which may specifically include the following steps:

in step 201, video data is acquired.

In the embodiment of the present invention, the video data may be video data obtained by shooting a video in real time by the electronic device, or may be video data obtained by shooting a completed video. For example, the video data may be video data of a video stored locally by the electronic device, a video transmitted by other electronic devices, or a video on the internet.

Step 202, inputting a previous frame image of the video data into a second stylized image generation model for stylized processing, so as to obtain a corresponding previous frame stylized image, wherein the second stylized image generation model is obtained by training according to a first training sample set, and each group of training samples in the first training sample set comprises an original image containing a target object and a corresponding stylized image.

In an embodiment of the present invention, the second stylized image generation model is a machine learning model, and after training, the second stylized image generation model can be used to stylize an original image, and after providing an image, the trained second stylized image generation model can output a corresponding stylized image. And the second stylized image generation model is trained by adopting the first training sample set, and the obtained model is marked as the second stylized image generation model. For example, the second stylized image generation model may employ a network architecture that opposes the generation network.

In an embodiment of the present invention, the video data is composed of a plurality of frames of images. When stylizing the video data, the video data may be processed frame by frame, or may be processed every several frames of frame images, which is not limited in this embodiment of the present invention. The current frame image and the previous frame image may be adjacent frames or non-adjacent frames. For example, two frame images separated by one frame are provided between the current frame image and the previous frame image.

In the embodiment of the invention, the previous frame image of the video data is input into the second stylized image generation model for stylized processing, and the generated stylized image is marked as the corresponding previous frame stylized image.

Step 203, generating a corresponding optical flow image according to the current frame image and the previous frame image of the video data.

In the embodiment of the invention, the corresponding optical flow image is obtained by searching the displacement relation of the pixels at the corresponding positions in the previous frame image and the current frame image through registration.

Step 204, inputting the previous frame image, the previous frame stylized image and the optical flow image into a first stylized image generation model for stylized processing, so as to obtain a corresponding current frame stylized image, wherein the first stylized image generation model is obtained by performing the same motion simulation on the target object in the original image and the target object in the corresponding stylized image by acquiring the first training sample set aiming at each group of training samples, so as to obtain an original image after motion simulation and a corresponding stylized image after motion simulation, generating a corresponding optical flow image according to the original image and the corresponding original image after motion simulation, and training the corresponding optical flow image and the stylized image after motion simulation according to the training samples, the optical flow image and the stylized image after motion simulation.

In the embodiment of the present invention, the previous frame image, the previous frame stylized image obtained in step 202, and the corresponding optical flow image generated in step 203 are input to the trained first stylized image generation model. And performing stylization processing by the first stylized image generation model, and recording the generated stylized image as a corresponding stylized image of the current frame.

In the embodiment of the invention, the previous frame of image, the previous frame of stylized image and the corresponding optical flow image are subjected to image splicing in the channel dimension, so as to obtain the image of the corresponding input first stylized image generation model.

For example, video data is first decoded into a picture sequence, the first frame image Xin1 generates a first frame stylized image Yout1 using a second stylized image generation model, a corresponding optical flow image Xflow is generated from the second frame images Xin2 and Xin1, Xin1, Yout1 and Xflow are subjected to image stitching in the channel dimension to obtain an input serving as a trained first stylized image generation model, the first stylized image generation model outputs the second frame stylized image Yout2, and the rest is repeated to obtain stylized images of all frames of the video.

In the embodiment of the present invention, the other frame images except for the first frame image in the video data may be stylized in the above manner.

In an optional embodiment of the present invention, further comprising: forming a third training sample set according to the current frame image and the corresponding current frame stylized image, wherein each group of training samples in the third training sample set comprises one current frame image and one corresponding current frame stylized image; and training to obtain a third stylized image generation model according to the third training sample set, acquiring target video data, inputting the frame image in the target video data into the third stylized image generation model for stylization, and obtaining a corresponding stylized image.

Because the stylization processing process of the first stylized image generation model is complex, the real-time shooting speed of electronic equipment with limited computing power, such as a mobile terminal, is influenced. Therefore, the embodiment of the present invention proposes to generate a model by using the first stylized image, and form a third training sample set according to the current frame image and the corresponding current frame stylized image. And performing stylization processing on a large amount of video data to obtain a third training sample set, wherein each training sample set comprises a current frame image and a current frame stylized image. The third training sample set may be used as stable video training data.

The third stylized image generation model is a machine learning model and can be used to stylize an original image after training, and after providing an image, the trained third stylized image generation model can output a corresponding stylized image. And the third stylized image generation model is trained by adopting a third training sample set, and the obtained model is marked as the third stylized image generation model. For example, the third stylized image generation model may employ a network architecture that opposes the generation network. The third stylized image generation model may be a lightweight model.

The target video data may be video data obtained by shooting a video in real time by the electronic device, or may be video data obtained by shooting a finished video. For example, the target video data may be video data of a video stored locally by the electronic device, a video transmitted by another electronic device, or a video on the internet.

The target video data is composed of a plurality of frame images. When the target video data is stylized, the target video data may be processed frame by frame, or may be processed every several frame images, which is not limited in this embodiment of the present invention.

And inputting the frame image of the target video data into the trained third stylized image generation model for stylized processing to generate a corresponding stylized image.

Because the third training sample set is stable video training data, the third stylized image generation model can also avoid unpredictable flicker existing in the stylized images among frames, and the stability of the stylized video is improved. In addition, the third stylized image generation model can be simpler than the stylized process of the first stylized image generation model, so that the third stylized image generation model can meet the real-time shooting requirements of electronic equipment with limited computing capacity, such as a mobile terminal.

According to the embodiment of the invention, video data is acquired, a previous frame image of the video data is input into a second stylized image generation model to be stylized to obtain a corresponding previous frame stylized image, the second stylized image generation model is obtained by training according to a first training sample set, each training sample in the first training sample set comprises an original image containing a target object and a corresponding stylized image, a corresponding optical flow image is generated according to a current frame image and a previous frame image of the video data, the previous frame image, the previous frame stylized image and the optical flow image are input into a first stylized image generation model to be stylized to obtain a corresponding current frame stylized image, the first stylized image generation model is used for obtaining the first training sample set and aiming at each group of training samples, performing the same motion simulation on the target object in the original image and the target object in the corresponding stylized image to obtain a motion-simulated original image and a corresponding motion-simulated stylized image, generating a corresponding optical flow image according to the original image and the corresponding motion-simulated original image, and training the optical flow image and the motion-simulated stylized image according to the training sample, so that adjacent frames in the video are manufactured by using the motion simulation for model training, and the optical flow image between the original image and the motion-simulated original image is introduced as the input of the model, so that the first stylized image generation model can use the correlation between the adjacent frames when generating the stylized image, and the motion change between the stylized images of the adjacent frames can be consistent with the motion change between the adjacent frames, unpredictable flicker existing in the stylized images among frames is avoided, and stability of the stylized video is improved.

It should be noted that for simplicity of description, the method embodiments are shown as a series of combinations of acts, but those skilled in the art will recognize that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 5, a block diagram of a model training apparatus according to an embodiment of the present invention is shown, which may specifically include the following modules:

a sample set obtaining module 301, configured to obtain a first training sample set, where each training sample in the first training sample set includes an original image containing a target object and a corresponding stylized image;

a motion model module 302, configured to perform the same motion simulation on the target object in the original image and the target object in the corresponding stylized image for each group of training samples, to obtain one motion-simulated original image and one corresponding motion-simulated stylized image;

an image generating module 303, configured to generate a corresponding optical flow image according to the original image and a corresponding motion-simulated original image;

and the model training module 304 is configured to train to obtain a first stylized image generation model according to the training sample, the optical flow image and the stylized image after the motion simulation.

In an alternative embodiment of the invention, the motion simulation comprises at least one of rotation, translation, random deformation, motion blur.

In an optional embodiment of the invention, the model training module comprises:

In an optional embodiment of the invention, the model training submodule comprises:

Referring to fig. 6, a block diagram of an embodiment of an image stylizing apparatus according to the present invention is shown, which may specifically include the following modules:

a data obtaining module 401, configured to obtain video data;

a first stylized processing module 402, configured to input a previous frame of image of the video data into a second stylized image generation model for stylized processing, so as to obtain a corresponding previous frame of stylized image, where the second stylized image generation model is obtained by training according to a first training sample set, and each set of training samples in the first training sample set includes an original image including a target object and a corresponding stylized image;

an image generating module 403, configured to generate a corresponding optical flow image according to a current frame image and a previous frame image of the video data;

a second stylized processing module 404, configured to input the previous frame image, the previous frame stylized image, and the optical flow image into a first stylized image generation model for stylized processing, so as to obtain a corresponding current frame stylized image, where the first stylized image generation model is obtained by obtaining the first training sample set, performing the same motion simulation on the target object in the original image and the target object in the corresponding stylized image for each set of training samples, so as to obtain one motion-simulated original image and one corresponding motion-simulated stylized image, generating a corresponding optical flow image according to the original image and the corresponding motion-simulated original image, and training the previous frame image, the previous frame stylized image, and the optical flow image according to the training samples, the optical flow image, and the motion-simulated stylized image.

In an optional embodiment of the invention, the apparatus further comprises:

the target video acquisition module is used for acquiring target video data;

According to the embodiment of the invention, video data is acquired, a previous frame image of the video data is input into a second stylized image generation model to be stylized to obtain a corresponding previous frame stylized image, the second stylized image generation model is obtained by training according to a first training sample set, each training sample in the first training sample set comprises an original image containing a target object and a corresponding stylized image, a corresponding optical flow image is generated according to a current frame image and a previous frame image of the video data, the previous frame image, the previous frame stylized image and the optical flow image are input into a first stylized image generation model to be stylized to obtain a corresponding current frame stylized image, the first stylized image generation model is used for obtaining the first training sample set and aiming at each group of training samples, performing the same motion simulation on the target object in the original image and the target object in the corresponding stylized image to obtain a motion-simulated original image and a corresponding motion-simulated stylized image, generating a corresponding optical flow image according to the original image and the corresponding motion-simulated original image, and training the optical flow image and the motion-simulated stylized image according to the training sample, so that adjacent frames in the video are manufactured by using the motion simulation for model training, and the optical flow image between the original image and the motion-simulated original image is introduced as the input of the model, so that the first stylized image generation model can use the correlation between the adjacent frames when generating the stylized image, and the motion change between the stylized images of the adjacent frames can be consistent with the motion change between the adjacent frames, unpredictable flicker existing in the stylized image between frames is avoided, and then the stability of the stylized video is improved.

An embodiment of the present invention further provides an electronic device, as shown in fig. 7, which includes a processor 901, a communication interface 902, a memory 903, and a communication bus 904, where the processor 901, the communication interface 902, and the memory 903 complete mutual communication through the communication bus 904,

a memory 903 for storing computer programs;

the processor 901 is configured to implement the steps described in any of the foregoing method embodiments when executing the program stored in the memory 903.

The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the terminal and other equipment.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In yet another embodiment of the present invention, a computer-readable storage medium is further provided, which has instructions stored therein, which when run on a computer, cause the computer to perform the model training method described in any of the above embodiments.

In a further embodiment of the present invention, there is also provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the model training method of any of the above embodiments.

In yet another embodiment of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform the image stylization method of any one of the above embodiments.

In yet another embodiment, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the image stylization method of any one of the above embodiments.

In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method of model training, comprising:

2. The method of claim 1, wherein the motion simulation comprises at least one of rotation, translation, random deformation, and motion blur.

3. The method of claim 1, wherein training a first stylized image generation model based on the training samples, optical flow images, and motion-simulated stylized images comprises:

4. The method of claim 3, wherein the training the first stylized image generation model based on the second set of training samples comprises:

5. An image stylization method, comprising:

acquiring video data;

6. The method of claim 5, further comprising:

training according to the third training sample set to obtain a third stylized image generation model;

acquiring target video data;

7. A model training apparatus, comprising:

8. An image stylizing apparatus, comprising:

the data acquisition module is used for acquiring video data;

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1-6 when executing a program stored in the memory.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6.