CN115171023A

CN115171023A - Style migration model training method, video processing method and related device

Info

Publication number: CN115171023A
Application number: CN202210862700.6A
Authority: CN
Inventors: 孔耀祖
Original assignee: Guangzhou Huya Technology Co Ltd
Current assignee: Guangzhou Huya Technology Co Ltd
Priority date: 2022-07-20
Filing date: 2022-07-20
Publication date: 2022-10-11

Abstract

The invention provides a style migration model training method, a video processing method and a related device, wherein the method comprises the following steps: the method comprises the steps of obtaining a training sample set, then constructing a generated countermeasure model, training the generated countermeasure model by using the training sample set, and obtaining a style migration model corresponding to target style characteristics.

Description

Style migration model training method, video processing method and related device

Technical Field

The invention relates to the technical field of video processing, in particular to a style migration model training method, a video processing method and a related device.

Background

The image style migration is a technology for migrating the picture style in the reference image to the original image, and the process not only maintains the main content structure of the original image, but also enables the original image to have the corresponding picture style in the reference image. The video style migration is to perform the picture style migration on the video level, and requires higher stability and accuracy compared with the image style migration.

The existing image style migration technology generally adopts methods such as deep learning to extract a plurality of feature layers of an image to be migrated, distinguishes content features and style features, and finally mixes the content features and the style features of different images to achieve the purpose of style migration. In order to ensure the style and content quality of the migrated images, multiple iterative optimization training is required, or a single model is trained for individual styles. However, these methods are generally simple migration of textures or pixels in a single style, and have very poor effect and stability on migration of complex objects and complex styles.

Disclosure of Invention

An objective of the present invention is to provide a style migration model training method, a video processing method and a related apparatus, so as to improve the migration effect and stability for complex objects and complex styles.

In a first aspect, the present invention provides a style migration model training method, including: acquiring a training sample set; the training sample set comprises at least one content image and at least one reference image, and the reference image has a target style characteristic; constructing an initial generative confrontation model; wherein the generating a confrontation model comprises a generator and a discriminator; the generator is used for generating a plurality of style feature migration graphs corresponding to each training sample; the resolutions corresponding to the style feature migration graphs are different from each other; training the generated confrontation model by utilizing the training sample set to obtain a style migration model corresponding to the target style characteristic; and the style migration model is used for processing the video stream to be processed so that each frame of image of the video stream to be processed has the target style characteristic.

In a second aspect, the present invention provides a video processing method, including: acquiring a video stream to be processed and a target style; inputting each frame of image of the video stream to be processed into a style migration model corresponding to the target style to obtain a target image corresponding to each frame of image; the target image has the target style, and the style migration model is obtained by the style migration model training method according to the first aspect; and obtaining the processed video stream based on all the target images.

In a third aspect, the present invention provides a style migration model training apparatus, including: the acquisition module is used for acquiring a training sample set; the training sample set comprises at least one content image and at least one reference image, and the reference image has a target style characteristic; a construction module for constructing an initial generative confrontation model; wherein the generating a confrontation model comprises a generator and a discriminator; the generator is used for generating a plurality of style feature migration graphs corresponding to each training sample; the resolutions corresponding to the style feature migration graphs are different from each other; the training module is used for training the generated confrontation model by utilizing the training sample set to obtain a style migration model corresponding to the target style characteristics; and the style migration model is used for processing the video stream to be processed so that each frame of image of the video stream to be processed has the target style characteristic.

In a fourth aspect, the present invention provides a video processing apparatus, comprising: the acquisition module is used for acquiring a video stream to be processed and a target style; the migration module is used for inputting each frame of image of the video stream to be processed into a style migration model corresponding to the target style to obtain a target image corresponding to each frame of image; wherein the target image has the target style, and the style migration model is obtained by the style migration model training method according to the first aspect; and the processing module is used for obtaining the processed video stream based on all the target images.

In a fifth aspect, the present invention provides an electronic device comprising a processor and a memory, the memory storing a computer program executable by the processor, the processor being capable of executing the computer program to implement the method of the first aspect or the second aspect.

In a sixth aspect, the present invention provides a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of the first or second aspect.

The invention provides a style migration model training method, a video processing method and a related device, wherein the method comprises the following steps: the method comprises the steps of obtaining content images and reference images for training, then constructing a generated countermeasure model, and training the countermeasure model by utilizing the training sample set to obtain a style migration model corresponding to target style characteristics.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a schematic view of an application scenario of a style migration model training method provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a style migration model training method according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a generator according to an embodiment of the present invention;

fig. 4 is another schematic structural diagram of a generator according to an embodiment of the present invention;

fig. 5 is a schematic flowchart of step S203 according to an embodiment of the present invention;

fig. 6 is a schematic flowchart of a video processing method according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a user interface provided by an embodiment of the present invention;

FIG. 8 is a functional block diagram of a style migration model training apparatus according to an embodiment of the present invention;

FIG. 9 is a functional block diagram of a video processing apparatus according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the present invention, it should be noted that if the terms "upper", "lower", "inside", "outside", etc. indicate an orientation or a positional relationship based on that shown in the drawings or that the product of the present invention is used as it is, this is only for convenience of description and simplification of the description, and it does not indicate or imply that the device or the element referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present invention.

Furthermore, the appearances of the terms "first," "second," and the like, if any, are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

It should be noted that the features of the embodiments of the present invention may be combined with each other without conflict.

Style migration refers to the conversion of images in two different domains, specifically, providing a style image, converting any one of the images into the style, and retaining the content of the original image as much as possible, for example, converting a real person photo into a cartoon-style photo, or converting a real person photo into an oil painting-style photo, or converting a real person photo into a hand-drawing-style photo.

The existing image style migration technology generally adopts methods such as deep learning to extract a plurality of feature layers of an image to be migrated, distinguishes content features and style features, and finally mixes the content features and the style features of different images to achieve the purpose of style migration. In order to ensure the style and content quality of the migrated images, multiple iterative optimization training is required, or a single model is trained for individual styles. However, these methods are generally simple migration of texture and pixel styles, and cannot be implemented well in migration of complex objects and complex styles due to poor effects, and especially, the problem of poor style stability still exists in extension application to video style migration.

For example, the related art provides a method for replacing a frame image pixel group in a video to be migrated with a pixel of a stylized clustering center by extracting a frame image pixel point in a video stream for pixel clustering, and extracting one or more frames of images by random extraction or equal-interval extraction for processing to obtain a video stream after style migration.

However, the above method is only suitable for style migration on a pixel level, and there is no high-level semantic distinction for the migration target, and a complex target cannot be identified and then detailed stylized migration is performed; the style migration scale is a clustering pixel group formed by pixel points with similar conditions, the clustering pixel group is a single-scale unfixed simple area, simple style migration can be performed only for similar areas such as skin, and simple division and migration cannot be performed for multi-scale complex areas such as faces, eyes and pupils through clustering.

In order to solve the above problems, embodiments of the present invention provide a style migration model and a training method corresponding to the style migration model, which can implement style migration on a complex target, and can effectively improve the quality and stability of style migration, so as to implement style migration of a live video stream with a lot of details, stability, and high quality.

The style migration model and the style migration model training method provided in the embodiment of the present invention will be described in detail below.

The style migration model training method provided by the embodiment of the application can be applied to equipment with a model training function, such as terminal equipment, a server and the like. The terminal device may be a smart phone, a computer, a Personal Digital Assistant (PDA), a tablet computer, or the like; the server may specifically be an application server or a Web server, and when the server is deployed in actual application, the server may be an independent server or a cluster server.

In practical application, the terminal device and the server may train the style migration model independently or may train the style migration model in an interactive manner, and when the terminal device and the server train the style migration model in an interactive manner, the terminal device may acquire a training sample set from the server, and then perform model training using the training sample set to obtain the style migration model, or the server may acquire the training sample set from the terminal, and then perform model training using the training sample set to obtain the style migration model.

It should be understood that, when the terminal device or the server executes the training method provided in the embodiment of the present application, after obtaining the style migration model through training, the style migration model may be sent to other terminal devices, so as to run the style migration model on the terminal devices, thereby implementing corresponding functions; the style migration model can also be sent to other servers to run the style migration model on the other servers, and corresponding functions are realized through the servers.

In order to facilitate understanding of the technical solution provided by the embodiment of the present application, a training method provided by the embodiment of the present application is described below by taking a server training style migration model as an example and combining with an actual application scenario.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a style migration model training method provided in the embodiment of the present application. The scene comprises a terminal device 101 and a server 102 for model training, wherein the terminal device 101 and the server 102 are connected through a network. The terminal device 101 is capable of providing a content image and a reference image for the server, wherein the content image may be an image containing any image content, for example, an image of a human being, an image of an animal, an image of a scene, and the like; the reference image may be a style feature transition image obtained based on the content image and a target style, the reference image and the content image appear in pairs, and the content is similar and has a different style, for example, the target style may be, but is not limited to, a canvas style, a bright style, a colored-lead style, and the like, and is not limited herein.

After the server 102 acquires the content images and the reference images from the terminal device 101 through the network, the content images and the reference images form a training sample set, next, the server can construct an initial generative confrontation model, and execute the training method provided by the embodiment of the present invention on the constructed generative confrontation model by using the training sample set, so as to finally obtain a style migration model corresponding to a target style, wherein the structure of the generative confrontation model and the training method constructed in the embodiment of the present invention will be described in detail in the following content.

After the server 102 generates the style migration models, the style migration models may be further sent to the terminal device 101, so as to run the style migration models on the terminal device 101, and implement corresponding functions by using the style migration models.

It can be understood that the embodiment of the present invention may obtain multiple styles of migration models through pre-training, each migration model corresponds to one image style, and the image styles corresponding to each migration model are different, that is, each migration model may implement the migration of one image style to an image.

In some embodiments, multiple style migration models may be stored locally in the terminal 101 or the server 102, and a file of a style migration model corresponding to a target style may be directly read locally in a scene in which the style migration model needs to be used.

It should be noted that the application scenario shown in fig. 1 is only an example, and in practical application, the style migration model training method provided in the embodiment of the present application may also be applied to other application scenarios, and no limitation is made to the application scenario of the style migration model training method herein.

Referring to fig. 2, fig. 2 is a schematic flowchart of a style migration model training method according to an embodiment of the present disclosure. For convenience of description, the following embodiments are described with a server as an execution subject, and it should be understood that the execution subject of the style transition model training method is not limited to the server, and may be applied to a device having a model training function, such as a terminal device. As shown in fig. 2, the style migration model training method includes the following steps:

s201, acquiring a training sample set; the training sample set comprises at least one content image and at least one reference image, and the reference image has target style characteristics;

s202, constructing an initial generation countermeasure model; the generation of the confrontation model comprises a generator and a discriminator; the generator is used for generating a plurality of style feature migration graphs corresponding to each training sample; the resolution ratios corresponding to the style characteristic migration graphs are different;

s203, training the generative reactance model by using the training sample set to obtain a style migration model corresponding to the target style characteristics; the style migration model is used for processing the video stream to be processed, so that each frame of image of the video stream to be processed has a target style characteristic.

According to the style migration model training method provided by the embodiment of the invention, firstly, a content image and a reference image for training are obtained, then a generated countermeasure model is constructed, then a training sample set is utilized to train the generated countermeasure model, and a style migration model corresponding to the style characteristics of a target is obtained.

The above steps S201 to S203 provided by the embodiment of the present invention will be described in detail with reference to the drawings.

In step S201, a training sample set is acquired.

In the embodiment of the application, the training sample set includes a content image and a reference image, the reference image is obtained according to the content image, and is a style transition diagram similar to the content of the content image but with different styles, and the reference image has a target style characteristic, that is, when any style characteristic needs to be transitioned into the content image to be processed, the reference image can be obtained based on any style characteristic and the content image.

In some embodiments, the number of content images and the number of reference images may not be limited, and the number of content images may be one or multiple, and the number of reference images is consistent with the number of content images. When a plurality of content images and reference images are subsequently used for training, the content images and the style images can be used for model training at the same time, so that the accuracy of the style migration model obtained after training is improved.

In order to ensure the efficiency and effect of model training, after a training sample set is obtained, the content image and the reference image can be normalized first, and the normalized content image and the normalized reference image can be used in the subsequent training process.

In step S202, an initial generative confrontation model is constructed.

The generation of the confrontation model in the embodiment of the invention may include a generator and a discriminator, where the generator is configured to generate a plurality of style feature migration maps corresponding to each training sample, and the resolutions of the plurality of style feature migration maps are different from each other.

In order to enable the generator to have an effect of outputting a plurality of style feature migration maps with different resolutions, in an optional implementation manner, please refer to fig. 3, fig. 3 is a schematic structural diagram of the generator provided in the embodiment of the present invention, where the generator includes M groups of sub-generation models; the M groups of sub-generation models are formed by connecting in series through the pooling layers; each group of sub-generation models corresponds to a discriminator; wherein, the resolution of the style feature migration diagram generated by the mth group of sub-generative models is greater than the resolution of the style feature migration diagram generated by the M +1 group of sub-generative models, and M is greater than or equal to 2, which can be understood as: starting with the first sub-generative model for receiving input training samples, numbering all the sub-generative models in the order from small to large, and sequentially reducing the resolution of the output style characteristic migration diagram from the 1 st group of sub-generative models to the Mth group of sub-generative models according to preset multiples.

To achieve the effect that the resolutions of the style feature migration maps are sequentially reduced according to the preset multiple, please refer to fig. 4, where fig. 4 is another schematic structural diagram of the generator according to the embodiment of the present invention:

each group of sub-generative models can be formed by combining an encoder and a decoder; each decoder has a first output branch and a second output branch; the input of the encoder of the mth group of sub-generative models is the output of the encoder of the (m-1) th group of sub-generative models; the input of the decoder of the m-th group of sub-generative models is the output of the encoder of the m-th group of sub-generative models and the output of the first output branch of the decoder of the m + 1-th group of sub-generative models; the second output branch of each decoder is used for outputting the first style feature migration map or the second style feature migration map.

It can be understood that, the encoder-decoder model is a common convolutional neural network model, after being trained, the target can be detected at the encoder stage, and the image content can be restored at the decoder stage, in the prior art, in the process of using the encoder-decoder model, a feature map with reduced resolution is obtained by inputting the feature map into the encoder for feature extraction, and then the feature map is transmitted to the decoder for decoding, wherein the features in the middle of each level of the encoder are transmitted to the decoder through skip connections to assist decoding, and finally the decoder outputs 1 group of decoding maps.

As can be seen from the above, the encoder-decoder model of the prior art design has only a single set of outputs, which may result in an imbalance in the codec capabilities of the model for multi-scale target features. While the complex target usually consists of target features of various scales, the imbalance of the encoding and decoding capabilities is not favorable for the style migration of the complex target. Meanwhile, the decoder only has 1 output branch, only the information of the highest resolution ratio is explicitly supervised in the training stage, and the output disturbance is easily caused after the disturbance occurs to the middle layer lacking the constraint, so that the problem of unstable video stream style migration effect occurs.

Therefore, the generator shown in fig. 4 is constructed based on multiple sets of encoder-decoder models, each set of encoder-decoder can be used as a sub-generation model, and in order to achieve the effect that each sub-generation model outputs a style feature migration diagram with different resolution, the embodiment of the present invention performs special design on the encoder-decoder in each set of sub-generation models, which is specifically as follows:

1, the method comprises the following steps: for the encoder in each sub-generative model, the number N of the convolutional layer and the pooling layer is determined according to the number of the sub-generative model and the total number of the sub-generative models, and the encoding is formed based on the series connection of N groups of convolutional layers and pooling layers, for example, for the mth group (1 ≦ M ≦ M) of sub-generative models, i.e., N = M-M +1.

It can be seen that the multiple sub-generative models can be numbered in sequence according to the data stream direction between the encoders to obtain the number corresponding to each sub-generative model.

And (2): the decoder in each sub-generation model is composed of a convolutional layer with the same number N as that of the encoder and an upsampling layer in series, and the decoder is provided with a first output branch and a second output branch, wherein the first output branch is used for outputting a feature map, the second output branch is used for outputting a generated style feature migration map, and the second output branch can be composed of convolutional layers with the channel number of 3.

And 3, a step of: the two groups of inputs of the decoder of the mth group (M is more than or equal to 1 and less than or equal to M) of the sub-generative model are respectively the output of the encoder of the mth group and the output of the decoder of the (M + 1) th group. And for the mth decoder, the output characteristic diagram is respectively subjected to M-1 group of encoder coding and M-1 group of decoder decoding, and is subjected to M-M +1 times of coding and decoding in the sub-generation model after being combined.

That is, the characteristic diagram of the first output branch of each decoder is encoded and decoded for constant m-1+M-m +1=M times, so that the radix can be encoded and decoded from 1 to 2 ^M-1 The target characteristics under the receptive field have equivalent coding and decoding capabilities, the problem of unbalanced coding and decoding capabilities of multi-scale target characteristics is solved, the hidden learning of a model is facilitated to detect targets with different scales, and style migration of multi-scale complex targets is realized.

At the same time, since each decoder also has a second output branch, i.e. there are M output branches for the whole generator, it is possible to output a resolution reduction of 1 to 2 with respect to the original image ^M-1 The method for gradually adding the migration details from coarse to fine can obviously improve the stability of the style feature migration diagram.

In step S203, training a generative pair model by using the training sample set to obtain a style migration model corresponding to the target style characteristics; the style migration model is used for processing the video stream to be processed, so that each frame of image of the video stream to be processed has a target style characteristic.

With reference to fig. 3 and fig. 4, the step S203 may be understood as: obtaining M first style feature migration maps y 'with resolution reduced by multiple through a generator for a content image' _m Where M =1,2, … M, will be' _m Inputting the first and second images to an M-th discriminator to obtain first discrimination information, and down-sampling the reference image to obtain M second style feature migration maps y with resolution reduced by multiple _m M =1,2, … M, and _i and inputting the second judgment information into the m-th discriminator to obtain second judgment information, and then calculating the loss values of a plurality of preset loss functions based on the first judgment information and the second judgment information obtained by all the discriminators until reaching a preset condition, and finishing training.

It can be understood that in the process of training the style migration model, the embodiment of the present invention expects to down-sample the reference image to obtain the style migration map under different scales, and use this as the supervision information, the generator down-samples and encodes the content image, in this process, the generator can detect each target to be migrated in the content image, and then convert the decoded features into the style feature migration map with corresponding resolution in the decoding stage, and when the style migration map under different scales exists as the supervision information, the features corresponding to the decoding stage can be explicitly trained as multi-scale features, that is, each scale of the reference image needs the same scale of features to be generated. This ensures that each stage of the decoder focuses on generating features of its corresponding scale, resulting in a refined style transition map at the final decoded output.

In the embodiment of the present invention, the discriminator is used to judge whether the input image is the style feature transition map generated by the generator or the reference image, for example, the feature map of the first output branch output of the decoder of the mth sub-generation model is assumed to be s' _m The second output branch of the decoder is a style feature migration map y' _m Y 'is determined by the m-th discriminator' _m The style feature transition map generated by the generator or the reference image, and the discrimination information is output.

Here is taken the common generation of confrontational training scenarios: assuming that the style feature transition map generated by the generator is false, the reference image y is true, the generator aims to judge y 'as true, the discriminator aims to judge y as true, and y' is false. The specific process is to pass its input (perhaps y or y') through several convolutional pooling layers to derive one or more penalty values, and then derive the penalty corresponding to the generator and arbiter by GAN loss (LSGAN is used here), thus performing generative confrontation training.

The significance of the above training method is that the style transition diagram generated by the generator during training can be consistent with the reference diagram, and cannot be distinguished. Thus, the style transition graph in the actual reasoning can be more similar to the style of the reference image.

Therefore, in an alternative implementation manner, please refer to fig. 5, where fig. 5 is a schematic flowchart of step S203 provided in an embodiment of the present invention:

s203-1, inputting the training sample set into the generation confrontation model, and generating each first style feature transfer graph corresponding to each content image and each second style feature transfer graph corresponding to each reference image by utilizing each sub generation model.

In other words, in the training process, the content image is input into the generator to generate a first style feature transition diagram with different resolution scales, after the reference image is input into the model, the reference image under the non-resolution scale is directly obtained through down sampling, namely, the second style feature transition diagram is obtained, the first style feature transition diagram and the second style feature transition diagram are input into the discriminator to be discriminated, and the discriminator gives different discrimination information by judging whether the input image is the generated image and the reference image.

S203-2, inputting the first style feature transition diagram and the second style feature transition diagram corresponding to each sub generation model into a discriminator corresponding to the sub generation model respectively to obtain first discrimination information of each first style feature transition diagram and first discrimination information of each second style feature transition diagram.

It can be understood that, since the content image and the reference image in the embodiment of the present application appear in pairs, for each content image, a plurality of first style feature migration maps corresponding to the content image are generated by the generator, then the second style feature migration map of the reference image corresponding to the content image is obtained by downsampling, and then the first style feature migration map and the second style feature migration map are input to the discriminator for discrimination.

The first judgment information refers to whether the input first style feature transition diagram is generated by the generator or the reference image judged by the discriminator; the second judgment information is that the discriminator judges whether the input second style feature transition diagram is generated by the generator or the reference image.

S203-3, determining loss values of a plurality of loss functions corresponding to the generated countermeasure model based on the first discrimination information and the second discrimination information obtained by all discriminators;

s203-4, reversely transmitting the loss values of the loss functions to each layer of the generated countermeasure model to iteratively update the model parameters until the preset conditions are reached, and taking the trained generated countermeasure model as a style migration model corresponding to the target style characteristics.

In an alternative implementation manner, the plurality of loss functions in the embodiment of the present invention may be: least squares loss function

Learning perceptual image block similarity loss function

Feature matching penalty function

Loss function of style characteristics

The expressions for the respective loss functions described above are as follows:

wherein x is a normalized content image, y is a normalized reference image, wherein Flpips and FVGG are LPIPS and VGG inference models respectively, j is an output characteristic of a certain layer of the model, and w is _j Is the weight of the j-th layer loss function.

In an optional embodiment, in order to perform computational graph optimization on the trained model to accelerate inference, the following steps may also be performed:

and removing second output branches in the remaining groups of sub-generative models except the initial group of sub-generative models in the style migration model, and taking the removed style migration model as a style migration model corresponding to the target style characteristics.

It is understood that, as can be seen in conjunction with fig. 4, the resolution of the output of the starting group of sub-generative models (which may also be understood as the first group of sub-generative models) is the highest, so in a practical application scenario, only the output of the starting group of sub-generative models needs to be retained.

In an application implementation scenario, the style migration model obtained through the training may be deployed on a target device through an inference engine (including but not limited to TensorRT, MNN, and the like), and an inference warm-up effect is achieved by running on the target device multiple times, so that the model can be compatible with the target device.

After obtaining the style migration model corresponding to each style feature, the style migration processing may be performed on the video stream or the image, so an embodiment of the present invention further provides a video processing method, please refer to fig. 6, where fig. 6 is a schematic flowchart of the video processing method provided in the embodiment of the present invention, an execution subject of the video processing method may be the terminal 101 or the server 102 in fig. 1, and the method may include:

s301, a video stream to be processed and a target style are obtained.

In the embodiment of the present invention, the real-time video stream may be a video stream from a network, or may be a video stream locally stored in a device.

In an optional implementation manner, reference may be made to fig. 7 for a manner of acquiring the video stream and the target genre, where fig. 7 is a schematic view of a user interface provided by an embodiment of the present invention, and based on the user interface, the implementation manner of the step S301 may be:

a1, displaying a user interface; the user interface is provided with a video acquisition area and a style selection area;

a2, responding to user operation on a video acquisition area to acquire a video stream;

and a3, acquiring the target style by the selection operation on the style selection area.

It can be understood that the user interface is only an example, and is not a limitation on the user interface, in an actual scene, the video acquisition area and the style selection area may also be displayed through two sub-interfaces, that is, the sub-interface corresponding to the video acquisition area may be displayed first, and after a video stream is obtained, the sub-interface displaying the style selection area may be triggered.

It can also be understood that, for the sub-interface of the style selection area, besides the identifiers corresponding to various styles, a preview interface corresponding to each style can be displayed, that is, an effect corresponding to each style can be displayed for the user on the preview interface, so that the user can select a target style meeting the requirement.

S302, inputting each frame of image of the video stream to be processed into a style migration model corresponding to a target style, and obtaining a target image corresponding to each frame of image.

Wherein the target image has a target style. The style migration model is obtained through the style migration model training method provided by the embodiment of the invention.

S303, a processed video stream is obtained based on all the target images.

According to the video processing method provided by the embodiment of the application, after the video stream and the target style are obtained, the style migration model corresponding to the target style is utilized to perform the style migration processing on each frame of image of the video stream to be processed, the target image with the target style is obtained, then the video stream after the style migration is obtained based on the target image, the style migration of the complex target can be realized by utilizing the style migration model trained in advance in the whole process, and the accuracy and the stability are improved.

In an alternative embodiment, the video stream in the embodiment of the present invention is generally stored in the form of YUV video frame byte strings, and therefore, after the video stream is obtained, it is necessary to convert frame data into an original image in RGB format, where the format of each frame image is an image in RGB format, and therefore, before each frame image is input to the style migration model, it is necessary to convert the video frame byte strings into frame images, and therefore, an embodiment of the present invention provides a conversion method, that is:

and b1, reading frame data corresponding to the video stream, and preprocessing the frame data to obtain YUV component data corresponding to each frame of image.

It will be appreciated that the frame data is in YUV420p/I420 format, and in order to construct the data structure required by the model and to pass it to the graphics card device, the expression of the frame data is as follows:

{I _Y ,I _U ,I _V }＝DecodeToDevice(ByteStream _in )

the meaning of the above relation expression is: reading byte data from a memory, transmitting the byte data to a processing device (GPU or NPU, if the byte data is CPU, no extra transmission is needed), and performing segmentation of 4.

And b2, obtaining RGB format data corresponding to each frame of image based on the YUV component data corresponding to each frame of image and a preset color space conversion matrix.

And c, processing the frame data obtained in the step b1, separating YUV sub-images, performing nearest neighbor interpolation upsampling on the U component sub-image and the V component sub-image respectively, recombining the sub-images, and then multiplying the sub-images by a conversion matrix for converting YUV format into RGB format. Finally, RGB format data corresponding to each frame of image is obtained, and the specific expression is as follows:

RGB＝MY _UV2RGB *I _YUV

and b3, converting the RGB format data corresponding to each frame of image into a data format corresponding to the style migration model.

Because the style migration model is based on a data structure suitable for the model itself, in order to enable the model to process video stream data, RGB format data corresponding to each frame of image needs to be converted into a data structure that can be processed by the model, and the specific conversion method is as follows:

y＝G(x)

wherein x is each frame image after conversion, and G is a style migration model; y is the output of the style migration model;

in an optional implementation manner, in order to enable the data output by the model to be displayed on the terminal, the data format of the data output by the model needs to be converted into an RGB format, and then the RGB format is converted into a YUV format, so that an optional implementation manner is provided in an embodiment of the present invention, that is, an implementation manner of obtaining a processed video stream based on all target images may include the following steps:

c1, converting each target image into YUV format data based on the color space conversion matrix;

the YUV format data refers to frame data in YUV420p/I420 format.

And c1, obtaining YUV component data corresponding to the target image based on the YUV format data.

The YUV component data here refers to a subpicture in YUV format.

After obtaining the data output by the model, the data can be converted into RGB format data according to the following relation:

RGB＝[Resize(y)+1]*127.5

and then, multiplying the obtained RGB format data by a conversion matrix for converting the RGB format into the YUV format, down-sampling a U component sub-image and a V component sub-image, and finally obtaining an output YUV420p video frame through byte coding and transmitting the frame to the terminal equipment, wherein the specific conversion process table is as follows:

YUV＝M _RGB2YUV *O _RGB

yteStream _out ＝EncodeToHost(O _Y ,O _U ,O _V )

that is, three component data are obtained from the processing device, spliced together, and converted into byte data.

In the above formula, O _Y 、O _U 、O _V For three components of the output stylized YUV video stream frame data, O _YUV For the output graph in YUV444 format, MRGB2YUV is a color space conversion matrix from RGB to YUV.

Through the implementation mode, namely the available video stream after the style migration, a user can intuitively feel the video effect after the style migration on the terminal equipment.

The style migration model training method provided in the embodiment of the present application may be implemented in a hardware device or in a form of a software module, and when the style migration model training method is implemented in the form of a software module, an embodiment of the present application further provides a style migration model training method apparatus, please refer to fig. 8, where fig. 8 is a functional block diagram of the style migration model training apparatus provided in the embodiment of the present application, and the style migration model training apparatus 400 may include:

a first obtaining module 410, configured to obtain a training sample set; the training sample set comprises at least one content image and at least one reference image, and the reference image has target style characteristics;

a construction module 420 for constructing an initial generative confrontation model; the generation of the confrontation model comprises a generator and a discriminator; the generator is used for generating a plurality of style feature migration graphs corresponding to each training sample; the resolution ratios corresponding to the style characteristic migration graphs are different from each other;

the training module 430 is configured to train the generative reactance model by using a training sample set to obtain a style migration model corresponding to the target style characteristics; the style migration model is used for processing the video stream to be processed, so that each frame of image of the video stream to be processed has a target style characteristic.

It is appreciated that the first obtaining module 410, the building module 420, and the training module 430 may cooperatively perform the various steps of fig. 2 to achieve a corresponding technical effect.

In an alternative embodiment, the building module 420 is used to build a generator as shown in fig. 3 or fig. 4.

In an alternative embodiment, the training module 430 is specifically configured to perform the various steps in fig. 6 to achieve the corresponding technical effect.

In an optional implementation manner, the apparatus may further include a processing module, configured to remove a second output branch from remaining sub-generative models except the initial group of sub-generative models in the style migration model, and use the removed style migration model as a style migration model corresponding to the target style feature.

The video processing method provided in the embodiment of the present application may be implemented in a hardware device or in a form of a software module, and when the video processing method is implemented in the form of a software module, an apparatus for a style migration model training method is also provided in the embodiment of the present application, please refer to fig. 9, where fig. 9 is a functional block diagram of a video processing apparatus provided in the embodiment of the present application, and the video processing apparatus 500 may include:

a second obtaining module 510, configured to obtain a video stream to be processed and a target style;

a migration module 520, configured to input each frame of image of the video stream to be processed into a style migration model corresponding to a target style, so as to obtain a target image corresponding to each frame of image; the target image has a target style, and the style migration model is obtained by the style migration model training method provided by the embodiment of the invention;

a processing module 530, configured to obtain a processed video stream based on all target images.

It is appreciated that the second obtaining module 510, the migrating module 520 and the processing module 530 may cooperatively perform the various steps of fig. 7 to achieve the corresponding technical effect.

In an optional embodiment, the second obtaining module 510 is specifically configured to: displaying a user interface; the user interface is provided with a video acquisition area and a style selection area; responding to user operation on the video acquisition area to acquire a video stream; and responding to the selection operation on the style selection area to acquire the target style.

In an optional embodiment, the processing module 530 is further configured to read frame data corresponding to the video stream, and pre-process the frame data to obtain YUV component data corresponding to each frame of image; obtaining an RGB format corresponding to each frame of image based on YUV component data corresponding to each frame of image and a preset color space conversion matrix; and converting the RGB format data corresponding to each frame of image into a data format corresponding to the style migration model.

In an optional embodiment, the processing module 530 is specifically configured to convert each target image into YUV format data based on a color space conversion matrix; and acquiring YUV component data corresponding to the target image based on the YUV format data, and generating a processed video stream based on the YUV component data corresponding to all the target images.

Fig. 10 shows a block diagram of an electronic device according to an embodiment of the present invention, where fig. 10 is a block diagram of the electronic device according to the embodiment of the present invention.

As shown in fig. 10, the electronic device 600 comprises a memory 601, a processor 602, and a communication interface 603, wherein the memory 601, the processor 602, and the communication interface 603 are electrically connected to each other directly or indirectly to enable data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines.

The memory 601 may be used to store software programs and modules, such as instructions/modules of the style migration model training apparatus 400 or the video processing apparatus 500 provided by the embodiment of the present invention, and may be stored in the memory 601 in the form of software or firmware (firmware) or fixed in an Operating System (OS) of the electronic device 600, and the processor 602 executes the software programs and modules stored in the memory 601, so as to execute various functional applications and data processing. The communication interface 603 may be used for communicating signaling or data with other node devices.

The Memory 601 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.

The processor 602 may be an integrated circuit chip having signal processing capabilities. The processor 602 may be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

It will be appreciated that the configuration shown in FIG. 10 is merely illustrative and that electronic device 600 may include more or fewer components than shown in FIG. 10 or have a different configuration than shown in FIG. 10. The components shown in fig. 10 may be implemented in hardware, software, or a combination thereof.

The embodiment of the present application further provides a readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the style migration model training method or the video processing method according to any one of the foregoing embodiments. The computer readable storage medium may be, but is not limited to, various media that can store program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a PROM, an EPROM, an EEPROM, a magnetic or optical disk, etc.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A style migration model training method, the method comprising:

acquiring a training sample set; the training sample set comprises at least one content image and at least one reference image, and the reference image has a target style characteristic;

constructing an initial generative confrontation model; wherein the generating confrontation model comprises a generator and a discriminator; the generator is used for generating a plurality of style feature migration graphs corresponding to each training sample; the resolutions corresponding to the style feature migration graphs are different from each other;

training the generated confrontation model by using the training sample set to obtain a style migration model corresponding to the target style characteristics; and the style migration model is used for processing the video stream to be processed so that each frame of image of the video stream to be processed has the target style characteristic.

2. The style migration model training method according to claim 1, wherein the generator comprises M groups of sub-generation models; the M groups of sub generation models are formed by connecting in series through pooling layers; each group of the sub-generation models corresponds to a discriminator; the resolution of the style feature migration diagram generated by the mth group of sub generation models is greater than that of the style feature migration diagram generated by the m +1 group of sub generation models; m is greater than or equal to 2,m is a positive integer.

3. The style migration model training method according to claim 2,

each group of the sub-generative models is formed by combining an encoder and a decoder; each of the decoders has a first output branch and a second output branch; the input of the encoder of the mth group of sub-generative models is the output of the encoder of the (m-1) th group of sub-generative models; the input of the decoder of the m-th group of sub-generative models is the output of the encoder of the m-th group of sub-generative models and the output of the first output branch of the decoder of the m + 1-th group of sub-generative models; the second output branch of each decoder is used for outputting the first style feature migration map or the second style feature migration map.

4. The style migration model training method according to claim 2, wherein training the generated confrontation model by using the training sample set to obtain the style migration model corresponding to the target style feature comprises:

inputting the training sample set into the generation confrontation model, and generating each first style feature migration diagram corresponding to each content image and each second style feature migration diagram corresponding to each reference image by utilizing each sub generation model;

inputting the first style feature migration diagram and the second style feature migration diagram corresponding to each sub generation model into a discriminator corresponding to the sub generation model respectively to obtain first discrimination information of each first style feature migration diagram and first discrimination information of each second style feature migration diagram;

determining loss values of a plurality of loss functions corresponding to the generated confrontation model based on the first discrimination information and the second discrimination information obtained by all the discriminators;

and reversely transmitting the loss values of the plurality of loss functions to each layer of the generated countermeasure model to iteratively update model parameters until preset conditions are reached, and taking the trained generated countermeasure model as a style migration model corresponding to the target style characteristics.

5. The style migration model training method according to claim 4, wherein the plurality of loss functions are respectively: least square loss function, learning perception image block similarity loss function and feature matching loss function; a style feature loss function.

6. The method for training the style migration model according to claim 2, wherein after the training of the generated confrontation model by using the training sample set to obtain the style migration model corresponding to the target style feature, the method further comprises:

and removing the second output branches in the rest sub-generation models except the initial group of sub-generation models in the style migration model, and taking the style migration model after removal as a style migration model corresponding to the target style characteristic.

7. A method of video processing, the method comprising:

acquiring a video stream to be processed and a target style;

inputting each frame of image of the video stream to be processed into a style migration model corresponding to the target style to obtain a target image corresponding to each frame of image;

the target image has the target style, and the style migration model is obtained by the style migration model training method according to any one of claims 1 to 6;

and obtaining the processed video stream based on all the target images.

8. The video processing method according to claim 7, wherein obtaining the video stream to be processed and the target style comprises:

displaying a user interface; the user interface is provided with a video acquisition area and a grid selection area;

responding to user operation on the video acquisition area to acquire the video stream;

and responding to the selection operation on the style selection area to acquire the target style.

9. The video processing method according to claim 7, wherein before inputting each frame of image of the video stream to be processed into the style migration model corresponding to the target style, and obtaining the target image corresponding to each frame of image, the method further comprises:

reading frame data corresponding to the video stream, and preprocessing the frame data to obtain YUV component data corresponding to each frame of image;

obtaining an RGB format corresponding to each frame of image based on the YUV component data corresponding to each frame of image and a preset color space conversion matrix;

and converting the RGB format data corresponding to each frame of image into a data format corresponding to the style migration model.

10. The video processing method according to claim 7, wherein obtaining the processed video stream based on all of the target images comprises:

converting each target image into YUV format data based on a color space conversion matrix;

and acquiring YUV component data corresponding to the target image based on the YUV format data, and generating the processed video stream based on the YUV component data corresponding to all the target images.

11. A style migration model training apparatus, comprising:

the acquisition module is used for acquiring a training sample set; the training sample set comprises at least one content image and at least one reference image, and the reference image has a target style characteristic;

a construction module for constructing an initial generative confrontation model; wherein the generating a confrontation model comprises a generator and a discriminator; the generator is used for generating a plurality of style feature migration graphs corresponding to each training sample; the resolution ratios corresponding to the style feature migration graphs are different from each other;

the training module is used for training the generated confrontation model by utilizing the training sample set to obtain a style migration model corresponding to the target style characteristics; and the style migration model is used for processing the video stream to be processed so that each frame of image of the video stream to be processed has the target style characteristic.

12. A video processing apparatus, comprising:

the acquisition module is used for acquiring a video stream to be processed and a target style;

the migration module is used for inputting each frame of image of the video stream to be processed into a style migration model corresponding to the target style to obtain a target image corresponding to each frame of image;

the target image has the target style, and the style transition model is obtained by the style transition model training method according to any one of claims 1 to 6;

and the processing module is used for obtaining the processed video stream based on all the target images.

13. An electronic device comprising a processor and a memory, the memory storing a computer program executable by the processor, the processor being operable to execute the computer program to implement the method of any one of claims 1 to 6 or to implement the method of claims 7 to 10.

14. A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-10.