CN111667399B

CN111667399B - Training method of style migration model, video style migration method and device

Info

Publication number: CN111667399B
Application number: CN202010409043.0A
Authority: CN
Inventors: 张依曼; 陈醒濠; 王云鹤; 许春景
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-05-14
Filing date: 2020-05-14
Publication date: 2023-08-25
Anticipated expiration: 2040-05-14
Also published as: CN111667399A

Abstract

The application discloses a training method of a style migration model, a video style migration method and a device in the artificial intelligence field, comprising the following steps: acquiring training data; performing image style migration processing on the N frames of sample content images according to the sample style images through the neural network model to obtain N frames of predicted composite images; and determining parameters of the neural network model according to an image loss function between the N-frame sample content image and the N-frame prediction synthesized image, wherein the image loss function comprises a low-rank loss function, the low-rank loss function is used for representing the difference between a first low-rank matrix and a second low-rank matrix, the first low-rank matrix is obtained based on the N-frame sample content image and optical flow information, the second low-rank matrix is obtained based on the N-frame prediction synthesized image and the optical flow information, and the optical flow information is used for representing the position difference of corresponding pixel points between two adjacent frames of images in the N-frame sample content image. The technical scheme of the application can improve the stability of the video after style migration processing.

Description

Training method of style migration model, video style migration method and device

Technical Field

The application relates to the field of artificial intelligence, in particular to a training method of a style migration model in the field of computer vision, a video style migration method and a device.

Background

Artificial intelligence (artificial intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar manner to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

Image rendering tasks such as image style migration and the like have wide application requirement scenes on terminal equipment. With the high-speed improvement of the performance and network performance of the terminal equipment, the entertainment requirement of the terminal equipment is gradually changed from an image level to a video level, namely, the image style migration processing of a single image is changed into the image style migration processing of a video; compared with the image style migration task, the video style migration task not only needs to consider the stylized effect of the images, but also needs to consider the stability among multiple frames of images included in the video, so that the fluency of the video after the image style migration processing is ensured.

Therefore, how to improve the stability of video after image migration processing is a problem to be solved.

Disclosure of Invention

The application provides a training method of a style migration model, a video style migration method and a device, wherein a low-rank loss function is introduced in the process of training the style migration model for videos, so that the stability of the video after style migration and the stability of the original video can be synchronized, and the stability of the video after style migration treatment obtained by a target migration model can be improved.

In a first aspect, a method for training a style migration model is provided, including: acquiring training data, wherein the training data comprises N frames of sample content images, sample style images and N frames of synthesized images, the N frames of synthesized images are images obtained by performing image style migration processing on the N frames of sample content images according to the sample style images, and N is an integer greater than or equal to 2; performing image style migration processing on the N frames of sample content images according to the sample style images through a neural network model to obtain N frames of predicted composite images; determining parameters of the neural network model according to an image loss function between the N frames of sample content images and the N frames of predicted composite images,

The image loss function comprises a low-rank loss function, the low-rank loss function is used for representing the difference between a first low-rank matrix and a second low-rank matrix, the first low-rank matrix is obtained based on the N-frame sample content image and optical flow information, the second low-rank matrix is obtained based on the N-frame prediction synthesized image and the optical flow information, and the optical flow information is used for representing the position difference of corresponding pixel points between two adjacent frame images in the N-frame sample content image.

It should be appreciated that, for a matrix of multi-frame images, a low rank matrix may be used to represent areas that are all present in the N-frame images and are not motion boundaries. The sparse matrix may be used to represent areas of intermittent appearance in the N frames of images; for example, the sparse matrix may refer to an area that newly appears or disappears at the image boundary due to camera movement, or a boundary area of a moving object.

In the embodiment of the application, a low-rank loss function is introduced when the target style migration model for the video style migration processing is trained, and the region which appears in adjacent multi-frame images in the video to be processed and is not a motion boundary can be kept the same after the style migration processing is introduced, namely, the rank of the region in the video after the style migration processing is approximate to the rank of the region in the video to be processed, so that the stability of the video after the style migration processing can be improved.

It should be understood that the image style migration processing refers to processing of fusing image content in a content image a, which is an image having a style migration need, with an image style of a style image B, thereby generating a composite image C having the content of the image a and the style of the image B, or becoming a fused image C.

The style image can refer to a reference image subjected to style migration processing, and the style in the image can comprise texture characteristics of the image and artistic expression forms of the image; for example, the art expression of the image may include the style of the image such as cartoon, oil painting, watercolor, ink, etc; the content image may refer to an image requiring style migration, and the content in the image may refer to semantic information in the image, that is, may include high-frequency information, low-frequency information, and the like in the content image.

In one possible implementation, the first low rank matrix is derived based on N frames of sample content images and optical flow information; for example, the first low rank matrix may refer to calculating optical flow information between adjacent image frames in the N-frame sample content image; mask information can be obtained according to the optical flow information, wherein the optical flow information is used for representing operation information of pixel points corresponding to the adjacent frame images, and the mask information can be used for representing a change area in two continuous frame images obtained according to the optical flow information; further, according to the optical flow information and the mask information, mapping the N frames of sample content images to a fixed frame of image, respectively generating vectors by the N frames of sample content images after mapping processing, and combining the vectors into a matrix according to columns, wherein the matrix is a first low-rank matrix. Similarly, the second low-rank matrix may be obtained based on N-frame predicted composite images and optical flow information, where the N-frame predicted composite images are mapped to a fixed frame of image according to the optical flow information and mask information, and the mapped N-frame predicted composite images are respectively generated into vectors and combined into a matrix according to columns, and then the matrix is the second low-rank matrix, where the optical flow information is used to represent a difference in positions of corresponding pixels between two adjacent frames of images in the N-frame sample content images.

With reference to the first aspect, in certain implementation manners of the first aspect, the image loss function further includes a residual loss function, where the residual loss function is obtained according to a difference between a first sample composite image and a second sample composite image, where the first sample composite image is an image obtained by performing image style migration processing on the N-frame sample content image through a first model, the second sample composite image is an image obtained by performing image style migration processing on the N-frame sample content image through a second model, the first model and the second model are image style migration models trained in advance according to the sample style image, the second model includes an optical flow module, and the first model does not include the optical flow module, and the optical flow module is used to determine the optical flow information.

In the embodiment of the application, the aim of introducing the residual error loss function when training the target style migration model is to enable the neural network model to learn the difference of the synthesized image output by the style migration model comprising the optical flow module and the style migration model not comprising the optical flow module in the training process, so that the stability of the video after the style migration processing obtained by the target migration model can be improved.

It should be appreciated that the difference between the first and second sample composite images may refer to the difference between the pixel values corresponding to the first and second sample composite images.

In one possible implementation, the first model and the second model may employ the same sample content image and sample style image during the training phase; for example, the first model and the second model may refer to the same model during the training phase; however, the second model also needs to calculate optical flow information between the multi-frame sample content images during the test phase; the first model does not need to calculate optical flow information between the multiple frames of images.

With reference to the first aspect, in some implementations of the first aspect, the first model and the second model are teacher models trained in advance, and the target style migration model refers to a target student model obtained by training a student model to be trained according to the residual loss function and a knowledge distillation algorithm.

In one possible implementation manner, the target style migration model may refer to a target student model, and when the target student model is trained, a student model to be trained may be trained according to a first teacher model (excluding an optical flow module) which is trained in advance, a second teacher model (including an optical flow module) which is trained in advance, and a basic model which is trained in advance, so as to obtain the target student model; the student model to be trained, the pre-trained basic model and the target student model have the same network structure, and the student model to be trained is trained through the low-rank loss function, the residual loss function and the perception loss function, so that the target student model is obtained.

The pre-trained basic model may be a style migration model that is obtained by training a perceptual loss function in advance and does not include an optical flow module in a test stage; alternatively, the pre-trained style migration model may refer to a style migration model pre-trained by a perceptual loss function and an optical flow loss function that does not include an optical flow module during the test phase; the perceptual loss function is used to represent content loss between the composite image and the content image and style loss between the composite image and the style image; the optical flow loss function is used to represent the difference between corresponding pixels of the synthesized image of adjacent frames.

In one possible implementation manner, in the process of training the student model to be trained, the difference of migration results (also called composite images) output between the student model to be trained and the pre-trained basic model is enabled to be continuously approximate to the difference of migration results output between the second model and the first model through the residual loss function.

In the embodiment of the application, the target style migration model can be a target student model, the difference between the style migration results output by the student model to be trained and the pre-trained basic model is enabled to be continuously approximate to the difference between the style migration results output by the teacher model comprising the optical flow module and the teacher model not comprising the optical flow module by adopting a knowledge distillation method for teacher-student model learning, and the double image phenomenon caused by the non-uniform styles of the teacher model and the student model can be effectively avoided by adopting the training method.

With reference to the first aspect, in certain implementations of the first aspect, the residual loss function is obtained according to the following equation,

wherein L is _res Representing the residual loss function; n (N) _T Representing the second model;representing the first model; n (N) _S Representing the student model to be trained; />Representing a pre-trained base model, the pre-trained base model being the same as the network structure of the student model to be trained; x is x ⁱ And representing an ith frame sample content image included in the sample video, wherein i is a positive integer.

With reference to the first aspect, in certain implementations of the first aspect, the image loss function further includes a perceptual loss function, wherein the perceptual loss function includes a content loss representing an image content difference between the N-frame predicted composite image and the N-frame sample content image corresponding thereto and a style loss representing an image style difference between the N-frame predicted composite image and the sample style image.

With reference to the first aspect, in certain implementations of the first aspect, the image loss function is obtained by weighting the low rank loss function, the residual loss function, and the perceptual loss function.

With reference to the first aspect, in certain implementations of the first aspect, the parameters of the target style migration model are obtained by iterating a back propagation algorithm a plurality of times based on the image loss function.

In a second aspect, a method of video style migration includes: acquiring a video to be processed, wherein the video to be processed comprises N frames of content images to be processed, and N is an integer greater than or equal to 2; performing image style migration processing on the N frames of content images to be processed according to a target style migration model to obtain N frames of synthesized images; obtaining a video after style migration processing corresponding to the video to be processed according to the N frames of synthesized images,

the parameters of the target style migration model are determined according to an image loss function of performing style migration processing on N frames of sample content images by the target style migration model, the image loss function comprises a low-rank loss function, the low-rank loss function is used for representing the difference between a first low-rank matrix and a second low-rank matrix, the first low-rank matrix is obtained based on the N frames of sample content images and optical flow information, the second low-rank matrix is obtained based on N frames of predicted synthesized images and the optical flow information, the optical flow information is used for representing the position difference of corresponding pixel points between two adjacent frames of images in the N frames of sample content images, and the N frames of predicted synthesized images are images obtained after performing image style migration processing on the N frames of sample content images according to the sample style images by the target style migration model.

It should be noted that, the image style migration refers to fusing the image content in a content image a with the image style of a style image B, so as to generate a composite image C having the image content a and the image style B; wherein, the style in the image can include information such as texture characteristics of the image; the content in the image may refer to semantic information in the image, i.e. may include high frequency information, low frequency information, etc. in the content image.

On the other hand, in the process of performing style migration processing on the video to be processed, the target style migration model provided by the embodiment of the application does not need to calculate optical flow information among multiple frames of images included in the video to be processed, so that the target style migration model provided by the embodiment of the application can improve stability, shorten the time of style migration processing of the model and improve the operation efficiency of the target style migration model.

In one possible implementation, the video to be processed may be a video captured by the electronic device through a camera, or the video to be processed may also be a video obtained from inside the electronic device (for example, a video stored in an album of the electronic device, or a video obtained by the electronic device from the cloud).

It should be understood that the video to be processed may be a video with style migration requirements, and the present application is not limited to any source of the video to be processed.

With reference to the second aspect, in certain implementations of the second aspect, the image loss function further includes a residual loss function, where the residual loss function is obtained according to a difference between a first sample composite image and a second sample composite image, where the first sample composite image is an image obtained by performing image style migration processing on the N-frame sample content image through a first model, the second sample composite image is an image obtained by performing image style migration processing on the N-frame sample content image through a second model, the first model and the second model are image style migration models trained in advance according to the sample style image, and the second model includes an optical flow module, and the first model does not include the optical flow module, and the optical flow module is used to determine the optical flow information.

It should be appreciated that the difference between the first and second sample composite images may refer to the difference between the pixel values corresponding to the first and second sample composite images. In one possible implementation, the first model and the second model may employ the same sample content image and sample style image during the training phase; for example, the first model and the second model may refer to the same model during the training phase; however, the second model also needs to calculate optical flow information between the multi-frame sample content images during the test phase; the first model does not need to calculate optical flow information between the multiple frames of images.

With reference to the second aspect, in some implementations of the second aspect, the first model and the second model are teacher models trained in advance, and the target style migration model is a target student model obtained by training a student model to be trained according to the residual loss function and a knowledge distillation algorithm.

It should be noted that, the network structures of the student model and the target student model may be the same, that is, the student model may refer to a style migration model that is trained in advance and does not need to input optical flow information in a test stage; the target student model is a model which is further trained by the residual loss function and the low-rank loss function on the basis of the student model.

In one possible implementation, the pre-trained student model may be a student model pre-trained by a perceptual loss function used to represent the effect of video stylization, i.e. to represent the content differences between the sample composite image and the sample style image and the style differences between the sample composite image and the sample content image.

In one possible implementation, the pre-trained student model may be a student model pre-trained by a perceptual loss function, where the optical flow loss function is used to represent differences between corresponding pixels of the composite image of adjacent frames.

With reference to the second aspect, in certain implementations of the second aspect, the residual loss function is obtained according to the following equation,

In one possible implementation manner, in the process of training the student model to be trained, the difference of migration results (also called composite images) output between the student model to be trained and the pre-trained basic model is enabled to be continuously approximate to the difference between migration results output by the second model and the first model through the residual loss function.

With reference to the second aspect, in certain implementations of the second aspect, the image loss function further includes a perceptual loss function, wherein the perceptual loss function includes a content loss representing an image content difference between the N-frame predicted composite image and the N-frame sample content image corresponding thereto and a style loss representing an image style difference between the N-frame predicted composite image and the sample style image.

With reference to the second aspect, in certain implementations of the second aspect, the image loss function is obtained by weighting the low rank loss function, the residual loss function, and the perceptual loss function.

With reference to the second aspect, in some implementations of the second aspect, the parameters of the target style migration model are obtained by iterating a back propagation algorithm multiple times based on the image loss function.

In a third aspect, a training device for a style migration model is provided, including: the training device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring training data, the training data comprises N frames of sample content images, sample style images and N frames of synthesized images, the N frames of synthesized images are images obtained by performing image style migration processing on the N frames of sample content images according to the sample style images, and N is an integer greater than or equal to 2; the processing unit is used for performing image style migration processing on the N frames of sample content images according to the sample style images through a neural network model to obtain N frames of predicted composite images; and determining parameters of the neural network model according to an image loss function between the N-frame sample content image and the N-frame prediction synthesized image, wherein the image loss function comprises a low-rank loss function, the low-rank loss function is used for representing the difference between a first low-rank matrix and a second low-rank matrix, the first low-rank matrix is obtained based on the N-frame sample content image and optical flow information, the second low-rank matrix is obtained based on the N-frame prediction synthesized image and the optical flow information, and the optical flow information is used for representing the position difference of corresponding pixel points between two adjacent frames of images in the N-frame sample content image.

In a possible implementation manner, the training device includes a functional unit/module further configured to perform the method of the first aspect and any implementation manner of the first aspect.

It should be appreciated that the extensions, definitions, explanations and illustrations of the relevant content in the first aspect described above also apply to the same content in the third aspect.

In a fourth aspect, an apparatus for video style migration is provided, including: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a video to be processed, the video to be processed comprises N frames of content images to be processed, and N is an integer greater than or equal to 2; the processing unit is used for carrying out image style migration processing on the N frames of content images to be processed according to the target style migration model to obtain N frames of synthesized images; obtaining a video after style migration processing corresponding to the video to be processed according to the N frames of synthesized images,

In a possible implementation manner, the above apparatus includes a functional unit/module further configured to perform the method of the second aspect and any implementation manner of the second aspect.

It should be appreciated that the extensions, limitations, explanations and illustrations of the relevant content in the second aspect described above also apply to the same content in the fourth aspect.

In a fifth aspect, a training device for a style migration model is provided, including: a memory for storing a program; a processor for executing the program stored in the memory, the processor being configured to, when executed, perform: acquiring training data, wherein the training data comprises N frames of sample content images, sample style images and N frames of synthesized images, the N frames of synthesized images are images obtained by performing image style migration processing on the N frames of sample content images according to the sample style images, and N is an integer greater than or equal to 2; performing image style migration processing on the N frames of sample content images according to the sample style images through a neural network model to obtain N frames of predicted composite images; and determining parameters of the neural network model according to an image loss function between the N-frame sample content image and the N-frame prediction synthesized image, wherein the image loss function comprises a low-rank loss function, the low-rank loss function is used for representing the difference between a first low-rank matrix and a second low-rank matrix, the first low-rank matrix is obtained based on the N-frame sample content image and optical flow information, the second low-rank matrix is obtained based on the N-frame prediction synthesized image and the optical flow information, and the optical flow information is used for representing the position difference of corresponding pixel points between two adjacent frames of images in the N-frame sample content image.

In a possible implementation manner, the training device includes a processor, and the processor is further configured to perform the method in the first aspect and any implementation manner of the first aspect.

It should be appreciated that the extensions, limitations, explanations and illustrations of the relevant content in the first aspect described above also apply to the same content in the fifth aspect.

In a sixth aspect, an apparatus for video style migration is provided, including: a memory for storing a program; a processor for executing the program stored in the memory, the processor being configured to, when executed, perform: acquiring a video to be processed, wherein the video to be processed comprises N frames of content images to be processed, and N is an integer greater than or equal to 2; performing image style migration processing on the N frames of content images to be processed according to a target style migration model to obtain N frames of synthesized images; and obtaining a video after style migration processing corresponding to the video to be processed according to the N frames of synthesized images, wherein parameters of the target style migration model are determined according to an image loss function of performing style migration processing on N frames of sample content images by the target style migration model, the image loss function comprises a low rank loss function, the low rank loss function is used for representing the difference between a first low rank matrix and a second low rank matrix, the first low rank matrix is obtained based on the N frames of sample content images and optical flow information, the second low rank matrix is obtained based on the N frames of predicted synthesized images and the optical flow information, the optical flow information is used for representing the position difference of corresponding pixel points between two adjacent frames of images in the N frames of sample content images, and the N frames of predicted synthesized images are images obtained by performing image style migration processing on the N frames of sample content images according to the sample style images by the target style migration model.

In a possible implementation manner, the processor included in the above apparatus is further configured to perform the training method in any two implementation manners of the second aspect.

It should be appreciated that the extensions, limitations, explanations and illustrations of the relevant content in the second aspect described above also apply to the same content in the sixth aspect.

In a seventh aspect, a computer readable medium is provided, the computer readable medium storing program code for execution by a device, the program code comprising instructions for performing the method of any one of the above first to second aspects and first to second aspects.

In an eighth aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of the above first to second aspects and the first to second aspects.

A ninth aspect provides a chip comprising a processor and a data interface, the processor reading instructions stored on a memory via the data interface, performing the method of any one of the above first to second aspects and the first to second aspects.

Optionally, as an implementation manner, the chip may further include a memory, where the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, where the processor is configured to perform the method in any implementation manner of the first aspect to the second aspect and the first aspect to the second aspect when the instructions are executed.

Drawings

FIG. 1 is a schematic diagram of an artificial intelligence subject framework provided by an embodiment of the present application;

fig. 2 is a schematic diagram of an application scenario provided in an embodiment of the present application;

FIG. 3 is a system architecture provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of a convolutional neural network according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a chip hardware structure according to an embodiment of the present application;

FIG. 6 is a system architecture provided by an embodiment of the present application;

FIG. 7 is a schematic flow chart of a training method of a style migration model provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of a training process of a style migration model provided by an embodiment of the present application;

FIG. 9 is a schematic flow chart diagram of a method for video style migration provided by an embodiment of the present application;

FIG. 10 is a schematic diagram of a training phase and a testing phase provided by an embodiment of the present application;

FIG. 11 is a schematic block diagram of an apparatus for video style migration provided by an embodiment of the present application;

FIG. 12 is a schematic block diagram of a training apparatus for a style migration model provided by an embodiment of the present application;

FIG. 13 is a schematic block diagram of an apparatus for video style migration provided by an embodiment of the present application;

FIG. 14 is a schematic block diagram of a training apparatus for a style migration model provided by an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application; it will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

FIG. 1 illustrates a schematic diagram of an artificial intelligence framework that describes the overall workflow of an artificial intelligence system, applicable to general artificial intelligence field requirements.

The above-described artificial intelligence topic framework 100 is described in detail below from two dimensions, the "Smart information chain" (horizontal axis) and the "information technology (information technology, IT) value chain" (vertical axis).

The "intelligent information chain" reflects a list of processes from the acquisition of data to the processing. For example, there may be general procedures of intelligent information awareness, intelligent information representation and formation, intelligent reasoning, intelligent decision making, intelligent execution and output. In this process, the data undergoes a "data-information-knowledge-wisdom" gel process.

The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry from the underlying infrastructure of personal intelligence, information (provisioning and processing technology implementation), to the industrial ecological process of the system.

(1) Infrastructure 110

The infrastructure can provide computing capability support for the artificial intelligence system, achieve communication with the outside world, and achieve support through the base platform.

The infrastructure may communicate with the outside through sensors, and the computing power of the infrastructure may be provided by the smart chip.

The smart chip may be a hardware acceleration chip such as a central processing unit (central processing unit, CPU), a neural network processor (neural-network processing unit, NPU), a graphics processor (graphics processing unit, GPU), an application-specific integrated circuit (application specific integrated circuit, ASIC), or a field programmable gate array (field programmable gate array, FPGA).

The basic platform of the infrastructure can comprise a distributed computing framework, network and other relevant platform guarantees and supports, and can comprise cloud storage, computing, interconnection network and the like.

For example, for an infrastructure, data may be obtained through sensor and external communication and then provided to a smart chip in a distributed computing system provided by the base platform for computation.

(2) Data 120

The data of the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence. The data relate to graphics, images, voice and text, and also relate to internet of things data of traditional equipment, wherein the data comprise service data of an existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing 130

Such data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

Wherein machine learning and deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Reasoning refers to the process of simulating human intelligent reasoning modes in a computer or an intelligent system, and carrying out machine thinking and problem solving by using formal information according to a reasoning control strategy, and typical functions are searching and matching.

Decision making refers to the process of making decisions after intelligent information is inferred, and generally provides functions of classification, sequencing, prediction and the like.

(4) Generic capabilities 140

After the data is processed as mentioned above, further general capabilities may be formed according to the result of the data processing, such as algorithms or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.

(5) Intelligent product and industry application 150

The intelligent product and industry application refers to products and applications of an artificial intelligent system in various fields, is encapsulation of an artificial intelligent overall solution, and realizes land application by making intelligent information decisions, and the application fields mainly comprise: intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, safe city, intelligent terminal and the like.

Fig. 2 is a schematic diagram of an application scenario provided in an embodiment of the present application.

As shown in fig. 2, the video style migration method according to the embodiment of the present application may be applied to an intelligent terminal; for example, inputting a video to be processed through a camera of an intelligent terminal or a video to be processed stored in an album in the intelligent terminal into a target style migration model provided by the embodiment of the application, so as to obtain a video after style migration processing; by adopting the target style migration model provided by the embodiment of the application, the stability of the video after the style migration processing can be ensured, namely the fluency of the video obtained after the style migration processing is ensured.

In an example, the method for migrating video styles provided by the embodiment of the application can be applied to offline scenes.

For example, the video to be processed is acquired, and is input into the target style migration model, so that the video after style migration processing, namely the video with stable style is obtained.

In one example, the method for migrating video styles provided by the embodiment of the application can be applied to an online scene.

For example, acquiring a video recorded in real time by an intelligent terminal, and inputting the video recorded in real time into a target style migration model, so as to obtain a video subjected to style migration processing which is output in real time; for example, the system can be used for scenes such as real-time exhibition of exhibition stands

For example, when online video call is performed through the intelligent terminal, user video shot by the camera in real time can be input into the target style migration model, so that output video after style migration processing is obtained. For example, stable stylized videos can be displayed to others in real time, and interestingness is improved.

The target style migration model is a pre-trained model obtained by training through the training method of the style migration model provided by the embodiment of the application.

Illustratively, the above-described intelligent terminal may be mobile or fixed; for example, the smart terminal may be a mobile phone, a tablet personal computer (tablet personal computer, TPC), a media player, a smart television, a notebook computer (LC), a personal digital assistant (personal digital assistant, PDA), a personal computer (personal computer, PC), a camera, a video camera, a smart watch, a Wearable Device (WD), or an autonomous vehicle, etc., with the image processing function, to which the embodiment of the present application is not limited.

It should be understood that the foregoing is illustrative of an application scenario, and is not intended to limit the application scenario of the present application in any way.

Since embodiments of the present application relate to a large number of applications of neural networks, for ease of understanding, the following description will first discuss the terms and concepts related to neural networks that may be involved in embodiments of the present application.

1. Neural network

The neural network may be composed of neural units, which may be referred to as x _s And an arithmetic unit whose intercept 1 is an input, the output of the arithmetic unit may be:

wherein s=1, 2, … … n, n is a natural number greater than 1, W _s Is x _s B is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit to an output signal. The output signal of the activation function may be used as an input to a next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by joining together a plurality of the above-described single neural units, i.e., the output of one neural unit may be the input of another neural unit. The input of each neural unit may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be an area composed of several neural units.

2. Deep neural network

Deep neural networks (deep neural network, DNN), also known as multi-layer neural networks, can be understood as neural networks with multiple hidden layers. The DNNs are divided according to the positions of different layers, and the neural networks inside the DNNs can be divided into three types: input layer, hidden layer, output layer. Typically the first layer is the input layer, the last layer is the output layer, and the intermediate layers are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.

Although DNN appears to be complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:wherein (1)>Is an input vector, +.>Is the output vector, +.>Is the offset vector, W is the weight matrix (also called coefficient), and α () is the activation function. Each layer is only for the input vector +.>The output vector is obtained by such simple operation>Since the DNN layers are many, the coefficient W and the offset vector +.>And the number of (2) is also relatively large. The definition of these parameters in DNN is as follows: taking the coefficient W as an example: it is assumed that in DNN of one three layers, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as +.>. The superscript 3 represents the number of layers in which the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.

In summary, the coefficients of the kth neuron of the L-1 layer to the jth neuron of the L layer are defined as

It should be noted that the input layer is devoid of W parameters. In deep neural networks, more hidden layers make the network more capable of characterizing complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the greater the "capacity", meaning that it can accomplish more complex learning tasks. The process of training the deep neural network, i.e. learning the weight matrix, has the final objective of obtaining a weight matrix (a weight matrix formed by a number of layers of vectors W) for all layers of the trained deep neural network.

3. Convolutional neural network

The convolutional neural network (convolutional neuron network, CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of a convolutional layer and a sub-sampling layer, which can be regarded as a filter. The convolution layer refers to a neuron layer in the convolution neural network, which performs convolution processing on an input signal. In the convolutional layer of the convolutional neural network, one neuron may be connected with only a part of adjacent layer neurons. A convolutional layer typically contains a number of feature planes, each of which may be composed of a number of neural elements arranged in a rectangular pattern. Neural elements of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights can be understood as the way image information is extracted is independent of location. The convolution kernel can be initialized in the form of a matrix with random size, and reasonable weight can be obtained through learning in the training process of the convolution neural network. In addition, the direct benefit of sharing weights is to reduce the connections between layers of the convolutional neural network, while reducing the risk of overfitting.

4. Loss function

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really wanted to be predicted as possible, the predicted value of the current network and the really wanted target value can be compared, and then the weight vector of each layer of the neural network is updated according to the difference condition between the predicted value and the really wanted target value (of course, an initialization process is usually carried out before the first update, namely, the pre-configuration parameters of each layer in the deep neural network); for example, if the predicted value of the network is high, the weight vector is adjusted to make it predicted lower, and the adjustment is continued until the deep neural network is able to predict the truly desired target value or a value very close to the truly desired target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and then the training of the deep neural network becomes a process of reducing the loss as much as possible.

5. Back propagation algorithm

The neural network can adopt a Back Propagation (BP) algorithm to correct the parameter in the initial neural network model in the training process, so that the reconstruction error loss of the neural network model is smaller and smaller.

Specifically, the input signal is transmitted forward until the output is generated with error loss, and the parameters in the initial neural network model are updated by back propagation of the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion that dominates the error loss, and aims to obtain parameters of the optimal neural network model, such as a weight matrix.

6. Image style migration

Image style migration refers to fusing the image content in a content image a with the image style of a style image B to produce a composite image C having both the image content a and the image style B.

Illustratively, performing image style migration on the content image 1 according to the style image 1, so as to obtain a composite image 1, wherein the composite image 1 comprises the content in the content image 1 and the style in the style image 1; similarly, the image style migration is performed on the content image 1 according to the style image 2, and a composite image 2 can be obtained, wherein the composite image 2 includes the content in the content image 1 and the style in the style image 2.

The style image can refer to a reference image for style migration, and the style in the image can comprise texture characteristics of the image and artistic expression forms of the image; for example, the art expression of the image may include the style of the image such as cartoon, oil painting, watercolor, ink, etc; the content image may refer to an image requiring style migration, and the content in the image may refer to semantic information in the image, that is, may include high-frequency information, low-frequency information, and the like in the content image.

7. Optical flow information

The optical flow (optical flow or optic flow) is used to represent the instantaneous velocity of the pixel motion of a spatially moving object in an observation imaging plane, and is one method of using the pixel's change in the time domain in an image sequence and the correlation between adjacent frames to find the correspondence between the previous frame and the current frame, and thus calculate the motion information of the object between the adjacent frames.

8. Knowledge distillation

Knowledge distillation refers to a key technology for miniaturizing a deep learning model and achieving the deployment requirement of terminal equipment. Compared with the compression technology such as quantization and sparsification, the compression model can be achieved without specific hardware support. The knowledge distillation technology adopts a teacher-student model learning strategy, wherein the teacher model can refer to a model with large parameters, and can not meet deployment requirements generally; and the student model has few parameters and can be directly deployed. Through designing an effective knowledge distillation algorithm, the student model learns to simulate the behavior of a teacher model, and effective knowledge migration is performed, so that the student model can finally show the same processing capacity as the teacher model.

Firstly, a system architecture of a video style migration method and a training method of a style migration model provided by the embodiment of the application is introduced.

Fig. 3 illustrates a system architecture 200 according to an embodiment of the present application.

As shown in the system architecture 200 of fig. 3, the data acquisition device 260 is used to acquire training data. For the training method of the style migration model in the embodiment of the present application, the target style migration model may be further trained by training data, that is, training data collected by the data collection device 260.

For example, in the embodiment of the present application, the training data for training the target style migration model may be an N-frame sample content image, a sample style image, and an N-frame sample composite image, where the N-frame sample composite image is an image obtained by performing image style migration processing on the N-frame sample content image according to the sample style image, and N is an integer greater than or equal to 2.

After the training data is collected, the data collection device 260 stores the training data in the database 230, and the training device 220 trains the target model/rule 201 (i.e., the target style migration model in the embodiment of the present application) according to the training data maintained in the database 230. The training device 220 inputs training data into the target style migration model until a difference between output data of the training target style migration model and sample data meets a preset condition (e.g., a difference between predicted data and sample data is less than a threshold, or a difference between predicted data and sample data remains unchanged or no longer decreases), thereby completing training of the target model/rule 201.

The output data may refer to N-frame predicted composite images output by the target style migration model; the sample data may refer to an N-frame sample composite image.

In the embodiment provided by the present application, the target model/rule 201 is obtained by training a target style migration model, and the target style migration model may be used for performing style migration processing on the video to be processed. It should be noted that, in practical applications, the training data maintained in the database 230 is not necessarily all acquired by the data acquisition device 260, but may be received from other devices.

It should be noted that, the training device 220 does not necessarily need to completely train the target model/rule 201 according to the training data maintained by the database 230, and it is also possible to acquire the training data from the cloud or other places to train the model, which should not be taken as a limitation of the embodiments of the present application.

It should also be noted that at least some of the training data maintained in database 230 may also be used to perform the processing of the processing by device 210.

The target model/rule 201 obtained by training according to the training device 220 may be applied to different systems or devices, such as the execution device 210 shown in fig. 3, where the execution device 210 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an AR/VR, a vehicle-mounted terminal, etc., and may also be a server or cloud terminal, etc.

In fig. 3, the execution device 210 configures an input/output (I/O) interface 212 for data interaction with an external device, and a user may input data to the I/O interface 212 through the client device 240, where the input data may include in an embodiment of the present application: and the video to be processed is input by the client device.

The preprocessing module 213 and the preprocessing module 214 are configured to perform preprocessing according to input data (such as video to be processed) received by the I/O interface 212. In the embodiment of the present application, the preprocessing module 213 and the preprocessing module 214 (or only one of them may be used) may be omitted, and the calculation module 211 may be directly used to process the input data.

In preprocessing input data by the execution device 210, or in performing processing related to computation or the like by the computation module 211 of the execution device 210, the execution device 210 may call data, codes or the like in the data storage system 250 for corresponding processing, or may store data, instructions or the like obtained by corresponding processing in the data storage system 250.

Finally, the I/O interface 212 returns the processing result to the client device 240 to provide the user with the video to be processed, i.e., the video after style migration processing.

It should be noted that, the training device 220 may generate, according to different training data, a corresponding target model/rule 201 for different targets or different tasks, where the corresponding target model/rule 201 may be used to achieve the targets or complete the tasks, so as to provide the user with the desired result.

In the case shown in FIG. 3, in one case, the user may manually give input data that may be manipulated through an interface provided by I/O interface 212.

In another case, the client device 240 may automatically send the input data to the I/O interface 212, and if the client device 240 is required to automatically send the input data requiring the user's authorization, the user may set the corresponding permissions in the client device 240. The user may view the results output by the execution device 210 at the client device 240, and the specific presentation may be in the form of a display, a sound, an action, or the like. The client device 240 may also be used as a data collection terminal to collect input data from the input I/O interface 212 and output results from the output I/O interface 212 as new sample data, and store the new sample data in the database 230. Of course, the input data input to the I/O interface 212 and the output result output from the I/O interface 212 as shown in the figure may be stored as new sample data in the database 230 directly by the I/O interface 212 instead of being collected by the client device 240.

It should be noted that fig. 3 is only a schematic diagram of a system architecture according to an embodiment of the present application, and the positional relationship among devices, apparatuses, modules, etc. shown in the drawings is not limited in any way. For example, in FIG. 3, data storage system 250 is external memory to execution device 210, in other cases, data storage system 250 may be located within execution device 210.

As shown in fig. 3, the target model/rule 201 is obtained by training according to the training device 220, where the target model/rule 201 may be a target style migration model in the embodiment of the present application, and specifically, the target style migration model provided in the embodiment of the present application may be a deep neural network, a convolutional neural network, or may be a deep convolutional neural network.

The structure of the convolutional neural network is described in detail below with reference to fig. 4. As described in the basic concept introduction above, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning architecture, in which multiple levels of learning are performed at different abstraction levels through machine learning algorithms. As a deep learning architecture, a convolutional neural network is a feed-forward artificial neural network in which individual neurons can respond to an image input thereto.

The network structure of the grid migration model in the embodiment of the present application may be as shown in fig. 4. In fig. 4, convolutional neural network 300 may include an input layer 310, a convolutional layer/pooling layer 320 (where the pooling layer is optional), and a neural network layer 330. The input layer 310 may acquire an image to be processed, and process the acquired image to be processed by the convolution layer/pooling layer 320 and the following neural network layer 330, so as to obtain a processing result of the image. The internal layer structure of CNN300 in fig. 4 is described in detail below.

Convolution layer/pooling layer 320:

the convolution/pooling layer 320 as shown in fig. 4 may include layers as examples 321-326; for example: in one implementation, layer 321 is a convolutional layer, layer 322 is a pooling layer, layer 323 is a convolutional layer, layer 324 is a pooling layer, layer 325 is a convolutional layer, and layer 326 is a pooling layer; in another implementation 321, 322 are convolutional layers, 323 are pooling layers, 324, 325 are convolutional layers, 326 is a pooling layer, i.e., the output of the convolutional layer may be used as the input of a subsequent pooling layer or as the input of another convolutional layer to continue the convolutional operation.

The internal principles of operation of one convolution layer will be described below using convolution layer 321 as an example.

The convolution layer 321 may include a plurality of convolution operators, also called kernels, which function in image processing as a filter that extracts specific information from the input image matrix, and the convolution operators may be essentially a weight matrix, which is usually predefined, and during the convolution operation on the image, the weight matrix is usually processed on the input image in a horizontal direction, pixel by pixel (or two pixels by two pixels, etc., depending on the value of the step size stride), so as to perform the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix produces a convolved output of a single depth dimension, but in most cases does not use a single weight matrix, but instead applies multiple weight matrices of the same size (row by column), i.e., multiple homography matrices. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by the "multiple" as described above.

Different weight matrices may be used to extract different features in the image, e.g., one weight matrix is used to extract image edge information, another weight matrix is used to extract a particular color of the image, yet another weight matrix is used to blur unwanted noise in the image, etc. The sizes (rows and columns) of the weight matrixes are the same, the sizes of the convolution feature images extracted by the weight matrixes with the same sizes are the same, and the convolution feature images with the same sizes are combined to form the output of convolution operation.

The weight values in the weight matrices are required to be obtained through a large amount of training in practical application, and each weight matrix formed by the weight values obtained through training can be used for extracting information from an input image, so that the convolutional neural network 300 can perform correct prediction.

When convolutional neural network 300 has multiple convolutional layers, the initial convolutional layer (e.g., 321) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 300 increases, features extracted by the later convolutional layers (e.g., 326) become more complex, such as features of high level semantics, which are more suitable for the problem to be solved.

Since it is often desirable to reduce the number of training parameters, the convolutional layers often require periodic introduction of pooling layers, each of layers 321-326 as illustrated at 320 in FIG. 4, which may be one convolutional layer followed by one pooling layer, or multiple convolutional layers followed by one or more pooling layers. The purpose of the pooling layer during image processing is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image. The average pooling operator may calculate pixel values in the image over a particular range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.

Neural network layer 330:

after processing by convolutional layer/pooling layer 320, convolutional neural network 300 is not yet sufficient to output the desired output information. Because, as previously described, the convolution layer/pooling layer 320 will only extract features and reduce the parameters imposed by the input image. However, to generate the final output information (the required class information or other relevant information), convolutional neural network 300 needs to utilize neural network layer 330 to generate an output of one or a set of the required number of classes. Thus, multiple hidden layers (331, 332 to 33n as shown in fig. 4) and an output layer 340 may be included in the neural network layer 330, where parameters included in the multiple hidden layers may be pre-trained according to relevant training data of a specific task type, for example, the task type may include image recognition, image classification, image detection, and image super-resolution reconstruction.

After the underlying layers in the neural network layer 330, i.e., the final layer of the overall convolutional neural network 300 is the output layer 340, the output layer 340 has a class-cross entropy-like loss function, specifically for calculating the prediction error, once the forward propagation of the overall convolutional neural network 300 (e.g., propagation from 310 to 340 in fig. 4) is completed, the backward propagation (e.g., propagation from 340 to 310 in fig. 4) begins to update the weights and deviations of the aforementioned layers to reduce the loss of the convolutional neural network 300, and the error between the result output by the convolutional neural network 300 through the output layer and the ideal result.

It should be noted that, the convolutional neural network shown in fig. 4 is only used as a structural example of the target style migration model in the embodiment of the present application, and in a specific application, the style migration model adopted in the video style migration method in the embodiment of the present application may also exist in the form of other network models.

Fig. 5 is a hardware structure of a chip according to an embodiment of the present application, where the chip includes a neural-network processor 400 (NPU). The chip may be provided in an execution device 210 as shown in fig. 3 to perform the calculation of the calculation module 211. The chip may also be provided in a training device 220 as shown in fig. 3 for completing training work of the training device 220 and outputting the target model/rule 201. The algorithm of each layer in the convolutional neural network as shown in fig. 4 can be implemented in a chip as shown in fig. 5.

The NPU 400 is mounted as a coprocessor to a main central processing unit (central processing unit, CPU) which distributes tasks. The NPU 400 has a core part of an arithmetic circuit 403, and the controller 404 controls the arithmetic circuit 403 to extract data in a memory (weight memory or input memory) and perform arithmetic.

In some implementations, the arithmetic circuit 403 includes a plurality of processing units (PEs) inside. In some implementations, the operation circuit 403 is a two-dimensional systolic array. The arithmetic circuit 403 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the operation circuit 403 is a general-purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit 403 takes the data corresponding to the matrix B from the weight memory 402 and buffers the data on each PE in the arithmetic circuit 403. The arithmetic circuit 403 takes the matrix a data from the input memory 401 and performs matrix operation with the matrix B, and the partial result or the final result of the matrix obtained is stored in the accumulator 408 (accumulator).

The vector calculation unit 407 may further process the output of the operation circuit 403, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 407 may be used for network calculation of non-convolution/non-FC layers in a neural network, such as pooling (pooling), batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector computation unit 407 can store the vector of processed outputs to the unified memory 406. For example, the vector calculation unit 407 may apply a nonlinear function to an output of the operation circuit 403, for example, a vector of accumulated values, to generate an activation value. In some implementations, the vector calculation unit 407 generates a normalized value, a combined value, or both.

In some implementations, the vector of processed outputs can be used as an activation input to the arithmetic circuit 403, for example for use in subsequent layers in a neural network.

The unified memory 406 is used to store input data and output data.

The weight data directly stores the input data in the external memory into the input memory 401 and/or the unified memory 406 through the memory cell access controller 405 (direct memory access controller, DMAC), stores the weight data in the external memory into the weight memory 402, and stores the data in the unified memory 406 into the external memory.

A bus interface unit 410 (bus interface unit, BIU) for interfacing between the main CPU, DMAC and finger memory 409 via a bus.

An instruction fetch memory 409 (instruction fetch buffer) connected to the controller 404 stores instructions for use by the controller 404.

And the controller 404 is used for calling the instruction cached in the instruction fetch memory 409 to realize the control of the working process of the operation accelerator.

Typically, the unified memory 406, the input memory 401, the weight memory 402, and the finger memory 409 are On-Chip (On-Chip) memories, and the external memory is a memory external to the NPU, which may be a double data rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM), a high bandwidth memory (high bandwidth memory, HBM), or other readable and writable memory.

The operations of the layers in the convolutional neural network shown in fig. 4 may be performed by the operation circuit 403 or the vector calculation unit 407.

The performing device 210 in fig. 3 described above is capable of performing the steps of the video style migration method according to the embodiment of the present application, and the CNN model shown in fig. 4 and the chip shown in fig. 5 may also be used to perform the steps of the video style migration method according to the embodiment of the present application.

Fig. 6 illustrates a system architecture 500 provided by an embodiment of the present application. The system architecture includes a local device 520, a local device 530, and an execution device 510 and data storage system 550, wherein the local device 520 and the local device 530 are connected to the execution device 510 through a communication network.

The execution device 510 may be implemented by one or more servers. Alternatively, the execution device 510 may be used with other computing devices, such as: data storage, routers, load balancers, etc. The execution device 510 may be disposed on one physical site or distributed across multiple physical sites. The execution device 510 may use data in the data storage system 550 or invoke program code in the data storage system 550 to implement the method of video style migration of embodiments of the present application.

It should be noted that, the executing device 510 may also be referred to as a cloud device, and the executing device 510 may be deployed at the cloud.

Specifically, the execution device 510 may execute the following procedure:

acquiring training data, wherein the training data comprises N frames of sample content images, sample style images and N frames of synthesized images, the N frames of synthesized images are images obtained by performing image style migration processing on the N frames of sample content images according to the sample style images, and N is an integer greater than or equal to 2; performing image style migration processing on the N frames of sample content images according to the sample style images through a neural network model to obtain N frames of predicted composite images; and determining parameters of the neural network model according to an image loss function between the N-frame sample content image and the N-frame prediction synthesized image, wherein the image loss function comprises a low-rank loss function, the low-rank loss function is used for representing the difference between a first low-rank matrix and a second low-rank matrix, the first low-rank matrix is obtained based on the N-frame sample content image and optical flow information, the second low-rank matrix is obtained based on the N-frame prediction synthesized image and the optical flow information, and the optical flow information is used for representing the position difference of corresponding pixel points between two adjacent frames of images in the N-frame sample content image.

Alternatively, the execution device 510 may execute the following procedure:

acquiring a video to be processed, wherein the video to be processed comprises N frames of content images to be processed, and N is an integer greater than or equal to 2; performing image style migration processing on the N frames of content images to be processed according to a target style migration model to obtain N frames of synthesized images; according to the N frames of synthesized images, obtaining a video after style migration processing corresponding to the video to be processed, wherein parameters of the target style migration model are determined according to an image loss function of performing style migration processing on N frames of sample content images by the target style migration model, the image loss function comprises a low rank loss function, the low rank loss function is used for representing differences between a first low rank matrix and a second low rank matrix, the first low rank matrix is obtained based on the N frames of sample content images and optical flow information, the second low rank matrix is obtained based on N frames of predicted synthesized images and the optical flow information, the optical flow information is used for representing position differences of corresponding pixels between two adjacent frames of images in the N frames of sample content images, and the N frames of predicted synthesized images are images obtained by performing image style migration processing on the N frames of sample content images according to the sample style images by the target style migration model.

The user may operate respective user devices (e.g., local device 520 and local device 530) to interact with the execution device 510. Each local device may represent any computing device, such as a personal computer, computer workstation, smart phone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set top box, game console, etc.

The local device of each user may interact with the performing device 510 via a communication network of any communication mechanism/communication standard, which may be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.

In one implementation, the local device 520, 530 may obtain relevant parameters of the target style migration model from the execution device 510, deploy the target style migration model on the local device 520, 530, use the target style migration model to perform video style migration processing, and so on.

In another implementation, the target style migration model may be directly deployed on the executing device 510, where the executing device 510 obtains the video to be processed from the local device 520 and the local device 530, and performs style migration processing on the video to be processed according to the target style migration model, and so on.

At present, video style migration models for stabilizing stylized videos by utilizing optical flow information mainly comprise two types, wherein the first type is to use optical flow in a training process of the style migration model, but the optical flow information is not introduced in a test stage; the second is to blend the optical flow module into the structure of the style migration model; however, with the first method, the operation efficiency of the style migration model in the test stage can be ensured, but the stability of the obtained video after the style migration processing is poor; the second method can ensure the stability of the output video after the style migration processing, but because the optical flow module is introduced, the optical flow information between the image frames included in the video needs to be calculated in the test stage, so that the operation efficiency of the style migration model in the test stage cannot be ensured.

In view of this, the embodiment of the application provides a training method of a style migration model and a video style migration method, in which a low-rank loss function is introduced in the process of training the style migration model for video, and the stability of the video after style migration and the original video can be synchronized through learning of low-rank information, so that the stability of the video after style migration processing obtained by a target migration model can be improved; in addition, the style migration model provided by the embodiment of the application does not need to calculate the optical flow information among the multi-frame images included in the video in the test stage, namely, in the process of performing style migration processing on the video to be processed, so that the target migration style migration model provided by the embodiment of the application can improve stability, shorten the time of style migration processing of the model and improve the operation efficiency of the target style migration model.

FIG. 7 shows a schematic flow chart of a method 600 for training a style migration model, which may be performed by an apparatus capable of image style migration, provided by an embodiment of the present application; for example, the training method may be performed by the execution device 510 in fig. 6, or may be performed by the local device 520. The training method 600 includes S610 to S630, and these steps are described in detail below, respectively.

S610, training data is acquired.

The training data may include N frames of sample content images, sample style images, and N frames of composite images, where the N frames of composite images are images obtained by performing image style migration processing on the N frames of sample content images according to the sample style images, and N is an integer greater than or equal to 2.

Illustratively, the N-frame sample content image may refer to N-frame consecutive sample content images included in the sample video; the N-frame composite image may refer to N-frame continuous composite images included in a video obtained by performing style migration processing on a sample video according to a sample style image.

It should be understood that only the content in the content image and the style in the style image need be considered for the style migration process of the single frame image, i.e., the image style migration; however, for the video, since the video comprises multiple continuous frames, the style migration of the video needs to consider not only the stylized effect of the images, but also the stability among the multiple frames of images; the smoothness of the video after style migration processing needs to be ensured, and noise such as screen flash and artifacts is avoided.

The N-frame sample content image refers to an image adjacent to N frames in the video; the N-frame composite image refers to an image corresponding to the N-frame sample content image.

S620, performing image style migration processing on the N frames of sample content images according to the sample style images through the neural network model to obtain N frames of predicted composite images.

For example, N frames of sample content images and N frames of sample style images included in the sample video may be input to the neural network model.

For example, N frames of sample content images may be input into a neural network model one frame by one frame respectively, and the neural network model may perform image style migration processing on the one frame of sample content image according to the sample style image, so as to obtain a frame of predicted composite image corresponding to the one frame of sample content image; after the above process is performed N times, N frame prediction synthesis images corresponding to the N frame sample content images can be obtained.

For example, a multi-frame image of the N-frame sample content image may be input to the neural network model once, and the neural network model may perform image style migration processing on the multi-sample content image from the sample style image.

S630, determining parameters of the neural network model according to an image loss function between the N-frame sample content image and the N-frame prediction synthesized image.

The image loss function comprises a low-rank loss function, the low-rank loss function is used for representing the difference between a first low-rank matrix and a second low-rank matrix, the first low-rank matrix is obtained based on N frames of sample content images and optical flow information, the second low-rank matrix is obtained based on N frames of predicted synthesized images and the optical flow information, and the optical flow information is used for representing the position difference of corresponding pixel points between two adjacent frames of images in the N frames of sample content images.

It should be noted that, in the matrix constituted by the plurality of frame images, a low rank matrix may be used to represent an area where a motion boundary does not occur in all of the N frame images. The sparse matrix may be used to represent areas of intermittent appearance in the N frames of images; for example, the sparse matrix may refer to an area that newly appears or disappears at the image boundary due to camera movement, or a boundary area of a moving object.

For example, the N-frame sample content image may refer to an image in which a user is moving, and a low rank matrix constituted by the N-frame sample content image may be used to represent an area in which all of the N-frame sample content image appears and is not a moving boundary; for example, a low rank matrix of N frames of sample content may be used to represent background areas where no motion has occurred, or areas where users are present in all of the N frames of sample content images and are not motion boundaries.

Illustratively, assume that a video includes consecutive 5 frames of images, i.e., consecutive 5 frames of sample content images; obtaining a 5-frame style migration result, namely a 5-frame sample synthesized image, of the style migration processing of the 5-frame sample content image; the low rank matrix may be calculated as follows:

step 1, calculating optical flow information between image frames according to 5 frames of images in a video, namely 5 frames of sample content images;

step 2, calculating mask information according to the optical flow information, wherein the mask information can be used for representing a change area in two continuous frames of images obtained according to the optical flow information;

step 3, calculating a low-rank part after the 1 st, 2 nd, 4 th and 5 th frame images are aligned with the 3 rd frame images according to the optical flow information and the mask information, and setting a sparse part to be 0;

and 4, respectively generating vectors from the aligned 5 frames of images and combining the vectors into a matrix according to columns, wherein the matrix can be a low-rank matrix.

It should be understood that in calculating the low-rank loss function, the objective is to approximate the rank of the low-rank portion of the image matrix composed of 5 frames of sample content images to the rank of the low-rank portion of the image matrix composed of 5 frames of sample composite images; the rank of the low rank part can be continuously optimized by optimizing a nuclear norm, wherein the nuclear norm is obtained by carrying out singular value decomposition on a matrix.

Illustratively, for consecutive K-frame imagesOptical flow information corresponding thereto (for example, forward optical flow information and reverse optical flow information may be included)/(>Composite image output by student model->

First, the K-frame composite image may be mapped to a fixed frame, typically τ= [ K/2] frames, based on the forward optical flow information, the backward optical flow information, and the mask information; that is, for the t-th frame composite image, after it is mapped to the τ -th frame composite image, its low rank matrix can be expressed as:

R _t ＝M _t-τ ⊙W[N _s (x _t ),f _t-τ ]；

wherein M is _t-τ Mask information calculated from the forward optical flow information and the backward optical flow information for representing the K-frame image; w is used to represent a mapping operation (warp).

According to the step 4, vectorization can be obtained and combined according to columnsIs a matrix X, x= [ vec (R ₀ ),...,vec(R _K )] ^T ∈R ^K*L Wherein l=h×w×3, k is used to represent the number of rows of the matrix X, and is the number of frames of the image; l is used to represent the number of columns of matrix X; h is used to represent the height of each frame of image; w is used to represent the width of each frame of image.

Singular value decomposition of X is desired to obtain a kernel norm, the decomposition process being x=u Σv ^T Wherein the size of the matrix X is K X L, u E R ^K*K ,v∈R ^L*L While the kernel norms X _* Tr (Σ). tr is used to represent the trace of the matrix; for example, for the sum of the individual elements on the main diagonal (diagonal from upper left to lower right) of an n×n matrix a, the trace of matrix a is referred to as tr (a).

The low rank loss function is:

L＝(||X _input || _* -||X _s || _* ) ² ；

wherein X is _input Representing a vectorization matrix obtained by the input K frame image; x is X _s And representing a vectorization matrix obtained according to the K frames of synthesized images output by the target student network.

In the embodiment of the application, the aim of introducing the low-rank loss function when training the style migration model is to ensure that the areas which are not motion boundaries and appear in adjacent multi-frame content images in the original video remain the same after the style migration processing, namely, the ranks of the areas in the video after the style migration processing are approximate to the ranks of the areas in the original video, so that the stability of the video after the style migration processing can be improved.

Further, in an embodiment of the present application, the image loss function further includes a residual loss function, where the residual loss function is obtained according to a difference between a first sample composite image and a second sample composite image, where the first sample composite image is obtained by performing an image style migration process on the N frames of sample content images through a first model, the second sample composite image is obtained by performing an image style migration process on the N frames of sample content images through a second model, the first model and the second model are image style migration models trained in advance according to the sample style images, the second model includes an optical flow module, and the first model does not include the optical flow module, and the optical flow module is used to determine optical flow information of the N frames of sample content images.

It should be noted that, the first model and the second model refer to a style migration model trained through the same style image, where the first model and the second model are different in that the first model does not include an optical flow module; the second model comprises an optical flow module; that is, the first model and the second model may employ the same sample content image and sample style image during training, for example, the first model and the second model may refer to the same model during the training stage; however, the second model also needs to calculate optical flow information between the multi-frame sample content images during the test phase; the first model does not need to calculate optical flow information between the multiple frames of images.

Further, in the embodiment of the application, a teacher-student model learning strategy can be adopted to meet the deployment requirement of the mobile terminal, namely, the trained style migration model can be a target student model; in the training process, parameters of the student model can be updated through the image loss function, so that a target student model is obtained.

It should be noted that, the network structure of the student model is the same as that of the target student model, and the student model may refer to a pre-trained style migration model that does not need to input optical flow information in the test stage.

Optionally, in one possible implementation manner, the first model and the second model may be pre-trained teacher models, and the target style migration model is a target student model obtained by training a student model to be trained according to a residual loss function and a knowledge distillation algorithm.

It should be understood that knowledge distillation refers to a key technology that enables deep learning models to be miniaturized to meet the deployment requirements of terminal devices. Compared with the compression technology such as quantization and sparsification, the compression model can be achieved without specific hardware support. The knowledge distillation technology adopts a teacher-student model learning strategy, wherein the teacher model can refer to a model with large parameters, and can not meet deployment requirements generally; and the student model has few parameters and can be directly deployed. Through designing an effective knowledge distillation algorithm, the student model learns to simulate the behavior of a teacher model, and effective knowledge migration is performed, so that the student model can finally show the same processing capacity as the teacher model.

In an embodiment of the application, knowledge distillation can be performed on a model without an optical flow module at test by employing a model including an optical flow module at test; in the style migration of the video, the stylized effect of the student model and the teacher model may not be completely the same due to the different structures and different training modes of the teacher model and the student model; if the student model is made to learn the output information of the teacher model directly at the pixel level, the output of the student model may appear as ghost or blurry. In the embodiment of the application, the target style migration model can be a target student model, the difference between the style migration results output by the student model to be trained and the pre-trained basic model is continuously approximate to the difference between the style migration results output by the teacher model comprising the optical flow module and the teacher model not comprising the optical flow module by adopting a knowledge distillation method for teacher-student model learning, and the double image phenomenon caused by the non-uniform styles of the teacher model and the student model can be effectively avoided by adopting the training method.

In one possible implementation, the residual loss function is optionally derived from the following equation,

Wherein L is _res Representation houseThe residual loss function; n (N) _T Representing the second model;representing the first model; n (N) _S Representing the student model to be trained; />Representing a pre-trained base model, the pre-trained base model being the same as the network structure of the student model to be trained; x is x ⁱ And representing an ith frame sample content image included in the sample video, wherein i is a positive integer.

In one example, the target style migration model may refer to a target student model, and when the target student model is trained, a student model to be trained may be trained according to a first teacher model (excluding an optical flow module) which is trained in advance, a second teacher model (including an optical flow module) which is trained in advance, and a basic model which is trained in advance, so as to obtain the target student model; the student model to be trained, the pre-trained basic model and the target student model have the same network structure, and the student model to be trained is trained through the low-rank loss function, the residual loss function and the perception loss function, so that the target student model is obtained.

In the embodiment of the application, the target style migration model can be a target student model, the difference between the style migration results output by the student model to be trained and the pre-trained basic model is continuously approximate to the difference between the style migration results output by the teacher model comprising the optical flow module and the teacher model not comprising the optical flow module by adopting a knowledge distillation method for teacher-student model learning, and the double image phenomenon caused by the non-uniform styles of the teacher model and the student model can be effectively avoided by adopting the training method.

In one example, parameters of the neural network model are determined from an image loss function between the N-frame sample composite image and the N-frame predictive composite image, wherein the image loss function includes the low rank loss function described above, and the low rank loss function is used to represent a difference between a low rank matrix formed of the N-frame sample content images and a low rank matrix formed of the N-frame sample composite images.

In one example, parameters of the neural network model are determined according to an image loss function between the N-frame sample composite image and the N-frame prediction composite image, wherein the image loss function includes the low-rank loss function and the residual loss function, and the low-rank loss function is used for representing a difference between a low-rank matrix formed by the N-frame sample content image and a low-rank matrix formed by the N-frame sample composite image; the residual loss function is derived from the difference between the first sample composite image and the second sample composite image; the first sample synthesized image is an image obtained by performing image style migration processing on the N frames of sample content images through a first model, the second sample synthesized image is an image obtained by performing image style migration processing on the N frames of sample content images through a second model, the first model and the second model are image style migration models trained in advance according to the same sample style images, the second model comprises an optical flow module, the first module does not comprise an optical flow module, and the optical flow module is used for determining optical flow information of the N frames of sample content images.

Optionally, in one possible implementation manner, the image loss function further includes a perceptual loss function, where the perceptual loss function includes a content loss and a style loss, the content loss is used to represent an image content difference between the N-frame predicted composite image and the N-frame sample content image corresponding thereto, and the style loss is used to represent an image style difference between the N-frame predicted composite image and the sample style image.

Wherein the perceptual loss function may be used to represent content similarity between the sample content image and the corresponding composite image; and for representing style similarity between the sample style image and the corresponding composite image.

In one example, parameters of the neural network model are determined from an image loss function between the N-frame sample composite image and the N-frame prediction composite image, wherein the image loss function includes the low rank loss function, the residual loss function, and the perceptual loss function.

For example, the image loss is obtained by weighting the low rank loss function, the residual loss function, and the perceptual loss function.

Optionally, in a possible implementation manner, the parameters of the target style migration model are obtained by iterating through a back propagation algorithm based on the image loss function.

Fig. 8 is a schematic diagram illustrating a training process of a style migration model according to an embodiment of the present application.

As shown in fig. 8, the first teacher model may refer to the second model described above, that is, a pre-trained style migration model including an optical flow module; the second teacher model may refer to the first model, that is, a pre-trained style migration model that does not include an optical flow module; the pre-trained basic model and the student model to be trained have the same network structure as the target student model; the input data of the first teacher model may include a content image of a T-1 frame, a synthesized image of a T-1 frame processed by using optical flow information, and change information calculated by using the optical flow information, where the change information may refer to a different area in two frames of content images obtained according to the content image of the T-1 frame and the content image of the T-1 frame; the optical flow information may refer to motion information of a corresponding pixel in the T-1 th content image and the T-th frame content image, and the output data of the first teacher model is the T-th frame composite image (# 1). For the second teacher model, since the model does not include the optical flow module, the change information in the input data of the second teacher model may be set to 1 entirely; the T-1 frame composite image processed with the optical flow information may be set to 0 entirely, and the output data of the second teacher model is the T-th frame composite image (# 2); the input data of the student model to be trained is a T frame content image, and the output data is a T frame synthesized image (# 3); in the training process, the input data of the pre-trained basic model may be the T-th frame content image, and the output data is the predicted T-th frame synthesized image (# 4); and sequentially inputting the sample content images of the frames from the T frame to the T+N-1, wherein the pre-trained basic model can obtain a T-T+N-1 frame and N frame prediction synthesized image, and continuously updating parameters of the student model to be trained through a back propagation algorithm according to an image loss function, namely a low-rank loss function, a residual loss function and a perception loss function, so as to obtain the trained target student model.

In the embodiment of the application, a low-rank loss function is introduced in the process of training the style migration model for the video, and the stability of the video after the style migration and the original video can be synchronized through the learning of low-rank information, so that the stability of the video after the style migration processing obtained by the target migration model can be improved.

In addition, the style migration model for the video trained in the embodiment of the application can be a target student model obtained by adopting a teacher-learning model learning strategy, so that the requirement of deploying the style migration model for the mobile equipment can be met on one hand; on the other hand, when the target student model is trained, the learning model learns the difference between output information of the teacher model including the optical flow module and the teacher model not including the optical flow module, so that the ghost phenomenon caused by non-uniform styles of the teacher model and the student model can be effectively avoided, and the stability of the video after style migration processing obtained by the target migration model can be improved.

FIG. 9 illustrates a schematic flow diagram of a method 700 for video style migration, which may be performed by an image style migration enabled device, provided by an embodiment of the present application; for example, the method may be performed by the execution device 510 of fig. 6, or may be performed by the local device 520. The method 700 includes S710 to S730, and these steps are described in detail below.

S710, acquiring a video to be processed.

The video to be processed comprises N frames of content images to be processed, wherein N is an integer greater than or equal to 2.

The video to be processed may be a video captured by the electronic device through a camera, or the video to be processed may also be a video obtained from inside the electronic device (for example, a video stored in an album of the electronic device, or a video obtained from a cloud end by the electronic device).

S720, performing image style migration processing on the N frames of content images to be processed according to the target style migration model to obtain N frames of synthesized images.

And S730, synthesizing the image according to the N frames to obtain a video after style migration processing corresponding to the video to be processed.

The parameters of the target style migration model are determined according to an image loss function of performing style migration processing on N frames of sample content images by the target style migration model, the image loss function comprises a low-rank loss function, the low-rank loss function is used for representing the difference between a first low-rank matrix and a second low-rank matrix, the first low-rank matrix is obtained based on N frames of sample content images and optical flow information, the second low-rank matrix is obtained based on N frames of predicted synthesized images and the optical flow information, the optical flow information is used for representing the position difference of corresponding pixels between two adjacent frames of the N frames of sample content images, and the N frames of predicted synthesized images are images obtained after performing image style migration processing on the N frames of sample content images according to the sample style images by the target style migration model.

The N-frame sample content image refers to an image adjacent to N frames in the video; the N-frame synthesized image refers to an image corresponding to the N-frame sample content image; the target style migration network may refer to a pre-trained style migration model obtained by the training method shown in fig. 7.

For example, assume that a video includes consecutive 5-frame images, i.e., consecutive 5-frame sample content images; obtaining a 5-frame style migration result, namely a 5-frame sample synthesized image, of the style migration processing of the 5-frame sample content image; the low rank matrix may be calculated as follows:

In the embodiment of the application, the aim of introducing the low-rank loss function when training the style migration model is to ensure that the areas which are not motion boundaries and appear in adjacent multi-frame images in the original video remain the same after the style migration processing, namely, the rank of the areas in the video after the style migration processing is approximate to the rank of the areas in the original video, so that the stability of the video after the style migration processing can be improved.

Further, in an embodiment of the present application, the image loss function further includes a residual loss function, where the residual loss function is obtained according to a difference between a first sample composite image and a second sample composite image, where the first sample composite image is obtained by performing an image style migration process on the N frames of sample content images through a first model, the second sample composite image is obtained by performing an image style migration process on the N frames of sample content images through a second model, the first model and the second model are image style migration models trained in advance according to the sample style images, and the second model includes an optical flow module, where the optical flow module is configured to determine optical flow information of the N frames of sample content images.

It should be appreciated that the difference between the first and second sample composite images may refer to a difference in position between pixel values corresponding to the first and second sample composite images.

It should be noted that, the first model and the second model refer to a style migration model trained through the same style image, where the first model and the second model are different in that the first model does not include an optical flow module; the second model comprises an optical flow module; the first model and the second model can adopt the same sample content image and sample style image during training; for example, the first model and the second model may refer to the same model during the training phase; however, the second model also needs to calculate optical flow information between the multi-frame sample content images during the test phase; the first model does not need to calculate optical flow information between the multiple frames of images.

It should be noted that, the learning model and the target student model have the same network structure, and the student model may refer to a pre-trained style migration model that does not need to input optical flow information in the test stage.

In an embodiment of the application, knowledge distillation can be performed on a model without an optical flow module at test by employing a model including an optical flow module at test; in the style migration of the video, the stylized effect of the student model and the teacher model may not be completely the same due to the different structures and different training modes of the teacher model and the student model; if the student model is made to learn the output information of the teacher model directly at the pixel level, the output of the student model may appear as ghost or blurry. In the embodiment of the application, through the difference of style migration results output by the teacher model including the optical flow module during the learning test and the teacher model not including the optical flow module during the test, the ghost phenomenon caused by the non-uniform styles of the teacher model and the student model can be effectively avoided, and the stability of the video after style migration processing obtained by the target migration model can be improved.

wherein L is _res Representing the residual loss function; n (N) _T Representing the second model;representing the first model; n (N) _S Representing the students to be trainedA model; />Representing a pre-trained base model, the pre-trained base model being the same as the network structure of the student model to be trained; x is x ⁱ And representing an ith frame sample content image included in the sample video, wherein i is a positive integer.

In one example, parameters of the neural network model are determined according to an image loss function between the N-frame sample composite image and the N-frame prediction composite image, wherein the image loss function includes the low-rank loss function and the residual loss function, and the low-rank loss function is used for representing a difference between a low-rank matrix formed by the N-frame sample content image and a low-rank matrix formed by the N-frame sample composite image; the residual loss function is derived from the difference between the first sample composite image and the second sample composite image; the first sample synthesized image is an image obtained by performing image style migration processing on the N frames of sample content images through a first model, the second sample synthesized image is an image obtained by performing image style migration processing on the N frames of sample content images through a second model, the first model and the second model are image style migration models trained in advance according to the same sample style images, the second model comprises an optical flow module, the first model does not comprise an optical flow module, and the optical flow module can be used for determining optical flow information of the N frames of sample content images.

In the embodiment of the application, a low-rank loss function is introduced in the process of the target style migration model, and the stability of the video after style migration and the original video can be synchronized through the learning of low-rank information, so that the stability of the video after style migration obtained by the target migration model can be improved.

In addition, the target style migration model in the embodiment of the application can be a target student model obtained by adopting a teacher-learning model learning strategy, so that the requirement of deploying the style migration model by the mobile equipment can be met on one hand; on the other hand, when the target student model is trained, the learning model learns the difference between output information of the teacher model including the optical flow module and the teacher model not including the optical flow module, so that the ghost phenomenon caused by non-uniform styles of the teacher model and the student model can be effectively avoided, and the stability of the video after style migration processing obtained by the target migration model can be improved.

Furthermore, the target style migration model provided by the embodiment of the application does not need to calculate the optical flow information among the multi-frame images included in the video in the test stage, namely in the process of performing style migration processing on the video to be processed, so that the target style migration model provided by the embodiment of the application can improve the stability, shorten the time of style migration processing of the model and improve the operation efficiency of the target style migration model.

Fig. 10 is a schematic diagram illustrating a training phase and a testing phase according to an embodiment of the present application.

Training phase:

for example, in the embodiment of the present application, a flowet 2 network and a data set with optical flow data generated according to the Hollywood2 data set may be used, and the training method shown in the embodiment of the present application is used to train the network.

For example, the specific implementation steps include: firstly, training a style migration model, namely a pre-trained basic model, by adopting a perception loss function only or adopting the perception loss function and an optical flow loss functionTraining a teacher model N including an optical flow module using video data and optical flow data _T And a teacher model not including an optical flow module +.>Then according toN _T 、/>And training the student model N to be trained by the low-rank loss function and the residual loss function _S And finally obtaining a trained target student model, wherein the pre-trained basic model, the student model to be trained and the target student model have the same network structure.

It should be noted that, for the specific implementation of the training phase, reference may be made to the descriptions in fig. 7 and fig. 8, and details are not repeated here.

Testing:

for example, test data may be input in a target student model at a test, and test results, i.e., data after style migration processing, may be obtained by the target student model.

It should be noted that, the specific implementation of the testing stage may be referred to the description in fig. 9, and will not be repeated here.

TABLE 1

The teacher model in table 1 may refer to the second model in the above embodiment, that is, the style migration model including the optical flow module in the test stage; the first class of student models may refer to pre-trained student models obtained through perceptual loss function training; the second class of student models may refer to pre-trained student models trained from perceptual loss functions and optical flow loss functions; the loss function 1 may refer to a residual loss function in the present application; the loss function 2 may refer to a low rank loss function in the present application; alley_2, amBush_5, bandage_2, market_6, and sample_2 represent the names of five video data in the MPI-Sintel dataset, respectively; all represents the first five videos. The results of the tests on stability for the different models by using the MPI-Sintel dataset are shown in table 1; the stability index calculation mode may adopt the following formula:

wherein T represents the number of image frames included in the video; d=c×w×d, M _t ∈R ^(w*d) Represents mask information, O _t Representing a style migration result of the t frame; o (O) _(t-1) Representing style migration results of t-1 frames; w (W) _t Optical flow information representing from t-1 frame to t frame; w (W) _t (O _t-1 ) Representing style migration junctions for t-1 framesThe results of the style migration for the fruit and t frames are aligned.

As shown in table 1, the smaller the result of the stability index is, the better the stability of the output data after the migration processing of the model is; from the test results shown in table 1, it can be seen that the stability of the output data of the target migration model subjected to style migration provided by the embodiment of the present application is significantly better than that of other models.

It should be understood that the above description is intended to aid those skilled in the art in understanding the embodiments of the present application, and is not intended to limit the embodiments of the present application to the specific values or particular scenarios illustrated. It will be apparent to those skilled in the art from the foregoing description that various equivalent modifications or variations can be made, and such modifications or variations are intended to be within the scope of the embodiments of the present application.

The training method for style migration and the method for video style migration provided by the embodiment of the application are described in detail above with reference to fig. 1 to 10; an embodiment of the device of the present application will be described in detail below with reference to fig. 11 to 14. It should be understood that the image processing apparatus in the embodiment of the present application may perform the various methods of the foregoing embodiment of the present application, that is, the specific working processes of the following various products may refer to the corresponding processes in the foregoing method embodiment.

Fig. 11 is a schematic block diagram of an apparatus for video style migration provided by an embodiment of the present application.

It should be appreciated that the apparatus 800 for video style migration may perform the method shown in fig. 9, or the method of the test phase shown in fig. 10. The apparatus 800 includes: an acquisition unit 810 and a processing unit 820.

The obtaining unit 810 is configured to obtain a video to be processed, where the video to be processed includes N frames of content images to be processed, where N is an integer greater than or equal to 2; the processing unit 820 is configured to perform image style migration processing on the N frames of content images to be processed according to a target style migration model, so as to obtain N frames of composite images; obtaining a video after style migration processing corresponding to the video to be processed according to the N frames of synthesized images,

the parameters of the target style migration model are determined according to an image loss function of performing style migration processing on N frames of sample content images by the target style migration model, the image loss function comprises a low-rank loss function, the low-rank loss function is used for representing the difference between a first low-rank matrix and a second low-rank matrix, the first low-rank matrix is obtained based on N frames of sample content images and optical flow information, the second low-rank matrix is obtained based on N frames of predicted synthesized images and the optical flow information, the optical flow information is used for representing the position difference of corresponding pixel points between two adjacent frames of images in the N frames of sample content images, and the N frames of predicted synthesized images refer to images obtained after performing image style migration processing on the N frames of sample content images according to the sample style images by the target style migration model.

Optionally, as an embodiment, the image loss function further comprises a residual loss function, the residual loss function being derived from a difference between the first sample composite image and the second sample composite image,

the first sample synthesized image is an image obtained by performing image style migration processing on the N frames of sample content images through a first model, the second sample synthesized image is an image obtained by performing image style migration processing on the N frames of sample content images through a second model, the first model and the second model are image style migration models trained in advance according to the sample style images, the second model comprises an optical flow module, the first module does not comprise the optical flow module, and the optical flow module is used for determining optical flow information of the N frames of sample content images.

Optionally, as an embodiment, the first model and the second model are teacher models trained in advance, and the target style migration model is a target student model obtained by training a student model to be trained according to the residual loss function and a knowledge distillation algorithm.

Alternatively, as an example, the residual loss function is derived from the following equation,

Optionally, as an embodiment, the image loss function further includes a perceptual loss function, and the image loss function further includes a perceptual loss function, where the perceptual loss function includes a content loss and a style loss, the content loss is used to represent an image content difference between the N-frame predicted composite image and the N-frame sample content image corresponding thereto, and the style loss is used to represent an image style difference between the N-frame predicted composite image and the sample style image.

Optionally, as an embodiment, the image loss function is obtained by weighting the low rank loss function, the residual loss function, and the perceptual loss function.

Optionally, as an embodiment, the parameters of the target style migration model are obtained by iterating through a back propagation algorithm based on the image loss function.

FIG. 12 is a schematic block diagram of a training apparatus for a style migration model provided by an embodiment of the present application.

It should be appreciated that the training apparatus 900 may perform the training method of the style migration model shown in fig. 7, 8, or 10. The training device 900 includes: an acquisition unit 910 and a processing unit 920.

The acquiring unit 910 is configured to acquire training data, where the training data includes N frames of sample content images, a sample style image, and N frames of composite images, where the N frames of composite images are images obtained by performing image style migration processing on the N frames of sample content images according to the sample style image, and N is an integer greater than or equal to 2; the processing unit 920 is configured to perform image style migration processing on the N frames of sample content images according to the sample style images through a neural network model, so as to obtain N frames of predicted composite images; determining parameters of the neural network model according to an image loss function between the N frames of sample content images and the N frames of predicted composite images,

the first sample synthesized image is an image obtained by performing image style migration processing on the N frames of sample content images through a first model, the second sample synthesized image is an image obtained by performing image style migration processing on the N frames of sample content images through a second model, the first model and the second model are image style migration models trained in advance according to the sample style images, the second model comprises an optical flow module, the first model does not comprise the optical flow module, and the optical flow module is used for determining optical flow information of the N frames of sample content images.

Optionally, as an embodiment, the image loss function further includes a perceptual loss function, where the perceptual loss function includes a content loss representing an image content difference between the N-frame predicted composite image and the N-frame sample content image corresponding thereto and a style loss representing an image style difference between the N-frame predicted composite image and the sample style image.

The apparatus 800 and the training apparatus 900 are embodied as functional units. The term "unit" herein may be implemented in software and/or hardware, without specific limitation.

For example, a "unit" may be a software program, a hardware circuit or a combination of both that implements the functions described above. The hardware circuitry may include application specific integrated circuits (application specific integrated circuit, ASICs), electronic circuits, processors (e.g., shared, proprietary, or group processors, etc.) and memory for executing one or more software or firmware programs, merged logic circuits, and/or other suitable components that support the described functions.

Thus, the elements of the examples described in the embodiments of the present application can be implemented in electronic hardware, or in a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Fig. 13 is a schematic hardware structure of a video style migration apparatus according to an embodiment of the present application. The apparatus 1000 shown in fig. 13 (the apparatus 1000 may be a computer device in particular) includes a memory 1010, a processor 1020, a communication interface 1030, and a bus 1040. The memory 1010, the processor 1020, and the communication interface 1030 are communicatively connected to each other via a bus 1040.

The memory 1010 may be a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access memory (random access memory, RAM). The memory 1010 may store a program that, when executed by the processor 1020, the processor 1020 is configured to perform the steps of the method of video style migration of an embodiment of the present application; for example, the steps shown in fig. 9 are performed.

It should be understood that the video style migration device shown in the embodiment of the present application may be a server, for example, a cloud server, or may also be a chip configured in the cloud server.

The processor 1020 may employ a general purpose central processing unit (central processing unit, CPU), microprocessor, application specific integrated circuit (application specific integrated circuit, ASIC), graphics processor (graphics processing unit, GPU) or one or more integrated circuits for executing associated programs to implement the image classification methods of the present method embodiments.

The processor 1020 may also be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the image classification method of the present application may be performed by integrated logic circuitry of hardware in the processor 1020 or by instructions in the form of software.

The processor 1020 may also be a general purpose processor, a digital signal processor (digital signal processing, DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 1010, and the processor 1020 reads the information in the memory 1010, and in combination with its hardware, performs the functions required to be performed by the units included in the apparatus shown in fig. 11 in the implementation of the present application, or performs the method for video style migration shown in fig. 9 in the embodiment of the method of the present application.

The communication interface 1030 enables communication between the apparatus 1000 and other devices or communication networks using transceiving apparatus such as, but not limited to, a transceiver.

A bus 1040 may include a path for transferring information between various components of the device 1000 (e.g., the memory 1010, the processor 1020, the communication interface 1030).

Fig. 14 is a schematic hardware structure of a training device for a style migration model according to an embodiment of the present application. The exercise apparatus 1100 shown in fig. 14 (the exercise apparatus 1100 may be a computer device in particular) includes a memory 1110, a processor 1120, a communication interface 1130, and a bus 1140. Wherein the memory 1110, the processor 1120, and the communication interface 1130 implement communication connection therebetween through the bus 1140.

The memory 1110 may be a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access memory (random access memory, RAM). The memory 1110 may store a program, and the processor 1120 is configured to perform the respective steps of the training method of the style migration model according to the embodiment of the present application when the program stored in the memory 1110 is executed by the processor 1120; for example, the respective steps shown in fig. 7 or 8 are performed.

It should be understood that the training device shown in the embodiment of the present application may be a server, for example, a cloud server, or may be a chip configured in the cloud server.

Illustratively, the processor 1120 may employ a general-purpose central processing unit (central processing unit, CPU), microprocessor, application specific integrated circuit (application specific integrated circuit, ASIC), graphics processor (graphics processing unit, GPU) or one or more integrated circuits for executing associated programs to implement the training method of the image classification model of the method embodiment of the present application.

The processor 1120 may also be an integrated circuit chip with signal processing capabilities, for example. In implementation, the various steps of the training method of the style migration model of the present application may be accomplished by instructions in the form of integrated logic circuits or software of hardware in the processor 1120.

The processor 1120 may also be a general purpose processor, a digital signal processor (digital signal processing, DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 1110, and the processor 1120 reads the information in the memory 1110, and combines the hardware thereof to perform the functions required by the units included in the training apparatus shown in fig. 12, or performs the training method of the style migration model shown in fig. 7 or fig. 8 according to the method embodiment of the present application.

Communication interface 1130 enables communication between exercise device 1100 and other devices or communication networks using a transceiver device such as, but not limited to, a transceiver.

Bus 1140 may include a path for transferring information between components of training device 1100 (e.g., memory 1110, processor 1120, communication interface 1130).

It should be noted that while the above-described apparatus 1000 and training apparatus 1100 only illustrate a memory, processor, communication interface, those skilled in the art will appreciate that in a particular implementation, apparatus 1000 and training apparatus 1100 may also include other devices necessary to achieve proper operation. Also, those skilled in the art will appreciate that the apparatus 1000 and training apparatus 1100 described above may also include hardware devices that perform other additional functions, as desired. Furthermore, it will be appreciated by those skilled in the art that the apparatus 1000 and training apparatus 1100 described above may also include only the necessary components to implement embodiments of the present application, and not necessarily all of the components shown in fig. 13 or 14.

The embodiment of the application also provides a chip, which comprises a receiving and transmitting unit and a processing unit. The receiving and transmitting unit can be an input and output circuit and a communication interface; the processing unit is a processor or a microprocessor or an integrated circuit integrated on the chip; the chip can execute the video style migration method in the embodiment of the method.

The embodiment of the application also provides a chip, which comprises a receiving and transmitting unit and a processing unit. The receiving and transmitting unit can be an input and output circuit and a communication interface; the processing unit is a processor or a microprocessor or an integrated circuit integrated on the chip; the chip can execute the training method of the style migration model in the embodiment of the method.

Exemplary, embodiments of the present application also provide a computer-readable storage medium having instructions stored thereon that, when executed, perform the method of video style migration in the method embodiments described above.

Exemplary, embodiments of the present application also provide a computer readable storage medium having instructions stored thereon that, when executed, perform the training method of the style migration model in the method embodiments described above.

Exemplary embodiments of the present application also provide a computer program product comprising instructions that when executed perform the method of video style migration in the method embodiments described above.

The present application also illustratively provides a computer program product comprising instructions which, when executed, perform the method of training the style migration model in the method embodiment described above.

It is to be appreciated that the processor in embodiments of the application may be a central processing unit (central processing unit, CPU), but may also be other general purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate arrays (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should also be appreciated that the memory in embodiments of the present application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example but not limitation, many forms of random access memory (random access memory, RAM) are available, such as Static RAM (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link DRAM (SLDRAM), and direct memory bus RAM (DR RAM).

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.

It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: there are three cases, a alone, a and B together, and B alone, wherein a, B may be singular or plural. In addition, the character "/" herein generally indicates that the associated object is an "or" relationship, but may also indicate an "and/or" relationship, and may be understood by referring to the context.

In the present application, "at least one" means one or more, and "a plurality" means two or more. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for training a style migration model, comprising:

acquiring training data, wherein the training data comprises N frames of sample content images, sample style images and N frames of synthesized images, the N frames of synthesized images are images obtained by performing image style migration processing on the N frames of sample content images according to the sample style images, and N is an integer greater than or equal to 2;

performing image style migration processing on the N frames of sample content images according to the sample style images through a neural network model to obtain N frames of predicted composite images;

determining parameters of the neural network model according to an image loss function between the N frames of sample content images and the N frames of predicted composite images,

2. The training method of claim 1 wherein the image loss function further comprises a residual loss function derived from a difference between the first sample composite image and the second sample composite image,

the first sample synthesized image is an image obtained by performing image style migration processing on the N frames of sample content images through a first model, the second sample synthesized image is an image obtained by performing image style migration processing on the N frames of sample content images through a second model, the first model and the second model are image style migration models trained in advance according to the sample style images, the second model comprises an optical flow module, the first model does not comprise the optical flow module, and the optical flow module is used for determining optical flow information.

3. The training method of claim 2, wherein the first model and the second model are pre-trained teacher models, and the target style migration model is a target student model obtained by training a student model to be trained according to the residual loss function and a knowledge distillation algorithm.

4. The training method of claim 3, wherein the residual loss function is obtained according to the following equation,

wherein L is _res Representing the residual loss function; n (N) _T Representing the secondA model;representing the first model; n (N) _S Representing the student model to be trained; />Representing a pre-trained base model, the pre-trained base model being the same as the network structure of the student model to be trained; x is x ⁱ And representing an ith frame of sample content image included in the N frames of sample content images, wherein i is a positive integer.

5. Training method according to any of the claims 2-4, wherein the image loss function further comprises a perceptual loss function, wherein the perceptual loss function comprises a content loss representing an image content difference between the N-frame predictive composite image and the N-frame sample content image corresponding thereto and a style loss representing an image style difference between the N-frame predictive composite image and the sample style image.

6. The training method of claim 5 wherein said image loss function is obtained by weighting said low rank loss function, said residual loss function, and said perceptual loss function.

7. Training method according to claim 3 or 4, wherein the parameters of the target style migration model are obtained by a number of iterations of a back propagation algorithm based on the image loss function.

8. A method of video style migration, comprising:

acquiring a video to be processed, wherein the video to be processed comprises N frames of content images to be processed, and N is an integer greater than or equal to 2;

performing image style migration processing on the N frames of content images to be processed according to a target style migration model to obtain N frames of synthesized images;

obtaining a video after style migration processing corresponding to the video to be processed according to the N frames of synthesized images,

9. The method of claim 8, wherein the image loss function further comprises a residual loss function derived from a difference between the first sample composite image and the second sample composite image,

10. The method of claim 9, wherein the first model and the second model are pre-trained teacher models, and the target style migration model is a target student model obtained by training a student model to be trained according to the residual loss function and a knowledge distillation algorithm.

11. The method of claim 10, wherein the residual loss function is derived from the equation,

wherein L is _res Representing the residual loss function; n (N) _T Representing the second model;representing the first model; n (N) _S Representing the student model to be trained; />Representing a pre-trained style migration model, wherein the pre-trained style migration model has the same network structure as the student model to be trained; x is x ⁱ And representing an ith frame of sample content image included in the N frames of sample content images, wherein i is a positive integer.

12. The method of any of claims 9 to 11, wherein the image loss function further comprises a perceptual loss function, wherein the perceptual loss function comprises a content loss representing an image content difference between the N-frame predicted composite image and the N-frame sample content image corresponding thereto and a style loss representing an image style difference between the N-frame predicted composite image and the sample style image.

13. The method of claim 12, wherein the image loss function is obtained by weighting the low rank loss function, the residual loss function, and the perceptual loss function.

14. The method according to any one of claims 8 to 11, wherein the parameters of the target style migration model are obtained by a plurality of iterations of a back propagation algorithm based on the image loss function.

15. A training device for a style migration model, comprising:

the training device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring training data, the training data comprises N frames of sample content images, sample style images and N frames of synthesized images, the N frames of synthesized images are images obtained by performing image style migration processing on the N frames of sample content images according to the sample style images, and N is an integer greater than or equal to 2;

the processing unit is used for performing image style migration processing on the N frames of sample content images according to the sample style images through a neural network model to obtain N frames of predicted composite images; determining parameters of the neural network model according to an image loss function between the N frames of sample content images and the N frames of predicted composite images,

16. The training device of claim 15 wherein the image loss function further comprises a residual loss function derived from a difference between the first sample composite image and the second sample composite image,

17. The training apparatus of claim 16 wherein the first model and the second model are pre-trained teacher models and the target style migration model is a target student model obtained by training a student model to be trained according to the residual loss function and a knowledge distillation algorithm.

18. The training apparatus of claim 17 wherein said residual loss function is derived from the following equation,

wherein L is _res Representing the residual loss function; n (N) _T Representing the second model;representing the first model; n (N) _S Representing the student model to be trained; />Representing a pre-trained base model, the pre-trained base model being the same as the network structure of the student model to be trained; x is x ⁱ And representing an ith frame of sample content image included in the N frames of sample content images, wherein i is a positive integer.

19. The training apparatus of any of claims 16 to 18 wherein the image loss function further comprises a perceptual loss function, wherein the perceptual loss function comprises a content loss representing an image content difference between the N-frame predicted composite image and the N-frame sample content image corresponding thereto and a style loss representing an image style difference between the N-frame predicted composite image and the sample style image.

20. The training apparatus of claim 19 wherein the image loss function is obtained by weighting the low rank loss function, the residual loss function, and the perceptual loss function.

21. Training apparatus according to claim 17 or 18, wherein the parameters of the target style migration model are derived by a number of iterations of a back propagation algorithm based on the image loss function.

22. An apparatus for video style migration, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a video to be processed, the video to be processed comprises N frames of content images to be processed, and N is an integer greater than or equal to 2;

the processing unit is used for carrying out image style migration processing on the N frames of content images to be processed according to the target style migration model to obtain N frames of synthesized images; obtaining a video after style migration processing corresponding to the video to be processed according to the N frames of synthesized images,

23. The apparatus of claim 22, wherein the image loss function further comprises a residual loss function derived from a difference between the first sample composite image and the second sample composite image,

24. The apparatus of claim 23, wherein the first model and the second model are pre-trained teacher models, and the target style migration model is a target student model obtained by training a student model to be trained according to the residual loss function and a knowledge distillation algorithm.

25. The apparatus of claim 24, wherein the residual loss function is derived from the following equation,

26. The apparatus of any of claims 23 to 25, wherein the image loss function further comprises a perceptual loss function, wherein the perceptual loss function comprises a content loss representing an image content difference between the N-frame predicted composite image and the N-frame sample content image corresponding thereto and a style loss representing an image style difference between the N-frame predicted composite image and the sample style image.

27. The apparatus of claim 26, wherein the image loss function is obtained by weighting the low rank loss function, the residual loss function, and the perceptual loss function.

28. The apparatus of any one of claims 22 to 25, wherein the parameters of the target style migration model are derived by a plurality of iterations of a back propagation algorithm based on the image loss function.

29. A training device for a style migration model, comprising a processor and a memory, the memory for storing program instructions, the processor for invoking the program instructions to perform the method of any of claims 1-7 or 8-14.

30. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein program instructions, which when executed by a processor, implement the method of any of claims 1 to 7 or 8 to 14.