WO2022242122A1

WO2022242122A1 - Video optimization method and apparatus, terminal device, and storage medium

Info

Publication number: WO2022242122A1
Application number: PCT/CN2021/137583
Authority: WO
Inventors: 刘翼豪; 赵恒远; 董超; 乔宇
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2021-05-21
Filing date: 2021-12-13
Publication date: 2022-11-24
Also published as: CN113298728A; CN113298728B

Abstract

A video optimization method and apparatus, a terminal device, and a storage medium, relating to the technical field of deep learning, and can improve the continuity of video optimization. The video optimization method comprises: using a trained feature extraction network to respectively extract intermediate features of M frame anchor point frames in a video frame sequence to be optimized (S201), wherein said video frame sequence comprises N frame video frames, and the M frame anchor point frames comprise a first frame video frame and an N-th frame video frame of said video frame sequence; respectively determining a forward optical flow parameter and a reverse optical flow parameter of each N-M frame intermediate frame using a trained optical flow network (S202); determining an intermediate feature of the N-M frame intermediate frame according to the forward optical flow parameter and the reverse optical flow parameter of the N-M frame intermediate frame, and the intermediate features of the M frame anchor point frames (S203); and performing feature estimation on intermediate features of N frame video frames of said video frame sequence using a trained feature estimation network to obtain N frame optimized images, wherein the N frame optimized images constitute an optimized video of said video frame sequence (S204).

Description

A video optimization method, device, terminal equipment and storage medium

technical field

The present application relates to the technical field of deep learning, and in particular to a video optimization method, device, terminal equipment and storage medium.

Background technique

Video optimization generally includes optimization operations such as video denoising, video rain removal, video super resolution, video color correction, and black and white video coloring. At present, in deep learning-based video optimization schemes, image optimization models (such as image denoising models, image deraining models, super-resolution models, image color toning models, black and white image coloring models, etc.) are often used to extract each image in the video. The intermediate features of a frame of video frame, and perform feature estimation on the intermediate features of each frame of video frame, and obtain the optimized image corresponding to each frame of video frame, so as to realize the optimization of video.

However, this method of independently optimizing each video frame in the video based on the image optimization model may cause different video frames to have different optimization effects, affecting the continuity of the optimized video.

Contents of the invention

In view of this, the present application provides a video optimization method, device, terminal equipment, and storage medium, so as to improve the continuity of the optimized video.

In a first aspect, the present application provides a video optimization method, including:

Use the trained feature extraction network to extract the intermediate features of the M frame anchor frames in the video frame sequence to be optimized. The video frame sequence includes N frame video frames, and the M frame anchor frame includes the first video frame of the video frame sequence. and the Nth video frame, M is a positive integer greater than 2 and less than N; use the trained optical flow network to determine the forward optical flow parameters and reverse optical flow parameters of the N-M frame intermediate frame respectively, and the forward optical flow parameter of the intermediate frame The flow parameters are used to describe the transformation relationship from the previous frame of the intermediate frame to the intermediate frame, and the reverse optical flow parameters of the intermediate frame are used to describe the transformation relationship from the subsequent frame to the intermediate frame. Optimizing the video frames except the anchor point frame in the video; according to the forward optical flow parameter and the reverse optical flow parameter of the N-M frame intermediate frame, and the intermediate feature of the M frame anchor frame, determine the intermediate feature of the N-M frame intermediate frame; utilize The trained feature estimation network performs feature estimation on the intermediate features of each frame of the video frame sequence respectively to obtain N frames of optimized images, and the N frames of optimized images constitute the optimized video of the video frame sequence.

In an optional implementation manner, according to the forward optical flow parameter and the reverse optical flow parameter of the intermediate frame of the N-M frame, and the intermediate feature of the anchor point frame of the M frame, the intermediate feature of the intermediate frame of the N-M frame is determined, including:

For the i-th frame video frame in the video frame sequence, the value of i is {1, 2, ..., N-1, N}, when the i-th frame video frame is an intermediate frame: use the positive value of the i-th frame video frame Transform the shape of the intermediate features of the i-1th video frame to the optical flow parameters to obtain the forward features of the i-th video frame; use the reverse optical flow parameters of the i-th video frame to transform the i+1th video frame Transform the shape of the reverse feature of the i-th video frame to obtain the reverse feature of the i-th video frame; perform feature fusion on the forward feature of the i-th video frame and the reverse feature of the i-th video frame to obtain the i-th video frame Intermediate features; Wherein, if the i+1th video frame is an anchor frame, the reverse feature value of the i+1th video frame is the intermediate feature of the i+1th video frame.

In an optional implementation, feature fusion is performed on the forward feature of the i-th video frame and the reverse feature of the i-th video frame to obtain the intermediate features of the i-th video frame, including:

The i-1th video frame, the i-th video frame, the i+1th video frame, the forward feature of the i-th video frame, the reverse feature of the i-th video frame, the i-1th video frame The forward features of the i+1th video frame and the reverse features of the i+1th video frame are input into the trained FFM model for fusion processing, and the intermediate features of the i-th video frame are obtained, wherein, if the i-1th video frame is the anchor Point frame, the value of the forward feature of the i-1th frame video frame is the intermediate feature of the i-1th frame video frame.

In an optional implementation manner, fusion processing includes:

Obtain the fusion feature of the i-1th frame video frame, the i-th frame video frame and the i+1-th frame video frame; perform fusion feature, the forward feature of the i-th frame video frame and the reverse feature of the i-th frame video frame Weight estimation to obtain a weight matrix; use the weight matrix to weight the forward features of the i-th video frame and the reverse features of the i-th video frame to obtain weighted features; weighted features, fusion features, i-1 frame video The forward feature of the frame and the reverse feature of the i+1th video frame are convoluted to obtain supplementary features; the supplementary features and weighted features are superimposed to obtain the intermediate features of the i-th video frame.

In an optional implementation, the method also includes:

Construct the initial model of video optimization, which includes the initial network of feature extraction, initial network of optical flow, initial network of feature estimation and initial model of FFM; use the preset loss function and training set to conduct unsupervised training on the initial model of video optimization, and get The trained feature extraction network, optical flow network, feature estimation network and FFM model; wherein, the training set includes a plurality of video frame sequence samples to be optimized.

In an optional implementation manner, the feature extraction network and the feature estimation network are obtained by splitting a preset image optimization model, and the image optimization model is used to perform image optimization on a two-dimensional image.

In an optional implementation, the image optimization model is an image coloring model, and the video frame sequence includes N frames of grayscale images; for the i-th frame grayscale image in the video frame sequence, the value of i is {1, 2 ,..., N-1, N}, use the feature estimation network to perform feature estimation on the intermediate features of the grayscale image of the i-th frame, and obtain the optimized image of the grayscale image of the i-th frame including:

Perform color estimation on the intermediate features of the i-th grayscale image, and obtain the a-channel image and b-channel image corresponding to the i-th frame grayscale image; obtain the i-th The color image of the frame grayscale image in the Lab domain, and the color image is the optimized image of the i-th frame grayscale image.

In a second aspect, the present application provides a video optimization device, including:

The extraction unit is used to utilize the trained feature extraction network to extract the intermediate features of the M frame anchor frame in the video frame sequence to be optimized respectively, the video frame sequence includes N frame video frames, and the M frame anchor frame includes the video frame sequence The first video frame and the Nth video frame, M is a positive integer greater than 2 and less than N;

The determining unit is used to determine the forward optical flow parameters and the reverse optical flow parameters of the N-M frame intermediate frame respectively by using the trained optical flow network, and the forward optical flow parameter of the intermediate frame is used to describe the direction of the previous frame of the intermediate frame to the The transformation relationship of the intermediate frame transformation, the reverse optical flow parameter of the intermediate frame is used to describe the transformation relationship of the next frame of the intermediate frame to the intermediate frame transformation, and the intermediate frame is a video frame other than the anchor point frame in the video to be optimized;

The determination unit is also used to determine the intermediate features of the N-M frame intermediate frames according to the forward optical flow parameters and reverse optical flow parameters of the N-M frame intermediate frames, and the intermediate features of the M frame anchor frame;

The estimation unit is used to use the trained feature estimation network to perform feature estimation processing on the intermediate features of the N frames of the video frame sequence to obtain N frames of optimized images, and the N frames of optimized images constitute the optimized video of the video frame sequence.

In a third aspect, the present application provides a terminal device, including: a memory and a processor, where the memory is used to store a computer program; and the processor is used to execute the method described in any one of the above first aspects when calling the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method described in any one of the above-mentioned first aspects is implemented.

In a fifth aspect, an embodiment of the present application provides a computer program product, which, when the computer program product runs on a processor, causes the processor to execute the method described in any one of the above-mentioned first aspects.

Based on the video optimization method, device, terminal equipment and storage medium provided in the present application, by extracting the anchor frame from the video frame sequence to be optimized, the intermediate features of the anchor frame are extracted by using the feature extraction network. For the intermediate frames between the anchor frames, the optical flow parameters between the intermediate frames of each frame and the adjacent two frames before and after each frame are obtained through the optical flow network (that is, including the previous frame used to describe the intermediate frame to the The forward optical flow parameter of the transformation relationship of the intermediate frame transformation, and the reverse optical flow parameter used to describe the transformation relationship of the subsequent frame of the intermediate frame to the intermediate frame). The intermediate features of the intermediate frames are then calculated using the optical flow parameters and the intermediate features of the anchor frames located before and after the intermediate frame. That is, the intermediate features of the intermediate frames are obtained by forward-propagating and back-propagating the intermediate features of the anchor frame between the intermediate frames. Therefore, the intermediate features of the intermediate frames retain the transformation information from frame to frame. Therefore, the optimized video obtained after feature estimation based on the intermediate features of each frame improves the continuity to a certain extent.

Description of drawings

FIG. 1 is a schematic diagram of a network structure of a video optimization model provided by an embodiment of the present application;

FIG. 2 is a schematic flowchart of a video optimization method provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a network structure of an FFM model provided by an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a video optimization device provided by an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.

Detailed ways

At present, in the video optimization algorithm based on deep learning, the image optimization model is often directly used to individually optimize each frame in the video to achieve video optimization. This method of independently optimizing each video frame in the video based on the image optimization model may cause different video frames to have different optimization effects, affecting the continuity of the optimized video.

To solve this problem, the present application provides a video optimization method. After extracting the intermediate features of the anchor frames in the sequence of video frames to be optimized, the intermediate features of the anchor frames are placed in the intermediate frames (located between the anchor frames) Forward propagation and backpropagation between video frames) to calculate the intermediate features of the intermediate frames. The intermediate features of the intermediate frames retain the transformation information between frames. Therefore, the optimized video obtained after performing feature estimation on the intermediate features of each frame can guarantee the continuity to a certain extent.

The technical solution of the present application will be described in detail below with specific embodiments. The following specific embodiments may be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments.

First, a video optimization model provided by this application is exemplarily introduced with reference to FIG. 1 . The video optimization model is deployed in a video processing device, and the video processing device can process a sequence of video frames to be optimized based on the video optimization model, so as to implement the video optimization method provided in this application. Wherein, the video processing device may be a mobile terminal device such as a smart phone, a tablet computer, or a video camera, or may be a terminal device capable of processing video data such as a desktop computer, a robot, or a server.

Exemplarily, as shown in FIG. 1 , the video optimization model provided by the present application includes a feature extraction (feature extraction) network GE , an optical flow network ( _FlowNet ) and a feature estimation network G _C .

Among them, the feature extraction network is used to extract the intermediate features of the input image, and the size of the intermediate features matches the input size required by the feature estimation network. The feature estimation network is used to perform feature estimation on the input intermediate features (including feature mapping, feature reconstruction, etc.), and the output is an optimized image.

In one example, the feature extraction network and the feature estimation network can be obtained by splitting an image optimization model for image optimization on 2D images.

For example, when the video optimization model is applied to the video optimization scene of black and white video coloring, the feature extraction network and feature estimation network are obtained by splitting the image coloring model. Wherein, the image coloring model can be any network model capable of automatically coloring black and white images, for example, Pix2Pix model, colornet.t7 model, colornet_imagenet.t7 model, etc.

The image coloring model generally extracts the intermediate features of the input grayscale image (that is, black and white image, which can be regarded as an L channel image in the Lab domain) through network layers such as layer by layer convolution layer, activation layer, and/or pooling layer. Perform color mapping or color reconstruction on the final extracted intermediate features to obtain an a-channel image and a b-channel image. Finally, the color image corresponding to the grayscale image in the Lab domain is constructed through the a-channel image, b-channel image and the input grayscale image.

When splitting the network model, it can be split from any intermediate layer before the network layer that outputs the a-channel image and the b-channel image to obtain two sub-networks. Among them, the sub-network whose input is a grayscale image and whose output is an intermediate feature is defined as a feature extraction network; the sub-network whose input is an intermediate feature and whose output is a color image is defined as a feature estimation network.

For another example, when the video optimization model is applied to the video optimization scene of video super-resolution (that is, converting low-resolution video to high-resolution video), the feature extraction network and feature estimation network are obtained by splitting the super-resolution model. Wherein, the super-resolution model can be any network model capable of mapping low-resolution images to high-resolution images, for example, FSRCNN model, CARN model, SRResNet model, RCAN model, etc.

The super-resolution model generally extracts the intermediate features of the input low-resolution image through layer-by-layer convolutional layers, residual layers, pooling layers, and/or deconvolution layers, and then upsamples the final extracted intermediate features. (i.e., image reconstruction) to obtain the corresponding high-resolution image. When splitting the network model, it can be split from any intermediate layer before the upsampling layer to obtain two sub-networks. Among them, the sub-network whose input is a low-resolution image and whose output is an intermediate feature is defined as a feature extraction network; the sub-network whose input is an intermediate feature and whose output is a high-resolution image is defined as a feature estimation network.

It can be understood that corresponding to different video optimization scenarios, for example, in addition to the above-mentioned video colorization and video super-resolution, video optimization scenarios such as video rain removal, video defogging, and video color adjustment may also be included. You can directly split the image optimization model corresponding to the scene to build a video optimization model. I won't list them all here.

The optical flow network is used to estimate the optical flow parameters of two adjacent video frames, that is, the amount of movement of the same object from one video frame to another, which can describe the transformation from one video frame to another. transform relationship. Exemplarily, FlowNet2.0 may be used as the optical flow network in this application.

Based on the above video optimization model, the video processing device obtains the video frame sequence to be optimized, and after determining the M frame anchor frame and the N-M frame intermediate frame in the video frame sequence, the video frame sequence can be input into the trained Processed in the video optimization model to obtain the optimized video.

Wherein, the video frame sequence to be optimized may be a video segment cut out from a video, or a complete video. Assume that the video frame sequence includes N frames of video. In the N video frames including the first video frame and the Nth video frame, there are M anchor frames, where M is a positive integer greater than 2 and less than N.

The M frames of anchor frames may be designated manually, or may be identified by the video processing device from N frames of video frames according to preset anchor frame extraction rules. For example, if the number of intervals between intermediate frames is set to 10, then the video processing device can start from the first video frame, identify the first video frame as the first frame anchor frame, and identify the twelfth frame of video after an interval of 10 intermediate frames The frame is the second anchor frame, and so on until the Nth video frame is identified as the Mth anchor frame. It can be understood that the number of intermediate frames between the anchor frame of the M-th frame and the anchor frame of the M-1 frame may be less than 10 frames. As the name implies, the intermediate frame is the video frame between two adjacent anchor frames in the N frame video frame, for example, the first video frame and the 12th video frame are two adjacent video frames, located in the first video frame The 2nd-11th frame video frame between the 12th frame and the 12th video frame is an intermediate frame.

Exemplarily, the video processing device performs video optimization on the video frame sequence to be optimized based on the video optimization model, as shown in FIG. 2 , including:

S201. Use the trained feature extraction network to extract the intermediate features of M frame anchor frames respectively.

For example, take four consecutive video frames x ₁ , x ₂ , x ₃ , and x ₄ shown in FIG. 1 as an example. Among them, the first video frame x ₁ and the fourth video frame x ₄ are anchor frames, and the second video frame x ₂ and the third video frame x ₃ are intermediate frames. The video processing device inputs x ₁ and x ₄ respectively to the feature extraction network _GE for processing to obtain the intermediate feature F _{1 of x 1} _and the intermediate feature F _{4 of x 4} _.

S202, using the trained optical flow network to determine the forward optical flow parameters and reverse optical flow parameters of the intermediate frames of the N-M frames respectively.

Among them, the forward optical flow parameters of the intermediate frame are used to describe the transformation relationship from the previous frame of the intermediate frame to the intermediate frame, and the reverse optical flow parameters of the intermediate frame are used to describe the transition from the next frame of the intermediate frame to the intermediate frame. The transformation relationship of frame transformation.

For example, as shown in Figure 1, for the intermediate frame x ₂ , the video processing device inputs x ₁ and x ₂ into the optical flow network, and obtains the forward optical flow parameter f _1→2 of x ₂ (used to describe x ₁ to x ₂ transform transformation relationship). Input x ₃ and x ₂ into the optical flow network to obtain the reverse optical flow parameter f _3→2 of x ₂ (used to describe the transformation relationship from x ₃ to x ₂ ). For the intermediate frame x ₃ , the video processing device inputs x ₂ and x ₃ into the optical flow network to obtain the forward optical flow parameter f _2→3 of x ₃ (used to describe the transformation relationship from x ₂ to x ₃ ). Input x ₄ and x ₃ into the optical flow network to obtain the reverse optical flow parameter f _4→3 of x ₃ (used to describe the transformation relationship from x ₄ to x ₃ ).

S203. According to the forward optical flow parameters and reverse optical flow parameters of the intermediate frames of the N-M frames, and the intermediate features of the anchor frames of the M frames, determine the intermediate features of the intermediate frames of the N-M frames.

In the embodiment of the present application, for an intermediate frame located between two adjacent anchor frames, the optical flow parameters are used to make the intermediate features of the two anchor frames propagate between the intermediate frames. That is to say, the optical flow parameters between each intermediate frame and the adjacent two frames of video frames are calculated through the optical flow network, and based on each optical flow parameter, the intermediate features of the anchor frame are forward or backward Propagated frame by frame such that the intermediate features of the intermediate frame are aligned to the intermediate features of the anchor frame.

Exemplarily, for the i-th video frame in the video frame sequence, the value of i is {1, 2, ..., N-1, N}, when the i-th video frame is an intermediate frame:

The video processing device can use the forward optical flow parameters of the i-th frame of video frame to perform shape transformation on the intermediate features of the i-1th frame of video frame to obtain the forward feature of the i-th frame of video frame; Perform shape transformation on the reverse feature of the i+1 frame video frame to the optical flow parameter to obtain the reverse feature of the i frame video frame; for the forward feature of the i frame video frame and the reverse feature of the i frame video frame Feature fusion is performed to obtain the intermediate features of the i-th video frame.

It is worth noting that the intermediate feature of the anchor frame can also be used as the reverse feature and forward feature of the anchor frame, that is, the values of the intermediate feature, reverse feature, and forward feature of the anchor frame are the same. That is to say, if the i+1th frame video frame is an anchor frame, the reverse feature value of the i+1th frame video frame is the intermediate feature of the i+1th frame video frame extracted by the feature extraction network.

For example, taking x ₁ , x ₂ , x ₃ and x ₄ shown in Figure 1 as an example, it shows that the intermediate features of the anchor frame x ₁ and x ₄ are back-propagated and forward-propagated between the intermediate frames x ₂ and x ₃ , in a way to obtain the intermediate features of the intermediate frames _x2 and _x3 .

As shown in Figure 1, firstly, the intermediate feature _{F4 of x4} _is backpropagated to obtain the reverse features of _x2 and _x3 . That is, use the reverse optical flow parameter f _4→3 of x ₃ to perform a shape change (warp) operation on F ₄ to obtain the reverse feature of x ₃

get

After that, continue to use the reverse optical flow parameter f _3→2 of x ₂ for

Perform warp operation to get the reverse feature of x ₂

Then based on the reverse features _of _x2 and _x3 , the intermediate feature F1 _of x1 is forward propagated to obtain the intermediate features of _x2 and _x3 . That is, use the forward optical flow parameter f _1→2 of x ₂ to perform warp operation on F ₁ to obtain the forward feature of x ₂

followed by

and

Perform feature fusion to obtain the intermediate feature F _{2 of x 2} _. After obtaining F ₂ , continue to use the forward optical flow parameter f _2→3 of x ₃ to perform warp operation on F ₂ to obtain the forward feature of x ₃

followed by

and

Perform feature fusion to obtain the intermediate feature F _{3 of x 3} _.

It can be seen that when calculating the intermediate features of the intermediate frame, the intermediate features of one anchor frame are transmitted backwards first, and the intermediate features of another anchor frame are forward propagated. This two-way transmission of information can complement the information loss caused by the optical flow network and warp operation in a single transmission direction, and the temporal continuity of the intermediate features of each frame. This is more conducive to the subsequent video optimization effect.

In addition, since the intermediate features of the intermediate frame are calculated based on the intermediate features of the anchor frames located on both sides of the intermediate frame, when there is a changing scene in the video frame sequence, some The influence information only exists in this time interval (even between two anchor frames), and will not affect the accuracy of intermediate features of intermediate frames in other time intervals.

Among them, when performing feature fusion on the forward feature of the i-th video frame and the reverse feature of the i-th video frame, the feature fusion can be performed by numerical calculation, or the feature fusion network can be set in the video optimization model. feature fusion. The feature fusion network may be a conventional feature fusion network, for example, a field-aware factorization machine (FFM), a factorization machine (Factorization Machines, FM) and the like with field-aware capabilities.

Optionally, this embodiment of the present application provides an improved FFM model, by inputting the i-1th video frame x _i-1 , the i-th video frame x _i , the i+1th video frame x _i+1 , Forward features of the i-th video frame

Reverse feature of the i-th video frame

Forward features of the i-1th video frame

And the reverse feature of the i+1th frame video frame

Perform a feature fusion operation to output the intermediate feature F _i of the i-th video frame. That is, as shown in FIG. 1 , the video optimization model provided in the embodiment of the present application also includes the FFM model provided in the embodiment of the present application.

Exemplary, to fuse the forward features of the i-th frame video frame

and the reverse feature of the i-th video frame

Taking the intermediate feature F _i of the i-th video frame as an example, the network structure of the FFM model provided in this application can be shown in FIG. 3 .

Firstly, feature extraction is performed on _xi-1 , _xi and _xi+1 , for example, a convolutional layer is used to extract features on _xi-1 , _xi and _xi+1 respectively. The extracted features are combined (concat) to obtain a combined feature. The merged features are then fed into the weighting network and feature refinement network respectively.

Among them, the weight estimation network and the feature compensation network are respectively composed of multiple convolutional layers. The weight estimation network is based on the input merged features,

and

After multi-layer convolution operation, the weight matrix W is output. Using W to

and

Weighted, can achieve the

and

The choice of the same pixel in the middle, get a fusion feature

(E.g,

).

Will

After a 1×1 convolution operation, it is input into the feature compensation network. The feature compensation network merges the input features, the convolutional

and

After multi-layer convolution operation, the output can be compared with

Corresponding Supplementary Features

The supplementary feature

can be restored in the calculation

and

In the process, the missing information is caused by the optical flow network and warp operation. Will

and

After the superposition, the intermediate feature F _i of the i-th video frame is obtained.

It is worth noting that the FFM model provided by this application can not only refer to the positive features of the i-1th video frame x _i-1 , the i+1th video frame x _i+1 , and the i-1th video frame

And the reverse feature of the i+1th frame video frame

to construct the F _i of the i-th video frame. That is, the information of the previous and subsequent frames is considered, so that the F _i of the i-th video frame has more continuity with the intermediate features of the previous and subsequent frames in time. while being able to

and

Supplement missing information due to optical flow network and warp operation. Therefore, the continuity of the intermediate features of the intermediate frames can be further improved.

S204. Use the trained feature estimation network to perform feature estimation on the intermediate features of each frame of the video frame sequence to obtain N frames of optimized images, and the N frames of optimized images constitute an optimized video of the video frame sequence.

Take the video optimization scenario of black and white video colorization as an example. For the i-th frame grayscale image in the video frame sequence, the feature estimation network is used to perform feature estimation on the intermediate features of the i-th frame grayscale image, and the optimized image of the i-th frame grayscale image includes:

Perform color estimation on the intermediate features of the i-th frame grayscale image to obtain output information

in,

Including the a-channel image and b-channel image corresponding to the i-th frame grayscale image; according to the i-th frame grayscale image, a-channel image and b-channel image, the color image of the i-th frame grayscale image in the Lab domain is obtained, and the color The image is the optimized image of the i-th frame grayscale image.

For example, in Figure 1, after obtaining the intermediate features of x ₁ , x ₂ , x ₃ and x ₄ , input the intermediate features of x ₁ , x ₂ , x ₃ and x ₄ into the feature estimation network G _C for processing, Output Get the output information corresponding to x ₁ , x ₂ , x ₃ and x ₄ respectively

Since the video optimization scene applied to black and white video coloring is taken as an example in Figure 1, each output information

Including a channel image and b channel image. After the a-channel image, b-channel image and grayscale image are combined, color images corresponding to x ₁ , x ₂ , x ₃ and x ₄ respectively can be obtained.

In summary, with the video optimization method provided by this application, instead of extracting the intermediate features of each video frame independently, the anchor frame is selected, and after extracting the intermediate features of the anchor frame, the intermediate feature of the anchor frame is extracted. The features are forward-propagated and back-propagated between the intermediate frames to compute intermediate features for the intermediate frames. The intermediate features of the intermediate frames retain the transformation information between frames. Therefore, the optimized video obtained after performing feature estimation on the intermediate features of each frame can guarantee the continuity to a certain extent.

The training process of the video optimization model provided by this application is described as an example below.

First, an initial model for video optimization is constructed, which includes an initial network for feature extraction, an initial network for optical flow, and an initial network for feature estimation.

It can be understood that when constructing the initial video optimization model, a corresponding image optimization model can be selected based on a specific video optimization scenario. Then the image optimization model is split to obtain the corresponding feature extraction initial network and feature estimation initial network.

In addition, if the feature fusion network is used to fuse the intermediate features and reverse features, then the feature fusion network can also be set in the initial model of video optimization. For example, the improved FFM initial model provided by the present application may be set in the video optimization initial model.

Afterwards, unsupervised training is performed on the video optimization initial model using the preset loss function and training set to obtain a trained video optimization module. Correspondingly, the trained video optimization module includes the above-mentioned trained feature extraction network, optical flow network, feature estimation network and FFM model.

In the embodiment of the present application, the training set includes a plurality of video frame sequence samples to be optimized. Since unsupervised training is adopted, the training set may not need to collect corresponding color video frame sequences.

The design of the loss function can be designed based on the actual video optimization scenario. For example, taking the video optimization scene of black and white video coloring as an example, the loss function can be designed as:

Among them, M is the occlusion matrix,

N is the frame number of video frame sequence samples, d is the interval between adjacent frames, d=1 means two adjacent frames, and d=2 means two adjacent frames separated by one frame.

Indicates the output information of the i-th video frame sample.

Indicates the result of transforming the video frame sample of the i+d frame to the video frame sample of the i frame through the warp operation. due to need

and

The content is consistent, so the loss constraint can be lost based on the loss function.

Exemplarily, a gradient descent algorithm may be used during training. The parameters of the network are learned through iteration. For example, the initial learning rate can be set to 1e-4, and every 50,000 iterations, the learning rate is decayed by half until the network converges.

It is worth noting that the video optimization model and training method provided by this application are universal. It can be applied to any video optimization task or any task that takes the video optimization effect as an evaluation index.

Based on the same inventive concept, as an implementation of the above method, the embodiment of the present application provides a video optimization device. The device embodiment corresponds to the aforementioned method embodiment. The details in the present invention will be described one by one, but it should be clear that the device in this embodiment can correspondingly implement all the content in the foregoing method embodiments.

FIG. 4 is a schematic structural diagram of a video optimization device provided by an embodiment of the present application. As shown in FIG. 4 , the video optimization device provided by this embodiment includes: an extraction unit 401 , a determination unit 402 and an estimation unit 403 .

The extraction unit 401 is used to utilize the trained feature extraction network to extract the intermediate features of the M frame anchor frames in the video frame sequence to be optimized respectively, the video frame sequence includes N frame video frames, and the M frame anchor frame includes a video frame sequence The 1st video frame and the Nth video frame of , M is a positive integer greater than 2 and less than N.

The determination unit 402 is used to determine the forward optical flow parameters and reverse optical flow parameters of the N-M frame intermediate frames respectively by using the trained optical flow network, and the forward optical flow parameters of the intermediate frames are used to describe the direction of the previous frame of the intermediate frame. The transformation relationship of the intermediate frame transformation, the reverse optical flow parameter of the intermediate frame is used to describe the transformation relationship of the subsequent frame of the intermediate frame to the intermediate frame, and the intermediate frame is a video frame other than the anchor point frame in the video to be optimized.

The determination unit 402 is further configured to determine the intermediate features of the N-M intermediate frames according to the forward optical flow parameters and reverse optical flow parameters of the N-M intermediate frames, and the intermediate features of the M anchor frames.

The estimation unit 403 is configured to use the trained feature estimation network to perform feature estimation processing on the intermediate features of N frames of the video frame sequence to obtain N frames of optimized images, and the N frames of optimized images constitute an optimized video of the video frame sequence.

The video optimization device provided in this embodiment can execute the above-mentioned method embodiment, and its implementation principle and technical effect are similar, and details are not repeated here.

Those skilled in the art can clearly understand that for the convenience and brevity of description, only the division of the above-mentioned functional units and modules is used for illustration. In practical applications, the above-mentioned functions can be assigned to different functional units, Completion of modules means that the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above. Each functional unit and module in the embodiment may be integrated into one processing unit, or each unit may exist separately physically, or two or more units may be integrated into one unit, and the above-mentioned integrated units may adopt hardware It can also be implemented in the form of software functional units. In addition, the specific names of each functional unit and module are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application. For the specific working process of the units and modules in the above system, reference may be made to the corresponding process in the foregoing method embodiments, and details will not be repeated here.

Based on the same inventive concept, the embodiment of the present application also provides a terminal device. FIG. 5 is a schematic structural diagram of a terminal device provided in an embodiment of the present application. As shown in FIG. 5 , the terminal device provided in this embodiment includes: a memory 501 and a processor 502, the memory 501 is used to store computer programs; the processor 502 is used to The methods described in the above method embodiments are executed when the computer program is called.

Exemplarily, the computer program may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 501 and executed by the processor 502 to complete this Apply the method described in the examples. The one or more modules/units may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer program in the terminal device.

Those skilled in the art can understand that FIG. 5 is only an example of a terminal device, and does not constitute a limitation on the terminal device. It may include more or less components than those shown in the figure, or combine certain components, or different components, such as The terminal device may also include an input and output device, a network access device, a bus, and the like.

The processor 502 can be a central processing unit (Central Processing Unit, CPU), and can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.

The storage 501 may be an internal storage unit of the terminal device, such as a hard disk or memory of the terminal device. The memory 82 may also be an external storage device of the terminal device, such as a plug-in hard disk equipped on the terminal device, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, Flash card (Flash Card), etc. Further, the memory 501 may also include both an internal storage unit of the terminal device and an external storage device. The memory 501 is used to store the computer program and other programs and data required by the terminal device. The memory 501 can also be used to temporarily store data that has been output or will be output.

The terminal device provided in this embodiment can execute the foregoing method embodiment, and its implementation principle and technical effect are similar, and details are not repeated here.

The embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method described in the foregoing method embodiment is implemented.

The embodiment of the present application further provides a computer program product, which, when the computer program product runs on a terminal device, enables the terminal device to implement the method described in the foregoing method embodiments when executed.

If the above-mentioned integrated units are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments in the present application can be completed by instructing related hardware through computer programs, and the computer programs can be stored in a computer-readable storage medium. The computer program When executed by a processor, the steps in the above-mentioned various method embodiments can be realized. Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form. The computer-readable storage medium may at least include: any entity or device capable of carrying computer program codes to a photographing device/terminal device, a recording medium, a computer memory, a read-only memory (Read-Only Memory, ROM), a random access Memory (Random Access Memory, RAM), electrical carrier signal, telecommunication signal and software distribution medium. Such as U disk, mobile hard disk, magnetic disk or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunication signals under legislation and patent practice.

In the above-mentioned embodiments, the descriptions of each embodiment have their own emphases, and for parts that are not detailed or recorded in a certain embodiment, refer to the relevant descriptions of other embodiments.

Those skilled in the art can appreciate that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.

In the embodiments provided in this application, it should be understood that the disclosed device/device and method can be implemented in other ways. For example, the device/device embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, such as multiple units or Components may be combined or integrated into another system, or some features may be omitted, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

It should be understood that when used in this specification and the appended claims, the term "comprising" indicates the presence of described features, integers, steps, operations, elements and/or components, but does not exclude one or more other Presence or addition of features, wholes, steps, operations, elements, components and/or collections thereof.

It should also be understood that the term "and/or" used in the description of the present application and the appended claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations.

As used in this specification and the appended claims, the term "if" may be construed, depending on the context, as "when" or "once" or "in response to determining" or "in response to detecting ". Similarly, the phrase "if determined" or "if [the described condition or event] is detected" may be construed, depending on the context, to mean "once determined" or "in response to the determination" or "once detected [the described condition or event] ]” or “in response to detection of [described condition or event]”.

In addition, in the description of the specification and appended claims of the present application, the terms "first", "second", "third" and so on are only used to distinguish descriptions, and should not be understood as indicating or implying relative importance.

Reference to "one embodiment" or "some embodiments" or the like in the specification of the present application means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in other embodiments," etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean "one or more but not all embodiments" unless specifically stated otherwise. The terms "including", "comprising", "having" and variations thereof mean "including but not limited to", unless specifically stated otherwise.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and are not intended to limit it; although the application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present application. scope.

Claims

A video optimization method, characterized in that the method comprises:

Utilize the trained feature extraction network to extract the intermediate features of M frame anchor frames in the video frame sequence to be optimized respectively, the video frame sequence includes N frame video frames, and the M frame anchor frame includes the video frame sequence The 1st frame video frame and the Nth frame video frame, M is a positive integer greater than 2 and less than N;

Use the trained optical flow network to determine the forward optical flow parameters and reverse optical flow parameters of the N-M frame intermediate frame respectively, and the forward optical flow parameter of the intermediate frame is used to describe the direction of the previous frame of the intermediate frame to the The transformation relationship of the intermediate frame transformation, the reverse optical flow parameter of the intermediate frame is used to describe the transformation relationship of the subsequent frame of the intermediate frame to the intermediate frame transformation, and the intermediate frame is the video to be optimized except a video frame other than the anchor frame;

According to the forward optical flow parameter and the reverse optical flow parameter of the intermediate frame described in the N-M frame, and the intermediate feature of the anchor point frame described in the M frame, determine the intermediate feature of the intermediate frame described in the N-M frame;

Using the trained feature estimation network to perform feature estimation on the intermediate features of each frame of the video frame sequence to obtain N frames of optimized images, and the N frames of optimized images constitute an optimized video of the video frame sequence.
The method according to claim 1, wherein the N-M frame is determined according to the forward optical flow parameter and the reverse optical flow parameter of the intermediate frame of the N-M frame, and the intermediate feature of the anchor frame of the M frame The intermediate features of the intermediate frame include:

For the i-th video frame in the video frame sequence, i takes a value of {1, 2, ..., N-1, N}, when the i-th video frame is the intermediate frame:

Using the forward optical flow parameters of the i-th video frame to perform shape transformation on the intermediate features of the i-1th video frame to obtain the forward features of the i-th video frame;

Using the reverse optical flow parameters of the i-th frame video frame to carry out shape transformation to the reverse feature of the i+1 frame video frame, to obtain the reverse feature of the i-th frame video frame;

Carrying out feature fusion to the forward feature of the i-th video frame and the reverse feature of the i-th video frame to obtain the intermediate feature of the i-th video frame;

Wherein, if the i+1th video frame is the anchor frame, the reverse feature of the i+1th video frame is an intermediate feature of the i+1th video frame.
The method according to claim 2, wherein the feature fusion is performed on the forward feature of the i-th video frame and the reverse feature of the i-th video frame to obtain the i-th video frame Intermediate features of the frame, including:

The i-1th frame video frame, the i-th frame video frame, the i+1-th frame video frame, the forward feature of the i-th frame video frame, the reverse feature of the i-th frame video frame , the forward feature of the i-1th frame video frame and the reverse feature of the i+1th frame video frame are input into the trained FFM model for fusion processing to obtain the middle of the i-th frame video frame feature, wherein, if the i-1th video frame is the anchor frame, the value of the forward feature of the i-1th video frame is the intermediate feature of the i-1th video frame.
The method according to claim 3, wherein the fusion process comprises:

Acquiring fusion features of the i-1th video frame, the i-th video frame, and the i+1th video frame;

Carrying out weight estimation to the fusion feature, the forward feature of the i-th video frame and the reverse feature of the i-th video frame, to obtain a weight matrix;

Using the weight matrix to weight the forward feature of the i-th video frame and the reverse feature of the i-th video frame to obtain weighted features;

Perform convolution calculation on the weighted feature, the fusion feature, the forward feature of the i-1th video frame and the reverse feature of the i+1th video frame to obtain the supplementary feature;

The supplementary features and the weighted features are superimposed to obtain the intermediate features of the i-th video frame.
The method according to claim 3, further comprising:

Build video optimization initial model, described video optimization initial model comprises feature extraction initial network, optical flow initial network, feature estimation initial network and FFM initial model;

Using a preset loss function and training set to carry out unsupervised training on the video optimization initial model to obtain the trained feature extraction network, the optical flow network, the feature estimation network and the FFM model;

Wherein, the training set includes a plurality of video frame sequence samples to be optimized.
The method according to any one of claims 1-5, wherein the feature extraction network and the feature estimation network are obtained by splitting a preset image optimization model, and the image optimization model is used for two-dimensional Image for image optimization.
The method according to claim 6, wherein the image optimization model is an image coloring model, and the video frame sequence includes N frames of grayscale images;

For the i-th frame grayscale image in the video frame sequence, the value of i is {1, 2, ..., N-1, N}, and the feature estimation network is used to analyze the i-th frame grayscale image Perform feature estimation on the intermediate features, and obtain the optimized image of the i-th frame grayscale image including:

Carrying out color estimation on the intermediate features of the grayscale image of the i-th frame, and obtaining the a-channel image and the b-channel image corresponding to the grayscale image of the i-th frame;

Obtain the color image of the i-th frame gray-scale image in the Lab domain according to the i-th frame gray-scale image, the a-channel image and the b-channel image, and the color image is the i-th frame gray-scale image Optimized image for graph.
A video optimization device, characterized in that it comprises:

The extraction unit is used to utilize the trained feature extraction network to extract the intermediate features of the M frame anchor frames in the video frame sequence to be optimized respectively, the video frame sequence includes N frames of video frames, and the anchor frame of the M frames includes The first video frame and the Nth video frame of the video frame sequence, M is a positive integer greater than 2 and less than N;

A determining unit, configured to use the trained optical flow network to determine the forward optical flow parameters and reverse optical flow parameters of the N-M frame intermediate frames, the forward optical flow parameters of the intermediate frames are used to describe the front of the intermediate frames Transformation relationship from one frame to the intermediate frame, the reverse optical flow parameter of the intermediate frame is used to describe the transformation relationship from the next frame of the intermediate frame to the intermediate frame, the intermediate frame is the Video frames other than the anchor frame in the video to be optimized;

The determining unit is further configured to determine the intermediate features of the intermediate frames of the N-M frames according to the forward optical flow parameters and reverse optical flow parameters of the intermediate frames of the N-M frames, and the intermediate features of the anchor frame of the M frames ;

The estimation unit is used to perform feature estimation processing on the intermediate features of the N frames of video frames in the video frame sequence by using the trained feature estimation network to obtain N frames of optimized images, and the N frames of optimized images constitute the video frame sequence optimized video for .
A terminal device, characterized in that it includes: a memory and a processor, the memory is used to store a computer program; the processor is used to execute the method described in any one of claims 1-6 when calling the computer program method.
A computer-readable storage medium on which a computer program is stored, wherein the computer program implements the method according to any one of claims 1-6 when executed by a processor.