CN115631093A

CN115631093A - Video super-resolution model training method and device and video super-resolution processing method and device

Info

Publication number: CN115631093A
Application number: CN202211304774.4A
Authority: CN
Inventors: 王娜; 江列霖; 党青青; 赖宝华
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2023-01-20
Also published as: CN114119371B; CN114119371A

Abstract

The disclosure provides a video hyper-resolution model training method and device, and relates to the technical fields of computer vision, deep learning and the like. The specific implementation scheme is as follows: acquiring a video set comprising more than two video frames; acquiring a pre-established video super-resolution network, wherein the video super-resolution network comprises: the method comprises the following steps of sliding window alignment sub-networks, cyclic alignment sub-networks and refined alignment sub-networks, wherein the sliding window alignment sub-networks are used for carrying out feature extraction and fusion on a plurality of continuous video frames, the cyclic alignment sub-networks are used for carrying out bidirectional propagation on features output by the sliding window sub-networks to generate reconstruction features and alignment parameters, and the refined alignment sub-networks carry out realignment and bidirectional propagation on the reconstruction features based on the alignment parameters to obtain over-divided video frames; the following training steps are performed: inputting continuous video frames selected from the video set into a video super-distribution network, and calculating a loss value of the video super-distribution network; and obtaining a video hyper-resolution model based on the loss value of the video hyper-resolution network. The embodiment improves the video over-separation effect of the model.

Description

Video super-resolution model training method and device and video super-resolution processing method and device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to the field of computer vision and deep learning, and in particular, to a method and an apparatus for training a video hyper-resolution model, a method and an apparatus for video hyper-resolution processing, an electronic device, a computer-readable storage medium, and a computer program product.

Background

The Video Super-Resolution (Video Super-Resolution) task is to recover the corresponding high-Resolution counterpart from a given low-Resolution Video. In recent years, with the explosive growth of internet video data, the video super-resolution technology has received great attention from researchers. The same thing as image super-scoring is that video super-scoring is also an ill-posed problem. However, unlike image hyperscoring, video hyperscoring requires not only attention to the corresponding low resolution frame, but also the use of information from successive frames in the video sequence. Some early video over-segmentation methods continue the idea of image over-segmentation, and do not well utilize continuous video frame information, resulting in unsatisfactory video over-segmentation results.

Disclosure of Invention

The present disclosure provides a video hyper-separation model training method and apparatus, a video hyper-separation processing method and apparatus, an electronic device, a computer readable storage medium, and a computer program product.

According to a first aspect, there is provided a video hyper-score model training method, the method comprising: acquiring a video set comprising more than two video frames; acquiring a pre-established video hyper-division network, wherein the video hyper-division network comprises: the method comprises the steps of sliding window alignment sub-networks, cyclic alignment sub-networks and refinement alignment sub-networks, wherein the sliding window alignment sub-networks are used for carrying out feature extraction and fusion on a plurality of continuous video frames, the cyclic alignment sub-networks are used for carrying out bidirectional propagation on features output by the sliding window sub-networks to generate reconstruction features and alignment parameters, the refinement alignment sub-networks carry out realignment and bidirectional propagation on the reconstruction features based on the alignment parameters, and the over-divided video frames are obtained; the following training steps are performed: inputting continuous video frames selected from the video set into a video super-distribution network, and calculating a loss value of the video super-distribution network; and obtaining a video hyper-resolution model based on the loss value of the video hyper-resolution network.

According to a second aspect, there is provided a video super-resolution processing method, the method comprising: acquiring a plurality of video frames to be processed; and inputting the video frame to be processed into the video super-resolution model generated by adopting the method described in any implementation manner of the first aspect, and outputting the video super-resolution processing result of the video frame to be processed.

According to a third aspect, there is provided a video hyper-resolution model training apparatus, the apparatus comprising: a video acquisition unit configured to acquire a video set including two or more video frames; a network acquisition unit configured to acquire a pre-established video hyper-division network, the video hyper-division network including: the method comprises the steps of sliding window alignment sub-networks, cyclic alignment sub-networks and refinement alignment sub-networks, wherein the sliding window alignment sub-networks are used for carrying out feature extraction and fusion on a plurality of continuous video frames, the cyclic alignment sub-networks are used for carrying out bidirectional propagation on features output by the sliding window sub-networks to generate reconstruction features and alignment parameters, the refinement alignment sub-networks carry out realignment and bidirectional propagation on the reconstruction features based on the alignment parameters, and the over-divided video frames are obtained; a selecting unit configured to input successive video frames selected from the video set into a video super-distribution network; a calculation unit configured to calculate a loss value of the video hyper-distribution network; and the obtaining unit is configured to obtain the video hyper-division model based on the loss value of the video hyper-division network.

According to a fourth aspect, there is provided a video super-resolution processing apparatus, comprising: an acquisition unit configured to acquire a plurality of video frames to be processed; and the input unit is configured to input the video frame to be processed into the video super-resolution model generated by the device according to any one of the implementation manners of the third aspect, so as to obtain a video super-resolution processing result of the video frame to be processed.

According to a fifth aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as described in any one of the implementations of the first aspect or the second aspect.

According to a sixth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method as described in any implementation of the first or second aspects.

According to a seventh aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first or second aspect.

The embodiment of the disclosure provides a method and a device for training a video hyper-resolution model, and the method comprises the following steps of firstly, acquiring a video set comprising more than two video frames; secondly, acquiring a pre-established video super-distribution network, wherein the video super-distribution network comprises: the method comprises the following steps of sliding window alignment sub-networks, cyclic alignment sub-networks and refined alignment sub-networks, wherein the sliding window alignment sub-networks are used for carrying out feature extraction and fusion on a plurality of continuous video frames, the cyclic alignment sub-networks are used for carrying out bidirectional propagation on features output by the sliding window sub-networks to generate reconstruction features and alignment parameters, and the refined alignment sub-networks carry out realignment and bidirectional propagation on the reconstruction features based on the alignment parameters to obtain over-divided video frames; thirdly, inputting continuous video frames selected from the video set into a video super-distribution network; calculating the loss value of the video super-resolution network from the time to the time; and finally, obtaining a video hyper-resolution model based on the loss value of the video hyper-resolution network. According to the method, the video hyper-resolution network combining the sliding window network and the bidirectional circulation network is trained, so that the obtained video hyper-resolution model has the advantages of the two networks; and the video super-resolution network is adopted to carry out multi-stage super-resolution and refinement on the input video frames, so that the detail keeping effect of the video frame sequence is ensured, and the effect of carrying out video super-resolution on the model is improved.

The video super-resolution processing method and device provided by the embodiment of the disclosure acquire a plurality of videos to be processed; and inputting the video to be processed into the video super-resolution model generated by adopting the video super-resolution model training method of the embodiment to obtain a video super-resolution processing result of the video to be processed. Therefore, the video super-resolution model of the video super-resolution network comprising the super-resolution of a plurality of stages is adopted, reliable video super-resolution processing can be carried out on the video to be processed, and effectiveness of the super-resolution processing is guaranteed.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow diagram of one embodiment of a video hyper-score model training method according to the present disclosure;

fig. 2 is a schematic structural diagram of a video hyper-division network in an embodiment of the present disclosure;

FIG. 3 is a flow diagram for one embodiment of a video super-resolution processing method according to the present disclosure;

FIG. 4 is a schematic diagram illustrating an embodiment of a video hyper-segmentation model training apparatus according to the present disclosure;

FIG. 5 is a schematic block diagram illustrating one embodiment of a video super-resolution processing apparatus according to the present disclosure;

fig. 6 is a block diagram of an electronic device for implementing a video hyper-score model training method or a video hyper-score processing method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

For video frames in the same piece of video, it is possible that the content source is the same, but there is a motion offset, a scene offset, or a ray offset. In this case, how to align the video frames and find a common reference point is a problem to be solved for video frame alignment.

The video alignment method aligns adjacent frames in video frames with target frames by extracting motion information, and mainly comprises two methods, namely motion estimation compensation and deformable convolution, wherein the motion estimation compensation is used for extracting inter-frame motion information (such as optical flow extraction), and distortion operation between frames is performed according to the inter-frame motion information to align the inter-frame motion information. The Deformable convolution method is to use Deformable convolution to make the input characteristics obtain offset through convolution operation, and the convolution kernel of the Deformable convolution is obtained by adding offset into the traditional convolution kernel, such as EDVR (Video Restoration with Enhanced Deformable recovery) model.

Some early methods were inspired by the image super-resolution method, which only applied the framework to video super-resolution, but the temporal information between adjacent frames was not fully exploited. To solve this problem, recent approaches have designed many complex modules, such as bi-directional recurrent neural networks. The basic idea of the bidirectional cyclic neural network is to provide that each training sequence is respectively two cyclic neural networks forward and backward, a forward sub-network inputs a forward video frame, and a backward sub-network inputs a backward video frame; the circular neural network introduces an optical flow alignment network (spynet) to align in propagation; an important advantage of the recurrent neural network in operation is that it is possible to utilize the information about previous and subsequent frames in the mapping process between input and output sequences; the improvement of the bidirectional recurrent neural network is that the current output (output of the t-th frame) is assumed to be related not only to the preceding sequence frame but also to the following sequence frame. For example: predicting a frame in a video then requires prediction from previous and subsequent frames. The bidirectional recurrent neural network is formed by superposing two recurrent neural networks, and the output is determined by the state of a hidden layer of the two recurrent neural networks.

Video super-resolution methods can be divided into two broad categories depending on whether video frames are clearly aligned: a non-aligned method and an aligned method. The unaligned method has a simple network structure, but generally has a poor recovery result for large-sized moving videos. There are alignment methods that typically have complex alignment modules with a large number of parameters. Therefore, a design model with a small parameter number, including the alignment module, and capable of obtaining a better recovery effect is an urgent and troublesome problem. Based on this, the present disclosure proposes a video hyper-resolution model training method, fig. 1 shows a flow 100 according to an embodiment of the video hyper-resolution model training method of the present disclosure, the video hyper-resolution model training method includes the following steps:

step 101, a video set comprising more than two video frames is obtained.

In this embodiment, the execution subject on which the video hyper-segmentation model training method operates may obtain the video set in a plurality of ways, for example, the execution subject may obtain the video set stored in the database server in a wired connection way or a wireless connection way. As another example, the user may obtain a video set collected by the terminal by communicating with the terminal.

Here, the video set may include at least two or more video frames, the two or more video frames may be composed of at least one video sequence, consecutive video frames may be selected from the video set for samples required in training the video super-distribution network, the selected consecutive video frames may be one video sequence, and the video sequence output by the video super-distribution network may be obtained by inputting the video sequence to the video super-distribution network.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the related video frames and video sequences are executed after authorization, and the related laws and regulations are met.

And 102, acquiring a pre-established video hyper-distribution network.

As shown in fig. 2, in the video hyper-resolution network, the sliding window aligned subnet 1, the circular aligned subnet 2, and the refined aligned subnet 3 are sequentially connected, the sliding window aligned subnet 1 receives a plurality of input continuous video frames (video sequences), and after sequentially processing the sliding window aligned subnet 2 and the refined aligned subnet 3, the hyper-resolution video frames (video frames output by the refined aligned subnet 3 in fig. 2) output by the refined aligned subnet 3 are obtained, and the video frames output by the refined aligned subnet 3 are also a plurality of continuous video frames (video sequences), and the video resolution of the obtained video frames is higher than that of the plurality of continuous video frames input to the sliding window aligned subnet 1, so as to achieve the effect of recovering a high-resolution video from a low-resolution video.

In this embodiment, the sliding window aligned subnet estimates a plurality of consecutive video frames by using a sliding window method to obtain a high-quality reference frame of the high-resolution video close to the annotation, for example, in the sliding window aligned subnet, 2n +1 (N ≧ 1) consecutive video frames are used as input, the input represents an intermediate frame as a reference frame, and other frames except the reference frame are used as neighboring frames, and the sliding window aligned subnet includes: the system comprises an alignment module (PCD, pyramid, framing and Deformable constraint) and a fusion module (TSA, temporal and Spatial attribute), wherein each adjacent frame is aligned with a reference frame at a characteristic level through the alignment module, and the fusion module fuses image information of different frames to obtain a fused image.

Optionally, in the sliding window alignment subnet, a pre-blurring module may be further used before the alignment module to pre-process the blurred input video frame, so as to improve the alignment accuracy.

In some optional implementations of this embodiment, the sliding window aligning the subnet includes: a feature extraction module and a local fusion module; the feature extraction module is used for extracting features of the selected continuous video frames; the local fusion module is used for aligning a target frame feature in the features of the selected continuous video frames and an adjacent frame feature adjacent to the target frame feature to obtain at least one adjacent alignment feature; the local fusion module is also used for connecting at least one adjacent alignment feature and the target frame feature in sequence and obtaining fused features through a residual block.

In this optional implementation, the feature extraction module is mainly used to map the selected continuous video frames to a high-dimensional feature space to obtain the features of the high-dimensional video frames. The feature extraction module may be an encoder, for example, the feature extractor is convolved by two layers.

In this alternative implementation, the neighbor alignment feature is used to characterize a neighbor frame feature that is aligned with the target frame feature, and the neighbor alignment feature and the target frame feature may be connected together by a concat function.

Specifically, as in fig. 2, sliding window aligning subnet 1 includes: the local fusion module LFM is used for fusing the information of adjacent frames (all video frames on both sides of the intermediate frame) of target frame features (such as the intermediate frame in the selected continuous video frame) in the features of the selected continuous video frame, and then transmitting the fused features and the features of the selected continuous video frame to a next-stage cyclic alignment subnet 2.

In this embodiment, the input of the sliding aligned subnet is any continuous video frame (N), the features of N continuous video frames are obtained by feature extraction of the feature extraction module, the features of the N continuous video frames are sequentially input to the local fusion module LFM, for example, the i (i > 1) th input of the features of the N continuous video frames is input, the i +1 th feature and the i-1 th feature are simultaneously input when the i-th feature is input, and the i-th feature after fusion is obtained after alignment fusion is performed in the local fusion module LFM.

In this embodiment, the local fusion module LFM first aligns the adjacent frame feature with the target frame feature, and then fuses the aligned adjacent feature with a plurality of residual blocks through a concat function. Feature alignment herein may employ flow-directed morphic convolution

After fusionMay be represented by the following formula (1):

in equation (1), the target frame feature g _i The intermediate frames are obtained by extracting the features of the selected continuous video frames through a feature extraction module. The concat function may combine the adjacent frame features with the target frame features, and then perform feature fusion through a residual block ResBlocks (a series of modules formed by convolution), so as to obtain a fused feature.

In the optional implementation mode, the local feature fusion of the local fusion module is performed before the feature propagation of the video frame, so that the cross-frame feature fusion in the feature propagation can be enhanced, and the fusion effect of the sliding window aligning with the subnet on the feature fusion of the video frame is improved.

In this embodiment, the cyclic alignment sub-network and the refinement alignment sub-network may both adopt a bidirectional cyclic neural network, and the bidirectional cyclic neural network may fully utilize time information between video sequences, thereby improving the super-resolution effect.

As shown in fig. 2, the circularly aligned subnet 2 may be a conventional bidirectional recurrent neural network, and the circularly aligned subnet 2 includes: the device comprises a characteristic bidirectional propagation module and a reconstruction module; the feature bidirectional propagation module is used for performing bidirectional propagation and alignment on features (which may include features of a selected video and fused features output by the local fusion module) output by the sliding window alignment subnet 1 to obtain a reconstructed feature and an alignment parameter; the reconstruction module is used for reconstructing the reconstruction characteristics output by the characteristic bidirectional propagation module to obtain a reconstruction picture. In fig. 2, the reconstruction graph generated by the reconstruction module of the circularly aligned subnet 2 can be used to calculate the branch loss of the circularly aligned subnet 2.

In this embodiment, the alignment parameter refers to a parameter used in the process of performing inter-frame alignment such as optical flow and deformation convolution by the bidirectional cyclic neural network, for example, an offset between two video frames output by the residual block during alignment. Compared with the traditional bidirectional cyclic neural network, the thinned alignment subnet can fully use the alignment parameters of the alignment operation before the cyclic alignment subnet, and realigns on the basis of the alignment parameters of the cyclic alignment subnet, so that the alignment parameters of the cyclic alignment subnet are optimized, and a better alignment operation result is obtained.

The bidirectional recurrent neural network generally includes a residual error network (ResNet), the residual error network of the bidirectional recurrent neural network has a plurality of residual error blocks, each of the residual error blocks can adopt different principles to perform fusion processing on the characteristics, for example, the offset (the offset between a reference frame and an adjacent frame) generated by the alignment module is fused to obtain the fused offset; or the offset and the weight value (corresponding to different offsets) generated by the alignment module are fused to obtain the fused offset and the fused weight value.

Unlike image hyperscoring, video hyperscoring generally requires aligning adjacent frames with the current frame to better integrate the information of the adjacent frames. In some large-scale motion video over-division tasks, the alignment effect is particularly obvious. In using a two-way loop network, there are often multiple times that the same alignment operation occurs. In order to fully utilize the results of the previous alignment operation, the refined alignment module is adopted in the refined alignment subnet to utilize the parameters of the previous alignment and obtain better alignment results.

In some optional implementations of this embodiment, as shown in fig. 2, the refining of the aligned subnet 3 may include: the system comprises a characteristic bidirectional propagation module, a refining alignment module and a reconstruction module; the refining alignment module performs pre-alignment on the reconstruction features based on the alignment parameters to obtain pre-alignment features; generating a residual error of the alignment parameter based on the pre-alignment feature; generating a new parameter based on the alignment parameter and a residual error of the alignment parameter; the feature bidirectional propagation module performs bidirectional propagation and alignment on the reconstruction features based on the new parameters to generate reconstruction features; and the reconstruction module is used for reconstructing the reconstruction characteristics to obtain the video frame corresponding to the super-resolution of the reconstruction characteristics.

In this optional implementation, the refinement and alignment module performs pre-alignment using an offset and a weight generated by circularly aligning the subnet 2, and the pre-alignment feature is represented by the following formula (2):

wherein

Is the i +1 th input feature of the refined aligned subnet (i +1 th reconstructed feature of the circularly aligned subnet output),

and

is the offset and weight value of the i +1 th feature to the i-th feature alignment in the circularly aligned subnet,

is a deformable convolution. Then use the pre-alignment feature and

generating offset and weight value residuals

As shown in formula (3):

finally, the two pairs of offset and weight values are used for final alignment to obtain reconstruction characteristics

As shown in formula (4):

in this optional implementation, the feature bidirectional propagation module is configured to perform bidirectional propagation and alignment on the reconstructed features using a bidirectional recurrent neural network to generate reconstructed features.

In this optional implementation manner, the reconstruction module is configured to restore the video frame characteristics to obtain a restored video frame.

Optionally, the refinement alignment subnet may further include a pixel selection module (pixel buffer), and the pixel selection module may select a pixel in the video frame output by the reconstruction module, so as to obtain a better over-divided video frame.

The refinement alignment subnet provided by the optional implementation mode utilizes the alignment parameters of the circularly aligned subnet and obtains a better alignment result. The method comprises the steps that a thinning and aligning module firstly uses offset and weight values generated when feature alignment is carried out on a circular alignment sub-network to carry out pre-alignment on input features of the thinning and aligning sub-network, then residual errors of the offset and the weight values are generated by using the pre-alignment features, and finally two pairs of the offset and the weight values are used for carrying out final feature alignment, so that the alignment effect of a video hyper-resolution network is improved, and the processing effect of a video hyper-resolution model on a video sequence is ensured.

And 103, inputting continuous video frames selected from the video set into a video hyper-resolution network.

Wherein the selected consecutive video frames are a video sequence having a plurality of consecutive video frames.

In this embodiment, the executing entity may select a plurality of consecutive video frames from the video set obtained in step 101, and execute the training steps from step 103 to step 105 to complete an iterative training of the video hyper-diversity network. The selection mode and the selection number of the video frames selected from the video set are not limited in the application, and the number of times of the iterative training of the video hyper-division network is also not limited. For example, in one iterative training, a plurality of consecutive video frames may be randomly selected, the selected consecutive video frames are labeled with a true value (ground route), the loss value of the video super-resolution network is calculated according to the true value of the selected consecutive video frames, and the parameters of the video super-resolution network are adjusted.

In this embodiment, the selected continuous video frames are simultaneously input to the video super-resolution network according to the time sequence of the video frames, the video super-resolution network performs super-resolution processing on the selected continuous video frames to obtain a super-resolution processing result, and the super-resolution processing result is a plurality of high-resolution video frames, so that the resolution ratio is improved relative to the selected continuous video frames.

And step 104, calculating the loss value of the video hyper-division network.

In this embodiment, during each iterative training of the video hyper-division network, a plurality of continuous video frames are selected from the video set, the selected continuous video frames are input into the video hyper-division network, and the loss value of the video hyper-division network is calculated based on a loss function preset for the video hyper-division network.

In this embodiment, the loss function of the video hyper-division network may adopt a mean square error function, which is an expectation of a square of a difference between a predicted value (an estimated value) and a true value of the video hyper-division network, and during an iterative training process of the video hyper-division network, the loss function of the video hyper-division network may be minimized by using a gradient descent algorithm, so as to iteratively optimize network parameters of the video hyper-division network.

The gradient is intended to be a vector, meaning that the directional derivative of a certain loss function at that point takes a maximum value along that direction, i.e. the loss function changes most rapidly and at a maximum rate along that direction at that point. In deep learning, the main task of neural networks is to find the optimal network parameters (weights and biases) during learning, i.e. the parameters when the loss function is minimal.

In order to optimize each subnet in the video super-resolution network, a branch loss function may be individually set for the circularly aligned subnet, and the branch loss of the circularly aligned subnet is calculated, and the loss value of the video super-resolution network is taken as the overall loss, in some optional implementation manners of this embodiment, the calculating the loss value of the video super-resolution network includes: calculating the overall loss of the video super-resolution network; calculating the branch loss of the circularly aligned sub-network based on a preset branch loss function; and obtaining a loss value of the video hyper-distribution network based on the integral loss and the branch loss of the circularly aligned sub-network.

In the optional implementation mode, in each iteration training of the video hyper-division network, the overall loss of the video hyper-division network is calculated based on a loss function preset for the video hyper-division network; calculating to obtain the branch loss of the circularly aligned sub-network based on the branch loss function; the obtaining of the loss value of the video hyper-division network based on the whole loss and the branch loss of the circularly aligned sub-network comprises: and adding the whole loss and the branch loss of the circularly aligned sub-network to obtain a loss value of the video hyper-distribution network.

In this embodiment, both the loss function and the branch loss function preset for the video hyper-division network may adopt a mean square error function.

Optionally, in each iteration training of the video hyper-division network, obtaining a loss value of the video hyper-division network based on the overall loss and the branch loss of the circularly aligned sub-network includes: respectively setting weight values for the video hyper-distribution network and the circular alignment sub-network; multiplying the integral loss by a weight value of a video super-resolution network to obtain a first product value; multiplying the branch loss of the circularly aligned sub-network with the weight value of the circularly aligned sub-network to obtain a second product value; and adding the first product value and the second product value to obtain a loss value of the video super-distribution network.

In the optional implementation manner, by setting the loss function for the circularly aligned subnet, the branch loss of the circularly aligned subnet can be calculated independently, so that the characteristics are propagated in two directions in the circularly aligned subnet, and a branch loss is added to the reconstruction result of the characteristics generated by the circularly aligned subnet, so that the characteristics of the circularly aligned subnet are closer to a real high-resolution characteristic space, and the reliability of the super-resolution processing of the circularly aligned subnet is improved.

In some optional implementations of this embodiment, the calculating a branch penalty of the circularly aligned subnet based on a preset branch penalty function includes: acquiring a reconfiguration image of the circularly aligned subnet, wherein the reconfiguration image is obtained by reconfiguring reconfiguration characteristics; and calculating the branch loss of the circularly aligned subnet based on the reconstructed graph and the branch loss function.

In this embodiment, the subnet is circularly aligned and the aligner is refined as shown in fig. 2The two reconstruction modules in the network have the same function and can obtain the reconstructed image, the principle of the reconstruction module is the conventional reconstruction principle, the realization principle of the reconstruction module is not described herein any more, and the description is omitted to order

In order to circularly align the features in the reconstruction module in the subnet, a reconstructed graph can be obtained after the features are subjected to upsampling and convolution processing, and a branch loss function is calculated through the features, specifically, the branch loss AuxLoss is calculated by the formula (5):

the function characterized by equation (5) is a branch penalty function set for circularly aligned subnets, wherein, in equation (5),

is the true value of the video frame;

for reconstructing a picture (characteristic of the reconstruction module)

Obtained after up-sampling and convolution processing); epsilon is an error value, which may be a fixed value preset for circularly aligned subnets.

In the optional implementation mode, the branch loss of the circularly aligned subnet is calculated through the reconstructed graph output by the circularly aligned subnet, the reliability of calculation of the loss value of the circularly aligned subnet is ensured, and a reliable basis is provided for stable training of the circularly aligned subnet.

And 105, obtaining a video hyper-resolution model based on the loss value of the video hyper-resolution network.

In this embodiment, the video hyper-separation model is a trained video hyper-separation network obtained after performing parameter adjustment on the video hyper-separation network through multiple iterative training, whether the video hyper-separation network meets the training completion condition can be detected through a loss value of the video hyper-separation network, and the video hyper-separation model is obtained after the video hyper-separation network meets the training completion condition.

In this embodiment, the training completion condition includes at least one of the following: the training iteration times of the video hyper-division network reach a preset iteration threshold value, and the loss value of the video hyper-division network is smaller than the preset loss value threshold value. Wherein the predetermined iteration threshold is an empirical value based on a loss value of the video hyper-distribution network. For example, the predetermined iteration threshold for a video hyper-distribution network is 5 ten thousand times. The predetermined loss value threshold of the video hyper-distribution network is 0.01.

Optionally, in this embodiment, in response to that the video hyper-separation network does not satisfy the training completion condition, adjusting related parameters in the video hyper-separation network so that a loss value of the video hyper-separation network converges, and continuing to perform the training steps 103 to 105 based on the adjusted video hyper-separation network.

In the optional implementation mode, when the video hyper-division network does not meet the training completion condition, the relevant parameters of the video hyper-division network are adjusted, which is helpful for helping the loss value convergence of the video hyper-division network.

In this embodiment, if the training is not completed, the loss value of the video hyper-division network can be converged by adjusting the parameters of the video hyper-division network. Specifically, adjusting relevant parameters in the video hyper-division network so that loss values of the video hyper-division network converge comprises: by executing steps 103 to 105, the parameters or loss weight values of any one of the sliding window aligned subnets, the circular aligned subnets and the refined aligned subnets are repeatedly adjusted to converge the loss value of the video hyper-distribution network.

Optionally, in each iteration process, parameters of two or more sub-networks in the sliding window aligned sub-network, the circularly aligned sub-network and the refined aligned sub-network may also be adjusted at the same time to ensure that the loss value of the video hyper-division network gradually decreases until the loss value is stable.

In this embodiment, the obtaining a video hyper-resolution model based on the loss value of the video hyper-resolution network includes: responding to the fact that the video hyper-distribution network meets the training completion condition, and taking the video hyper-distribution network meeting the training completion condition as a video hyper-distribution model; and taking the output of the refined aligned sub-network as the output of the video hyper-resolution model.

In some optional implementation manners of this embodiment, the obtaining a video hyper-resolution model based on a loss value of a video hyper-resolution network includes: responding to the fact that the video hyper-distribution network meets training completion conditions, and taking the video hyper-distribution network as a video hyper-distribution model; and adding the fused characteristics of the sliding window alignment sub-networks and the super-divided video frames output by the thinning alignment sub-networks to obtain the output of the video super-divided model.

In the optional implementation mode, the fused features of the sliding window alignment sub-networks and the contents of the super-divided videos output by the refinement alignment sub-networks are added to serve as the output of the video super-division model, so that the output of the video super-division model can be enriched, and the output effect of the video super-division model is improved.

The present disclosure proposes a multi-stage video super-divide network that combines the ideas of the sliding window method and the cyclic network method, using a multi-stage strategy for video super-divide. Specifically, firstly, feature extraction and local feature fusion are carried out on an input video frame in a sliding window alignment subnet; then, the fused features are propagated in the circularly aligned sub-network, and branch loss is introduced to strengthen the feature alignment in the propagation process; and finally, introducing a thinning and aligning module into the thinning and aligning sub-network to repeatedly utilize the alignment parameters generated by circularly aligning the sub-network to carry out thinning and aligning of the characteristics and reinforced propagation, thereby obtaining the video frame after the super-resolution.

The video super-resolution model obtained by the embodiment of the disclosure only uses 1.45M parameters, and the PSNR (Peak Signal to Noise Ratio) index reaches 28.13 on a vid4 data set (a common data set in the video super-resolution field). In the current lightweight video over-classification method, the video over-classification model achieves the highest PSNR and SSIM (structural similarity index measurement) indexes on four standard video over-classification test data sets (REDS 4 data set, UDM10 data set, vimeo-90K-T data set, vid4 data set) with the least number of parameters.

The video hyper-resolution model training method provided by the embodiment of the disclosure comprises the following steps of firstly, obtaining a video set comprising more than two video frames; secondly, acquiring a pre-established video super-distribution network, wherein the video super-distribution network comprises: the method comprises the following steps of sliding window alignment sub-networks, cyclic alignment sub-networks and refined alignment sub-networks, wherein the sliding window alignment sub-networks are used for carrying out feature extraction and fusion on a plurality of continuous video frames, the cyclic alignment sub-networks are used for carrying out bidirectional propagation on features output by the sliding window sub-networks to generate reconstruction features and alignment parameters, and the refined alignment sub-networks carry out realignment and bidirectional propagation on the reconstruction features based on the alignment parameters to obtain over-divided video frames; thirdly, inputting continuous video frames selected from the video set into a video super-distribution network; calculating the loss value of the video super-resolution network from the time to the time; and finally, obtaining a video hyper-resolution model based on the loss value of the video hyper-resolution network. According to the method, the video hyper-resolution network combining the sliding window network and the bidirectional circulation network is trained, so that the obtained video hyper-resolution model has the advantages of the two networks; and the video super-division network is adopted to carry out multi-stage super-division and refinement on the input video frames, thereby ensuring the detail keeping effect of the video frame sequence and improving the effect of carrying out video super-division by the model.

Further, based on the video super-resolution model training method provided by the embodiment, the disclosure also provides an embodiment of a video super-resolution processing method, and the video super-resolution processing method disclosed by the disclosure combines the artificial intelligence fields of computer vision, deep learning and the like.

Referring to fig. 3, a flow 300 of an embodiment of a video super-resolution processing method according to the present disclosure is shown, and the video super-resolution processing method provided by the present embodiment includes the following steps:

step 301, obtaining a plurality of video frames to be processed.

In this embodiment, the plurality of to-be-processed video frames may be a plurality of continuous video frames, each to-be-processed video frame may be a video frame including information of people, objects, scenes, and the like, and the to-be-processed video frames are processed through the video super-resolution model, so that a result of video super-resolution processing can be obtained. The execution subject of the video super-resolution processing method can acquire the video frame to be processed in various ways. For example, the execution subject may obtain the video frames to be processed stored in the execution subject from the database server through a wired connection manner or a wireless connection manner. For another example, the execution main body may also receive, in real time, a to-be-processed video frame acquired by the terminal or other device in real time.

Step 302, inputting the video frame to be processed into the video super-resolution model, and outputting the video super-resolution processing result of the video frame to be processed.

In this embodiment, the execution main body may input the to-be-processed video frame obtained in step 301 into the video super-resolution model, so as to obtain a video super-resolution processing result of the obtained to-be-processed video frame. It should be noted that the video super-resolution processing result is a video frame obtained by performing high-resolution super-resolution processing on the obtained video to be processed, and based on the structure of the video super-resolution model, the video frame in the obtained video super-resolution processing result has a higher resolution than the obtained video frame to be processed.

In this embodiment, the video hyper-segmentation model may be obtained by training by using the method described in the embodiment of fig. 1, and the specific training process may refer to the related description of the embodiment of fig. 1, which is not described herein again.

In some optional implementations of this embodiment, the method further comprises: up-sampling a video frame to be processed to obtain a sampled video frame; and adding the video super-resolution processing result and the sampled video frame to obtain a processed video frame.

As shown in fig. 2, a video frame S to be processed is respectively input to a video hyper-division model (formed by a sliding window alignment sub-network 1, a cyclic alignment sub-network 2 and a refinement alignment sub-network 3) and an up-sampler 4, and up-sampling is performed in the up-sampler 4 to obtain a sampled video frame; and processing the video super-resolution model to obtain a video super-resolution processing result, and adding the sampled video frame and the video super-resolution processing result output by the video super-resolution model to obtain a processed video frame S'.

In the optional implementation mode, the video super-resolution processing result is added to the sampled video frame, so that the video full effect of the video frame to be processed is better, and the reliability of the video super-resolution processing result is improved.

The video super-resolution processing method provided by the embodiment of the disclosure acquires a plurality of videos to be processed; and inputting the video to be processed into the video hyper-resolution model generated by adopting the video hyper-resolution model training method of the embodiment, so as to obtain a video hyper-resolution processing result of the video to be processed. Therefore, the video super-resolution model of the video super-resolution network comprising the super-resolution of a plurality of stages is adopted, reliable video super-resolution processing can be carried out on the video to be processed, and effectiveness of the super-resolution processing is guaranteed.

With further reference to fig. 4, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a video hyper-differentiation model training apparatus, which corresponds to the embodiment of the method shown in fig. 1, and which is specifically applicable to various electronic devices.

As shown in fig. 4, the video hyper-score model training apparatus 400 provided in this embodiment includes: video acquisition unit 401, network acquisition unit 402, selection unit 403, calculation unit 404, and obtaining unit 405. The video acquiring unit 401 may be configured to acquire a video set including two or more video frames. The network acquiring unit 402 may be configured to acquire a pre-established video super-distribution network, where the video super-distribution network includes: the method comprises a sliding window alignment sub-network, a cycle alignment sub-network and a refined alignment sub-network, wherein the sliding window alignment sub-network is used for carrying out feature extraction and fusion on a plurality of continuous video frames, the cycle alignment sub-network is used for carrying out bidirectional propagation on features output by the sliding window sub-network to generate reconstruction features and alignment parameters, and the refined alignment sub-network carries out realignment and bidirectional propagation on the reconstruction features based on the alignment parameters to obtain over-divided video frames. The selecting unit 403 may be configured to input the selected consecutive video frames from the video set into the video super-distribution network. The calculating unit 404 may be configured to calculate a loss value of the video hyper-distribution network. The obtaining unit 405 may be configured to obtain a video hyper-score model based on the loss value of the video hyper-score network.

In this embodiment, in the video hyper-resolution model training apparatus 400: the specific processing of the video obtaining unit 401, the network obtaining unit 402, the selecting unit 403, and the calculating unit 404 and the obtaining unit 405 and the technical effects thereof may refer to the related descriptions of step 101, step 102, step 103, step 104, and step 105 in the corresponding embodiment of fig. 1, which are not described herein again.

In some optional implementations of this embodiment, the calculating unit 404 includes: a global computation module (not shown), a local computation module (not shown), and a derivation module (not shown). Wherein the overall calculation module may be configured to calculate an overall loss of the video hyper-distribution network. The local computation module may be configured to compute a branch loss of the circularly aligned subnet based on a preset branch loss function; the obtaining module may be configured to obtain a loss value of the video super-distribution network based on the overall loss and the branch loss of the circularly aligned sub-network.

In some optional implementations of this embodiment, the local calculation module includes: an acquisition submodule (not shown), and a computation submodule (not shown). The obtaining sub-module may be configured to obtain a reconstructed graph of the cyclically aligned subnet, where the reconstructed graph is obtained by reconstructing the reconstruction characteristics. The calculation sub-module may be configured to calculate a branch loss of the circularly aligned subnet based on the reconfiguration map and the branch loss function.

In some optional implementations of this embodiment, the sliding window aligning the subnet includes: a feature extraction module and a local fusion module. The feature extraction module is used for extracting features of the selected continuous video frames; the local fusion module is used for aligning a target frame feature in the features of the selected continuous video frames and an adjacent frame feature adjacent to the target frame feature to obtain at least one adjacent alignment feature; the local fusion module is further used for connecting at least one adjacent alignment feature and the target frame feature in sequence and obtaining fused features through a residual block.

In some optional implementations of this embodiment, the refining aligning the subnet includes: the device comprises a characteristic bidirectional transmission module, a thinning and aligning module and a reconstruction module; the fine alignment module is used for pre-aligning the reconstruction features based on the alignment parameters to obtain pre-alignment features; generating a residual error of the alignment parameter based on the pre-alignment feature; based on the alignment parameters and the residual error of the alignment parameters, new parameters are generated. And the feature bidirectional propagation module performs bidirectional propagation and alignment on the reconstruction features based on the new parameters to generate reconstruction features. And the reconstruction module is used for reconstructing the reconstruction characteristics to obtain the video frame corresponding to the super-resolution of the reconstruction characteristics.

In some optional implementations of this embodiment, the obtaining unit 405 includes: a shaping module (not shown), and an adding module (not shown). The modeling module can be configured to respond to the video hyper-division network meeting the training completion condition and take the video hyper-division network as a video hyper-division model. The adding module may be configured to add the fused features of the sliding window aligned subnet and the super-divided video frame output by the refined aligned subnet to obtain an output of the video super-divided model.

In the video hyper-resolution model training apparatus provided by the embodiment of the present disclosure, first, a video acquisition unit 401 acquires a video set including two or more video frames; next, the network obtaining unit 402 obtains a video super-resolution network that is established in advance, where the video super-resolution network includes: the method comprises the following steps of sliding window alignment sub-networks, cyclic alignment sub-networks and refined alignment sub-networks, wherein the sliding window alignment sub-networks are used for carrying out feature extraction and fusion on a plurality of continuous video frames, the cyclic alignment sub-networks are used for carrying out bidirectional propagation on features output by the sliding window sub-networks to generate reconstruction features and alignment parameters, and the refined alignment sub-networks carry out realignment and bidirectional propagation on the reconstruction features based on the alignment parameters to obtain over-divided video frames; thirdly, the selecting unit 403 inputs the continuous video frames selected from the video set into the video super-distribution network; from this, the calculation unit 404 calculates a loss value of the video hyper-distribution network; finally, the obtaining unit 405 obtains a video hyper-resolution model based on the loss value of the video hyper-resolution network. According to the method, the video hyper-resolution network combining the sliding window network and the bidirectional circulation network is trained, so that the obtained video hyper-resolution model has the advantages of the two networks; and the video super-resolution network is adopted to carry out multi-stage super-resolution and refinement on the input video frames, so that the detail keeping effect of the video frame sequence is ensured, and the effect of carrying out video super-resolution on the model is improved.

With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a video super-resolution processing apparatus, which corresponds to the embodiment of the method shown in fig. 3, and which is particularly applicable to various electronic devices.

As shown in fig. 5, the video super-resolution processing apparatus 500 provided in this embodiment includes: an acquisition unit 501 and an input unit 502. The acquiring unit 501 may be configured to acquire a plurality of video frames to be processed. The input unit 502 may be configured to input the video frame to be processed into the video super-resolution model generated by the apparatus as described in the embodiment of fig. 3, and output the video super-resolution processing result of the video frame to be processed.

In this embodiment, the video super-resolution processing apparatus 500: the detailed processing and the technical effects of the obtaining unit 501 and the input unit 502 can refer to the related descriptions of step 301 and step 302 in the corresponding embodiment of fig. 3, which are not repeated herein.

In some optional implementations of this embodiment, the video super-resolution processing apparatus 500 further includes: a sampling unit (not shown in the figure), an adding unit (not shown in the figure). The sampling unit may be configured to perform upsampling on the video frame to be processed to obtain a sampled video frame. The adding unit may be configured to add the video super-resolution processing result to the sampled video frame to obtain a processed video frame.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the device 600 comprises a computing unit 601, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the various methods and processes described above, such as a video hyper-score model training method or a video hyper-score processing method. For example, in some embodiments, the video hyper-segmentation model training method or the video hyper-segmentation processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 608. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 600 via ROM 602 and/or communications unit 609. When the computer program is loaded into the RAM603 and executed by the computing unit 601, one or more steps of the video hyper-segmentation model training method or the video hyper-segmentation processing method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured by any other suitable means (e.g., by means of firmware) to perform a video hyper-minute model training method or a video hyper-minute processing method.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable video hyper-segmentation model training apparatus, video hyper-segmentation processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server combining a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of video hyper-segmentation model training, the method comprising:

acquiring a video set comprising more than two video frames;

acquiring a pre-established video hyper-division network, wherein the video hyper-division network comprises: the system comprises a sliding window alignment subnet, a cycle alignment subnet and a refined alignment subnet, wherein the sliding window alignment subnet is used for extracting and fusing the characteristics of a plurality of continuous video frames, the cycle alignment subnet is used for carrying out bidirectional transmission on the characteristics output by the sliding window to generate reconstruction characteristics and alignment parameters, and the refined alignment subnet carries out realignment and bidirectional transmission on the reconstruction characteristics based on the alignment parameters to obtain over-divided video frames;

the following training steps are performed:

inputting continuous video frames selected from the video set into the video hyper-division network, and calculating a loss value of the video hyper-division network; obtaining a video super-resolution model based on the loss value of the video super-resolution network;

the sliding window aligning the subnet comprises: a feature extraction module and a local fusion module;

the feature extraction module is used for extracting features of the selected continuous video frames;

the local fusion module is used for aligning a target frame feature in the features of the selected continuous video frames with an adjacent frame feature adjacent to the target frame feature to obtain at least one adjacent alignment feature;

the local fusion module is further configured to connect the at least one neighboring alignment feature and the target frame feature in sequence, and then obtain a fused feature through a residual block.

2. The method of claim 1, wherein the calculating the loss value for the video hyper-divided network comprises:

calculating the overall loss of the video hyper-division network;

calculating the branch loss of the circularly aligned sub-network based on a preset branch loss function;

and obtaining a loss value of the video hyper-distribution network based on the overall loss and the branch loss of the circularly aligned sub-network.

3. The method of claim 2, wherein said calculating branch penalty for said circularly aligned subnetwork based on a preset branch penalty function comprises:

acquiring a reconstruction graph of the circularly aligned subnet, wherein the reconstruction graph is obtained by reconstructing the reconstruction characteristics;

and calculating the branch loss of the circularly aligned subnet based on the reconstructed graph and the branch loss function.

4. The method of claim 1, wherein the deriving a video hyper-score model based on the loss values of the video hyper-score network comprises:

responding to the fact that the video hyper-division network meets training completion conditions, and using the video hyper-division network as a video hyper-division model;

and adding the fused characteristics of the sliding window aligned subnets and the over-divided video frame output by the refined aligned subnets to obtain the output of the video over-divided model.

5. A method of video super-divide processing, the method comprising:

acquiring a plurality of video frames to be processed;

inputting the video frame to be processed into the video super-resolution model generated by the method according to any one of claims 1-4, and outputting the video super-resolution processing result of the video frame to be processed.

6. The method of claim 5, further comprising:

performing up-sampling on the video frame to be processed to obtain a sampled video frame;

and adding the video super-resolution processing result and the sampling video frame to obtain a processed video frame.

7. A video hyper-resolution model training apparatus, the apparatus comprising:

a video acquisition unit configured to acquire a video set including two or more video frames;

a network acquisition unit configured to acquire a pre-established video hyper-division network, the video hyper-division network comprising: the method comprises the steps of sliding window alignment sub-networks, cyclic alignment sub-networks and refinement alignment sub-networks, wherein the sliding window alignment sub-networks are used for carrying out feature extraction and fusion on a plurality of continuous video frames, the cyclic alignment sub-networks are used for carrying out bidirectional propagation on features output by the sliding window sub-networks to generate reconstruction features and alignment parameters, and the refinement alignment sub-networks carry out realignment and bidirectional propagation on the reconstruction features based on the alignment parameters to obtain over-divided video frames;

a selecting unit configured to input successive video frames selected from the video set into the video hyper-distribution network;

a calculation unit configured to calculate a loss value of the video hyper-distribution network;

an obtaining unit configured to obtain a video hyper-resolution model based on a loss value of the video hyper-resolution network;

8. The apparatus of claim 7, wherein the computing unit comprises:

an integral calculation module configured to calculate an integral loss of the video hyper-distribution network;

a local computation module configured to compute a branch penalty of the circularly aligned sub-network based on a preset branch penalty function;

a deriving module configured to derive a loss value of the video hyper-distribution network based on the overall loss and the branch loss of the circularly aligned sub-network.

9. The apparatus of claim 8, wherein the local computation module comprises:

an obtaining sub-module configured to obtain a reconstruction graph of the circularly aligned subnet, the reconstruction graph being obtained by reconstructing the reconstruction feature;

and the calculation sub-module is configured to calculate the branch loss of the circularly aligned subnet based on the reconfiguration picture and the branch loss function.

10. The apparatus of claim 7, wherein the deriving unit comprises:

a shaping module configured to treat the video hyper-division network as a video hyper-division model in response to the video hyper-division network satisfying a training completion condition;

and the adding module is configured to add the fused features of the sliding window aligned subnets and the super-divided video frames output by the refined aligned subnets to obtain the output of the video super-divided model.

11. A video super-divide processing apparatus, the apparatus comprising:

an acquisition unit configured to acquire a plurality of video frames to be processed;

an input unit configured to input the video frame to be processed into the video super-resolution model generated by the apparatus according to any one of claims 7-10, and output the video super-resolution processing result of the video frame to be processed.

12. The apparatus of claim 11, the apparatus further comprising:

the sampling unit is configured to perform up-sampling on the video frame to be processed to obtain a sampled video frame;

and the adding unit is configured to add the video super-division processing result and the sampling video frame to obtain a processed video frame.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1-6.