CN117998090A

CN117998090A - Video processing method, device and related products

Info

Publication number: CN117998090A
Application number: CN202410296014.6A
Authority: CN
Inventors: 庞映雪; 赵世杰; 郭孟曦
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2024-03-14
Filing date: 2024-03-14
Publication date: 2024-05-07

Abstract

The embodiment of the disclosure provides a video processing method, a device and related products, wherein the method comprises the following steps: acquiring video data and a code unit mask image corresponding to a first video frame in the video data; the coding unit mask image is used for representing a coding unit division result of the first video frame; and performing region self-adaptive feature enhancement processing on the first video frame according to the first video frame, a previous video frame of the first video frame in the video data, a next video frame of the first video frame in the video data and the encoding unit mask image by using a video processing model to obtain the enhanced first video frame. By the method, the device and the system, the video data can be subjected to area self-adaptive enhancement, noise is prevented from being introduced into the video data, the quality of the video data is improved, and the viewing experience of the video data is improved.

Description

Video processing method, device and related products

Technical Field

The present disclosure relates to the field of video processing technologies, and in particular, to a video processing method, apparatus, and related products.

Background

The feature enhancement processing of video data refers to processing the video data to improve the quality of the video data and enhance the viewing experience of the video data. In the prior art, the video data is generally subjected to globally uniform feature enhancement processing, however, the processing manner may introduce noise in a region with less texture, so that the quality of the video data is reduced.

Disclosure of Invention

The embodiment of the disclosure provides a video processing method, a video processing device and related products, which can carry out region self-adaptive enhancement on video data, avoid introducing noise into the video data, improve the quality of the video data and improve the viewing experience of the video data.

In a first aspect, an embodiment of the present disclosure provides a video processing method, including:

Acquiring video data and a code unit mask image corresponding to a first video frame in the video data; the coding unit mask image is used for representing a coding unit division result of the first video frame;

And performing region self-adaptive feature enhancement processing on the first video frame according to the first video frame, a previous video frame of the first video frame in the video data, a next video frame of the first video frame in the video data and the encoding unit mask image by using a video processing model to obtain the enhanced first video frame.

In a second aspect, an embodiment of the present disclosure provides a video processing apparatus, including:

The data acquisition module is used for acquiring video data and acquiring a code unit mask image corresponding to a first video frame in the video data; the coding unit mask image is used for representing a coding unit division result of the first video frame;

And the characteristic enhancement module is used for carrying out regional self-adaptive characteristic enhancement processing on the first video frame through a video processing model according to the first video frame, a previous video frame of the first video frame in the video data, a next video frame of the first video frame in the video data and the code unit mask image to obtain the enhanced first video frame.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: a processor; and a memory configured to store computer-executable instructions that, when executed, cause the processor to implement the steps of the method of the first aspect described above.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium for storing computer-executable instructions which, when executed by a processor, implement the steps of the method of the first aspect described above.

In a fifth aspect, embodiments of the present disclosure provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of the first aspect described above.

In one or more embodiments of the present disclosure, firstly, video data is acquired, and a coding unit mask image corresponding to a first video frame in the video data is acquired, where the coding unit mask image is used to represent a coding unit division result of the first video frame, and then, through a video processing model, according to the first video frame, a previous video frame of the first video frame in the video data, a next video frame of the first video frame in the video data, and the coding unit mask image, performing a region adaptive feature enhancement process on the first video frame to obtain an enhanced first video frame. Therefore, according to the embodiment, the first video frame can be subjected to the region self-adaptive feature enhancement processing by using the encoding unit mask image corresponding to the first video frame, the enhanced first video frame is obtained, and the matched feature enhancement can be realized for different regions in the video data in a region self-adaptive mode, so that noise is prevented from being introduced into the video data, the quality of the video data is improved, and the viewing experience of the video data is improved.

Drawings

For a clearer description of one or more embodiments of the present disclosure or of the solutions of the prior art, the drawings that are needed in the description of the embodiments or of the prior art will be briefly described, it being obvious that the drawings in the description that follows are only some of the embodiments described in the present disclosure, and that other drawings may be obtained from these drawings by those skilled in the art without inventive effort;

Fig. 1 is a flow chart of a video processing method according to an embodiment of the disclosure;

FIG. 2 is a schematic diagram of a video processing model according to an embodiment of the disclosure;

FIG. 3 is a schematic diagram illustrating a processing principle of a first feature enhancement unit according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating a processing principle of a first mask attention unit according to an embodiment of the disclosure;

FIG. 5 is a schematic diagram illustrating a processing principle of a second feature enhancement unit according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating a processing principle of a second mask attention unit according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a training process of a video processing model according to an embodiment of the disclosure;

FIG. 8 is a schematic diagram of setting blur intensities according to the size of a coding unit according to an embodiment of the present disclosure;

FIG. 9 is a training primitive diagram of a video processing model provided in an embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of a video processing apparatus according to an embodiment of the disclosure;

Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

In order that those skilled in the art will better understand the technical solutions in one or more embodiments of the present disclosure, a detailed description will be made below, with reference to the accompanying drawings in one or more embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, not all embodiments. All other embodiments, which may be made by one of ordinary skill in the art based on one or more embodiments of the present disclosure without undue burden, are intended to be within the scope of the present disclosure.

It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.

For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.

As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.

It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.

The embodiment of the disclosure provides a video processing method, which can perform region self-adaptive enhancement on video data, avoid introducing noise into the video data, improve the quality of the video data and improve the viewing experience of the video data. The video processing method may be performed by a server for performing video processing.

Fig. 1 is a flow chart of a video processing method according to an embodiment of the disclosure, as shown in fig. 1, the method includes:

step S102, obtaining video data and obtaining a code unit mask image corresponding to a first video frame in the video data; the coding unit mask image is used for representing a coding unit division result of the first video frame;

Step S104, performing region self-adaptive feature enhancement processing on the first video frame through a video processing model according to the first video frame, a previous video frame of the first video frame in the video data, a next video frame of the first video frame in the video data and the mask image of the coding unit to obtain an enhanced first video frame.

In this embodiment, firstly, video data is obtained, and a coding unit mask image corresponding to a first video frame in the video data is obtained, where the coding unit mask image is used to represent a coding unit division result of the first video frame, and then, through a video processing model, according to the first video frame, a previous video frame of the first video frame in the video data, a next video frame of the first video frame in the video data, and the coding unit mask image, the first video frame is subjected to region adaptive feature enhancement processing, so as to obtain an enhanced first video frame. Therefore, according to the embodiment, the first video frame can be subjected to the region self-adaptive feature enhancement processing by using the encoding unit mask image corresponding to the first video frame, the enhanced first video frame is obtained, and the matched feature enhancement can be realized for different regions in the video data in a region self-adaptive mode, so that noise is prevented from being introduced into the video data, the quality of the video data is improved, and the viewing experience of the video data is improved.

In the step S102, the video data may be any video data that needs to be subjected to feature enhancement processing to improve the quality of the video data, which is not limited in this embodiment. In step S102, a coding unit mask image corresponding to the first video frame in the video data is also acquired, where the coding unit mask image is used to represent a coding unit division result of the first video frame. In this embodiment, the video data includes a first video frame, and the first video frame may be any video frame in the video data. The coding unit mask image corresponding to the first video frame may be an image having the same size as the first video frame, and each of the coding units includes at least 1 pixel point marked with each of the divided coding units, so that the coding unit division result of the first video frame is represented by the coding unit mask image.

In one embodiment, acquiring an encoding unit mask image corresponding to a first video frame in video data includes:

and dividing the coding unit of the first video frame based on a rate distortion optimization algorithm through a coding unit dividing network in the video processing model to obtain a coding unit mask image corresponding to the first video frame.

In this embodiment, the video processing model includes a coding unit division network, where the coding unit division network may divide coding units of an image based on a Rate-distortion optimization (Rate-distortion optimization, RDO) algorithm, and in one embodiment, the coding unit division network may divide an input image into a series of CTUs (coding tree units) first, then divide each CTU into individual CUs (coding units) based on the RDO algorithm, and the sizes of the CUs may be the same or different, so as to obtain a coding unit division result of the image, where the coding unit division result may be represented by the coding unit mask image.

Based on this, fig. 2 is a schematic diagram of a processing principle of a video processing model according to an embodiment of the present disclosure, and as shown in fig. 2, the video processing model includes a coding unit partition network, which may also be referred to as a CP network. In this embodiment, a first video frame is input to a coding unit division network, and coding unit division is performed on the first video frame based on a rate distortion optimization algorithm through the coding unit division network, so as to obtain a coding unit mask image corresponding to the first video frame.

It can be seen that, through the embodiment, the first video frame may be subjected to coding unit division based on the rate-distortion optimization algorithm through the coding unit division network in the video processing model, so as to obtain a coding unit mask image corresponding to the first video frame, and the first video frame may be subjected to coding unit division based on the rate-distortion optimization algorithm, so that quality distortion of the encoded first video frame may be reduced as much as possible, and code rate of the encoded first video frame may be limited, so as to improve coding efficiency.

In the step S104, the first video frame is subjected to the region adaptive feature enhancement processing according to the first video frame, the previous video frame of the first video frame in the video data, the next video frame of the first video frame in the video data, and the encoding unit mask image by the video processing model, so as to obtain the enhanced first video frame.

In this embodiment, the video processing model may be obtained by performing feature enhancement processing on the degraded sample video frame to obtain an enhanced sample video frame, and combining rate distortion loss training corresponding to the enhanced sample video frame, where the rate distortion loss is used to measure the encoded code rate and the encoded quality loss of the enhanced sample video frame.

The degraded sample video frame refers to a video frame with lower quality, such as poor definition, after degradation treatment, the degraded sample video frame is subjected to characteristic enhancement treatment to obtain an enhanced sample video frame, and a video treatment model is trained by combining the rate distortion loss corresponding to the enhanced sample video frame, and the rate distortion loss is used for measuring the encoded code rate and the encoded quality loss of the enhanced sample video frame, so that the trained video treatment model can consider the encoded code rate and the encoded quality loss of the enhanced video, the encoded code rate of the enhanced video is as small as possible, the encoded quality loss is as small as possible, the code rate of the enhanced video data is reduced as much as possible while the video data is enhanced, the code rate of the enhanced video data is distributed to a region with obvious influence on the video quality, and the reasonable distribution of the code rate is realized, and the transmission efficiency and the storage efficiency of the video data are improved.

In addition, in the embodiment, the coding unit division result based on the rate distortion optimization algorithm is used as priori information, so that the distribution of the increased code rate after the video data enhancement can be effectively guided and restrained, and the joint optimization of the visual enhancement of the video data and the effective bit rate saving can be realized. Experiments show that the video processing method in the embodiment can effectively save the bit rate while improving the video quality, thereby obviously improving the video transmission and storage efficiency.

The specific process of video data enhancement by the video processing model will be described in detail later, and the training process will be described later in detail with respect to the case where the first video frame does not have a previous video frame, for example, the first video frame is the first video frame, or the case where the first video frame does not have a subsequent video frame, for example, the first video frame is the last video frame. It should be noted that, the previous video frame of the first video frame refers to the previous video frame before the first video frame, and the next video frame of the first video frame refers to the next video frame after the first video frame, and the time stamp is in the time stamp order.

In one embodiment, the processing of the feature enhancement of the first video frame to obtain an enhanced first video frame according to the first video frame, a previous video frame of the first video frame in the video data, a next video frame of the first video frame in the video data, and the encoding unit mask image by using the video processing model includes:

Generating regional self-adaptive enhanced image features of the first video frame in a plurality of feature propagation directions respectively according to the first video frame, the previous video frame, the next video frame and the code unit mask images through a video processing model; the enhanced image features are in one-to-one correspondence with the propagation directions of the features;

And generating the enhanced first video frame according to each enhanced image characteristic and the first video frame through a video processing model.

In this embodiment, the first video frame, the previous video frame of the first video frame, and the next video frame of the first video frame represent at least a first propagation direction in which video features propagate from front to back and a second propagation direction in which video features propagate from back to front, and therefore, by using a video processing model, from the first video frame, the previous video frame, the next video frame, and the encoding unit mask image, respectively, an enhanced image feature in which the first video frame is area-adaptive in a plurality of feature propagation directions is generated, where each propagation direction corresponds to one enhanced image feature, and therefore, a plurality of enhanced image features corresponding to each feature propagation direction one to one can be obtained, and, since the enhanced image feature is determined by combining the encoding unit mask image herein, it can be considered that the enhanced image feature is the area-adaptive enhanced image feature of the first video frame. Next, generating, by the video processing model, an enhanced first video frame from each of the enhanced image features and the first video frame.

It can be seen that, through this embodiment, through the video processing model, according to the first video frame, the previous video frame, the next video frame, and the encoding unit mask image, the region-adaptive enhanced image features of the first video frame in the multiple feature propagation directions are generated respectively, and according to each enhanced image feature and the first video frame, the enhanced first video frame is generated, so that long-term and global context information of video data is obtained in a feature multidirectional propagation manner, and high-quality feature enhancement is implemented on video data.

In one embodiment, generating, by a video processing model, region-adaptive enhanced image features of a first video frame in a plurality of feature propagation directions from the first video frame, a previous video frame, a next video frame, and an encoding unit mask image, respectively, includes:

Generating, by a first enhancement network in the video processing model, a region-adaptive first enhancement image feature of the first video frame in a first propagation direction in which the video feature propagates from front to back, according to the first video frame, the previous video frame, and the coding unit mask image;

and generating a second enhanced image feature of the first video frame with self-adaption in a second propagation direction of the video feature from back to front according to the first video frame, the next video frame and the code unit mask image through a second enhanced network in the video processing model.

In this embodiment, the first video frame, the previous video frame of the first video frame, and the next video frame of the first video frame represent at least a first propagation direction in which the video feature propagates from front to back and a second propagation direction in which the video feature propagates from back to front, and therefore, referring to fig. 2, the video processing model includes a first enhancement network and a second enhancement network, where the first enhancement network may generate a first enhancement image feature in which the first video frame is adaptive in a region in the first propagation direction in which the video feature propagates from front to back based on the first video frame, the previous video frame, and the coding unit mask image, and the second enhancement network may generate a second enhancement image feature in which the first video frame is adaptive in a region in the second propagation direction in which the video feature propagates from back to front based on the first video frame, the next video frame, and the coding unit mask image.

It can be seen that, through the present embodiment, the region-adaptive enhanced image feature of the first video frame in the first propagation direction and the second propagation direction can be obtained, so that long-term and global context information of video data is obtained through the feature bidirectional propagation manner, and high-quality feature enhancement is achieved on the video data.

In one embodiment, generating, by a first enhancement network in a video processing model, a region-adaptive first enhanced image feature of a first video frame in a first propagation direction in which the video feature propagates from front to back, from the first video frame, a previous video frame, and a coding unit mask image, comprises:

Predicting, by a first feature prediction sub-network in the first enhancement network, first initial enhanced image features of the first video frame in the first propagation direction according to enhanced image features of the first video frame, the previous video frame, and the previous video frame in the first propagation direction;

Performing feature enhancement processing on the first initial enhanced image feature through a first feature enhancement sub-network in the first enhancement network to obtain a first final enhanced image feature of the first video frame in a first propagation direction, wherein the first final enhanced image feature is used as a first enhanced image feature; the feature enhancement processing includes feature enhancement processing that is adaptive according to the region of the coding unit mask image.

Referring to fig. 2, a first enhancement network of a video processing model is used to generate a first enhanced image feature of a region adaptation of a first video frame, where the first enhancement network includes a first feature prediction sub-network and a first feature enhancement sub-network, in this embodiment, the first video frame, a previous video frame of the first video frame, and an enhanced image feature of the previous video frame in a first propagation direction are acquired, where the first propagation direction refers to a direction in which the video feature propagates from front to back, when the video data is enhanced, the video data may be divided into a plurality of sequences, for each sequence, from the first frame of each sequence, the processing is performed to the last frame of the sequence one by one, so as to obtain the corresponding first enhanced image feature of the previous video frame of the first video frame in the first propagation direction, where the enhanced image feature is the enhanced image feature output by the first enhancement network when the previous video frame is processed.

And then, inputting the first video frame, the previous video frame of the first video frame and the enhanced image characteristic of the previous video frame in the first propagation direction into a first characteristic prediction sub-network, and predicting to obtain a first initial enhanced image characteristic of the first video frame in the first propagation direction through the first characteristic prediction sub-network.

And then, inputting the first initial enhanced image feature and the code unit mask image corresponding to the first video frame into a first feature enhancement sub-network, carrying out feature enhancement processing on the first initial enhanced image feature through the first feature enhancement sub-network to obtain a first final enhanced image feature of the first video frame in a first propagation direction, and taking the first final enhanced image feature as a first enhanced image feature of the first video frame in the first propagation direction in a region self-adaption mode, wherein the feature enhancement processing comprises feature enhancement processing according to the region self-adaption of the code unit mask image.

The process of generating a region-adaptive first enhanced image feature of a first video frame in a first propagation direction in which video features propagate from front to back, from the first video frame, a previous video frame, and a coding unit mask image over a first enhancement network can be understood by the following formula.

In the formula (1),A first enhanced image feature representing a first video frame, F _f representing a process of the first enhancement network, the process including a process of the first feature prediction sub-network and a process of the first feature enhancement sub-network, LQ _t representing the first video frame, LQ _t-1 representing a previous video frame to the first video frame, M _t representing a coding unit mask image corresponding to the first video frame,/>Representing enhanced image features of a previous video frame in a first propagation direction.

In this embodiment, when video data is enhanced, the video data may be divided into a plurality of sequences, for each sequence, processing is performed from a first frame of each sequence to a last frame of the sequence one by one, so as to obtain a first enhanced image feature of each frame in a first propagation direction, which corresponds to each frame, so that an enhanced image feature of a previous video frame of the first video frame in the first propagation direction is obtained first, where the enhanced image feature is an enhanced feature output by the first enhancement network when the previous video frame is processed. It can also be seen from the formula that the enhanced image feature in the first propagation direction of the previous video frame of the first video frame is an enhanced feature output by the first enhancement network when processing the previous video frame.

Therefore, when the first video frame is the first video frame of a certain sequence, if the enhanced image features of the previous video frame and the previous video frame in the first propagation direction do not exist, the first feature prediction sub-network can be skipped, and the first video frame is subjected to feature enhancement processing directly through the first feature enhancement sub-network, so that the first enhanced image feature of the first video frame in the first propagation direction is obtained, wherein the feature enhancement processing comprises the feature enhancement processing according to the region self-adaption of the encoding unit mask image.

It can be seen that, when the first enhanced image feature of the first video frame in the first propagation direction is generated, the enhanced image feature of the previous video frame is combined, and further enhanced by the first feature enhancement sub-network on the basis of the enhanced image feature of the previous video frame, so that the effect of optimizing the depth enhancement of the video feature is improved. As can also be seen from the flow shown in fig. 2, the enhanced image features of the previous video frame in the first propagation direction are output by the first enhancement network after processing the previous video frame.

It can be seen that, through this embodiment, the first initial enhanced image feature of the first video frame in the first propagation direction can be predicted according to the enhanced image feature of the first video frame, the previous video frame, and the previous video frame in the first propagation direction through the first feature prediction sub-network, and then the first initial enhanced image feature is subjected to feature enhancement processing through the first feature enhancement sub-network, so as to obtain the first enhanced image feature of the first video frame in the first propagation direction, where the feature enhancement processing includes feature enhancement processing according to the region adaptation of the mask image of the coding unit, so that the first enhanced image feature of the first video frame in the first propagation direction is further predicted and enhanced on the basis of the enhanced image feature of the previous video frame in the first propagation direction by combining the feature prediction and feature enhancement, so as to achieve the effect of optimizing the depth enhancement of the video feature.

In one embodiment, predicting, by a first feature prediction sub-network in a first enhancement network, a first initial enhanced image feature of a first video frame in a first propagation direction from enhanced image features of the first video frame, a previous video frame, and a previous video frame in the first propagation direction, comprises:

estimating, by a first optical flow feature estimation unit in the first feature prediction sub-network, a first optical flow feature between the first video frame and a previous video frame;

And performing spatial alignment on the first optical flow characteristic and the enhanced image characteristic of the previous video frame in the first propagation direction through a first spatial warping unit in the first characteristic prediction sub-network to obtain a first initial enhanced image characteristic of the first video frame in the first propagation direction.

Referring to fig. 2, the first feature prediction sub-network includes a first optical flow feature estimation unit, which may be FlowNet units, and a first spatial warping unit, which may be a Warp unit. In the present embodiment, the first video frame LQ _t and the previous video frame LQ _t-1 are input to the first optical flow feature estimation unit, and the first optical flow feature between the first video frame and the previous video frame is estimated by the first optical flow feature estimation unitThe first optical flow feature may be an optical flow feature of the first video frame relative to the previous video frame, or may be an optical flow feature of the previous video frame relative to the first video frame. The processing procedure of the first optical flow feature estimation unit can be expressed by the formula:

Next, the first optical flow is characterized And enhanced image features/>, in a first propagation direction, of a previous video frameInputting the first video frame to a first spatial warping unit, and performing spatial alignment on the first optical flow characteristic and the enhanced image characteristic of the previous video frame in the first propagation direction through the first spatial warping unit to obtain a first initial enhanced image characteristic/>, in the first propagation direction, of the first video frameThe process of the first spatial warping unit can be expressed as:

therefore, according to the embodiment, the first initial enhanced image feature of the first video frame in the first propagation direction can be accurately predicted by combining the enhanced image features of the previous video frame and the previous video frame in the first propagation direction in a mode of optical flow prediction and spatial alignment, and the accuracy of feature generation is improved by combining the context information of the video.

In one embodiment, performing feature enhancement processing on the first initial enhanced image feature through a first feature enhancement sub-network in the first enhancement network to obtain a first final enhanced image feature of the first video frame in the first propagation direction, including:

Performing global feature enhancement processing on the first initial enhanced image feature by a first feature enhancement unit in a first feature enhancement sub-network in a feature extraction mode to obtain an enhanced first initial enhanced image feature;

And performing region-adaptive feature enhancement processing on the enhanced first initial enhanced image feature by using a mask image of the coding unit based on an attention mechanism through a first mask attention unit in the first feature enhancement sub-network to obtain a first final enhanced image feature of the first video frame in a first propagation direction.

Referring to fig. 2, the first feature enhancement sub-network includes a first feature enhancement unit, which may be referred to as an EB _f unit, and a first mask attention unit, which may be referred to as an MA _f unit. In fig. 2, the number of the first feature-enhancing units may be one or more, that is, N in the figure, N is a natural number greater than or equal to 1. In this embodiment, the first initial enhanced image feature is input to the first feature enhancing unit, and global feature enhancement processing is performed on the first initial enhanced image feature by the first feature enhancing unit through a feature extraction manner, so as to obtain an enhanced first initial enhanced image feature f _t ^f. The operation of the first feature enhancement unit may be expressed by the formula:

Then, the coding unit mask image M _t and the enhanced first initial enhanced image feature f _t ^f are input to a first mask attention unit, and the first mask attention unit performs region adaptive feature enhancement processing on the enhanced first initial enhanced image feature by using the coding unit mask image based on an attention mechanism to obtain a first final enhanced image feature of the first video frame in a first propagation direction, wherein the first final enhanced image feature is the first enhanced image feature of the first video frame in the first propagation direction The operation of the first masked attention unit may be expressed as:

Therefore, according to the embodiment, the first initial enhancement image feature can be subjected to global enhancement and region self-adaptive enhancement based on the code unit mask image, so that the first enhancement image feature of the first video frame in the first propagation direction can be accurately generated, the video enhancement effect is improved, and the purpose of region self-adaptive enhancement is achieved.

Fig. 3 is a schematic diagram of a processing principle of a first feature enhancement unit provided in an embodiment of the present disclosure, where, as mentioned above, the number of first feature enhancement units may be one or more, and when the number of first feature enhancement units is plural, the input of the first unit is a first initial enhanced image feature, the output of the last unit is the input of the next unit, and the output of the last unit is the enhanced first initial enhanced image feature.

Fig. 3 illustrates an example in which the number of the first feature-enhancement units is 1, and as shown in fig. 3, the first feature-enhancement units include a first convolution layer, a first activation layer, a second convolution layer, a third convolution layer, and a fourth convolution layer. After the first initial enhancement image feature is processed through the first convolution layer and the first activation layer, the processing result is transmitted to the second convolution layer and the third convolution layer for further processing, on the one hand, the processing result of the first activation layer is multiplied by the processing result of the third convolution layer, the multiplied result is processed through the fourth convolution layer, and the output of the fourth convolution layer is connected with the first initial enhancement image feature in a residual mode, so that the enhanced first initial enhancement image feature is obtained.

Fig. 4 is a schematic diagram of a processing principle of a first mask attention unit according to an embodiment of the disclosure, and as shown in fig. 4, the first mask attention unit includes a fifth convolution layer, a second activation layer, a sixth convolution layer, and a seventh convolution layer. And after the mask image of the coding unit is processed through the fifth convolution layer and the second activation layer, the processing result is transmitted to the sixth convolution layer for processing on one hand, and is transmitted to the seventh convolution layer for processing on the other hand, the processing result of the sixth convolution layer is subjected to feature multiplication with the enhanced first initial enhanced image feature, and the multiplied result is subjected to residual connection with the output of the seventh convolution layer to obtain the first enhanced image feature of the first video frame in the first propagation direction.

In fig. 3 and 4, parameters of each convolution layer and each activation layer may be obtained by means of model training. In this embodiment, the first feature enhancement unit may enhance the input features by using feature extraction and fine alignment. The first mask attention unit may perform a feature affine transformation of the enhanced first initial enhanced image feature using the coding unit mask image based on an attention mechanism, the first mask attention unit may automatically learn importance degrees of respective coding units in the first video frame according to the coding unit mask image, and automatically learn parameters of the feature affine transformation of the respective coding units based on the importance degrees of the coding units, thereby implementing the region-adaptive video enhancement.

Referring to fig. 2, in this embodiment, the video processing model further includes a second enhancement network for generating a second enhancement image feature of the first video frame that is adaptive to the region of the first video frame in a second propagation direction in which the video feature propagates from back to front, based on the first video frame, the subsequent video frame, and the coding unit mask image. In one embodiment, generating, by a second enhancement network in the video processing model, a region-adaptive second enhanced image feature of the first video frame in a second propagation direction of the video feature from back to front based on the first video frame, the subsequent video frame, and the coding unit mask image, comprises:

Predicting, by a second feature prediction sub-network in the second enhancement network, second initial enhanced image features of the first video frame in the second propagation direction according to enhanced image features of the first video frame, the subsequent video frame, and the subsequent video frame in the second propagation direction;

Performing feature enhancement processing on the second initial enhanced image feature through a second feature enhancement sub-network in the second enhancement network to obtain a second final enhanced image feature of the first video frame in a second propagation direction, wherein the second final enhanced image feature is used as a second enhanced image feature; the feature enhancement processing includes feature enhancement processing that is adaptive according to the region of the coding unit mask image.

Referring to fig. 2, the second enhancement network of the video processing model is used to generate a second enhanced image feature of the first video frame, where the second enhancement network includes a second feature prediction sub-network and a second feature enhancement sub-network, in this embodiment, the first video frame, a video frame subsequent to the first video frame, and an enhanced image feature of a video frame subsequent to the first video frame in a second propagation direction are acquired, where the second propagation direction refers to a direction in which the video feature propagates from back to front, when the video data is enhanced, the video data may be divided into a plurality of sequences, for each sequence, from a last frame of each sequence, the video data may be processed one by one to a first frame of the sequence, so as to obtain the corresponding second enhanced image feature of each frame in the second propagation direction, where the enhanced image feature is the enhanced image feature output by the second enhancement network when the video frame subsequent to be processed.

And then, inputting the first video frame, the next video frame of the first video frame and the enhanced image features of the next video frame in the second propagation direction into a second feature prediction sub-network, and predicting to obtain second initial enhanced image features of the first video frame in the second propagation direction through the second feature prediction sub-network.

And then, inputting the second initial enhanced image feature and the code unit mask image corresponding to the first video frame into a second feature enhancement sub-network, carrying out feature enhancement processing on the second initial enhanced image feature through a second feature enhancement sub-network to obtain a second final enhanced image feature of the first video frame in a second propagation direction, and taking the second final enhanced image feature as a second enhanced image feature of the first video frame in the second propagation direction in a region self-adaption mode, wherein the feature enhancement processing comprises feature enhancement processing according to the region self-adaption of the code unit mask image.

The process of generating a region-adaptive second enhanced image feature of a first video frame in a second propagation direction of video features propagating from back to front based on the first video frame, a subsequent video frame, and a coding unit mask image over a second enhanced network can be understood by the following formula.

In the formula (6) of the present invention,A second enhanced image feature representing the first video frame, B _b representing a process of the second enhanced network including a process of the second feature prediction sub-network and a process of the second feature enhancement sub-network, LQ _t representing the first video frame, LQ _t+1 representing a subsequent video frame to the first video frame, M _t representing a coding unit mask image corresponding to the first video frame,/>Representing enhanced image features of a subsequent video frame in a second propagation direction.

In this embodiment, when video data is reinforced, the video data may be divided into a plurality of sequences, for each sequence, feature propagation in a first propagation direction is performed first, processing is performed from a first frame of each sequence, and processing is performed one by one until a last frame of the sequence, so as to obtain first enhanced image features in the first propagation direction corresponding to each frame, thereby obtaining enhanced image features in the first propagation direction of a previous video frame of the first video frame, where the enhanced image features are enhanced features output by the first enhancement network when processing the previous video frame.

And for each sequence, further performing feature propagation in a second propagation direction, processing from the last frame of each sequence to the first frame of the sequence one by one, and obtaining second enhanced image features of each frame in the second propagation direction, thereby obtaining the enhanced image features of the subsequent video frame of the first video frame in the second propagation direction, wherein the enhanced image features are the enhanced features output by the second enhanced network when the subsequent video frame is processed. It can also be seen from the formula that the enhanced image feature in the second propagation direction of the subsequent video frame to the first video frame is an enhanced feature output by the second enhancement network when the subsequent video frame is processed.

Therefore, when the first video frame is the last video frame of a certain sequence, if the enhanced image features of the next video frame and the next video frame in the second propagation direction do not exist, the second feature prediction sub-network can be skipped, and the feature enhancement processing is directly performed on the first video frame through the second feature enhancement sub-network to obtain the second enhanced image feature of the first video frame in the second propagation direction, wherein the feature enhancement processing comprises the feature enhancement processing according to the region self-adaption of the encoding unit mask image.

For each sequence, the bidirectional propagation is performed on the video frames in the sequence, and when the bidirectional propagation is completed, the enhanced image feature of any video frame, such as the previous video frame of the first video frame, in the first propagation direction and the enhanced image feature of the subsequent video frame of the first video frame, in the second propagation direction, can be obtained, so that the enhancement processing for the first video frame is realized.

It can be seen that, when the second enhanced image feature of the first video frame in the second propagation direction is generated, the enhanced image feature of the subsequent video frame is combined, and further enhanced by the second feature enhancer network on the basis of the enhanced image feature of the subsequent video frame, so that the effect of the depth enhancement optimization of the video feature is achieved. As can also be seen from the flow shown in fig. 2, the enhanced image features of the subsequent video frame in the second propagation direction are output by the second enhancement network after processing the subsequent video frame.

It can be seen that, through this embodiment, the second initial enhancement image feature of the first video frame in the second propagation direction can be predicted according to the enhancement image feature of the first video frame, the subsequent video frame, and the subsequent video frame in the second propagation direction through the second feature prediction sub-network, and then the second initial enhancement image feature is subjected to feature enhancement processing through the second feature enhancement sub-network, so as to obtain the second enhancement image feature of the first video frame in the second propagation direction, where the feature enhancement processing includes feature enhancement processing according to the region adaptation of the mask image of the coding unit, so that the second enhancement image feature of the first video frame in the second propagation direction is further predicted and enhanced on the basis of the enhancement image feature of the subsequent video frame in the second propagation direction by combining the feature prediction and feature enhancement, so as to achieve the effect of optimizing the depth enhancement of the video feature.

In one embodiment, predicting, by a second feature prediction sub-network in the second enhancement network, second initial enhanced image features of the first video frame in the second propagation direction from enhanced image features of the first video frame, the subsequent video frame, and the subsequent video frame in the second propagation direction, comprises:

Estimating, by a second optical flow feature estimation unit in the second feature prediction sub-network, a second optical flow feature between the first video frame and a subsequent video frame;

And performing spatial alignment on the second optical flow characteristic and the enhanced image characteristic of the subsequent video frame in the second propagation direction through a second spatial warping unit in the second characteristic prediction sub-network to obtain a second initial enhanced image characteristic of the first video frame in the second propagation direction.

Referring to fig. 2, the second feature prediction sub-network includes a second optical flow feature estimation unit, which may be FlowNet units, and a second spatial warping unit, which may be a Warp unit. In the present embodiment, the first video frame LQ _t and the subsequent video frame LQ _t+1 are input to the second optical flow feature estimation unit, and the second optical flow feature between the first video frame and the subsequent video frame is estimated by the second optical flow feature estimation unitThe second optical flow feature may be an optical flow feature of the first video frame relative to the subsequent video frame, or may be an optical flow feature of the subsequent video frame relative to the first video frame. The processing procedure of the second optical flow characteristic estimating unit can be expressed by the formula:

Next, a second optical flow characteristic is obtained And enhanced image features/>, in a second propagation direction, of a subsequent video frameInputting the first video frame to a first spatial warping unit, and performing spatial alignment on the first optical flow characteristic and the enhanced image characteristic of the subsequent video frame in the first propagation direction to obtain a first initial enhanced image characteristic/>, of the first video frame in the first propagation direction, by the first spatial warping unitThe process of the second spatial warping unit can be expressed as:

therefore, according to the embodiment, the second initial enhanced image feature of the first video frame in the second propagation direction can be accurately predicted by combining the enhanced image features of the next video frame and the next video frame in the second propagation direction in a mode of optical flow prediction and spatial alignment, and the accuracy of feature generation is improved by combining the context information of the video.

In one embodiment, feature enhancement processing is performed on the second initial enhanced image feature through a second feature enhancement sub-network in the second enhancement network to obtain a second final enhanced image feature of the first video frame in the second propagation direction, including:

performing global feature enhancement processing on the second initial enhanced image feature by a second feature enhancement unit in the second feature enhancement sub-network in a feature extraction mode to obtain an enhanced second initial enhanced image feature;

And performing region-adaptive feature enhancement processing on the enhanced second initial enhanced image feature by using a mask image of the coding unit based on an attention mechanism through a second mask attention unit in the second feature enhancer network to obtain a second final enhanced image feature of the first video frame in a second propagation direction.

Referring to fig. 2, the second feature-enhancer network includes a second feature-enhancing unit, which may be called an EB _b unit, and a second mask attention unit, which may be called an MA _b unit. In fig. 2, the number of the second feature-enhancing units may be one or more, that is, N in the figure, N is a natural number equal to or greater than 1. In this embodiment, the second initial enhancement image feature is input to the second feature enhancement unit, and global feature enhancement processing is performed on the second initial enhancement image feature by the second feature enhancement unit through feature extraction, so as to obtain an enhanced second initial enhancement image featureThe operation of the second feature enhancement unit may be expressed by the formula:

Next, the coding unit mask image M _t and the enhanced second initial enhanced image feature Inputting the first video frame to a first mask attention unit, performing region-adaptive feature enhancement processing on the enhanced first initial enhanced image feature by using the mask image of the coding unit based on an attention mechanism through the first mask attention unit to obtain a first final enhanced image feature of the first video frame in a first propagation direction, wherein the first final enhanced image feature is a first enhanced image feature/>, of the first video frame in the first propagation direction, and the first final enhanced image feature is a second enhanced image feature/>The operation of the second mask attention unit can be expressed by the formula:

Therefore, according to the embodiment, the second initial enhancement image feature can be subjected to global enhancement and region self-adaptive enhancement based on the code unit mask image, so that the second enhancement image feature of the first video frame in the second propagation direction can be accurately generated, the video enhancement effect is improved, and the purpose of region self-adaptive enhancement is achieved.

Fig. 5 is a schematic diagram of a processing principle of a second feature enhancement unit provided in an embodiment of the present disclosure, where, as mentioned above, the number of second feature enhancement units may be one or more, and when the number of second feature enhancement units is multiple, the input of the first unit is a second initial enhanced image feature, the output of the last unit is the input of the next unit, and the output of the last unit is the enhanced second initial enhanced image feature.

Fig. 5 illustrates an example in which the number of the second feature-enhancement units is 1, and as shown in fig. 5, the second feature-enhancement units include an eighth convolution layer, a third activation layer, a ninth convolution layer, a tenth convolution layer, and an eleventh convolution layer. And after the second initial enhancement image feature is processed through the eighth convolution layer and the third activation layer, the processing result is transmitted to the ninth convolution layer and the tenth convolution layer for further processing, on the one hand, the processing result of the third activation layer is subjected to feature multiplication with the processing result of the tenth convolution layer, the multiplied result is processed through the eleventh convolution layer, and the output of the eleventh convolution layer is subjected to residual connection with the second initial enhancement image feature, so that the enhanced second initial enhancement image feature is obtained.

Fig. 6 is a schematic diagram of a processing principle of a second mask attention unit according to an embodiment of the disclosure, and as shown in fig. 6, the second mask attention unit includes a twelfth convolution layer, a fourth activation layer, a thirteenth convolution layer, and a fourteenth convolution layer. And after the mask image of the coding unit is processed through the twelfth convolution layer and the fourth activation layer, the processing result is transmitted to the thirteenth convolution layer for processing on one hand, and is transmitted to the fourteenth convolution layer for processing on the other hand, the processing result of the thirteenth convolution layer is subjected to feature multiplication with the enhanced second initial enhanced image feature, and the multiplied result is subjected to residual connection with the output of the fourteenth convolution layer to obtain the second enhanced image feature of the first video frame in the second propagation direction.

In fig. 5 and 6, parameters of each convolution layer and each activation layer may be obtained by means of model training. In this embodiment, the second feature enhancement unit may enhance the input features by means of feature extraction and fine alignment. The second mask attention unit may perform a feature affine transformation of the enhanced second initial enhanced image feature using the coding unit mask image based on an attention mechanism, the second mask attention unit may automatically learn importance degrees of the respective coding units in the first video frame according to the coding unit mask image, and automatically learn parameters of the feature affine transformation of the respective coding units based on the importance degrees of the coding units, thereby realizing the region-adaptive video enhancement, and in this embodiment, the region adaptation may also be understood as the coding unit adaptation since the region adaptation is realized by the coding unit mask image.

As described above, the enhanced first video frame may also be generated by the video processing model according to the enhanced image feature and the first video frame corresponding to each feature propagation direction. It can be appreciated that each feature propagation direction corresponds to an enhanced image feature, including a first enhanced image feature and a second enhanced image feature.

In one embodiment, generating, by a video processing model, an enhanced first video frame from each of the enhanced image features and the first video frame, includes:

Feature fusion is carried out on each enhanced image feature through a feature fusion network in the video processing model, so that fused image features are obtained;

and carrying out residual connection on the fused image characteristics and the first video frame through a video processing model to obtain an enhanced first video frame.

Referring to fig. 2, the video processing model further includes a feature fusion network, where the feature fusion network includes a feature fusion convolution layer and a feature fusion activation layer, and performs feature stitching on the first enhanced image feature and the second enhanced image feature through the video processing model, and processes the stitched feature through the feature fusion convolution layer and the feature fusion activation layer to obtain a fused image feature. And then, residual connection is carried out on the fused image characteristics and the first video frame through a video processing model, so that the enhanced first video frame R _t is obtained.

Therefore, through the embodiment, feature fusion can be performed on each enhanced image feature, the fused image feature is obtained, residual connection is performed on the fused image feature and the first video frame, the enhanced first video frame is obtained, as much information as possible of the first video frame is reserved in a residual connection mode, region self-adaptive video enhancement is realized, and the quality of video data is improved.

The application process of the video processing model is described above, and the training process of the video processing model is described below.

In one embodiment, the method further comprises:

acquiring a first degraded sample video frame; performing region self-adaptive feature enhancement processing on the first degraded sample video frame through a pre-built neural network structure to obtain a first enhanced sample video frame;

training a neural network structure based on the rate distortion loss corresponding to the first enhanced sample video frame; the trained neural network structure is a video processing model.

In this embodiment, a neural network structure is pre-built, and the pre-built neural network structure is the same as the video processing model shown in fig. 2, and the difference is that the coding unit division network in fig. 2 is a pre-trained network, and in the model training process, only other parts except for the coding unit need to be trained, and in fig. 2, other parts except for the coding unit (including the first enhancement network, the second enhancement network, and the feature fusion network) may be collectively referred to as a video enhancer (CPENHANCER) based on a Coding Tree Unit (CTU) division mask.

In this embodiment, a first degraded sample video frame is obtained, and a region-adaptive feature enhancement processing is performed on the first degraded sample video frame through a pre-built neural network structure, so as to obtain a first enhanced sample video frame, and based on a rate distortion loss corresponding to the first enhanced sample video frame, the neural network structure is trained, and the trained neural network structure is a video processing model.

As can be seen from the foregoing description, the degraded sample video frame refers to a video frame with lower quality, such as worse definition, after degradation processing, in this embodiment, the degraded sample video frame is subjected to feature enhancement processing to obtain an enhanced sample video frame, and a video processing model is trained in combination with a rate distortion loss corresponding to the enhanced sample video frame, where the rate distortion loss is used to measure the size of a code rate after encoding and the size of a quality loss after encoding of the enhanced sample video frame.

Therefore, according to the embodiment, the enhanced sample video frame is obtained by performing feature enhancement processing on the degraded sample video frame, and the video processing model is trained by combining the rate distortion loss corresponding to the enhanced sample video frame, so that the trained video processing model can consider the encoded code rate and the encoded quality loss of the enhanced video, the encoded code rate and the encoded quality loss of the enhanced video are as small as possible, the enhanced video data is enhanced, the code rate of the enhanced video data is reduced as much as possible, the increased code rate of the video data is distributed to a region with obvious influence on video quality, reasonable distribution of the code rate is realized, and the transmission efficiency and the storage efficiency of the video data are improved.

In one embodiment, acquiring a first degraded sample video frame includes:

Acquiring original sample video data, and performing characteristic degradation treatment on the original sample video data to obtain degraded sample video data; the raw sample video data comprises a first raw sample video frame; the degraded sample video data comprises a first degraded sample video frame corresponding to the first original sample video frame;

training a neural network structure based on a rate distortion loss corresponding to the first enhanced sample video frame, comprising:

And training the neural network structure by using the reconstruction loss between the first original sample video frame and the first enhanced sample video frame and the rate distortion loss corresponding to the first enhanced sample video frame.

Fig. 7 is a schematic diagram of a training process of a video processing model according to an embodiment of the disclosure, as shown in fig. 7, where the process includes:

step S702, obtaining original sample video data, and performing characteristic degradation processing on the original sample video data to obtain degraded sample video data; the raw sample video data comprises a first raw sample video frame; the degraded sample video data comprises a first degraded sample video frame corresponding to the first original sample video frame;

step S704, performing region self-adaptive feature enhancement processing on the first degraded sample video frame through a pre-built neural network structure to obtain a first enhanced sample video frame;

Step S706, training a neural network structure by using the reconstruction loss between the first original sample video frame and the first enhanced sample video frame and the rate distortion loss corresponding to the first enhanced sample video frame, where the trained neural network structure is a video processing model.

In step S702, original sample video data is acquired, where the original sample video data is high-quality video data and includes a plurality of original sample video frames. And then, performing characteristic degradation processing on the original sample video data to obtain degraded sample video data, wherein the degradation processing can be fuzzy processing, and the degraded sample video data is low-quality video data and comprises a plurality of degraded sample video frames. In this embodiment, the original sample video data includes a first original sample video frame; the degraded sample video data includes a first degraded sample video frame corresponding to the first original sample video frame.

In the above step S704, the first degraded sample video frame is subjected to the region adaptive feature enhancement processing through the pre-built neural network structure, so as to obtain the first enhanced sample video frame, which is similar to the previously described region adaptive feature enhancement processing for the first video frame, and will be described in detail later.

In step S706, the neural network structure is trained using the reconstruction loss between the first original sample video frame and the first enhanced sample video frame, and the rate distortion loss corresponding to the first enhanced sample video frame.

Therefore, according to the embodiment, the reconstruction loss between the first original sample video frame and the first enhanced sample video frame and the rate distortion loss corresponding to the first enhanced sample video frame are considered at the same time, so that the video processing model obtained through training can improve the video quality after enhancement as much as possible while perceiving the video code rate, and the code rate is saved while enhancing the video effect.

In one embodiment, performing feature degradation processing on original sample video data to obtain degraded sample video data, including:

dividing an original sample video frame in original sample video data into coding units;

determining the fuzzy strength of each coding unit of the original sample video frame according to the size of each coding unit of the original sample video frame obtained by dividing;

and carrying out degradation treatment on the original sample video frame according to the fuzzy strength of each coding unit of the original sample video frame to obtain degraded sample video data.

First, original sample video data is acquired, and high-quality video data may be acquired as the original sample video data by any method, which is not limited herein. Next, coding unit division is performed for each original sample video frame in the original sample video data. Alternatively, the coding unit division may be performed for each original sample video frame based on a rate distortion optimization algorithm.

Then, according to the size of each coding unit of each original sample video frame obtained by dividing, the blurring strength of each coding unit of each original sample video frame is determined. In one embodiment, the coding unit with the largest area and the coding unit with the smallest area may be set with lower blur intensities, and the other coding units may be set with higher blur intensities.

Fig. 8 is a schematic diagram of setting blur intensities according to the sizes of coding units according to an embodiment of the present disclosure, where, as shown in fig. 8, coding units are divided into local areas of one original sample video frame to obtain a plurality of coding units, and the blur intensities of the coding unit with the largest area and the coding unit with the smallest area are set to be lower, for example, 1, and the blur intensities of other coding units are set to be higher, for example, 2.

In this embodiment, the fuzzy intensities of the coding unit with the largest area and the coding unit with the smallest area are set to be lower, and the fuzzy intensities of other coding units are set to be higher, so that the problem of noise introduction caused by applying an excessive enhancement effect to the region with sparse texture and the region with dense texture during video enhancement can be avoided when the largest coding unit represents the region with sparse texture, the smallest coding unit represents the region with dense texture, the magnitude of the fuzzy intensity guides the magnitude of the enhancement intensity, and the fuzzy intensity is in direct proportion to the enhancement intensity.

And finally, carrying out degradation treatment on each original sample video frame according to the fuzzy strength of each coding unit of each original sample video frame to obtain each degradation sample video frame, wherein the degradation sample video frames correspond to the original sample video frames one by one, and each degradation sample video frame forms degradation sample video data.

Therefore, according to the embodiment, on one hand, characteristic degradation processing is performed on the original sample video data to obtain degraded sample video data, an effect of generating training data of a video processing model is achieved, on the other hand, the fuzzy strength of each coding unit is determined according to the size of each coding unit, the fuzzy of different strengths of different areas of an original sample video frame is achieved, and preparation is made for area self-adaptation during subsequent video enhancement.

The degraded sample video data comprises a first degraded sample video frame. In one embodiment, performing, by using a pre-built neural network structure, a region-adaptive feature enhancement process on a first degraded sample video frame to obtain a first enhanced sample video frame, where the method includes:

acquiring a sample coding unit mask image corresponding to a first degraded sample video frame; the sample coding unit mask image is used to represent coding unit division results of the first degraded video frame;

And performing region self-adaptive feature enhancement processing on the first degraded sample video frame according to the first degraded sample video frame, a previous degraded sample video frame of the first degraded sample video frame in the degraded sample video data, a next degraded sample video frame of the first degraded sample video frame in the degraded sample video data and a sample coding unit mask image to obtain the first enhanced sample video frame.

Fig. 9 is a training original schematic diagram of a video processing model provided in an embodiment of the present disclosure, as shown in fig. 9, a pre-built neural network structure is the same as the video processing model shown in fig. 2, and the difference is that the coding unit division network in fig. 2 is a pre-trained network, and in the model training process, the coding unit division network does not need to be trained, and only other parts need to be trained.

After the first degraded sample video frame is obtained, the first degraded sample video frame is subjected to coding unit division through a coding unit division network in fig. 9, so that a sample coding unit mask image corresponding to the first degraded sample video frame is obtained, and the sample coding unit mask image is used for representing a coding unit division result of the first degraded video frame.

Next, by dividing the part of the neural network structure except the coding unit in fig. 9 except the network, the first degraded sample video frame is subjected to the region adaptive feature enhancement processing according to the first degraded sample video frame, the previous degraded sample video frame of the first degraded sample video frame in the degraded sample video data, the next degraded sample video frame of the first degraded sample video frame in the degraded sample video data, and the sample coding unit mask image, so as to obtain the first enhanced sample video frame.

Therefore, through the embodiment, the first degraded sample video frame can be subjected to the regional self-adaptive feature enhancement processing through the pre-built neural network structure, so that the first enhanced sample video frame is obtained, and the video processing model is obtained by utilizing the first enhanced sample video frame to train efficiently and quickly.

In one embodiment, performing, by a neural network structure, a region-adaptive feature enhancement process on a first degraded sample video frame according to the first degraded sample video frame, a previous degraded sample video frame of the first degraded sample video frame in the degraded sample video data, a next degraded sample video frame of the first degraded sample video frame in the degraded sample video data, and a sample coding unit mask image, to obtain a first enhanced sample video frame, including:

Generating regional self-adaptive enhanced sample image features of the first degraded sample video frame in a plurality of feature propagation directions respectively according to the first degraded sample video frame, the previous degraded sample video frame, the next degraded sample video frame and the sample coding unit mask images through a neural network structure; enhancing the one-to-one correspondence of the sample image features and the propagation directions of the features;

And generating a first enhanced sample video frame according to the image characteristics of each enhanced sample and the first degraded sample video frame through the neural network structure.

In this embodiment, the previous degraded sample video frame of the first degraded sample video frame refers to the previous degraded sample video frame whose time stamp is located before the first degraded sample video frame, and the next degraded sample video frame of the first degraded sample video frame refers to the next degraded sample video frame whose time stamp is located after the first degraded sample video frame.

In this embodiment, the first degraded sample video frame, the previous degraded sample video frame of the first degraded sample video frame, and the next degraded sample video frame of the first degraded sample video frame represent at least a first propagation direction in which video features propagate from front to back and a second propagation direction in which video features propagate from back to front, and therefore, by the video processing model, from the first degraded sample video frame, the previous degraded sample video frame, the next degraded sample video frame, and the sample coding unit mask image, respectively, area-adaptive enhanced sample image features of the first degraded sample video frame in a plurality of feature propagation directions are generated, where each propagation direction corresponds to one enhanced sample image feature, and therefore, a plurality of enhanced sample image features corresponding to each feature propagation direction one by one can be obtained, and, since the enhanced sample image features are determined by combining the sample coding unit mask images herein, the enhanced sample image features can be considered as area-adaptive enhanced sample image features of the first degraded sample video frame. Next, a first enhanced sample video frame is generated from the respective enhanced sample image features and the first degraded sample video frame by the neural network structure.

Therefore, according to the embodiment, the long-term and global context information of the video data can be acquired through the mode of multi-directional feature propagation, so that the video processing model obtained through training can realize high-quality feature enhancement of the video data.

The training process of the video processing model is relatively similar to the application process of the video processing model, so the training process of the video processing model is briefly described below, and reference to the foregoing application process is not clear.

In one embodiment, generating, by a neural network structure, region-adaptive enhanced sample image features of a first degraded sample video frame in a plurality of feature propagation directions from the first degraded sample video frame, a previous degraded sample video frame, a next degraded sample video frame, and a sample coding unit mask image, respectively, includes:

Generating, by a first enhancement network in the neural network structure, a region-adaptive first enhancement sample image feature of the first degraded sample video frame in a first propagation direction in which the video feature propagates from front to back, according to the first degraded sample video frame, the previous degraded sample video frame, and the sample coding unit mask image;

And generating, by a second enhancement network in the neural network structure, a region-adaptive second enhancement sample image feature of the first degraded sample video frame in a second propagation direction in which the video feature propagates from back to front according to the first degraded sample video frame, the subsequent degraded sample video frame, and the sample coding unit mask image.

Wherein generating, by a first enhancement network in the neural network structure, a region-adaptive first enhanced sample image feature of the first degraded sample video frame in a first propagation direction in which the video feature propagates from front to back, from the first degraded sample video frame, the previous degraded sample video frame, and the sample coding unit mask image, comprises:

predicting, by a first feature prediction sub-network in the first enhancement network, first initial enhanced sample image features of the first degraded sample video frame in the first propagation direction based on the first degraded sample video frame, the previous degraded sample video frame, enhanced sample image features of the previous degraded sample video frame in the first propagation direction;

Performing feature enhancement processing on the first initial enhanced sample image feature through a first feature enhancement sub-network in the first enhancement network to obtain a first final enhanced sample image feature of the first degraded sample video frame in the first propagation direction, wherein the first final enhanced sample image feature is used as a first enhanced sample image feature; the feature enhancement processing includes feature enhancement processing that is adaptive to the region of the sample coding unit mask image.

Wherein predicting, by a first feature prediction sub-network in the first enhancement network, a first initial enhanced sample image feature of the first degraded sample video frame in the first propagation direction from the first degraded sample video frame, the previous degraded sample video frame, the enhanced sample image feature of the previous degraded sample video frame in the first propagation direction, comprises:

Estimating, by a first optical flow feature estimation unit in the first feature prediction sub-network, first sample optical flow features between the first degraded sample video frame and the previous degraded sample video frame;

And performing spatial alignment on the first sample optical flow characteristic and the enhanced sample image characteristic of the previous degraded sample video frame in the first propagation direction through a first spatial warping unit in the first characteristic prediction sub-network to obtain a first initial enhanced sample image characteristic of the first degraded sample video frame in the first propagation direction.

The method for enhancing the characteristics of the first initial enhanced sample image features through the first characteristic enhancement sub-network in the first enhancement network, obtaining first final enhanced sample image features of the first degraded sample video frames in the first propagation direction, comprises the following steps:

Performing global feature enhancement processing on the first initial enhancement sample image features in a feature extraction mode through a first feature enhancement unit in a first feature enhancement sub-network to obtain enhanced first initial enhancement sample image features;

And performing region-adaptive feature enhancement processing on the enhanced first initial enhanced sample image feature by using a first mask attention unit in the first feature enhancement sub-network based on an attention mechanism by using a sample coding unit mask image to obtain a first final enhanced sample image feature of the first degraded sample video frame in a first propagation direction.

Wherein generating, by a second enhancement network in the neural network structure, a region-adaptive second enhanced sample image feature of the first degraded sample video frame in a second propagation direction in which the video feature propagates from back to front, from the first degraded sample video frame, the subsequent degraded sample video frame, and the sample coding unit mask image, comprises:

Predicting, by a second feature prediction sub-network in the second enhancement network, second initial enhanced sample image features of the first degraded sample video frame in the second propagation direction based on the enhanced sample image features of the first degraded sample video frame, the subsequent degraded sample video frame, and the subsequent degraded sample video frame in the second propagation direction;

Performing feature enhancement processing on the second initial enhanced sample image feature through a second feature enhancement sub-network in the second enhancement network to obtain a second final enhanced sample image feature of the first degraded sample video frame in a second propagation direction, wherein the second final enhanced sample image feature is used as a second enhanced sample image feature; the feature enhancement processing includes feature enhancement processing that is adaptive to the region of the sample coding unit mask image.

Wherein predicting, by a second feature prediction sub-network in the second enhancement network, second initial enhanced sample image features of the first degraded sample video frame in the second propagation direction from enhanced sample image features of the first degraded sample video frame, the subsequent degraded sample video frame, and the subsequent degraded sample video frame in the second propagation direction comprises:

Estimating, by a second optical flow feature estimation unit in the second feature prediction sub-network, second sample optical flow features between the first degraded sample video frame and the subsequent degraded sample video frame;

And performing spatial alignment on the second sample optical flow characteristic and the enhanced sample image characteristic of the later degraded sample video frame in the second propagation direction through a second spatial warping unit in the second characteristic prediction sub-network to obtain a second initial enhanced sample image characteristic of the first degraded sample video frame in the second propagation direction.

The feature enhancement processing is performed on the second initial enhanced sample image feature through a second feature enhancement sub-network in the second enhancement network to obtain a second final enhanced sample image feature of the first degraded sample video frame in the second propagation direction, including:

performing global feature enhancement processing on the features of the second initial enhanced sample image in a feature extraction mode through a second feature enhancement unit in the second feature enhancement sub-network to obtain enhanced features of the second initial enhanced sample image;

and performing region-adaptive feature enhancement processing on the enhanced second initial enhanced sample image feature by using a mask image of the sample coding unit based on an attention mechanism through a second mask attention unit in the second feature enhancer network to obtain a second final enhanced sample image feature of the first degraded sample video frame in a second propagation direction.

Wherein generating, by the neural network structure, a first enhanced sample video frame from each enhanced sample image feature and the first degraded sample video frame, comprises:

feature fusion is carried out on the features of each enhanced sample image through a feature fusion network in the neural network structure, so that fused sample image features are obtained;

And residual connection is carried out on the fused sample image characteristics and the first degraded sample video frame through a neural network structure, so that the first enhanced sample video frame is obtained.

The above-mentioned model training is not clear, and reference is made to the model application process described above.

In this embodiment, the first original sample video frame, the first degraded sample video frame, and the first enhanced sample video frame are video frames corresponding to each other one by one, and after training to obtain the first enhanced sample video frame, the neural network structure is trained by using the reconstruction loss between the first original sample video frame and the first enhanced sample video frame, and the rate distortion loss corresponding to the first enhanced sample video frame, to obtain the video processing model.

In one embodiment, the reconstruction loss between the first original sample video frame and the first enhanced sample video frame may be obtained by the following equation:

In this formula, L _rec represents reconstruction loss, GT _t represents the first original sample video frame, R _t represents the first enhanced sample video frame, and e represents a preset parameter.

In one embodiment, the rate distortion loss corresponding to the first enhanced sample video frame may be obtained by the following.

Encoding the first enhanced sample video frame through a nerve video encoder to obtain a first encoded sample video frame;

And determining the rate distortion loss corresponding to the first enhanced sample video frame according to the first original sample video frame, the first coded sample video frame and the first enhanced sample video frame.

As shown in fig. 9, a neural video encoder (Neural Video Codec, NVC), which is an encoder that uses a depth learning technique for video compression, is also introduced when training the video processing model. Unlike conventional video coding standards (e.g., h.264, h.265/HEVC, or AV 1), neural video encoders use a neural network model to more effectively compress and decompress video data. The neural video encoder may approximate the rate-distortion behavior of a standard encoder, thereby enabling enhanced video data to maintain a higher video quality with as little code rate as possible.

In this embodiment, the neural video encoder may be a pre-trained encoder, and no further training of the neural video encoder is required when training the video processing model.

In this embodiment, a first enhanced sample video frame is input to a neural video encoder, the first enhanced sample video frame is encoded by the neural video encoder to obtain a first encoded sample video frame, and a rate distortion loss corresponding to the first enhanced sample video frame is determined according to the first original sample video frame, the first encoded sample video frame, and the first enhanced sample video frame.

It can be seen that, through this embodiment, the first enhanced sample video frame can be encoded by means of the neural video encoder, so as to obtain a first encoded sample video frame, and according to the first original sample video frame, the first encoded sample video frame, and the first enhanced sample video frame, a rate distortion loss corresponding to the first enhanced sample video frame is determined, so that the enhanced video data maintains a higher video quality with as few code rates as possible.

In one embodiment, determining a rate distortion loss corresponding to a first enhanced sample video frame from a first original sample video frame, a first encoded sample video frame, and a first enhanced sample video frame comprises:

Determining the quality loss size of the encoded enhanced sample video frame according to the first original sample video frame and the first encoded sample video frame, and determining the code rate size of the encoded first enhanced sample video frame;

and determining the rate distortion loss corresponding to the first enhanced sample video frame according to the quality loss size and the code rate size of the encoded enhanced sample video frame.

In this embodiment, the rate distortion loss corresponding to the first enhanced sample video frame may be calculated by the following formula:

Wherein L _rd represents the corresponding rate distortion loss of the first enhanced sample video frame, R represents the encoded code rate size of the first enhanced sample video frame, GT _t represents the first original sample encoded video frame, The method comprises the steps of representing a first coded sample video frame, D representing the quality loss size between a first original sample video frame and the first coded sample video frame, calculating by calculating an L2 distance, namely calculating a Euclidean distance, wherein the quality loss size between the first original sample video frame and the first coded sample video frame is used as the quality loss size of an enhanced sample video frame after being coded, and lambda is a parameter.

Through the formula, the quality loss size between the first original sample video frame and the first coding sample video frame can be calculated, the quality loss size between the first original sample video frame and the first coding sample video frame is used as the quality loss size after the coding of the enhancement sample video frame, the code rate size after the coding of the first enhancement sample video frame is determined, and the quality loss size after the coding of the enhancement sample video frame and the code rate size are weighted and summed to obtain the corresponding rate distortion loss of the first enhancement sample video frame.

Therefore, according to the embodiment, the rate distortion loss corresponding to the first enhanced sample video frame can be accurately determined and obtained according to the first original sample video frame, the first encoded sample video frame and the first enhanced sample video frame, so that the model training efficiency and the training accuracy are improved.

According to the above procedure, in the present embodiment, when training the video processing model, the total loss function may be represented by the following formula, where γ is a preset parameter:

L_overall＝L_rec+γL_rd (13)

training the neural network structure in fig. 9 through the loss function, and removing the neural video encoder in fig. 9 after training is completed, so as to obtain the video processing model in fig. 2.

In this embodiment, by generating the mask image of the sample coding unit, the model may be guided to implement optimization of region adaptation in the training process, and the increased code rate may be allocated to the region that benefits the most from visual quality improvement. And the code rate distribution of the enhancement result can be globally controlled through a nerve video encoder, and the nerve video encoder can maintain the enhancement quality and reduce the bit rate through rate distortion optimization, so that synchronous optimization of the code rate and the video quality is realized.

In summary, through the above embodiments, an end-to-end video processing model is provided, which can save the code rate of video data while enhancing the quality of the video data, realize quality enhancement and code rate optimization, and reasonably allocate the code rate to the key area to be enhanced, thereby improving the transmission and storage efficiency of the video data.

Fig. 10 is a schematic structural diagram of a video processing apparatus according to an embodiment of the disclosure, as shown in fig. 10, the apparatus includes:

a data obtaining module 1001, configured to obtain video data, and obtain a coding unit mask image corresponding to a first video frame in the video data; the coding unit mask image is used for representing a coding unit division result of the first video frame;

The feature enhancement module 1002 is configured to perform, by using a video processing model, region adaptive feature enhancement processing on the first video frame according to the first video frame, a previous video frame of the first video frame in the video data, a next video frame of the first video frame in the video data, and the encoding unit mask image, to obtain an enhanced first video frame.

Optionally, the data acquisition module 1001 is specifically configured to:

and dividing the coding unit of the first video frame based on a rate distortion optimization algorithm through a coding unit division network in the video processing model to obtain a coding unit mask image corresponding to the first video frame.

Optionally, the feature enhancement module 1002 is specifically configured to;

generating, by the video processing model, region-adaptive enhanced image features of the first video frame in a plurality of feature propagation directions according to the first video frame, the previous video frame, the next video frame, and the coding unit mask image, respectively; the enhanced image features are in one-to-one correspondence with the feature propagation directions;

and generating the enhanced first video frame according to each enhanced image characteristic and the first video frame through the video processing model.

Optionally, the feature enhancement module 1002 is also specifically configured to;

Generating, by a first enhancement network in the video processing model, a first enhanced image feature of the first video frame that is region-adaptive in a first propagation direction in which video features propagate from front to back, according to the first video frame, the previous video frame, and the coding unit mask image;

And generating a second enhanced image feature of the first video frame with self-adaption of the region of the first video frame in a second propagation direction of video features from back to front through a second enhanced network in the video processing model according to the first video frame, the next video frame and the code unit mask image.

Predicting, by a first feature prediction sub-network in the first enhancement network, a first initial enhanced image feature of the first video frame in the first propagation direction according to enhanced image features of the first video frame, the previous video frame, and the previous video frame in the first propagation direction;

Performing feature enhancement processing on the first initial enhanced image feature through a first feature enhancement sub-network in the first enhancement network to obtain a first final enhanced image feature of the first video frame in the first propagation direction, wherein the first final enhanced image feature is used as the first enhanced image feature; the feature enhancement processing includes feature enhancement processing that is adaptive according to regions of the coding unit mask image.

estimating, by a first optical flow feature estimation unit in the first feature prediction sub-network, a first optical flow feature between the first video frame and the previous video frame;

Performing global feature enhancement processing on the first initial enhanced image feature by a first feature enhancement unit in the first feature enhancement sub-network in a feature extraction mode to obtain an enhanced first initial enhanced image feature;

and performing region self-adaptive feature enhancement processing on the enhanced first initial enhanced image feature by using a first mask attention unit in the first feature enhancement sub-network based on an attention mechanism by using the code unit mask image to obtain a first final enhanced image feature of the first video frame in the first propagation direction.

Performing feature enhancement processing on the second initial enhanced image feature through a second feature enhancement sub-network in the second enhancement network to obtain a second final enhanced image feature of the first video frame in the second propagation direction, wherein the second final enhanced image feature is used as the second enhanced image feature; the feature enhancement processing includes feature enhancement processing that is adaptive according to regions of the coding unit mask image.

Estimating, by a second optical flow feature estimation unit in the second feature prediction sub-network, a second optical flow feature between the first video frame and the subsequent video frame;

And performing region-adaptive feature enhancement processing on the enhanced second initial enhanced image feature by using a second mask attention unit in the second feature enhancer network based on an attention mechanism by using the code unit mask image to obtain a second final enhanced image feature of the first video frame in the second propagation direction.

and carrying out residual connection on the fused image characteristics and the first video frame through the video processing model to obtain the enhanced first video frame.

Optionally, the method further comprises:

A sample processing unit for acquiring a first degraded sample video frame; performing region self-adaptive feature enhancement processing on the first degraded sample video frame through a pre-built neural network structure to obtain a first enhanced sample video frame;

the model training unit is used for training the neural network structure based on the rate distortion loss corresponding to the first enhanced sample video frame; the trained neural network structure is the video processing model.

Optionally, the sample processing unit is specifically configured to:

Acquiring original sample video data, and performing characteristic degradation treatment on the original sample video data to obtain degraded sample video data; the raw sample video data comprises a first raw sample video frame; the degraded sample video data comprises the first degraded sample video frame corresponding to the first original sample video frame;

The model training unit is specifically used for:

Optionally, the sample processing unit is further specifically configured to:

Dividing an original sample video frame in the original sample video data into coding units;

Optionally, the sample processing unit is further specifically configured to:

acquiring a sample coding unit mask image corresponding to the first degraded sample video frame; the sample coding unit mask image is used for representing a coding unit division result of the first degraded video frame;

And performing region self-adaptive feature enhancement processing on the first degraded sample video frame according to the first degraded sample video frame, a previous degraded sample video frame of the first degraded sample video frame in the degraded sample video data, a next degraded sample video frame of the first degraded sample video frame in the degraded sample video data and the sample coding unit mask image to obtain the first enhanced sample video frame.

Optionally, the sample processing unit is further specifically configured to:

Generating, by the neural network structure, region-adaptive enhanced sample image features of the first degraded sample video frame in a plurality of feature propagation directions according to the first degraded sample video frame, the previous degraded sample video frame, the next degraded sample video frame, and the sample coding unit mask image, respectively; the enhanced sample image features are in one-to-one correspondence with the feature propagation directions;

And generating the first enhanced sample video frame according to each enhanced sample image characteristic and the first degraded sample video frame through the neural network structure.

Optionally, the method further comprises a loss determination unit for:

Encoding the first enhancement sample video frame through a nerve video encoder to obtain a first encoded sample video frame;

and determining a rate distortion loss corresponding to the first enhanced sample video frame according to the first original sample video frame, the first coded sample video frame and the first enhanced sample video frame.

Optionally, the loss determination unit is specifically configured to:

The video processing device in the embodiment of the present disclosure may implement the respective processes of the embodiment of the video processing method described above, and achieve the same effects and functions, which are not repeated here.

An embodiment of the present disclosure further provides an electronic device, and fig. 11 is a schematic structural diagram of the electronic device provided in an embodiment of the present disclosure, as shown in fig. 11, where the electronic device may have a relatively large difference due to different configurations or performances, and may include one or more processors 1101 and a memory 1102, where the memory 1102 may store one or more application programs or data. Wherein the memory 1102 may be transient storage or persistent storage. The application programs stored in the memory 1102 may include one or more modules (not shown), each of which may include a series of computer-executable instructions in the electronic device. Still further, the processor 1101 may be arranged to communicate with the memory 1102 and execute a series of computer executable instructions in the memory 1102 on an electronic device. The electronic device can also include one or more power supplies 1103, one or more wired or wireless network interfaces 1104, one or more input or output interfaces 1105, one or more keyboards 1106, and the like.

In a specific embodiment, an electronic device includes a processor; and a memory configured to store computer-executable instructions that, when executed, cause the processor to implement the following:

The electronic device in the embodiment of the present disclosure may implement the processes of the embodiment of the video processing method described above, and achieve the same effects and functions, which are not repeated here.

Another embodiment of the present disclosure also provides a computer-readable storage medium for storing computer-executable instructions that, when executed by a processor, implement the following flow:

The storage medium in the embodiments of the present disclosure may implement the respective processes of the embodiments of the video processing method described above, and achieve the same effects and functions, which are not repeated here.

Another embodiment of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the following flow:

The computer program product in the embodiments of the present disclosure may implement the respective processes of the embodiments of the video processing method described above and achieve the same effects and functions, and are not repeated here.

In various embodiments of the present disclosure, the computer readable storage medium includes a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a magnetic disk or an optical disk, and the like.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable GATE ARRAY, FPGA)) is an integrated circuit whose logic functions are determined by user programming of the device. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented with "logic compiler (logic compiler)" software, which is similar to the software compiler used in program development and writing, and the original code before being compiled is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but HDL is not just one, but a plurality of kinds, such as ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language), and VHDL (Very-High-SPEED INTEGRATED Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application SPECIFIC INTEGRATED Circuits (ASICs), programmable logic controllers, and embedded microcontrollers, examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of the various elements may be implemented in the same one or more software and/or hardware when implementing the embodiments of the disclosure.

One skilled in the art will appreciate that one or more embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, one or more embodiments of the present disclosure may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

One or more embodiments of the disclosure may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. One or more embodiments of the disclosure may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The various embodiments in this disclosure are described in a progressive manner, and identical and similar parts of the various embodiments are all referred to each other, and each embodiment is mainly described as different from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the present disclosure. Various modifications and variations of this disclosure will be apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present disclosure, are intended to be included within the scope of the claims of the present disclosure.

Claims

1. A video processing method, comprising:

2. The method of claim 1, wherein the acquiring the coding unit mask image corresponding to the first video frame in the video data comprises:

3. The method according to claim 1, wherein the performing, by a video processing model, a region adaptive feature enhancement process on the first video frame according to the first video frame, a previous video frame of the first video frame in the video data, a next video frame of the first video frame in the video data, and the coding unit mask image to obtain the enhanced first video frame includes:

4. A method according to claim 3, wherein generating, by the video processing model, region-adaptive enhanced image features of the first video frame in a plurality of feature propagation directions from the first video frame, the previous video frame, the next video frame, and the coding unit mask image, respectively, comprises:

5. The method of claim 4, wherein generating, by the first enhancement network in the video processing model, a first enhanced image feature of the first video frame that is region-adaptive in a first propagation direction in which video features propagate from front to back based on the first video frame, the previous video frame, and the coding unit mask image, comprises:

6. The method of claim 5, wherein predicting, by a first feature prediction sub-network in the first enhancement network, a first initial enhanced image feature of the first video frame in the first propagation direction from the enhanced image feature of the first video frame, the previous video frame, and the previous video frame in the first propagation direction, comprises:

7. The method of claim 5, wherein said performing feature enhancement processing on said first initial enhanced image feature through a first feature enhancement sub-network in said first enhancement network to obtain a first final enhanced image feature of said first video frame in said first propagation direction, comprises:

8. The method of claim 4, wherein generating, by the second enhancement network in the video processing model, a second enhanced image feature of the first video frame that is region-adaptive in a second propagation direction in which video features propagate from back to front based on the first video frame, the subsequent video frame, and the coding unit mask image, comprises:

9. The method of claim 8, wherein predicting, by the second feature prediction sub-network in the second enhancement network, a second initial enhanced image feature of the first video frame in the second propagation direction from enhanced image features of the first video frame, the subsequent video frame, and the subsequent video frame in the second propagation direction comprises:

10. The method of claim 8, wherein said performing feature enhancement processing on said second initial enhanced image feature via a second feature enhancer network in said second enhancement network to obtain a second final enhanced image feature of said first video frame in said second propagation direction, comprises:

11. A method according to claim 3, wherein said generating, by said video processing model, said enhanced first video frame from each of said enhanced image features and said first video frame comprises:

12. The method according to claim 1, wherein the method further comprises:

Training the neural network structure based on a rate distortion loss corresponding to the first enhanced sample video frame; the trained neural network structure is the video processing model.

13. The method of claim 12, wherein the step of determining the position of the probe is performed,

The acquiring a first degraded sample video frame includes:

The training the neural network structure based on the rate distortion loss corresponding to the first enhanced sample video frame includes:

14. The method of claim 13, wherein performing feature degradation processing on the original sample video data to obtain degraded sample video data comprises:

15. The method according to claim 13, wherein the performing, by using a pre-built neural network structure, the region-adaptive feature enhancement processing on the first degraded sample video frame to obtain a first enhanced sample video frame includes:

16. The method according to claim 15, wherein the performing, by the neural network structure, the region-adaptive feature enhancement processing on the first degraded sample video frame according to the first degraded sample video frame, a previous degraded sample video frame to the first degraded sample video frame in the degraded sample video data, a next degraded sample video frame to the first degraded sample video frame in the degraded sample video data, and the sample coding unit mask image, to obtain the first enhanced sample video frame includes:

17. The method of claim 13, wherein the method further comprises:

18. The method of claim 17, wherein determining a rate distortion loss corresponding to the first enhanced sample video frame from the first original sample video frame, the first encoded sample video frame, and the first enhanced sample video frame comprises:

19. A video processing apparatus, comprising:

20. An electronic device, comprising:

A processor; and

A memory configured to store computer-executable instructions that, when executed, cause the processor to perform the steps of the method of any of the preceding claims 1-18.

21. A computer readable storage medium for storing computer executable instructions which when executed by a processor implement the steps of the method of any of the preceding claims 1-18.

22. A computer program product, characterized in that it comprises a computer program which, when executed by a processor, implements the steps of the method of any of the preceding claims 1-18.