CN111726623A

CN111726623A - Method for improving reconstruction quality of spatial scalable coding video in packet loss network

Info

Publication number: CN111726623A
Application number: CN202010456887.0A
Authority: CN
Inventors: 宋利; 虞盛炜; 解蓉; 张文军
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-05-26
Filing date: 2020-05-26
Publication date: 2020-09-29
Anticipated expiration: 2040-05-26
Also published as: CN111726623B

Abstract

The invention provides a method for improving the reconstruction quality of a spatial scalable coding video in a packet loss network, wherein the method comprises the following steps: obtaining a preliminary high-resolution image, wherein the image characteristics of the preliminary high-resolution image are obtained through a CNN network; respectively extracting the motion characteristics of the preliminary high-resolution image of the current frame and the high-resolution images obtained by decoding the previous frames, and fusing the motion characteristics with the preliminary image characteristics to obtain fused characteristics; and cascading all the fusion characteristics to restore the high-resolution image of the current frame. The invention combines the characteristics of the spatial grading coding video, fully utilizes the low-resolution image information of the current frame and the high-resolution image information existing in the previous frames, thereby being capable of recovering the high-quality high-resolution image at the loss position of the enhancement layer.

Description

Method for improving reconstruction quality of spatial scalable coding video in packet loss network

Technical Field

The invention relates to the field of video reconstruction quality optimization, in particular to a spatial scalable coding video reconstruction quality optimization technology, and particularly relates to a method for improving the reconstruction quality of a spatial scalable coding video in a packet loss network.

Background

The proportion of video traffic to the total traffic of the internet is increasing, and how to better transmit video content on the network becomes the focus of research. Compared with the traditional coded video, the scalable coded video can better adapt to the fluctuation of the network bandwidth or overcome the network packet loss when being transmitted on the network, because even if part of the enhancement layer code stream is lost, the decoding end can still obtain the video with the basic quality by decoding the base layer code stream.

Scalable coding is classified into spatial classification, temporal classification, quality classification, and the like. Taking the most common spatial scalable coding as an example, the code stream obtained by coding includes a base layer code stream and a plurality of enhancement layer code streams, wherein the base layer code stream can obtain the video content with the lowest resolution through decoding, and the video with higher resolution can be sequentially obtained by combining the base layer code stream and the enhancement layer code streams. Due to the limitation of coding complexity, in practical use, a two-layer structure, i.e. a base layer and an enhancement layer, is often adopted. When video content is transmitted, stronger protection is often added to the transmission of the base layer content, including forward protection or packet loss retransmission, and the protection of the enhancement layer is relatively weaker.

For spatial scalable coding, if loss of an enhancement layer occurs, the frame can only obtain an image with low resolution by decoding a base layer code stream, and the frame can be normally played only after the frame is super-resolution to the video resolution of the enhancement layer through a super-resolution algorithm during playing. The traditional super-resolution algorithm has poor performance and cannot accurately restore details in a picture, so that serious visual artifacts are generated due to sudden reduction of the quality of a frame when a video is played. The image super-resolution algorithm or the video super-resolution algorithm based on the neural network is better than the traditional method in performance, and particularly, the video super-resolution algorithm utilizes the information of the front frame and the rear frame of the video, so that the details of the recovered image are richer. However, none of these super resolution algorithms are designed for scalable coded video and do not take full advantage of the already decoded first few frames of high resolution image information. In addition, in order to improve the quality of the partial video super-resolution algorithm, subsequent frame information is required, which introduces additional delay.

Disclosure of Invention

Aiming at the problem that the existing super-resolution algorithm can not fully utilize scalable coding video information, the invention provides a method for improving the reconstruction quality of a spatial scalable coding video in a packet loss network.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a method for improving the reconstruction quality of a spatial scalable coding video in a packet loss network, which comprises the following steps:

s1, acquiring a preliminary high-resolution image of the current frame;

s2, obtaining a preliminary image characteristic from the preliminary high-resolution image through a CNN network;

s3, a CNN network is adopted as a circulating network, the motion characteristics of the preliminary high-resolution image of the current frame and the high-resolution images obtained by decoding the previous frames are respectively extracted, and the motion characteristics and the preliminary image characteristics are fused to obtain fusion characteristics;

and S4, cascading all the fusion characteristics output by the circulating network, and restoring the current frame high-resolution image through a CNN network.

In the above S1, when the spatial scalable coded video stream is transmitted over the network, the enhancement layer is allowed to be lost to adapt to the fluctuation of the network bandwidth, and the frames at these positions can only be decoded to obtain low-resolution images; therefore, when the spatial scalable coding video is decoded, if the current frame enhancement layer is lost and only a low-resolution image can be obtained by decoding, a preliminary high-resolution image is obtained by an image super-resolution algorithm based on a neural network.

In S3, the motion feature refers to an image feature including motion information between previous and next frame images, and the previous and next frame images are concatenated and implicitly acquired through a CNN network.

The present invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor being operable to execute the above-mentioned method for improving the reconstruction quality of a spatially scalable coded video in a packet loss network when executing the program.

Compared with the prior art, the embodiment of the invention has at least one of the following beneficial effects:

the invention can recover the current frame image with higher quality under the condition that the enhancement layer is lost: furthermore, by utilizing a super-resolution algorithm based on a neural network and combining the characteristics of a spatial scalable coding video, the low-resolution image information of the current frame and the high-resolution image information of the previous frames are fully utilized, so that the high-quality high-resolution image can be recovered at the position where the enhancement layer is lost.

Specifically, in the embodiment of the present invention, in the video super-resolution test set Vid4, the average PSNR (32.95) of the current frame recovered by the embodiment of the present invention is 5.61dB higher than that of the EDVR (27.34) in the advanced video super-resolution algorithm, and the average SSIM (0.952) is 0.126 higher than that of the EDVR (0.826).

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a flow chart of a process of a method according to an embodiment of the invention;

FIG. 2 is a block diagram of a network model architecture in an embodiment of the present invention;

FIG. 3 is a block diagram of a feature fusion module according to an embodiment of the present invention;

FIG. 4 is a comparison of visual quality of City video sequences in the Vid4 test data set with other mainstream super resolution schemes at a scale factor of 4x according to an embodiment of the invention;

fig. 5 is a comparison of reconstructed video quality in a random packet loss scene and a continuous packet loss scene with a Bicubic scheme, where a scaling factor is 4x in an embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.

When the high-resolution frame at the position lost by the enhancement layer is recovered, the invention fully utilizes the information of the current low-resolution frame and the previous high-resolution frame obtained by decoding, and takes the situation that the subsequent frame is not available by decoding in the practical application process into consideration, and the invention uses the multiple CNN network to realize better performance than the existing super-resolution method.

The following examples will illustrate how the technical solution of the present invention is implemented, and the specific method flow is shown in fig. 1:

firstly, obtaining a current frame initial high-resolution image by utilizing the existing high-performance image super-resolution algorithm.

And acquiring the preliminary image characteristics of the current frame by using the first CNN network. The first CNN network is used to extract the internal features of the current frame.

And thirdly, cascading the previous frame high-resolution image and the preliminary high-resolution image, acquiring the motion characteristics through a second CNN network, and fusing the characteristics and the preliminary image characteristics through a characteristic fusion module to obtain new preliminary image characteristics. The second CNN network is used as a circulating network and comprises a motion feature extraction and fusion network.

And fourthly, replacing the high-resolution image of the previous frame with the high-resolution image of the next previous frame, and repeating the previous step for the specified times.

Restoring the current frame image with high quality and high resolution by all the obtained image characteristics through a third CNN network. The third network is a single-layer network and is used for final image restoration.

Specifically, in order to better understand the implementation of the above steps, the steps of the above method are illustrated with reference to specific embodiments, but the invention is not limited thereto.

1. Acquiring a preliminary high-resolution image:

once the enhancement layer code stream of the spatial scalable coding is lost, the current frame can only obtain a low-resolution picture by decoding the base layer

Before rendering and playing, the video needs to be super-resolution to the resolution of the enhancement layer.

In an embodiment, the step may use an existing image super-resolution technology, as shown in fig. 2, in the embodiment, an ESPCN image super-resolution algorithm is used. Recording the image obtained by super-resolution as a preliminary high-resolution image H_tI.e. by

Wherein theta is_SISRAnd the super-resolution network model parameters are obtained.

In this step, the preliminary high-resolution image needs to be obtained by an image super-resolution algorithm based on a neural network, because the motion characteristics are obtained by the CNN network after the frame and the previous high-resolution frame are cascaded, the two resolutions are the same. The traditional super-resolution algorithm has poor performance and cannot obtain accurate motion characteristics.

2. Obtaining preliminary image features of a current frame

Obtaining a preliminary high resolution image H using a CNN network_tIs marked as

Namely, it is

Wherein theta is_SNetwork model parameters are extracted for the image features.

3. Extracting motion features and performing feature fusion

The motion feature refers to an image feature containing motion information between previous and next frame images, and the previous and next frame images can be concatenated and implicitly acquired through a CNN network. Motion features between current and previous frames

θ_MIt is referred to the motion feature extraction network model parameters.

The motion features can be fused with the features of the current frame by a feature fusion module. Fig. 3 shows a feature fusion module in an embodiment, which mainly consists of two resblocks, and the whole structure adopts the concept of residual error. Specifically, firstly, a difference is made between the preliminary image feature and the motion feature to obtain a feature residual error, the feature residual error is refined by using ResBlock and is added with the preliminary image feature to obtain a refined image feature, and finally, the image feature is refined again by using another ResBlock to obtain a new preliminary image feature for inputting into a next-stage circulation network.

Obtaining new preliminary image characteristics, namely fusion characteristics after passing through a characteristic fusion module

θ_FRefers to the network parameters of the feature fusion module.

In order to fully utilize the image information of the previous frame, a single-layer CNN network is needed to restore the new preliminary image characteristics obtained by the characteristic fusion moduleInto high resolution frames H_t-1。

After the motion characteristics are obtained in the step, the high-quality current frame cannot be directly restored, and the obtained motion characteristics and the preliminary image characteristics obtained in the previous step are passed through a characteristic fusion module together to obtain the image characteristics after characteristic fusion.

4. Cycling the network to take full advantage of earlier high resolution frames

Inputting H of the CNN network in the step 3_tIs replaced by H_t-1,H_t-2…H_t-n+1，

Is replaced by

Can obtain the characteristics of the frame information with the earlier high resolution fused in turn

5. Restoring high quality high resolution frames

All the obtained image characteristics of the 4 th step

Cascading and restoring high-quality high-resolution frame SR through CNN network_tI.e. by

θ_RRefers to the model parameters of the CNN network.

The following table summarizes the performance comparison between the method of the present embodiment and the existing super-resolution algorithm on the Vid4 super-resolution data set, with a scaling factor of 4x, where higher PSNR and SSIM values indicate better performance.

TABLE 1 Performance comparison (PSNR/SSIM) of the present embodiment method with existing super resolution algorithms

FIG. 4 shows the subjective performance comparison of the method of the present embodiment and each existing super-resolution algorithm on the City sequence of Vid 4. The results show that compared with the existing super-resolution algorithm, the performance of the method is greatly improved.

To verify the performance of the partitioning method in the actual packet loss network, a SHVC encoder is first used to spatially hierarchically encode the ClassB standard coding sequence (basetballdrive, BQTerrace, Cactus, Kimono, and ParkScene) of HEVC, where the base layer resolution is 480x270, the enhancement layer is 1920x1080, and the QP is set to 22. The enhancement layer is subjected to simulated random frame loss, and the following table shows the performance comparison between the method of the embodiment and Bicubic under different enhancement layer frame loss rates.

Table 2 comparison of performance of this embodiment method with Bicubic under random frame loss of enhancement layer

Fig. 5 shows that, in the embodiment of the present invention, compared with the Bicubic scheme, in the reconstructed video quality in the random packet loss and continuous packet loss scenes, VMAF is used as a quality evaluation index, and a basetballdrive video sequence is adopted. The VMAF value of the reconstructed video obtained by the method of the embodiment is far greater than that of the Bicubic scheme; (b) for the VMAF comparison of each frame under the continuous packet loss scene, wherein the enhancement layer of the 30 th to 37 th frames is lost, the VMAF value of the reconstructed video obtained by the method of the embodiment is in a slow descending trend, and therefore the influence of frame loss on subjective quality is reduced.

In summary, the present invention can fully utilize the low resolution image information of the current frame and the high resolution image information obtained by decoding the previous frames (or obtained by the method of the present embodiment), and recover the high resolution image of the current frame with high quality, thereby improving the overall reconstruction quality of the spatial scalable coded video in the packet loss network.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims

1. A method for improving reconstruction quality of a spatial scalable coded video in a packet loss network, comprising:

s1, acquiring a preliminary high-resolution image of the current frame;

s2, obtaining a preliminary image characteristic of the preliminary high-resolution image through a first CNN network;

s3, respectively extracting the motion characteristics of the preliminary high-resolution image of the current frame and the high-resolution images obtained by decoding the previous frames, and fusing the motion characteristics and the preliminary image characteristics to obtain fused characteristics;

and S4, cascading all the fusion characteristics output by the circulating network to restore the current frame high-resolution image.

2. The method of claim 1 for improving the reconstruction quality of spatially scalable coded video in packet loss networks, wherein: in S1, when the spatial scalable coded video is decoded, if the current frame enhancement layer is lost and only a low-resolution image can be obtained by decoding, a preliminary high-resolution image is obtained through an image super-resolution algorithm based on a neural network.

3. The method of claim 1 for improving the reconstruction quality of spatially scalable coded video in packet loss networks, wherein: the motion feature refers to an image feature including motion information between previous and next frame images.

4. The feature-fused image feature of claim 3, which has not yet utilized a higher resolution frame, wherein: the motion characteristics are acquired by a circulating network.

5. The feature-fused image feature of claim 4, which has not yet utilized a higher resolution frame, wherein: the circulating network is a second CNN network, and the second CNN network is used for realizing motion feature extraction and fusion.

6. The method of claim 1 for improving the reconstruction quality of spatially scalable coded video in packet loss networks, wherein: in S3, the motion feature and the preliminary image feature are fused by a feature fusion module, where the feature fusion module mainly includes two resblocks, and an overall structure adopts a residual error concept.

7. The method of claim 1 for improving the reconstruction quality of spatially scalable coded video in packet loss networks, wherein: in S4, inputting all the obtained fusion features and the preliminary image features into a third CNN network together to obtain a high-quality high-resolution image; the third network is a single-layer network and is used for final image restoration.

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the program when executed by the processor is operable to perform the method of any of claims 1 to 7.