CN108965847B

CN108965847B - Method and device for processing panoramic video data

Info

Publication number: CN108965847B
Application number: CN201710393777.2A
Authority: CN
Inventors: 魏开进
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2017-05-27
Filing date: 2017-05-27
Publication date: 2020-04-14
Anticipated expiration: 2037-05-27
Also published as: CN108965847A

Abstract

A method and a device for processing panoramic video data are used for improving the video quality during panoramic video playing and improving the watching experience of a user. The method comprises the following steps: the method comprises the steps that a terminal receives and decodes N layer data code streams of a panoramic video to obtain N to-be-processed pictures, a 1 st to-be-processed picture obtained by decoding the 1 st layer data code stream is used as a 1 st to-be-spliced picture, a Pi multiple of an ith to-be-processed picture obtained by decoding the ith layer data code stream is amplified through an image super-resolution algorithm to obtain an ith to-be-spliced picture, the Pi is equal to the ratio of the 1 st resolution to the ith resolution, data of the Nth to-be-spliced picture are used as data in all picture areas in a picture frame to be displayed, data in an i-1 th area in the picture frame to be displayed are sequentially covered by data of the i-1 th to-be-spliced picture according to the sequence that the value of i is decreased from N, and the boundary from the 1 st area to the N-1 th area in the picture.

Description

Method and device for processing panoramic video data

Technical Field

The present application relates to the field of communications technologies, and in particular, to a method and an apparatus for processing panoramic video data.

Background

Virtual Reality (VR) technology is a computer simulation system that creates and experiences a Virtual world, with a computer creating a simulated environment into which a user is immersed. VR technology mainly includes aspects such as simulated environment, perception, natural skills and sensing equipment. The simulated environment is a three-dimensional realistic image generated by a computer and dynamic in real time. The perception means that the ideal VR should have the perception of all people, and in addition to the visual perception generated by the computer graphics technology, the perception includes the senses of hearing, touch, force sense, movement and the like, and even the sense of smell and taste. The natural skill refers to the head rotation, eyes, gestures or other human body behavior actions of a person, and the data adaptive to the actions of the participants are processed by the computer, respond to the input of the user in real time and respectively feed back to the five sense organs of the user. The sensing device is a three-dimensional interaction device which can collect the user's motion and feed the motion as input back to the computer simulation system.

Visual senses play an extremely important role in VR. The most basic VR systems address virtual visual senses first. Therefore, the basic VR system firstly has the following three points, namely, the first point is that the original visual input of a person is blocked; secondly, occupying all vision by virtual image light; and thirdly, the interaction with the image achieves the effect of deceiving the brain.

Panoramic video has expanded traditional video technique and has reached the purpose that the VR was immersed. Unlike the traditional video which only passively watches the picture and lens Of a given Field Of View (FOV) shot by a photographer, the panoramic video can allow a user to watch a dynamic video at any position Of 360 degrees up, down, left and right Of a shooting point in an active interaction mode, so that the user has a sense Of being personally on the scene in a real sense without being limited by time, space and region. Because the panoramic video records all scenes in a 360-degree space, the panoramic video has the characteristic of large data volume, which also brings great challenges to the transmission of the panoramic video on a network. If the server side directly transmits all data of the panoramic video to the client side for decoding and playing, a large amount of transmission resources are needed, and the player of the client side needs to be greatly modified to achieve rendering and displaying of the panoramic video.

Disclosure of Invention

The application provides a panoramic video data processing method and device, which are used for improving the video quality during panoramic video playing and improving the watching experience of a user.

In a first aspect, the present application provides a method for processing panoramic video data, where the method includes:

a terminal receives an N-layer data code stream of a panoramic video; for each picture frame of the panoramic video, a layer 1 data code stream in the layer N data code streams is used for transmitting data in a layer 1 area in the picture frame, the layer 1 area comprises an area determined by a user field angle of the terminal in the picture frame, the original resolution of the picture frame is a layer 1 resolution, the layer i data code stream is used for transmitting data in the layer i area after the picture frame is subjected to down-sampling to the layer i resolution, i is an integer which is greater than 1 and not greater than N, the layer i resolution is smaller than the layer i-1 resolution, the layer i area comprises a layer i-1 area, and the layer N data code stream is used for transmitting data of all picture areas after the picture frame is subjected to down-sampling to the layer N resolution;

the terminal decodes the data code stream of the N layers to obtain N pictures to be processed;

the terminal takes a 1 st to-be-processed picture obtained by decoding the 1 st data code stream as a 1 st to-be-spliced picture, and amplifies the P th to-be-processed picture obtained by decoding the i th data code stream through an image super-resolution algorithm_iMultiplying to obtain the ith picture to be spliced, wherein P is_iEqual to the ratio of the 1 st resolution to the ith resolution;

the terminal uses the data of the Nth picture to be spliced as the data in all the picture areas in the picture frame to be displayed, and sequentially uses the data of the (i-1) th picture to be spliced to cover the data in the (i-1) th area in the picture frame to be displayed according to the sequence that the value of i is decreased from N;

and the terminal performs smoothing processing on the boundary of N-1 areas from the 1 st area to the N-1 st area in the frame of the picture to be displayed and then displays the smoothed boundary.

It can be seen that, according to the embodiments of the present application, on the basis of layered tile transmission of a panoramic video, a low-resolution layer transmission picture in layered transmission is amplified by using an image super-resolution amplification algorithm, and a picture boundary caused by picture splicing is subjected to smoothing processing, so that the defects of nonuniform picture quality transition, fuzzy dynamic transition and the like of the panoramic video can be overcome, the picture quality during playing of the panoramic video is integrally improved, and the viewing experience of a user is improved.

In a possible design, the terminal performs smoothing processing on the boundary of N-1 regions from the 1 st region to the N-1 st region in the frame of the picture to be displayed, and the smoothing processing may be implemented by:

and the terminal smoothes the boundary of each region in the total N-1 regions from the 1 st region to the N-1 st region in the frame of the picture to be displayed by using a pixel set with the distance from the boundary of the region within a first set distance range and by using a spatial filtering algorithm.

By the frequency domain filtering, the boundary formed by the resolution difference in the frame of the picture to be displayed can be effectively eliminated, so that the video quality during playing of the panoramic video is improved, and the watching experience of a user is improved

In a possible design, after the smoothing, by the terminal, a boundary of N-1 regions from the 1 st region to the N-1 st region in the frame of the to-be-displayed picture, and before displaying, the method further includes:

and the terminal carries out time domain filtering processing on the picture frame obtained after the boundary smoothing processing through a time domain filtering algorithm so as to enable the resolution of the moving object in the panoramic video to be in smooth transition.

By the time domain filtering, the definition of moving objects in the panoramic video is enhanced, the blurring and flickering phenomena in the time domain in the playing of the panoramic video are weakened, and the time domain quality is smoothed

In a possible design, the terminal amplifies an ith to-be-processed picture obtained by decoding an ith data code stream by Pi times through an image super-resolution algorithm to obtain an ith to-be-spliced picture, and the method can be implemented as follows:

and the terminal amplifies the Pi times of pixels, which are within a second set distance range from the i-1 region boundary, in the ith picture to be processed by using an image super-resolution algorithm, and amplifies the Pi times of other pixels in the ith picture to be processed by using a non-image super-resolution algorithm to obtain the ith picture to be spliced.

Therefore, the defect that the consumption of computing resources is high when the super-resolution image amplification processing is carried out on all pictures due to the fact that the calculated amount of the super-resolution image amplification algorithm is usually large can be overcome, the processing complexity is reduced, and the watching experience of a user is not influenced

In one possible design, the method further includes:

and the terminal smoothes a boundary formed between an area obtained by the ith picture to be spliced based on the image super-resolution algorithm and an area obtained by the image difference algorithm through an airspace filtering algorithm.

Therefore, the problem that some embodiments of the application select a partial region in a picture to be spliced for image super-resolution amplification, which causes a boundary to appear in the picture to be spliced, can be solved.

In one possible design, the spatial filtering algorithm is an h.264 deblocking filtering algorithm, or an h.265 deblocking filtering algorithm.

In a second aspect, the present application provides an apparatus for processing panoramic video data, the apparatus comprising:

the receiving module is used for receiving an N-layer data code stream of the panoramic video; for each picture frame of the panoramic video, a layer 1 data code stream in the layer N data code streams is used for transmitting data in a layer 1 area in the picture frame, the layer 1 area comprises an area determined by a user field angle of the terminal in the picture frame, the original resolution of the picture frame is a layer 1 resolution, the layer i data code stream is used for transmitting data in the layer i area after the picture frame is subjected to down-sampling to the layer i resolution, i is an integer which is greater than 1 and not greater than N, the layer i resolution is smaller than the layer i-1 resolution, the layer i area comprises a layer i-1 area, and the layer N data code stream is used for transmitting data of all picture areas after the picture frame is subjected to down-sampling to the layer N resolution;

the decoding module is used for decoding the N layers of data code streams received by the receiving module to obtain N pictures to be processed;

the amplifying module is used for taking a 1 st to-be-processed picture obtained by decoding the 1 st data code stream by the decoding module as a 1 st to-be-spliced picture, and amplifying an ith to-be-processed picture obtained by decoding the ith data code stream by the decoding module by a Pi times through an image super-resolution algorithm to obtain an ith to-be-spliced picture, wherein the Pi is equal to the ratio of the 1 st resolution to the ith resolution;

the splicing module is used for using the data of the Nth picture to be spliced obtained by the amplifying module as the data in all the picture areas in the picture frame to be displayed, and sequentially using the data of the (i-1) th picture to be spliced obtained by the amplifying module to cover the data in the (i-1) th area in the picture frame to be displayed according to the sequence that the value of i is decreased from N;

the boundary processing module is used for smoothing the boundary of N-1 areas from the 1 st area to the N-1 st area in the frame of the picture to be displayed, which is obtained by the splicing module;

and the display module is used for displaying the picture frame processed by the boundary processing module.

In a possible design, when the boundary processing module performs smoothing processing on the boundary of N-1 regions in total from the 1 st region to the N-1 st region in the to-be-displayed picture frame obtained by the splicing module, the boundary processing module is specifically configured to:

and smoothing the boundary of each region in the total N-1 regions from the 1 st region to the N-1 st region in the frame of the picture to be displayed, which is obtained by the splicing module, by using a pixel set with the distance from the boundary of the region within a first set distance range and through a spatial filtering algorithm.

In a possible design, after smoothing the boundary of N-1 regions from the 1 st region to the N-1 st region in the to-be-displayed picture frame obtained by the splicing module, the boundary processing module is further configured to:

and performing time domain filtering processing on the picture frame obtained after the boundary smoothing processing through a time domain filtering algorithm so as to enable the resolution of the moving object in the panoramic video to be in smooth transition.

In a possible design, when the i-th picture to be processed, which is obtained by decoding the i-th data code stream by the decoding module, is amplified Pi times by using an image super-resolution algorithm, the amplification module is specifically configured to:

and amplifying the Pi times of the pixels, which are within a second set distance range from the i-1 region boundary, in the ith picture to be processed by using an image super-resolution algorithm, and amplifying the Pi times of other pixels in the ith picture to be processed by using a non-image super-resolution algorithm.

In one possible design, the boundary processing module is further configured to:

and smoothing a boundary formed between an area obtained based on an image super-resolution algorithm and an area obtained based on an image difference algorithm in the ith picture to be spliced obtained by the amplifying module through an airspace filtering algorithm.

For implementation and advantageous effects of any one of the above second aspect or the second aspect of the present invention, reference may be made to implementation and advantageous effects of any one of the above first aspect or the first aspect of the present invention, and repeated details are not repeated.

In a third aspect, the present application provides a terminal, including: a processing unit, a storage unit and a communication unit;

wherein the storage unit stores a computer program, the communication unit is configured to transmit and receive data, and the processing unit is configured to execute the processing method of the panoramic video data according to the first aspect or any one of the first aspects by calling the computer program stored in the storage unit based on a function of the communication unit to transmit and receive data.

In one possible design, the terminal further includes: a display unit, configured to display a to-be-displayed picture frame obtained by the processing unit executing any one of the first aspect or the first aspect implementing the processing method of the panoramic video data.

In a fourth aspect, the present application provides a computer-readable storage medium for storing computer instructions for performing the functions implemented by any one of the above first aspect and the first aspect, where the computer-readable storage medium contains a program designed to perform the method implemented by any one of the above first aspect and the first aspect, so that a processor can execute any one of the above implementations of the first aspect and the first aspect when the computer instructions are called.

Drawings

FIG. 1 is a schematic view of a user's field of view in some embodiments of the present application;

FIG. 2 is a schematic diagram of a panoramic video layered tile transmission process according to some embodiments of the present application;

fig. 3 is a schematic flow chart of a processing scheme for panoramic video data according to some embodiments of the present application;

FIG. 4 is a schematic diagram illustrating that a terminal performs image super-resolution amplification on a selected partial area of a picture to be spliced in some embodiments of the application;

fig. 5(a) is a schematic diagram illustrating a principle that a terminal performs smoothing processing on a region boundary in a frame of a picture to be displayed in some embodiments of the present application;

fig. 5(b) is a schematic diagram illustrating a principle that a terminal performs smoothing processing on a region boundary in a frame of a picture to be displayed in some embodiments of the present application;

FIG. 6(a) is a diagram illustrating a motion trajectory of a moving object in some embodiments of the present application from a high resolution region of a picture to a low resolution region of the picture in a temporal domain;

FIG. 6(b) is a diagram illustrating temporal filtering of a picture frame in an example scenario according to some embodiments of the present application;

FIG. 7 is a diagram illustrating temporal filtering based on motion estimation in some embodiments of the present application;

FIG. 8(a) is a schematic diagram illustrating the resolution change when a moving object in a panoramic video moves from a high resolution region to a low resolution region without temporal filtering processing according to some embodiments of the present application;

FIG. 8(b) is a schematic diagram illustrating the resolution change of a moving object in a panoramic video from a high resolution area to a low resolution area after temporal filtering processing is performed according to some embodiments of the present application;

FIG. 8(c) is a schematic diagram illustrating the resolution change when a moving object in a panoramic video moves from a low resolution area to a high resolution area without temporal filtering processing according to some embodiments of the present application;

FIG. 8(d) is a schematic diagram illustrating the resolution change of a moving object in a panoramic video from a low resolution area to a high resolution area after temporal filtering processing is performed according to some embodiments of the present application;

fig. 9 is a schematic flow chart of the processing scheme of panoramic video data applied in this exemplary scenario according to some embodiments of the present application;

fig. 10 is a schematic structural diagram of a device for processing panoramic video data according to some embodiments of the present application;

fig. 11 is a schematic structural diagram of a terminal according to some embodiments of the present application.

Detailed Description

The present application is described in further detail below with reference to the attached figures.

The application provides a method and a device for processing panoramic video data, which are used for solving the problems that a part of area is fuzzy, the boundary is obvious, a moving object flickers and the like when a panoramic video is decoded and played at a terminal, improving the video quality when the panoramic video is played, and improving the watching experience of a user. The method and the device are based on the same inventive concept, and because the principles of solving the problems of the method and the device are similar, the implementation of the device and the method can be mutually referred, and repeated parts are not repeated.

In the embodiment of the application, a picture frame of a panoramic video is transmitted from a server side to a terminal side through an N-layer data code stream, wherein for each picture frame of the panoramic video, the original resolution of the picture frame is made to be a 1 st resolution, the 1 st data code stream in the N-layer data code stream is used for transmitting data in a 1 st area determined by a user field angle of the terminal in the picture frame, the i-layer data code stream is used for transmitting data in an i-th area after the picture frame is down-sampled to the i-th resolution, the i-th resolution is smaller than the i-1 st resolution, the i-th area comprises the i-1 st area, i is an integer larger than 1 and not larger than N, and the N-layer data code stream is used for transmitting data in all picture areas after the picture frame is down-sampled to the N-th; the terminal decodes after receiving the N-layer data code stream of the panoramic video to obtain N to-be-processed pictures, further takes the 1 st to-be-processed picture obtained by decoding the 1 st data code stream as the 1 st to-be-spliced picture, and amplifies the P of the ith to-be-processed picture obtained by decoding the ith data code stream through an image super-resolution algorithm_iMultiplying to obtain the ith picture to be spliced, wherein P_iThe ratio of the 1 st resolution to the ith resolution is equal to the ratio of the 1 st resolution to obtain N pictures to be spliced, the terminal uses the data of the Nth picture to be spliced as the data in all the picture areas in the picture frame to be displayed, sequentially uses the data of the i-1 st picture to be spliced to cover the data in the i-1 st area in the picture frame to be displayed according to the sequence that the value of i is decreased from N, and then totally N-1 areas from the 1 st area to the N-1 st area in the picture frame to be displayedThe boundary of (2) is displayed after being smoothed.

Through the process provided by the embodiment of the application, under the condition that the panoramic video is transmitted to the terminal in a layered resolution reduction transmission mode, the terminal receives and decodes a layered data code stream, and then amplifies the pictures recovered by each layer of data transmitted by reducing the resolution based on a super-resolution amplification algorithm, so that each layer of picture with higher picture quality is obtained, the terminal splices the obtained multiple layers of pictures, and displays the borders formed by splicing the pictures after smoothing, so that the panoramic picture frame with higher picture quality and effectively reduced border effect can be presented to a user of the terminal, the video quality during playing of the panoramic video can be improved, and the watching experience of the user of the terminal is improved.

Hereinafter, some terms in the present application will be explained so as to be understood by those skilled in the art.

1. Field angle (Field Of View, FOV): the angle is the angle formed by the connecting line of the edge of the scene which is visible by the eyes of a person at any moment and an observation point (the position where the eyes are located); for example, in a display system, the field angle is typically the angle between the edge of the display and the line connecting the viewing point (eye).

In the description herein, an area determined by the FOV of the user on a picture frame of the panoramic video is referred to as a user angular field of view area (FOV area). For example, fig. 1 shows an example of a field angle, and in a 360-degree space, the area a is an FOV area formed by the FOV of the user.

2. ROI (region of interest): generally, in the field of image processing, a region of interest (ROI) is an image region selected from an image, and by delineating the region for further processing, the purpose of reducing processing time and increasing processing accuracy can be achieved. In many application occasions of image algorithms, the position of a processed target in an image is often relatively fixed, and the ROI area of the image is defined according to prior information, so that the complexity of the algorithm can be greatly reduced, and the efficiency and the robustness of the algorithm are improved. According to the shape characteristics of the object in the image, there are various ways to define the ROI region, and the region to be processed is often delineated by a square frame, a circle, an ellipse, an irregular polygon, or the like.

It should be noted that the terminal referred to in this application may include devices capable of processing panoramic video data, such as a smart phone, a tablet computer, a desktop computer, a notebook computer, and a VR device (such as a VR head-mounted display device).

For example, the terminal referred to in this application may specifically be a VR device, and the VR device may generally include a virtual environment, a virtual environment processor with a high performance computer as a core, a display system, an auditory tracking system, a body orientation and posture tracking system with an orientation tracker, a data glove and a data garment as a main body, and a feedback functional unit or module such as taste, smell, touch, force sense, etc.; the virtual environment processor can be used for receiving and processing panoramic video data, and the display system can be used for playing the panoramic video.

It should be understood that the above structural description does not limit the terminal related to the embodiments of the present application, and may include more functions or modules, which are not specifically limited by the present application.

The panoramic video referred to in the present application may specifically be a VR video, or may also be a stereoscopic two-way 360-degree video.

The terminal related in the application can be provided with an application for supporting panoramic video playing; for example, the VR application includes, but is not limited to, a VR game application, a VR cinema application, a VR social application, a VR simulated shopping application, a VR education application, a VR sports application, and the like, or other applications newly developed in the future, and the specific details are not limited.

In addition, a plurality of the terms referred to in the present application means two or more; the terms "first" ("1 st"), "second" ("2 nd"), and the like in the description of the present application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or order.

In order to more clearly illustrate the processing scheme of panoramic video data provided in the embodiments of the present application, a brief description will be given below of a related technical scheme of panoramic video transmission.

The conventional schemes for transmitting panoramic video and playing the panoramic video at a terminal include: the server directly transmits the whole panoramic video to the terminal for rendering and playing, and the scheme has the defects that because a user of the terminal only sees a picture in a right-front visual angle area (namely a FOV area) in the panoramic video at any moment, the picture area usually only occupies about 1/6 of the whole picture area of the panoramic video, the bandwidth utilization rate is low; in addition, the player in the terminal needs to be greatly changed, so that the player can map the panoramic video as a texture onto a corresponding projection surface, such as a spherical surface or a cubic surface, according to the projection format of the panoramic video, and perform re-projection rendering at the center of the projection surface to display the picture of the panoramic video;

for another example, there is a scheme of rendering a panoramic video picture by a server, where the server obtains a rotation condition of a user view angle of a terminal in real time, renders and compresses a picture in an FOV area that a user of the terminal wants to view in real time, and transmits the rendered and compressed picture to the terminal, and the terminal decodes a received picture data code stream, and displays the picture on a screen of the terminal itself or transmits the picture data code stream to a screen externally connected to the terminal for display, where the scheme has a defect that a processing path is long, and a high Motion to display delay (MTP) is caused by collecting a user behavior from the terminal, transmitting the user behavior to the server, rendering, encoding, transmitting, buffering at a server end, and finally decoding and displaying at the terminal, and is difficult to meet the requirement of MTP delay of most panoramic video applications smaller than 20 ms;

for another example, there is a scheme in which a server performs projective transformation preprocessing, and the server performs some projective transformation preprocessing on the panoramic video in the source file format to reduce the data size of the file and reduce the transmission bandwidth; a typical example is to project a panoramic video in advance with a limited number of discrete and different FOV angles into different quad-pyramid files using a quad-pyramid projection format, when the server detects that the user visual angle of the terminal moves to a certain FOV direction, the server selects the rhombus file closest to the direction to transmit to the terminal, and after the terminal receives the code stream of the rhombus file, the code stream of the rhombus file is decoded into a rhombus video, and is mapped to a corresponding quadrangular pyramid surface as a texture, the center of the quadrangular pyramid is subjected to re-projection rendering to obtain a display picture, the scheme still has many defects, for example, the server and the terminal need to perform projection rendering, and the server needs to ensure that the server has all files with different FOV directions, and needs to consume a large amount of server storage space, and also needs a large Content Delivery Network (CDN) bandwidth.

In view of the respective drawbacks of the above solutions, the industry currently provides a solution for transmitting panoramic video based on a layered block (tile) transmission technology. In the scheme, for each picture frame of the panoramic video, the server transmits a code stream to the terminal side, wherein the code stream comprises a plurality of different resolution layers of the picture frame and each layer comprises different ROI (region of interest) sizes. For example, as an example, fig. 2 shows a schematic diagram of a panoramic video layered tile transmission flow.

Referring to fig. 2, the exemplary process employs a typical three-layer transmission scheme, wherein the first layer (resolution layer 1) corresponds to the original picture frame of the panoramic video, having the highest resolution, and thus only the minimum ROI region (ROI 1 region) containing the FOV region will be transmitted in order to save bandwidth. The second layer in the middle (resolution layer 2) corresponds to a low-resolution picture after the original picture frame of the panoramic video is down-sampled by 2:1, and thus a ROI region (ROI 2 region) including the FOV region and larger than the first layer can be transmitted because of the reduced resolution. The third layer (resolution layer 3), i.e. the bottom layer, is a picture after the original picture frame corresponding to the panoramic video is down-sampled by 2:1 again, i.e. a picture with lower resolution after the original picture frame corresponding to the panoramic video is down-sampled by 4:1, and the whole picture area (ROI 3 area) of the picture can be transmitted to the terminal because the resolution is extremely low;

for the convenience of data transmission of the first layer and the second layer, the two resolution layers adopt a tile slice coding mode. For example, a tile slice is typically encoded by dividing a picture into independent tile streams of MxN rectangular blocks (tiles) in the horizontal and vertical directions and transmitting the tile streams. As shown in fig. 2, the third layer may be directly transmitted as one tile (one ROI-tile) without slicing the tiles, the second layer corresponds to tile streams of 2 × 2 tiles sliced in the horizontal and vertical directions, i.e., the terminal will receive a plurality of ROI-tiles from the second layer, and the first layer corresponds to tile streams sliced in the horizontal and vertical directions into 4 × 4 tiles, i.e., the terminal will receive a plurality of ROI-tiles from the first layer;

at the terminal side, the terminal receives and decodes the tile code streams of the three layers (the third layer contains a tile code stream, the second layer and the first layer contain a plurality of tile code streams), and then the obtained pictures can be synthesized and spliced, so that a final panoramic picture frame to be displayed is obtained and sent to a Graphics Processing Unit (GPU) for rendering and displaying.

It can be seen that, due to the adoption of the layered tile transmission technology, for different FOV viewing angles, when a user switches the FOV viewing angle to a corresponding picture area, tile code streams of rectangular blocks corresponding to the picture area can be transmitted to a terminal through a tile slice coding scheme, and the tile code streams can be combined to achieve FOV adaptive transmission.

However, panoramic video transmitted by the above layered tile transmission scheme still has defects when played: firstly, when synthesizing and splicing pictures obtained by decoding a data code stream of a panoramic video, a low-resolution picture transmitted by a low-resolution layer is often amplified through an image interpolation amplification algorithm, and then the pictures are spliced, but the quality of the pictures amplified through the image interpolation amplification algorithm is poor, the effect of blurring and unsharpness is easy to occur, and particularly, the quality of dynamic transition of the pictures is very unsatisfactory in the process of rotating a view angle (FOV) of a user.

For example, after the user rotates the FOV viewing angle at a higher speed, due to transmission delay, the tile code stream at the next moment may not reach the terminal side yet, and therefore, the player at the terminal side can only use the existing second layer picture content with a lower resolution as the texture to render the picture in the FOV area, if the FOV viewing angle is rotated by a larger amplitude, the player even needs to use the bottommost layer data with the lowest resolution to render the picture in the FOV area, and use the picture with the lower resolution to render the blurred picture with the lower quality will reduce the viewing experience of the user;

in addition, the pictures of different resolution layers have a boundary effect of a space domain during splicing and synthesizing, so that the transition of the picture quality on the space domain is not smooth; in addition, if there is a moving object (or object) in the panoramic video, and the motion of the object in the time domain crosses the picture boundary formed by each resolution layer, a boundary effect in the time domain will occur, i.e. the picture quality transition is not smooth in the time domain; for example, when the human eye tracks a moving object in a panoramic video, if the motion of the object changes from a high resolution region to a low resolution region, the moving object may cause abrupt blurring and flickering in the time domain, which may also degrade the viewing experience of the user.

In consideration of the defects of playing of the panoramic video after layered tile transmission and the advantages of the panoramic video layered tile transmission technology, the embodiment of the application provides a processing scheme of panoramic video data.

The processing scheme of the panoramic video data provided by the embodiment of the application is mainly characterized in that on the basis of panoramic video layered tile transmission, a low-resolution layer transmission picture in layered transmission is amplified by adopting an image super-resolution amplification algorithm, and picture boundaries caused by picture splicing are subjected to smooth processing, so that the defects of nonuniform picture quality transition, fuzzy dynamic transition and the like of the panoramic video are overcome, the picture quality during playing of the panoramic video is integrally improved, and the watching experience of a user is improved.

The following describes a processing scheme of panoramic video data according to some embodiments of the present application with reference to fig. 3. Fig. 3 is a flow chart illustrating a processing scheme for panoramic video data according to some embodiments of the present application.

Referring to fig. 3, the process may include the following steps:

in step 301, a terminal receives an N-layer data code stream of a panoramic video; further, in step 302, the terminal decodes the N layers of data streams to obtain N to-be-processed pictures.

Based on the introduction of the layered tile transmission of the panoramic video, for each picture frame of the panoramic video, the layer 1 data code stream in the layer N data code stream may be used to transmit data in a layer 1 area in the picture frame (the resolution of the picture frame is set to be a layer 1 resolution), where the layer 1 area may be an area including a field of view (FOV) of a user of a terminal in the picture frame, the layer i data code stream is used to transmit data in the layer i area after the picture frame is down-sampled to an i resolution, i is an integer greater than 1 and not greater than N, the i resolution is smaller than the i-1 resolution, the layer i area includes the layer i-1 area, and the layer N data code stream is used to transmit data of all the picture areas after the picture frame is down-sampled to the N resolution.

For example, in a typical example of the above-described panoramic video layered tile transmission, i.e., a triple-layer transmission scenario in which the terminal transmits, in step 301, a data stream of a picture in an ROI1 region (corresponding to the above-described 1 st region) including an FOV region corresponding to a picture frame in which the panoramic video has an original resolution (corresponding to the above-described 1 st resolution), transmits a data stream of a picture in an ROI1 region (corresponding to the above-described 1 st region) including the FOV region, transmits a data stream of a picture in a resolution layer 2 (corresponding to the above-described 2 nd layer) corresponding to a picture in a ROI 2 region (corresponding to the above-described 2 nd region) including the FOV region and larger than the ROI1 region by 2:1 down-sampling, and transmits a data stream of a picture in an ROI 2 region (corresponding to the above-described 2 nd region), the resolution layer 3 (corresponding to the above-mentioned layer 3 (or N-th layer)) corresponds to a picture 3 of a lower resolution (corresponding to the above-mentioned resolution 3 (or N) when i is 3, that is, when i is N) after the original picture frame of the panoramic video is down-sampled by 4:1, and transmits a data stream of a picture in the entire picture region, that is, the ROI 3 region (corresponding to the above-mentioned region 3 (or N)).

After receiving the N-layer data code stream of the panoramic video in step 301, the terminal may decode in step 302 to obtain N pictures to be spliced, where the terminal decodes each layer of data code stream to obtain one picture to be spliced.

Still taking the above three-layer transmitted scene example, the terminal decodes the data code stream transmitted by the resolution layer 1 to obtain the picture 1 to be spliced with the resolution equal to the resolution of the original picture frame of the panoramic video, the picture data of the picture 1 to be spliced is the picture data with the original resolution located in the ROI1 region of the panoramic video picture, the data code stream transmitted by the resolution layer 2 is the picture 2 to be spliced with the lower resolution, the picture data of the picture 2 to be spliced is the picture located in the ROI 2 region of the panoramic video picture, and (3) the image data with lower resolution obtained by down-sampling, decoding the data code stream transmitted by the resolution layer 3 to obtain the image 3 to be spliced with the lowest resolution, wherein the image data of the image 3 to be spliced is the image data with the lowest resolution obtained by down-sampling, which is positioned in the ROI 3 area (namely all areas) of the panoramic video image.

Further, in step 303, the terminal may use a 1 st to-be-processed picture obtained by decoding the 1 st data code stream as a 1 st to-be-spliced picture, and amplify the P-th to-be-processed picture obtained by decoding the i-th data code stream through an image super-resolution algorithm_iMultiplying to obtain the ith picture to be spliced; wherein, P_iEqual to the ratio of the 1 st resolution to the ith resolution;

after the above processing, in step 304, the terminal may use the data of the nth picture to be spliced as the data in all the picture areas in the picture frame to be displayed, and sequentially use the data of the i-1 th picture to be spliced to cover the data in the i-1 th area in the picture frame to be displayed according to the sequence that the value of i decreases from N, thereby obtaining the picture frame to be displayed.

For example, still based on the above scenario example of three-layer transmission, the terminal may decode the data streams of two low-resolution layers (resolution layer 2 and resolution layer 3) to obtain the streams to be splicedThe picture 2 and the picture 3 to be spliced are respectively magnified by an image super-resolution magnification algorithm in the horizontal direction and the vertical direction, wherein the picture 2 to be spliced is magnified according to the ratio of 1:2, namely magnified by 2 times (equivalent to the P time)_i＝22), the picture 3 to be stitched is enlarged by 1:4, i.e. by 4 times (corresponding to P above)_i＝34), the picture 1 to be spliced obtained by decoding the resolution layer 1, the picture 2 to be spliced after being amplified by 2 times, and the picture 3 to be spliced after being amplified by 4 times are synthesized and spliced together through the step 304, so as to form a picture frame to be displayed.

Specifically, the image super-resolution algorithm may be, for example, a non-local interpolation algorithm, a time domain pixel algorithm, a statistical dictionary method, a super-resolution algorithm based on sparse representation, a super-resolution algorithm based on deep learning, a super-resolution algorithm based on image decomposition, and the like, which are used to restore a low-resolution image or an image sequence to a high-resolution image.

It can be seen that, since the terminal amplifies the to-be-spliced picture with a lower resolution by using the image super-resolution algorithm in step 303, that is, performs the image super-resolution enhancement processing on the 2 nd to nth to-be-spliced pictures obtained by decoding the 2 nd to nth data code streams, instead of the conventional image super-value amplification processing, it is possible to obtain a picture with higher quality, and overcome the defects of poor amplification effect, blurred and unclear images, poor dynamic transition effect in the FOV rotation process, and the like when the low-resolution picture is amplified by using the ordinary image interpolation amplification algorithm.

Specifically, in step 304, the terminal synthesizes the N to-be-spliced pictures processed in step 303 into a practical implementation of a to-be-displayed picture frame, which is mainly the copying and covering of the memory and does not involve other operations on the picture data.

For example, based on the above three-layer transmission scene example, for the picture to be spliced 1 obtained by decoding the resolution layer 1, the picture to be spliced 2 times enlarged, and the picture to be spliced 3 4 times enlarged, the terminal may first copy and overlay the picture data of the picture to be spliced 1 with the highest resolution to the storage location corresponding to the picture of the picture to be spliced 2 with the intermediate resolution (equivalent to overlay the data in the ROI1 region with the highest resolution to the storage location corresponding to the ROI1 region in the picture to be spliced 2), then copy and overlay the image data of the overlaid picture to be spliced 2 with the intermediate resolution to the storage location corresponding to the picture of the picture to be spliced 3 with the lowest resolution (equivalent to overlay the data in the ROI 2 region overlaid with the data of the ROI1 region to the storage location corresponding to the ROI 2 region in the picture to be spliced 3), and obtaining a picture frame to be displayed, wherein the picture frame to be displayed has picture data with the highest resolution in the ROI1 area of the picture, the part of the ROI 2 area excluding the ROI1 area is the picture data with the middle resolution, and the part of the ROI 3 area (namely the whole picture area) excluding the ROI 2 area is the picture data with the lowest resolution.

Further, considering that the calculation amount of the image super-resolution amplification algorithm is usually large, the consumption of the calculation resources for performing the image super-resolution amplification processing on all the pictures is very large, in an actual scene of a panoramic video watched by a user, the pictures watched by the user are only pictures in an ROI region including an FOV region, and for more pictures at the periphery of the FOV region, due to the fact that the user may rotate the angle of view, the pictures are accessed with a certain probability, for example, in the case that a new FOV prediction is not accurate in the FOV rotation process of the user, or the transmission of pictures in the FOV region is delayed, so that the terminal cannot timely obtain a picture in the next high-resolution FOV region, the terminal needs to use a picture with a larger ROI region transmitted by an existing lower resolution layer to perform FOV rendering in order to reduce the MTP delay. Thus, in some implementation scenarios of the flow shown in fig. 3, when the terminal performs the super-resolution image magnification processing on the to-be-stitched picture with lower resolution, the terminal may select a partial area of the to-be-stitched picture to perform the super-resolution image magnification processing, so as to reduce the complexity of the processing without affecting the viewing experience of the user, because the more distant other areas from the FOV area of the current user, the less likely the terminal is to be accessed.

Specifically, for the ith picture to be processed obtained by decoding the ith data code stream (wherein, i isValues are sequentially taken from 2 to N), the terminal can amplify the pixels, which are within a second set distance range from the i-1 area boundary, in the i-th picture to be processed by using an image super-resolution algorithm to obtain P_iMultiplying, amplifying other pixels in the ith picture to be processed by using a non-image super-resolution algorithm (such as image interpolation and amplification algorithm and the like are typical) to amplify P_iAnd multiplying, and obtaining the ith picture to be spliced in the manner.

Based on the foregoing scene example of three-layer transmission of a panoramic video, as an example, fig. 4 shows a schematic diagram that a terminal performs image super-resolution amplification on a selected partial area of a picture to be spliced in some embodiments of the present application.

Referring to fig. 4, the region filled by horizontal lines represents the picture 3 to be spliced (corresponding to the ROI 3 region of the picture, i.e. the whole picture region), the region filled by oblique lines represents the picture 2 to be spliced (corresponding to the ROI 2 region of the picture), the region filled by grid lines represents the picture 1 to be spliced (corresponding to the ROI1 region of the picture) with the original resolution, and since the minimum region including the FOV region, i.e. the ROI1 region, is currently accessed from the perspective of the user viewing the panoramic video, the larger ROI 2 region and the whole picture region (i.e. the ROI 3 region) are accessed only with a certain probability, for example, when the new FOV prediction is not accurate during the FOV rotation of the user or the FOV picture is transmitted with a delay, so that the terminal cannot timely acquire the next picture in the high resolution ROI1 region including the FOV, in order to reduce the MTP delay, the terminal may temporarily use the low-resolution blurred picture in the existing ROI 2 region or ROI 3 region to perform the rendering of the new FOV region;

thus, considering that the view angle movement of the user has continuity, the closer to the current FOV area, the more likely it is that the next time is accessed, in order to reduce the processing complexity, as shown in fig. 4, the terminal performs the image super-resolution enlargement processing only on a picture in a rectangular region (region a circled by the ROI 2 region boundary and the dotted line in fig. 4) located near the ROI 2 region boundary in the picture to be stitched 3, instead of the entire picture to be stitched 3, at the time of enlargement processing of the picture to be stitched 3, and similarly, when the picture to be stitched 2 is subjected to the enlargement processing, the image super-resolution enlargement processing is performed only on the picture in the rectangular region (e.g., the region B circled by the ROI1 region boundary and the dotted line in fig. 4) located near the ROI1 region boundary in the picture to be stitched 2, instead of the entire picture to be stitched 2.

It can be seen that, by selecting the pixels around the ROI close to the high resolution level in the to-be-stitched picture of the low resolution level to perform the image super-resolution amplification processing, rather than performing the super-resolution processing on all the pixel regions, on one hand, the number of pixels to be processed is small, which can greatly reduce the complexity, and on the other hand, because the probability that the pixels farther away from the high resolution level ROI are accessed is lower, the image super-resolution processing is not performed on the pixels in all the regions in the to-be-stitched picture, which also does not cause a great influence on the experience of the user watching the panoramic video.

Because the frame of the picture to be displayed obtained by the terminal through the step 304 is synthesized by the N pictures to be spliced which are amplified in the step 303, have different resolutions, and respectively correspond to the N areas in the picture, if the frame is directly displayed, a boundary effect caused by the boundary of N-1 areas from the 1 st area (corresponding to the 1 st picture to be spliced) to the N-1 st area (corresponding to the N-1 st picture to be spliced) appears in the frame of the picture to be displayed, which results in poor user viewing experience; in order to solve the problem of the boundary effect, in the process shown in FIG. 3, after the terminal obtains the frame to be displayed through the

steps

301 and 304, the boundary of the total N-1 regions from the 1 st region to the N-1 st region in the frame to be displayed may be smoothed and then displayed in the step 305.

Specifically, the terminal may smooth the boundary of each of N-1 regions from the 1 st region to the N-1 st region in the frame of the to-be-displayed image by using a pixel set whose distance from the boundary of the region is within a first set distance range through a spatial filtering algorithm.

The spatial filtering algorithm may specifically be, for example, an h.264 deblocking filtering algorithm, or an h.265 deblocking filtering algorithm.

As an example, assuming that the region boundaries (i.e. the boundaries of the 1 st region to the N-1 st region described above) in the frame of the picture to be displayed are all rectangles, fig. 5(a) and 5(b) show schematic diagrams of the principle that the terminal performs smoothing processing on the region boundaries in the frame of the picture to be displayed in some embodiments of the present application.

Referring to fig. 5(a), for a vertical boundary of a region existing in a frame of a picture to be displayed, a terminal may perform filtering by a spatial filtering algorithm using 4 pixels p3, p2, p1, p0 to the left of the vertical boundary and 4 pixels q3, q2, q1, q0 to the right of the vertical boundary, thereby weakening the vertical boundary; referring to fig. 5(b), for a horizontal boundary of a region existing in a frame of a picture to be displayed, a terminal may perform filtering by a spatial filtering algorithm using 4 pixels p3, p2, p1, p0 above the horizontal boundary and 4 pixels q3, q2, q1, q0 below the horizontal boundary, thereby weakening the horizontal boundary; for the intersection inflection point of the horizontal boundary and the vertical boundary, filtering in the horizontal direction may be performed first, and then filtering in the vertical direction may be performed. Or filtering in the vertical direction first and then filtering in the horizontal direction;

similarly, for the case that the area boundary in the frame to be displayed is in other shapes, such as polygon, circle, ellipse, etc., the boundary smoothing process can be performed similarly to the above.

The method has the advantages that the to-be-displayed picture frames obtained by splicing the to-be-spliced pictures with different resolutions are subjected to spatial filtering processing, so that the visible boundary formed by the resolution difference in the to-be-displayed picture frames can be effectively eliminated, the boundary effect is reduced, the video quality during panoramic video display is improved, and the user watching experience is improved.

Further, considering that in addition to the boundary formed by the pictures to be stitched with different resolutions after stitching, if the region to be stitched is enlarged in step 304, a part of the region in the pictures to be stitched is selected for super-resolution enlargement of the image, and other regions in the pictures to be stitched are enlarged in other manners, which will cause the boundary to appear in the pictures to be stitched, so in order to solve the above-mentioned boundary phenomenon, in some embodiments of the present application, the terminal may further smooth the boundary formed between the region enlarged based on the image super-resolution algorithm and the region enlarged based on the non-image super-resolution algorithm in each picture to be stitched enlarged in step 304 through a spatial filtering algorithm, similar to the spatial filtering process performed on the boundary caused by the different resolution layers in the pictures, the specific manner and process are similar to the spatial filtering process described above, and will not be described herein again.

Since in panoramic video, if a moving object or object spans multiple resolution picture areas in temporal motion, it will cause temporal boundary effects, and the transition of the video picture in temporal is not smooth. For example, as shown in fig. 6(a), the motion trajectory of the object is from the high resolution area of the picture to the low resolution area of the picture in the time domain, the human eye is also tracking the object, and thus the rotation of the FOV angle is also from the high resolution to the low resolution, in this process, since the resolution of the object tracked by the human eye is suddenly changed from high to low, which will be reflected as a phenomenon that the picture is suddenly blurred, similarly, if the motion track of the object is moved from the low resolution area of the picture to the high resolution area of the picture in the time domain, the rotation of the FOV viewing angle is also moved from the low resolution to the high resolution, in the process, as the resolution of the object tracked by the human eyes is suddenly changed from low to high, the phenomenon that the picture is suddenly changed into clear is shown, these all can cause the flickering in the time domain when the panoramic video is played, which affects the viewing experience of the user.

Although spatial filtering can solve the boundary effect formed between different resolution layers in each picture frame of a panoramic video transmitted according to a layered tile, that is, in each picture frame at each moment, it cannot effectively process the boundary effect in a time domain caused by the cross-domain resolution motion of a moving object in the panoramic video in the picture frames at a plurality of moments.

In order to solve the problem of a fuzzy mutation and a flicker occurring in a moving object in the panoramic video in a time domain, in some embodiments of the present application, after the terminal performs a spatial domain smoothing process on the boundary of N-1 regions from the 1 st region to the N-1 st region in the frame of the picture to be displayed through step 305, the frame to be displayed after the spatial domain smoothing process is further performed may be subjected to a temporal domain filtering process, and then the frame to be displayed after the spatial domain filtering process and the temporal domain filtering process is rendered and displayed, so as to enhance the definition of the moving object in the panoramic video, weaken the phenomenon of the fuzzy and the flicker in the time domain during the playing of the panoramic video transmitted by layered tile, and smooth the temporal quality.

Specifically, the terminal may further perform temporal filtering processing on the picture frame obtained after the boundary smoothing processing obtained in step 305 through a temporal filtering algorithm, so as to enhance the definition of the moving object in the panoramic video, and enable the resolution of the moving object in the panoramic video to be in smooth transition.

The temporal filtering generally refers to performing weighted average on a current pixel and a similar or same pixel of a temporally adjacent frame to replace the current pixel; the time-domain filtering algorithm comprises a motion-based adaptive algorithm, a motion estimation and motion compensation algorithm and the like; defects in video quality such as noise and flicker caused by temporally inconsecutive pixels can be attenuated by temporal filtering.

For example, based on the case that the motion trajectory of the object is moving from the high resolution area of the picture to the low resolution area of the picture in the time domain as shown in fig. 6(a), fig. 6(b) shows a schematic diagram of performing temporal filtering on the picture frame in this case in some embodiments of the present application. Referring to fig. 6(b), the 1 st, 2 nd, 3 rd picture frames located in the high resolution picture area and the 4 th picture frame located in the low resolution picture area, which are sequentially generated in the time domain corresponding to the motion trajectory of the object, in order to solve the defect that the picture quality transition of the moving object crossing the resolution in the time domain is not smooth, the terminal may use the 4 th picture frames 1, 2, 3, 4, and obtain the 4 th picture frame after the filtering processing in the time domain through the filtering algorithm in the time domain, and the picture quality of the processed 4 th picture frame will be enhanced, so that the effects of enhancing the definition of the moving object crossing the boundary, reducing the time domain blur and flicker, and smoothing the time domain quality can be achieved.

The following briefly introduces the implementation process of time domain filtering by taking 3-order time domain filtering as an example:

assuming that the current pixel value is denoted by cur, ref1 and ref2 denote 2 similar reference pixels of the frame temporally adjacent to the current pixel, output denotes the temporally filtered pixel value;

the expression of the 3-order temporal filtering may be specifically expressed as: output is w0 cur + w1 ref1+ w2 ref2,

where w0, w1, and w2 represent the weighting coefficients of cur, ref1, and ref2, and the sum of these weighting coefficients (or weights) is 1: w0+ w1+ w2 is 1.

Thus, in performing the temporal filtering, the current pixel (such as cur) is used to determine the respective reference pixels (such as ref1 and ref2), and the weighting coefficients of the current pixel and the respective reference pixels (such as w0, w1 and w 2);

in some simpler time-domain filtering algorithms, for example, based on a motion adaptive algorithm, pixels at the same positions as those of a current frame time-domain neighboring frame where a current pixel is located are mainly selected as reference pixels, and a region motion detection mode is adopted to adjust each filtering weight; for example, based on cur, ref1, ref2, w0, w1, w2 respectively correspond, for pixels in a region with large motion, the weight coefficients w1 and w2 are reduced during calculation, and for pixels in a region with small motion, the weight coefficients w1 and w2 are increased during calculation; the algorithm has the advantages of simplicity and easy implementation, but the effect is not ideal for video frames with severe motion and the motion smear phenomenon of moving objects is also caused.

In some temporal filtering algorithms with higher complexity, for example, based on motion estimation and motion compensation algorithms, motion estimation and motion compensation are mainly used to find temporal predictions of a current frame in temporally adjacent frames, so as to effectively remove a smear effect, but the temporal complexity is higher and the real-time performance is weaker.

For example, still based on cur, ref1, ref2, and corresponding w0, w1, w2, and fig. 7 shows an example of motion estimation based on 16 × 16 blocks, where cur block (cur _ block in the figure) represents a matching block corresponding cur in motion estimation, ref1 block (ref 1_ block in the figure) represents a matching block corresponding ref1 in motion estimation, ref2 block (ref 2_ block in the figure) represents a matching block corresponding ref2 in motion estimation, and after obtaining each matching block, the output temporally filtered pixel value may be calculated by the following formula:

Output_block＝w0*cur_block+w1*ref1_block+w2*ref2_block，w0+w1+w2＝1；

wherein, the weighting coefficients w1 and w2 can be obtained as follows:

as can be seen from the above expressions, if ref1 and ref2 are more similar to cur, the cur block can be replaced by a ref1 block and a ref2 block with high definition, which is reflected in the above formula that the smaller the Mean Absolute Difference (MAD) of these matching blocks is, and thus the larger the weighting coefficients w1 and w2 are, the stronger the filtering effect is; the weighting factor w0 corresponding to the cur block is related to the amplification factor of the up sampling, and the larger the factor is, the smaller the value is, which can be given according to experience;

finally, in order to satisfy the constraint of w0+ w1+ w2 being 1, each weighting coefficient may be normalized by:

after the time domain filtering processing, the defects of video quality such as noise and flicker caused by pixels which are not continuously changed in the time domain can be effectively weakened.

FIG. 8(a) is a schematic diagram illustrating a change in resolution when a moving object in a panoramic video moves from a high resolution region to a low resolution region without temporal filtering, and FIG. 8(b) is a schematic diagram illustrating a change in resolution when a moving object in a panoramic video moves from a high resolution region to a low resolution region after temporal filtering is performed in some embodiments of the present application;

it can be seen that, through the filtering processing in the time domain, the resolution (or also can be understood as the definition) of the moving object in the panoramic video gradually changes from high to low, rather than suddenly changing, so that on one hand, smooth time domain quality transition is achieved, the flicker effect is eliminated, on the other hand, the definition of the moving object in the low resolution layer is improved, the dynamic transition fuzzy quality can be improved, and therefore, the user experience of watching the panoramic video by the user is improved.

Similarly, fig. 8(c) shows a schematic diagram of a resolution change when a moving object in the panoramic video moves from a low resolution area to a high resolution area without temporal filtering, and fig. 8(d) shows a schematic diagram of a resolution change when a moving object in the panoramic video moves from a low resolution area to a high resolution area after temporal filtering is performed in some embodiments of the present application; it can be seen that, through the filtering processing in the time domain, the resolution (or also can be understood as the definition) of the moving object in the panoramic video gradually changes from low to high, rather than suddenly changing, and a smooth time domain quality transition can be realized, flicker can be eliminated, the definition of the moving object in the low resolution layer can be improved, the dynamic transition fuzzy quality can be improved, and thus the user experience of watching the panoramic video by the user can be improved.

After the Processing procedure is completed, the terminal may display the processed frame to be displayed to a user for viewing, for example, the frame may be sent to a Graphics Processing Unit (GPU) for rendering and displaying.

As can be seen from the above description, the processing scheme of the panoramic video data provided in the foregoing embodiment of the present application, on one hand, can amplify the picture transmitted by the low resolution layer in the layered tile transmission of the panoramic video by using the image super-resolution amplification technology after decoding, so as to enhance the quality of the amplified picture of the low resolution picture, improve the quality of the played panoramic video as a whole, and improve the smooth transition of the quality of the moving picture in the FOV view rotation process of the user; on the other hand, after the pictures from different resolution layers are fused to obtain a picture frame to be displayed, the boundary formed by different resolution areas of the picture frame to be displayed is subjected to spatial filtering processing, so that the boundary effect of the panoramic video picture frame on a spatial domain is removed; finally, the time domain filtering processing can be adopted to enhance the picture definition of a moving object in the panoramic video when the moving object moves across different resolution areas in a plurality of continuous picture frames of the panoramic video, improve the phenomena of fuzzy mutation and time domain flicker of the moving object in the time domain and smooth the time domain quality for the boundaries formed by the different resolution areas of the plurality of continuous picture frames of the panoramic video; in addition, when the image super-resolution amplification algorithm is used for amplifying a low-resolution picture, the use of the image super-resolution amplification algorithm can be subjected to area selective control, so that the complexity of the super-resolution algorithm is reduced.

Therefore, by the processing scheme of the panoramic video data provided by the embodiment of the application, the problems of dynamic transition fuzzy quality, obvious boundary effect, image flicker and the like of the panoramic video image caused by inaccurate prediction of a new FOV area in the FOV rotation process of a user or delayed image transmission in the FOV area can be solved, and the user experience of watching the panoramic video by the user is improved.

In order to make the processing scheme of the panoramic video data provided by the above embodiment of the present application clearer, the following briefly introduces, with reference to fig. 9, an application of the processing scheme of the panoramic video data provided by the embodiment of the present application in an actual scene.

Based on the example of the panoramic video three-layer transmission exemplarily described above, fig. 9 shows a flowchart of the processing scheme of the panoramic video data provided by some embodiments of the present application applied in this exemplary scenario.

Referring to fig. 9, based on the foregoing three-layer transmission example of the panoramic video as shown in fig. 2, it is assumed that the panoramic video in the scene is transmitted in three layers, where resolution layer 1 transmits a picture in a ROI1 region in a panoramic video picture frame with the highest resolution, the ROI1 region is the smallest region containing the current FOV area of the user, and resolution layer 2 transmits the picture frame 2:1 down-sampled picture in ROI 2 region, the ROI 2 region containing the ROI1 region, resolution layer 3 transmitting the picture frame 4:1, down-sampling a picture in an ROI 3 area, wherein the ROI 3 area comprises the ROI 2 area;

as shown in fig. 9, after the terminal receives the three layers of data streams of the panoramic video transmitted by the layered tile, the data streams of the layers are decoded first: the terminal carries out ROI-tiles decoding on the data code stream transmitted by the resolution layer 1 and the resolution layer 2, and directly decodes the data code stream transmitted by the resolution layer 3;

further, the terminal selectively performs super-resolution enhancement processing on a low-resolution picture to be spliced obtained by decoding a data code stream transmitted by a low-resolution layer (such as the resolution layer 2 and the resolution layer 3 shown in fig. 9) by using an image super-resolution amplification algorithm, that is, only performing super-resolution amplification on pixels around the boundary of the ROI area close to the high-resolution layer in the picture of the low-resolution layer, instead of the conventional uniform interpolation amplification, selectively performs super-resolution amplification on a picture area obtained by decoding the data code stream of the resolution layer 2 by 2 times and selectively performs super-resolution amplification on a picture area obtained by decoding the data code stream of the resolution layer 3 by 4 times, as shown in fig. 9, and the specific process can be described in the foregoing description, which is not repeated herein;

subsequently, the terminal may combine the processed pictures from the three resolution layers into one frame of the picture to be displayed (such as inter-layer splicing shown in fig. 9), mainly copying and overlaying the memory, specifically, copying and overlaying the picture data with the highest resolution to the corresponding storage location of the picture with the intermediate resolution. Then copying and covering the picture data of the middle resolution to the corresponding storage position of the picture of the lowest resolution;

furthermore, the terminal may perform an air filtering process (such as the air filtering shown in fig. 9) on the spliced and synthesized single-frame panoramic picture frame to eliminate the boundary formed by the splicing of the pictures with different resolutions and the boundary formed by the areas with different resolutions after the super resolution with regioselectivity;

finally, the terminal can perform time-domain filtering processing (such as time-domain filtering shown in fig. 9) on the panoramic video after the spatial filtering so as to improve the image quality of moving objects in the panoramic video, eliminate the phenomena of flickering, blurring and the like of the moving objects across the boundary of the resolution level in the panoramic video, and improve the quality of dynamic transition blurring.

In summary, the processing scheme of the panoramic video data provided in the embodiments of the present application is mainly used to solve the problems that when a panoramic video is transmitted in a layered manner, a partial area of the panoramic video is blurred, a boundary is obvious, and a moving object flickers when the panoramic video is decoded and played by a terminal;

the processing scheme of the panoramic video data provided by the embodiment of the application specifically includes that under the framework of panoramic video layered tile transmission, an image super-resolution algorithm is applied to pictures transmitted by a plurality of low-resolution layers to perform picture amplification processing, the quality of the amplified low-resolution pictures is enhanced, and when the picture transmitted by the low-resolution layers is subjected to the image super-resolution amplification processing, in order to reduce the processing complexity, the image super-resolution amplification processing is selectively performed on pixels located in the boundary area of each area, so that the processing complexity is reduced and the viewing experience of the panoramic video is not affected;

the processing scheme of the panoramic video data provided by the embodiment of the present application further includes that for the boundaries formed by different resolutions of each region on the same picture frame of the panoramic video, spatial filtering is adopted to weaken the boundary effect formed by the difference of the resolutions of the regions, and further, for the boundaries of the regions with different resolutions of different picture frames of the panoramic video, temporal filtering is adopted to enhance the definition of the moving object moving across the boundaries of the regions with different resolutions in the temporal domain in the panoramic video, so that the video picture quality of the moving object in the temporal domain in the panoramic video is in smooth transition; through the process, the video quality during panoramic video playing is improved, and the watching experience effect of a user is improved.

Based on the same inventive concept, the present application further provides a processing apparatus for panoramic video data, where the processing apparatus may be deployed in a terminal capable of supporting panoramic video processing, and is configured to execute the above-mentioned embodiment of the processing method for panoramic video data of the present application, and each functional module in the processing apparatus may be specifically implemented by software, hardware, or a combination of software and hardware.

As shown in fig. 10, the processing apparatus includes:

a receiving module 1001, configured to receive an N-layer data code stream of a panoramic video; for each picture frame of the panoramic video, a layer 1 data code stream in the layer N data code streams is used for transmitting data in a layer 1 area in the picture frame, the layer 1 area comprises an area determined by a user field angle of the terminal in the picture frame, the original resolution of the picture frame is a layer 1 resolution, the layer i data code stream is used for transmitting data in the layer i area after the picture frame is subjected to down-sampling to the layer i resolution, i is an integer which is greater than 1 and not greater than N, the layer i resolution is smaller than the layer i-1 resolution, the layer i area comprises a layer i-1 area, and the layer N data code stream is used for transmitting data of all picture areas after the picture frame is subjected to down-sampling to the layer N resolution;

a decoding module 1002, configured to decode the N-layer data code stream received by the receiving module 1001, so as to obtain N to-be-processed pictures;

the amplifying module 1003 is configured to use a 1 st to-be-processed picture obtained by decoding the 1 st data code stream by the decoding module 1002 as a 1 st to-be-spliced picture, and amplify an i th to-be-processed picture obtained by decoding the i th data code stream by the decoding module by Pi times through an image super-resolution algorithm to obtain an i th to-be-spliced picture, where Pi is equal to a ratio of a 1 st resolution to an i th resolution;

a splicing module 1004, configured to use the data of the nth picture to be spliced obtained by the amplifying module 1003 as data in all picture areas in the picture frame to be displayed, and sequentially use the data of the i-1 st picture to be spliced obtained by the amplifying module 1003 to cover the data in the i-1 st area in the picture frame to be displayed according to a sequence in which the value of i decreases from N;

a boundary processing module 1005, configured to perform smoothing processing on the boundary of N-1 regions, from the region 1 to the region N-1, in the to-be-displayed picture frame obtained by the splicing module 1004;

a display module 1006, configured to display the frame processed by the boundary processing module 1005.

In a possible design, when the boundary processing module 1005 performs smoothing processing on the boundary of N-1 regions in total from the 1 st region to the N-1 st region in the to-be-displayed picture frame obtained by the splicing module 1004, specifically, the boundary processing module 1005 is configured to:

and smoothing the boundary of each of N-1 regions from the 1 st region to the N-1 st region in the frame of the picture to be displayed, which is obtained by the splicing module 1004, by using a pixel set whose distance from the boundary of the region is within a first set distance range through a spatial filtering algorithm.

In a possible design, after smoothing the boundary of N-1 regions from the 1 st region to the N-1 st region in the to-be-displayed picture frame obtained by the splicing module 1004, the boundary processing module 1005 is further configured to:

In a possible design, when the i-th to-be-processed picture obtained by decoding the i-th data code stream by the decoding module 1002 is amplified by Pi times through an image super-resolution algorithm, the amplifying module 1003 is specifically configured to:

In one possible design, the boundary processing module 1005 is further configured to: and smoothing a boundary formed between an area obtained based on an image super-resolution algorithm and an area obtained based on an image difference algorithm in the ith picture to be spliced, which is obtained by the amplifying module 1003, by using an airspace filtering algorithm.

Specifically, since the apparatus provided in the foregoing embodiment of the present application and the method provided in the foregoing embodiment of the present application solve the problem in a similar manner, the specific implementation of the apparatus provided in the foregoing embodiment of the present application and the implementation of the method provided in the foregoing embodiment of the present application can be referred to each other, and repeated details are not repeated.

The division of the modules in the embodiments of the present application is schematic, and only one logical function division is provided, and in actual implementation, there may be another division manner, and in addition, each functional module in each embodiment of the present application may be integrated in one processor, may also exist alone physically, or may also be integrated in one module by two or more modules. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

Based on the same inventive concept, the present application further provides a terminal, where the processing apparatus for panoramic video data shown in fig. 10 may be deployed on the terminal, so as to execute the flow of the processing method for panoramic video data shown in fig. 3.

Fig. 11 shows a schematic structural diagram of a terminal provided in an embodiment of the present application. Referring to fig. 11, the terminal 1100 may include a processing unit 1101 and a communication unit 1102. The processing unit 1101 may include a Central Processing Unit (CPU), or may also include a digital processing module, etc.

Specifically, the communication unit 1102 is configured to receive an N-layer data stream of the panoramic video; for each picture frame of the panoramic video, a layer 1 data code stream in the layer N data code streams is used for transmitting data in a layer 1 area in the picture frame, the layer 1 area comprises an area determined by a user field angle of the terminal in the picture frame, the original resolution of the picture frame is a layer 1 resolution, the layer i data code stream is used for transmitting data in the layer i area after the picture frame is subjected to down-sampling to the layer i resolution, i is an integer which is greater than 1 and not greater than N, the layer i resolution is smaller than the layer i-1 resolution, the layer i area comprises a layer i-1 area, and the layer N data code stream is used for transmitting data of all picture areas after the picture frame is subjected to down-sampling to the layer N resolution;

the processing unit 1101 is configured to decode the N-layer data code stream to obtain N to-be-processed pictures, use the 1 st to-be-processed picture obtained by decoding the 1 st data code stream as the 1 st to-be-spliced picture, and amplify the P of the i-th to-be-processed picture obtained by decoding the i-th data code stream through an image super-resolution algorithm_iMultiplying to obtain the ith picture to be spliced, wherein P is_iThe ratio of the 1 st resolution to the ith resolution is equal, further, the data of the Nth picture to be spliced is used as the data in all the picture areas in the picture frame to be displayed, and the data in the i-1 st area in the picture frame to be displayed is covered by the data of the i-1 st picture to be spliced according to the sequence that the value of i is decreased from N; and the device is used for smoothing the boundary of the total N-1 areas from the 1 st area to the N-1 st area in the frame of the picture to be displayed.

Specifically, the processing unit 1101 and the communication unit 1102 may be specifically configured to execute the processing method of the panoramic video data provided by the foregoing embodiments of the present application. This application is not described in detail herein.

In particular, the terminal 1100 may further comprise a storage unit 1104 for storing computer programs executed by the processing unit 1101. The storage unit 1104 may be a nonvolatile memory such as a Hard Disk Drive (HDD) or a solid-state drive (SSD), and may also be a volatile memory such as a random-access memory (RAM). The storage unit 1104 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such.

As shown in fig. 11, the terminal 1100 may further include a display unit 1103 for displaying the to-be-displayed picture frame obtained by the processing unit 1101 through the above-described process.

In the embodiment of the present application, a specific connection medium among the processing unit 1101, the communication unit 1102, the display unit 1103, and the storage unit 1104 is not limited. For example, the connection may be via a bus, which may be divided into an address bus, a data bus, a control bus, etc.

The embodiment of the present application further provides a readable storage medium, which is used for storing software instructions required to be executed for executing the processing unit, and includes a program required to be executed for executing the processing unit.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for processing panoramic video data, the method comprising:

2. The method as claimed in claim 1, wherein the smoothing, by the terminal, of the boundary of N-1 regions in total from the 1 st region to the N-1 st region in the frame of the picture to be displayed comprises:

3. The method according to claim 1 or 2, wherein the terminal, after smoothing the boundary of N-1 regions from the 1 st region to the N-1 st region in the frame of the picture to be displayed and before displaying, further comprises:

4. The method according to claim 1 or 2, wherein the terminal amplifies P an ith to-be-processed picture obtained by decoding an ith data code stream by an image super-resolution algorithm_iMultiplying to obtain the ith picture to be spliced, comprising the following steps:

the terminal amplifies pixels, within a second set distance range, of the ith to-be-processed picture and the ith-1 area boundary by using an image super-resolution algorithmP is_iMultiplying, amplifying the P by other pixels in the ith picture to be processed by using a non-image super-resolution algorithm_iAnd multiplying to obtain the ith picture to be spliced.

5. The method of claim 4, further comprising:

and the terminal smoothes a boundary formed between an area obtained based on an image super-resolution algorithm and an area obtained based on an image interpolation algorithm in the ith picture to be spliced through an airspace filtering algorithm.

6. The method of claim 2 or 5, wherein the spatial filtering algorithm is an H.264 deblocking filtering algorithm or an H.265 deblocking filtering algorithm.

7. An apparatus for processing panoramic video data, the apparatus comprising:

the receiving module is used for receiving an N-layer data code stream of the panoramic video; for each picture frame of the panoramic video, a layer 1 data code stream in the layer N data code streams is used for transmitting data in a layer 1 area in the picture frame, the layer 1 area comprises an area determined by a user field angle of a terminal in the picture frame, the original resolution of the picture frame is a layer 1 resolution, the layer i data code stream is used for transmitting data in the layer i area after the picture frame is subjected to down-sampling to the layer i resolution, i is an integer which is greater than 1 and not greater than N, the layer i resolution is smaller than the layer i-1 resolution, the layer i area comprises a layer i-1 area, and the layer N data code stream is used for transmitting data of all the picture areas after the picture frame is subjected to down-sampling to the layer N resolution;

an amplifying module, configured to use a 1 st to-be-processed picture obtained by decoding the 1 st data code stream by the decoding module as a 1 st to-be-spliced picture, and amplify a P-th to-be-processed picture obtained by decoding the i-th data code stream by the decoding module through an image super-resolution algorithm_iMultiplying to obtain the ith picture to be spliced, wherein P is_iEqual to the ratio of the 1 st resolution to the ith resolution;

8. The apparatus according to claim 7, wherein the boundary processing module, when performing smoothing processing on the boundary of N-1 regions in total from the 1 st region to the N-1 st region in the to-be-displayed picture frame obtained by the splicing module, is specifically configured to:

9. The apparatus according to claim 7 or 8, wherein the boundary processing module, after performing smoothing processing on the boundary of N-1 regions in total from the 1 st region to the N-1 st region in the to-be-displayed picture frame obtained by the splicing module, is further configured to:

10. An apparatus according to claim 7 or 8, wherein the amplification isThe module amplifies the ith to-be-processed picture obtained by decoding the ith data code stream by the decoding module through an image super-resolution algorithm by P_iDoubling, it is used specifically for:

amplifying the pixels of the ith to-be-processed picture, the distance between which and the boundary of the (i-1) th area is within a second set distance range by using an image super-resolution algorithm_iMultiplying, amplifying the P by other pixels in the ith picture to be processed by using a non-image super-resolution algorithm_iAnd (4) doubling.

11. The apparatus of claim 10, wherein the boundary processing module is further configured to:

and smoothing a boundary formed between a region obtained based on an image super-resolution algorithm and a region obtained based on an image interpolation algorithm in the ith picture to be spliced obtained by the amplifying module through an airspace filtering algorithm.

12. The apparatus of claim 8 or 11, wherein the spatial filtering algorithm is an h.264 deblocking filtering algorithm or an h.265 deblocking filtering algorithm.