CN114095756A

CN114095756A - Adaptive panoramic video streaming transmission system and method based on long-term view prediction

Info

Publication number: CN114095756A
Application number: CN202111361167.7A
Authority: CN
Inventors: 李克秋; 周晓波; 邱铁; 高晓松; 曾嘉欣
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-11-17
Filing date: 2021-11-17
Publication date: 2022-02-25
Anticipated expiration: 2041-11-17
Also published as: CN114095756B

Abstract

The invention discloses a self-adaptive panoramic video streaming transmission system based on long-term visual field prediction, which comprises a video server and a video client, wherein video content is transmitted between the video server and the video client through a dynamically-changing wireless network; the video server comprises a video coding module, a first view prediction processing module, a first cache region and a request processing module; the video client comprises a video decoding module, a second visual field prediction processing module, a second buffer area, a video tracking feedback module and a video bit rate distribution processing module; the video client predicts the long-term visual field of the user and allocates the bit rate of the video block to the user through a second visual field prediction processing module and a video bit rate allocation processing module respectively so as to realize the maximized user experience quality; the invention can adaptively adjust the bit rate of the video according to the fluctuating network environment, thereby improving the experience quality of the panoramic video user.

Description

Adaptive panoramic video streaming transmission system and method based on long-term view prediction

Technical Field

The invention mainly relates to the technical field of streaming media, in particular to a self-adaptive panoramic video streaming transmission system and a self-adaptive panoramic video streaming transmission method based on long-term visual field prediction.

Background

Panoramic video is of interest because it is one of the basic technologies for virtual reality, which can bring an immersive viewing experience to the user. However, implementing an immersive panoramic video viewing experience requires high resolution (e.g., 4K × 2K) and ultra-low latency (below 20ms) requirements, and the required network transmission rates will rapidly rise to the Gb/s level, which undoubtedly puts tremendous pressure on today's wireless networks. According to data statistical analysis, the current wireless network environment cannot realize smooth transmission and playing of panoramic videos, and watching experience of users can be greatly reduced. The data is searched and read, and a user watching the panoramic video only watches the video content in a view field (Viewport) area at a certain moment, and the area outside the view field is invisible to the user, so that serious bandwidth resource waste is caused by transmitting the complete panoramic video content.

From the prior art, it can be known that: the amount of panoramic video data is about 3-4 times that of ordinary video. In the face of the characteristic of large data volume of panoramic video, the current academic community has proposed a scheme of video coding and transmission based on tile (tile) to replace the overall transmission scheme of panoramic video, so as to reduce the waste of bandwidth resources while ensuring the watching experience of users. The specific method comprises the following steps: firstly, projecting a three-dimensional panoramic video onto a two-dimensional plane; then dividing the two-dimensional picture into a plurality of video blocks, wherein each video block can be independently coded and transmitted; and finally, splicing and restoring the video blocks at the user side, and restoring the video blocks into a panoramic video through projection inverse mapping.

From the related panoramic video transmission work it is known that: under the condition of limited bandwidth, in order to improve the experience quality of video users, a user view prediction strategy is generally adopted to adaptively transmit video content. The principle is as follows: the future visual field of the video watched by the user is predicted, and the bit rates of different video blocks are determined (i.e. the video blocks in the visual field of the user are allocated with high bit rates, and other areas are allocated with low bit rates). And the user side requests the content of the video in advance according to the predicted visual field and stores the content in the buffer area of the player, so that the smooth playing of the video is guaranteed. However, there is a deviation between the predicted view and the actual view, and the deviation will increase with the increase of the length of the prediction window. Considering the accuracy of view prediction, the user usually can only buffer a small amount of video content in advance, such as 1-2 seconds. When the network condition is poor, the player can quickly play the content in the buffer area, so that the video is blocked and the user experience is influenced; if the user tends to cache longer video segments, there may be a large deviation between the cached content and the user's view, which also affects the user experience.

The existing panoramic video transmission scheme cannot obtain an accurate long-term view prediction result, so that the duration of the cache content of a user side is limited. In addition, in addition to view prediction, video streaming also needs to adaptively adjust the bit rate level of a video block according to a series of changing dynamic factors such as network bandwidth and user buffer, so as to improve the user experience quality, i.e. the pre-fetching decision is affected by both the view prediction result and bandwidth fluctuation.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and the self-adaptive panoramic video streaming transmission system based on the long-term visual field prediction comprises a video server and a video client, wherein video content is transmitted between the server and the client through a dynamically-changing wireless network. The system carries out user long-term visual field prediction and video block bit rate allocation decision at a user side so as to realize the maximization of user experience quality.

The invention is implemented by adopting the following technical scheme:

the adaptive panoramic video streaming transmission system based on the long-term visual field prediction comprises a video server and a video client; the video server transmits video content to the video client through a dynamically changing wireless network; the video server comprises a video coding module, a first view prediction processing module, a first cache region and a request processing module; the video client comprises a video decoding module, a second visual field prediction processing module, a second buffer area, a video tracking feedback module and a video bit rate distribution processing module; wherein: the video client predicts the long-term visual field of the user and allocates the bit rate of the video block to the user through a second visual field prediction processing module and a video bit rate allocation processing module respectively so as to realize the maximized user experience quality; the method comprises the following steps: the video bit rate distribution processing module selects a specific bit rate for each video block and sends a request to a video server in units of one video clip, and the maximized user experience quality is generated by the following formula:

Q＝α₁*QoE₁-α₂*QoE₂-α₃*QoE₃-α₄*QoE₄

wherein (alpha)₁，α₂，α₃，α₄) A non-negative weight parameter; QoE₁Is the average bit rate quality of the video block within the field of view; QoE₂Buffering duration of a video pause for the duration of downloading the video segment; QoE₃Is the spatial difference in bit rate between adjacent video blocks; QoE₄Is the time difference of the bit rate of adjacent video segments.

The invention can also be implemented by the following method:

the video server off-line processing panoramic video file step:

the video coding module divides a two-dimensional video obtained by projection into video segments with equal time length and cuts the video segments into independent video blocks; the video coding module carries out bit rate coding on video blocks, converts the video blocks into video files with various bit rates and stores the video files into a first cache region;

the first visual field prediction processing module collects visual field movement tracks of historical users to generate a thermal map file and stores the thermal map file in a first cache region;

the request processing module receives a client video request and takes a video file in the first buffer area as a response;

the video client pre-fetches all video blocks by taking one video segment as a unit according to the time sequence, and performs the following steps each time a video is requested from the video server:

the second view prediction processing module predicts a user view of the upcoming requested video segment;

the video bit rate allocation processing module selects a specific bit rate for each video block and sends a request to the server in units of one video segment, maximizing the user experience quality as in the following equation (1):

Q＝α₁*QoE₁-α₂*QoE₂-α₃*QoE₃-α₄*QoE₄ (1)

wherein (alpha)₁，α₂，α₃，α₄) A non-negative weight parameter; QoE₁Is the average bit rate quality of the video block within the field of view; QoE₂Buffering duration of a video pause for the duration of downloading the video segment; QoE₃Is the spatial difference in bit rate between adjacent video blocks; QoE₄Is the time difference of the bit rate of the adjacent video segments;

the second buffer area stores a video description file which is sent by a server and contains a video segment sequence number and a request set corresponding to the bit rate grade of each video block;

the video tracking feedback module records the visual field moving track of the user in real time and calculates the QoE of the user;

and the video decoding module splices all the video blocks into a complete video and projects the complete video into a panoramic video in a display for a user to watch.

Further, the second visual field prediction processing module predicts the user visual field in a plurality of future video segments according to the following formula according to thermodynamic diagram data information provided in the video description file responded by the request processing module and the visual field moving track recorded by the video tracking feedback module:

P_c+1，…，P_c+n＝LSTM([V_c-m，…，V_c]，[H_c+1，…，H_c+n])

wherein c denotes the sequence number of the current video segment, V_cThe user's view field indicating the time, H_cThermodynamic diagram data information representing the video segment, m, n representing the length of the time window, respectively; predicted result P_c+1Showing the user's center point of view at each videoProbability distribution in the block.

Further, the video bit rate allocation processing module allocates the user experience quality to each video block, and obtains the user experience quality index according to the following formula:

wherein, 0-1 vector

Field of view, vector, representing the current user

Representing the bit rate allocated for each video block, the user experience indicator QoE₂Expressed in real measured video pause duration (in seconds).

Further, the video bit rate allocation processing module allocates an equal bit rate for all video blocks in the same area for each video block allocation bit rate allocation, and the following conditions are satisfied: the view area bit rate ≧ the edge area bit rate ≧ the outer area bit rate.

Advantageous effects

1. The invention designs a panoramic video transmission system from a server side to a client side, which can adaptively adjust the bit rate of a video according to fluctuating network environment, thereby improving the experience quality of a panoramic video user.

2. The invention designs a novel visual field prediction method and a bandwidth self-adaptive bit rate allocation method. Video blocks are allocated into different regions (importance levels) according to the prediction view and the appropriate bit rate is selected adaptively, thereby obtaining better user experience quality.

Drawings

Fig. 1 is a system diagram of a panoramic video stream.

Fig. 2 video pre-processing: projection, blocking and encoding.

Fig. 3 video block area division.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the following detailed discussion of the present invention will be made with reference to the accompanying drawings and examples, which are only illustrative and not limiting, and the scope of the present invention is not limited thereby.

As shown in fig. 1, the present invention provides an adaptive panoramic video streaming system based on long-term visual field prediction, the system includes a video server and a video client, and video content is transmitted between the server and the client through a dynamically changing wireless network; wherein:

the video server 100 comprises a video encoding module 101, a first view prediction processing module 102, a first buffer 103 and a request processing module 104; the method comprises the following specific steps:

the video server 100 processes the panoramic video file offline, wherein: the video coding module 101 divides a two-dimensional video obtained by projection into video segments with equal time length and cuts the video segments into independent video blocks; the video coding module 101 performs bit rate coding on video blocks to convert the video blocks into video files with various bit rates and stores the video files into the first buffer area 103;

the first visual field prediction processing module 102 collects visual field movement tracks of historical users to generate a video file containing thermodynamic diagrams and stores the video file in a first cache region 103;

the request processing module 104 receives a client video request and takes a video file in the first buffer area as a response;

the video client 200 comprises a video decoding module 201, a second view prediction processing module 202, a second buffer 203, a video tracking feedback module 204 and a video bit rate allocation processing module 205; the video client 200 respectively predicts the long-term visual field of the user and allocates the bit rate of the video block through a second visual field prediction processing module 202 and a video bit rate allocation processing module 205 to realize the maximization of the user experience quality;

the video client pre-fetches all video blocks in a video segment unit according to a time sequence, and performs the following operations each time a video is requested from the video server:

the second view prediction processing module 202 predicts a user view of the upcoming requested video segment;

the video bitrate allocation processing module 205 selects a specific bitrate for each video block and sends a request to the server in units of one video segment, maximizing the user experience quality as in the following equation (1):

Q＝α₁*QoE₁-α₂*QoE₂-α₃*QoE₃-α₄*QoE₄ (1)

the second buffer area 203 stores the video file sent by the server;

the video tracking feedback module 204 records the visual field movement track of the user in real time and calculates the QoE of the user;

the video decoding module 201 splices all video blocks into a complete video, and projects the complete video into a panoramic video on a display for a user to watch.

The embodiment of the invention comprises the following steps:

step 1: video offline preprocessing

The video server side projects the panoramic video into a planar video by adopting an ERP projection method, wherein the video coding module divides the video into video segments with equal duration; and each video slice is divided into a number of equal-sized video blocks, wherein each video block is encoded into a plurality of different bit rate versions. The process of video processing is shown in fig. 2;

in addition, the first visual field prediction processing module records the historical visual field movement track of the user and uses the historical visual field movement track to generate a thermodynamic diagram for later visual field prediction. Wherein the thermodynamic diagram

Expressed as the probability that each video block in the same video segment is viewed by the user, is calculated as follows,

where c is the number of the video segment, i is the number of the video block, vector

Representing the user's u field of view within the video clip c.

Step 2: user request for video content

Before actually requesting video data, a video client sends a request panoramic video to a video server; the request processing module responds to and sends out a corresponding video description file, and the video description file contains a generated video thermal map file, trained view prediction data and bit rate distribution data;

the video client requests all video blocks in one video segment each time according to the sequence of the video segments, and allocates different bit rates to the video blocks according to factors such as the visual field of a user, bandwidth fluctuation and the like, wherein the video blocks are mainly completed by a second visual field prediction module and a video bit rate allocation processing module. In each video request:

1) and a second visual field prediction processing module of the video client predicts the user visual field in a plurality of video segments in the future by adopting long-short term memory (LSTM) of an encoder-decoder structure according to the latest collected visual field movement track and thermodynamic diagram information of the video:

P_c+1，…，P_c+n＝LSTM([V_c-m，…，V_c]，[H_c+1，…，H_c+n]) (3)

wherein c denotes the sequence number of the current video segment, V_cThe user's view field indicating the time, H_cThermodynamic data information representing the video segment, m, n each representing the length of a time window. Wherein: the LSTM model for visual field prediction is trained in advance at the server side and sent to the client side.

2) In the prediction of the previous step, the result of prediction P_c+1The probability distribution of the user view center point in each video block is represented. Selecting the video block with the maximum probability as a visual field center, calculating a user visual field area according to the projection relation according to the visual field center point, and dividing all the video blocks into three areas: the division mode of the visual field area, the edge area and the outer area is shown in fig. 3. The video block positioned in the visual field of the user is a visual field area, the video block adjacent to the visual field area is an edge area, and the rest video blocks are external areas.

3) After obtaining the user view, the video bit rate allocation processing module allocates a bit rate to each video block by maximizing the user experience quality shown in formula (1); the specific calculation method of the user experience quality index is as follows:

wherein, 0-1 vector

Field of view, vector, representing the current user

Representing the allocated bit rate for each video block. In addition, the user experience metric QoE2 is expressed in terms of the actual measured video pause duration (in seconds).

The specific bit rate distribution method comprises the following steps: using a reinforcement learning algorithm Soft Actor-Critic to allocate bit rates, allocating an equal bit rate for all video blocks in the same region, and satisfying the following conditions: the view area bit rate ≧ the edge area bit rate ≧ the outer area bit rate. To conserve bandwidth, the outer region is fixed at the lowest bit rate to prioritize bandwidth for transmitting video blocks within the field of view. The video stream distribution module generates bit rate level decisions for the field of view region and the edge region.

4) In the downloading process, the user side records the moving track of the head of the user by using a sensor of the device so as to predict the visual field, and simultaneously records information such as network bandwidth condition, downloading time and the like.

And step 3: video playback

The user end firstly puts the downloaded video segment data in the player buffer area. When the segment is played, all the video blocks are taken out to be spliced into a complete panoramic video picture, and the complete panoramic video picture is finally displayed to a user through projection.

The present invention is not limited to the above-described embodiments. The foregoing description of the specific embodiments is intended to describe and illustrate the technical solutions of the present invention, and the above specific embodiments are merely illustrative and not restrictive. Those skilled in the art can make many changes and modifications to the invention without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. The system comprises a video server and a video client, wherein video content is transmitted between the video server and the video client through a dynamically changing wireless network; the method is characterized in that:

the video server comprises a video coding module, a first view prediction processing module, a first cache region and a request processing module; the video client comprises a video decoding module, a second visual field prediction processing module, a second buffer area, a video tracking feedback module and a video bit rate distribution processing module; wherein: the video client predicts the long-term visual field of the user and allocates the bit rate of the video block to the user through a second visual field prediction processing module and a video bit rate allocation processing module respectively so as to realize the maximized user experience quality; the method comprises the following steps: the video bit rate distribution processing module selects a specific bit rate for each video block, and sends a request to a video server by taking a video clip as a unit, so that the user experience quality of the following formula is maximized:

Q＝α₁*QoE₁-α₂*QoE₂-α₃*QoE₃-α₄*QoE₄

2. A method for long term view prediction using the adaptive panoramic video streaming system of claim 1, wherein:

the video server off-line processing panoramic video file step:

the video coding module divides a two-dimensional video obtained by projection into video segments with equal time length and cuts the video segments into independent video blocks; the video coding module carries out bit rate coding on the video blocks, converts the video blocks into video files with various bit rates and stores the video files into a first cache region;

the video bit rate allocation processing module selects a specific bit rate for each video block and sends a request to the server in units of one video segment, maximizing the user experience quality as in the following formula (1):

Q＝α₁*QoE₁-α₂*QoE₂-α₃*QoE₃-α₄*QoE₄ (1)

the second buffer area stores the video file sent by the server side;

3. The adaptive panoramic video streaming system based on long-term-of-view prediction of claim 1, characterized by: the second visual field prediction processing module predicts the user visual field in a plurality of video segments in the future according to the visual field moving track provided in the video description file responded by the request processing module and thermodynamic diagram data information of the video according to the following formula:

P_c+1，…，P_c+n＝LSTM([V_c-m，…，V_c]，[H_c+1，…，H_c+n])

wherein c represents the sequence number of the current video segment, and m and n respectively represent the length of the time window; predicted result P_c+1The probability distribution of the user view center point in each video block is represented.

4. The adaptive panoramic video streaming system based on long-term-of-view prediction of claim 1, characterized by: the video bit rate distribution processing module distributes the user experience quality to each video block according to the following formula to obtain the user experience quality index:

wherein, 0-1 vector

Field of view, vector, representing the current user

5. The adaptive panoramic video streaming system based on long-term-of-view prediction of claim 4, characterized by: the video bit rate distribution processing module distributes an equal bit rate for all video blocks in the same area for each video block distribution bit rate distribution, and the following conditions are met: the view area bit rate ≧ the edge area bit rate ≧ the outer area bit rate.