CN116017003A

CN116017003A - Self-adaptive VR360 video-on-demand method and system based on multiple artificial intelligence methods

Info

Publication number: CN116017003A
Application number: CN202310028902.5A
Authority: CN
Inventors: 闫彩霞; 张凯喆; 刘汇川; 郑庆华; 杜海鹏; 王志文; 曹坚翔; 袁慕遥; 王洋; 张志浩; 张未展
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2023-01-09
Filing date: 2023-01-09
Publication date: 2023-04-25

Abstract

The invention provides a self-adaptive VR360 video-on-demand method and a system based on a plurality of artificial intelligence methods, which are characterized in that a generated countermeasure network is utilized to carry out significance detection on an original video, and the original video is dynamically divided into a plurality of space blocks according to detection results and stored in a server; when a video is requested and watched, a long and short memory network is used for establishing an extraction model of network track characteristics, and bandwidth information at a future moment is predicted; taking the predicted bandwidth information and the past viewport track information as state input of code rate decision, and training an optimal code rate corresponding to the A3C network decision by using a PPO algorithm; downloading and playing the corresponding video blocks according to the code rate decision result; the generation of the countermeasure network can be guaranteed, and the video area can be divided to the maximum extent; the network state can be fully extracted to predict the bandwidth, and effective input is provided for code rate self-adaptive decision; the method based on the viewport prediction can furthest utilize the network to carry out effective transmission, reduce bandwidth waste and effectively improve the viewing quality of users.

Description

Self-adaptive VR360 video-on-demand method and system based on multiple artificial intelligence methods

Technical Field

The invention belongs to the technical field of video transmission, and particularly relates to a self-adaptive VR360 video-on-demand method and system based on a plurality of artificial intelligence methods.

Background

With the widespread use of multimedia technology and intelligent terminals, video services have become one of the main modes of learning work and entertainment life for people. 360 degree video is becoming increasingly popular due to the tremendous advancement in panoramic cameras and head mounted devices. However, since 360 degree video is typically high resolution, the transmission requires extremely high bandwidth. In order to protect quality of experience (QoE) of users, 360-degree video streaming systems based on spatial blocking are proposed, assigning high/low bit rates to corresponding video frames in order to bring the highest viewing quality to users within a limited bandwidth. However, different videos should have different emphasis, and the focus is usually in the video center with the person as the core; video with scenes as the core often focuses on the edges of the video. Dividing the same spatial block for different videos may result in inaccuracy of transmission and waste of bandwidth. Therefore, dynamic spatial block division strategies for different videos have been developed, so that in order to make the possible focused positions of the user exist in one spatial block at the same time, the spatial blocks are divided according to the learned significance, thereby saving bandwidth and improving the QoE of the user.

In order to reduce video quality switching delay and improve user QoE, the bandwidth prediction problem at the user side needs to predict the future network bandwidth at the user side and pre-fetch the Guan Malv version of video block in combination with the current network condition, which is a time sequence prediction problem. In bandwidth prediction, bandwidth change in a period of time can be predicted through a previous bandwidth time sequence through a seq2seq model, and the performance of long-term prediction is further improved through learning feature weights through a attention mechanism, so that better bandwidth estimation is provided for subsequent video self-adaptive transmission and playing, and good experience quality of users is ensured. Fov prediction problem is also a problem of time series prediction in nature, so both we use a similar approach to learn training.

According to the search and the update of the applicant, the following patents related to the invention and belonging to the video transmission field are searched:

CN108063961B, a self-adaptive code rate video transmission method and system based on reinforcement learning.

Cn1594307 a subscribes to video on demand delivery.

The above patent 1 provides a self-adaptive code rate video transmission method and system based on reinforcement learning. The method is based on a deep neural network for code rate prediction, a state space corresponding to a video block to be downloaded is input into the code rate prediction neural network, and the code rate prediction neural network outputs a code rate strategy; downloading video blocks to be downloaded according to a code rate strategy output by the code rate prediction neural network; after each video block is downloaded, calculating a video playing quality index corresponding to the video block and returning the video playing quality index to the code rate prediction neural network; the code rate prediction neural network trains according to the returned video playing quality index and the state space corresponding to the video block which is downloaded recently. The invention reduces the labor time cost of rule setting and parameter tuning, and greatly improves the video quality experience.

The above-mentioned patent 2 provides a video delivery system for video-on-demand stored at least partially in the vicinity of a user location. The video delivery system has a large number of generally viewable channels within the user's location and content receivers connected to those channels. One of the plurality of channels is used to transmit a hidden video stream that cannot be viewed when it arrives at the user location. The content receiver includes a storage device and a video reproduction circuit. The storage device is connected to the hidden channel and is near the user location.

The above related patent 1 uses deep reinforcement learning prediction to input a state space corresponding to a video block to be downloaded into a code rate prediction neural network, outputs a code rate policy, and downloads the required video block according to the code rate policy. The state space described in patent 1 includes information such as video block throughput rate, downloading time, etc., and ignores the influence of network bandwidth information on video viewing quality and accurate measurement, so when the network bandwidth varies severely, the method is difficult to give a better code rate policy to influence user QoE, and the method is suitable for traditional video, and lacks important factors similar to a significance region and fov for 360 video. Patent 2 provides only an optimization method for ordinary video on demand transmission, and is not fully applicable to 360 video.

Technical Field

In order to solve the problems in the prior art, the invention provides a self-adaptive VR360 video-on-demand method and a system based on various artificial intelligence methods, which use various advanced methods such as saliency area detection, dynamic space block division, bandwidth prediction, view port change prediction and code rate self-adaptive decision to provide a method for solving the high bandwidth consumption of VR360 video for the first time.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: an adaptive VR360 video-on-demand method based on multiple artificial intelligence methods, comprising the steps of:

step1, generating a saliency area based on generation of an attention mechanism and processing an original video by an countermeasure network, carrying out dynamic space block division according to the saliency area, and storing the generated new video in a server;

step2, establishing an extraction model of network track characteristics by using a long and short memory network to measure network bandwidth;

and step3, taking a bandwidth prediction result and a view port change track as state input of a code rate self-adaptive decision, training an optimal code rate corresponding to the A3C network decision by using a PPO algorithm, selecting a video file corresponding to the code rate based on the optimal code rate self-adaptive result of the server, and downloading the video file to a buffer area for decoding.

In step1, the video is extracted frame by frame, and then the generation of the happy region is performed on the frame processing by using the generation countermeasure network model, specifically the steps are as follows:

step 1.1, performing salient region identification through a generating countermeasure network based on an attention mechanism, generating a generator overall structure in the countermeasure network as an encoder-decoder, and generating a salient region map, wherein a discriminator discriminates whether a predicted map or a real map is input, so that the predicted map and the real map are indistinguishable, and outputting a predicted map close to the real map;

and 1.2, further processing the generated saliency region map, processing the obtained saliency region by using a MinimalOverlapping Cover algorithm, and dividing different dynamic space blocks.

In the step 1.1, the encoder uses a VGG-16 model, and the initial parameters use parameters obtained through ImageNet pre-training; the feature map size generated by the decoder step by step corresponds to each layer of the generator, the features obtained by the generator are amplified step by step, the discriminator consists of three layers of convolution networks, and the last layer uses a sigmoid function for classification judgment.

Step 1.2, dividing the area into three parts: the method comprises the steps of generating dynamic space blocks according to region division in a core region, an edge region and an irrelevant region, processing an original video file according to the generated dynamic space block positions, generating a new dash video, storing the new dash video in a server side, and waiting for a client to call and play.

When the network bandwidth is measured by using an extraction model of the long and short memory network to establish network track characteristics, a bandwidth prediction model is established at a server end, the network bandwidth is predicted by using bandwidth history data, the bandwidth prediction model adopts a seq2seq model added with an attention mechanism, an encoder layer is a single-layer bidirectional GRU, the decoder layer is a single-layer unidirectional GRU, and a section layer is a fully connected neural network.

The step3 specifically comprises the following steps:

the client interacts with the server, and the server acquires the visual angle change sent by the client, so as to obtain an expected visual angle;

the server takes a bandwidth prediction result, a buffer area and an expected view angle as a state space of a code rate self-adaptive decision, the code rate selection is realized based on a reinforcement learning algorithm PPO of an A3C framework, and an optimal code rate self-adaptive strategy is finally obtained through interaction of three elements of an environment state, actions and a reward function

The client selects the video file corresponding to the code rate to download to the buffer area and decode; and rendering to the Unity system for playing according to the corresponding synchronous rendering logic, and continuously collecting the visual angle change of the user.

When the client interacts with the server, the user uses the mobile VR device to watch, and the mobile VR device is used as the client to collect the viewport change data in real time and send the viewport change data back to the server.

The environment state comprises an estimated view port position, an estimated bandwidth value, a buffer occupancy rate and a current video block saliency area position when a current video block is requested, an action space is a code rate allocation strategy of different space blocks of the current video block, and rewards are QoE matrix values after a section of play is finished;

when the server generates dynamic space block video, each space block is dividedThe blocks are numbered according to (i, j) and the spatial block size of the c-th video block at (i, j) is positioned as d _c,ij (r), the size of the c-th video block is considered as z (c),

definition matrix->

Representing whether (i, j) video block is in the viewport, the sum of code rates in the viewport is: />

The bandwidth at time stamp t is N (t), at t the user _c Initiating a request of a c-th video block at a moment, wherein the average downloading speed of the current block is N _c There is a delay deltat between the c-th video block and the (c + 1) -th video block _c Then

The video blocks are stored in a play buffer after being downloaded, the buffer is divided into a plurality of slots, each slot contains a data block, B (t) E [0, B _max ]Defined as buffer occupancy at time stamp t, where B _max Representing buffer capacity, will B _max Video content set to a few seconds, c video blocks are downloaded in the start-up phase, the buffer area is ensured not to be empty, and the buffer area occupies B _(c+1) After the start-up phase, each new block is downloaded, the buffer update follows:

where ρ (x) =max { x,0};

the QoE matrix is determined according to three factors of average view port quality, buffer time and average view port quality which influence user experience, and the average view port quality, the buffer time and the average view port quality are weighted and summed to obtain the QoE matrix.

When the video file corresponding to the code rate is selected to be downloaded to the buffer area and decoded, the segmented video is downloaded according to the mpd file, the video is rendered to a unit material ball to be played, and the collected new video port change track is fed back to the server end for subsequent prediction.

The invention also provides a self-adaptive VR360 video-on-demand system based on various artificial intelligence methods, which comprises a client and a server, wherein the client and the server transmit video data through a network, and the client is VR equipment; when video on demand is carried out, the client and the server are carried out according to the method of the invention.

Compared with the prior art, the invention has at least the following beneficial effects: the invention aims at a truly immersive experience, and a user can select and watch videos on a VR head display; various methods for improving user experience are used, such as dynamic space block division, bandwidth prediction and code rate self-adaptive decision aiming at VR360 video, so that the user experience is maximized; the method can ensure that the generated countermeasure network can effectively generate a salient region and furthest divide a video region according to a generation result; the network state can be fully extracted to predict the bandwidth, and effective input is provided for code rate self-adaptive decision; the method based on the viewport prediction can furthest utilize the network to carry out effective transmission, reduce bandwidth waste and effectively improve the viewing quality of users.

Drawings

Fig. 1 is a schematic diagram of the structure of the present invention.

Fig. 2 is a flow chart of the algorithm of the present invention.

Fig. 3 is a schematic diagram of processing a salient region.

Fig. 4 is a schematic diagram of a bandwidth prediction model structure.

Detailed Description

The technical scheme of the invention is described in detail below with reference to specific application examples.

Referring to fig. 1, 2 and 4, the present invention provides an adaptive VR360 video on demand system based on a variety of artificial intelligence methods, comprising the steps of:

step1, generating a saliency area based on the generation of an attention mechanism and processing an original video by an countermeasure network, carrying out dynamic space block division according to the saliency area, and storing the generated new video in a server for later request playing;

step 1.1, saliency area identification is performed on the countermeasure network through generation based on an attention mechanism. The overall structure of the generator in the countermeasure network is an encoder-decoder, which is responsible for generating a significant graph, and a discriminator discriminates whether a predicted graph or a real graph is input, and aims to make the predicted graph and the real graph indistinguishable, so that a predicted graph close to the real graph is output. The generator encoder part uses VGG-16 model, and initial parameters use parameters obtained through image Net pre-training; and the decoder part is used for gradually generating feature map sizes corresponding to layers of the encoder and gradually amplifying features obtained by the encoder. The discriminator is composed of three layers of convolution networks, and the last layer uses a sigmoid function for classification judgment.

And step 1.2, further processing the generated saliency area map to divide different space blocks. The resulting region of significance was processed using the Minimal Overlapping Cover (MNC) algorithm. The region is divided into three parts: core region, edge region, irrelevant region. And generating a space block according to the region division. The generation process is shown in fig. 3 below, where a is the original video picture, b is the picture processed by the salient region, and c is the division of the salient region fine granularity into many small spatial partitions. And finally, processing the original video file according to the generated space block position, generating a new dash video, storing the new dash video at a server side, and waiting for the calling and playing of the client side.

And 2, constructing a bandwidth prediction model at a server side, and predicting network bandwidth by utilizing bandwidth history data, wherein the bandwidth prediction model is formed by a seq2seq model added with an attention mechanism. The encoder layer is a single-layer bidirectional GRU, the decoder layer is a single-layer unidirectional GRU, and the layer is a very complex fully-connected neural network. The network model is shown in the following figure:

and step3, inputting the bandwidth prediction result and the view port change track as states of code rate self-adaptive decisions, and selecting a video file corresponding to the code rate by the client based on the code rate self-adaptive results of the server, downloading the video file to a buffer area and decoding the video file.

And 3.1, the client interacts with the server, and the server acquires the visual angle change sent by the client. The user views on the mobile VR device and the device collects viewport change data in real time and sends it back to the server.

Step 3.2, the server takes the bandwidth prediction result, the buffer area and the expected view angle as a state space of the code rate self-adaptive decision, the selection of the code rate is realized based on a reinforcement learning algorithm PPO (Proximal Policy Optimization) of the A3C framework, and the optimal code rate self-adaptive strategy is finally obtained through the interaction of three elements of environment state, action and reward function reward

The environmental state comprises an estimated viewport position when the current video block is requested, an estimated bandwidth value, a buffer occupancy, and a current video block saliency region position. The action space is a code rate allocation policy for different spatial partitions of the current video block. The prize is a QoE matrix value after the end of a play.

The following characteristic indices are defined: when the server generates the dynamic space block video, the space blocks are numbered according to (i, j), and the space block size of the c-th video block at (i, j) is positioned as d _c,ij (r), the size of the c-th video block is considered as z (c),

definition matrix->

Representing whether the (i, j) video block is in the viewport. Then define the sum of code rates in the viewport as：/>

Defining the bandwidth at the timestamp t as N (t), assuming that the user is at t _c Initiating a request for a c-th video block at a time, the average download speed of this block being N _c There is a short delay deltat between the c-th video block and the (c + 1) -th video block _c Then

The video blocks are stored in a play buffer after being downloaded, the buffer is divided into a plurality of slots, and each slot contains a data block. B (t) ∈ [0, B _max ]Defined as buffer occupancy at time stamp t, where B _max Representing the buffer capacity. B because of the large amount of interaction required to view 360 degrees of video _max Typically set to a few seconds of video content. The start-up phase requires downloading c video blocks, ensuring that the buffer is not empty. The buffer occupancy B at this time _(c+1) =c×t. After the start-up phase, every time a new block is downloaded, the buffer update is as follows:

where ρ (x) =max { x,0}.

Next, a QoE matrix is defined, with the target focused onC-th video block

Several factors that can affect the user experience between video blocks:

average viewport quality because only video blocks in the viewport will be seen by the user during VR360 video viewing, then only video in the viewport is considered to affect the user's experience, with average viewport quality being:

buffer time, which is an event that obviously results in a poor user experience, can be calculated by:

the average viewport temporary storage changes, and the stable code rate energy is considered to bring more comfortable experience to the user in the viewport, and frequent code rate switching can bring uncomfortable feeling to the user, so that the index is quantized into:

and finally defining QoE, and carrying out weighted summation on the three indexes, wherein the weight is changed according to the specific scene requirement.

Step 3.3, the client selects the video file corresponding to the code rate to download to the buffer area and decode; and rendering to a Unity system for playing according to the corresponding synchronous rendering logic, continuously collecting the visual angle change of the user, downloading the segmented video according to the mpd file by the client, rendering the video to a Unity material ball for playing, and feeding back the collected new visual opening change track to the server for subsequent prediction.

The method of the present invention will be specifically described in detail by the following steps.

Step1, placing the original video into a server, using ffmpeg to realize frame extraction operation, detecting a salient region for each period of video, and dividing the salient region for the video frame. The small-granularity spatial partitions are divided for the video, and then the small-granularity spatial partitions are combined according to the saliency areas to generate large-granularity spatial partitions. And generating an mpd file according to the generated space blocks, and storing the mpd file at a server side to wait for calling.

Step2, the client and the server establish connection. The client plays a video, firstly downloads the mpd file, downloads the initial space block and blocks, and starts playing when the buffer area meets the playing limit. The client begins to track changes in the user's viewport and sends the viewport back to the client.

Step3, the server monitors the bandwidth change in real time, and the server uses the seq2seq model to predict the bandwidth.

Step4, the server receives the user viewport change fed back by the client, uses a code rate self-adaptive decision model to make code rate decisions by combining with information such as bandwidth predicted values, significance areas and the like, and distributes and transmits corresponding video blocks for the client. Returning to step2 until the video ends.

By adopting the technical scheme, the invention provides a self-adaptive VR360 video-on-demand system based on a plurality of artificial intelligence methods, so as to improve the QoE of users with large bandwidth requirements for VR360 immersive video viewing. In this method, first a generating task is performed using a generating countermeasure network, and a segmentation process is performed on an initial VR360 video to generate a salient region. And generating a space block for the video by using an MNC algorithm according to the saliency area so as to realize the maximum bandwidth saving. In the playing process, the client transmits the change of the visual port of the user to the server in real time. The server uses the network model of seq2seq to predict the bandwidth and the view port change respectively, and generates a predicted bandwidth and a predicted view port change value. And the server terminal carries out code rate self-adaptive decision according to various state information, and distributes videos to the client terminal according to decision results.

Claims

1. An adaptive VR360 video-on-demand method based on a plurality of artificial intelligence methods, comprising the steps of:

2. The adaptive VR360 video on demand method based on multiple artificial intelligence methods of claim 1, wherein in step1, video is extracted frame by frame and then salient region generation is used to generate a countermeasure network model to feel the frame processing, comprising the following specific steps:

3. The adaptive VR360 video on demand method based on multiple artificial intelligence methods of claim 2, wherein in step 1.1, the encoder uses VGG-16 model and the initial parameters use image net pre-trained parameters; the feature map size generated by the decoder step by step corresponds to each layer of the generator, the features obtained by the generator are amplified step by step, the discriminator consists of three layers of convolution networks, and the last layer uses a sigmoid function for classification judgment.

4. The adaptive VR360 video on demand method based on multiple artificial intelligence methods of claim 2, wherein said step 1.2 divides the area into three parts: the method comprises the steps of generating dynamic space blocks according to region division in a core region, an edge region and an irrelevant region, processing an original video file according to the generated dynamic space block positions, generating a new dash video, storing the new dash video in a server side, and waiting for a client to call and play.

5. The adaptive VR360 video on demand method based on multiple artificial intelligence methods of claim 1, wherein when using long and short memory networks to build network track feature extraction models to measure network bandwidth, a bandwidth prediction model is built at a server end, bandwidth history data is used to predict network bandwidth, the bandwidth prediction model adopts a seq2seq model with an attention mechanism, an encoder layer is a single-layer bidirectional GRU, a decoder layer is a single-layer unidirectional GRU, and an attention layer is a fully connected neural network.

6. The adaptive VR360 video on demand method based on multiple artificial intelligence methods of claim 1, wherein step3 specifically comprises the steps of:

7. The adaptive VR360 video on demand method based on multiple artificial intelligence methods of claim 6, wherein the user views using the mobile VR device while the client and the server interact, and the mobile VR device collects viewport variation data as a client in real time and sends it back to the server.

8. The adaptive VR360 video on demand method based on multiple artificial intelligence methods of claim 6, wherein the environmental state includes an estimated viewport position at the time of current video block request, an estimated bandwidth value, a buffer occupancy, a current video block saliency area position, an action space is a code rate allocation policy for different spatial partitions of the current video block, and a reward is a QoE matrix value after a playback is completed;

when the server generates the dynamic space block video, the space blocks are numbered according to (i, j), and the space block size of the c-th video block at (i, j) is positioned as d _c,ij (r), the size of the c-th video block is considered as z (c),

definition matrix->

The video blocks are stored in playing buffer after being downloaded, and the buffer is divided into a plurality of slotsEach slot contains a data block, B (t) ∈ [0, B _max ]Defined as buffer occupancy at time stamp t, where B _max Representing buffer capacity, will B _max Video content set to a few seconds, c video blocks are downloaded in the start-up phase, the buffer area is ensured not to be empty, and the buffer area occupies B _(c+1) After the start-up phase, each new block is downloaded, the buffer update follows:

where ρ (x) =max { x,0};

9. The adaptive VR360 video on demand method based on multiple artificial intelligence methods of claim 6, wherein when selecting a video file corresponding to a code rate to download to a buffer and decode, downloading a block video according to an mpd file, rendering the video to a unit material ball for playing, and feeding back the collected new viewport variation trajectories to a server for subsequent prediction.

10. The self-adaptive VR360 video-on-demand system based on multiple artificial intelligence methods is characterized by comprising a client and a server, wherein the client and the server transmit video data through a network, and the client is VR equipment; in video-on-demand, the client and the server operate according to the method of any one of claims 1 to 9.