CN115037962B - Video self-adaptive transmission method, device, terminal equipment and storage medium - Google Patents

Video self-adaptive transmission method, device, terminal equipment and storage medium Download PDF

Info

Publication number
CN115037962B
CN115037962B CN202210609323.5A CN202210609323A CN115037962B CN 115037962 B CN115037962 B CN 115037962B CN 202210609323 A CN202210609323 A CN 202210609323A CN 115037962 B CN115037962 B CN 115037962B
Authority
CN
China
Prior art keywords
video
user
video content
result
decision
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210609323.5A
Other languages
Chinese (zh)
Other versions
CN115037962A (en
Inventor
王�琦
程志鹏
李康敬
杨忠尧
张志浩
张源鸿
张未展
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
Xian Jiaotong University
MIGU Video Technology Co Ltd
MIGU Culture Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
Xian Jiaotong University
MIGU Video Technology Co Ltd
MIGU Culture Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, Xian Jiaotong University, MIGU Video Technology Co Ltd, MIGU Culture Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202210609323.5A priority Critical patent/CN115037962B/en
Publication of CN115037962A publication Critical patent/CN115037962A/en
Application granted granted Critical
Publication of CN115037962B publication Critical patent/CN115037962B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/258Client or end-user data management, e.g. managing client capabilities, user preferences or demographics, processing of multiple end-users preferences to derive collaborative data
    • H04N21/25866Management of end-user data

Abstract

The invention discloses a video self-adaptive transmission method, a device, terminal equipment and a storage medium, wherein panoramic video data are acquired; and carrying out self-adaptive decision based on a pre-obtained user visual angle prediction result to obtain a decision result, adjusting the panoramic video data according to the decision result to obtain video content, and sending the video content and the decision result to a player so that the player can determine whether to reconstruct the video content according to the decision result to obtain a target video. The user preference can be identified in a long-term and accurate manner by carrying out user view angle prediction on the video data in advance, the panoramic video data is adjusted to obtain video content according to a decision result by carrying out self-adaptive decision on the video data based on a user view angle prediction result, and the video content and the decision result are sent to a player, so that the video quality can be improved while the bandwidth resource is adapted, and the immersive viewing experience of the user on the panoramic video is improved.

Description

Video self-adaptive transmission method, device, terminal equipment and storage medium
Technical Field
The present invention relates to the field of video transmission technologies, and in particular, to a video adaptive transmission method, apparatus, terminal device, and storage medium.
Background
360 panoramic video is an emerging video application that gives people an immersive viewing experience. With the adaptive streaming technology (HTTP Adaptive Streaming, HAS) becoming the mainstream technology of streaming media distribution, the adaptive transmission of 360 panoramic video streams plays a vital role in ensuring a good viewing experience for users.
At present, in a 360-panorama video stream self-adaptive transmission strategy, one is to analyze significant feature points in the 360-panorama video by adopting a mathematical method, and then determine the position of a user visual angle through the position of the feature points, but the method ignores the preference of the user, and the obtained prediction accuracy rate also can cause larger fluctuation due to the switching of the user; secondly, a cyclic neural network is adopted to predict the future view angle of the user by learning the time relation between the view points of the user at different times, but the influence of the change of the 360 video content on the view angle of the user is ignored, so that when the change range of the 360 video content is large, the method cannot accurately predict the change of the view angle of the user.
Therefore, there is a need for a solution that improves the user's immersive viewing experience of panoramic video.
Disclosure of Invention
The invention mainly aims to provide a video self-adaptive transmission method, a device, terminal equipment and a storage medium, aiming at improving the immersive viewing experience of a user on panoramic video.
In order to achieve the above object, the present invention provides a video adaptive transmission method, where the video adaptive transmission method is applied to a server, and the video adaptive transmission method includes:
the video self-adaptive transmission method comprises the following steps:
acquiring panoramic video data;
and carrying out self-adaptive decision based on a pre-obtained user visual angle prediction result to obtain a decision result, adjusting the panoramic video data according to the decision result to obtain video content, and sending the video content and the decision result to a player so that the player can determine whether to reconstruct the video content according to the decision result to obtain a target video.
Optionally, the step of adaptively deciding based on the pre-obtained user view angle prediction result to obtain a decision result, adjusting the panoramic video data according to the decision result to obtain video content, and sending the video content to a player further includes:
Acquiring a head motion trail and a panoramic video image of a user;
coding the head movement track of the user to obtain time characteristic information;
extracting the salient features of the panoramic video image to obtain user preference features;
and obtaining the user visual angle prediction result according to the time characteristic information and the user preference characteristics.
Optionally, the step of obtaining the user viewing angle prediction result according to the time characteristic information and the user preference feature includes:
decoding the time characteristic information through a decoder to obtain a user visual angle prediction motion trail of the current frame image;
and integrating the user visual angle prediction motion trail and the user preference characteristics through a fully-connected neural network to obtain a user visual angle prediction result of the current frame image.
Optionally, the step of integrating the motion trail of the user view angle prediction with the user preference feature through the fully connected neural network to obtain the user view angle prediction result of the current frame image further includes:
inputting the user view angle prediction result of the current frame image into the decoder for decoding to obtain a user view angle prediction motion trail of the next frame image, taking the next frame image as the current frame image, and returning to the execution step: and integrating the user visual angle prediction motion trail and the user preference characteristics through a fully-connected neural network to obtain a user visual angle prediction result of the current frame image until all video images are processed.
Optionally, the step of performing adaptive decision based on the pre-obtained user view angle prediction result to obtain a decision result, adjusting the panoramic video data according to the decision result to obtain video content, and sending the video content to a player includes:
partitioning and dicing the panoramic video data to obtain diced videos, and obtaining a network state estimation result of the player;
performing self-adaptive decision according to the network state estimated result and the user visual angle estimated result through a multi-decision reinforcement learning model to obtain the decision result, wherein the decision result comprises a target code rate and a reconstruction strategy;
repackaging each of the diced videos according to the target code rate to obtain the video content;
and sending the video content to the player.
Optionally, the video adaptive transmission method is applied to a player, and the video adaptive transmission method includes the following steps:
and receiving video content and a decision result sent by a server, and determining whether to reconstruct the video content according to the decision result to obtain a target video.
Optionally, the step of receiving the video content and the decision result sent by the server, and determining whether to reconstruct the video content according to the decision result, so as to obtain the target video includes:
Receiving video content and a decision result sent by the server, wherein the decision result comprises a reconstruction strategy, the player comprises a first buffer zone and a second buffer zone, the first buffer zone is used for caching the video content, and the second buffer zone is used for caching a reconstructed video;
judging whether video reconstruction is needed to be carried out on the video content according to the reconstruction strategy;
if the video content is required to be subjected to video reconstruction, performing video reconstruction through a pre-trained super-resolution reconstruction model to obtain the reconstructed video, and taking the reconstructed video as the target video;
and if the video content does not need to be subjected to video reconstruction, taking the video content as the target video.
In addition, to achieve the above object, the present invention also provides a video adaptive transmission apparatus, including:
the acquisition module is used for acquiring panoramic video data;
the transmission module is used for carrying out self-adaptive decision based on a pre-obtained user visual angle prediction result to obtain a decision result, adjusting the panoramic video data according to the decision result to obtain video content, and sending the video content and the decision result to a player so that the player can determine whether to carry out video reconstruction on the video content according to the decision result to obtain a target video.
In addition, to achieve the above object, the present invention also provides a terminal device including a memory, a processor, and a video adaptive transmission program stored on the memory and executable on the processor, the video adaptive transmission program implementing the steps of the video adaptive transmission method as described above when executed by the processor.
In addition, in order to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a video adaptive transmission program which, when executed by a processor, implements the steps of the video adaptive transmission method as described above.
The embodiment of the invention provides a video self-adaptive transmission method, a device, terminal equipment and a storage medium, which are used for acquiring panoramic video data; and carrying out self-adaptive decision based on a pre-obtained user visual angle prediction result to obtain a decision result, adjusting the panoramic video data according to the decision result to obtain video content, and sending the video content and the decision result to a player so that the player can determine whether to reconstruct the video content according to the decision result to obtain a target video. The user preference can be identified in a long-term and accurate manner by carrying out user view angle prediction on the video data in advance, the panoramic video data is adjusted to obtain video content according to a decision result by carrying out self-adaptive decision on the video data based on a user view angle prediction result, and the video content and the decision result are sent to a player, so that the video quality can be improved while the bandwidth resource is adapted, and the immersive viewing experience of the user on the panoramic video is improved.
Drawings
Fig. 1 is a schematic diagram of functional modules of a terminal device to which a video adaptive transmission device of the present invention belongs;
FIG. 2 is a flow chart of an exemplary embodiment of a video adaptive transmission method according to the present invention;
FIG. 3 is a schematic diagram of a user perspective prediction framework according to an embodiment of the present invention;
FIG. 4 is a flowchart of another exemplary embodiment of a video adaptive transmission method according to the present invention;
FIG. 5 is a schematic diagram illustrating a specific flow of step S114 in the embodiment of FIG. 4;
FIG. 6 is a schematic diagram illustrating a specific flow of step S20 in the embodiment of FIG. 2;
FIG. 7 is a flow chart of a video adaptive transmission method according to another exemplary embodiment of the present invention;
FIG. 8 is a schematic flow chart of step A10 in the embodiment of FIG. 7;
fig. 9 is a schematic diagram of a panoramic video code rate adaptive transmission method based on super resolution in an embodiment of the present invention;
fig. 10 is a schematic diagram of adaptive transmission principle of a simulation system according to an embodiment of the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The main solutions of the embodiments of the present invention are: obtaining video data; and carrying out code rate decision on the video data based on a pre-obtained user visual angle prediction result to obtain video content, and sending the video content to a player. The user preference can be identified accurately and long-term through pre-carrying out user view angle prediction on the video data, and the video quality can be improved while the bandwidth resource is adapted through carrying out code rate decision on the video data based on the user view angle prediction result and transmitting the video content to the player, so that the immersive viewing experience of the user on the panoramic video is improved.
Technical terms related to the embodiment of the invention:
quality of experience (Quality of Experience, qoE): refers to the subjective perception of quality and performance by a user of a device, network and system, application or service. QoE refers to the difficulty perceived by a user to complete the entire process;
adaptive streaming technology (HTTP Adaptive Streaming, HAS): the downloading speed of the user can be intelligently perceived, then the encoding rate of the video is dynamically adjusted, and a high-quality smoother video playing technology is provided for the user;
code rate adaptation technique (Adaptive Bitrate Streaming, ABR): sensing network environment change, or automatically making reasonable code rate adjustment according to the buffer playing condition of the client, so as to improve (maximize) the experience quality of online video watching of the user;
Long Short-Term Memory network (LSTM): a time-loop neural network is specifically designed to solve the long-term dependence problem of a general RNN (loop neural network).
360 panoramic video is an emerging video application that gives people an immersive viewing experience. With the HTTP adaptive streaming technology (HAS) becoming the mainstream technology of streaming media distribution, the 360-degree panoramic video streaming adaptive transmission not only can greatly reduce the bandwidth consumption of transmission, but also can ensure good viewing experience (QoE) for users. In the adaptive transmission strategy of 360 panoramic video streams, how to predict the user viewing angle (FoV) accurately and how to formulate the optimal Adaptive Bitrate (ABR) transmission strategy for saving network bandwidth and ensuring the user a good immersive viewing experience is the main difficulty and challenge currently faced.
At present, in a 360-degree panoramic video stream self-adaptive transmission strategy, one of the dynamic self-adaptive streaming media code rate allocation methods for maintaining space-time consistency of 360-degree video comprises a code rate self-adaptive algorithm, a field of view (FoV) conversion model, a block priority calculation model and a block code rate allocation algorithm, a Gaussian model and a Ji Pufu model are used for estimating a FoV visual angle, priorities of all blocks of the 360-degree video are calculated, and then the buffer length and the video quality are comprehensively considered through the code rate self-adaptive algorithm to determine the segment code rate required by downloading a current video segment; and secondly, a view prediction model based on a cyclic neural network and a view tracking module based on a correlation filter are adopted, and the module is fused. The time relation between the watching viewpoints of different times of the user is learned through training so as to predict viewpoint sequences of a plurality of times in the future.
However, the first method uses a mathematical method to analyze 360 the salient feature points in the panoramic video and then determine the position of the user viewing angle through the position of the feature points, but ignores the preference of the user, the obtained Fov viewing angle is only based on the video information, and the obtained prediction accuracy also can cause larger fluctuation due to the switching of the user. The second method adopts the cyclic neural network to predict the future view angle of the user by learning the time relation between the view points of the user at different times, but ignores the influence of the change of the 360 video content on the view angle of the user, so that when the change range of the 360 video content is large, the model cannot accurately predict the change of the view angle of the user, and has certain limitation. Thus, problems still exist with current-stage user perspective predictions that result in a poor user experience when viewing 360 video.
The invention provides a novel super-resolution-based code rate self-adaptive transmission method as a solution, which mainly comprises two parts, namely a user visual angle prediction method and a super-resolution-based 360 video transmission method. Finally, the method adopted by the proposal is verified through a 360 video self-adaptive transmission prototype simulation system, so that the viewing experience of a user can be effectively improved.
Specifically, referring to fig. 1, fig. 1 is a schematic functional block diagram of a terminal device to which the video adaptive transmission device of the present invention belongs. The video adaptive transmission device may be a device independent of the terminal device and capable of performing video adaptive transmission, and may be carried on the terminal device in a form of hardware or software. The terminal equipment can be an intelligent mobile terminal with a data processing function such as a mobile phone and a tablet personal computer, and can also be a fixed terminal equipment or a server with a data processing function.
In this embodiment, the terminal device to which the video adaptive transmission apparatus belongs at least includes an output module 110, a processor 120, a memory 130, and a communication module 140.
The memory 130 stores an operating system and a video adaptive transmission program, and the video adaptive transmission device can store the acquired panoramic video data, an adaptive decision based on a pre-obtained user view angle prediction result, an obtained decision result, and information such as obtained video content by adjusting the panoramic video data according to the decision result in the memory 130; the output module 110 may be a display screen or the like. The communication module 140 may include a WIFI module, a mobile communication module, a bluetooth module, and the like, and communicates with an external device or a server through the communication module 140.
Wherein the video adaptive transmission program in the memory 130, when executed by the processor, performs the steps of:
acquiring panoramic video data;
and carrying out self-adaptive decision based on a pre-obtained user visual angle prediction result to obtain a decision result, adjusting the panoramic video data according to the decision result to obtain video content, and sending the video content and the decision result to a player so that the player can determine whether to reconstruct the video content according to the decision result to obtain a target video.
Further, the video adaptive transmission program in the memory 130, when executed by the processor, further performs the steps of:
acquiring a head motion trail and a panoramic video image of a user;
coding the head movement track of the user to obtain time characteristic information;
extracting the salient features of the panoramic video image to obtain user preference features;
and obtaining the user visual angle prediction result according to the time characteristic information and the user preference characteristics.
Further, the video adaptive transmission program in the memory 130, when executed by the processor, further performs the steps of:
decoding the time characteristic information through a decoder to obtain a user visual angle prediction motion trail of the current frame image;
And integrating the user visual angle prediction motion trail and the user preference characteristics through a fully-connected neural network to obtain a user visual angle prediction result of the current frame image.
Further, the video adaptive transmission program in the memory 130, when executed by the processor, further performs the steps of:
inputting the user view angle prediction result of the current frame image into the decoder for decoding to obtain a user view angle prediction motion trail of the next frame image, taking the next frame image as the current frame image, and returning to the execution step: and integrating the user visual angle prediction motion trail and the user preference characteristics through a fully-connected neural network to obtain a user visual angle prediction result of the current frame image until all video images are processed.
Further, the video adaptive transmission program in the memory 130, when executed by the processor, further performs the steps of:
partitioning and dicing the panoramic video data to obtain diced videos, and obtaining a network state estimation result of the player;
performing self-adaptive decision according to the network state estimated result and the user visual angle estimated result through a multi-decision reinforcement learning model to obtain the decision result, wherein the decision result comprises a target code rate and a reconstruction strategy;
Repackaging each of the diced videos according to the target code rate to obtain the video content;
and sending the video content to the player.
Further, the video adaptive transmission program in the memory 130, when executed by the processor, further performs the steps of:
and receiving video content and a decision result sent by a server, and determining whether to reconstruct the video content according to the decision result to obtain a target video.
Further, the video adaptive transmission program in the memory 130, when executed by the processor, further performs the steps of:
receiving video content and a decision result sent by the server, wherein the decision result comprises a reconstruction strategy, the player comprises a first buffer zone and a second buffer zone, the first buffer zone is used for caching the video content, and the second buffer zone is used for caching a reconstructed video;
judging whether video reconstruction is needed to be carried out on the video content according to the reconstruction strategy;
if the video content is required to be subjected to video reconstruction, performing video reconstruction through a pre-trained super-resolution reconstruction model to obtain the reconstructed video, and taking the reconstructed video as the target video;
And if the video content does not need to be subjected to video reconstruction, taking the video content as the target video.
According to the scheme, panoramic video data are acquired; and carrying out self-adaptive decision based on a pre-obtained user visual angle prediction result to obtain a decision result, adjusting the panoramic video data according to the decision result to obtain video content, and sending the video content and the decision result to a player so that the player can determine whether to reconstruct the video content according to the decision result to obtain a target video. The user preference can be identified in a long-term and accurate manner by carrying out user view angle prediction on the video data in advance, the panoramic video data is adjusted to obtain video content according to a decision result by carrying out self-adaptive decision on the video data based on a user view angle prediction result, and the video content and the decision result are sent to a player, so that the video quality can be improved while the bandwidth resource is adapted, and the immersive viewing experience of the user on the panoramic video is improved.
The method embodiment of the invention is proposed based on the above-mentioned terminal equipment architecture but not limited to the above-mentioned architecture.
The implementation main body of the method of the embodiment may be a video adaptive transmission device or a terminal device, and the embodiment uses the video adaptive transmission device as an example.
Referring to fig. 2, fig. 2 is a flowchart of an exemplary embodiment of a video adaptive transmission method according to the present invention. The video self-adaptive transmission method comprises the following steps:
step S10, panoramic video data is obtained;
the panoramic video is also called 360-degree video, is a spherical video, and covers the picture content of 360-degree horizontal and 180-degree vertical, after a user wears the head-mounted display, the picture content of different areas can be watched through rotating the head, the user viewing angle watched by human eyes is about 110 degrees, and the user viewing angle area only occupies a part of the panoramic video, so that a large amount of bandwidth resources can be wasted in panoramic transmission, video playing is blocked and higher time delay is easily brought, the video watching experience of the user cannot be guaranteed, and therefore, the acquired video data is necessary to be subjected to user viewing angle prediction and code rate decision, so that bandwidth resources are effectively saved, and meanwhile, the user experience is improved. Before this, the video data may be acquired by receiving the video data from the video server through the gateway device, and further processing the video data.
Step S20, performing self-adaptive decision based on a pre-obtained user visual angle prediction result to obtain a decision result, adjusting the panoramic video data according to the decision result to obtain video content, and sending the video content and the decision result to a player so that the player can determine whether to reconstruct the video content according to the decision result to obtain a target video.
After panoramic video data is obtained, the user view angle can be predicted based on a pre-constructed user view angle prediction model, then self-adaptive decision is carried out on the video data based on the user view angle prediction result, a decision result is obtained, the panoramic video data is adjusted according to the decision result, and therefore video content is obtained and sent to a player.
Specifically, referring to fig. 3, fig. 3 is a schematic diagram of a user view angle prediction framework in an embodiment of the present invention, and as shown in fig. 3, based on the thought of understanding video content, a frame 360 of panoramic video image is analyzed to extract a target that may be of interest to a user. And then, a Long Short-Term Memory (LSTM) is used as a basic model for extracting time characteristics, a model for extracting 360 panoramic video image content information characteristics by a user is added, and prediction of a user viewing angle is made by combining time characteristic information and user preference information of 360 panoramic videos in a space dimension.
The problem of inconsistent probability distribution before and after prediction of a user view movement track time sequence can be solved based on an encoder-decoder model, but the structure still utilizes time sequence characteristics to predict future view, the robustness of prediction is reduced along with the increase of prediction steps, particularly, the deviation of a predicted value at a certain moment is continuously transmitted to the future prediction, so that the user preference information based on video content is integrated on the basis of the encoder-decoder structure, the input of the encoder is not the output of a self-predicted result, the time sequence characteristics and the user preference characteristics based on the video content are integrated through a fully connected neural network, and the corrected user view predicted result is used as the input of the next prediction.
The integration of the video content-based user preference information mainly includes semantic information (features of a user region of interest) of the image content and position information in the video image corresponding to the semantic information, and as one implementation manner, the video content of the user viewing region can be extracted according to the user viewing angle coordinates, so as to extract salient features of the video content of the user viewing angle region. The video content features can also be regarded as a time series, and the user preference features at each moment are integrated by a fully-connected neural network and then used as the input of the decoder LSTM network, and the result of the fully-connected neural network integration is actually the output prediction result.
Furthermore, based on the prediction of the user visual angle, the performance of the 360-degree panoramic video transmission system can be effectively improved by carrying out code rate self-adaptive transmission on the user visual angle. The high-definition image is reconstructed locally by utilizing the computing capability of the video playing client, the dependence on network bandwidth is reduced while the video picture quality is maintained, and the experience quality of watching the video by a user can be improved. And applying the deep reinforcement learning model to make a decision on the downloading code rate and the image super-resolution reconstruction so as to achieve the optimal user experience quality.
In this embodiment, panoramic video data is acquired; and carrying out self-adaptive decision based on a pre-obtained user visual angle prediction result to obtain a decision result, adjusting the panoramic video data according to the decision result to obtain video content, and sending the video content and the decision result to a player so that the player can determine whether to reconstruct the video content according to the decision result to obtain a target video. The user preference can be identified in a long-term and accurate manner by carrying out user view angle prediction on the video data in advance, the panoramic video data is adjusted to obtain video content according to a decision result by carrying out self-adaptive decision on the video data based on a user view angle prediction result, and the video content and the decision result are sent to a player, so that the video quality can be improved while the bandwidth resource is adapted, and the immersive viewing experience of the user on the panoramic video is improved.
Referring to fig. 4, fig. 4 is a flowchart illustrating another exemplary embodiment of a video adaptive transmission method according to the present invention. Based on the embodiment shown in fig. 2, in this embodiment, before performing an adaptive decision based on a pre-obtained user view angle prediction result to obtain a decision result, adjusting the panoramic video data according to the decision result to obtain video content, and sending the video content to a player, the video adaptive transmission method further includes:
step S111, acquiring a head motion trail and a panoramic video image of a user;
the user viewing angle may also be called a viewpoint, and accurate prediction of the user viewpoint change is a key for improving the user experience, and a prediction error may cause a reduction in the quality of a video picture or a picture loss watched by the user. Before the user visual angle prediction is carried out, a user visual angle prediction model is required to be built, a head movement data set can be obtained through an open source database, and the user visual angle prediction model is obtained through training by adopting data in the head movement data set.
In the process of predicting the user viewing angle by applying the user viewing angle prediction model, firstly, a user head movement track and a panoramic video image of a user viewing area are acquired, time characteristic information can be obtained by encoding the user head movement track, and user preference characteristics can be obtained by extracting the panoramic video image of the user viewing area.
Step S112, coding the motion trail of the head of the user to obtain time characteristic information;
specifically, the seq2seq model adopted in the embodiment of the present invention includes an Encoder (Encoder) and a Decoder (Decoder), where the Encoder encodes all input sequences into a unified semantic vector, and then the Encoder decodes the semantic vector. In the embodiment of the invention, the LSTM model is utilized to pair the history track x in the time t= {1,2, &. T } t Encoding, using another LSTM network as decoder to predict future user view motion trail, the decoder using the latest hidden state h of the encoder t And memory state c t To initialize and use the latest history data of the user's viewing track as an input initial value for the decoder. The problem of inconsistent probability distribution before and after prediction of the time series of the motion trail of the visual angle of the user can be solved based on the encoder-decoder model.
Step S113, extracting salient features of the panoramic video image to obtain user preference features;
further, video contents of a user viewing area are extracted according to the user viewing angle coordinates, and then salient feature extraction is performed according to the extracted video contents of the user viewing angle area (here, image contents without distortion inside the user viewing angle).
And finally, downsampling the saliency map, wherein each pixel point represents the saliency characteristic of a small region of the image. The video content features can also be regarded as a time series, and the user preference features at each moment are integrated by a fully-connected neural network and then used as the input of the decoder LSTM network, and the result of the fully-connected neural network integration is actually the output prediction result.
And step S114, obtaining the user visual angle prediction result according to the time characteristic information and the user preference characteristics.
And inputting the user view angle prediction result into the decoder for decoding to obtain a predicted motion track of the next user view angle, so as to be used for predicting the next user view angle.
Furthermore, the time characteristic information is obtained by encoding the motion trail of the head of the user, the user preference characteristic is obtained by extracting the panoramic video image of the user watching area, the time characteristic information and the user preference characteristic based on the video content are integrated through the fully connected neural network, the user visual angle prediction result is obtained, and in addition, the corrected user visual angle prediction result can be used as the input of the next prediction.
According to the scheme, the head movement track and the panoramic video image of the user are obtained; coding the head movement track of the user to obtain time characteristic information; extracting the salient features of the panoramic video image to obtain user preference features; and obtaining the user visual angle prediction result according to the time characteristic information and the user preference characteristics. The time characteristic information is obtained by encoding the motion trail of the head of the user, and the time characteristic information is combined with the user preference characteristics extracted according to the panoramic video image to obtain the prediction result of the user visual angle, so that the long-term effective comprehensive prediction of the user visual angle from the time dimension to the space dimension is realized, and the immersive viewing experience of the user on the panoramic video is promoted.
Referring to fig. 5, fig. 5 is a specific flowchart of step S114 in the embodiment of fig. 4. The present embodiment is based on the embodiment shown in fig. 4, and in the present embodiment, the step S114 includes:
step S1141, decoding the time characteristic information through a decoder to obtain a user visual angle prediction motion trail of the current frame image;
in particular, in order to solve the problem of inconsistent distribution between historical data and future predictions, instead of a single LSTM model, the embodiment of the present invention uses a seq2seq model, in which a single Encoder (Encoder) and a single Decoder (Decoder) are actually included, the Encoder encoding all input sequences into a unified one The semantic vector is then decoded by a decoder. In the embodiment of the invention, the LSTM model is utilized to pair the history track x in the time t= {1,2, &. T } t Encoding, using another LSTM network as decoder to predict future user view motion trail, the decoder using the latest hidden state h of the encoder t And memory state c t To initialize and use the latest history data of the user's viewing track as an input initial value for the decoder. LSTM decoder uses t' -1 moment prediction y t’-1 The view of the prediction t' that is cycled as an input, the length of the decoder cycle output can be adjusted according to the prediction step requirement. The problem of inconsistent probability distribution before and after prediction of the time series of the motion trail of the visual angle of the user can be solved based on the encoder-decoder model. As one embodiment, the hidden layer of LSTM is 2 and the number of neurons per layer is 128.
And step S1142, integrating the user visual angle prediction motion trail and the user preference characteristics through a fully connected neural network to obtain a user visual angle prediction result of the current frame image.
Inputting the user view angle prediction result of the current frame image into the decoder for decoding to obtain a user view angle prediction motion trail of the next frame image, taking the next frame image as the current frame image, and returning to the execution step: and integrating the user visual angle prediction motion trail and the user preference characteristics through a fully-connected neural network to obtain a user visual angle prediction result of the current frame image until all video images are processed.
Further, after the motion trail of the user view angle prediction is obtained through decoding, the user preference characteristic is obtained through carrying out salient characteristic extraction on the panoramic video image of the user viewing area, because the user preference characteristic can be regarded as a time sequence, the user preference characteristic at each moment is integrated through a fully connected neural network, then the integrated result is taken as the input of the LSTM network of the decoder, the integrated result of the fully connected neural network is actually the output user view angle prediction result, and because the input processed by the whole decoder structure is a time continuous sequence, the predicted result can reflect the real user view angle trail.
According to the scheme, the decoder decodes the time characteristic information to obtain the user visual angle prediction motion trail of the current frame image; and integrating the user visual angle prediction motion trail and the user preference characteristics through a fully-connected neural network to obtain a user visual angle prediction result of the current frame image. The user visual angle prediction motion trail is obtained through decoding, the user preference characteristic is obtained through carrying out salient characteristic extraction on the panoramic video image of the user watching area, the full-connection neural network is used for integrating, the integrated result is used as the input of a subsequent decoder LSTM network, and finally the user visual angle prediction result is output, so that the real user visual angle trail can be reflected, the prediction result is more accurate, and the immersive watching experience of the user on the panoramic video is improved.
Referring to fig. 6, fig. 6 is a specific flowchart of step S20 in the embodiment of fig. 2. The present embodiment is based on the embodiment shown in fig. 2, and in the present embodiment, the step S20 includes:
step S201, partitioning and dicing the panoramic video data to obtain diced videos, and obtaining a network state estimation result of the player;
specifically, in the embodiment of the present invention, the main function of the server is to DASH the video, train the super-division network of the video, store the video and the SR network, and process the request from the client, i.e. the player. The tools mainly used for DASH of video here are Kvazaar, GPAC and FFmpeg. The main purpose of Kvazaar is to re-encode and generate a video content without motion limitation such that each video slice is independent of the encoding, i.e. each slice of the video can be played independently. The video is then partitioned and repackaged with GPAC into DASH-stylized video content to accommodate the network state of the player.
Since the network status of the player plays a vital role in the viewing experience of the user, accurate prediction of the network status of the player is required. The network state of the player is estimated mainly by adopting a bandwidth prediction algorithm, and a single-value method, a mean value method, a weighted average method and the like can be adopted. As one implementation mode, the downloading rate of the video segment can be calculated according to the downloading data quantity and the downloading time, the downloading rate represents the available bandwidth to a certain extent, the single-value method directly takes the downloading rate of the current video segment as the predicted bandwidth of the next moment, and the average method takes the downloading rate of all historical video segments as the predicted bandwidth of the next moment.
Step S202, carrying out self-adaptive decision according to the network state estimated result and the user visual angle estimated result through a multi-decision reinforcement learning model to obtain a decision result, wherein the decision result comprises a target code rate and a reconstruction strategy;
further, after the network state estimation result is obtained, a proper code rate version can be allocated to tiles in the 360-degree panoramic video, through prediction of a user view angle, the characteristics that streaming media can be encoded into small files with multiple code rate versions and each file can be independently played are utilized, and the most proper code rate version is dynamically selected according to the state of the player, so that the viewing experience (including video picture quality, video playing fluency and the like) of the user is improved. Meanwhile, when the bandwidth resources are seriously insufficient, even if the accuracy of bandwidth estimation is very high, the player can not transmit the video content with high-definition code rate due to the limitation of bandwidth, and can only guarantee the smoothness of video playing at the moment, but the smoothness is insufficient to compensate for the decline of user watching experience caused by the decline of video quality, so that the embodiment of the invention adopts a mode of adding an image super-resolution technology in an adaptive transmission system, and the player can reconstruct the high-definition video content by utilizing the computing capability of a client when the network bandwidth is insufficient.
Specifically, in order to solve the multi-decision problem of code rate decision and super-resolution reconstruction, the embodiment of the invention also constructs a 360-panorama video code rate self-adaptive transmission frame with a double-buffer mechanism (double-buffer), wherein one buffer is used for buffering downloaded video content, and the other buffer is used for buffering SR reconstructed video content.
For the downloading situation in the dual buffer mechanism, the local video content is reconstructed through the SR, and the client can still improve the quality of the picture by reconstructing the high-definition video content no matter the picture quality of the user view angle is reduced due to the prediction error of the user view angle or the picture quality of the lower-code-rate video content is reduced due to the insufficient bandwidth. Video chunking may also occur because the client's local computing resources are insufficient to handle real-time reconstruction requirements. The decision how to make both download and reconstruction at the same time is very challenging, as both decision actions compete with each other in terms of time consumption.
Specifically, a state space state is first defined. The environment in reinforcement learning refers to everything that interacts with the agent, and in the environment of 360 panoramic video rate adaptive transmission, the state space refers to including all information related to video rate decision control. Specifically, the state space includes prediction of network throughput, current buffer occupancy of the player, code rate size of last chunk, remaining number of chunk, historical chunk download time, size of next chunk at different code rate, and user view angle position, and the state may be expressed as shown in formula (1):
Wherein S is k -the state at which the player downloads the kth video slice,network throughput prediction of past k video slices; />-the download time of k video slices in the past to represent network throughput measurementsIs a time interval of (2); />-the size of each tile in the next video slice at different code rates; b (B) k -current buffer occupancy size; c (C) k -the number of video slice blocks remaining; />-code rate of downloading of each tiles in the last video slice; v k -user view of predicted next chunk.
For action in the state space, the 360 panoramic video rate adaptive transmission framework based on image super-resolution needs to simultaneously decide the video rate inside the 360 panoramic video user view angle and whether to reconstruct the high-resolution video content by applying SR locally. Wherein the action of deciding the video code rate is A 1 = {1m,2m,5m,10m,20m } bps, and the action to decide whether to perform local SR reconstruction is defined as a 2 = {0,1}, where 1 represents performing image super-resolution reconstruction, and 0 represents not performing image super-resolution reconstruction. Rewards are rewards that an agent performs an action and after the action acts on an environment, the agent can obtain a return from the environment. In the present model, a reward function is defined by using the quality representation model of the video chunk block constructed previously, which is specifically expressed as:
r k =QoE k
Wherein: qoE (quality of experience) k The quality of the kth video chunk.
Since reinforcement learning focuses on long-term cumulative returns obtained for a strategy, introducing a discount factor γ may better describe the effect of rewards on the cumulative rewards in the time dimension, resulting in a cumulative discount reward, as follows:
due to the presence of two blocksThe policy is thus considered to be θ= { θ in order to be able to better coordinate the competing relationship between the two 1 ,θ 2 Parameterized game with two agents, let the set of agent policies be pi= { pi 1 ,π 2 Then expected benefit J (θ i )=E[R i ]Is shown below:
wherein,is a centralized action value function, takes the actions of two agents plus state information as input, and then outputs the Q value of agent i. To simplify the calculation s i Is an observation that includes all agents. And because both agents are targeted to maximize the overall video playback user experience quality, both agents use the same rewards (any rewarding means may be used). Inclusion of tuples in experience replay buffers<x,x′,a 1 ,a 2 ,r 1 ,r 2 >The experience of all agents is recorded. Concentrated action value function Q u i Updating according to the following formula:
wherein: μ' — a set of policies for interacting with the environment,
the above scheme is implemented using a plurality of processes, each representing an agent, and each representing a decision strategy in the environment. Each agent interacts directly with the environment in which the system is located through separate observations, and obtains a large number of action-state tuples in a short time. Because each agent has a state of attention, the agent only sees the environment variable (observed value) which has influence on the agent itself in the action execution process, but obtains rewards which are global rewards brought after the action is executed, namely the exploration result is the change of the whole environment brought by the action made by the agent itself, and the exploration process of each agent is independent of other agents. In the decision network training process, all the observations obtained by the agent exploration are considered. Rewards obtained by integrating decisions made by multiple agents may evaluate the effectiveness of decisions made by those agents.
Each policy network selects an action based on a certain probability, the policy network evaluates the score of the current action based on the action in the experience replay buffer, and then the policy network modifies the probability of the selected action according to the score, i.e. updates the action policy. Wherein, for the 1D-CNN layer in the policy network, 128 filters are included, and each filter size is set to 4; for a fully connected FC layer, it contains 128 units.
The application of super-resolution technology in the self-adaptive transmission system can reduce the dependence of the system on network bandwidth resources, wherein code rate decision and super-resolution reconstruction decision are in a competitive and cooperative relationship, on one hand, both are used for improving video playing quality, but simultaneously affect the perception of the self-adaptive transmission system to the environment, so that optimal decision cannot be made. In order to solve the multi-decision problem of code rate decision and super-resolution reconstruction, the method introduces a multi-decision reinforcement learning model to improve the effectiveness and robustness of the algorithm on the basis of the 360-panorama video code rate self-adaptive transmission method based on super resolution. The decision result obtained by the self-adaptive decision comprises a target code rate and a reconstruction strategy, video content which is suitable for the network state of the player can be obtained by the target code rate, and meanwhile, the player can judge whether the video content needs to be reconstructed according to the reconstruction strategy.
Step 203, repackaging each of the diced videos according to the target code rate to obtain the video content;
a video content without motion limitation can be re-encoded by DASH-ing the video such that each video slice is encoded independently, i.e. each slice of the video can be played independently. The video is then partitioned and repackaged with GPAC into DASH-stylized video content. In addition, FFmpeg can be used to re-encode video into video versions with multiple code rates, and in the embodiment of the invention, the operation of re-encoding video into video versions with multiple code rates is arranged before DASH of video by Kvazaar, so as to finally obtain video content suitable for a player.
Step S204, sending the video content to the player.
Furthermore, after the video content with the optimal code rate version is obtained through the 360-degree panoramic video code rate self-adaptive transmission framework, the corresponding video content can be sent to the player, and as the video content dynamically selects the code rate according to the network change and the user visual angle prediction result by the self-adaptive algorithm, play jamming can be avoided, and smooth and high-definition viewing experience is provided for the user.
According to the scheme, the panoramic video data are partitioned and diced to obtain diced videos, and the network state estimated result of the player is obtained; performing self-adaptive decision according to the network state estimated result and the user visual angle estimated result through a multi-decision reinforcement learning model to obtain the decision result, wherein the decision result comprises a target code rate and a reconstruction strategy; repackaging each of the diced videos according to the target code rate to obtain the video content; and sending the video content to the player. By carrying out code rate self-adaptive transmission on the user view angle on the basis of user view angle prediction, the performance of the panoramic video transmission system can be effectively improved, and therefore the viewing experience of a user is improved. And applying the deep reinforcement learning model to make a decision on the downloading code rate and the image super-resolution reconstruction so as to achieve the optimal user experience quality. Based on the prediction of the user visual angle, the code rate self-adaptive transmission of the user visual angle effectively improves the performance of the panoramic video transmission system, so that the immersive viewing experience of the user on the panoramic video is improved.
Referring to fig. 7, fig. 7 is a flowchart of still another exemplary embodiment of a video adaptive transmission method according to the present invention, where the video adaptive transmission method is applied to a player, and the video adaptive transmission method includes:
and step A10, receiving video content and a decision result sent by a server, and determining whether to reconstruct the video content according to the decision result to obtain a target video.
When the service end performs self-adaptive decision based on a pre-obtained user visual angle prediction result, a decision result is obtained, panoramic video data is adjusted according to the decision result, video content is obtained, after the video content and the decision result are sent to the player, the player can receive the video content and the decision result as a client, whether the video content is reconstructed or not is judged according to a reconstruction strategy in the decision result, and then a target video is obtained, so that a user can watch the target video.
Referring to fig. 8, fig. 8 is a schematic flow chart of step a10 in the embodiment of fig. 7. The present embodiment is based on the embodiment shown in fig. 7, and in the present embodiment, the step a10 includes:
step A101, receiving video content and a decision result sent by the server, wherein the decision result comprises a reconstruction strategy, the player comprises a first buffer zone and a second buffer zone, the first buffer zone is used for caching the video content, and the second buffer zone is used for caching the reconstructed video;
Step A102, judging whether video reconstruction is needed to be carried out on the video content according to the reconstruction strategy;
step A103, if video reconstruction is required to be carried out on the video content, carrying out video reconstruction through a pre-trained super-resolution reconstruction model to obtain the reconstructed video, and taking the reconstructed video as the target video;
and step A104, if the video content does not need to be subjected to video reconstruction, taking the video content as the target video. Judging whether the network state estimated result is lower than a preset threshold value or not;
specifically, after the network state estimation result of the player is obtained, whether the network state estimation result is lower than a preset threshold value needs to be judged, the preset threshold value can be adjusted according to actual application conditions, namely when bandwidth resources are seriously insufficient, video contents with high definition code rates cannot be transmitted easily, the player can only guarantee smoothness of video playing, but user watching experience degradation caused by video quality degradation is difficult to compensate, and therefore judgment is needed according to the network state estimation result to determine whether video reconstruction is needed through an image super-resolution technology or not, so that watching experience of a user is improved.
Further, in order to maximize the viewing experience of the user, the embodiment of the invention adopts the image super-resolution reconstruction technology, so that the player can reconstruct the high-definition video content by utilizing the computing capability of the client when the network bandwidth is insufficient. If the video reconstruction is judged to be needed according to the reconstruction strategy in the decision result, the video reconstruction can be carried out through the super-resolution model to obtain a target video, and the super-resolution model is needed to be obtained by training based on the head motion data set before the target video is obtained.
In the embodiment of the invention, the super-resolution model MDSR is modified to have better effect, a SR model is trained by using 360 panoramic videos (including 195 4K videos) in an open-source head motion data set, the 195 4K videos are taken as the original videos with the highest resolution, a video data set with lower resolutions (2K, 1080p, 720p and 480 p) is generated by re-encoding, the video images are divided into 20 multiplied by 10 areas, and each area is taken as a video slicing training SR model by using the video slicing.
More specifically, before the super-resolution model is adopted to reconstruct the video, training of the super-resolution model is completed, and training data adopted in the training process can be a head motion data set obtained from an open source database, so that 360 panoramic video in the data set is selected.
Further, after the head motion data set is obtained, selecting a plurality of panoramic videos meeting preset definition, in the embodiment of the invention, 195 4K videos are selected as the original videos with the highest resolution, and then recoding the panoramic videos.
Furthermore, after the original panoramic video with the highest resolution is selected, the panoramic video can be recoded through an encoder to generate a video data set with lower resolution, for example, in the embodiment of the invention, the video data set with lower resolution including 2K, 1080p, 720p and 480p is generated, and then video slicing can be performed on the video data set with lower resolution.
After the encoder generates the lower resolution video data set, the video images can be divided into several regions, for example, 20×10 regions, each region serving as a video chunk (tiles), and the super resolution model is trained using the video chunks.
After the video image is divided into areas to obtain each video block, the super-resolution model can be trained through the video block with lower resolution to generate a corresponding video block with higher resolution, and after training is completed, the super-resolution model can reconstruct the video with lower resolution into the video with higher resolution under the condition that the bandwidth resource of a player is seriously insufficient, so that the watching experience of a user is effectively improved.
In this embodiment, by receiving the video content and the decision result sent by the server, where the decision result includes a reconstruction policy, the player includes a first buffer area and a second buffer area, where the first buffer area is used to buffer the video content, and the second buffer area is used to buffer the reconstructed video; judging whether video reconstruction is needed to be carried out on the video content according to the reconstruction strategy; if the video content is required to be subjected to video reconstruction, performing video reconstruction through a pre-trained super-resolution reconstruction model to obtain the reconstructed video, and taking the reconstructed video as the target video; and if the video content does not need to be subjected to video reconstruction, taking the video content as the target video. The quality of panoramic video content is effectively improved through video reconstruction, so that the immersive viewing experience of a user on the panoramic video is improved. The method for adaptively selecting the video block code rate by reinforcement learning is introduced, the video block of a high-resolution version is generated by an image super-resolution technology, the video picture quality watched by a user and the computing resource and network bandwidth resource of a client are effectively balanced, and the experience quality of watching streaming media video by the user is improved.
In addition, an embodiment of the present invention further provides a video adaptive transmission device, where the video adaptive transmission device includes:
the acquisition module is used for acquiring panoramic video data;
the transmission module is used for carrying out self-adaptive decision based on a pre-obtained user visual angle prediction result to obtain a decision result, adjusting the panoramic video data according to the decision result to obtain video content, and sending the video content and the decision result to a player so that the player can determine whether to carry out video reconstruction on the video content according to the decision result to obtain a target video.
Referring to fig. 9, fig. 9 is a schematic diagram of a panoramic video code rate adaptive transmission method based on super resolution in the embodiment of the present invention, and as shown in fig. 9, the method mainly includes two parts, namely a user view angle prediction method and a 360 video transmission method based on super resolution. Finally, the method is verified through a 360 video self-adaptive transmission prototype simulation system, so that the viewing experience of a user can be effectively improved.
Referring to fig. 10, fig. 10 is a schematic diagram illustrating an adaptive transmission principle of a simulation system according to an embodiment of the present invention, and as shown in fig. 10, the simulation system is extended on the basis of pensve, and a super resolution model SR module is added by adding support to 360 panoramic video slicing. The simulation system comprises a server and a client, wherein the main functions of the server are to carry out DASH conversion of videos, train a super-division network of the videos, store the videos and an SR network, and process requests from the client. Under the original self-adaptive transmission framework, the client adds a user visual angle prediction module and an SR super-resolution module which are adaptive to a 360-degree panoramic video self-adaptive transmission scheme. Wherein the content downloaded by the client includes video content and a superscore model for viewing by the user. The 360-degree video self-adaptive transmission prototype simulation system can obtain comparison results more quickly in a short time.
In addition, the embodiment of the invention also provides a terminal device, which comprises a memory, a processor and a video self-adaptive transmission program stored on the memory and capable of running on the processor, wherein the video self-adaptive transmission program realizes the steps of the video self-adaptive transmission method when being executed by the processor.
Because all the technical schemes of all the embodiments are adopted when the video self-adaptive transmission program is executed by the processor, the video self-adaptive transmission program at least has all the beneficial effects brought by all the technical schemes of all the embodiments and is not described in detail herein.
In addition, the embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a video self-adaptive transmission program, and the video self-adaptive transmission program realizes the steps of the video self-adaptive transmission method when being executed by a processor.
Because all the technical schemes of all the embodiments are adopted when the video self-adaptive transmission program is executed by the processor, the video self-adaptive transmission program at least has all the beneficial effects brought by all the technical schemes of all the embodiments and is not described in detail herein.
Compared with the prior art, the video self-adaptive transmission method, the device, the terminal equipment and the storage medium provided by the embodiment of the invention acquire panoramic video data; and carrying out self-adaptive decision based on a pre-obtained user visual angle prediction result to obtain a decision result, adjusting the panoramic video data according to the decision result to obtain video content, and sending the video content and the decision result to a player so that the player can determine whether to reconstruct the video content according to the decision result to obtain a target video. The user preference can be identified in a long-term and accurate manner by carrying out user view angle prediction on the video data in advance, the panoramic video data is adjusted to obtain video content according to a decision result by carrying out self-adaptive decision on the video data based on a user view angle prediction result, and the video content and the decision result are sent to a player, so that the video quality can be improved while the bandwidth resource is adapted, and the immersive viewing experience of the user on the panoramic video is improved.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as above, including several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, a controlled terminal, or a network device, etc.) to perform the method of each embodiment of the present application.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (10)

1. The video self-adaptive transmission method is characterized by being applied to a server, and comprises the following steps of:
acquiring panoramic video data;
performing self-adaptive decision based on a pre-obtained user visual angle prediction result to obtain a decision result, adjusting the panoramic video data according to the decision result to obtain video content, and sending the video content and the decision result to a player for the player to determine whether to reconstruct the video content according to the decision result to obtain a target video, wherein the player comprises a first buffer zone and a second buffer zone, the first buffer zone is used for caching the video content, and the second buffer zone is used for caching reconstructed video;
the user visual angle prediction result is obtained by encoding the acquired user head motion trail based on the seq2seq model to obtain time characteristic information and combining the time characteristic information with user preference characteristics extracted according to the panoramic video image.
2. The method for adaptively transmitting video according to claim 1, wherein said step of adaptively deciding based on a pre-obtained user viewing angle prediction result to obtain a decision result, adjusting said panoramic video data according to said decision result to obtain video content, and transmitting said video content to a player further comprises, before:
Acquiring a head motion trail and a panoramic video image of a user;
coding the head movement track of the user to obtain time characteristic information;
extracting the salient features of the panoramic video image to obtain user preference features;
and obtaining the user visual angle prediction result according to the time characteristic information and the user preference characteristics.
3. The video adaptive transmission method according to claim 2, wherein the step of obtaining the user viewing angle prediction result according to the time characteristic information and the user preference feature comprises:
decoding the time characteristic information through a decoder to obtain a user visual angle prediction motion trail of the current frame image;
and integrating the user visual angle prediction motion trail and the user preference characteristics through a fully-connected neural network to obtain a user visual angle prediction result of the current frame image.
4. The video adaptive transmission method according to claim 3, wherein the step of integrating the user view angle prediction motion trail and the user preference feature through the fully connected neural network to obtain the user view angle prediction result of the current frame image further comprises:
Inputting the user view angle prediction result of the current frame image into the decoder for decoding to obtain a user view angle prediction motion trail of the next frame image, taking the next frame image as the current frame image, and returning to the execution step: and integrating the user visual angle prediction motion trail and the user preference characteristics through a fully-connected neural network to obtain a user visual angle prediction result of the current frame image until all video images are processed.
5. The method for adaptively transmitting video according to claim 1, wherein said step of adaptively deciding based on a pre-obtained user viewing angle prediction result to obtain a decision result, adjusting said panoramic video data according to said decision result to obtain video content, and transmitting said video content to a player comprises:
partitioning and dicing the panoramic video data to obtain diced videos, and obtaining a network state estimation result of the player;
performing self-adaptive decision according to the network state estimated result and the user visual angle estimated result through a multi-decision reinforcement learning model to obtain the decision result, wherein the decision result comprises a target code rate and a reconstruction strategy;
Repackaging each of the diced videos according to the target code rate to obtain the video content;
and sending the video content to the player.
6. A video adaptive transmission method, wherein the video adaptive transmission method is applied to a player, and the video adaptive transmission method comprises the following steps:
receiving video content and a decision result sent by a server, and determining whether to reconstruct the video content according to the decision result to obtain a target video;
the player comprises a first buffer zone and a second buffer zone, wherein the first buffer zone is used for caching the video content, and the second buffer zone is used for caching the reconstructed video;
the decision result is obtained by carrying out self-adaptive decision on the basis of a user visual angle prediction result obtained in advance by the service end; the video content is obtained by the server side according to the panoramic video data obtained by adjustment of the decision result;
the user visual angle prediction result is obtained by encoding the acquired user head motion trail based on the seq2seq model to obtain time characteristic information and combining the time characteristic information with user preference characteristics extracted according to the panoramic video image.
7. The method for adaptively transmitting video according to claim 6, wherein the step of receiving video content and a decision result transmitted from the server and determining whether to reconstruct video of the video content according to the decision result, and obtaining a target video comprises:
receiving video content and a decision result sent by the server, wherein the decision result comprises a reconstruction strategy;
judging whether video reconstruction is needed to be carried out on the video content according to the reconstruction strategy;
if the video content is required to be subjected to video reconstruction, performing video reconstruction through a pre-trained super-resolution reconstruction model to obtain the reconstructed video, and taking the reconstructed video as the target video;
and if the video content does not need to be subjected to video reconstruction, taking the video content as the target video.
8. A video adaptive transmission apparatus, characterized in that the video adaptive transmission apparatus comprises:
the acquisition module is used for acquiring panoramic video data;
the transmission module is used for carrying out self-adaptive decision based on a pre-obtained user visual angle prediction result to obtain a decision result, adjusting the panoramic video data according to the decision result to obtain video content, and sending the video content and the decision result to a player so that the player can determine whether to reconstruct the video content according to the decision result to obtain a target video, wherein the player comprises a first buffer zone and a second buffer zone, the first buffer zone is used for caching the video content, and the second buffer zone is used for caching reconstructed video;
The user visual angle prediction result is obtained by encoding the acquired user head motion trail based on the seq2seq model to obtain time characteristic information and combining the time characteristic information with user preference characteristics extracted according to the panoramic video image.
9. A terminal device comprising a memory, a processor and a video adaptive transmission program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the video adaptive transmission method according to any one of claims 1-5 or 6-7.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a video adaptive transmission program, which when executed by a processor, implements the steps of the video adaptive transmission method according to any of claims 1-5 or 6-7.
CN202210609323.5A 2022-05-31 2022-05-31 Video self-adaptive transmission method, device, terminal equipment and storage medium Active CN115037962B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210609323.5A CN115037962B (en) 2022-05-31 2022-05-31 Video self-adaptive transmission method, device, terminal equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210609323.5A CN115037962B (en) 2022-05-31 2022-05-31 Video self-adaptive transmission method, device, terminal equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115037962A CN115037962A (en) 2022-09-09
CN115037962B true CN115037962B (en) 2024-03-12

Family

ID=83123019

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210609323.5A Active CN115037962B (en) 2022-05-31 2022-05-31 Video self-adaptive transmission method, device, terminal equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115037962B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115589499B (en) * 2022-10-08 2023-09-29 深圳市东恒达智能科技有限公司 Remote education playing code stream distribution control system and method
CN116708843B (en) * 2023-08-03 2023-10-31 清华大学 User experience quality feedback regulation system in semantic communication process

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10560759B1 (en) * 2018-10-23 2020-02-11 At&T Intellectual Property I, L.P. Active network support on adaptive virtual reality video transmission
CN110827198A (en) * 2019-10-14 2020-02-21 唐山学院 Multi-camera panoramic image construction method based on compressed sensing and super-resolution reconstruction
CN112953922A (en) * 2021-02-03 2021-06-11 西安电子科技大学 Self-adaptive streaming media control method, system, computer equipment and application
CN113313123A (en) * 2021-06-11 2021-08-27 西北工业大学 Semantic inference based glance path prediction method
CN113395505A (en) * 2021-06-21 2021-09-14 河海大学 Panoramic video coding optimization algorithm based on user field of view
CN113573140A (en) * 2021-07-09 2021-10-29 西安交通大学 Code rate self-adaptive decision-making method supporting face detection and real-time super-resolution
CN113905221A (en) * 2021-09-30 2022-01-07 福州大学 Stereo panoramic video asymmetric transmission stream self-adaption method and system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10694249B2 (en) * 2015-09-09 2020-06-23 Vantrix Corporation Method and system for selective content processing based on a panoramic camera and a virtual-reality headset
US10764494B2 (en) * 2018-05-25 2020-09-01 Microsoft Technology Licensing, Llc Adaptive panoramic video streaming using composite pictures

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10560759B1 (en) * 2018-10-23 2020-02-11 At&T Intellectual Property I, L.P. Active network support on adaptive virtual reality video transmission
CN110827198A (en) * 2019-10-14 2020-02-21 唐山学院 Multi-camera panoramic image construction method based on compressed sensing and super-resolution reconstruction
CN112953922A (en) * 2021-02-03 2021-06-11 西安电子科技大学 Self-adaptive streaming media control method, system, computer equipment and application
CN113313123A (en) * 2021-06-11 2021-08-27 西北工业大学 Semantic inference based glance path prediction method
CN113395505A (en) * 2021-06-21 2021-09-14 河海大学 Panoramic video coding optimization algorithm based on user field of view
CN113573140A (en) * 2021-07-09 2021-10-29 西安交通大学 Code rate self-adaptive decision-making method supporting face detection and real-time super-resolution
CN113905221A (en) * 2021-09-30 2022-01-07 福州大学 Stereo panoramic video asymmetric transmission stream self-adaption method and system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Unequal Error Protection Aided Region of Interest Aware Wireless Panoramic Video;Yongkai Huo et al.;《IEEE Access》;第7卷;第80262-80276页 *
全景视频的压缩及后处理;李雅茹;《中国学位论文全文数据库》;全文 *
虚拟现实视频处理与传输技术;董振等;《电信科学》(第08期);51-58 *
面向三维视频的虚拟视点合成技术研究进展;张博文等;《计算机工程与应用》;第57卷(第2期);第12-17页 *

Also Published As

Publication number Publication date
CN115037962A (en) 2022-09-09

Similar Documents

Publication Publication Date Title
CN115037962B (en) Video self-adaptive transmission method, device, terminal equipment and storage medium
Yeo et al. Neural adaptive content-aware internet video delivery
Yaqoob et al. A survey on adaptive 360 video streaming: Solutions, challenges and opportunities
Xie et al. 360ProbDASH: Improving QoE of 360 video streaming using tile-based HTTP adaptive streaming
Zhang et al. Improving quality of experience by adaptive video streaming with super-resolution
CN108833880B (en) Method and device for predicting viewpoint and realizing optimal transmission of virtual reality video by using cross-user behavior mode
Chiariotti A survey on 360-degree video: Coding, quality of experience and streaming
US10848768B2 (en) Fast region of interest coding using multi-segment resampling
CN113905221B (en) Stereoscopic panoramic video asymmetric transport stream self-adaption method and system
CN107211193A (en) The intelligent adaptive video streaming method and system of sensory experience quality estimation driving
Park et al. Advancing user quality of experience in 360-degree video streaming
WO2020067592A1 (en) Method and apparatus for transmitting adaptive video in real time by using content-aware neural network
US11164339B2 (en) Fast region of interest coding using multi-segment temporal resampling
US20200404241A1 (en) Processing system for streaming volumetric video to a client device
KR102129115B1 (en) Method and apparatus for transmitting adaptive video in real time using content-aware neural network
Li et al. A super-resolution flexible video coding solution for improving live streaming quality
Nguyen et al. Super-resolution based bitrate adaptation for HTTP adaptive streaming for mobile devices
Hsu Mec-assisted fov-aware and qoe-driven adaptive 360 video streaming for virtual reality
US20230319292A1 (en) Reinforcement learning based rate control
CN112911347B (en) Virtual reality video transmission method, system, server side and client side
Lindskog et al. Reeft-360: Real-time emulation and evaluation framework for tile-based 360 streaming under time-varying conditions
Khan A taxonomy for generative adversarial networks in dynamic adaptive streaming over http
CN116996661B (en) Three-dimensional video display method, device, equipment and medium
CN112333456B (en) Live video transmission method based on cloud edge protocol
Nasrabadi Improving Quality of Experience for HTTP Adaptive Video Streaming: From Legacy to 360° Videos

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant