CN110166850B

CN110166850B - Method and system for predicting panoramic video watching position by multiple CNN networks

Info

Publication number: CN110166850B
Application number: CN201910465138.1A
Authority: CN
Inventors: 宋利; 李逍; 解蓉; 张文军
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2019-05-30
Filing date: 2019-05-30
Publication date: 2020-11-06
Anticipated expiration: 2039-05-30
Also published as: CN110166850A

Abstract

The invention provides a method and a system for predicting the watching position of a panoramic video by a multiple CNN network, wherein the method comprises the following steps: based on the watching track of the previous period of time, a neural network method is used for predicting the watching point of the next moment; mapping a panoramic video frame at a moment to be predicted into small video frames in multiple directions, obtaining a corresponding saliency map of each small video frame through a first Convolutional Neural Network (CNN), merging the saliency maps into a saliency map of the whole video frame, and refining the saliency map of the whole video frame through a second Convolutional Neural Network (CNN) to obtain a saliency map of the panoramic video frame; and inputting the predicted viewing point and the panoramic video frame saliency map into a full-connection network to obtain a final predicted point, namely a panoramic video viewing position point. The invention comprehensively considers the problems of time continuity when watching the video and the mapping distortion of the panoramic video, and combines the time continuity and the mapping distortion to obtain the final optimal prediction point, thereby realizing higher prediction accuracy.

Description

Method and system for predicting panoramic video watching position by multiple CNN networks

Technical Field

The invention relates to a method for predicting the viewing position of a panoramic video, in particular to a method and a system for predicting the viewing position of the panoramic video based on a multiple convolutional neural network.

Background

In recent years, video traffic still occupies a large portion of the overall network traffic, and panoramic video has rapidly evolved due to its unique immersive experience. However, because the panoramic video data volume is large, the requirement on the network environment is quite high, and the current network basic configuration is not enough to transmit the huge information volume at all without adding certain preprocessing. Therefore, the problem to be solved is to reduce the amount of transmitted data, but at the same time, the video quality can be maintained as much as possible, and the video that the audience wants to watch can be transmitted in advance by predicting the watching position in the panoramic video and according to the existing panoramic video space blocking transmission protocol, such as MPEG-DASH and the like, so that the watching experience of the audience can be improved by higher prediction accuracy, and the limited network resources can be more fully and reasonably utilized.

The prediction of the view angle when watching the panoramic video has several problems which are difficult to solve, and different audiences can have great difference on the interested area of the same video; the same audience can watch different video content watching areas with great randomness; since the data size of a panoramic video is several times larger than that of a normal video, even if the same viewer watches the same video, the watching point at a certain moment has a large uncertainty.

In recent years, many view angle prediction methods for panoramic videos have been proposed, but most of the methods are considered to be not comprehensive enough, for example, some methods predict the view angle at the next moment according to the previous watching track, and the adopted specific methods include linear regression, neural network and the like, so that a certain accuracy can be achieved. The predictable range can be further reduced by considering the addition of the significant area of the video frame at the corresponding moment in the prediction, and the accuracy of the prediction should be correspondingly improved. Moreover, when the video is played, the time of scene switching and the like inevitably occurs in the content, and a large error exists in the prediction according to the previous track, so that the correction of the prediction area and the improvement of the prediction accuracy rate according to the salient area of the existing video frame are very important.

Disclosure of Invention

Aiming at the problem that the prediction accuracy rate of the panoramic video view angle area is not high enough in the prior art, the invention provides a method, a system and a terminal for predicting the panoramic video watching position based on a multiple Convolutional Neural Network (CNN).

In order to realize the purpose, the invention adopts the technical scheme that:

according to a first aspect of the present invention, there is provided a method for predicting a panoramic video viewing position by multiple CNN networks, the method comprising:

based on the watching track of the previous period of time, a neural network method is used for predicting the watching point of the next moment;

mapping the panoramic video frame into small video frames in multiple directions, obtaining a corresponding saliency map by each small video frame through a first Convolutional Neural Network (CNN), combining the saliency maps into a saliency map of the whole video frame, and refining the saliency map of the whole video frame through a second Convolutional Neural Network (CNN) to obtain a saliency map of the panoramic video frame;

and inputting the predicted viewing point and the panoramic video frame saliency map into a full-connection network to obtain a final predicted point, namely a panoramic video viewing position point.

Optionally, the neural network method uses an LSTM model, reads a viewing trajectory of a previous second, and inputs the trajectory into the LSTM model to predict a viewing point at a next time.

Optionally, the panoramic video frame at the time to be predicted is mapped into small video frames in multiple directions, wherein: and performing cube mapping on the panoramic video frame to be predicted to obtain small video frames in six directions, namely up, down, front, rear, left and right.

Optionally, each of the small video frames obtains a corresponding saliency map through a first convolutional neural network CNN, where: and (3) passing each small video frame through the trained VGG16 network to obtain a corresponding saliency map.

Optionally, the fully connected network is a two-layer fully connected network.

The invention designs a method for predicting the watching position of a panoramic video based on a multiple CNN network, which comprises the steps of firstly, obtaining a predicted point at a corresponding moment by using an LSTM network, and taking the watching habit of people and the distortion problem of cube mapping into consideration when analyzing a saliency map of a panoramic video frame; when the merged saliency map is obtained, corresponding distortion problems exist, so that the merged saliency map is passed through a second CNN network to obtain a final saliency map.

According to a second aspect of the present invention, there is provided a system for predicting a panoramic video viewing position by multiple CNN networks, comprising:

the neural network module is used for predicting a viewing point at the next moment by using a neural network method according to the viewing track of the previous period of time;

a mapping module that maps the panoramic video frame into small video frames in a plurality of directions;

a saliency map construction module, which obtains a corresponding saliency map from each small video frame through a first Convolutional Neural Network (CNN), merges the saliency maps into a saliency map of the whole video frame, and refines the saliency map of the whole video frame through a second Convolutional Neural Network (CNN) to obtain a saliency map of the panoramic video frame;

and the prediction module inputs the view point predicted by the neural network module and the panoramic video frame saliency map obtained by the saliency map construction module into a full-connection network to obtain a final predicted point, namely a panoramic video viewing position point.

Optionally, the mapping module, wherein:

and performing cube mapping on the panoramic video frame to be predicted to obtain small video frames in six directions, namely up, down, front, rear, left and right.

Optionally, the prediction module, wherein:

the fully connected network is a two-layer fully connected network.

According to a third aspect of the present invention, there is provided a terminal comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor is operable to execute the method for predicting a panoramic video viewing position by multiple CNN networks.

According to a fourth aspect of the present invention, there is provided a computer readable medium having stored thereon a computer program, characterized in that the computer program, when executed by a processor, implements the method for predicting a panoramic video viewing position by multiple CNN networks.

Compared with the prior art, the invention has the following beneficial effects:

according to the method, the system and the terminal, the distortion problem during panoramic video mapping is considered by watching the track, the saliency map of the panoramic video frame and the mapping distortion problem of the panoramic video, how to process the distortion problem is also considered, the prediction point obtained according to the track and the saliency map of the video frame are combined to obtain the final prediction point, and the prediction accuracy is effectively improved.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flow chart of a method for panoramic video view prediction according to an embodiment of the present invention;

FIG. 2 is a representation of an original video frame, a mapped thumbnail, a saliency map of a thumbnail, and a merged saliency map in accordance with an embodiment of the present invention;

fig. 3 is a comparison of saliency maps obtained from the merged saliency map through second CNN network learning according to an embodiment of the present invention;

FIGS. 4a and 4b are the comparison of the accuracy of the LSTM plus saliency map versus the LSTM plus no saliency map for a 1 second prediction interval in an embodiment of the present invention;

fig. 5 is a block diagram of a system for panoramic video view prediction according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.

The invention comprehensively considers the watching track and the saliency map of the video frame when watching the panoramic video, and considers the mapped saliency map and the distortion problem when merging, and can realize higher prediction accuracy rate by applying the multiple CNN network compared with the traditional method.

Specifically, referring to fig. 1, a method for predicting a panoramic video viewing position based on a multiple CNN network in an embodiment of the present invention includes the following steps:

and S1, inputting the watching track of the previous period into an LSTM (Long Short-Time Memory) network, wherein the LSTM network has Memory capacity and better learning capacity for Time sequences, so that the predicted point of the next moment is obtained through the LSTM network.

S2, mapping the panoramic video frame into small video frames in multiple directions, obtaining a corresponding saliency map from each small video frame through a first Convolutional Neural Network (CNN), and combining the saliency maps into a saliency map of the whole video frame;

when a panoramic video is watched, the attention of the upper area and the lower area of the video is less, the attention of the upper area and the lower area is more middle area, and each area has its own saliency map, so that a panoramic video frame is mapped to obtain mapping maps in 6 directions of upper, lower, front, rear, left and right, the 6 mapping maps are respectively used for obtaining 6 corresponding saliency maps through a first CNN network, and then the 6 saliency maps are inversely mapped into a saliency map of the whole video frame, wherein the saliency map is a gray map.

S3, enabling the obtained saliency map of the whole video frame to pass through a second CNN network, and further obtaining a refined saliency map of the panoramic video frame;

the problem that the saliency map of the whole video frame has certain distortion and coverage at the splicing position during reflection is solved through the second CNN network.

And S4, inputting the predicted point at the next moment obtained in S1 and the saliency map of the panoramic video frame obtained in S3 into a two-layer full-connection network to obtain the final predicted point.

Referring to fig. 5, in correspondence to the method described above, an embodiment of the present invention further provides a system for predicting a panoramic video viewing position by using multiple CNN networks, where the system includes:

a mapping module that maps a panoramic video frame at a time to be predicted into small video frames in a plurality of directions;

the salient map building module is used for obtaining a corresponding salient map from each small video frame through a first convolutional neural network CNN, combining the salient maps into a salient map of the whole video frame, and refining the salient map of the whole video frame through a second convolutional neural network CNN to obtain a panoramic video frame salient map;

and the prediction module inputs the viewing point predicted by the neural network module and the panoramic video frame saliency map obtained by the saliency map construction module into a full-connection network to obtain a final predicted point, namely a panoramic video viewing position point.

Corresponding to the method, the embodiment of the invention also provides a terminal, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor can be used for executing the method for predicting the panoramic video viewing position by the multiple CNN networks when executing the program.

Corresponding to the above method, an embodiment of the present invention further provides a computer readable medium having a computer program stored thereon, which when executed by a processor, implements the method for predicting a panoramic video viewing position by multiple CNN networks.

How the above method of the present invention is implemented is illustrated by a specific embodiment, and the specific operation flow is shown in fig. 1:

inputting the previous watching track of one second into an LSTM network to predict a watching point of the next moment;

secondly, mapping the video frame into small images in 6 directions by using a cube mapping method, learning by using a first CNN network (VGG-16) to obtain a saliency map of the 6 small images, and merging the saliency maps of the obtained small images into a saliency map of the whole video frame.

Thirdly, the saliency map obtained above is subjected to distortion and superposition during merging, and then passes through a second CNN network to obtain a final effective video frame saliency map.

And fourthly, inputting the obtained view point at the next moment and the obtained final effective video frame saliency map as features into a two-layer full-connection network, and outputting a final predicted point.

Firstly, adopting LSTM network as the accurate method to predict the viewing point, secondly, obtaining the corresponding saliency map according to the characteristics of the panoramic video frame, and finally combining the two to obtain the final prediction point. The following describes the prediction of the view points and the corresponding data set by the LSTM method, then the acquisition of the saliency map, and finally how to obtain the final predicted points.

The LSTM method predicts the viewpoint:

assume that the next 0.1s viewing position is predicted with the previous second's viewing trajectory. The LSTM model is trained by using various types of panoramic videos in advance, and then a predicted position point P using the model is obtained_LSTM-0.1s。

2. Obtaining a saliency map using a first CNN network

Firstly, video frames at corresponding time in a data set are extracted, namely, one video frame is extracted every 1 second, and then the video frames are mapped into mapping maps in 6 directions by using cube mapping. The first CNN model is to change the last pooling layer of the VGG-16 network into 4 convolutional layers so as to be able to better learn the key information of the whole image, and the CNN model also needs to be trained in advance by using a data set, then the 6 mapping images are respectively passed through the network to obtain corresponding saliency maps, and finally the saliency maps are combined into a saliency map of the whole video frame. A specific example is shown in fig. 2.

The cube mapping is adopted because when viewing a panoramic video, most of the viewing part of the viewer will concentrate on the middle area of the whole picture, and pay relatively little attention to the areas above and below. The cube mapping may take into account multiple directions, so this approach is adopted.

3. Obtaining an improved saliency map using a second CNN network, and obtaining a final predicted point by combining with the LSTM predicted point

The saliency maps obtained in the previous step have problems of distortion or coverage and the like during merging, so that the saliency maps are further combinedThe refined saliency map is obtained through a CNN network, and the network model also needs to be trained in advance. After obtaining the refined saliency map, adding P_LSTM-0.1sCombined with the saliency map, is input into a two-layer fully connected network and the final predicted point is then obtained.

The two-layer fully-connected network can be directly realized by adopting the existing fully-connected network.

The following table summarizes the predicted accuracy of LSTM versus conventional methods, where the values represent the average error angle from the actual viewing value.

TABLE 1 prediction accuracy of the present method and the conventional method

Fig. 4a, 4b show the results of comparing the accuracy of LSTM plus saliency map with that of the 1 second prediction interval. Including results directly predicted from the saliency map, LSTM predicted results, plus one CNN network predicted results and plus two CNN network predicted results. The comparison shows that the current method has better prediction accuracy compared with the original method.

Fig. 3 is the result of coding the drivingmincity frame 7 under two methods in one embodiment of the present invention, which shows better visual quality and less blocking in some regions of interest, such as the shape of the car, compared to the "original method" above.

FIG. 4 shows the result of encoding Aeriological City 15 th frame in two ways according to an embodiment of the present invention, and it can be seen that the wall edge of the "original method" has been distorted, and the wall edge of the "current method" is continuous and has higher visual quality.

In summary, the above embodiments of the present invention comprehensively consider the time continuity when viewing the video and the mapping distortion problem of the panoramic video, and combine the two to obtain the final optimal prediction point, thereby achieving a higher prediction accuracy.

It should be noted that, the steps in the method provided by the present invention may be implemented by using corresponding modules, devices, units, and the like in the system, and those skilled in the art may refer to the technical solution of the system to implement the step flow of the method, that is, the embodiment in the system may be understood as a preferred example for implementing the method, and details are not described herein.

Those skilled in the art will appreciate that, in addition to implementing the system and its various devices provided by the present invention in purely computer readable program code means, the method steps can be fully programmed to implement the same functions by implementing the system and its various devices in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices thereof provided by the present invention can be regarded as a hardware component, and the devices included in the system and various devices thereof for realizing various functions can also be regarded as structures in the hardware component; means for performing the functions may also be regarded as structures within both software modules and hardware components for performing the methods.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims

1. A method for predicting the viewing position of a panoramic video by a multiple CNN network is characterized in that: the method comprises the following steps:

mapping the panoramic video frame into small video frames in multiple directions, obtaining a corresponding saliency map by each small video frame through a first Convolutional Neural Network (CNN), combining the saliency maps into a saliency map of the whole video frame, and refining the saliency map of the whole video frame through a second Convolutional Neural Network (CNN) to obtain a saliency map of the panoramic video frame; when a panoramic video is watched, the attention of the upper area and the lower area of the video is less, the attention of the upper area and the lower area is more middle area, each area has its own saliency map, the panoramic video frame is mapped to obtain mapping maps in 6 directions, namely, the upper direction, the lower direction, the front direction, the rear direction, the left direction and the right direction, the 6 mapping maps are respectively used for obtaining 6 corresponding saliency maps through a first CNN network, and then the 6 saliency maps are inversely mapped into a saliency map of the whole video frame, wherein the saliency map is a gray map;

2. The method for predicting a panoramic video viewing position by multiple CNN networks as claimed in claim 1, wherein: the neural network method adopts an LSTM model, reads the watching track of the previous second, and inputs the track into the LSTM model to predict and obtain the watching point of the next moment.

3. The method for predicting a panoramic video viewing position by multiple CNN networks as claimed in claim 1, wherein: the mapping of the panoramic video frame into small video frames in multiple directions, wherein:

4. The method for predicting a panoramic video viewing position by multiple CNN networks as claimed in claim 1, wherein: and obtaining a corresponding saliency map of each small video frame through a first Convolutional Neural Network (CNN), wherein:

and (3) passing each small video frame through the trained VGG16 network to obtain a corresponding saliency map.

5. The method for predicting a panoramic video viewing position by multiple CNN networks as claimed in claim 1, wherein: the fully connected network is a two-layer fully connected network.

6. A system for predicting the viewing position of a panoramic video by a multiple CNN network is characterized in that: the method comprises the following steps:

a saliency map construction module, which obtains a corresponding saliency map from each small video frame through a first Convolutional Neural Network (CNN), merges the saliency maps into a saliency map of the whole video frame, and refines the saliency map of the whole video frame through a second Convolutional Neural Network (CNN) to obtain a saliency map of the panoramic video frame; when a panoramic video is watched, the attention of the upper area and the lower area of the video is less, the attention of the upper area and the lower area is more middle area, each area has its own saliency map, the panoramic video frame is mapped to obtain mapping maps in 6 directions, namely, the upper direction, the lower direction, the front direction, the rear direction, the left direction and the right direction, the 6 mapping maps are respectively used for obtaining 6 corresponding saliency maps through a first CNN network, and then the 6 saliency maps are inversely mapped into a saliency map of the whole video frame, wherein the saliency map is a gray map;

7. The system for multiple CNN network predictive panoramic video viewing location of claim 6, wherein: the mapping module, wherein:

8. The system for multiple CNN network predictive panoramic video viewing location of claim 6, wherein: the prediction module, wherein:

the fully connected network is a two-layer fully connected network.

9. A terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor is operable to perform the method of any of claims 1 to 5 when executing the program.

10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-5.