CN114173206A

CN114173206A - Low-complexity viewpoint prediction method fusing user interest and behavior characteristics

Info

Publication number: CN114173206A
Application number: CN202111510706.9A
Authority: CN
Inventors: 邓瑞
Original assignee: Shaanxi Normal University
Current assignee: Shaanxi Normal University
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2022-03-11
Anticipated expiration: 2041-12-10
Also published as: CN114173206B

Abstract

The invention discloses a low-complexity viewpoint predicting method fusing user interests and behavior characteristics, which comprises the steps of dividing a video to be subjected to viewpoint prediction into a plurality of video segments, and marking the sequence numbers of salient objects in the video segments by utilizing a video frame salient image of the video to be subjected to viewpoint prediction; acquiring the viewpoint staying time of the user who has watched the video on the salient object, classifying the users according to the viewpoint staying time, acquiring the interest model of the user of the same type according to the viewpoint staying time of the user of the same type, and combining the user interest model with the video frame saliency map to obtain an interest distribution map; the method comprises the steps of obtaining a user behavior model by utilizing the random motion of a user viewpoint and viewpoint feedback information of videos watched by a user historically, establishing a low-complexity viewpoint prediction model capable of accurately predicting the future user viewpoint position for a long time by comprehensively considering user interests and user behavior characteristics on the basis of analyzing video saliency characteristics, and predicting the user viewpoint position by utilizing the viewpoint prediction model.

Description

Low-complexity viewpoint prediction method fusing user interest and behavior characteristics

Technical Field

The invention belongs to the technical field of video processing, and particularly relates to a low-complexity viewpoint prediction method fusing user interest and behavior characteristics.

Background

Virtual Reality (VR) video is receiving wide attention and pursuit of people due to its immersive viewing experience and low-cost convenient viewing mode, and is one of the most rapidly developing online VR applications at present. According to the report of VR/AR industry released in 2016, the VR video service will have 5200 ten thousand users in 2020, accounting for 40% of all expected users in VR application field, and the VR video user group will reach 1 hundred million, 7 thousand and 4 million by 2025. However, the "high bit rate, low latency" feature of VR video poses a great challenge to network transmission. Especially in a mobile network, limited bandwidth resources and time-varying network transmission capability will seriously hinder the improvement of VR video user viewing experience.

The VR video covers 360-degree visual field angle, the horizontal visual field range of human eyes does not exceed 180 degrees generally, and the visual field angle which can be supported by VR terminal equipment (such as VR helmet) is only about 90-110 degrees. Therefore, in recent years, VR video adaptive transmission schemes based on video blocking are becoming hot spots and common consensus in academia and industry. According to the scheme, the VR video is divided into a plurality of video blocks according to the space, and the video blocks within the visual angle range are dynamically selected according to the viewpoint of the user for transmission, so that the requirement of the VR video on the network bandwidth can be reduced while good visual experience is ensured. In order to avoid the problems of picture delay, picture blocking or quality reduction and the like caused by transmission delay when the view points of the users are switched, a view point prediction technology is adopted to predict a new view point of the user at the next moment, and the video blocks in a new view angle range are pre-downloaded and pre-cached. Therefore, the accurate prediction of the user view point has an important significance for improving the user viewing experience.

Viewpoint prediction methods which are most mature and widely applied in the current stage of research can be roughly divided into two types: a prediction method based on motion estimation and a prediction method based on content analysis.

The prediction method based on motion estimation mainly predicts the future view angle position according to the historical browsing behavior of the user, but ignores the guiding effect of video content on the view point of the user, and utilizes the user characteristics which are limited to the recent motion of the head of the user, so that the view point prediction for a long time is difficult.

Although the prediction method based on content analysis improves the accuracy of long-term viewpoint prediction to a certain extent through video saliency characteristics or browsing content correlation analysis, the influence of different user characteristics (such as interests, habits, and behaviors) on viewpoint prediction is not deeply explored, and the internal rules of viewpoint changes of different users are difficult to accurately reflect. Meanwhile, the method has extremely high implementation complexity, consumes time, labor and money, and is difficult to use in VR video real-time communication. Such methods are specifically classified into two methods, one is to determine the area where the future viewpoint of the user is located by using strong correlation between the contents browsed by the user, and the other is to perform prediction based on the saliency features of the video. The saliency features reflect the degree to which the user is interested in the video content of various regions in the video. Generally, the stronger the saliency, the more interesting the user, and the higher the user viewing probability. At present, salient region extraction methods for static images are relatively mature, and therefore many salient detection methods for videos are based on existing image salient detection models and are extended to models which can be used for video salient detection by introducing motion features.

Since most videos transmitted and stored on the internet are compression-coded, recently, many scholars begin to explore how to obtain a video saliency detection algorithm in a compression domain to avoid a complex operation process caused by complete decoding, Xu et al calculate a motion saliency map by using the sum of absolute values of motion vectors, and adaptively fuse the motion saliency map with a static saliency map to obtain a final saliency map. Muthus wamy et al think that motion plays a decisive role in video saliency detection, and therefore, the final video saliency detection is achieved by modifying a still image saliency map with an accumulated time domain motion map and combining with the spatiotemporal similarity representing lens motion. In order to better utilize the motion vector to solve the motion saliency map, Fang et al respectively calculate the static saliency map and the motion saliency map of the P frame of the I frame by using a fixed gaussian weight DCT coefficient and a motion vector weighting method, and introduce a fusion rule of normalizing, summing and multiplying parameters to fuse the static saliency map and the motion saliency map of the P frame. By adopting the Gaussian weight for the motion vector, the performance of the algorithm is further improved. However, although the video saliency analysis based on the video compression domain greatly reduces the computational complexity, the accuracy is difficult to be effectively guaranteed because the intra-frame prediction coding mode is not considered.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a low-complexity viewpoint prediction method fusing user interests and behavior characteristics, which is used for establishing a low-complexity viewpoint prediction model capable of accurately predicting the viewpoint position of a future user for a long time by comprehensively considering personalized characteristics such as the user interests and the user behavior characteristics on the basis of analyzing the video saliency characteristics.

In order to achieve the purpose, the invention provides the following technical scheme: a low-complexity viewpoint prediction method fusing user interest and behavior characteristics comprises the following specific steps:

s1, acquiring a video frame saliency map of a viewpoint prediction video to be performed, wherein the video frame saliency map comprises an I frame saliency map and a P frame saliency map;

s2, dividing the video to be viewpoint predicted into several video segments, and marking the serial numbers of pi most significant objects in the video segments by using the video frame significant map;

s3 obtaining the viewpoint stay time of the user who has watched the video on the pi most significant objects, classifying the users according to the viewpoint stay time, obtaining the interest model of the user according to the viewpoint stay time of the same user on the pi most significant objects of the video, and combining the user interest model with the obtained video frame significance map to obtain the interest distribution map of each frame of the video;

s4, constructing a user behavior model by using the random motion of the user viewpoint and the viewpoint feedback information of the video watched by the user historically, and acquiring a user behavior distribution map reflecting the occurrence probability of the user viewpoint according to the user behavior model;

s5, combining the interest distribution map of the users of the same category with the user behavior distribution map to obtain a viewpoint prediction model, and predicting the viewpoint positions of the users by using the viewpoint prediction model.

Further, in step S1, the specific step of generating the I-frame saliency map is as follows:

s1.1, obtaining an intra-frame prediction coding mode and a residual DCT (discrete cosine transformation) coefficient to obtain a DC coefficient of an image block which is not subjected to direct DCT (discrete cosine transformation) transformation without prediction, wherein the DC coefficient is used for representing the brightness and color characteristics of the image block;

s1.2, acquiring a prediction direction corresponding to an intra-frame prediction coding mode, taking the prediction direction as the texture direction of an intra-frame prediction coding image block, and acquiring the texture intensity of an adjacent block similar to the texture direction of the intra-frame prediction coding image block, wherein the texture intensity is taken as the texture intensity of the intra-frame prediction coding image block;

s1.3, obtaining original pixel values of I _ PCM coded image blocks recovered from a compressed domain, and calculating DCT coefficients of the I _ PCM coded image blocks by using the pixel values, wherein DC coefficients in the DCT coefficients are used for expressing the brightness and the color of the I _ PCM coded image blocks, and AC coefficients in the DCT coefficients are used for expressing texture direction and intensity characteristics of the I _ PCM coded image blocks;

s1.4, constructing a motion vector set of the I frame image according to the coding mode and the motion vector of the previous and next P frame inter-frame predictive coding image blocks of the I frame inter-frame predictive coding image block and the time continuity of the viewpoint predictive video content to be carried out;

s1.5, respectively carrying out significance detection on the brightness, the color, the texture intensity, the texture direction and the motion characteristic of the acquired I frame image, and adaptively fusing the significance detection results into an I frame significance map;

the specific steps for generating the P frame saliency map are as follows:

s1.6, obtaining a motion vector of an inter-frame prediction coding image block in a compression domain, sorting and filling the motion vector, and establishing a complete motion vector set for each P frame;

s1.7, translating the significance characteristics of the image blocks in the I frame according to the indication of the motion vector by utilizing the time domain reference relationship between the P frame image blocks and the I frame image blocks in the inter-frame prediction coding process to obtain a P frame significance map.

Further, in step S1.1, intra-frame predictive coding is applied to the N × N image blocks i in the video to be view-point predicted, so that the DCT transform coefficients of the image blocks i

Can be calculated from equation (1-1):

in the formula (I), the compound is shown in the specification,

representing DCT coefficients of the intra prediction block corresponding to the image block i;

representing the DCT coefficient of the intra-frame prediction residual block corresponding to the image block i;

in which the DCT coefficients of the intra prediction residual block

Directly extracting DCT coefficients in a video compression domain for viewpoint prediction to be carried out;

DCT coefficient

Can be expressed by the formula (1-2):

in the formula

Representing the pixel value of an image block i at (x, y)

The intra prediction value of (1);

if define { s_i,qQ is 0,1, …, Q-1, and is the set of neighboring pixels used by the prediction image block to be coded and reconstructed, the intra prediction value for each pixel of the image block i

Calculated from equations (1-3):

wherein

Is a pixel s_i,qThe value of the pixel of (a) is,

representing a pixel s_i,qA corresponding prediction weight value;

definition of

J-0, 1,2, …, J being the encoded reconstructed neighboring pixel s used by the prediction image block_i,qQ is 0,1, …, Q-1, and the DC coefficients in the DCT coefficients of the block are predicted assuming equal pixel values for the same 4 x 4 block and equal to the average of all pixels in the entire 4 x 4 block

Can be obtained by calculation using the formulae (1-4), i.e.

u-0 and v-0 (1-4)

In the formula, w_jAs the weight, the specific value is determined by the adopted prediction mode;

substituting the formula (1-4) into the formula (1-1) can represent the brightness of the image block iDC coefficient of degree, color characteristic

Can be calculated by the formula (1-5),

further, in step S1.1, the prediction pixels of the 4 x 4 partition are selected from the pixels S of the 4 neighboring blocks located at the upper left, upper right and left sides thereof_i,0～s_i,12In selecting, order

i is 0,1, …, and 3 respectively represent DCT coefficients of the 4 neighboring blocks, then

Comprises the following steps:

1) when the prediction mode of the 4 x 4 block is 0,

from the pixel s of the adjacent block above it_i,1～s_i,4Prediction is obtained, i.e.

Therefore, the temperature of the molten metal is controlled,

2) when the prediction mode of the 4 x 4 block is 1,

by the pixel s of the adjacent block to its left_i,9～s_i,12Prediction is obtained, i.e.

Therefore, the temperature of the molten metal is controlled,

3) when the prediction mode of the 4 x 4 block is 2,

by pixels s of adjacent blocks above and to the left of it_i,1～s_i,4,s_i,9～s_i,12Prediction is obtained, i.e.

Wherein round (α) denotes rounding the value α;

therefore, the temperature of the molten metal is controlled,

if s is_i，1～s_i，4Is absent, then

If s is₉～s₁₂Is absent, then

4) When the prediction mode of the 4 x 4 block is 3,

by the pixel s of the adjacent block above and to the upper right_i,1～s_i,8Prediction is obtained, i.e.

Therefore, the temperature of the molten metal is controlled,

5) when the prediction mode of the 4 x 4 block is 4,

by pixels s of adjacent blocks located at the upper left, upper and left sides thereof_i,0～s_i,4,s_i,9～s_i,12Prediction is obtained, i.e.

Therefore, the temperature of the molten metal is controlled,

6) when the prediction mode of the 4 x 4 block is 5,

by pixels s of adjacent blocks located at the upper left, upper and left sides thereof_i,0～s_i,4,s_i,9～s_i,10Prediction is obtained, i.e.

Therefore, the temperature of the molten metal is controlled,

7) when the prediction mode of the 4 x 4 block is 6,

by pixels s of adjacent blocks located at the upper left, upper and left sides thereof_i,0～s_i,3,s_i,9～s_i,12Predicted to obtain, i.e.

Therefore, the temperature of the molten metal is controlled,

8) when the prediction mode of the 4 x 4 block is 7,

by the pixel s of the adjacent block above and to the upper right_i,1～s_i,7Predicted to obtain, i.e.

Therefore, the temperature of the molten metal is controlled,

9) when the prediction mode of the 4 x 4 block is 8,

by the pixel s of the adjacent block located at the upper right thereof_i,9～s_i,12Predicted to obtain, i.e.

Therefore, the temperature of the molten metal is controlled,

further, in step S1.1, when intra prediction is performed based on 16 × 16 partitions, the DC coefficient of each 4 × 4 block in the 16 × 16 partitions is determined

Comprises the following steps:

1) when the prediction mode of 16 × 16 partition m is 0, the prediction mode of each 4 × 4 block

By pixels s of adjacent partitions above partition m_m,1～s_m,16Predicted to obtain, i.e.

Where mod (·,) represents a complementation operation, mod (i,4) returns the remainder of i divided by 4;

therefore, if s_m,1～s_m,16The DC coefficients of the 4 × 4 neighboring blocks are sequentially represented from left to right as

p＝1,2,3,4；

2) When the prediction mode of 16 × 16 partition m is 1, for each 4 × 4 block i thereof

By pixels s of adjacent partitions to the left of partition m_m,17～s_m,32Predicted to obtain, i.e.

Wherein

Representing a rounding-down operation;

therefore, if s_m,17～s_m,32The DC coefficients of the 4 × 4 neighboring blocks are sequentially represented from left to right as

p＝5,6,7,8,

3) When the prediction mode of 16 × 16 partition m is 2, for each 4 × 4 block i thereof

By pixels s of adjacent partitions above and to the left of partition m_m,1～s_m,32Predicted to obtain, i.e.

Therefore, the temperature of the molten metal is controlled,

if s is_m,1～s_m,16Is absent, then

If s is_M,17～s_M,32Is absent, then

4) When the prediction mode of 16 × 16 partition m is 3, for each 4 × 4 block i, it is determined

Wherein the content of the first and second substances,

i＝0,1,…,15；x,y＝0,1,2,3

Clip1(x)＝min(255,max(0,x))

therefore, the temperature of the molten metal is controlled,

wherein

Weighting coefficient matrix

Is composed of

Further, in step S1.2, the texture intensity of the intra-prediction coded image block i is:

wherein N is_i×N_iRepresenting the partition size, N, of an image block i_j×N_jDenotes the partition size, T, of the neighboring block j that is closest to the texture direction of the image block i and has a higher prediction weight_jIndicating the texture strength of the neighboring block j that is closest to the texture direction of the image block i and has a higher prediction weight.

Further, in step S1.3, the texture direction θ of the I _ PCM encoded image block I' is_i′And intensity T_i′Expressed by AC coefficients among DCT coefficients, as shown in equations (1-24) and (1-25):

wherein N is_i′×N_i′Represents the partition size of the I _ PCM encoded image block I';

u, v ═ 0,1,2,3 denote DCT coefficients obtained by DCT transformation of 4 × 4 using the original pixel values of the I _ PCM encoded image block I' restored in the compressed domain.

Further, in step S3, the interest model Int of the user_lComprises the following steps:

wherein l is a user category, and users in the clustering center of each category are m respectively in turn₁,m₂,...,m_L，

For the user m who has watched the video to be detected_lIn a video segment p pi is the largestViewpoint dwell time on salient objects;

in the formula (I), the compound is shown in the specification,

a set of segments representing the video partition;

representing a set of positions of the region in which the salient object o is located in the video segment p,

represents user m_lThe time of viewpoint stay on the salient object o of the video segment p;

obtaining the user interest degree at the f frame (x, y) according to the interest distribution map

Comprises the following steps:

in the formula (I), the compound is shown in the specification,

indicating the saliency at the f-th frame (x, y).

Further, in step S4, the "current" statistical model is used to describe the random motion of the user' S viewpoint, and the specific motion prediction equation is shown in the following formulas (1-29):

in the formula:

x_f,

y_f,

respectively representing the position, the speed and the acceleration of a viewpoint in the x-axis direction and the y-axis direction when a user watches the f-th frame;

respectively representing the average acceleration of the user viewpoint in the x-axis direction and the y-axis direction; alpha is the reciprocal of the maneuvering acceleration time constant, namely the maneuvering frequency;

the probability that the viewpoint is located at (x, y) when the user views the (f + δ) -th frame can be calculated by equations (1-30):

user behavior model Act_kComprises the following steps:

and (3) calculating a user behavior distribution diagram reflecting the user viewpoint occurrence probability according to the formulas (1-29) and (1-30).

Further, in step S5, the user viewpoint position prediction is performed by the following formula (1-32):

in the formula (I), the compound is shown in the specification,

respectively representing the values of the user interest distribution diagram and the user behavior distribution diagram at the f + delta frame (x, y), wherein the function phi is a fusion function of the user interest distribution diagram and the user behavior distribution diagram.

Compared with the prior art, the invention has at least the following beneficial effects:

the invention provides a low-complexity viewpoint prediction method fusing user interest and behavior characteristics, which starts from video compression code streams, comprehensively considers the significance characteristics of video contents and the user interest and behavior characteristics, and designs a viewpoint prediction model capable of accurately predicting the future viewpoint position of a user with lower complexity. Specifically, an interest model of a user is constructed, and is combined with the video saliency characteristics to generate an interest distribution map capable of reflecting the interest degree of the user on each salient object of the VR video; estimating the probability of the future viewpoint appearing at different positions for the user according to the current viewpoint position of the user and the behavior characteristics (such as speed, acceleration, maneuvering frequency and the like) of the user; and integrating the user interest distribution and establishing a viewpoint prediction model capable of accurately predicting the long-term viewpoint change of the user. In addition, the significance analysis of the invention is based on video compression domain information, and the spatial correlation between adjacent blocks of video content and the time continuity between adjacent frames are utilized to carry out comprehensive significance analysis on the intra-frame prediction coding block on the basis of the prior art.

The video significance analysis based on the compression domain adopted by the invention can effectively reduce the calculation and implementation complexity, but is different from the problem that only the video significance analysis under the intra-frame non-prediction coding mode and the inter-frame prediction coding mode is concerned in the prior work, the method provided by the invention analyzes the significance of the brightness, the color and the texture of the intra-frame prediction block by utilizing the extracted prediction residual DCT coefficient, the spatial correlation of the video content and the prediction directionality of the intra-frame prediction mode, estimates the motion vector missing from the intra-frame coding block by combining the time continuity of the video content, and finally obtains the video significance result by self-adaptive fusion with other significance characteristics. Because the intra-frame prediction mode is considered, the method provided by the invention can effectively improve the video significance analysis accuracy based on the H.264\ AVC compressed code stream without increasing the calculation and implementation complexity.

The user characteristics are one of key factors influencing the viewpoint change of the user, but different from the prior work that only the influence of recent viewpoint motion of the user on the viewpoint is focused, the invention deeply explores the action mechanism between the user interest and behavior characteristics and the viewpoint of the user, and combines the user characteristics and the saliency characteristics of the video based on the compressed domain to establish a low-complexity viewpoint prediction model which takes the user as the core and the video content as the guide, thereby remarkably improving the accuracy of viewpoint prediction.

Drawings

FIG. 1 is a diagram of a technical route of a viewpoint prediction algorithm research integrating user interests and behavior characteristics;

fig. 2 is a schematic diagram of 4 x 4 block intra prediction;

fig. 3 is a flow chart of a video saliency detection algorithm based on compressed domain.

Detailed Description

The invention is further described with reference to the following figures and detailed description.

As shown in fig. 1, the present invention provides a low-complexity viewpoint prediction method fusing user interests and behavior characteristics, and the specific implementation steps include:

(1) video saliency detection based on compressed domain

The method comprises the following steps of performing significance detection on viewpoint prediction to be performed to obtain an I frame significance map and a P frame significance map, specifically:

1) extracting compressed domain information: and extracting the coding mode and residual DCT coefficients of the intra-frame prediction coding image block, the coding mode and motion vectors of the inter-frame prediction coding image block and the pixel values of the I-PCM coding image block from the video compression domain.

2) I-frame (intra-coded frame) saliency map generation:

estimating a DC coefficient of the image block after direct DCT (discrete cosine transform) conversion without prediction according to the coding mode of the intra-frame prediction coding image block obtained in the step 1) and the residual DCT coefficient so as to represent the brightness and color characteristics of the image block;

selecting a prediction direction corresponding to an intra-frame prediction mode as a texture direction of an intra-frame prediction coding image block, predicting the texture intensity of the image block by using the texture intensity of an adjacent block with the texture direction similar to that of the intra-frame prediction coding image block so as to represent the texture direction and intensity characteristics of the image block, calculating a DCT (discrete cosine transformation) coefficient of the image block by using an original pixel value of an I _ PCM (inter-frame pulse code) coding image block recovered from a compression domain, and estimating the brightness, the color, the texture direction and the intensity characteristics of the image block according to the DCT coefficient;

and according to the coding mode and the motion vector of the inter-frame predictive coding image block of the previous frame and the next frame of the I frame or the P frame extracted from the compressed domain, constructing a motion vector set based on 4 x 4 blocks for the I frame image by utilizing the time continuity of the video content so as to represent the motion characteristic of the I frame image.

And respectively carrying out significance detection on the brightness, the color, the texture intensity, the texture direction and the motion characteristics of the acquired I frame image, and adaptively fusing the brightness, the color, the texture intensity, the texture direction and the motion characteristics into an I frame significance map so as to comprehensively represent the significance characteristics of each object in the I frame.

3) P-frame (inter-coded frame) saliency map generation:

arranging and filling motion vectors of an inter-frame predictive coding image block extracted from a compressed domain, and establishing a complete motion vector set based on 4 x 4 blocks for each P frame;

and translating the salient features of the corresponding matched blocks in the previous I frame according to the motion vectors by utilizing a time domain reference relation provided by the P frame motion vector set, thereby obtaining a P frame salient map.

(2) Construction of user interest distribution map

5) Salient object segmentation and labeling: and dividing the whole video to be subjected to viewpoint prediction into a plurality of video segments. For each video clip, dividing the I-frame image into a plurality of salient objects according to the I-frame salient map information generated in the step (1), and marking the sequence numbers of pi most salient objects in the I-frame image according to the sequence of the saliency values from large to small.

And simultaneously, according to the time domain reference relation between the P frame image block and the I frame image block in the inter-frame prediction coding process, marking the salient object of the P frame image by using the salient object marking value which is referred to in the previous I frame.

Preferably, if the salient object of the I frame has the same or similar salient feature as or to the salient object of the previous P frame, the salient object of the I frame is preferentially marked according to the marking value of the salient object in the previous P frame.

6) And (3) viewpoint stay time statistics: using the historical user real viewpoint feedback information to count the viewpoint staying time of the user who has watched the video on the pi most significant objects in the video;

7) user classification based on interest similarity: and classifying the users by adopting a K-means clustering algorithm in machine learning according to the viewpoint staying time obtained by statistics, wherein the users in the same category have higher interest similarity than the users in different groups.

8) Obtaining an interest distribution map of each category of users: and (3) generating interest models of users of the same type according to the stay time of the users of the same type on the view points of the pi most significant objects of the video, and generating interest distribution maps of all frames of the video for the users of each type by combining the I frame saliency map and the P frame saliency map acquired in the step (1).

(3) User behavior profile prediction

9) Constructing a user behavior model: by using a modeling method of a mobile target for reference, the random motion of the viewpoint of a user is described by adopting the existing 'current' statistical model, and a user behavior model is constructed by utilizing the viewpoint feedback information of the user historically watching videos.

10) And (3) generating a user behavior distribution diagram: and calculating a user behavior distribution diagram reflecting the user viewpoint occurrence probability according to the user behavior model.

11) And (3) viewpoint prediction: and (4) predicting the viewpoint by combining the interest distribution map of the category where the user is located and the user behavior distribution map.

Example 1

(1) Video saliency detection based on compressed domain

Most of videos transmitted and stored on the internet are compressed and encoded, so that significance detection is directly carried out in a compressed domain, and a complex operation process caused by decoding can be avoided. The intensity, color and texture features of a static image are obtained by using DCT coefficients, the motion intensity is estimated by using motion vectors, and the data are subjected to significance detection and fusion, so that the method is the most effective method for detecting the significance of the compressed domain video at present. However, none of these methods considers the intra-frame prediction coding mode, and it is difficult to perform accurate significance detection on the compressed code stream adopting the h.264/AVC coding standard, which has become one of the mainstream compression coding standards at present.

1) Extraction of brightness and color characteristics of intra-frame prediction coding image block in I frame

The intra-frame prediction mode is an encoding mode which uses spatial correlation to predict a current block to be encoded by using adjacent pixels encoded in the same frame image and performs DCT transformation on a prediction residual. Therefore, for the intra-frame predictive coding block, the DCT coefficient directly extracted from the compressed domain can no longer be directly used to represent the luminance, color, and texture features of the original image block, and needs to be subjected to certain preprocessing, which is specifically as follows:

for an N x N image block i in a video, if the image block i adopts intra-frame prediction coding, DCT (discrete cosine transform) transformation coefficients of the image block i

Can be calculated from equation (1-1):

in the formula (I), the compound is shown in the specification,

and the DCT coefficients of the intra prediction residual block corresponding to the image block i are represented.

For intra-predictive coding blocks, the known encoder only performs a DCT transformation on intra-predictive residual blocks, and therefore from the video compression domainThe directly extracted DCT coefficient is the DCT coefficient of the intra-frame prediction residual block corresponding to the image block i

As can be seen from equation (1-1), the DCT transform coefficients of image block i are calculated

Only the DCT coefficient of the intra-frame prediction block corresponding to the image block i needs to be estimated

The value of (2) is sufficient.

According to the principle of the DCT transformation,

can be expressed in the form of equations (1-2).

In the formula

Representing the pixel value of an image block i at (x, y)

The intra prediction value of (1).

In general, an image block i is coded in intra prediction mode if and only if the current image block has a strong spatial correlation with its neighboring blocks, and therefore the predicted pixels of the image block i are weighted by their neighboring pixels. If define { s_i,qQ0, 1, …, Q-1 is the set of neighboring pixels of the encoded reconstruction used to predict the image block i, the intra prediction value of each pixel of the image block i

Calculated from equations (1-3):

wherein

Is a pixel s_i,qThe value of the pixel of (a) is,

representing a pixel s_i,qAnd (4) corresponding prediction weight values.

Definition of

J is 0,1,2, …, J being the encoded reconstructed neighboring pixel s used by the prediction image block i_i,qQ is 0,1, …, Q-1, and the DC coefficients in the DCT coefficients of the block are predicted assuming equal pixel values for the same 4 x 4 block and equal to the average of all pixels in the entire 4 x 4 block

Can be composed of

Obtained by weighted summation, i.e.

In the formula, w_jFor the weight value, the specific value depends only on the prediction mode adopted.

Substituting the formula (1-4) into the formula (1-1) can represent the DC coefficient of the brightness and color characteristics of the image block i

Can be calculated by the formula (1-5),

h.264/AVC supports intra prediction based on two partition sizes, 4 × 4 and 16 × 16, where 4 × 4 partitions have 9 optional prediction modes, each 4 × 4 block in a macroblock is predicted independently (as shown in fig. 2), and 16 × 16 partitions have 4 intra prediction modes, and the whole macroblock is predicted, which is suitable for image coding of flat regions.

Specifically for 4 x 4 partition based prediction pixels are from the pixels s of 4 neighboring blocks located at the top left, top right and left side thereof_i,0～s_i,12In selecting, if order

The following formula can be used to calculate:

1) when the prediction mode of the 4 x 4 block is 0,

Therefore, the temperature of the molten metal is controlled,

2) when the prediction mode of the 4 x 4 block is 1,

Therefore, the temperature of the molten metal is controlled,

3) when the prediction mode of the 4 x 4 block is 2,

Where round (α) means rounding off the value α.

Therefore, the temperature of the molten metal is controlled,

if s is_i,1～s_i,4Is absent, then

If s is₉～s₁₂Is absent, then

4) When the prediction mode of the 4 x 4 block is 3,

Thus, it is possible to provide

5) When the prediction mode of the 4 x 4 block is 4,

Thus, it is possible to provide

6) When the prediction mode of the 4 x 4 block is 5,

Therefore, the temperature of the molten metal is controlled,

7) when the prediction mode of the 4 x 4 block is 6,

Therefore, the temperature of the molten metal is controlled,

8) when the prediction mode of the 4 x 4 block is 7,

Therefore, the temperature of the molten metal is controlled,

9) when the prediction mode of the 4 x 4 block is 8,

Therefore, the temperature of the molten metal is controlled,

for intra prediction based on 16 × 16 partitions, the present invention uses a similar method to derive and build a corresponding strategy to estimate the DC coefficient of each 4 × 4 block i in a 16 × 16 partition

The specific calculation method is as follows to represent the brightness and color characteristics of the original image:

1) when the prediction mode of 16 × 16 partition m is 0, for each 4 × 4 block i, it is determined

Where mod (,) represents the remainder of the complementation, mod (i,4) returns the remainder of i divided by 4.

p＝1,2,3,4,

Wherein

Indicating a rounding down operation.

p＝5,6,7,8,

Therefore, the temperature of the molten metal is controlled,

if s is_m,1～s_m,16Is absent, then

If s is_M,17～s_M,32Is absent, then

Wherein the content of the first and second substances,

i＝0,1,…,15；x,y＝0,1,2,3

Clip1(x)＝min(255,max(0,x))

therefore, the temperature of the molten metal is controlled,

wherein

Weighting coefficient matrix

Is composed of

2) Texture direction and intensity feature extraction of intra-frame prediction coding image block in I frame

And selecting the prediction direction corresponding to the intra-frame prediction coding mode as the texture direction of the image block i by utilizing the characteristic that the intra-frame prediction coding mode is closely related to the image texture information. If the texture direction of the neighboring block j is closest to the texture direction of the image block i and the prediction weight is higher, the texture intensity of the neighboring block j is used to predict the texture intensity of the image block i, i.e. the texture intensity of the neighboring block j is used to predict the texture intensity of the image block i

Wherein N is_i×N_iIndicating the partition size of the image block i. N is a radical of_j×N_jDenotes the partition size, T, of the neighboring block j that is closest to the texture direction of the image block i and has a higher prediction weight_jIndicating the texture strength of the neighboring block j that is closest to the texture direction of the image block i and has a higher prediction weight.

3) Extraction of brightness, color, texture direction and intensity characteristics of I _ PCM coded image block in I frame

And performing 4-by-4 DCT (discrete cosine transformation) on the image block by using the original pixel value of the I _ PCM encoded image block recovered from the compressed domain, and estimating the brightness, color, texture direction and intensity characteristics of the image block according to the obtained DCT coefficient. Wherein, the brightness and color characteristics of the I _ PCM coded image block I' are described by a DC coefficient in DCT coefficients, and the texture direction theta_i′And intensity T_i′Calculated from the AC coefficients in the DCT coefficients, as shown in equations (1-24) and (1-25):

wherein N is_i′×N_i′Representing the partition size of the I PCM encoded image block I'. Since smaller partition sizes are typically used for more texture-rich regions, a scale factor is introduced in equations (1-25)

To ensure that smaller sized partitions have greater texture strength.

4) I-frame image motion feature estimation

The motion features of the I-frame image are described by motion vectors. Because all image blocks in the I frame adopt an intra-frame coding mode, motion vectors cannot be directly extracted from a compressed code stream. Therefore, the motion vector of the I-frame image block needs to be interpolated from the motion vectors of the image blocks of the previous and next P-frames of the I-frame image block by fully utilizing the temporal continuity of the video content. Because the motion vector of the inter-frame prediction coding block in the P frame is directly extracted from a compressed domain, the motion vector is obtained from a coding angle, certain noise information exists, real motion characteristics are difficult to represent, and a real motion object is represented in each frame image in a region form, the motion vector needs to be preprocessed before interpolation, including a) motion vector filling, and from the perspective of spatial correlation, the motion vector of an adjacent block in the prediction direction is used for estimating the motion vector missing from the intra-frame prediction block; b) global motion filtering, namely eliminating global motion vectors to obtain motion vectors which can truly reflect the motion of the object; c) time-space domain amplitude filtering, which is used for filtering the isolated motion vector noise with smaller amplitude from the angle of time domain continuity and space correlation; d) time-space domain phase filtering, which is used for filtering the isolated motion vector noise with abrupt change of direction from the angle of phase consistency; e) and expanding the motion area, communicating the cavities of the motion area and improving the integrity of the motion object.

5) I-frame saliency map generation

And respectively carrying out significance detection on the image block brightness, the color, the texture intensity and direction and the motion characteristics obtained according to the steps 1) to 4) by adopting a center-surround operator, and adaptively fusing the significance detection results of the characteristics into an I frame significance map capable of representing the significance characteristics of each object in the I frame.

6) P-frame saliency map generation

For a P frame, a significance analysis is not performed independently for the P frame, but a time domain reference relation between an I frame image block and a P frame image block in an inter-frame prediction coding process is analyzed according to a motion vector of the P frame image block, and a significance characteristic of the I frame image is translated according to an indication of the motion vector to obtain a P frame significance map so as to reduce the calculation complexity. The overall algorithm flow is shown in fig. 3.

(2) Building user interest distribution map

Unlike ordinary video, VR video covers 360-degree field angle, and the scene is complex, usually includes a plurality of different features, different sizes of salient objects, and is distributed in different areas of the image. Therefore, the invention sets up a user interest model capable of reflecting the interest degree of the user on different salient objects by starting from the video salient map and combining with the viewpoint feedback information, and generates a corresponding user personalized interest distribution map for each frame of the video on the basis, thereby better guiding the accurate prediction of the subsequent viewpoint, and the specific thought is as follows:

2.1 dividing the whole video to be view-predicted into several video segments. For each video clip, dividing the I frame image into a plurality of salient objects according to the generated I frame salient image information, and marking the sequence numbers of the pi most salient objects in the I frame image according to the sequence of the salient values from large to small. And simultaneously, according to the time domain reference relation between the P frame and the I frame in the process of predictive coding of the P frame, marking the salient object of the P frame image by using the salient object marking value which is referred to in the previous I frame. If the salient object of the I frame has the same or similar salient object salient characteristics as the salient object of the previous P frame, the salient object of the I frame is preferentially marked according to the marking value of the salient object in the previous P frame.

2.2 for any segment p, using the user viewpoint feedback information to count the viewpoint stay time of the user k' who has watched the video on the pi most significant objects

Can be expressed in the following form:

in the formula (I), the compound is shown in the specification,

a set of segments representing the video partition;

represents user m_lThe time of viewpoint stay on the salient object o of the video segment p.

And 2.3, classifying the users by adopting a K-means clustering algorithm in machine learning according to the viewpoint staying time, so that the users in the same category have higher interest similarity than the users in different groups.

If the users are classified into L classes, the users in the centers of the classes are m in turn₁,m₂,...,m_LThen the interest model Int of class I users_lCan be described using equations (1-27).

2.4, predicting the category of a user k watching the video for the first time by using the interest similarity of the user when watching other videos, and generating an interest distribution graph of each frame for the user k according to the user interest model and the video significance of the category; estimating the user interest degree at the f frame (x, y) from the interest distribution map

Comprises the following steps:

in the formula (I), the compound is shown in the specification,

indicating the saliency at the f-th frame (x, y).

(3) User behavior profile prediction

While watching VR video, humans switch viewpoints by controlling head motion. Therefore, by taking the modeling method of the maneuvering target as a reference, the random motion of the viewpoint of the user is described by adopting a 'current' statistical model, and a specific motion prediction equation is shown as an equation (1-29):

in the formula:

x_f,

y_f,

respectively representing the average acceleration of the user viewpoint in the x-axis direction and the y-axis direction; α is the reciprocal of the maneuvering acceleration time constant, i.e., the maneuvering frequency.

Due to the randomness, complexity and diversity of viewpoint motion, the situation of inaccurate prediction inevitably occurs when describing the motion state of the viewpoint motion by using the model. Due to the fact thatThe invention introduces two independent random variables e_xAnd e_yDescribing the prediction error of the model to the viewpoint in the x-axis direction and the y-axis direction, and assuming that the mean value of the model is zero and the variance is respectively

Are distributed and independent of each other. Then, the probability that the viewpoint is located at (x, y) when the user views the (f + δ) -th frame can be calculated by equations (1-30).

Taking into account the parameter a required in the above analysis,

are all only related to user behavior characteristics, so we define a user behavior model Act_kComprises the following steps:

and constructing by using the user viewpoint feedback information. After obtaining the user behavior model, the invention can utilize the user behavior distribution diagram obtained by calculation of the formulas (1-29) and (1-30) and reflecting the appearance probability of the user viewpoint.

(4) Viewpoint prediction

In practical applications, it is found that the user is more inclined to focus on the object which is both in line with the maneuver reality and interesting under the guidance of the selective visual attention mechanism and the inertia of the user behavior. Thus, the viewpoint position of the end user when viewing the f + δ -th frame

The prediction can be made from equations (1-32).

In the formula (I), the compound is shown in the specification,

The "high bit rate and low delay" characteristics of VR video provide great challenges for network transmission. Especially in a mobile network, limited bandwidth resources and time-varying network transmission capability will seriously hinder the improvement of VR video user viewing experience. The VR video covers 360-degree visual field angle, the horizontal visual field range of human eyes does not exceed 180 degrees generally, and the visual field angle which can be supported by VR terminal equipment (such as VR helmet) is only about 90-110 degrees. Therefore, in recent years, VR video adaptive transmission schemes based on video blocking are becoming hot spots and common consensus in academia and industry. The invention divides the VR video into a plurality of video blocks according to the space and dynamically selects the video blocks within the visual angle range according to the viewpoint of the user for transmission, thereby reducing the requirement of the VR video on the network bandwidth while ensuring good visual experience. In order to avoid the problems of picture delay, picture blocking or quality reduction and the like caused by transmission delay when the view points of the users are switched, a view point prediction technology is adopted to predict a new view point of the user at the next moment, and the video blocks in a new view angle range are pre-downloaded and pre-cached. Therefore, the accurate prediction of the user view point has an important significance for improving the user viewing experience.

Claims

1. A low-complexity viewpoint prediction method fusing user interest and behavior characteristics is characterized by comprising the following specific steps:

2. The method for predicting low-complexity viewpoints by fusing user interests and behavior features as claimed in claim 1, wherein in step S1, the specific steps for generating the I-frame saliency map are as follows:

the specific steps for generating the P frame saliency map are as follows:

3. The method according to claim 2, wherein in step S1.1, intra-frame predictive coding is applied to the N x N image blocks i in the video to be view-predicted, so that DCT transform coefficients of the image blocks i are encoded by intra-frame predictive coding

Can be calculated from equation (1-1):

in the formula (I), the compound is shown in the specification,

representing the intra prediction residual corresponding to image block iDCT coefficients of the block;

in which the DCT coefficients of the intra prediction residual block

DCT coefficient

Can be expressed by the formula (1-2):

in the formula

Representing the pixel value of an image block i at (x, y)

The intra prediction value of (1);

Calculated from equations (1-3):

wherein

Is a pixel s_i,qThe value of the pixel of (a) is,

representing a pixel s_i,qA corresponding prediction weight value;

definition of

Coded reconstructed neighboring pixels s for use in predicting image blocks_i,qAnd Q is 0,1, …, and Q-1, and assuming that the pixel values of the same 4 x 4 block are equal and equal to the average value of all pixels in the whole 4 x 4 block, the DC coefficients in the DCT coefficients of the prediction block are predicted

Can be obtained by calculation using the formulae (1-4), i.e.

Can be calculated by the formula (1-5),

4. a low complexity viewpoint prediction method with fusion of user interest and behavior features as claimed in claim 3 wherein, in step S1.1, the predicted pixels of 4 x 4 partition are from the pixels S of 4 neighboring blocks at the top left, top right and left sides of the predicted pixels_i,0～s_i,12In selecting, order

Respectively represent the DCT coefficients of the 4 adjacent blocks, then

Comprises the following steps:

1) when the prediction mode of the 4 x 4 block is 0,

Therefore, the temperature of the molten metal is controlled,

2) when the prediction mode of the 4 x 4 block is 1,

Therefore, the temperature of the molten metal is controlled,

3) when the prediction mode of the 4 x 4 block is 2,

Wherein round (α) denotes rounding the value α;

therefore, the temperature of the molten metal is controlled,

if s is_i,1～s_i,4Is absent, then

If s is₉～s₁₂Is absent, then

4) When the prediction mode of the 4 x 4 block is 3,

Therefore, the temperature of the molten metal is controlled,

5) when the prediction mode of the 4 x 4 block is 4,

Therefore, the temperature of the molten metal is controlled,

6) when the prediction mode of the 4 x 4 block is 5,

Therefore, the temperature of the molten metal is controlled,

7) when the prediction mode of the 4 x 4 block is 6,

Therefore, the temperature of the molten metal is controlled,

8) when the prediction mode of the 4 x 4 block is 7,

Therefore, the temperature of the molten metal is controlled,

9) when the prediction mode of the 4 x 4 block is 8,

Therefore, the temperature of the molten metal is controlled,

5. the method according to claim 3, wherein in step S1.1, when intra-frame prediction is performed based on 16 × 16 partitions, the DC coefficient of each 4 × 4 block in the 16 × 16 partitions is determined

Comprises the following steps:

Wherein

Representing a rounding-down operation;

Therefore, the temperature of the molten metal is controlled,

if s is_m,1～s_m,16Is absent, then

If s is_M,17～s_M,32Is absent, then

Wherein the content of the first and second substances,

i＝0,1,…,15；x,y＝0,1,2,3

Clip1(x)＝min(255,max(0,x))

therefore, the temperature of the molten metal is controlled,

wherein

Weighting coefficient matrix

Is composed of

6. The method as claimed in claim 2, wherein in step S1.2, the texture intensity of the intra-prediction coded image block i is:

7. The method as claimed in claim 2, wherein in step S1.3, the texture direction θ of the I _ PCM encoded image block I' is determined by the texture direction θ_i′And intensity T_i′Expressed by AC coefficients among DCT coefficients, as shown in equations (1-24) and (1-25):

represents the DCT coefficient obtained by DCT transformation of 4 x 4 of the original pixel value of the I _ PCM coded image block I' restored by the compressed domain.

8. The method for predicting low-complexity viewpoint with fusion of user interest and behavior feature as claimed in claim 1, wherein in step S3, the interest model Int of the user_lComprises the following steps:

For the user m who has watched the video to be detected_lView dwell time on the pi most salient objects in video segment p;

in the formula (I), the compound is shown in the specification,

a set of segments representing the video partition;

according toThe interest distribution map obtains the user interest degree at the f frame (x, y)

Comprises the following steps:

in the formula (I), the compound is shown in the specification,

indicating the saliency at the f-th frame (x, y).

9. The method for predicting low-complexity viewpoint according to claim 1, wherein in step S4, the "current" statistical model is used to describe the random motion of the viewpoint of the user, and the specific motion prediction equation is shown in equations (1-29):

in the formula:

x_f,

y_f,

respectively indicate that the viewpoint of the user is on the x axis when the user watches the f-th framePosition, velocity, acceleration in the direction and y-axis direction;

user behavior model Act_kComprises the following steps:

10. The method for predicting a low-complexity viewpoint with user interest and behavior feature combined according to claim 1, wherein in step S5, the user viewpoint location prediction is performed according to the following formula (1-32):

in the formula (I), the compound is shown in the specification,