CN114173206A - Low-complexity viewpoint prediction method fusing user interest and behavior characteristics - Google Patents
Low-complexity viewpoint prediction method fusing user interest and behavior characteristics Download PDFInfo
- Publication number
- CN114173206A CN114173206A CN202111510706.9A CN202111510706A CN114173206A CN 114173206 A CN114173206 A CN 114173206A CN 202111510706 A CN202111510706 A CN 202111510706A CN 114173206 A CN114173206 A CN 114173206A
- Authority
- CN
- China
- Prior art keywords
- prediction
- user
- frame
- viewpoint
- block
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000005192 partition Methods 0.000 claims description 62
- 239000013598 vector Substances 0.000 claims description 41
- 239000002184 metal Substances 0.000 claims description 31
- 238000001514 detection method Methods 0.000 claims description 20
- 238000010586 diagram Methods 0.000 claims description 19
- 230000009466 transformation Effects 0.000 claims description 14
- 230000006835 compression Effects 0.000 claims description 13
- 238000007906 compression Methods 0.000 claims description 13
- 150000001875 compounds Chemical class 0.000 claims description 12
- 230000001133 acceleration Effects 0.000 claims description 10
- 230000004927 fusion Effects 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 7
- 238000013179 statistical model Methods 0.000 claims description 4
- 101150039623 Clip1 gene Proteins 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 3
- 230000006399 behavior Effects 0.000 description 36
- 230000000007 visual effect Effects 0.000 description 11
- 238000004458 analytical method Methods 0.000 description 10
- 230000005540 biological transmission Effects 0.000 description 10
- 238000001914 filtration Methods 0.000 description 5
- 230000003068 static effect Effects 0.000 description 5
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 4
- 230000000903 blocking effect Effects 0.000 description 4
- 238000000605 extraction Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 210000003128 head Anatomy 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000003064 k means clustering Methods 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/83—Generation or processing of protective or descriptive data associated with content; Content structuring
- H04N21/845—Structuring of content, e.g. decomposing content into time segments
- H04N21/8456—Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/169—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
- H04N19/186—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a colour or a chrominance component
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/50—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
- H04N19/593—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving spatial prediction techniques
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/60—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
- H04N19/625—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding using discrete cosine transform [DCT]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/442—Monitoring of processes or resources, e.g. detecting the failure of a recording device, monitoring the downstream bandwidth, the number of times a movie has been viewed, the storage space available from the internal hard disk
- H04N21/44204—Monitoring of content usage, e.g. the number of times a movie has been viewed, copied or the amount which has been watched
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/45—Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
- H04N21/466—Learning process for intelligent management, e.g. learning user preferences for recommending movies
- H04N21/4667—Processing of monitored end-user data, e.g. trend analysis based on the log file of viewer selections
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/60—Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client
- H04N21/65—Transmission of management data between client and server
- H04N21/658—Transmission by the client directed to the server
- H04N21/6587—Control parameters, e.g. trick play commands, viewpoint selection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/81—Monomedia components thereof
- H04N21/816—Monomedia components thereof involving special video data, e.g 3D video
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Databases & Information Systems (AREA)
- Computer Networks & Wireless Communication (AREA)
- Physics & Mathematics (AREA)
- Discrete Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The invention discloses a low-complexity viewpoint predicting method fusing user interests and behavior characteristics, which comprises the steps of dividing a video to be subjected to viewpoint prediction into a plurality of video segments, and marking the sequence numbers of salient objects in the video segments by utilizing a video frame salient image of the video to be subjected to viewpoint prediction; acquiring the viewpoint staying time of the user who has watched the video on the salient object, classifying the users according to the viewpoint staying time, acquiring the interest model of the user of the same type according to the viewpoint staying time of the user of the same type, and combining the user interest model with the video frame saliency map to obtain an interest distribution map; the method comprises the steps of obtaining a user behavior model by utilizing the random motion of a user viewpoint and viewpoint feedback information of videos watched by a user historically, establishing a low-complexity viewpoint prediction model capable of accurately predicting the future user viewpoint position for a long time by comprehensively considering user interests and user behavior characteristics on the basis of analyzing video saliency characteristics, and predicting the user viewpoint position by utilizing the viewpoint prediction model.
Description
Technical Field
The invention belongs to the technical field of video processing, and particularly relates to a low-complexity viewpoint prediction method fusing user interest and behavior characteristics.
Background
Virtual Reality (VR) video is receiving wide attention and pursuit of people due to its immersive viewing experience and low-cost convenient viewing mode, and is one of the most rapidly developing online VR applications at present. According to the report of VR/AR industry released in 2016, the VR video service will have 5200 ten thousand users in 2020, accounting for 40% of all expected users in VR application field, and the VR video user group will reach 1 hundred million, 7 thousand and 4 million by 2025. However, the "high bit rate, low latency" feature of VR video poses a great challenge to network transmission. Especially in a mobile network, limited bandwidth resources and time-varying network transmission capability will seriously hinder the improvement of VR video user viewing experience.
The VR video covers 360-degree visual field angle, the horizontal visual field range of human eyes does not exceed 180 degrees generally, and the visual field angle which can be supported by VR terminal equipment (such as VR helmet) is only about 90-110 degrees. Therefore, in recent years, VR video adaptive transmission schemes based on video blocking are becoming hot spots and common consensus in academia and industry. According to the scheme, the VR video is divided into a plurality of video blocks according to the space, and the video blocks within the visual angle range are dynamically selected according to the viewpoint of the user for transmission, so that the requirement of the VR video on the network bandwidth can be reduced while good visual experience is ensured. In order to avoid the problems of picture delay, picture blocking or quality reduction and the like caused by transmission delay when the view points of the users are switched, a view point prediction technology is adopted to predict a new view point of the user at the next moment, and the video blocks in a new view angle range are pre-downloaded and pre-cached. Therefore, the accurate prediction of the user view point has an important significance for improving the user viewing experience.
Viewpoint prediction methods which are most mature and widely applied in the current stage of research can be roughly divided into two types: a prediction method based on motion estimation and a prediction method based on content analysis.
The prediction method based on motion estimation mainly predicts the future view angle position according to the historical browsing behavior of the user, but ignores the guiding effect of video content on the view point of the user, and utilizes the user characteristics which are limited to the recent motion of the head of the user, so that the view point prediction for a long time is difficult.
Although the prediction method based on content analysis improves the accuracy of long-term viewpoint prediction to a certain extent through video saliency characteristics or browsing content correlation analysis, the influence of different user characteristics (such as interests, habits, and behaviors) on viewpoint prediction is not deeply explored, and the internal rules of viewpoint changes of different users are difficult to accurately reflect. Meanwhile, the method has extremely high implementation complexity, consumes time, labor and money, and is difficult to use in VR video real-time communication. Such methods are specifically classified into two methods, one is to determine the area where the future viewpoint of the user is located by using strong correlation between the contents browsed by the user, and the other is to perform prediction based on the saliency features of the video. The saliency features reflect the degree to which the user is interested in the video content of various regions in the video. Generally, the stronger the saliency, the more interesting the user, and the higher the user viewing probability. At present, salient region extraction methods for static images are relatively mature, and therefore many salient detection methods for videos are based on existing image salient detection models and are extended to models which can be used for video salient detection by introducing motion features.
Since most videos transmitted and stored on the internet are compression-coded, recently, many scholars begin to explore how to obtain a video saliency detection algorithm in a compression domain to avoid a complex operation process caused by complete decoding, Xu et al calculate a motion saliency map by using the sum of absolute values of motion vectors, and adaptively fuse the motion saliency map with a static saliency map to obtain a final saliency map. Muthus wamy et al think that motion plays a decisive role in video saliency detection, and therefore, the final video saliency detection is achieved by modifying a still image saliency map with an accumulated time domain motion map and combining with the spatiotemporal similarity representing lens motion. In order to better utilize the motion vector to solve the motion saliency map, Fang et al respectively calculate the static saliency map and the motion saliency map of the P frame of the I frame by using a fixed gaussian weight DCT coefficient and a motion vector weighting method, and introduce a fusion rule of normalizing, summing and multiplying parameters to fuse the static saliency map and the motion saliency map of the P frame. By adopting the Gaussian weight for the motion vector, the performance of the algorithm is further improved. However, although the video saliency analysis based on the video compression domain greatly reduces the computational complexity, the accuracy is difficult to be effectively guaranteed because the intra-frame prediction coding mode is not considered.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a low-complexity viewpoint prediction method fusing user interests and behavior characteristics, which is used for establishing a low-complexity viewpoint prediction model capable of accurately predicting the viewpoint position of a future user for a long time by comprehensively considering personalized characteristics such as the user interests and the user behavior characteristics on the basis of analyzing the video saliency characteristics.
In order to achieve the purpose, the invention provides the following technical scheme: a low-complexity viewpoint prediction method fusing user interest and behavior characteristics comprises the following specific steps:
s1, acquiring a video frame saliency map of a viewpoint prediction video to be performed, wherein the video frame saliency map comprises an I frame saliency map and a P frame saliency map;
s2, dividing the video to be viewpoint predicted into several video segments, and marking the serial numbers of pi most significant objects in the video segments by using the video frame significant map;
s3 obtaining the viewpoint stay time of the user who has watched the video on the pi most significant objects, classifying the users according to the viewpoint stay time, obtaining the interest model of the user according to the viewpoint stay time of the same user on the pi most significant objects of the video, and combining the user interest model with the obtained video frame significance map to obtain the interest distribution map of each frame of the video;
s4, constructing a user behavior model by using the random motion of the user viewpoint and the viewpoint feedback information of the video watched by the user historically, and acquiring a user behavior distribution map reflecting the occurrence probability of the user viewpoint according to the user behavior model;
s5, combining the interest distribution map of the users of the same category with the user behavior distribution map to obtain a viewpoint prediction model, and predicting the viewpoint positions of the users by using the viewpoint prediction model.
Further, in step S1, the specific step of generating the I-frame saliency map is as follows:
s1.1, obtaining an intra-frame prediction coding mode and a residual DCT (discrete cosine transformation) coefficient to obtain a DC coefficient of an image block which is not subjected to direct DCT (discrete cosine transformation) transformation without prediction, wherein the DC coefficient is used for representing the brightness and color characteristics of the image block;
s1.2, acquiring a prediction direction corresponding to an intra-frame prediction coding mode, taking the prediction direction as the texture direction of an intra-frame prediction coding image block, and acquiring the texture intensity of an adjacent block similar to the texture direction of the intra-frame prediction coding image block, wherein the texture intensity is taken as the texture intensity of the intra-frame prediction coding image block;
s1.3, obtaining original pixel values of I _ PCM coded image blocks recovered from a compressed domain, and calculating DCT coefficients of the I _ PCM coded image blocks by using the pixel values, wherein DC coefficients in the DCT coefficients are used for expressing the brightness and the color of the I _ PCM coded image blocks, and AC coefficients in the DCT coefficients are used for expressing texture direction and intensity characteristics of the I _ PCM coded image blocks;
s1.4, constructing a motion vector set of the I frame image according to the coding mode and the motion vector of the previous and next P frame inter-frame predictive coding image blocks of the I frame inter-frame predictive coding image block and the time continuity of the viewpoint predictive video content to be carried out;
s1.5, respectively carrying out significance detection on the brightness, the color, the texture intensity, the texture direction and the motion characteristic of the acquired I frame image, and adaptively fusing the significance detection results into an I frame significance map;
the specific steps for generating the P frame saliency map are as follows:
s1.6, obtaining a motion vector of an inter-frame prediction coding image block in a compression domain, sorting and filling the motion vector, and establishing a complete motion vector set for each P frame;
s1.7, translating the significance characteristics of the image blocks in the I frame according to the indication of the motion vector by utilizing the time domain reference relationship between the P frame image blocks and the I frame image blocks in the inter-frame prediction coding process to obtain a P frame significance map.
Further, in step S1.1, intra-frame predictive coding is applied to the N × N image blocks i in the video to be view-point predicted, so that the DCT transform coefficients of the image blocks iCan be calculated from equation (1-1):
in the formula (I), the compound is shown in the specification,representing DCT coefficients of the intra prediction block corresponding to the image block i;representing the DCT coefficient of the intra-frame prediction residual block corresponding to the image block i;
in which the DCT coefficients of the intra prediction residual blockDirectly extracting DCT coefficients in a video compression domain for viewpoint prediction to be carried out;
in the formulaRepresenting the pixel value of an image block i at (x, y)The intra prediction value of (1);
if define { si,qQ is 0,1, …, Q-1, and is the set of neighboring pixels used by the prediction image block to be coded and reconstructed, the intra prediction value for each pixel of the image block iCalculated from equations (1-3):
whereinIs a pixel si,qThe value of the pixel of (a) is,representing a pixel si,qA corresponding prediction weight value;
definition ofJ-0, 1,2, …, J being the encoded reconstructed neighboring pixel s used by the prediction image blocki,qQ is 0,1, …, Q-1, and the DC coefficients in the DCT coefficients of the block are predicted assuming equal pixel values for the same 4 x 4 block and equal to the average of all pixels in the entire 4 x 4 blockCan be obtained by calculation using the formulae (1-4), i.e.
In the formula, wjAs the weight, the specific value is determined by the adopted prediction mode;
substituting the formula (1-4) into the formula (1-1) can represent the brightness of the image block iDC coefficient of degree, color characteristicCan be calculated by the formula (1-5),
further, in step S1.1, the prediction pixels of the 4 x 4 partition are selected from the pixels S of the 4 neighboring blocks located at the upper left, upper right and left sides thereofi,0~si,12In selecting, orderi is 0,1, …, and 3 respectively represent DCT coefficients of the 4 neighboring blocks, thenComprises the following steps:
1) when the prediction mode of the 4 x 4 block is 0,from the pixel s of the adjacent block above iti,1~si,4Prediction is obtained, i.e.Therefore, the temperature of the molten metal is controlled,
2) when the prediction mode of the 4 x 4 block is 1,by the pixel s of the adjacent block to its lefti,9~si,12Prediction is obtained, i.e.Therefore, the temperature of the molten metal is controlled,
3) when the prediction mode of the 4 x 4 block is 2,by pixels s of adjacent blocks above and to the left of iti,1~si,4,si,9~si,12Prediction is obtained, i.e.
Wherein round (α) denotes rounding the value α;
therefore, the temperature of the molten metal is controlled,
4) When the prediction mode of the 4 x 4 block is 3,by the pixel s of the adjacent block above and to the upper righti,1~si,8Prediction is obtained, i.e.
Therefore, the temperature of the molten metal is controlled,
5) when the prediction mode of the 4 x 4 block is 4,by pixels s of adjacent blocks located at the upper left, upper and left sides thereofi,0~si,4,si,9~si,12Prediction is obtained, i.e.
Therefore, the temperature of the molten metal is controlled,
6) when the prediction mode of the 4 x 4 block is 5,by pixels s of adjacent blocks located at the upper left, upper and left sides thereofi,0~si,4,si,9~si,10Prediction is obtained, i.e.
Therefore, the temperature of the molten metal is controlled,
7) when the prediction mode of the 4 x 4 block is 6,by pixels s of adjacent blocks located at the upper left, upper and left sides thereofi,0~si,3,si,9~si,12Predicted to obtain, i.e.
Therefore, the temperature of the molten metal is controlled,
8) when the prediction mode of the 4 x 4 block is 7,by the pixel s of the adjacent block above and to the upper righti,1~si,7Predicted to obtain, i.e.
Therefore, the temperature of the molten metal is controlled,
9) when the prediction mode of the 4 x 4 block is 8,by the pixel s of the adjacent block located at the upper right thereofi,9~si,12Predicted to obtain, i.e.
further, in step S1.1, when intra prediction is performed based on 16 × 16 partitions, the DC coefficient of each 4 × 4 block in the 16 × 16 partitions is determinedComprises the following steps:
1) when the prediction mode of 16 × 16 partition m is 0, the prediction mode of each 4 × 4 blockBy pixels s of adjacent partitions above partition mm,1~sm,16Predicted to obtain, i.e.
Where mod (·,) represents a complementation operation, mod (i,4) returns the remainder of i divided by 4;
therefore, if sm,1~sm,16The DC coefficients of the 4 × 4 neighboring blocks are sequentially represented from left to right asp=1,2,3,4;
2) When the prediction mode of 16 × 16 partition m is 1, for each 4 × 4 block i thereofBy pixels s of adjacent partitions to the left of partition mm,17~sm,32Predicted to obtain, i.e.
therefore, if sm,17~sm,32The DC coefficients of the 4 × 4 neighboring blocks are sequentially represented from left to right asp=5,6,7,8,
3) When the prediction mode of 16 × 16 partition m is 2, for each 4 × 4 block i thereofBy pixels s of adjacent partitions above and to the left of partition mm,1~sm,32Predicted to obtain, i.e.
Therefore, the temperature of the molten metal is controlled,
4) When the prediction mode of 16 × 16 partition m is 3, for each 4 × 4 block i, it is determinedBy pixels s of adjacent partitions above and to the left of partition mm,1~sm,32Predicted to obtain, i.e.
Wherein the content of the first and second substances,
i=0,1,…,15;x,y=0,1,2,3
Clip1(x)=min(255,max(0,x))
therefore, the temperature of the molten metal is controlled,
wherein
Further, in step S1.2, the texture intensity of the intra-prediction coded image block i is:
wherein N isi×NiRepresenting the partition size, N, of an image block ij×NjDenotes the partition size, T, of the neighboring block j that is closest to the texture direction of the image block i and has a higher prediction weightjIndicating the texture strength of the neighboring block j that is closest to the texture direction of the image block i and has a higher prediction weight.
Further, in step S1.3, the texture direction θ of the I _ PCM encoded image block I' isi′And intensity Ti′Expressed by AC coefficients among DCT coefficients, as shown in equations (1-24) and (1-25):
wherein N isi′×Ni′Represents the partition size of the I _ PCM encoded image block I';u, v ═ 0,1,2,3 denote DCT coefficients obtained by DCT transformation of 4 × 4 using the original pixel values of the I _ PCM encoded image block I' restored in the compressed domain.
Further, in step S3, the interest model Int of the userlComprises the following steps:
wherein l is a user category, and users in the clustering center of each category are m respectively in turn1,m2,...,mL,For the user m who has watched the video to be detectedlIn a video segment p pi is the largestViewpoint dwell time on salient objects;
in the formula (I), the compound is shown in the specification,a set of segments representing the video partition;representing a set of positions of the region in which the salient object o is located in the video segment p,represents user mlThe time of viewpoint stay on the salient object o of the video segment p;
obtaining the user interest degree at the f frame (x, y) according to the interest distribution mapComprises the following steps:
in the formula (I), the compound is shown in the specification,indicating the saliency at the f-th frame (x, y).
Further, in step S4, the "current" statistical model is used to describe the random motion of the user' S viewpoint, and the specific motion prediction equation is shown in the following formulas (1-29):
in the formula:
xf,yf,respectively representing the position, the speed and the acceleration of a viewpoint in the x-axis direction and the y-axis direction when a user watches the f-th frame;respectively representing the average acceleration of the user viewpoint in the x-axis direction and the y-axis direction; alpha is the reciprocal of the maneuvering acceleration time constant, namely the maneuvering frequency;
the probability that the viewpoint is located at (x, y) when the user views the (f + δ) -th frame can be calculated by equations (1-30):
user behavior model ActkComprises the following steps:
and (3) calculating a user behavior distribution diagram reflecting the user viewpoint occurrence probability according to the formulas (1-29) and (1-30).
Further, in step S5, the user viewpoint position prediction is performed by the following formula (1-32):
in the formula (I), the compound is shown in the specification,respectively representing the values of the user interest distribution diagram and the user behavior distribution diagram at the f + delta frame (x, y), wherein the function phi is a fusion function of the user interest distribution diagram and the user behavior distribution diagram.
Compared with the prior art, the invention has at least the following beneficial effects:
the invention provides a low-complexity viewpoint prediction method fusing user interest and behavior characteristics, which starts from video compression code streams, comprehensively considers the significance characteristics of video contents and the user interest and behavior characteristics, and designs a viewpoint prediction model capable of accurately predicting the future viewpoint position of a user with lower complexity. Specifically, an interest model of a user is constructed, and is combined with the video saliency characteristics to generate an interest distribution map capable of reflecting the interest degree of the user on each salient object of the VR video; estimating the probability of the future viewpoint appearing at different positions for the user according to the current viewpoint position of the user and the behavior characteristics (such as speed, acceleration, maneuvering frequency and the like) of the user; and integrating the user interest distribution and establishing a viewpoint prediction model capable of accurately predicting the long-term viewpoint change of the user. In addition, the significance analysis of the invention is based on video compression domain information, and the spatial correlation between adjacent blocks of video content and the time continuity between adjacent frames are utilized to carry out comprehensive significance analysis on the intra-frame prediction coding block on the basis of the prior art.
The video significance analysis based on the compression domain adopted by the invention can effectively reduce the calculation and implementation complexity, but is different from the problem that only the video significance analysis under the intra-frame non-prediction coding mode and the inter-frame prediction coding mode is concerned in the prior work, the method provided by the invention analyzes the significance of the brightness, the color and the texture of the intra-frame prediction block by utilizing the extracted prediction residual DCT coefficient, the spatial correlation of the video content and the prediction directionality of the intra-frame prediction mode, estimates the motion vector missing from the intra-frame coding block by combining the time continuity of the video content, and finally obtains the video significance result by self-adaptive fusion with other significance characteristics. Because the intra-frame prediction mode is considered, the method provided by the invention can effectively improve the video significance analysis accuracy based on the H.264\ AVC compressed code stream without increasing the calculation and implementation complexity.
The user characteristics are one of key factors influencing the viewpoint change of the user, but different from the prior work that only the influence of recent viewpoint motion of the user on the viewpoint is focused, the invention deeply explores the action mechanism between the user interest and behavior characteristics and the viewpoint of the user, and combines the user characteristics and the saliency characteristics of the video based on the compressed domain to establish a low-complexity viewpoint prediction model which takes the user as the core and the video content as the guide, thereby remarkably improving the accuracy of viewpoint prediction.
Drawings
FIG. 1 is a diagram of a technical route of a viewpoint prediction algorithm research integrating user interests and behavior characteristics;
fig. 2 is a schematic diagram of 4 x 4 block intra prediction;
fig. 3 is a flow chart of a video saliency detection algorithm based on compressed domain.
Detailed Description
The invention is further described with reference to the following figures and detailed description.
As shown in fig. 1, the present invention provides a low-complexity viewpoint prediction method fusing user interests and behavior characteristics, and the specific implementation steps include:
(1) video saliency detection based on compressed domain
The method comprises the following steps of performing significance detection on viewpoint prediction to be performed to obtain an I frame significance map and a P frame significance map, specifically:
1) extracting compressed domain information: and extracting the coding mode and residual DCT coefficients of the intra-frame prediction coding image block, the coding mode and motion vectors of the inter-frame prediction coding image block and the pixel values of the I-PCM coding image block from the video compression domain.
2) I-frame (intra-coded frame) saliency map generation:
estimating a DC coefficient of the image block after direct DCT (discrete cosine transform) conversion without prediction according to the coding mode of the intra-frame prediction coding image block obtained in the step 1) and the residual DCT coefficient so as to represent the brightness and color characteristics of the image block;
selecting a prediction direction corresponding to an intra-frame prediction mode as a texture direction of an intra-frame prediction coding image block, predicting the texture intensity of the image block by using the texture intensity of an adjacent block with the texture direction similar to that of the intra-frame prediction coding image block so as to represent the texture direction and intensity characteristics of the image block, calculating a DCT (discrete cosine transformation) coefficient of the image block by using an original pixel value of an I _ PCM (inter-frame pulse code) coding image block recovered from a compression domain, and estimating the brightness, the color, the texture direction and the intensity characteristics of the image block according to the DCT coefficient;
and according to the coding mode and the motion vector of the inter-frame predictive coding image block of the previous frame and the next frame of the I frame or the P frame extracted from the compressed domain, constructing a motion vector set based on 4 x 4 blocks for the I frame image by utilizing the time continuity of the video content so as to represent the motion characteristic of the I frame image.
And respectively carrying out significance detection on the brightness, the color, the texture intensity, the texture direction and the motion characteristics of the acquired I frame image, and adaptively fusing the brightness, the color, the texture intensity, the texture direction and the motion characteristics into an I frame significance map so as to comprehensively represent the significance characteristics of each object in the I frame.
3) P-frame (inter-coded frame) saliency map generation:
arranging and filling motion vectors of an inter-frame predictive coding image block extracted from a compressed domain, and establishing a complete motion vector set based on 4 x 4 blocks for each P frame;
and translating the salient features of the corresponding matched blocks in the previous I frame according to the motion vectors by utilizing a time domain reference relation provided by the P frame motion vector set, thereby obtaining a P frame salient map.
(2) Construction of user interest distribution map
5) Salient object segmentation and labeling: and dividing the whole video to be subjected to viewpoint prediction into a plurality of video segments. For each video clip, dividing the I-frame image into a plurality of salient objects according to the I-frame salient map information generated in the step (1), and marking the sequence numbers of pi most salient objects in the I-frame image according to the sequence of the saliency values from large to small.
And simultaneously, according to the time domain reference relation between the P frame image block and the I frame image block in the inter-frame prediction coding process, marking the salient object of the P frame image by using the salient object marking value which is referred to in the previous I frame.
Preferably, if the salient object of the I frame has the same or similar salient feature as or to the salient object of the previous P frame, the salient object of the I frame is preferentially marked according to the marking value of the salient object in the previous P frame.
6) And (3) viewpoint stay time statistics: using the historical user real viewpoint feedback information to count the viewpoint staying time of the user who has watched the video on the pi most significant objects in the video;
7) user classification based on interest similarity: and classifying the users by adopting a K-means clustering algorithm in machine learning according to the viewpoint staying time obtained by statistics, wherein the users in the same category have higher interest similarity than the users in different groups.
8) Obtaining an interest distribution map of each category of users: and (3) generating interest models of users of the same type according to the stay time of the users of the same type on the view points of the pi most significant objects of the video, and generating interest distribution maps of all frames of the video for the users of each type by combining the I frame saliency map and the P frame saliency map acquired in the step (1).
(3) User behavior profile prediction
9) Constructing a user behavior model: by using a modeling method of a mobile target for reference, the random motion of the viewpoint of a user is described by adopting the existing 'current' statistical model, and a user behavior model is constructed by utilizing the viewpoint feedback information of the user historically watching videos.
10) And (3) generating a user behavior distribution diagram: and calculating a user behavior distribution diagram reflecting the user viewpoint occurrence probability according to the user behavior model.
11) And (3) viewpoint prediction: and (4) predicting the viewpoint by combining the interest distribution map of the category where the user is located and the user behavior distribution map.
Example 1
(1) Video saliency detection based on compressed domain
Most of videos transmitted and stored on the internet are compressed and encoded, so that significance detection is directly carried out in a compressed domain, and a complex operation process caused by decoding can be avoided. The intensity, color and texture features of a static image are obtained by using DCT coefficients, the motion intensity is estimated by using motion vectors, and the data are subjected to significance detection and fusion, so that the method is the most effective method for detecting the significance of the compressed domain video at present. However, none of these methods considers the intra-frame prediction coding mode, and it is difficult to perform accurate significance detection on the compressed code stream adopting the h.264/AVC coding standard, which has become one of the mainstream compression coding standards at present.
1) Extraction of brightness and color characteristics of intra-frame prediction coding image block in I frame
The intra-frame prediction mode is an encoding mode which uses spatial correlation to predict a current block to be encoded by using adjacent pixels encoded in the same frame image and performs DCT transformation on a prediction residual. Therefore, for the intra-frame predictive coding block, the DCT coefficient directly extracted from the compressed domain can no longer be directly used to represent the luminance, color, and texture features of the original image block, and needs to be subjected to certain preprocessing, which is specifically as follows:
for an N x N image block i in a video, if the image block i adopts intra-frame prediction coding, DCT (discrete cosine transform) transformation coefficients of the image block iCan be calculated from equation (1-1):
in the formula (I), the compound is shown in the specification,representing DCT coefficients of the intra prediction block corresponding to the image block i;and the DCT coefficients of the intra prediction residual block corresponding to the image block i are represented.
For intra-predictive coding blocks, the known encoder only performs a DCT transformation on intra-predictive residual blocks, and therefore from the video compression domainThe directly extracted DCT coefficient is the DCT coefficient of the intra-frame prediction residual block corresponding to the image block iAs can be seen from equation (1-1), the DCT transform coefficients of image block i are calculatedOnly the DCT coefficient of the intra-frame prediction block corresponding to the image block i needs to be estimatedThe value of (2) is sufficient.
According to the principle of the DCT transformation,can be expressed in the form of equations (1-2).
In the formulaRepresenting the pixel value of an image block i at (x, y)The intra prediction value of (1).
In general, an image block i is coded in intra prediction mode if and only if the current image block has a strong spatial correlation with its neighboring blocks, and therefore the predicted pixels of the image block i are weighted by their neighboring pixels. If define { si,qQ0, 1, …, Q-1 is the set of neighboring pixels of the encoded reconstruction used to predict the image block i, the intra prediction value of each pixel of the image block iCalculated from equations (1-3):
whereinIs a pixel si,qThe value of the pixel of (a) is,representing a pixel si,qAnd (4) corresponding prediction weight values.
Definition ofJ is 0,1,2, …, J being the encoded reconstructed neighboring pixel s used by the prediction image block ii,qQ is 0,1, …, Q-1, and the DC coefficients in the DCT coefficients of the block are predicted assuming equal pixel values for the same 4 x 4 block and equal to the average of all pixels in the entire 4 x 4 blockCan be composed ofObtained by weighted summation, i.e.
In the formula, wjFor the weight value, the specific value depends only on the prediction mode adopted.
Substituting the formula (1-4) into the formula (1-1) can represent the DC coefficient of the brightness and color characteristics of the image block iCan be calculated by the formula (1-5),
h.264/AVC supports intra prediction based on two partition sizes, 4 × 4 and 16 × 16, where 4 × 4 partitions have 9 optional prediction modes, each 4 × 4 block in a macroblock is predicted independently (as shown in fig. 2), and 16 × 16 partitions have 4 intra prediction modes, and the whole macroblock is predicted, which is suitable for image coding of flat regions.
Specifically for 4 x 4 partition based prediction pixels are from the pixels s of 4 neighboring blocks located at the top left, top right and left side thereofi,0~si,12In selecting, if orderi is 0,1, …, and 3 respectively represent DCT coefficients of the 4 neighboring blocks, thenThe following formula can be used to calculate:
1) when the prediction mode of the 4 x 4 block is 0,from the pixel s of the adjacent block above iti,1~si,4Prediction is obtained, i.e.Therefore, the temperature of the molten metal is controlled,
2) when the prediction mode of the 4 x 4 block is 1,by the pixel s of the adjacent block to its lefti,9~si,12Prediction is obtained, i.e.Therefore, the temperature of the molten metal is controlled,
3) when the prediction mode of the 4 x 4 block is 2,by pixels s of adjacent blocks above and to the left of iti,1~si,4,si,9~si,12Prediction is obtained, i.e.
Where round (α) means rounding off the value α.
Therefore, the temperature of the molten metal is controlled,
4) When the prediction mode of the 4 x 4 block is 3,by the pixel s of the adjacent block above and to the upper righti,1~si,8Prediction is obtained, i.e.
Thus, it is possible to provide
5) When the prediction mode of the 4 x 4 block is 4,by pixels s of adjacent blocks located at the upper left, upper and left sides thereofi,0~si,4,si,9~si,12Prediction is obtained, i.e.
Thus, it is possible to provide
6) When the prediction mode of the 4 x 4 block is 5,by pixels s of adjacent blocks located at the upper left, upper and left sides thereofi,0~si,4,si,9~si,10Prediction is obtained, i.e.
Therefore, the temperature of the molten metal is controlled,
7) when the prediction mode of the 4 x 4 block is 6,by pixels s of adjacent blocks located at the upper left, upper and left sides thereofi,0~si,3,si,9~si,12Predicted to obtain, i.e.
Therefore, the temperature of the molten metal is controlled,
8) when the prediction mode of the 4 x 4 block is 7,by the pixel s of the adjacent block above and to the upper righti,1~si,7Predicted to obtain, i.e.
Therefore, the temperature of the molten metal is controlled,
9) when the prediction mode of the 4 x 4 block is 8,by the pixel s of the adjacent block located at the upper right thereofi,9~si,12Predicted to obtain, i.e.
Therefore, the temperature of the molten metal is controlled,
for intra prediction based on 16 × 16 partitions, the present invention uses a similar method to derive and build a corresponding strategy to estimate the DC coefficient of each 4 × 4 block i in a 16 × 16 partitionThe specific calculation method is as follows to represent the brightness and color characteristics of the original image:
1) when the prediction mode of 16 × 16 partition m is 0, for each 4 × 4 block i, it is determinedBy pixels s of adjacent partitions above partition mm,1~sm,16Predicted to obtain, i.e.
Where mod (,) represents the remainder of the complementation, mod (i,4) returns the remainder of i divided by 4.
Therefore, if sm,1~sm,16The DC coefficients of the 4 × 4 neighboring blocks are sequentially represented from left to right asp=1,2,3,4,
2) When the prediction mode of 16 × 16 partition m is 1, for each 4 × 4 block i thereofBy pixels s of adjacent partitions to the left of partition mm,17~sm,32Predicted to obtain, i.e.
Therefore, if sm,17~sm,32The DC coefficients of the 4 × 4 neighboring blocks are sequentially represented from left to right asp=5,6,7,8,
3) When the prediction mode of 16 × 16 partition m is 2, for each 4 × 4 block i thereofBy pixels s of adjacent partitions above and to the left of partition mm,1~sm,32Predicted to obtain, i.e.
Therefore, the temperature of the molten metal is controlled,
4) When the prediction mode of 16 × 16 partition m is 3, for each 4 × 4 block i, it is determinedBy pixels s of adjacent partitions above and to the left of partition mm,1~sm,32Predicted to obtain, i.e.
Wherein the content of the first and second substances,
i=0,1,…,15;x,y=0,1,2,3
Clip1(x)=min(255,max(0,x))
therefore, the temperature of the molten metal is controlled,
wherein
2) Texture direction and intensity feature extraction of intra-frame prediction coding image block in I frame
And selecting the prediction direction corresponding to the intra-frame prediction coding mode as the texture direction of the image block i by utilizing the characteristic that the intra-frame prediction coding mode is closely related to the image texture information. If the texture direction of the neighboring block j is closest to the texture direction of the image block i and the prediction weight is higher, the texture intensity of the neighboring block j is used to predict the texture intensity of the image block i, i.e. the texture intensity of the neighboring block j is used to predict the texture intensity of the image block i
Wherein N isi×NiIndicating the partition size of the image block i. N is a radical ofj×NjDenotes the partition size, T, of the neighboring block j that is closest to the texture direction of the image block i and has a higher prediction weightjIndicating the texture strength of the neighboring block j that is closest to the texture direction of the image block i and has a higher prediction weight.
3) Extraction of brightness, color, texture direction and intensity characteristics of I _ PCM coded image block in I frame
And performing 4-by-4 DCT (discrete cosine transformation) on the image block by using the original pixel value of the I _ PCM encoded image block recovered from the compressed domain, and estimating the brightness, color, texture direction and intensity characteristics of the image block according to the obtained DCT coefficient. Wherein, the brightness and color characteristics of the I _ PCM coded image block I' are described by a DC coefficient in DCT coefficients, and the texture direction thetai′And intensity Ti′Calculated from the AC coefficients in the DCT coefficients, as shown in equations (1-24) and (1-25):
wherein N isi′×Ni′Representing the partition size of the I PCM encoded image block I'. Since smaller partition sizes are typically used for more texture-rich regions, a scale factor is introduced in equations (1-25)To ensure that smaller sized partitions have greater texture strength.
4) I-frame image motion feature estimation
The motion features of the I-frame image are described by motion vectors. Because all image blocks in the I frame adopt an intra-frame coding mode, motion vectors cannot be directly extracted from a compressed code stream. Therefore, the motion vector of the I-frame image block needs to be interpolated from the motion vectors of the image blocks of the previous and next P-frames of the I-frame image block by fully utilizing the temporal continuity of the video content. Because the motion vector of the inter-frame prediction coding block in the P frame is directly extracted from a compressed domain, the motion vector is obtained from a coding angle, certain noise information exists, real motion characteristics are difficult to represent, and a real motion object is represented in each frame image in a region form, the motion vector needs to be preprocessed before interpolation, including a) motion vector filling, and from the perspective of spatial correlation, the motion vector of an adjacent block in the prediction direction is used for estimating the motion vector missing from the intra-frame prediction block; b) global motion filtering, namely eliminating global motion vectors to obtain motion vectors which can truly reflect the motion of the object; c) time-space domain amplitude filtering, which is used for filtering the isolated motion vector noise with smaller amplitude from the angle of time domain continuity and space correlation; d) time-space domain phase filtering, which is used for filtering the isolated motion vector noise with abrupt change of direction from the angle of phase consistency; e) and expanding the motion area, communicating the cavities of the motion area and improving the integrity of the motion object.
5) I-frame saliency map generation
And respectively carrying out significance detection on the image block brightness, the color, the texture intensity and direction and the motion characteristics obtained according to the steps 1) to 4) by adopting a center-surround operator, and adaptively fusing the significance detection results of the characteristics into an I frame significance map capable of representing the significance characteristics of each object in the I frame.
6) P-frame saliency map generation
For a P frame, a significance analysis is not performed independently for the P frame, but a time domain reference relation between an I frame image block and a P frame image block in an inter-frame prediction coding process is analyzed according to a motion vector of the P frame image block, and a significance characteristic of the I frame image is translated according to an indication of the motion vector to obtain a P frame significance map so as to reduce the calculation complexity. The overall algorithm flow is shown in fig. 3.
(2) Building user interest distribution map
Unlike ordinary video, VR video covers 360-degree field angle, and the scene is complex, usually includes a plurality of different features, different sizes of salient objects, and is distributed in different areas of the image. Therefore, the invention sets up a user interest model capable of reflecting the interest degree of the user on different salient objects by starting from the video salient map and combining with the viewpoint feedback information, and generates a corresponding user personalized interest distribution map for each frame of the video on the basis, thereby better guiding the accurate prediction of the subsequent viewpoint, and the specific thought is as follows:
2.1 dividing the whole video to be view-predicted into several video segments. For each video clip, dividing the I frame image into a plurality of salient objects according to the generated I frame salient image information, and marking the sequence numbers of the pi most salient objects in the I frame image according to the sequence of the salient values from large to small. And simultaneously, according to the time domain reference relation between the P frame and the I frame in the process of predictive coding of the P frame, marking the salient object of the P frame image by using the salient object marking value which is referred to in the previous I frame. If the salient object of the I frame has the same or similar salient object salient characteristics as the salient object of the previous P frame, the salient object of the I frame is preferentially marked according to the marking value of the salient object in the previous P frame.
2.2 for any segment p, using the user viewpoint feedback information to count the viewpoint stay time of the user k' who has watched the video on the pi most significant objectsCan be expressed in the following form:
in the formula (I), the compound is shown in the specification,a set of segments representing the video partition;representing a set of positions of the region in which the salient object o is located in the video segment p,represents user mlThe time of viewpoint stay on the salient object o of the video segment p.
And 2.3, classifying the users by adopting a K-means clustering algorithm in machine learning according to the viewpoint staying time, so that the users in the same category have higher interest similarity than the users in different groups.
If the users are classified into L classes, the users in the centers of the classes are m in turn1,m2,...,mLThen the interest model Int of class I userslCan be described using equations (1-27).
2.4, predicting the category of a user k watching the video for the first time by using the interest similarity of the user when watching other videos, and generating an interest distribution graph of each frame for the user k according to the user interest model and the video significance of the category; estimating the user interest degree at the f frame (x, y) from the interest distribution mapComprises the following steps:
in the formula (I), the compound is shown in the specification,indicating the saliency at the f-th frame (x, y).
(3) User behavior profile prediction
While watching VR video, humans switch viewpoints by controlling head motion. Therefore, by taking the modeling method of the maneuvering target as a reference, the random motion of the viewpoint of the user is described by adopting a 'current' statistical model, and a specific motion prediction equation is shown as an equation (1-29):
in the formula:
xf,yf,respectively representing the position, the speed and the acceleration of a viewpoint in the x-axis direction and the y-axis direction when a user watches the f-th frame;respectively representing the average acceleration of the user viewpoint in the x-axis direction and the y-axis direction; α is the reciprocal of the maneuvering acceleration time constant, i.e., the maneuvering frequency.
Due to the randomness, complexity and diversity of viewpoint motion, the situation of inaccurate prediction inevitably occurs when describing the motion state of the viewpoint motion by using the model. Due to the fact thatThe invention introduces two independent random variables exAnd eyDescribing the prediction error of the model to the viewpoint in the x-axis direction and the y-axis direction, and assuming that the mean value of the model is zero and the variance is respectivelyAre distributed and independent of each other. Then, the probability that the viewpoint is located at (x, y) when the user views the (f + δ) -th frame can be calculated by equations (1-30).
Taking into account the parameter a required in the above analysis,are all only related to user behavior characteristics, so we define a user behavior model ActkComprises the following steps:
and constructing by using the user viewpoint feedback information. After obtaining the user behavior model, the invention can utilize the user behavior distribution diagram obtained by calculation of the formulas (1-29) and (1-30) and reflecting the appearance probability of the user viewpoint.
(4) Viewpoint prediction
In practical applications, it is found that the user is more inclined to focus on the object which is both in line with the maneuver reality and interesting under the guidance of the selective visual attention mechanism and the inertia of the user behavior. Thus, the viewpoint position of the end user when viewing the f + δ -th frameThe prediction can be made from equations (1-32).
In the formula (I), the compound is shown in the specification,respectively representing the values of the user interest distribution diagram and the user behavior distribution diagram at the f + delta frame (x, y), wherein the function phi is a fusion function of the user interest distribution diagram and the user behavior distribution diagram.
The "high bit rate and low delay" characteristics of VR video provide great challenges for network transmission. Especially in a mobile network, limited bandwidth resources and time-varying network transmission capability will seriously hinder the improvement of VR video user viewing experience. The VR video covers 360-degree visual field angle, the horizontal visual field range of human eyes does not exceed 180 degrees generally, and the visual field angle which can be supported by VR terminal equipment (such as VR helmet) is only about 90-110 degrees. Therefore, in recent years, VR video adaptive transmission schemes based on video blocking are becoming hot spots and common consensus in academia and industry. The invention divides the VR video into a plurality of video blocks according to the space and dynamically selects the video blocks within the visual angle range according to the viewpoint of the user for transmission, thereby reducing the requirement of the VR video on the network bandwidth while ensuring good visual experience. In order to avoid the problems of picture delay, picture blocking or quality reduction and the like caused by transmission delay when the view points of the users are switched, a view point prediction technology is adopted to predict a new view point of the user at the next moment, and the video blocks in a new view angle range are pre-downloaded and pre-cached. Therefore, the accurate prediction of the user view point has an important significance for improving the user viewing experience.
Claims (10)
1. A low-complexity viewpoint prediction method fusing user interest and behavior characteristics is characterized by comprising the following specific steps:
s1, acquiring a video frame saliency map of a viewpoint prediction video to be performed, wherein the video frame saliency map comprises an I frame saliency map and a P frame saliency map;
s2, dividing the video to be viewpoint predicted into several video segments, and marking the serial numbers of pi most significant objects in the video segments by using the video frame significant map;
s3 obtaining the viewpoint stay time of the user who has watched the video on the pi most significant objects, classifying the users according to the viewpoint stay time, obtaining the interest model of the user according to the viewpoint stay time of the same user on the pi most significant objects of the video, and combining the user interest model with the obtained video frame significance map to obtain the interest distribution map of each frame of the video;
s4, constructing a user behavior model by using the random motion of the user viewpoint and the viewpoint feedback information of the video watched by the user historically, and acquiring a user behavior distribution map reflecting the occurrence probability of the user viewpoint according to the user behavior model;
s5, combining the interest distribution map of the users of the same category with the user behavior distribution map to obtain a viewpoint prediction model, and predicting the viewpoint positions of the users by using the viewpoint prediction model.
2. The method for predicting low-complexity viewpoints by fusing user interests and behavior features as claimed in claim 1, wherein in step S1, the specific steps for generating the I-frame saliency map are as follows:
s1.1, obtaining an intra-frame prediction coding mode and a residual DCT (discrete cosine transformation) coefficient to obtain a DC coefficient of an image block which is not subjected to direct DCT (discrete cosine transformation) transformation without prediction, wherein the DC coefficient is used for representing the brightness and color characteristics of the image block;
s1.2, acquiring a prediction direction corresponding to an intra-frame prediction coding mode, taking the prediction direction as the texture direction of an intra-frame prediction coding image block, and acquiring the texture intensity of an adjacent block similar to the texture direction of the intra-frame prediction coding image block, wherein the texture intensity is taken as the texture intensity of the intra-frame prediction coding image block;
s1.3, obtaining original pixel values of I _ PCM coded image blocks recovered from a compressed domain, and calculating DCT coefficients of the I _ PCM coded image blocks by using the pixel values, wherein DC coefficients in the DCT coefficients are used for expressing the brightness and the color of the I _ PCM coded image blocks, and AC coefficients in the DCT coefficients are used for expressing texture direction and intensity characteristics of the I _ PCM coded image blocks;
s1.4, constructing a motion vector set of the I frame image according to the coding mode and the motion vector of the previous and next P frame inter-frame predictive coding image blocks of the I frame inter-frame predictive coding image block and the time continuity of the viewpoint predictive video content to be carried out;
s1.5, respectively carrying out significance detection on the brightness, the color, the texture intensity, the texture direction and the motion characteristic of the acquired I frame image, and adaptively fusing the significance detection results into an I frame significance map;
the specific steps for generating the P frame saliency map are as follows:
s1.6, obtaining a motion vector of an inter-frame prediction coding image block in a compression domain, sorting and filling the motion vector, and establishing a complete motion vector set for each P frame;
s1.7, translating the significance characteristics of the image blocks in the I frame according to the indication of the motion vector by utilizing the time domain reference relationship between the P frame image blocks and the I frame image blocks in the inter-frame prediction coding process to obtain a P frame significance map.
3. The method according to claim 2, wherein in step S1.1, intra-frame predictive coding is applied to the N x N image blocks i in the video to be view-predicted, so that DCT transform coefficients of the image blocks i are encoded by intra-frame predictive codingCan be calculated from equation (1-1):
in the formula (I), the compound is shown in the specification,representing DCT coefficients of the intra prediction block corresponding to the image block i;representing the intra prediction residual corresponding to image block iDCT coefficients of the block;
in which the DCT coefficients of the intra prediction residual blockDirectly extracting DCT coefficients in a video compression domain for viewpoint prediction to be carried out;
in the formula Representing the pixel value of an image block i at (x, y)The intra prediction value of (1);
if define { si,qQ is 0,1, …, Q-1, and is the set of neighboring pixels used by the prediction image block to be coded and reconstructed, the intra prediction value for each pixel of the image block iCalculated from equations (1-3):
whereinIs a pixel si,qThe value of the pixel of (a) is,representing a pixel si,qA corresponding prediction weight value;
definition ofCoded reconstructed neighboring pixels s for use in predicting image blocksi,qAnd Q is 0,1, …, and Q-1, and assuming that the pixel values of the same 4 x 4 block are equal and equal to the average value of all pixels in the whole 4 x 4 block, the DC coefficients in the DCT coefficients of the prediction block are predictedCan be obtained by calculation using the formulae (1-4), i.e.
In the formula, wjAs the weight, the specific value is determined by the adopted prediction mode;
substituting the formula (1-4) into the formula (1-1) can represent the DC coefficient of the brightness and color characteristics of the image block iCan be calculated by the formula (1-5),
4. a low complexity viewpoint prediction method with fusion of user interest and behavior features as claimed in claim 3 wherein, in step S1.1, the predicted pixels of 4 x 4 partition are from the pixels S of 4 neighboring blocks at the top left, top right and left sides of the predicted pixelsi,0~si,12In selecting, orderRespectively represent the DCT coefficients of the 4 adjacent blocks, thenComprises the following steps:
1) when the prediction mode of the 4 x 4 block is 0,from the pixel s of the adjacent block above iti,1~si,4Prediction is obtained, i.e.Therefore, the temperature of the molten metal is controlled,
2) when the prediction mode of the 4 x 4 block is 1,by the pixel s of the adjacent block to its lefti,9~si,12Prediction is obtained, i.e.Therefore, the temperature of the molten metal is controlled,
3) when the prediction mode of the 4 x 4 block is 2,by pixels s of adjacent blocks above and to the left of iti,1~si,4,si,9~si,12Prediction is obtained, i.e.
Wherein round (α) denotes rounding the value α;
therefore, the temperature of the molten metal is controlled,
4) When the prediction mode of the 4 x 4 block is 3,by the pixel s of the adjacent block above and to the upper righti,1~si,8Prediction is obtained, i.e.
Therefore, the temperature of the molten metal is controlled,
5) when the prediction mode of the 4 x 4 block is 4,by pixels s of adjacent blocks located at the upper left, upper and left sides thereofi,0~si,4,si,9~si,12Prediction is obtained, i.e.
Therefore, the temperature of the molten metal is controlled,
6) when the prediction mode of the 4 x 4 block is 5,by pixels s of adjacent blocks located at the upper left, upper and left sides thereofi,0~si,4,si,9~si,10Prediction is obtained, i.e.
Therefore, the temperature of the molten metal is controlled,
7) when the prediction mode of the 4 x 4 block is 6,by pixels s of adjacent blocks located at the upper left, upper and left sides thereofi,0~si,3,si,9~si,12Predicted to obtain, i.e.
Therefore, the temperature of the molten metal is controlled,
8) when the prediction mode of the 4 x 4 block is 7,by the pixel s of the adjacent block above and to the upper righti,1~si,7Predicted to obtain, i.e.
Therefore, the temperature of the molten metal is controlled,
9) when the prediction mode of the 4 x 4 block is 8,by the pixel s of the adjacent block located at the upper right thereofi,9~si,12Predicted to obtain, i.e.
5. the method according to claim 3, wherein in step S1.1, when intra-frame prediction is performed based on 16 × 16 partitions, the DC coefficient of each 4 × 4 block in the 16 × 16 partitions is determinedComprises the following steps:
1) when the prediction mode of 16 × 16 partition m is 0, the prediction mode of each 4 × 4 blockBy pixels s of adjacent partitions above partition mm,1~sm,16Predicted to obtain, i.e.
Where mod (·,) represents a complementation operation, mod (i,4) returns the remainder of i divided by 4;
therefore, if sm,1~sm,16The DC coefficients of the 4 × 4 neighboring blocks are sequentially represented from left to right as
2) When the prediction mode of 16 × 16 partition m is 1, for each 4 × 4 block i thereofBy pixels s of adjacent partitions to the left of partition mm,17~sm,32Predicted to obtain, i.e.
therefore, if sm,17~sm,32The DC coefficients of the 4 × 4 neighboring blocks are sequentially represented from left to right as
3) When the prediction mode of 16 × 16 partition m is 2, for each 4 × 4 block i thereofBy pixels s of adjacent partitions above and to the left of partition mm,1~sm,32Predicted to obtain, i.e.
Therefore, the temperature of the molten metal is controlled,
4) When the prediction mode of 16 × 16 partition m is 3, for each 4 × 4 block i, it is determinedBy pixels s of adjacent partitions above and to the left of partition mm,1~sm,32Predicted to obtain, i.e.
Wherein the content of the first and second substances,
i=0,1,…,15;x,y=0,1,2,3
Clip1(x)=min(255,max(0,x))
therefore, the temperature of the molten metal is controlled,
wherein
6. The method as claimed in claim 2, wherein in step S1.2, the texture intensity of the intra-prediction coded image block i is:
wherein N isi×NiRepresenting the partition size, N, of an image block ij×NjDenotes the partition size, T, of the neighboring block j that is closest to the texture direction of the image block i and has a higher prediction weightjIndicating the texture strength of the neighboring block j that is closest to the texture direction of the image block i and has a higher prediction weight.
7. The method as claimed in claim 2, wherein in step S1.3, the texture direction θ of the I _ PCM encoded image block I' is determined by the texture direction θi′And intensity Ti′Expressed by AC coefficients among DCT coefficients, as shown in equations (1-24) and (1-25):
8. The method for predicting low-complexity viewpoint with fusion of user interest and behavior feature as claimed in claim 1, wherein in step S3, the interest model Int of the userlComprises the following steps:
wherein l is a user category, and users in the clustering center of each category are m respectively in turn1,m2,...,mL,For the user m who has watched the video to be detectedlView dwell time on the pi most salient objects in video segment p;
in the formula (I), the compound is shown in the specification,a set of segments representing the video partition;representing a set of positions of the region in which the salient object o is located in the video segment p,represents user mlThe time of viewpoint stay on the salient object o of the video segment p;
according toThe interest distribution map obtains the user interest degree at the f frame (x, y)Comprises the following steps:
9. The method for predicting low-complexity viewpoint according to claim 1, wherein in step S4, the "current" statistical model is used to describe the random motion of the viewpoint of the user, and the specific motion prediction equation is shown in equations (1-29):
in the formula:
xf,yf,respectively indicate that the viewpoint of the user is on the x axis when the user watches the f-th framePosition, velocity, acceleration in the direction and y-axis direction;respectively representing the average acceleration of the user viewpoint in the x-axis direction and the y-axis direction; alpha is the reciprocal of the maneuvering acceleration time constant, namely the maneuvering frequency;
the probability that the viewpoint is located at (x, y) when the user views the (f + δ) -th frame can be calculated by equations (1-30):
user behavior model ActkComprises the following steps:
and (3) calculating a user behavior distribution diagram reflecting the user viewpoint occurrence probability according to the formulas (1-29) and (1-30).
10. The method for predicting a low-complexity viewpoint with user interest and behavior feature combined according to claim 1, wherein in step S5, the user viewpoint location prediction is performed according to the following formula (1-32):
in the formula (I), the compound is shown in the specification,respectively representing the values of the user interest distribution diagram and the user behavior distribution diagram at the f + delta frame (x, y), wherein the function phi is a fusion function of the user interest distribution diagram and the user behavior distribution diagram.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111510706.9A CN114173206B (en) | 2021-12-10 | 2021-12-10 | Low-complexity viewpoint prediction method integrating user interests and behavior characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111510706.9A CN114173206B (en) | 2021-12-10 | 2021-12-10 | Low-complexity viewpoint prediction method integrating user interests and behavior characteristics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114173206A true CN114173206A (en) | 2022-03-11 |
CN114173206B CN114173206B (en) | 2023-06-06 |
Family
ID=80485557
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111510706.9A Active CN114173206B (en) | 2021-12-10 | 2021-12-10 | Low-complexity viewpoint prediction method integrating user interests and behavior characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114173206B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115103023A (en) * | 2022-06-14 | 2022-09-23 | 北京字节跳动网络技术有限公司 | Video caching method, device, equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2018055509A (en) * | 2016-09-29 | 2018-04-05 | ファイフィット株式会社 | Method of pre-treating composite finite element, method of analyzing composite material, analysis service system and computer readable recording medium |
CN109101896A (en) * | 2018-07-19 | 2018-12-28 | 电子科技大学 | A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism |
CN111325124A (en) * | 2020-02-05 | 2020-06-23 | 上海交通大学 | Real-time man-machine interaction system under virtual scene |
JP2020150519A (en) * | 2019-03-15 | 2020-09-17 | エヌ・ティ・ティ・コミュニケーションズ株式会社 | Attention degree calculating device, attention degree calculating method and attention degree calculating program |
-
2021
- 2021-12-10 CN CN202111510706.9A patent/CN114173206B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2018055509A (en) * | 2016-09-29 | 2018-04-05 | ファイフィット株式会社 | Method of pre-treating composite finite element, method of analyzing composite material, analysis service system and computer readable recording medium |
CN109101896A (en) * | 2018-07-19 | 2018-12-28 | 电子科技大学 | A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism |
JP2020150519A (en) * | 2019-03-15 | 2020-09-17 | エヌ・ティ・ティ・コミュニケーションズ株式会社 | Attention degree calculating device, attention degree calculating method and attention degree calculating program |
CN111325124A (en) * | 2020-02-05 | 2020-06-23 | 上海交通大学 | Real-time man-machine interaction system under virtual scene |
Non-Patent Citations (2)
Title |
---|
MAI XU等: "Predicting head movement in panoramic video:a deep reinforcement learning approach", 《IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE》 * |
张霁雯: "基于用户兴趣特征的微波信息传播预测方法研究", 《知网》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115103023A (en) * | 2022-06-14 | 2022-09-23 | 北京字节跳动网络技术有限公司 | Video caching method, device, equipment and storage medium |
CN115103023B (en) * | 2022-06-14 | 2024-04-05 | 北京字节跳动网络技术有限公司 | Video caching method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN114173206B (en) | 2023-06-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110087087B (en) | VVC inter-frame coding unit prediction mode early decision and block division early termination method | |
CN109309834B (en) | Video compression method based on convolutional neural network and HEVC compression domain significant information | |
CN104378643B (en) | A kind of 3D video depths image method for choosing frame inner forecast mode and system | |
CN108989802B (en) | HEVC video stream quality estimation method and system by utilizing inter-frame relation | |
CN103618900B (en) | Video area-of-interest exacting method based on coding information | |
CN111355956B (en) | Deep learning-based rate distortion optimization rapid decision system and method in HEVC intra-frame coding | |
EP3343923B1 (en) | Motion vector field coding method and decoding method, and coding and decoding apparatuses | |
CN103826125B (en) | Concentration analysis method and device for compression monitor video | |
CN110852964A (en) | Image bit enhancement method based on deep learning | |
CN105933711B (en) | Neighborhood optimum probability video steganalysis method and system based on segmentation | |
CN111479110B (en) | Fast affine motion estimation method for H.266/VVC | |
WO2016155070A1 (en) | Method for acquiring adjacent disparity vectors in multi-texture multi-depth video | |
CN114745549B (en) | Video coding method and system based on region of interest | |
CN112001308A (en) | Lightweight behavior identification method adopting video compression technology and skeleton features | |
Liu et al. | Fast depth intra coding based on depth edge classification network in 3D-HEVC | |
Fu et al. | Efficient depth intra frame coding in 3D-HEVC by corner points | |
CN114173206B (en) | Low-complexity viewpoint prediction method integrating user interests and behavior characteristics | |
CN106878754B (en) | A kind of 3D video depth image method for choosing frame inner forecast mode | |
CN117176960A (en) | Convolutional neural network chroma prediction coding method with multi-scale position information embedded | |
US20050259878A1 (en) | Motion estimation algorithm | |
Zuo et al. | Bi-layer texture discriminant fast depth intra coding for 3D-HEVC | |
Bachu et al. | Adaptive order search and tangent-weighted trade-off for motion estimation in H. 264 | |
CN107509074B (en) | Self-adaptive 3D video compression coding and decoding method based on compressed sensing | |
Bocheck et al. | Real-time estimation of subjective utility functions for MPEG-4 video objects | |
CN109982079B (en) | Intra-frame prediction mode selection method combined with texture space correlation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |