CN110827193A

CN110827193A - Panoramic video saliency detection method based on multi-channel features

Info

Publication number: CN110827193A
Application number: CN201911000029.9A
Authority: CN
Inventors: 邓向冬; 宁金辉; 王惠明; 张乾
Original assignee: Planning Institute Of Radio And Television Of State Administration Of Radio And Television
Current assignee: Planning Institute Of Radio And Television Of State Administration Of Radio And Television
Priority date: 2019-10-21
Filing date: 2019-10-21
Publication date: 2020-02-21
Anticipated expiration: 2039-10-21
Also published as: CN110827193B

Abstract

The invention provides a panoramic video saliency detection method based on multi-channel characteristics, which comprises the steps of carrying out reverse ERP transformation on a panoramic image, mapping a planar panoramic image onto a spherical surface, and generating a spherical panoramic image; simulating a visual window image by adopting a plane tangent to the spherical panoramic image to obtain different image blocks; in different feature spaces, extracting salient regions of image blocks in the feature spaces by using different salient operation operators to form different salient feature subgraphs, and simultaneously, considering motion information among image block sequences to convert the salient detection of the images into the salient detection of videos; and fusing different significant characteristic subgraphs to generate a total significant graph. The method has better accuracy in simulating the human visual attention mechanism.

Description

Panoramic video saliency detection method based on multi-channel features

Technical Field

The invention relates to the technical field of image significance detection, in particular to a panoramic video significance detection method based on multi-channel features, and particularly relates to a panoramic video significance detection method based on direction, color, spatial frequency and motion features.

Background

The detection of the significance of traditional images has been a relatively deep topic, and researchers have proposed many models in the past three decades, most of which are based on two ideas: from bottom to top and from top to bottom. The bottom-up model is data-driven, combines primary features such as color, contrast and orientation of an image, considers the difference of pixels and surrounding fields in features, and is unrelated to subjective emotion of a person, such as a visual saliency calculation model proposed by Itti L and the like. The top-down model is driven by tasks, and the prior knowledge about the scene is added into the model consideration to serve as an important basis for guiding significance distribution, so that the cognition of human psychological activities is included, for example, the human face, the vehicle and the central position are more easily noticed by an observer.

In the saliency data collection of images, the observer is allowed to repeatedly view "looking" for salient regions in front of a still image, which is a great difference from video. In the viewing of panoramic video, the picture content is dynamic, and the observer often misses some objects while watching a position or moving the head, so that the salient region of the image cannot completely correspond to the salient region of the panoramic video.

For the Saliency prediction algorithm of panoramic video, De Abreu Ana et al published "Ninth International Conference on Quality of Multimedia Experience" in 2017, which changes a 360 ° image into a conventional two-dimensional planar image through spherical-to-rectangular plane mapping (ERP transformation), and predicts a Saliency region through a conventional planar image Saliency detection algorithm. However, this method does not deal with the distortion in the mapping of the panoramic image to the planar image, and is still not slightly different from the panoramic content viewed by human eyes in the virtual reality environment. Battisti Federica et al published ' A feature-based adaptive evaluation for similarity evaluation of omni-directional images ' in 2018 Signal Processing: Image Communication ', and the final significance map is integrated by extracting visual window images from 360-degree images, performing significance measurement on chroma, saturation and GBVS characteristics based on graph theory and combining results of skin and face detection. However, this method only considers the prediction of the salient region of the panoramic image, and is not suitable for the prediction of the salient region of the panoramic video due to the omission of the inter-frame information. Researchers have also proposed panoramic video saliency detection algorithms based on deep learning, but the limitations are large, mainly due to the small number of eye movement data sets of dynamic scenes and the generally small scale.

At present, no explanation or report of the similar technology of the invention is found, and similar data at home and abroad are not collected.

Disclosure of Invention

In view of the above-mentioned deficiencies in the prior art, the present invention aims to provide a method for detecting the saliency of a panoramic video based on multi-channel features such as direction, color, spatial frequency, and motion features, which adopts a 360 ° image to plane image distortion-free mapping method, combines feature combination and feature extraction from bottom to top with a modeling idea from top to bottom, and simultaneously considers the influence of inter-frame information of the video on saliency prediction, thereby simulating a human visual attention mechanism with good accuracy.

The invention is realized by the following technical scheme.

The invention provides a panoramic video saliency detection method based on multi-channel features, which comprises the following steps:

s1: carrying out reverse ERP transformation on the panoramic image, and mapping the planar panoramic image to a spherical surface to generate a spherical panoramic image;

s2: simulating a visual window image by adopting a plane tangent to the spherical panoramic image to obtain different image blocks;

s3: in different feature spaces, extracting salient regions of image blocks in the feature spaces by using different salient operation operators to form different salient feature subgraphs, and simultaneously, considering motion information among image block sequences to convert the salient detection of the images into the salient detection of videos;

s4: and fusing different significant characteristic subgraphs to generate a total significant graph.

Preferably, in step 1, the expression of the spherical panoramic image is:

wherein, lambda is the longitude of the point (x, y) projected to the spherical surface in the rectangular coordinate system of the planar panoramic image,

is the latitude of the plane panoramic image projected to the spherical surface,is the latitude corresponding to the horizontal central axis of the planar panoramic image, wherein the value is 0 and lambda₀The longitude corresponding to the central meridian of the planar panoramic image.

Preferably, the step 2 includes the following sub-steps:

s2.1: setting a plane tangent to the spherical surface of the spherical panoramic image, and then projecting a curved surface with a limited angle on the spherical surface in the visual window onto the plane as an image block of a current picture;

s2.2: rotating the visual window by a fixed angle, and moving a plane tangent to the spherical surface to a new longitude latitude tangent to the center of the window to obtain a next projected image block;

s2.3: and repeating the step S2.2 to obtain a series of image blocks of the multi-view viewing plane panoramic image in the simulated visual window.

Preferably, the plane is a rectangular plane of fixed length and width tangent to the spherical surface of the spherical panoramic image, which is disposed at the center of the spherical surface, and the curved surfaces of limited visual angles in the visual window are each mapped onto this rectangular plane.

Preferably, the step 3 includes the following sub-steps:

s3.1: extracting statistical feature subgraph f of image block based on different layers and orientations of sideband pyramid domain of pixel s₁(s)：

Constructing a controllable pyramid model in a gray scale image of an image block of the planar panoramic image; calculating histograms of pictures with different spatial frequencies and orientations to estimate probability density distribution, performing weighted linear addition on results of different levels and orientations to obtain a statistical feature subgraph f of different levels and orientations based on a sideband pyramid domain of a pixel s₁(s) is as follows:

wherein, α_kRepresenting the weights for all orientations and levels, wherein the vertical and horizontal directions are given the same weight, and the weights between different frequency components are assigned by a function; p_kRepresenting the probability of the brightness corresponding in the side band k of the pyramid W, I_sRepresents the luminance of the pixel s;

s3.2: extraction of color feature subgraph f of image block based on pixel s₂(s)：

Calculating the distribution of the image in the image block in three channels of RGB and integrating to obtain a color characteristic value O(s) of a pixel s as shown in the following formula:

wherein λ is_cIs a weight, P, for a color channel learned by luminance value conversion of RGB to a given color format (YUV)_cRepresenting the corresponding probability of the brightness of different color channels;

and multiplying the color distance of the CIELAB space by the weight based on the space distance between the pixels, and performing normalization processing to obtain:

wherein k is_sFor the normalized denominator term, C calculates the color distance in CIELAB space, function g_dFor use in determining the distance between pixel spacesSetting a weight, s' representing another pixel in space, I_s′Represents the corresponding brightness; Ω denotes a set of pixels of the image block, Δ L^*、Δa^*、Δb^*Respectively representing the distances of two pixels on three components in the CIELAB space;

s3.3: local symmetric feature subgraph f for extracting image block₃：

Detecting the local symmetry axis of the image in the image block, and taking the obtained result as the local symmetric characteristic subgraph f of the image block₃；

S3.4: extracting semantic feature subgraph f of image block₄：

Extracting high-order features (including characters, automobiles and faces) of the image in the image block by using a target detection algorithm to obtain a semantic feature subgraph f of the image block₄。

S3.5: extracting motion information characteristic subgraph f of image block₅：

Detecting the image block sequence in the visual window, adding the motion information into the detection, and taking the obtained result as the motion information characteristic subgraph f of a group of image blocks₅。

Preferably, in S3.1, the controllable pyramid model uses spatial filters with different orientations and bandwidths for the construction of each layer, and the spatial filters are applied to extract information of different directions of the grayscale map.

Preferably, in S3.1, weights between different frequency components are assigned by a CSF function.

Preferably, in S3.4, high-order features of the image in the image block are extracted, a target detection algorithm based on multi-scale variability grouping model mixture is adopted, and features of different levels of the image are extracted by using an image pyramid.

Preferably, in S3.5, the LK optical flow method is adopted to detect the image block sequence in the visual window.

Preferably, in S4, the feature fusion is performed on the different feature subgraphs obtained in S3 by using a linear weighting method.

Preferably, in the feature fusion process, for a salient feature subgraph corresponding to a visual window at a high latitude, a lower weight is allocated to suppress the salient region possibility of two poles during feature fusion.

Compared with the prior art, the invention has the following beneficial effects:

1. the saliency detection algorithm of the traditional image can be applied to the panoramic image without being influenced by distortion through the mapping change of the coordinate system;

2. the saliency detection framework based on the multi-vision channel feature fusion has strong expansibility and has the characteristics of flexibility and easiness in modification;

3. the feature estimation of the motion information is introduced on the basis of the saliency image detection algorithm, so that a new panorama video saliency detection algorithm is provided, and the situation that other saliency contents are ignored can be reduced while a moving object is concerned.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a schematic diagram of a multi-window mapping process of a panoramic picture;

FIG. 2 is a flow chart of significance detection based on multi-channel features;

fig. 3 is a diagram showing the effect of comparing the normal rendering and the point-of-regard rendering.

Detailed Description

The following examples illustrate the invention in detail: the embodiment is implemented on the premise of the technical scheme of the invention, and a detailed implementation mode and a specific operation process are given. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

The embodiment of the invention provides a panoramic video saliency detection method based on multi-channel characteristics, wherein the multi-channel characteristics comprise: direction, color, spatial frequency, and motion characteristics.

The method comprises the following steps:

step 1: firstly, the panoramic image is transformed by reverse ERP (equal-distance cylindrical Projection), and the planar panoramic image is mapped to a spherical surface to generate a spherical panoramic image;

step 2: simulating a visual window image through a plane tangent to the spherical panoramic image to obtain different image blocks;

and step 3: in different feature spaces, extracting salient regions of image blocks in the feature spaces by using different saliency operation operators, and forming a salient feature map by considering motion information among image block sequences;

and 4, step 4: and synthesizing a total saliency map through a saliency map fusion process.

Further, still include:

and 5, repeatedly executing the steps 1 to 4 until a total saliency map of each frame of panoramic image in the panoramic video is obtained, and completing the saliency detection of the panoramic video.

Further, in step 1, the mathematical expression of the spherical panoramic image is:

wherein, lambda is the longitude of the point (x, y) of the planar panoramic image of the rectangular coordinate system after being projected to the spherical surface,

is the latitude of the plane panoramic image projected to the spherical surface,is the latitude corresponding to the horizontal central axis of the planar panoramic image, which is 0, lambda₀The longitude corresponding to the central meridian of the planar panoramic image.

Further, the step 2 includes the following sub-steps:

step 2.1: after the planar panoramic image is mapped onto the spherical surface, the present embodiment sets some planes tangent to the spherical surface to simulate viewing a planar panoramic picture (as shown in fig. 1) in a head-mounted display device (HMD), and then projects the curved surfaces with limited angles on the spherical surface onto these planes as image blocks of the current picture;

step 2.2: then the visual window rotates by a fixed angle, and the rectangular plane tangent to the spherical surface moves to a new longitude and latitude tangent to the center of the window along with the fixed angle to obtain a next projected image block;

step 2.3: repeating the step 2.2, so that the present embodiment obtains a series of image blocks simulating that the human eyes view the planar panoramic image in the HMD in a multi-view manner, and the planar panoramic image is mapped into these small image blocks, and then is subjected to saliency detection (as shown in fig. 3).

In the above steps, after the planar panoramic image is mapped to the spherical surface, the present embodiment sets a rectangular plane of fixed length and width tangent to the spherical surface as the initial projection plane at the center of the spherical panoramic image, and the image of limited visual angle in the visual window will be mapped to this plane.

Further, the step 3 includes the following sub-steps:

step 3.1: extracting statistical feature subgraph f of image block based on different layers and orientations of sideband pyramid domain of pixel s₁(s)：

Step 3.2: extraction of color feature subgraph f of image block based on pixel s₂(s)：

Step 3.3: local symmetric feature subgraph f for extracting image block₃：

Step 3.4: extracting semantic feature subgraph f of image block₄：

Step 3.5: extracting motion information characteristic subgraph f of image block₅：

Further, the air conditioner is provided with a fan,

step 3.1: statistics for different levels and orientations of the sideband pyramid domain: in consideration of multiple visual channels and contrast sensitivity, the controllable pyramid model is constructed in the gray scale image of the image block of the planar panoramic image, and spatial filters with different orientations and bandwidths are used for constructing each layer. After that, the present embodiment calculatesEstimating probability density distribution by histograms of pictures with different spatial frequencies and orientations, calculating a characteristic value of a certain pixel s by the following formula, and performing weighted linear addition on results of different levels and orientations to obtain a characteristic subgraph f based on the pixel s₁(s)：

Wherein, α_kIncluding weight considerations for all orientations and levels, applying Gabor filters extracts information in different directions of the image, where the vertical and horizontal directions are given the same weight, and the weights between different frequency components are assigned by the CSF function. P_kAnd representing the probability of the brightness corresponding to the sideband k of the pyramid W, and obtaining a salient feature subgraph corresponding to the pixel s through linear combination of all layers of the pyramid.

Wherein, the CSF is a contrast sensitivity function proposed in the Effects of Spatial bandwidth and temporal presentation by Peli El et al published in Spatial Vision 1993, wherein Spatial frequency is used as an input variable, a detection threshold value is changed along with the input, and different weights can be assigned to the contents of different Spatial frequencies in the picture.

Step 3.2: the method for calculating the characteristic value of the color of a certain pixel s is obtained by calculating and integrating the distribution of the image in the image block in three channels of RGB:

wherein λ is_cIs a weight, P, for a color channel learned by luminance value conversion of RGB to a given color format (YUV)_cRepresenting the probability that the luminance of different color channels corresponds. In addition, according to the study on the contrast, the present embodiment attempts to emphasize a feature map obtained in the case of high contrast among color features. As shown in the following equation, this embodiment multiplies the color distance in CIELAB space by the distance between pixelsThe weights of the spatial distances are weighted and normalized.

Wherein k is_sFor the normalized denominator term, C calculates the color distance in CIELAB space, function g_dFor setting weights based on the distance between pixel spaces, the present embodiment uses a gaussian function whose width is controlled by the standard deviation σ, so that the feature of the pixel s is enhanced by the local color contrast to obtain a feature sub-graph f₂(s)。

Step 3.3: in this embodiment, an extraction algorithm of a basic feature in "Learning-Based Symmetry Detection in natural images" proposed by Stavros Tsogkas et al in "European consensus Computer Vision" of 2012 is added to detect a local Symmetry axis of an image in an image block, and a result obtained by the Detection is used as a third-class salient feature sub-graph f₃。

Step 3.4: humans tend to focus on some particular objects, such as people, cars, faces, etc., in a high-dimensional semantic understanding of the image. In this embodiment, a relevant target detection algorithm proposed by Pedr F Felzenzwalb et al in "IEEEtransactions on Pattern Analysis and Machine Analysis" published in 2010 "Objectdetection with characterization related parts-Based Models" is used to extract such high-order features, and a significant feature sub-graph F is obtained₄. The target detection algorithm is based on multi-scale variability grouping model mixing, objects are detected and identified through mixing of a main coarse precision filter bank and a series of high-resolution filter banks, and the image pyramid is used for extracting features of different layers.

Step 3.5: this example introduces Bruce D.Lucas in the feature detection, which is proposed in the 1985 article "Generalized Image Matching by the Method of Differences" to detect the Image sequence in the visual window, and the result is used as a set of feature sub-graphs f₅Thereby adding motion information to the model to take into account the appearance of the image in the image blockThe saliency detection algorithm is improved into a saliency detection algorithm of the video.

Further, in the step 4, the 5 kinds of feature sub-images obtained in the step 3 are subjected to feature fusion by using a linear weighting method, and because the central axis deviation of the panoramic content watched by the audience is considered, a lower weight is assigned to the significant feature sub-image corresponding to the visual window at the high latitude during feature fusion to suppress the probability of significant areas of the two poles.

The method for detecting the significance of the panoramic video based on the multi-channel characteristics, provided by the embodiment of the invention, is used for performing reverse ERP transformation on the panoramic image, and mapping the planar panoramic image onto a spherical surface to generate a spherical panoramic image; simulating a visual window image by adopting a plane tangent to the spherical panoramic image to obtain different image blocks; in different feature spaces, extracting salient regions of image blocks in the feature spaces by using different salient operation operators to form different salient feature subgraphs, and simultaneously, considering motion information among image block sequences to convert the salient detection of the images into the salient detection of videos; and fusing different significant characteristic subgraphs to generate a total significant graph. The method has better accuracy in simulating the human visual attention mechanism.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims

1. A method for detecting the saliency of a panoramic video based on multi-channel features is characterized by comprising the following steps:

2. The method for detecting the saliency of panoramic video based on multi-channel features as claimed in claim 1, wherein in the step 1, the expression of the spherical panoramic image is as follows:

is the latitude of the plane panoramic image projected to the spherical surface,

is the latitude corresponding to the horizontal central axis of the planar panoramic image, wherein the value is 0 and lambda₀The longitude corresponding to the central meridian of the planar panoramic image.

3. The method for detecting the saliency of panoramic video based on multi-channel features as claimed in claim 1, wherein said step 2 comprises the following sub-steps:

4. The method for detecting the saliency of the panoramic video based on the multi-channel features as claimed in claim 3, wherein the plane is a rectangular plane with a fixed length and width and tangent to a spherical surface of the spherical panoramic image, and the curved surfaces with limited visual angles in the visual window are all mapped onto the rectangular plane.

5. The method for detecting the saliency of panoramic video based on multi-channel features as claimed in claim 1, wherein said step 3 comprises the following sub-steps:

wherein ks is a normalized denominator term, C calculates the color distance of CIELAB space, and g is a function_dFor setting a weight based on the distance between the pixel spaces, s' representing another pixel in space, I_s′Represents the corresponding brightness; Ω denotes a set of pixels of the image block, Δ L^*、Δa^*、Δb^*Respectively representing the distances of two pixels on three components in the CIELAB space;

s3.3: local symmetric feature subgraph f for extracting image block₃：

S3.4: extracting semantic feature subgraph f of image block₄：

6. The method for detecting the saliency of panoramic video based on multi-channel features of claim 5, wherein in S3.1, the controllable pyramid model uses spatial filters with different orientations and bandwidths for the construction of each layer, and the spatial filters are applied to extract information of different directions of a gray map; and/or

In S3.1, weights between different frequency components are assigned by CSF functions.

7. The method for detecting the saliency of panoramic video based on multi-channel features as claimed in claim 5, wherein in S3.4, the high-order features of the images in the image blocks are extracted, a target detection algorithm based on multi-scale variability grouping model mixing is adopted, and simultaneously, the features of different levels of the images are extracted by adopting an image pyramid.

8. The method for detecting the saliency of panoramic video based on multi-channel features of claim 5, wherein in S3.5, an LK optical flow method is adopted to detect the image block sequence in the visual window.

9. The method for detecting the saliency of panoramic video based on multi-channel features of claim 1, wherein in the step S4, feature fusion is performed on different feature subgraphs obtained in the step S3 by using a linear weighting method.

10. The method for detecting the saliency of panoramic video based on multi-channel features of claim 9, wherein in the feature fusion process, for a saliency feature subgraph corresponding to a visual window at a high latitude, a lower weight is assigned to suppress the saliency region possibility of two poles during feature fusion.