Disclosure of Invention
The invention aims to solve the technical problem of providing a video interesting region extraction method based on visual attention, which can ensure that the extracted video interesting region has higher precision and better stability and the extracted video interesting region accords with the semantic characteristics of human eye stereoscopic vision.
The technical scheme adopted by the invention for solving the technical problems is as follows: a video interesting region extraction method based on visual attention comprises the following steps:
firstly, defining a two-dimensional color video as a texture video, defining the size of texture video frames at each moment in the texture video to be W multiplied by H, W being the width of the texture video frames at each moment in the texture video, H being the height of the texture video frames at each moment in the texture video, and recording the texture video frame at t moment in the texture video as FtDefining a texture video frame F at time t in the texture videotFor the current texture video frame, detecting the visual attention of the static image domain of the current texture video frame by using a known static image visual attention detection method to obtain a distribution map, marked as S, of the visual attention of the static image domain of the current texture video frameIThe static image domain visual attention distribution map S of the current texture video frameIHas a dimension of W × H and is ZSA grayscale map of bit depth representations;
secondly, detecting the motion visual attention of the current texture video frame by adopting a motion visual attention detection method to obtain a distribution diagram of the motion visual attention of the current texture video frame, and recording the distribution diagram as SMDistribution map S of motion visual attention of current texture video frameMHas a dimension of W × H and is ZSA grayscale map of bit depth representations;
defining depth video frame of each time in depth video corresponding to texture video as ZDSetting the size of a depth video frame at each moment in a depth video to be W multiplied by H, wherein W is the width of the depth video frame at each moment in the depth video, H is the height of the depth video frame at each moment in the depth video, and D is the depth video frame at t moment in the depth videotDefining a depth video frame D at time t in the depth videotDetecting the current depth video frame and the current texture video frame link for the current depth video frame by adopting a depth visual attention detection methodCombining the depth visual attention of the displayed three-dimensional video image to obtain a distribution diagram of the depth visual attention of the three-dimensional video image, which is marked as SDDistribution diagram S of depth visual attention of three-dimensional video imageDHas a dimension of W × H and is ZSA grayscale map of bit depth representations;
fourthly, adopting a visual attention fusion method based on depth perception to visually pay attention to the static image domain of the current texture video frameIDistribution diagram S of motion visual attention of current texture video frameMAnd a current depth video frame and a depth visual attention distribution diagram S of a three-dimensional video image jointly displayed by the current depth video frame and the current texture video frameDFusing to extract a distribution diagram of three-dimensional visual attention conforming to human eye stereoscopic perception, wherein the distribution diagram of three-dimensional visual attention is marked as S, and the size of the distribution diagram of three-dimensional visual attention is W multiplied by H and ZSA grayscale map of bit depth representations;
performing thresholding and macro block post-processing on the distribution graph S of the three-dimensional visual attention to obtain a final region of interest which accords with human eye three-dimensional perception of the current texture video frame;
sixthly, repeating the steps from the first step to the fifth step until all texture video frames in the texture video are processed, and obtaining the video interesting area of the texture video.
The specific process of the motion visual attention detection method in the second step is as follows:
(II-1) recording the texture video frame at the time t + j continuous with the current texture video frame in the texture video as Ft+jRecording the texture video frame at the t-j time point which is continuous with the current texture video frame in the texture video as Ft-jWherein j ∈ (0, N)F/2],NFIs a positive integer less than 10;
2, calculating the current texture video frame and the texture video frame F at the moment t + j by adopting a known optical flow methodt+jMotion vector image in horizontal direction and motion vector image in vertical direction, and current texture video frame and time t-jTexture video frame Ft-jRecording the current texture video frame and the texture video frame F at the time of t + j in the motion vector image in the horizontal direction and the motion vector image in the vertical directiont+jThe motion vector image in the horizontal direction is Vt+j HAnd the motion vector image in the vertical direction is Vt+j VRecording the current texture video frame and the texture video frame F at the moment t-jt-jThe motion vector image in the horizontal direction is Vt-j HAnd the motion vector image in the vertical direction is Vt-j V,Vt+j H、Vt+j V、Vt-j HAnd Vt-j VW for width and H for height;
② -3, mixing Vt+j HAbsolute value of and Vt+j VThe absolute value of the current texture video frame is superposed to obtain the texture video frame F of the current texture video frame and the texture video frame F at the moment of t + jt+jMotion amplitude image of, noted as Mt+j, Memory Mt+jThe motion amplitude value of the pixel with the middle coordinate (x, y) is mt+j(x, y); will Vt-j HAbsolute value of and Vt-j VThe absolute value of the texture video frame is superposed to obtain the texture video frame F of the current texture video frame and the t-j momentt-jMotion amplitude image of, noted as Mt-j, Memory Mt-jThe motion amplitude value of the pixel with the middle coordinate (x, y) is mt-j(x,y);
② 4, utilizing current texture video frame and texture video frame F at t + j moment
t+jAnd texture video frame F at time t-j
t-jExtracting a joint motion map, denoted as M
j ΔExtracting the joint movement map M
j ΔThe specific process comprises the following steps: judging the current texture video frame and the texture video frame F at the moment t + j
t+jMotion amplitude image M
t+jEach pixel in (a) and the current texture video frame and the texture video frame F at time t-j
t-jMotion amplitude image M
t-jWhether the minimum value in the motion amplitude values of the pixels of the corresponding coordinates is larger than a set first threshold value T or not
1And if so, determining a joint motion map M
j ΔThe pixel value of the pixel of the corresponding coordinate is M
t+jAnd M
t-jAverage of the sum of motion amplitude values of pixels of the corresponding coordinates, otherwise, determining the joint motion map M
j ΔThe pixel value of the pixel of the corresponding coordinate is 0; for M
t+jPixel with (x, y) middle coordinate and M
t-jThe pixel with the middle coordinate of (x, y) is judged to be min (m)
t+j(x,y),m
t-j(x, y)) is greater than a set first threshold value T
1And if so, determining a joint motion map M
j ΔThe pixel value of the pixel with the middle coordinate (x, y) is
Otherwise, determining the joint motion map M
j ΔThe pixel value of the pixel with the middle coordinate (x, y) is 0, wherein min () is a minimum function;
② -5, the distance from the t time to the 1 time to the N timeFWeighted superposition of the combined motion images at each moment of the/2 moment is carried out to obtain a weighted combined motion image of the current texture video frame, the weighted combined motion image is marked as M, the pixel value of a pixel with coordinates (x, y) in the weighted combined motion image M of the current texture video frame is marked as M (x, y), <math><mrow><mi>m</mi><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>=</mo><munderover><mi>Σ</mi><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mrow><msub><mi>N</mi><mi>F</mi></msub><mo>/</mo><mn>2</mn></mrow></munderover><msub><mi>ζ</mi><mi>j</mi></msub><msubsup><mi>m</mi><mi>j</mi><mi>Δ</mi></msubsup><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>,</mo></mrow></math> wherein m isj Δ(x, y) represents the joint motion map M at a time j away from time tj ΔThe pixel value, ζ, of a pixel having a middle coordinate of (x, y)iAs a weighting coefficient, a weighting coefficient ζiSatisfy the requirement of <math><mrow><munderover><mi>Σ</mi><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mrow><msub><mi>N</mi><mi>F</mi></msub><mo>/</mo><mn>2</mn></mrow></munderover><msub><mi>ζ</mi><mi>j</mi></msub><mo>=</mo><mn>1</mn><mo>;</mo></mrow></math>
② 6, carrying out Gaussian pyramid decomposition on the weighted joint motion picture M of the current texture video frame to decompose into nLThe layer weighted joint motion picture is obtained by recording M Gaussian pyramid decomposition of the weighted joint motion picture, wherein the weighted joint motion picture of the ith layer is M (i), and the width and the height of the weighted joint motion picture of the ith layer M (i) are respectively W/2iAnd H/2iWherein n isLIs a positive integer less than 20, i ∈ [0, n ∈ [ ]L-1]W is the width of the current texture video frame, and H is the height of the current texture video frame;
② -7, utilizing the weighted joint motion map of the current texture video frame to n
LLayer weighted joint motion map, extracting the motion visual attention distribution map S of the current texture video frame
MRemember S
MThe pixel value of the pixel with the middle coordinate (x, y) is s
m(x,y),S
M=F
MWherein
s,c∈[0,n
L-1],s=c+δ,δ={-3,-2,-1,1,2,3},
is normalized to
-1 interval normalization function, symbol "|" is absolute value operation symbol, m (c) is c-th layer weighted joint motion diagram, m(s) is s-th layer weighted joint motion diagram, symbol
Performing cross-level difference operation on M (c) and M(s), if c is less than s, up-sampling M(s) to the image with the same resolution as M (c), then respectively performing difference on each pixel of M (c) and the corresponding pixel of up-sampled M(s), if c is more than s, up-sampling M (c) to the image with the same resolution as M(s), then respectively performing difference on each pixel of M(s) and the corresponding pixel of up-sampled M (c), and sign
Performing a cross-level addition operator for M (c) and M(s), if c < s, upsampling M(s) to an image with the same resolution as M (c), then summing each pixel of M (c) with the corresponding pixel of upsampled M(s), respectively, if c > s, upsampling M (c) to an image with the same resolution as M(s), and then summing each pixel of M(s) with the corresponding pixel of upsampled M (c), respectively.
The first threshold value T set in the step two-41=1。
The depth visual attention detection method in the third step comprises the following specific processes:
thirdly-1, carrying out Gaussian pyramid decomposition on the current depth video frame to obtain nLThe layer depth video frame is recorded as a layer i depth video frame obtained after the Gaussian pyramid decomposition of the current depth video frame is D (i), and the width and the height of the layer i depth video frame D (i) are respectively W/2iAnd H/2iWherein n isLIs a positive integer less than 20, i ∈ [0, n ∈ [ ]L-1]W is the width of the current depth video frame, and H is the height of the current depth video frame;
③ 2, utilizing n of current depth video frame
LLayer depth video frame, extracting depth characteristic map of current depth video frame, and recording as F
D,
Wherein,
s,c ∈[0,n
L-1],s=c+δ,δ={-3,-2,-1,1,2,3},
is normalized to
The normalized function of the interval, the symbol "|" is the absolute value operation symbol, D (c) is the c-th layer depth video frame, D(s) is the s-th layer depth video frame, the symbol
Performing cross-level difference operation on D (c) and D(s), if c < s, upsampling D(s) to an image with the same resolution as D (c), and then corresponding each pixel of D (c) to the upsampled D(s)
Respectively performing difference on pixels, if c is more than s, up-sampling D (c) to an image with the same resolution as D(s), and respectively performing difference and sign on each pixel of D(s) and the corresponding pixel of D (c) after up-sampling
Performing a cross-layer addition operator on D (c) and D(s), if c < s, upsampling D(s) to an image with the same resolution as D (c), and then performing an addition operation on each pixel of D (c)Respectively summing the corresponding pixels of the D(s) after sampling, if c is more than s, up-sampling the D (c) onto the image with the same resolution as the D(s), and then respectively summing each pixel of the D(s) and the corresponding pixel of the D (c) after up-sampling;
thirdly, carrying out convolution operation on the current depth video frame by adopting known Gabor filters with 0 degree, pi/4 degree, pi/2 degree and 3 pi/4 degree directions to extract four direction components of the 0 degree, pi/4 degree, pi/2 degree and 3 pi/4 degree directions to obtain four direction component graphs of the current depth video frame, wherein the four direction component graphs are respectively represented as O0 D、Oπ/4 D、Oπ/2 DAnd O3π/4 D(ii) a O for current depth video frame0 DDirectional component diagram, Oπ/4 DDirectional component diagram, Oπ/2 DDirectional component diagram and O3π/4 DThe direction component diagram is respectively decomposed into n by Gaussian pyramidLThe layer direction component diagram is marked, and the ith layer direction component diagram obtained by decomposing the direction component diagram in the theta degree direction by a Gaussian pyramid is Oθ D(i),Oθ D(i) Is W/2 in width and height respectivelyiAnd H/2iWhere θ ∈ {0, π/4, π/2, 3 π/4} i ∈ [0, n ∈L-1]W is the width of the current depth video frame, and H is the height of the current depth video frame;
thirdly-4, utilizing n of the direction component diagram of each degree direction of the current depth video frame
LExtracting a primary depth direction feature map of the current depth video frame, and recording the primary depth direction feature map as F'
DO,
<math><mrow><msubsup><mover><mi>F</mi><mo>‾</mo></mover><mi>DO</mi><mo>′</mo></msubsup><mo>=</mo><mfrac><mn>1</mn><mn>4</mn></mfrac><munder><mi>Σ</mi><mrow><mi>θ</mi><mo>∈</mo><mo>{</mo><mn>0</mn><mo>,</mo><mfrac><mi>π</mi><mn>4</mn></mfrac><mo>,</mo><mfrac><mi>π</mi><mn>2</mn></mfrac><mo>,</mo><mfrac><mrow><mn>3</mn><mi>π</mi></mrow><mn>4</mn></mfrac><mo>}</mo></mrow></munder><msub><mover><mi>F</mi><mo>‾</mo></mover><msub><mi>O</mi><mi>θ</mi></msub></msub><mo>,</mo></mrow></math> Wherein,
s,c ∈[0,n
L-1],s=c+δ,δ={-3,-2,-1,1,2,3},
is normalized to
Normalized function of interval, symbol "|" is absolute value operation symbol, O
θ D(c) A c-th layer direction component diagram which is a direction component diagram of a theta degree direction, O
θ D(s) an s-th layer direction component diagram of a theta degree direction, symbol
Is O
θ D(c) And O
θ D(s) perform a cross-level differencing operator, if c < s, then O
θ D(s) upsampling to and O
θ D(c) On the image with the same resolution, and then O
θ D(c) And the up-sampled O
θ D(s) performing difference on corresponding pixels respectively, and if c is more than s, performing O
θ D(c) Up-sampling to and O
θ D(s) on an image with the same resolution, and then adding O
θ D(s) each pixel with up-sampled O
θ D(c) Performing difference and sign on corresponding pixels
Is O
θ D(c) And O
θ D(s) performing a cross-level addition operator, if c < s, adding O
θ D(s) upsampling to and O
θ D(c) On the image with the same resolution, and then O
θ D(c) And the up-sampled O
θ D(s) the corresponding pixels are summed separately, and if c > s, O is added
θ D(c) Up-sampling to and O
θ D(s) on an image with the same resolution, and then adding O
θ D(s) each pixel with up-sampled O
θ D(c) Summing corresponding pixels respectively;
③ 5, adopting the known morphological dilation algorithm to obtain the size w1×h1Is a preliminary depth direction feature map F 'of the basic expansion unit to the current depth video frame'DOCarry out n1Performing secondary expansion operation to obtain a depth direction characteristic diagram of the current depth video frame, and recording the characteristic diagram as FDO;
Thirdly-6, utilizing the depth characteristic image F of the current depth video frame
DAnd depth direction feature map F
DOObtaining a distribution map, noted as S ', of the preliminary depth visual attention of the current depth video frame'
D,
Note S'
DThe pixel value of the pixel with the middle coordinate of (x, y) is s'
d(x, y) wherein,
is normalized to
A normalization function of the interval;
-7, profile S 'using preliminary depth visual attention of current depth video frame'DObtaining a depth visual attention distribution diagram S of the three-dimensional video image jointly displayed by the current depth video frame and the current texture video frameDRemember SDThe pixel value of the pixel with the middle coordinate (x, y) is sd(x,y),sd(x,y)=s′d(x, y) · g (x, y), whichIn (1), w is the width of the current depth video frame, H is the height of the current depth video frame, b is a set second threshold, and the symbol "| | is an" or "operator.
Step c 5 middle w1=8,h1=8,n1The second threshold b set in the step (c) -7 is 16.
The visual attention fusion method based on depth perception in the step IV comprises the following specific processes:
and 1, carrying out scale transformation on the current depth video frame by Q (d (x, y)) ═ d (x, y) + gamma, wherein the gamma is a value in
The coefficients within the range, d (x,y) represents the pixel value of the pixel with the coordinate (x, y) in the current depth video frame, and Q (d (x, y)) represents the pixel value of the pixel with the coordinate (x, y) in the current depth video frame after the scale transformation;
fourthly-2, utilizing the current depth video frame after the scale transformation, the current depth video frame and the depth visual attention distribution diagram S of the three-dimensional video image jointly displayed by the current depth video frame and the current texture video frame
DDistribution diagram S of motion visual attention of current texture video frame
MAnd a histogram S of the static image domain visual attention of the current texture video frame
IAcquiring a three-dimensional visual attention distribution graph S, wherein the pixel value of a pixel with coordinates (x, y) in the three-dimensional visual attention distribution graph S is S (x, y),
wherein, K
D、K
MAnd K
IRespectively S
D、S
MAnd S
IThe weighting coefficient satisfies the condition:
<math><mrow><munder><mi>Σ</mi><mrow><mi>a</mi><mo>∈</mo><mo>{</mo><mi>D</mi><mo>,</mo><mi>M</mi><mo>,</mo><mi>I</mi><mo>}</mo></mrow></munder><msub><mi>K</mi><mi>a</mi></msub><mo>=</mo><mn>1</mn><mo>,</mo></mrow></math> 0≤K
a≤1,
is normalized to
Normalized function of interval, s
D(x,y)、s
M(x, y) and s
I(x, y) each represents S
D、S
MAnd S
IThe pixel value, Θ, of a pixel having the middle coordinate (x, y)
ab(x, y) is the correlation value of visual attention, [ theta ]
ab(x,y)=min(s
a(x,y),s
b(x, y)), min () is the minimum function, C
abAs the correlation coefficient, the correlation coefficient satisfies the condition:
<math><mrow><munder><mi>Σ</mi><mrow><mi>a</mi><mo>,</mo><mi>b</mi><mo>∈</mo><mo>{</mo><mi>D</mi><mo>,</mo><mi>M</mi><mo>,</mo><mi>I</mi><mo>}</mo><mo>,</mo><mi>a</mi><mo>≠</mo><mi>b</mi></mrow></munder><msub><mi>C</mi><mi>ab</mi></msub><mo>=</mo><mn>1</mn><mo>,</mo></mrow></math> 0≤C
ab< 1, correlation coefficient C
DMDenotes S
DAnd S
MDegree of correlation, coefficient of correlation C
DIDenotes S
DAnd S
IDegree of correlation, coefficient of correlation C
IMDenotes S
IAnd S
MA, b ∈ { D, M, I } and a ≠ b.
The specific process of thresholding and macro block post-processing the distribution graph S of the three-dimensional visual attention in the fifth step is as follows:
-1, recording the pixel value of the pixel with coordinate (x, y) in the distribution graph S of the three-dimensional visual attention as S (x, y), defining a third threshold TS, <math><mrow><msub><mi>T</mi><mi>S</mi></msub><mo>=</mo><msub><mi>k</mi><mi>T</mi></msub><mo>·</mo><munderover><mi>Σ</mi><mrow><mi>y</mi><mo>=</mo><mn>0</mn></mrow><mrow><mi>H</mi><mo>-</mo><mn>1</mn></mrow></munderover><munderover><mi>Σ</mi><mrow><mi>x</mi><mo>=</mo><mn>0</mn></mrow><mrow><mi>W</mi><mo>-</mo><mn>1</mn></mrow></munderover><mi>s</mi><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>/</mo><mrow><mo>(</mo><mi>W</mi><mo>×</mo><mi>H</mi><mo>)</mo></mrow><mo>,</mo></mrow></math> Wherein W is the width of the three-dimensional visual attention profile S, H is the height of the three-dimensional visual attention profile S, and kTE (0, 3); newly building a preliminary binary mask image, and judging that s (x, y) is more than or equal to TSIf so, marking the pixel with the coordinate (x, y) in the preliminary binary mask image as the pixel of interest, otherwise, marking the pixel with the coordinate (x, y) in the preliminary binary mask image as the pixel of non-interest;
fifthly, dividing the preliminary binary mask imageTo (W/W)2)×(H/h2) Size of w2×h2The blocks are not overlapped, and the block with the abscissa of u and the block with the ordinate of v is marked as Bu,vWherein u ∈ [0, W/W ]2-1],v∈[0,H/h2-1]Determining whether the pixel in each corresponding block in the current texture video frame is the pixel of interest or the non-interest pixel according to each block in the preliminary binary mask image, and regarding the block Bu,vJudgment Block Bu,vWhether the number of pixels marked as the pixel of interest is larger than a set fourth threshold value T or notbWherein, 0 is less than or equal to Tb≤w2×h2If yes, the current texture video frame is compared with the block Bu,vAll pixels in the corresponding block are marked as pixels of interest, and block B is marked as the pixel of interestu,vTaking the corresponding block as the region of interest block, otherwise, comparing the current texture video frame with the block Bu,vAll pixels in the corresponding block are marked as non-interesting pixels and block B is markedu,vThe corresponding block is used as a non-interested area block to obtain a preliminary interested area mask image of the current texture video frame, wherein the preliminary interested area mask image consists of an interested area block and a non-interested area block;
fifthly, marking all pixels in a non-interested area block which is most adjacent to the interested area block in the primary interested area mask image as the Nth pixelRLevel transition interested areas and updating the mask images of the primary interested areas; then, the updated preliminary interested region mask image is compared with the Nth regionRAll pixels in the non-region-of-interest block nearest to the level transition region-of-interest are labeled as NthR-level 1 transition region of interest, recursively updating the preliminary region of interest mask image; repeating the recursion process until the region of interest is marked to the level 1 transition region of interest; finally, obtaining a final interested region mask image of the current texture video frame, wherein the final interested region mask image consists of an interested region block and NRA level transition region of interest and a region of non-interest block;
fifthly, recording the final interested areaSetting the pixel value of a pixel with coordinates (x, y) in the region mask image as r (x, y), setting the pixel values of all pixels in a region block which is not in a region of interest in the final region of interest mask image as r (x, y) to be 255, and setting N in the final region of interest mask image as NRSetting the pixel values of all pixels in the region of interest of the level transition to <math><mrow><mi>r</mi><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>=</mo><mfrac><mi>e</mi><mrow><msub><mi>N</mi><mi>R</mi></msub><mo>+</mo><mn>1</mn></mrow></mfrac><mo>×</mo><mi>f</mi><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>,</mo></mrow></math> Setting the pixel values of all pixels in the region of interest block in the final region of interest mask image as r (x, y) ═ f (x, y), obtaining the region of interest of the current texture video frame, wherein e represents the progression of the transition region of interest, and e belongs to [1, N ]R]And f (x, y) represents the pixel value of the pixel with coordinates (x, y) in the current texture video frame.
W in the step (v-2)2=16,h2A fourth threshold value T of 16h=50。
Compared with the prior art, the method has the advantages that the texture video frame and the depth video frame corresponding to the texture video frame which are synchronized in time are jointly utilized, firstly, the static image domain visual attention of the texture video frame is extracted, the distribution diagram of the static image domain visual attention of the texture video frame is obtained, the motion visual attention is extracted through the texture video frames which are continuous in time, the distribution diagram of the motion visual attention of the texture video frame is obtained, the depth visual attention distribution diagram of the three-dimensional video image which is displayed by the depth video frame and the texture video frame in a combined mode is obtained through extracting the depth visual attention of the depth video frame, then, the distribution diagram of the static image domain visual attention, the distribution diagram of the motion visual attention, the distribution diagram of the depth visual attention and the depth information are utilized, the distribution diagram of the three-dimensional (stereo) visual attention which accords with the stereo vision characteristic of human eyes is obtained through a fusion method based, and performing thresholding and macro block post-processing operations to obtain a final video interested area which accords with human eye stereoscopic perception and a mask image of the interested area and a non-interested area corresponding to the video interested area. The interesting regions extracted by the method are fused with the visual attention of the static image region, the visual attention of the movement and the visual attention of the depth, the inherent singleness and inaccuracy of each visual attention extraction are effectively inhibited, the problem of noise caused by a complex background in the visual attention of the static image region is solved, the problem that the interesting regions with local movement and small movement amplitude cannot be extracted by the visual attention of the movement is solved, the calculation precision is improved, the stability of an algorithm is enhanced, and the interesting regions can be extracted from the background and the movement environment with complex textures. In addition, the region of interest obtained by the method not only accords with the visual interesting characteristic of human eyes to the static texture video frame and the visual interesting characteristic of human eyes to the moving object, but also accords with the depth perception characteristic of interest to the object with strong depth or close distance in the stereoscopic vision, and accords with the semantic characteristic of the stereoscopic vision of human eyes.
Detailed Description
The invention is described in further detail below with reference to the accompanying examples.
The invention relates to a video interesting region extraction method based on visual attention, which mainly combines and utilizes the information of texture video and the information of depth video synchronized in time to extract the video interesting region, in the present embodiment, the texture video mainly uses a two-dimensional color video, the texture video is exemplified by a test sequence "Ballet" two-dimensional color video and a "Door Flower" two-dimensional color video, figure 1a shows a color video frame at time t in a test sequence "Ballet" two-dimensional color video, figure 1b shows a color video frame at time t in a two-dimensional color video of the test sequence "Door Flower", figure 2a shows a depth video frame at time t in a depth video corresponding to a test sequence "Ballet" two-dimensional color video, fig. 2b shows a depth video frame at time t in a depth video corresponding to the two-dimensional color video of the test sequence "Door Flower" and a depth video frame at each time in the depth video corresponding to the two-dimensional color video is Z.DA grayscale map of bit depth representations, the grayscale values of the grayscale map representing the relative distance of the object represented by each pixel in the depth video frame to the camera. Time instants in texture videoThe size of the texture video frame is defined as W × H, and for the depth video frame at each time in the depth video corresponding to the texture video, if the size of the depth video frame is different from the size of the texture video frame, the size of the depth video frame is generally set to the same size as the texture video frame by using the existing methods such as scale transformation and interpolation, that is, W × H, W is the width of the texture video frame at each time in the texture video or the width of the depth video frame at each time in the depth video, H is the height of the texture video frame at each time in the texture video or the height of the depth video frame at each time in the depth video, and the size of the depth video frame is set to be the same as the size of the texture video frame in order to more conveniently extract the video region of interest.
The general flow diagram of the method of the present invention is shown in fig. 3, and specifically includes the following steps:
firstly, defining a two-dimensional color video as a texture video, defining the size of texture video frames at each moment in the texture video to be W multiplied by H, W being the width of the texture video frames at each moment in the texture video, H being the height of the texture video frames at each moment in the texture video, and recording the texture video frame at t moment in the texture video as FtDefining a texture video frame F at time t in the texture videotFor the current texture video frame, detecting the visual attention of the static image domain of the current texture video frame by using a known static image visual attention detection method to obtain a distribution map, marked as S, of the visual attention of the static image domain of the current texture video frameIThe static image domain visual attention distribution map S of the current texture video frameIHas a dimension of W × H and is ZSAnd the gray level map is expressed by bit depth, wherein the larger the pixel value of a certain pixel in the gray level map represents that the relative attention degree of the human eye to the corresponding pixel of the current texture video frame is higher, and the smaller the pixel value represents that the relative attention degree of the human eye to the current texture video frame is lower.
In this embodiment, the process of detecting the visual attention of the still image domain of the current texture video frame by using the well-known still image visual attention detection methodBlock diagram as shown in fig. 4, each rectangle in fig. 4 represents a data processing procedure, each diamond respectively represents an image, and diamonds with different sizes represent images with different resolutions and are input and output data of corresponding operations; the current texture video frame is an image in an RGB format, each pixel in the image is represented BY R, G and B three color channels, firstly, each color channel component of each pixel of the current texture video frame is linearly transformed and decomposed into a brightness component diagram and two chroma component diagrams, namely a red-green component diagram and a blue-yellow component diagram, the brightness component diagram, the red-green component diagram and the blue-yellow component diagram are respectively represented as I, RG and BY, and the pixel value of a brightness component diagram I in (x, y) coordinates is represented as Ix,y=(rx,y+gx,y+bx,y) /3 wherein Ix,yThe pixel value, r, representing the luminance component in (x, y) coordinatesx,y、gx,y、bx,yThe pixel values of the pixels of the RGB three color channels of the current texture video frame at the (x, y) coordinate are respectively represented as follows:
wherein RG
x,yThe pixel value, BY, representing the red-green component map RG in (x, y) coordinates
x,yA pixel value representing the blue-yellow component map BY at (x, y) coordinates; extracting four directional component maps of 0 degree, 45 degrees, 90 degrees and 135 degrees of the brightness component map by using a known Gabor filter, and respectively marking the four extracted directional component maps as O
θ T,
<math><mrow><mi>θ</mi><mo>∈</mo><mo>{</mo><mn>0</mn><mo>,</mo><mfrac><mi>π</mi><mn>4</mn></mfrac><mo>,</mo><mfrac><mi>π</mi><mn>2</mn></mfrac><mo>,</mo><mfrac><mrow><mn>3</mn><mi>π</mi></mrow><mn>4</mn></mfrac><mo>}</mo><mo>;</mo></mrow></math> For one luminance component and two chrominance componentsAnd four directional components are respectively subjected to Gaussian pyramid decomposition into n
LLayer of where n
LAll the component graphs are uniformly expressed by l for positive integers less than 20, and the layer component graph of the ith layer obtained after the component graph l is decomposed by a Gaussian pyramid is recorded as l (i), wherein i belongs to [0, n ∈ [ ]
L-1],
<math><mrow><mi>l</mi><mo>∈</mo><mo>{</mo><mi>I</mi><mo>}</mo><mo>∪</mo><mo>{</mo><mi>RG</mi><mo>,</mo><mi>BY</mi><mo>}</mo><mo>∪</mo><mo>{</mo><msubsup><mi>O</mi><mn>0</mn><mi>T</mi></msubsup><mo>,</mo><msubsup><mi>O</mi><mrow><mi>π</mi><mo>/</mo><mn>4</mn></mrow><mi>T</mi></msubsup><mo>,</mo><msubsup><mi>O</mi><mrow><mi>π</mi><mo>/</mo><mn>2</mn></mrow><mi>T</mi></msubsup><mo>,</mo><msubsup><mi>O</mi><mrow><mn>3</mn><mi>π</mi><mo>/</mo><mn>4</mn></mrow><mi>T</mi></msubsup><mo>}</mo><mo>,</mo></mrow></math> So 7 component maps are co-generated, each component map being decomposed into n
LLayer-by-layer component diagram of 7 Xn
LLayer-by-layer component diagram, n in this example
LThe value is 9; the feature maps of the components (chrominance, luminance, and directional components) are calculated by using the extracted layer component maps as follows:
<math><mrow><mo>∀</mo><mi>l</mi><mo>∈</mo><mo>{</mo><mi>I</mi><mo>}</mo><mo>∪</mo><mo>{</mo><mi>RG</mi><mo>,</mo><mi>BY</mi><mo>}</mo><mo>∪</mo><mo>{</mo><msubsup><mi>O</mi><mn>0</mn><mi>T</mi></msubsup><mo>,</mo><msubsup><mi>O</mi><mrow><mi>π</mi><mo>/</mo><mn>4</mn></mrow><mi>T</mi></msubsup><mo>,</mo><msubsup><mi>O</mi><mrow><mi>π</mi><mo>/</mo><mn>2</mn></mrow><mi>T</mi></msubsup><mo>,</mo><msubsup><mi>O</mi><mrow><mn>3</mn><mi>π</mi><mo>/</mo><mn>4</mn></mrow><mi>T</mi></msubsup><mo>}</mo><mo>,</mo></mrow></math> wherein s, c is equal to [0, n ]
L-1],s=c+δ,δ={-3,-2,-1,1,2,3},
Is normalized to
-a normalization function of the interval 1,
denotes the most noticeable, 0 denotes the least noticeable, l (c) denotes the c-th layer component map of the component map l, l(s) denotes the s-th layer component map of the component map l, and symbols
Performing cross-level difference operation on l (c) and l(s), if c < s, upsampling l(s) to an image with the same resolution as l (c), then respectively performing difference operation on l (c) and a pixel corresponding to the upsampled l(s), if c > s, upsampling l (c) to an image with the same resolution as l(s), then respectively performing difference operation on l(s) and a pixel corresponding to the upsampled l (c), and performing symbol difference operation on l(s)
Representing the operators of addition between l (c) and l (S) at cross-layer level, if c < S, upsampling l (S) to the image with the same resolution as l (c), then summing up l (c) and the corresponding pixel of upsampled l (S) respectively, if c > S, upsampling l (c) to the image with the same resolution as l (S), then summing up l (S) and the corresponding pixel of upsampled l (c) respectively, linearly fusing and normalizing the feature maps of all components to obtain the visual attention map S of the static domain of the current texture video frame
I,
The size of each image of the test sequence "Ballet" and "Door Flower" is 1024 × 768, and the luminance characteristic diagram, the chrominance characteristic diagram and the direction characteristic diagram of the color video frame at the time t in the test sequence "Ballet" two-dimensional color video are respectively as shown in fig. 7a, fig. 7b and fig. 7b7c is shown; the luminance feature map, the chrominance feature map, and the direction feature map of the color video frame at time t in the test sequence "Door Flower" two-dimensional color video are shown in fig. 12a, 12b, and 12c, respectively. In this embodiment, ZS8, i.e. the profile S of the visual attention of the static image fieldIIs represented by 8 bit depth, and the distribution diagram of the static image domain visual attention of the color video frame at the time t in the test sequence "Ballet" two-dimensional color video is shown in fig. 8 a; fig. 13a shows a distribution diagram of the still image domain visual attention of the color video frames at time t in the test sequence "Door Flower" two-dimensional color video. Here, other known visual attention detection methods may be used as the static image domain visual attention detection method.
Secondly, detecting the motion visual attention of the current texture video frame by adopting a motion visual attention detection method to obtain a distribution diagram of the motion visual attention of the current texture video frame, and recording the distribution diagram as SMDistribution map S of motion visual attention of current texture video frameMHas a dimension of W × H and is ZSAnd the gray level map is expressed by bit depth, wherein the larger the pixel value of a certain pixel in the gray level map represents that the attention degree of the relative motion of the human eye to the corresponding pixel of the current texture video frame is higher, and the smaller the pixel value represents that the attention degree of the relative motion of the human eye to the corresponding pixel of the current texture video frame is lower.
In this embodiment, a flow chart of the motion visual attention detection method is shown in fig. 5, and the specific process of the motion visual attention detection method is as follows:
(II-1) recording the texture video frame at the time t + j continuous with the current texture video frame in the texture video as Ft+jRecording the texture video frame at the t-j time point which is continuous with the current texture video frame in the texture video as Ft-jWherein j ∈ (0, N)F/2],NFIs a positive integer less than 10, and N is taken in the specific application process of this exampleFAnd 4, extracting the motion area of the texture video by using the current texture video frame and the first two frames and the last two frames of the current texture video frame jointly.
2, calculating the current texture video frame and the texture video frame F at the moment t + j by adopting a known optical flow methodt+jMotion vector image in horizontal direction and motion vector image in vertical direction, and texture video frame F of current texture video frame and time t-jt-jRecording the current texture video frame and the texture video frame F at the time of t + j in the motion vector image in the horizontal direction and the motion vector image in the vertical directiont+jThe motion vector image in the horizontal direction is Vt+j HAnd the motion vector image in the vertical direction is Vt+j VRecording the current texture video frame and the texture video frame F at the moment t-jt-jThe motion vector image in the horizontal direction is Vt-j HAnd the motion vector image in the vertical direction is Vt-j V,Vt+j H、Vt+j V、Vt-j HAnd Vt-j VW for width and H for height.
② -3, mixing Vt+j HAbsolute value of and Vt+j VThe absolute value of the current texture video frame is superposed to obtain the texture video frame F of the current texture video frame and the texture video frame F at the moment of t + jt+jMotion amplitude image of, noted as Mt+j, Memory Mt+jThe motion amplitude value of the pixel with the middle coordinate (x, y) is mt+j(x, y); will Vt-j HAbsolute value of and Vt-j VThe absolute value of the texture video frame is superposed to obtain the texture video frame F of the current texture video frame and the t-j momentt-jMotion amplitude image of, noted as Mt-j, Memory Mt-jThe motion amplitude value of the pixel with the middle coordinate (x, y) is mt-j(x,y)。
② 4, utilizing current texture video frame and texture video frame F at t + j moment
t+jAnd texture video frame F at time t-j
t-jExtracting a joint motion map, denoted as M
j ΔExtracting the joint movement map M
j ΔThe specific process comprises the following steps: judging the current texture video frame and the texture video frame F at the moment t + j
t+jMotion amplitude image M
t+jEach pixel in (a) and the current texture video frame and the texture video frame F at time t-j
t-jMotion amplitude image M
t-jWhether the minimum value in the motion amplitude values of the pixels of the corresponding coordinates is larger than a set first threshold value T or not
1And if so, determining a joint motion map M
j ΔThe pixel value of the pixel of the corresponding coordinate is M
t+jAnd M
t-jAverage of the sum of motion amplitude values of pixels of the corresponding coordinates, otherwise, determining the joint motion map M
j ΔThe pixel value of the pixel of the corresponding coordinate is 0; for M
t+jPixel with (x, y) middle coordinate and M
t-jThe pixel with the middle coordinate of (x, y) is judged to be min (m)
t+j(x,y),m
t-j(x, y)) is greater than a set first threshold value T
1And if so, determining a joint motion map M
j ΔThe pixel value of the pixel with the middle coordinate (x, y) is
Otherwise, determining the joint motion map M
j ΔThe pixel value of the pixel with the middle coordinate (x, y) is 0, wherein min () is a function of taking the minimum value. Here, the first
threshold value T 11, to filter out small noise points caused by very small camera parameter jitter.
② -5, the distance from the t time to the 1 time to the N timeFWeighted superposition of the combined motion images at each moment of the/2 moment is carried out to obtain a weighted combined motion image of the current texture video frame, the weighted combined motion image is marked as M, the pixel value of a pixel with coordinates (x, y) in the weighted combined motion image M of the current texture video frame is marked as M (x, y), <math><mrow><mi>m</mi><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>=</mo><munderover><mi>Σ</mi><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mrow><msub><mi>N</mi><mi>F</mi></msub><mo>/</mo><mn>2</mn></mrow></munderover><msub><mi>ζ</mi><mi>j</mi></msub><msubsup><mi>m</mi><mi>j</mi><mi>Δ</mi></msubsup><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>,</mo></mrow></math> wherein m isj Δ(x, y) represents the joint motion map M at a time j away from time tj ΔPixel value, ζ, of a pixel having a middle coordinate of (x, y)jAs a weighting coefficient, a weighting coefficient ζjSatisfy the requirement of <math><mrow><munderover><mi>Σ</mi><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mrow><msub><mi>N</mi><mi>F</mi></msub><mo>/</mo><mn>2</mn></mrow></munderover><msub><mi>ζ</mi><mi>j</mi></msub><mo>=</mo><mn>1</mn><mo>.</mo></mrow></math>
In a video, a moving object is a main interested area, however, due to different motion types, the attention degree of people is different, and the motion types of the video are mainly divided into the following two types of cases, namely, the first type, for the case of shooting by a static camera, the background is static, and the moving object is a main interested object; the second type is that for the situation of shooting by a moving camera, the background moves globally, and the moving object and the camera are kept relatively still or are presented in the situation of inconsistent motion of the background, at this time, the moving object is still an interested object; for the above analysis, the motion attention area of people mainly comes from the motion attribute of the object, which is different from the motion attribute of the background environment, and is an area with large motion contrast, so the following steps can be adopted to obtain the motion visual attention.
② 6, carrying out Gaussian pyramid decomposition on the weighted joint motion picture M of the current texture video frame to decompose into nLThe layer weighted joint motion picture is characterized in that an ith layer weighted joint motion picture obtained by decomposing a weighted joint motion picture M by a Gaussian pyramid is M (i), and the width and the height of the ith layer weighted joint motion picture M (i) are respectively W/2iAnd H/2iWherein n isLIs a positive integer less than 20, i ∈ [0, n ∈ [ ]L-1]The 0 th layer is the bottom layer, the n th layerL-layer 1 is the highest layer, W is the width of the current texture video frame, H is the height of the current texture video frame; in the specific application process of the embodiment, nLThe value is 9.
② -7, utilizing weighted joint motion picture M n layers of current texture video frame
LWeighting the joint motion map to extract a distribution map S of the motion visual attention of the current texture video frame
MRemember S
MThe pixel value of the pixel with the middle coordinate (x, y) is s
m(x,y),S
M=F
MWherein
s,c ∈[0,n
L-1],s=c+δ,δ={-3,-2,-1,1,2,3},
is normalized to
The normalized function of interval, the symbol "|" is absolute value operation symbol, M (c) is the c-th layer weighted joint motion diagram, M(s) is the s-th layer weighted joint motion diagram, symbol
Performing cross-level difference operation on M (c) and M(s), if c < s, up-sampling M(s) to the image with the same resolution as that of M (c), and then respectively adding each pixel of M (c) and the corresponding pixel of up-sampled M(s)Performing subtraction, if c > s, up-sampling M (c) to the image with the same resolution as M(s), and performing subtraction and sign on each pixel of M(s) and the corresponding pixel of up-sampled M (c)
Performing a cross-level addition operator for M (c) and M(s), if c < s, upsampling M(s) to an image with the same resolution as M (c), then summing each pixel of M (c) with the corresponding pixel of upsampled M(s), respectively, if c > s, upsampling M (c) to an image with the same resolution as M(s), and then summing each pixel of M(s) with the corresponding pixel of upsampled M (c), respectively.
The distribution diagram of the motion visual attention obtained after the color video frames at the time t in the two-dimensional color video of the test sequence 'Ballet' are processed by the step is shown in FIG. 8 b; fig. 13b shows a distribution diagram of the motion visual attention obtained after the color video frames at time t in the two-dimensional color video of the test sequence "Door Flower" are processed in this step.
Defining depth video frame of each time in depth video corresponding to texture video as Z
DGrayscale map of bit depth representation from 0 to
The gray scale value of the range represents the relative distance from the shot object to the shooting camera represented by each pixel in the depth video frame, the gray scale value 0 corresponds to the maximum depth, and the gray scale value
Setting the size of a depth video frame at each moment in the depth video to be W multiplied by H corresponding to the minimum depth, wherein W is the width of the depth video frame at each moment in the depth video, H is the height of the depth video frame at each moment in the depth video, and D is the depth video frame at t moment in the depth video
tDefining a depth video frame D at time t in the depth video
tDetecting for the current depth video frame by adopting a depth visual attention detection methodThe depth visual attention of the three-dimensional video image jointly displayed by the current depth video frame and the current texture video frame is obtained, and a distribution diagram of the depth visual attention of the three-dimensional video image is obtained and recorded as S
DDistribution diagram S of depth visual attention of three-dimensional video image
DHas a dimension of W × H and is Z
SAnd the gray level map is expressed by bit depth, wherein the larger the pixel value of a certain pixel in the gray level map represents that the attention degree of the human eye to the relative depth of the corresponding pixel of the current texture video frame is higher, and the smaller the pixel value represents that the attention degree of the human eye to the relative depth of the current texture video frame is lower. In this embodiment, each pixel of the depth video frame consists of Z
DIn 8-bit depth representation, each pixel of the visual attention profile is represented by Z
SAn 8 bit depth representation.
The specific stereoscopic impression is the main characteristic that the stereoscopic video is different from the traditional single-channel video, and for the visual attention of the stereoscopic video, the depth perception mainly influences the visual attention of a user through two aspects, on one hand, the interest degree of the user on a scenery (or an object) close to the shooting camera array is generally larger than that of the scenery (or the object) far away from the shooting camera array; on the other hand, the depth discontinuity areas provide the user with a strong depth contrast. In this embodiment, a flow chart of the deep visual attention detection method is shown in fig. 6, and the specific process of the deep visual attention detection method is as follows:
thirdly-1, carrying out Gaussian pyramid decomposition on the current depth video frame to obtain nLAnd D (i) layer depth video frames are obtained after the Gaussian pyramid decomposition of the current depth video frames is recorded, wherein the width and the height of the i layer depth video frames D (i) are respectively W/2iAnd H/2iWherein n isLIs a positive integer less than 20, i ∈ [0, n ∈ [ ]L-1]The 0 th layer is the bottom layer, and the resolution is maximum D (0) ═ DtN thLLayer-1 is the highest layer, the resolution is the lowest, W is the width of the current depth video frame, and H is the height of the current depth video frame.
③ 2, utilizing n of current depth video frame
LLayer depth video frame, extractingDepth feature map of front depth video frame, denoted as F
D,
Wherein,
s,c ∈[0,n
L-1],s=c+δ,δ={-3,-2,-1,1,2,3},
is normalized to
The normalized function of the interval, the symbol "|" is the absolute value operation symbol, D (c) is the c-th layer depth video frame, D(s) is the s-th layer depth video frame, the symbol
Performing cross-level difference operation on D (c) and D(s), if c < s, upsampling D(s) to an image with the same resolution as D (c), then respectively performing difference on each pixel of D (c) and the corresponding pixel of the upsampled D(s), if c > s, upsampling D (c) to an image with the same resolution as D(s), then respectively performing difference on each pixel of D(s) and the corresponding pixel of the upsampled D (c), and symbolizing
And d (c) and d(s) are subjected to cross-layer addition operators, if c < s, d(s) is up-sampled to the image with the same resolution as d (c), then each pixel of d (c) is respectively summed with the corresponding pixel of d(s) after up-sampling, if c > s, d (c) is up-sampled to the image with the same resolution as d(s), and then each pixel of d(s) is respectively summed with the corresponding pixel of d (c) after up-sampling.
Thirdly, the depth edge area with larger depth difference gives stronger depth feeling to the user, so that the depth edge area in the current depth video frame is the depth visual attentionAnd performing convolution operation on the current depth video frame by using known Gabor filters in the directions of 0 degree, pi/4 degree, pi/2 degree and 3 pi/4 degree to extract four direction components in the directions of 0 degree, pi/4 degree, pi/2 degree and 3 pi/4 degree and obtain four direction component maps of the current depth video frame, wherein the four direction component maps are respectively represented as O0 D、Oπ/4 D、Oπ/2 DAnd O3π/4 D(ii) a O for current depth video frame0 DDirectional component diagram, Oπ/4 DDirectional component diagram, Oπ/2 DDirectional component diagram and O3π/4 DThe direction component diagram is respectively decomposed into n by Gaussian pyramidLThe layer direction component diagram is marked, and the ith layer direction component diagram obtained by decomposing the direction component diagram in the theta degree direction by a Gaussian pyramid is Oθ D(i),Oθ D(i) Is W/2 in width and height respectivelyiAnd H/2iWhere θ ∈ {0, π/4, π/2, 3 π/4} i ∈ [0, n ∈L-1]The 0 th layer is the bottom layer, <math><mrow><msubsup><mi>O</mi><mi>θ</mi><mi>D</mi></msubsup><mrow><mo>(</mo><mn>0</mn><mo>)</mo></mrow><mo>=</mo><msubsup><mi>O</mi><mi>θ</mi><mi>D</mi></msubsup><mo>,</mo></mrow></math> n thLLayer-1 is the highest layer, W is the width of the current depth video frame, and H is the height of the current depth video frame.
Thirdly-4, utilizing n of the direction component diagram of each degree direction of the current depth video frame
LExtracting a primary depth direction feature map of the current depth video frame, and recording the primary depth direction feature map as F'
DO,
<math><mrow><msubsup><mover><mi>F</mi><mo>‾</mo></mover><mi>DO</mi><mo>′</mo></msubsup><mo>=</mo><mfrac><mn>1</mn><mn>4</mn></mfrac><munder><mi>Σ</mi><mrow><mi>θ</mi><mo>∈</mo><mo>{</mo><mn>0</mn><mo>,</mo><mfrac><mi>π</mi><mn>4</mn></mfrac><mo>,</mo><mfrac><mi>π</mi><mn>2</mn></mfrac><mo>,</mo><mfrac><mrow><mn>3</mn><mi>π</mi></mrow><mn>4</mn></mfrac><mo>}</mo></mrow></munder><msub><mover><mi>F</mi><mo>‾</mo></mover><msub><mi>O</mi><mi>θ</mi></msub></msub><mo>,</mo></mrow></math> Wherein,
s,c ∈[0,n
L-1],s=c+δ,δ={-3,-2,-1,1,2,3},
is normalized to
Normalized function of interval, symbol "|" is absolute value operation symbol, O
θ D(c) A c-th layer direction component diagram which is a direction component diagram of a theta degree direction, O
θ D(s) an s-th layer direction component diagram of a theta degree direction, symbol
Is O
θ D(c) And O
θ D(s) perform a cross-level differencing operator, if c < s, then O
θ D(s) upsampling to and O
θ D(c) On the image with the same resolution, and then O
θ D(c) And the up-sampled O
θ D(s) performing difference on corresponding pixels respectively, and if c is more than s, performing O
θ D(c) Up-sampling to and O
θ D(s) on an image with the same resolution, and then adding O
θ D(s) each pixel with up-sampled O
θ D(c) Performing difference and sign on corresponding pixels
Is O
θ D(c) And O
θ D(s) performing a cross-level addition operator, if c < s, adding O
θ D(s) upsampling to and O
θ D(c) On the image with the same resolution, and then O
θ D(c) And the up-sampled O
θ D(s) the corresponding pixels are summed separately, and if c > s, O is added
θ D(c) Up-sampling to and O
θ D(s) on an image with the same resolution, and then adding O
θ D(s) each pixel with up-sampled O
θ D(c) The corresponding pixels are summed separately.
③ 5, adopting the known morphological dilation algorithm to obtain the size w1×h1Is a preliminary depth direction feature map F 'of the basic expansion unit to the current depth video frame'DOCarry out n1Performing secondary expansion operation to obtain a depth direction characteristic diagram of the current depth video frame, and recording the characteristic diagram as FDO. In this embodiment, for the "Ballet" and "Doorflower" test sequences, where the size of each image in the test sequence is 1024 × 768, the basic unit of morphological dilation uses 8 × 8 blocks, i.e., w1×h18 × 8, the number of dilations n1=2。
Thirdly-6, utilizing the depth characteristic image F of the current depth video frame
DAnd depth direction feature map F
DOObtaining a distribution map, noted as S ', of the preliminary depth visual attention of the current depth video frame'
D,
Note S'
DThe pixel value of the pixel with the middle coordinate of (x, y) is s'
d(x, y) wherein,
is normalized to
A normalization function of the interval.
③ 7, for the left and right borders of the imageA region where the left view image has a left image boundary and the right view image does not have a corresponding region, and therefore, a stereoscopic effect cannot be formed in the human brain; similarly, it is difficult to form a stereoscopic effect on the right image boundary of the right viewpoint image. Therefore, in the stereoscopic video, the left and right boundary regions of the image provide weaker stereoscopic effect or no stereoscopic effect, and are non-stereoscopic visual attention regions, so the invention visually pays attention to the preliminary depth of the current depth video frame by the distribution map S'DIs suppressed by using a profile S 'of the preliminary depth visual attention of the current depth video frame'DObtaining a depth visual attention distribution diagram S of the three-dimensional video image jointly displayed by the current depth video frame and the current texture video frameDRemember SDThe pixel value of the pixel with the middle coordinate (x, y) is sd(x,y),sd(x,y)=s′d(x, y) · g (x, y), wherein, w is the width of the current depth video frame, H is the height of the current depth video frame, b is a set second threshold, and the symbol "|" is an "or" operator. Here, the second threshold value b takes a value of 16. The function g (x, y) may also be other two-dimensional functions for suppressing the edge region of the image, such as a two-dimensional gaussian function with the template size being the same as the size of the texture video frame.
FIG. 8c is a diagram showing a depth visual attention distribution of a three-dimensional video image jointly displayed by a color video frame at time t and a corresponding depth video frame in a two-dimensional color video of the test sequence "Ballet"; fig. 13c shows a depth visual attention distribution diagram of a three-dimensional video image represented by a combination of a color video frame at time t and a corresponding depth video frame in a test sequence "DoorFlower" two-dimensional color video.
Fourthly, adopting a visual attention fusion method based on depth perception to visually pay attention to the static image domain of the current texture video frameIDistribution diagram S of motion visual attention of current texture video frameMAnd a current depth video frame and a depth visual attention distribution diagram S of a three-dimensional video image jointly displayed by the current depth video frame and the current texture video frameDFusing to extract a distribution diagram of three-dimensional visual attention conforming to human eye stereoscopic perception, wherein the distribution diagram of three-dimensional visual attention is marked as S, and the size of the distribution diagram of three-dimensional visual attention is W multiplied by H and ZSThe gray level image is represented by bit depth, the larger the pixel value of a certain pixel in the gray level image is, the higher the relative attention degree of human eyes to the corresponding pixel of the three-dimensional video image jointly represented by the current depth video frame and the current texture video frame is, and the smaller the pixel value is, the lower the relative attention degree of human eyes to the three-dimensional video image jointly represented by the current depth video frame and the current texture video frame is.
In a traditional single channel, a moving object is more likely to attract the attention of a viewer than a static object, and for objects which are all static, a bright-color region, a region with high color or brightness contrast, a region with high grain direction difference and the like are more likely to attract the attention of the viewer; in the stereo video, the visual attention distribution of human eyes is influenced by the special stereo perception provided by the stereo video for users besides the motion visual attention and the static image domain visual attention; the stereoscopic vision mainly comes from the tiny position deviation of scenes seen by left and right eyes, namely parallax, for example, the distance between two eyes is about 6 cm, the object images received by the eyes are projected to form a visual image with tiny position deviation on a retina, the tiny deviation is automatically integrated into a stereoscopic image with depth through a brain to form stereoscopic vision, and the relative distance information of objects reflected by the stereoscopic vision is another important factor directly influencing the attention selection of people. In a stereoscopic video, objects contained in a depth discontinuous region or a region with a large depth contrast can give a user a stronger depth difference, have stronger stereoscopic impression or depth sense, and are one of regions in which the user is interested; on the other hand, the viewer is interested in a foreground region close to the shooting camera (or video viewer) more than in a region far away from the shooting camera (or video viewer), so the foreground region is usually an important potential region of the stereoscopic video viewer interested region, and based on the above analysis, the factors influencing the three-dimensional visual attention of the human eyes are determined to include four factors of static image domain visual attention, moving visual attention, depth visual attention and depth, so the specific process of the visual attention fusion method based on depth perception in this specific embodiment is as follows:
and 1, carrying out scale transformation on the current depth video frame by Q (d (x, y)) ═ d (x, y) + gamma, wherein the gamma is a value in
The coefficients within the range, d (x, y) represents the pixel value of the pixel with coordinates (x, y) in the current depth video frame, and Q (d (x, y)) represents the pixel value of the pixel with coordinates (x, y) in the current depth video frame after the scaling.
Fourthly-2, displaying the current depth video frame after the scale transformation, the current depth video frame and the current texture video frame jointlyDepth visual attention distribution map S of dimensional video image
DDistribution diagram S of motion visual attention of current texture video frame
MAnd a histogram S of the static image domain visual attention of the current texture video frame
IAcquiring a three-dimensional visual attention distribution graph S, wherein the pixel value of a pixel with coordinates (x, y) in the three-dimensional visual attention distribution graph S is S (x, y),
wherein, K
D、K
MAnd K
IRespectively S
D、S
MAnd S
IThe weighting coefficient satisfies the condition:
<math><mrow><munder><mi>Σ</mi><mrow><mi>a</mi><mo>∈</mo><mo>{</mo><mi>D</mi><mo>,</mo><mi>M</mi><mo>,</mo><mi>I</mi><mo>}</mo></mrow></munder><msub><mi>K</mi><mi>a</mi></msub><mo>=</mo><mn>1</mn><mo>,</mo></mrow></math> 0≤K
a≤1,
is normalized to
Normalized function of interval, s
D(x,y)、s
M(x, y) and s
I(x, y) each represents S
D、S
MAnd S
IThe pixel value, Θ, of a pixel having the middle coordinate (x, y)
ab(x, y) is the correlation value of visual attention, [ theta ]
ab(x,y)=min(s
a(x,y),s
b(x, y)), min () is the minimum function, C
abAs the correlation coefficient, the correlation coefficient satisfies the condition:
<math><mrow><munder><mi>Σ</mi><mrow><mi>a</mi><mo>,</mo><mi>b</mi><mo>∈</mo><mo>{</mo><mi>D</mi><mo>,</mo><mi>M</mi><mo>,</mo><mi>I</mi><mo>}</mo><mo>,</mo><mi>a</mi><mo>≠</mo><mi>b</mi></mrow></munder><msub><mi>C</mi><mi>ab</mi></msub><mo>=</mo><mn>1</mn><mo>,</mo></mrow></math> 0≤C
ab< 1, correlation coefficient C
DMDenotes S
DAnd S
MDegree of correlation, coefficient of correlation C
DIDenotes S
DAnd S
IDegree of correlation, coefficient of correlation C
IMDenotes S
IAnd S
MA, b ∈ { D, M, I } and a ≠ b.
The moving visual attention, the static image domain visual attention and the depth visual attention all play important roles in combination with the visual attention of people, however, the moving visual attention is the most important content in the video visual attention, the static image domain visual attention is caused by the brightness, the color and the direction of the image domain, and the depth visual attention is caused again, so in the embodiment, each visual attention distribution map is ZSIn this particular embodiment, K is taken as the 8-bit depth representationD=0.15、KM0.4 and KI0.35, the degree of correlation between the depth visual attention and the moving visual attention is small, the degree of correlation between the depth visual attention and the still image domain visual attention is small, and the degree of correlation between the still image domain visual attention and the moving visual attention is large, so the correlation coefficient C is set hereDM、CDIAnd CIMThe scale transformation coefficients γ represent the scene depth of a texture video scene, the smaller γ is, the greater the scene depth is, the stronger the depth feeling given to the viewer is, and conversely, the larger γ is, the smaller the scene depth is, the weaker the depth feeling given to the viewer is, for the "Ballet" and "Door Flower" test sequences, the smaller the scene depth of the scene is, the scale transformation coefficient γ is set to 50, the distribution diagram of the three-dimensional visual attention extracted for the texture video frame at the time t of the "Ballet" test sequence and the corresponding depth video frame is shown in fig. 9, and the distribution diagram of the three-dimensional visual attention extracted for the texture video frame at the time t of the "Door Flower" test sequence and the corresponding depth video frame is shown in fig. 14.
And fifthly, carrying out thresholding and macro block post-processing on the distribution graph S of the three-dimensional visual attention to obtain the final region of interest which accords with human eye stereoscopic perception of the current texture video frame.
In this embodiment, the specific process of thresholding and macro-block post-processing the distribution map S of the three-dimensional visual attention is as follows:
-1, recording the pixel value of the pixel with coordinate (x, y) in the distribution graph S of the three-dimensional visual attention as S (x, y), defining a third threshold TS, <math><mrow><msub><mi>T</mi><mi>S</mi></msub><mo>=</mo><msub><mi>k</mi><mi>T</mi></msub><mo>·</mo><munderover><mi>Σ</mi><mrow><mi>y</mi><mo>=</mo><mn>0</mn></mrow><mrow><mi>H</mi><mo>-</mo><mn>1</mn></mrow></munderover><munderover><mi>Σ</mi><mrow><mi>x</mi><mo>=</mo><mn>0</mn></mrow><mrow><mi>W</mi><mo>-</mo><mn>1</mn></mrow></munderover><mi>s</mi><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>/</mo><mrow><mo>(</mo><mi>W</mi><mo>×</mo><mi>H</mi><mo>)</mo></mrow><mo>,</mo></mrow></math> Wherein W is the width of the three-dimensional visual attention profile S, H is the height of the three-dimensional visual attention profile S, and kTE (0, 3), k in the application process of this embodimentTThe value can be 1.5; newly building a preliminary binary mask image, and judging that s (x, y) is more than or equal to TSAnd if so, marking the pixel with the coordinate (x, y) in the preliminary binary mask image as the pixel of interest, otherwise, marking the pixel with the coordinate (x, y) in the preliminary binary mask image as the pixel not of interest.
Fifthly, dividing the preliminary binary mask image into (W/W)2)×(H/h2) Size of w2×h2The blocks are not overlapped, and the block with the abscissa of u and the block with the ordinate of v is marked as Bu,vWherein u ∈ [0, W/W ]2-1],v∈[0,H/h2-1]Determining whether the pixel in each corresponding block in the current texture video frame is the pixel of interest or the non-interest pixel according to each block in the preliminary binary mask image, and regarding the block Bu,vJudgment Block Bu,vWhether the number of pixels marked as the pixel of interest is larger than a set fourth threshold value T or notbWherein, 0 is less than or equal to Tb≤w2×h2If so, willAnd block B in the current texture video frameu,vAll pixels in the corresponding block are marked as pixels of interest, and block B is marked as the pixel of interestu,vTaking the corresponding block as the region of interest block, otherwise, comparing the current texture video frame with the block Bu,vAll pixels in the corresponding block are marked as non-interesting pixels and block B is markedu,vAnd taking the corresponding block as a non-interested area block to obtain a preliminary interested area mask image of the current texture video frame, wherein the preliminary interested area mask image consists of an interested area block and a non-interested area block.
In the present embodiment, the size of each image in the test sequences "Ballet" and "Door Flower" is 1024 × 768, and thus the block B can be setu,vDimension w of2×h216 × 16, the region with a small number of pixels is not likely to be interesting to the viewer, so the fourth threshold T is set herebSet to 50.
Fifthly-3, since the transition region is not suddenly changed but slowly changed between the interested region and the non-interested region, the invention sets N between the interested region and the non-interested regionRA level transition region of interest. Marking all pixels in a region-of-non-interest block which is most adjacent to the region-of-interest block in the preliminary region-of-interest mask image as Nth pixelRLevel transition interested areas and updating the mask images of the primary interested areas; then, the updated preliminary interested region mask image is compared with the Nth regionRAll pixels in the non-region-of-interest block nearest to the level transition region-of-interest are labeled as NthR-level 1 transition region of interest, recursively updating the preliminary region of interest mask image; repeating the recursion process until the region of interest is marked to the level 1 transition region of interest; finally, obtaining a final interested region mask image of the current texture video frame, wherein the final interested region mask image consists of an interested region block and NRAnd the level transition region of interest and the region of non-interest block. In this embodiment, NRAnd taking the value as 2, namely setting a 2-level transition region of interest.
FIG. 10a shows the final region of interest mask image of the texture video frame at time t of the test sequence "Ballet"; FIG. 15a shows the final region of interest mask image of the texture video frame at time t of the test sequence "Door Flower". In fig. 10a and 15a, the black area represents the region of interest, the gray area is the transition region of interest, and the white area is the region of non-interest.
Fifthly, recording the pixel value of a pixel with the coordinate of (x, y) in the final region-of-interest mask image as r (x, y), setting the pixel values of all pixels in a region block which is not the region of interest in the final region-of-interest mask image as r (x, y) 255, and setting N in the final region-of-interest mask image as NRSetting the pixel values of all pixels in the region of interest of the level transition to <math><mrow><mi>r</mi><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>=</mo><mfrac><mi>e</mi><mrow><msub><mi>N</mi><mi>R</mi></msub><mo>+</mo><mn>1</mn></mrow></mfrac><mo>×</mo><mi>f</mi><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>,</mo></mrow></math> Setting the pixel values of all pixels in the region of interest block in the final region of interest mask image as r (x, y) ═ f (x, y), obtaining the region of interest of the current texture video frame, wherein e represents the progression of the transition region of interest, and e belongs to [1, N ]R]And f (x, y) represents the pixel value of the pixel with coordinates (x, y) in the current texture video frame.
FIG. 10b shows the region of interest of a texture video frame at time t of the test sequence "Ballet"; fig. 15b shows the region of interest of the texture video frame at time t of the test sequence "Door Flower". The regions of interest in fig. 10b and fig. 15b have the same pixel value as the texture video frame at time t, and display the color texture content, and the transition regions of interest display the dark gray regions by reducing the brightness, and the smooth white regions are the regions of non-interest corresponding to the white regions of the mask image of the regions of interest. As a comparison of the extraction effects, fig. 11a and fig. 16a respectively show the regions of interest of the texture video frames at the time of the test sequences "Ballet" and "Door Flower" t, which are extracted only according to the visual attention cue of the static image domain, and the noise regions with rich background textures cannot be removed; fig. 11b and fig. 16b show the regions of interest of texture video frames at the time of the test sequences "Ballet" and "Door Flower" t, which are extracted only according to the motion visual attention cue in the prior art, for the "Ballet" sequence, the method for extracting the regions of interest cannot completely extract men with very slow motion according to the motion visual attention cue only, and meanwhile, the background noise caused by motion shadow is serious; for the "Door Flower" sequence, only the motion region is extracted according to the region-of-interest extraction method of the motion visual attention cue, but the texture complexity and the depth sense provided by the stereoscopic vision are not considered. Fig. 11c and fig. 16c show the regions of interest of the texture video frame at the time of the test sequences "Ballet" and "Door Flower" t, which are combined with the visual attention and the attention cue of the static image domain, although the method combines the static and the moving visual information, the texture region and the moving noise in the background environment cannot be effectively suppressed.
As can be seen from comparison experiments between fig. 10a and 10b and fig. 11a, 11b and 11c, and between fig. 15a and 15b and fig. 16a, 16b and 16c, the region of interest extracted by the present invention combines the visual attention of the static image domain, the visual attention of the motion and the visual attention of the depth, effectively inhibits the inherent singleness and inaccuracy of each visual attention extraction, solves the problem of noise caused by the complex background in the visual attention of the static image domain, and solves the problem that the visual attention of the motion cannot extract the region of interest with local motion and small motion amplitude, thereby improving the calculation accuracy, enhancing the stability of the algorithm, and being capable of extracting the region of interest from the background and the motion environment with complex texture. In addition, the region of interest obtained by the invention not only accords with the visual interesting characteristic of human eyes to the static texture video frame and the visual interesting characteristic of human eyes to the moving object, but also accords with the depth perception characteristic of interest to the object with strong depth or close distance in the stereoscopic vision, and accords with the semantic characteristic of the stereoscopic vision of human eyes.
Sixthly, repeating the steps from the first step to the fifth step until all texture video frames in the texture video are processed, and obtaining the video interesting area of the texture video.
In this embodiment, the static image domain visual attention histogram S of the current texture video frameIDistribution diagram S of motion visual attention of current texture video frameMDistribution diagram S of depth visual attention of three-dimensional video imageDAnd the distribution diagrams S of the three-dimensional visual attention are all ZSA gray scale image expressed by bit depth, and a depth video frame at each time in a depth video corresponding to a texture video is ZDBit depth representation gray scale map, wherein the gray scale map adopts 256 colors and is represented by 8 bit depth, therefore, taking ZS=8,ZDAs to 8, it is of course possible to use other bit depths to represent the grayscale map in practical applications, such as 16 bit depths, and if the grayscale map is represented by 16 bit depths, the representation accuracy is higher.