CN101651772B

CN101651772B - Method for extracting video interested region based on visual attention

Info

Publication number: CN101651772B
Application number: CN2009101525203A
Authority: CN
Inventors: 张云; 蒋刚毅; 郁梅
Original assignee: Ningbo University
Current assignee: Shanghai Spparks Technology Co ltd
Priority date: 2009-09-11
Filing date: 2009-09-11
Publication date: 2011-03-16
Anticipated expiration: 2029-09-11
Also published as: CN101651772A

Abstract

The invention discloses a method for extracting a video interested region based on a visual attention. The interested region extracted by the method combines a static image domain visual attention, a motion visual attention and a depth perception attention, and the method efficiently restrains the internal unicity and inaccuracy of each visual attention extraction and solves the problem of noises caused by a complex background in the static image domain visual attention and the problem that the motion visual attention can not extract an interested region with small local motion and motion amplitude, thus the method improves the calculation accuracy, enhances the algorithm stability and extracts an interested region from a background with complex veins and the motion environment; furthermore, the interested region obtained by the method accords with the depth perception feature interesting on an object having a strong depth perception or a short distance in the stereovision and the semantic characteristic of human eye stereovision besides according with the vision feature of human eye interesting a static grain video frame and interesting a motion object.

Description

Video interesting area extraction method based on visual attention

Technical Field

The invention relates to a video signal processing method, in particular to a video interesting region extraction method based on visual attention.

Background

Stereoscopic Television, also called 3DTV (Three Dimensional Television), has received great attention from domestic and foreign research institutes and industries because stereoscopic Television can provide a span from flat to stereoscopic, giving viewers a specific stereoscopic impression and sense of realism. In 2002, an ATTEST (advanced three-dimensional television systems technology) project was initiated in the IST program supported by the european commission, which was aimed at establishing a complete backward compatible three-dimensional digital television broadcast chain system. The ATTEST project aims to propose a new concept of a 3DTV broadcast chain, is downward compatible with the existing two-dimensional broadcast implementation, and widely supports various different forms of two-dimensional and three-dimensional displays. The main design concept of the ATTEST project is to propose that a Depth Map (Depth Map) is added to serve as enhancement layer information on the basis of traditional two-dimensional video image transmission, namely data representation of 'two-dimensional color video plus Depth', three-dimensional video is decoded and reconstructed on a display terminal in a two-dimensional color video plus Depth mode, and a display mode of the two-dimensional color video plus Depth is supported by an advanced naked eye auto-stereoscopic display terminal in the industry.

In the human visual receiving and processing system, due to limited brain resources and the difference of importance of external environment information, the human brain does not have the same sense of the external environment information but shows the selection characteristic in the processing process, namely the interest degrees are different. Conventionally, extraction of video interesting regions is one of the core and difficult technologies of content-based video processing methods in the fields of video compression and communication, video retrieval, pattern recognition and the like. Visual psychology studies have shown that this variability of the selectivity or the degree of interest of the human eye with respect to external visual input is a close and inseparable link to the human visual attention characteristics. Currently, visual attention cue research is mainly divided into two aspects: top-down (Top-down) (also known as Concept-driven) attention cues and Bottom-up (Bottom-up) (also known as Stimulus-driven) attention cues. The top-down attention cue comes mainly from a complex psychological process and directly pays attention to some objects in the scene, including object shapes, actions and other relevant recognition features such as patterns, and the cue is influenced by factors such as personal knowledge, hobbies and subconscious and varies from person to person. The other clue is a bottom-up attention clue, which mainly comes from direct stimulation of visual characteristic factors of a video scene to visual cortex, mainly includes stimulation of color, brightness, direction and the like, and the bottom-up attention clue is instinctive and automatic, has good universal applicability, is relatively stable, and is basically not influenced by conscious factors such as personal knowledge and hobbies, so the bottom-up attention clue is one of hot contents of research of an extraction method of an automatic interested area.

However, the extraction of the automatic region of interest is mainly classified into three categories, 1), the region of interest of the human eye to the current video frame is extracted by using the image internal information of a single viewpoint, including the stimulation information such as brightness, color, texture or direction, and the like, and the method mainly extracts the region with larger contrast difference of brightness, color and texture as the region of interest, so that the method is difficult to be applied to the region of interest extraction in the complex background environment; 2) based on a visual principle that human eyes are interested in a motion region, the motion information between video frames is used as a main clue to extract the region of interest, however, the method is difficult to accurately extract slowly moving or locally moving objects and is also difficult to be applied to the extraction of the region of interest under the condition of global motion; 3) the method for extracting the static texture and the motion information by combining is low in redundancy and correlation between the static texture and the motion information, and extraction errors and noises existing in the static texture and the motion information cannot be effectively inhibited, so that the extraction accuracy is low. The three traditional methods cause that the extracted interesting region is not accurate enough and has poor stability due to the limitation of available information quantity; on the other hand, the traditional method does not consider the stereoscopic vision characteristic of interest to objects with strong depth perception or close to viewers, and cannot well express the real interest degree of human eyes with stereoscopic vision, so that the traditional method is difficult to be applied to the extraction of interest areas which accord with the stereoscopic vision semantic features in the new generation of stereoscopic (three-dimensional)/multi-view videos.

Disclosure of Invention

The invention aims to solve the technical problem of providing a video interesting region extraction method based on visual attention, which can ensure that the extracted video interesting region has higher precision and better stability and the extracted video interesting region accords with the semantic characteristics of human eye stereoscopic vision.

The technical scheme adopted by the invention for solving the technical problems is as follows: a video interesting region extraction method based on visual attention comprises the following steps:

firstly, defining a two-dimensional color video as a texture video, defining the size of texture video frames at each moment in the texture video to be W multiplied by H, W being the width of the texture video frames at each moment in the texture video, H being the height of the texture video frames at each moment in the texture video, and recording the texture video frame at t moment in the texture video as F_tDefining a texture video frame F at time t in the texture video_tFor the current texture video frame, detecting the visual attention of the static image domain of the current texture video frame by using a known static image visual attention detection method to obtain a distribution map, marked as S, of the visual attention of the static image domain of the current texture video frame_IThe static image domain visual attention distribution map S of the current texture video frame_IHas a dimension of W × H and is Z_SA grayscale map of bit depth representations;

secondly, detecting the motion visual attention of the current texture video frame by adopting a motion visual attention detection method to obtain a distribution diagram of the motion visual attention of the current texture video frame, and recording the distribution diagram as S_MDistribution map S of motion visual attention of current texture video frame_MHas a dimension of W × H and is Z_SA grayscale map of bit depth representations;

defining depth video frame of each time in depth video corresponding to texture video as Z_DSetting the size of a depth video frame at each moment in a depth video to be W multiplied by H, wherein W is the width of the depth video frame at each moment in the depth video, H is the height of the depth video frame at each moment in the depth video, and D is the depth video frame at t moment in the depth video_tDefining a depth video frame D at time t in the depth video_tDetecting the current depth video frame and the current texture video frame link for the current depth video frame by adopting a depth visual attention detection methodCombining the depth visual attention of the displayed three-dimensional video image to obtain a distribution diagram of the depth visual attention of the three-dimensional video image, which is marked as S_DDistribution diagram S of depth visual attention of three-dimensional video image_DHas a dimension of W × H and is Z_SA grayscale map of bit depth representations;

fourthly, adopting a visual attention fusion method based on depth perception to visually pay attention to the static image domain of the current texture video frame_IDistribution diagram S of motion visual attention of current texture video frame_MAnd a current depth video frame and a depth visual attention distribution diagram S of a three-dimensional video image jointly displayed by the current depth video frame and the current texture video frame_DFusing to extract a distribution diagram of three-dimensional visual attention conforming to human eye stereoscopic perception, wherein the distribution diagram of three-dimensional visual attention is marked as S, and the size of the distribution diagram of three-dimensional visual attention is W multiplied by H and Z_SA grayscale map of bit depth representations;

performing thresholding and macro block post-processing on the distribution graph S of the three-dimensional visual attention to obtain a final region of interest which accords with human eye three-dimensional perception of the current texture video frame;

sixthly, repeating the steps from the first step to the fifth step until all texture video frames in the texture video are processed, and obtaining the video interesting area of the texture video.

The specific process of the motion visual attention detection method in the second step is as follows:

(II-1) recording the texture video frame at the time t + j continuous with the current texture video frame in the texture video as F_t+jRecording the texture video frame at the t-j time point which is continuous with the current texture video frame in the texture video as F_t-jWherein j ∈ (0, N)_F/2]，N_FIs a positive integer less than 10;

2, calculating the current texture video frame and the texture video frame F at the moment t + j by adopting a known optical flow method_t+jMotion vector image in horizontal direction and motion vector image in vertical direction, and current texture video frame and time t-jTexture video frame F_t-jRecording the current texture video frame and the texture video frame F at the time of t + j in the motion vector image in the horizontal direction and the motion vector image in the vertical direction_t+jThe motion vector image in the horizontal direction is V_t+j ^HAnd the motion vector image in the vertical direction is V_t+j ^VRecording the current texture video frame and the texture video frame F at the moment t-j_t-jThe motion vector image in the horizontal direction is V_t-j ^HAnd the motion vector image in the vertical direction is V_t-j ^V，V_t+j ^H、V_t+j ^V、V_t-j ^HAnd V_t-j ^VW for width and H for height;

② -3, mixing V_t+j ^HAbsolute value of and V_t+j ^VThe absolute value of the current texture video frame is superposed to obtain the texture video frame F of the current texture video frame and the texture video frame F at the moment of t + j_t+jMotion amplitude image of, noted as M_t+j，

M_{t + j} = | V_{t + j}^{H} | + | V_{t + j}^{V} |,

Memory M_t+jThe motion amplitude value of the pixel with the middle coordinate (x, y) is m_t+j(x, y); will V_t-j ^HAbsolute value of and V_t-j ^VThe absolute value of the texture video frame is superposed to obtain the texture video frame F of the current texture video frame and the t-j moment_t-jMotion amplitude image of, noted as M_t-j，

M_{t - j} = | V_{t - j}^{H} | + | V_{t - j}^{V} |,

Memory M_t-jThe motion amplitude value of the pixel with the middle coordinate (x, y) is m_t-j(x，y)；

② 4, utilizing current texture video frame and texture video frame F at t + j moment_t+jAnd texture video frame F at time t-j_t-jExtracting a joint motion map, denoted as M_j ^ΔExtracting the joint movement map M_j ^ΔThe specific process comprises the following steps: judging the current texture video frame and the texture video frame F at the moment t + j_t+jMotion amplitude image M_t+jEach pixel in (a) and the current texture video frame and the texture video frame F at time t-j_t-jMotion amplitude image M_t-jWhether the minimum value in the motion amplitude values of the pixels of the corresponding coordinates is larger than a set first threshold value T or not₁And if so, determining a joint motion map M_j ^ΔThe pixel value of the pixel of the corresponding coordinate is M_t+jAnd M_t-jAverage of the sum of motion amplitude values of pixels of the corresponding coordinates, otherwise, determining the joint motion map M_j ^ΔThe pixel value of the pixel of the corresponding coordinate is 0; for M_t+jPixel with (x, y) middle coordinate and M_t-jThe pixel with the middle coordinate of (x, y) is judged to be min (m)_t+j(x，y)，m_t-j(x, y)) is greater than a set first threshold value T₁And if so, determining a joint motion map M_j ^ΔThe pixel value of the pixel with the middle coordinate (x, y) is

Otherwise, determining the joint motion map M_j ^ΔThe pixel value of the pixel with the middle coordinate (x, y) is 0, wherein min () is a minimum function;

② -5, the distance from the t time to the 1 time to the N time_FWeighted superposition of the combined motion images at each moment of the/2 moment is carried out to obtain a weighted combined motion image of the current texture video frame, the weighted combined motion image is marked as M, the pixel value of a pixel with coordinates (x, y) in the weighted combined motion image M of the current texture video frame is marked as M (x, y),

wherein m is_j ^Δ(x, y) represents the joint motion map M at a time j away from time t_j ^ΔThe pixel value, ζ, of a pixel having a middle coordinate of (x, y)_iAs a weighting coefficient, a weighting coefficient ζ_iSatisfy the requirement of

② 6, carrying out Gaussian pyramid decomposition on the weighted joint motion picture M of the current texture video frame to decompose into n_LThe layer weighted joint motion picture is obtained by recording M Gaussian pyramid decomposition of the weighted joint motion picture, wherein the weighted joint motion picture of the ith layer is M (i), and the width and the height of the weighted joint motion picture of the ith layer M (i) are respectively W/2ⁱAnd H/2ⁱWherein n is_LIs a positive integer less than 20, i ∈ [0, n ∈ [ ]_L-1]W is the width of the current texture video frame, and H is the height of the current texture video frame;

② -7, utilizing the weighted joint motion map of the current texture video frame to n_LLayer weighted joint motion map, extracting the motion visual attention distribution map S of the current texture video frame_MRemember S_MThe pixel value of the pixel with the middle coordinate (x, y) is s_m(x，y)，S_M＝F_MWherein

s，c∈[0，n_L-1]，s＝c+δ，δ＝{-3，-2，-1，1，2，3}，

is normalized to

-1 interval normalization function, symbol "|" is absolute value operation symbol, m (c) is c-th layer weighted joint motion diagram, m(s) is s-th layer weighted joint motion diagram, symbol

Performing cross-level difference operation on M (c) and M(s), if c is less than s, up-sampling M(s) to the image with the same resolution as M (c), then respectively performing difference on each pixel of M (c) and the corresponding pixel of up-sampled M(s), if c is more than s, up-sampling M (c) to the image with the same resolution as M(s), then respectively performing difference on each pixel of M(s) and the corresponding pixel of up-sampled M (c), and sign

Performing a cross-level addition operator for M (c) and M(s), if c < s, upsampling M(s) to an image with the same resolution as M (c), then summing each pixel of M (c) with the corresponding pixel of upsampled M(s), respectively, if c > s, upsampling M (c) to an image with the same resolution as M(s), and then summing each pixel of M(s) with the corresponding pixel of upsampled M (c), respectively.

The first threshold value T set in the step two-4₁＝1。

The depth visual attention detection method in the third step comprises the following specific processes:

thirdly-1, carrying out Gaussian pyramid decomposition on the current depth video frame to obtain n_LThe layer depth video frame is recorded as a layer i depth video frame obtained after the Gaussian pyramid decomposition of the current depth video frame is D (i), and the width and the height of the layer i depth video frame D (i) are respectively W/2ⁱAnd H/2ⁱWherein n is_LIs a positive integer less than 20, i ∈ [0, n ∈ [ ]_L-1]W is the width of the current depth video frame, and H is the height of the current depth video frame;

③ 2, utilizing n of current depth video frame_LLayer depth video frame, extracting depth characteristic map of current depth video frame, and recording as F_D，

Wherein,

s，c ∈[0，n_L-1]，s＝c+δ，δ＝{-3，-2，-1，1，2，3}，

is normalized to

The normalized function of the interval, the symbol "|" is the absolute value operation symbol, D (c) is the c-th layer depth video frame, D(s) is the s-th layer depth video frame, the symbol

Performing cross-level difference operation on D (c) and D(s), if c < s, upsampling D(s) to an image with the same resolution as D (c), and then corresponding each pixel of D (c) to the upsampled D(s)

Respectively performing difference on pixels, if c is more than s, up-sampling D (c) to an image with the same resolution as D(s), and respectively performing difference and sign on each pixel of D(s) and the corresponding pixel of D (c) after up-sampling

Performing a cross-layer addition operator on D (c) and D(s), if c < s, upsampling D(s) to an image with the same resolution as D (c), and then performing an addition operation on each pixel of D (c)Respectively summing the corresponding pixels of the D(s) after sampling, if c is more than s, up-sampling the D (c) onto the image with the same resolution as the D(s), and then respectively summing each pixel of the D(s) and the corresponding pixel of the D (c) after up-sampling;

thirdly, carrying out convolution operation on the current depth video frame by adopting known Gabor filters with 0 degree, pi/4 degree, pi/2 degree and 3 pi/4 degree directions to extract four direction components of the 0 degree, pi/4 degree, pi/2 degree and 3 pi/4 degree directions to obtain four direction component graphs of the current depth video frame, wherein the four direction component graphs are respectively represented as O₀ ^D、O_π/4 ^D、O_π/2 ^DAnd O_3π/4 ^D(ii) a O for current depth video frame₀ ^DDirectional component diagram, O_π/4 ^DDirectional component diagram, O_π/2 ^DDirectional component diagram and O_3π/4 ^DThe direction component diagram is respectively decomposed into n by Gaussian pyramid_LThe layer direction component diagram is marked, and the ith layer direction component diagram obtained by decomposing the direction component diagram in the theta degree direction by a Gaussian pyramid is O_θ ^D(i)，O_θ ^D(i) Is W/2 in width and height respectivelyⁱAnd H/2ⁱWhere θ ∈ {0, π/4, π/2, 3 π/4} i ∈ [0, n ∈_L-1]W is the width of the current depth video frame, and H is the height of the current depth video frame;

thirdly-4, utilizing n of the direction component diagram of each degree direction of the current depth video frame_LExtracting a primary depth direction feature map of the current depth video frame, and recording the primary depth direction feature map as F'_DO，

<math><mrow><msubsup><mover><mi>F</mi><mo>&OverBar;</mo></mover><mi>DO</mi><mo>′</mo></msubsup><mo>=</mo><mfrac><mn>1</mn><mn>4</mn></mfrac><munder><mi>Σ</mi><mrow><mi>θ</mi><mo>&Element;</mo><mo>{</mo><mn>0</mn><mo>,</mo><mfrac><mi>π</mi><mn>4</mn></mfrac><mo>,</mo><mfrac><mi>π</mi><mn>2</mn></mfrac><mo>,</mo><mfrac><mrow><mn>3</mn><mi>π</mi></mrow><mn>4</mn></mfrac><mo>}</mo></mrow></munder><msub><mover><mi>F</mi><mo>&OverBar;</mo></mover><msub><mi>O</mi><mi>θ</mi></msub></msub><mo>,</mo></mrow></math>

Wherein,

s，c ∈[0，n_L-1]，s＝c+δ，δ＝{-3，-2，-1，1，2，3}，

is normalized to

Normalized function of interval, symbol "|" is absolute value operation symbol, O_θ ^D(c) A c-th layer direction component diagram which is a direction component diagram of a theta degree direction, O_θ ^D(s) an s-th layer direction component diagram of a theta degree direction, symbol

Is O_θ ^D(c) And O_θ ^D(s) perform a cross-level differencing operator, if c < s, then O_θ ^D(s) upsampling to and O_θ ^D(c) On the image with the same resolution, and then O_θ ^D(c) And the up-sampled O_θ ^D(s) performing difference on corresponding pixels respectively, and if c is more than s, performing O_θ ^D(c) Up-sampling to and O_θ ^D(s) on an image with the same resolution, and then adding O_θ ^D(s) each pixel with up-sampled O_θ ^D(c) Performing difference and sign on corresponding pixelsIs O_θ ^D(c) And O_θ ^D(s) performing a cross-level addition operator, if c < s, adding O_θ ^D(s) upsampling to and O_θ ^D(c) On the image with the same resolution, and then O_θ ^D(c) And the up-sampled O_θ ^D(s) the corresponding pixels are summed separately, and if c > s, O is added_θ ^D(c) Up-sampling to and O_θ ^D(s) on an image with the same resolution, and then adding O_θ ^D(s) each pixel with up-sampled O_θ ^D(c) Summing corresponding pixels respectively;

③ 5, adopting the known morphological dilation algorithm to obtain the size w₁×h₁Is a preliminary depth direction feature map F 'of the basic expansion unit to the current depth video frame'_DOCarry out n₁Performing secondary expansion operation to obtain a depth direction characteristic diagram of the current depth video frame, and recording the characteristic diagram as F_DO；

Thirdly-6, utilizing the depth characteristic image F of the current depth video frame_DAnd depth direction feature map F_DOObtaining a distribution map, noted as S ', of the preliminary depth visual attention of the current depth video frame'_D，

Note S'_DThe pixel value of the pixel with the middle coordinate of (x, y) is s'_d(x, y) wherein,

is normalized to

A normalization function of the interval;

-7, profile S 'using preliminary depth visual attention of current depth video frame'_DObtaining a depth visual attention distribution diagram S of the three-dimensional video image jointly displayed by the current depth video frame and the current texture video frame_DRemember S_DThe pixel value of the pixel with the middle coordinate (x, y) is s_d(x，y)，s_d(x，y)＝s′_d(x, y) · g (x, y), whichIn (1),

g (x, y) = \{\begin{matrix} 0.2 & if & x < b | | y < b | | x > W - b | | y > H - b \\ 0.4 & elseif & x < 2 b | | y < 2 b | | x > W - 2 b | | y > H - 2 b \\ 0.6 & elseif & x < 3 b | | y < 3 b | | x > W - 3 b | | y > H - 3 b \\ 1 & else \end{matrix},

w is the width of the current depth video frame, H is the height of the current depth video frame, b is a set second threshold, and the symbol "| | is an" or "operator.

Step c 5 middle w₁＝8，h₁＝8，n₁The second threshold b set in the step (c) -7 is 16.

The visual attention fusion method based on depth perception in the step IV comprises the following specific processes:

and 1, carrying out scale transformation on the current depth video frame by Q (d (x, y)) ═ d (x, y) + gamma, wherein the gamma is a value in

The coefficients within the range, d (x,y) represents the pixel value of the pixel with the coordinate (x, y) in the current depth video frame, and Q (d (x, y)) represents the pixel value of the pixel with the coordinate (x, y) in the current depth video frame after the scale transformation;

fourthly-2, utilizing the current depth video frame after the scale transformation, the current depth video frame and the depth visual attention distribution diagram S of the three-dimensional video image jointly displayed by the current depth video frame and the current texture video frame_DDistribution diagram S of motion visual attention of current texture video frame_MAnd a histogram S of the static image domain visual attention of the current texture video frame_IAcquiring a three-dimensional visual attention distribution graph S, wherein the pixel value of a pixel with coordinates (x, y) in the three-dimensional visual attention distribution graph S is S (x, y),

wherein, K_D、K_MAnd K_IRespectively S_D、S_MAnd S_IThe weighting coefficient satisfies the condition:

<math><mrow><munder><mi>Σ</mi><mrow><mi>a</mi><mo>&Element;</mo><mo>{</mo><mi>D</mi><mo>,</mo><mi>M</mi><mo>,</mo><mi>I</mi><mo>}</mo></mrow></munder><msub><mi>K</mi><mi>a</mi></msub><mo>=</mo><mn>1</mn><mo>,</mo></mrow></math>

0≤K_a≤1，

is normalized to

Normalized function of interval, s_D(x，y)、s_M(x, y) and s_I(x, y) each represents S_D、S_MAnd S_IThe pixel value, Θ, of a pixel having the middle coordinate (x, y)_ab(x, y) is the correlation value of visual attention, [ theta ]_ab(x，y)＝min(s_a(x，y)，s_b(x, y)), min () is the minimum function, C_abAs the correlation coefficient, the correlation coefficient satisfies the condition:

<math><mrow><munder><mi>Σ</mi><mrow><mi>a</mi><mo>,</mo><mi>b</mi><mo>&Element;</mo><mo>{</mo><mi>D</mi><mo>,</mo><mi>M</mi><mo>,</mo><mi>I</mi><mo>}</mo><mo>,</mo><mi>a</mi><mo>&NotEqual;</mo><mi>b</mi></mrow></munder><msub><mi>C</mi><mi>ab</mi></msub><mo>=</mo><mn>1</mn><mo>,</mo></mrow></math>

0≤C_ab< 1, correlation coefficient C_DMDenotes S_DAnd S_MDegree of correlation, coefficient of correlation C_DIDenotes S_DAnd S_IDegree of correlation, coefficient of correlation C_IMDenotes S_IAnd S_MA, b ∈ { D, M, I } and a ≠ b.

The specific process of thresholding and macro block post-processing the distribution graph S of the three-dimensional visual attention in the fifth step is as follows:

-1, recording the pixel value of the pixel with coordinate (x, y) in the distribution graph S of the three-dimensional visual attention as S (x, y), defining a third threshold T_S，

Wherein W is the width of the three-dimensional visual attention profile S, H is the height of the three-dimensional visual attention profile S, and k_TE (0, 3); newly building a preliminary binary mask image, and judging that s (x, y) is more than or equal to T_SIf so, marking the pixel with the coordinate (x, y) in the preliminary binary mask image as the pixel of interest, otherwise, marking the pixel with the coordinate (x, y) in the preliminary binary mask image as the pixel of non-interest;

fifthly, dividing the preliminary binary mask imageTo (W/W)₂)×(H/h₂) Size of w₂×h₂The blocks are not overlapped, and the block with the abscissa of u and the block with the ordinate of v is marked as B_u，vWherein u ∈ [0, W/W ]₂-1]，v∈[0，H/h₂-1]Determining whether the pixel in each corresponding block in the current texture video frame is the pixel of interest or the non-interest pixel according to each block in the preliminary binary mask image, and regarding the block B_u，vJudgment Block B_u，vWhether the number of pixels marked as the pixel of interest is larger than a set fourth threshold value T or not_bWherein, 0 is less than or equal to T_b≤w₂×h₂If yes, the current texture video frame is compared with the block B_u，vAll pixels in the corresponding block are marked as pixels of interest, and block B is marked as the pixel of interest_u，vTaking the corresponding block as the region of interest block, otherwise, comparing the current texture video frame with the block B_u，vAll pixels in the corresponding block are marked as non-interesting pixels and block B is marked_u，vThe corresponding block is used as a non-interested area block to obtain a preliminary interested area mask image of the current texture video frame, wherein the preliminary interested area mask image consists of an interested area block and a non-interested area block;

fifthly, marking all pixels in a non-interested area block which is most adjacent to the interested area block in the primary interested area mask image as the Nth pixel_RLevel transition interested areas and updating the mask images of the primary interested areas; then, the updated preliminary interested region mask image is compared with the Nth region_RAll pixels in the non-region-of-interest block nearest to the level transition region-of-interest are labeled as Nth_R-level 1 transition region of interest, recursively updating the preliminary region of interest mask image; repeating the recursion process until the region of interest is marked to the level 1 transition region of interest; finally, obtaining a final interested region mask image of the current texture video frame, wherein the final interested region mask image consists of an interested region block and N_RA level transition region of interest and a region of non-interest block;

fifthly, recording the final interested areaSetting the pixel value of a pixel with coordinates (x, y) in the region mask image as r (x, y), setting the pixel values of all pixels in a region block which is not in a region of interest in the final region of interest mask image as r (x, y) to be 255, and setting N in the final region of interest mask image as N_RSetting the pixel values of all pixels in the region of interest of the level transition to

Setting the pixel values of all pixels in the region of interest block in the final region of interest mask image as r (x, y) ═ f (x, y), obtaining the region of interest of the current texture video frame, wherein e represents the progression of the transition region of interest, and e belongs to [1, N ]_R]And f (x, y) represents the pixel value of the pixel with coordinates (x, y) in the current texture video frame.

W in the step (v-2)₂＝16，h₂A fourth threshold value T of 16_h＝50。

Compared with the prior art, the method has the advantages that the texture video frame and the depth video frame corresponding to the texture video frame which are synchronized in time are jointly utilized, firstly, the static image domain visual attention of the texture video frame is extracted, the distribution diagram of the static image domain visual attention of the texture video frame is obtained, the motion visual attention is extracted through the texture video frames which are continuous in time, the distribution diagram of the motion visual attention of the texture video frame is obtained, the depth visual attention distribution diagram of the three-dimensional video image which is displayed by the depth video frame and the texture video frame in a combined mode is obtained through extracting the depth visual attention of the depth video frame, then, the distribution diagram of the static image domain visual attention, the distribution diagram of the motion visual attention, the distribution diagram of the depth visual attention and the depth information are utilized, the distribution diagram of the three-dimensional (stereo) visual attention which accords with the stereo vision characteristic of human eyes is obtained through a fusion method based, and performing thresholding and macro block post-processing operations to obtain a final video interested area which accords with human eye stereoscopic perception and a mask image of the interested area and a non-interested area corresponding to the video interested area. The interesting regions extracted by the method are fused with the visual attention of the static image region, the visual attention of the movement and the visual attention of the depth, the inherent singleness and inaccuracy of each visual attention extraction are effectively inhibited, the problem of noise caused by a complex background in the visual attention of the static image region is solved, the problem that the interesting regions with local movement and small movement amplitude cannot be extracted by the visual attention of the movement is solved, the calculation precision is improved, the stability of an algorithm is enhanced, and the interesting regions can be extracted from the background and the movement environment with complex textures. In addition, the region of interest obtained by the method not only accords with the visual interesting characteristic of human eyes to the static texture video frame and the visual interesting characteristic of human eyes to the moving object, but also accords with the depth perception characteristic of interest to the object with strong depth or close distance in the stereoscopic vision, and accords with the semantic characteristic of the stereoscopic vision of human eyes.

Drawings

FIG. 1a is a color video frame at time t in a test sequence "Ballet" two-dimensional color video;

FIG. 1b is a color video frame at time t in a two-dimensional color video of the test sequence "Door Flower";

FIG. 2a is a depth video frame at time t in a depth video corresponding to a test sequence "Ballet" two-dimensional color video;

FIG. 2b is a depth video frame at time t in a depth video corresponding to a two-dimensional color video of a test sequence "Door Flower";

FIG. 3 is a general flow diagram of the method of the present invention;

FIG. 4 is a block diagram of a process for detecting the visual attention of the still image domain of a current texture video frame using a conventional still image visual attention detection method;

FIG. 5 is a block flow diagram of a method of motion visual attention detection;

FIG. 6 is a block flow diagram of a method of depth vision attention detection;

FIG. 7a is a luminance characteristic diagram of a color video frame at time t in a test sequence "Ballet" two-dimensional color video;

FIG. 7b is a chart of chrominance characteristics of a color video frame at time t in a test sequence "Ballet" two-dimensional color video;

FIG. 7c is a diagram of the directional characteristics of a color video frame at time t in a test sequence "Ballet" two-dimensional color video;

FIG. 8a is a distribution diagram of the still image domain visual attention of a color video frame at time t in a test sequence "Ballet" two-dimensional color video;

FIG. 8b is a graph of the distribution of the motion visual attention of color video frames at time t in a test sequence "Ballet" two-dimensional color video;

FIG. 8c is a diagram showing a depth visual attention distribution of a three-dimensional video image jointly displayed by a color video frame at time t and a corresponding depth video frame in a test sequence "Ballet" two-dimensional color video;

FIG. 9 is a three-dimensional visual attention distribution diagram obtained after the color video frame at time t and the corresponding depth video frame in the two-dimensional color video of the test sequence "Ballet" are processed by the present invention;

FIG. 10a is a final region of interest mask image extracted by the present invention of a texture video frame at time t of the test sequence "Ballet";

FIG. 10b is a region of interest extracted by the present invention for a texture video frame at time t of the test sequence "Ballet";

FIG. 11a is a region of interest of a texture video frame at time t of a test sequence "Ballet" extracted by a conventional method for extracting a region of interest based only on a static image domain visual attention cue;

FIG. 11b is a region of interest of a texture video frame at time t of the test sequence "Ballet" extracted according to a conventional method for extracting regions of interest based only on motion visual attention cues;

FIG. 11c is a region of interest of a texture video frame at time t of a test sequence "Ballet" extracted by a conventional static image domain visual attention and moving visual attention combined region of interest extraction method;

FIG. 12a is a graph showing the luminance characteristics of color video frames at time t in a two-dimensional color video of the test sequence "Door Flower";

FIG. 12b is a chart showing the chrominance characteristics of color video frames at time t in a two-dimensional color video of the test sequence "Door Flower";

FIG. 12c is a diagram showing the directional characteristics of color video frames at time t in a two-dimensional color video of the test sequence "Door Flower";

FIG. 13a is a diagram of a static image domain visual attention distribution of color video frames at time t in a test sequence "Door Flower" two-dimensional color video;

FIG. 13b is a diagram showing a distribution of motion visual attention of color video frames at time t in a two-dimensional color video of the test sequence "Door Flower";

FIG. 13c is a diagram showing a depth visual attention distribution of a three-dimensional video image jointly represented by a color video frame at time t and a corresponding depth video frame in a two-dimensional color video of the test sequence "Door Flower";

FIG. 14 is a three-dimensional visual attention distribution diagram obtained by processing a color video frame at time t and a corresponding depth video frame in a two-dimensional color video of a test sequence "Door Flower" according to the present invention;

FIG. 15a is the final region of interest mask image extracted by the present invention of texture video frame at time t of the test sequence "Door Flower";

FIG. 15b is the region of interest extracted by the present invention for the texture video frame at time t of the test sequence "Door Flower";

FIG. 16a is a region of interest of a texture video frame at time t in the test sequence "Door Flower" extracted by the conventional method for extracting a region of interest based only on the visual attention cue of a still image region;

FIG. 16b is a region of interest of a texture video frame at time t in the test sequence "Door Flower" extracted by the conventional method for extracting regions of interest based only on the attention cues for sports vision;

fig. 16c is the roi extracted by the still image domain visual attention and moving visual attention combined roi extraction method of the texture video frame at time t of the test sequence "Door Flower".

Detailed Description

The invention is described in further detail below with reference to the accompanying examples.

The invention relates to a video interesting region extraction method based on visual attention, which mainly combines and utilizes the information of texture video and the information of depth video synchronized in time to extract the video interesting region, in the present embodiment, the texture video mainly uses a two-dimensional color video, the texture video is exemplified by a test sequence "Ballet" two-dimensional color video and a "Door Flower" two-dimensional color video, figure 1a shows a color video frame at time t in a test sequence "Ballet" two-dimensional color video, figure 1b shows a color video frame at time t in a two-dimensional color video of the test sequence "Door Flower", figure 2a shows a depth video frame at time t in a depth video corresponding to a test sequence "Ballet" two-dimensional color video, fig. 2b shows a depth video frame at time t in a depth video corresponding to the two-dimensional color video of the test sequence "Door Flower" and a depth video frame at each time in the depth video corresponding to the two-dimensional color video is Z._DA grayscale map of bit depth representations, the grayscale values of the grayscale map representing the relative distance of the object represented by each pixel in the depth video frame to the camera. Time instants in texture videoThe size of the texture video frame is defined as W × H, and for the depth video frame at each time in the depth video corresponding to the texture video, if the size of the depth video frame is different from the size of the texture video frame, the size of the depth video frame is generally set to the same size as the texture video frame by using the existing methods such as scale transformation and interpolation, that is, W × H, W is the width of the texture video frame at each time in the texture video or the width of the depth video frame at each time in the depth video, H is the height of the texture video frame at each time in the texture video or the height of the depth video frame at each time in the depth video, and the size of the depth video frame is set to be the same as the size of the texture video frame in order to more conveniently extract the video region of interest.

The general flow diagram of the method of the present invention is shown in fig. 3, and specifically includes the following steps:

firstly, defining a two-dimensional color video as a texture video, defining the size of texture video frames at each moment in the texture video to be W multiplied by H, W being the width of the texture video frames at each moment in the texture video, H being the height of the texture video frames at each moment in the texture video, and recording the texture video frame at t moment in the texture video as F_tDefining a texture video frame F at time t in the texture video_tFor the current texture video frame, detecting the visual attention of the static image domain of the current texture video frame by using a known static image visual attention detection method to obtain a distribution map, marked as S, of the visual attention of the static image domain of the current texture video frame_IThe static image domain visual attention distribution map S of the current texture video frame_IHas a dimension of W × H and is Z_SAnd the gray level map is expressed by bit depth, wherein the larger the pixel value of a certain pixel in the gray level map represents that the relative attention degree of the human eye to the corresponding pixel of the current texture video frame is higher, and the smaller the pixel value represents that the relative attention degree of the human eye to the current texture video frame is lower.

In this embodiment, the process of detecting the visual attention of the still image domain of the current texture video frame by using the well-known still image visual attention detection methodBlock diagram as shown in fig. 4, each rectangle in fig. 4 represents a data processing procedure, each diamond respectively represents an image, and diamonds with different sizes represent images with different resolutions and are input and output data of corresponding operations; the current texture video frame is an image in an RGB format, each pixel in the image is represented BY R, G and B three color channels, firstly, each color channel component of each pixel of the current texture video frame is linearly transformed and decomposed into a brightness component diagram and two chroma component diagrams, namely a red-green component diagram and a blue-yellow component diagram, the brightness component diagram, the red-green component diagram and the blue-yellow component diagram are respectively represented as I, RG and BY, and the pixel value of a brightness component diagram I in (x, y) coordinates is represented as I_x，y＝(r_x，y+g_x，y+b_x，y) /3 wherein I_x，yThe pixel value, r, representing the luminance component in (x, y) coordinates_x，y、g_x，y、b_x，yThe pixel values of the pixels of the RGB three color channels of the current texture video frame at the (x, y) coordinate are respectively represented as follows:

\{\begin{matrix} R_{x, y} = r_{x, y} - (g_{x, y} + b_{x, y}) / 2 \\ G_{x, y} = g_{x, y} - (r_{x, y} + b_{x, y}) / 2 \\ B_{x, y} = b_{x, y} - (r_{x, y} + g_{x, y}) / 2 \\ Y_{x, y} = r_{x, y} + g_{x, y} - 2 (| r_{x, y} - g_{x, y} | + b_{x, y}) \\ {RG}_{x, y} = R_{x, y} - G_{x, y} \\ {BY}_{x, y} = B_{x, y} - Y_{x, y} \end{matrix},

wherein RG_x，yThe pixel value, BY, representing the red-green component map RG in (x, y) coordinates_x，yA pixel value representing the blue-yellow component map BY at (x, y) coordinates; extracting four directional component maps of 0 degree, 45 degrees, 90 degrees and 135 degrees of the brightness component map by using a known Gabor filter, and respectively marking the four extracted directional component maps as O_θ ^T，

<math><mrow><mi>θ</mi><mo>&Element;</mo><mo>{</mo><mn>0</mn><mo>,</mo><mfrac><mi>π</mi><mn>4</mn></mfrac><mo>,</mo><mfrac><mi>π</mi><mn>2</mn></mfrac><mo>,</mo><mfrac><mrow><mn>3</mn><mi>π</mi></mrow><mn>4</mn></mfrac><mo>}</mo><mo>;</mo></mrow></math>

For one luminance component and two chrominance componentsAnd four directional components are respectively subjected to Gaussian pyramid decomposition into n_LLayer of where n_LAll the component graphs are uniformly expressed by l for positive integers less than 20, and the layer component graph of the ith layer obtained after the component graph l is decomposed by a Gaussian pyramid is recorded as l (i), wherein i belongs to [0, n ∈ [ ]_L-1]，

<math><mrow><mi>l</mi><mo>&Element;</mo><mo>{</mo><mi>I</mi><mo>}</mo><mo>∪</mo><mo>{</mo><mi>RG</mi><mo>,</mo><mi>BY</mi><mo>}</mo><mo>∪</mo><mo>{</mo><msubsup><mi>O</mi><mn>0</mn><mi>T</mi></msubsup><mo>,</mo><msubsup><mi>O</mi><mrow><mi>π</mi><mo>/</mo><mn>4</mn></mrow><mi>T</mi></msubsup><mo>,</mo><msubsup><mi>O</mi><mrow><mi>π</mi><mo>/</mo><mn>2</mn></mrow><mi>T</mi></msubsup><mo>,</mo><msubsup><mi>O</mi><mrow><mn>3</mn><mi>π</mi><mo>/</mo><mn>4</mn></mrow><mi>T</mi></msubsup><mo>}</mo><mo>,</mo></mrow></math>

So 7 component maps are co-generated, each component map being decomposed into n_LLayer-by-layer component diagram of 7 Xn_LLayer-by-layer component diagram, n in this example_LThe value is 9; the feature maps of the components (chrominance, luminance, and directional components) are calculated by using the extracted layer component maps as follows:

<math><mrow><mo>&ForAll;</mo><mi>l</mi><mo>&Element;</mo><mo>{</mo><mi>I</mi><mo>}</mo><mo>∪</mo><mo>{</mo><mi>RG</mi><mo>,</mo><mi>BY</mi><mo>}</mo><mo>∪</mo><mo>{</mo><msubsup><mi>O</mi><mn>0</mn><mi>T</mi></msubsup><mo>,</mo><msubsup><mi>O</mi><mrow><mi>π</mi><mo>/</mo><mn>4</mn></mrow><mi>T</mi></msubsup><mo>,</mo><msubsup><mi>O</mi><mrow><mi>π</mi><mo>/</mo><mn>2</mn></mrow><mi>T</mi></msubsup><mo>,</mo><msubsup><mi>O</mi><mrow><mn>3</mn><mi>π</mi><mo>/</mo><mn>4</mn></mrow><mi>T</mi></msubsup><mo>}</mo><mo>,</mo></mrow></math>

wherein s, c is equal to [0, n ]_L-1]，s＝c+δ，δ＝{-3，-2，-1，1，2，3}，Is normalized to

-a normalization function of the interval 1,denotes the most noticeable, 0 denotes the least noticeable, l (c) denotes the c-th layer component map of the component map l, l(s) denotes the s-th layer component map of the component map l, and symbols

Performing cross-level difference operation on l (c) and l(s), if c < s, upsampling l(s) to an image with the same resolution as l (c), then respectively performing difference operation on l (c) and a pixel corresponding to the upsampled l(s), if c > s, upsampling l (c) to an image with the same resolution as l(s), then respectively performing difference operation on l(s) and a pixel corresponding to the upsampled l (c), and performing symbol difference operation on l(s)

Representing the operators of addition between l (c) and l (S) at cross-layer level, if c < S, upsampling l (S) to the image with the same resolution as l (c), then summing up l (c) and the corresponding pixel of upsampled l (S) respectively, if c > S, upsampling l (c) to the image with the same resolution as l (S), then summing up l (S) and the corresponding pixel of upsampled l (c) respectively, linearly fusing and normalizing the feature maps of all components to obtain the visual attention map S of the static domain of the current texture video frame_I，

The size of each image of the test sequence "Ballet" and "Door Flower" is 1024 × 768, and the luminance characteristic diagram, the chrominance characteristic diagram and the direction characteristic diagram of the color video frame at the time t in the test sequence "Ballet" two-dimensional color video are respectively as shown in fig. 7a, fig. 7b and fig. 7b7c is shown; the luminance feature map, the chrominance feature map, and the direction feature map of the color video frame at time t in the test sequence "Door Flower" two-dimensional color video are shown in fig. 12a, 12b, and 12c, respectively. In this embodiment, Z_S8, i.e. the profile S of the visual attention of the static image field_IIs represented by 8 bit depth, and the distribution diagram of the static image domain visual attention of the color video frame at the time t in the test sequence "Ballet" two-dimensional color video is shown in fig. 8 a; fig. 13a shows a distribution diagram of the still image domain visual attention of the color video frames at time t in the test sequence "Door Flower" two-dimensional color video. Here, other known visual attention detection methods may be used as the static image domain visual attention detection method.

Secondly, detecting the motion visual attention of the current texture video frame by adopting a motion visual attention detection method to obtain a distribution diagram of the motion visual attention of the current texture video frame, and recording the distribution diagram as S_MDistribution map S of motion visual attention of current texture video frame_MHas a dimension of W × H and is Z_SAnd the gray level map is expressed by bit depth, wherein the larger the pixel value of a certain pixel in the gray level map represents that the attention degree of the relative motion of the human eye to the corresponding pixel of the current texture video frame is higher, and the smaller the pixel value represents that the attention degree of the relative motion of the human eye to the corresponding pixel of the current texture video frame is lower.

In this embodiment, a flow chart of the motion visual attention detection method is shown in fig. 5, and the specific process of the motion visual attention detection method is as follows:

(II-1) recording the texture video frame at the time t + j continuous with the current texture video frame in the texture video as F_t+jRecording the texture video frame at the t-j time point which is continuous with the current texture video frame in the texture video as F_t-jWherein j ∈ (0, N)_F/2]，N_FIs a positive integer less than 10, and N is taken in the specific application process of this example_FAnd 4, extracting the motion area of the texture video by using the current texture video frame and the first two frames and the last two frames of the current texture video frame jointly.

2, calculating the current texture video frame and the texture video frame F at the moment t + j by adopting a known optical flow method_t+jMotion vector image in horizontal direction and motion vector image in vertical direction, and texture video frame F of current texture video frame and time t-j_t-jRecording the current texture video frame and the texture video frame F at the time of t + j in the motion vector image in the horizontal direction and the motion vector image in the vertical direction_t+jThe motion vector image in the horizontal direction is V_t+j ^HAnd the motion vector image in the vertical direction is V_t+j ^VRecording the current texture video frame and the texture video frame F at the moment t-j_t-jThe motion vector image in the horizontal direction is V_t-j ^HAnd the motion vector image in the vertical direction is V_t-j ^V，V_t+j ^H、V_t+j ^V、V_t-j ^HAnd V_t-j ^VW for width and H for height.

M_{t + j} = | V_{t + j}^{H} | + | V_{t + j}^{V} |,

M_{t - j} = | V_{t - j}^{H} | + | V_{t - j}^{V} |,

Memory M_t-jThe motion amplitude value of the pixel with the middle coordinate (x, y) is m_t-j(x，y)。

Otherwise, determining the joint motion map M_j ^ΔThe pixel value of the pixel with the middle coordinate (x, y) is 0, wherein min () is a function of taking the minimum value. Here, the first threshold value T ₁1, to filter out small noise points caused by very small camera parameter jitter.

wherein m is_j ^Δ(x, y) represents the joint motion map M at a time j away from time t_j ^ΔPixel value, ζ, of a pixel having a middle coordinate of (x, y)_jAs a weighting coefficient, a weighting coefficient ζ_jSatisfy the requirement of

In a video, a moving object is a main interested area, however, due to different motion types, the attention degree of people is different, and the motion types of the video are mainly divided into the following two types of cases, namely, the first type, for the case of shooting by a static camera, the background is static, and the moving object is a main interested object; the second type is that for the situation of shooting by a moving camera, the background moves globally, and the moving object and the camera are kept relatively still or are presented in the situation of inconsistent motion of the background, at this time, the moving object is still an interested object; for the above analysis, the motion attention area of people mainly comes from the motion attribute of the object, which is different from the motion attribute of the background environment, and is an area with large motion contrast, so the following steps can be adopted to obtain the motion visual attention.

② 6, carrying out Gaussian pyramid decomposition on the weighted joint motion picture M of the current texture video frame to decompose into n_LThe layer weighted joint motion picture is characterized in that an ith layer weighted joint motion picture obtained by decomposing a weighted joint motion picture M by a Gaussian pyramid is M (i), and the width and the height of the ith layer weighted joint motion picture M (i) are respectively W/2ⁱAnd H/2ⁱWherein n is_LIs a positive integer less than 20, i ∈ [0, n ∈ [ ]_L-1]The 0 th layer is the bottom layer, the n th layer_L-layer 1 is the highest layer, W is the width of the current texture video frame, H is the height of the current texture video frame; in the specific application process of the embodiment, n_LThe value is 9.

② -7, utilizing weighted joint motion picture M n layers of current texture video frame_LWeighting the joint motion map to extract a distribution map S of the motion visual attention of the current texture video frame_MRemember S_MThe pixel value of the pixel with the middle coordinate (x, y) is s_m(x，y)，S_M＝F_MWherein

s，c ∈[0，n_L-1]，s＝c+δ，δ＝{-3，-2，-1，1，2，3}，

is normalized to

The normalized function of interval, the symbol "|" is absolute value operation symbol, M (c) is the c-th layer weighted joint motion diagram, M(s) is the s-th layer weighted joint motion diagram, symbol

Performing cross-level difference operation on M (c) and M(s), if c < s, up-sampling M(s) to the image with the same resolution as that of M (c), and then respectively adding each pixel of M (c) and the corresponding pixel of up-sampled M(s)Performing subtraction, if c > s, up-sampling M (c) to the image with the same resolution as M(s), and performing subtraction and sign on each pixel of M(s) and the corresponding pixel of up-sampled M (c)

The distribution diagram of the motion visual attention obtained after the color video frames at the time t in the two-dimensional color video of the test sequence 'Ballet' are processed by the step is shown in FIG. 8 b; fig. 13b shows a distribution diagram of the motion visual attention obtained after the color video frames at time t in the two-dimensional color video of the test sequence "Door Flower" are processed in this step.

Defining depth video frame of each time in depth video corresponding to texture video as Z_DGrayscale map of bit depth representation from 0 to

The gray scale value of the range represents the relative distance from the shot object to the shooting camera represented by each pixel in the depth video frame, the gray scale value 0 corresponds to the maximum depth, and the gray scale valueSetting the size of a depth video frame at each moment in the depth video to be W multiplied by H corresponding to the minimum depth, wherein W is the width of the depth video frame at each moment in the depth video, H is the height of the depth video frame at each moment in the depth video, and D is the depth video frame at t moment in the depth video_tDefining a depth video frame D at time t in the depth video_tDetecting for the current depth video frame by adopting a depth visual attention detection methodThe depth visual attention of the three-dimensional video image jointly displayed by the current depth video frame and the current texture video frame is obtained, and a distribution diagram of the depth visual attention of the three-dimensional video image is obtained and recorded as S_DDistribution diagram S of depth visual attention of three-dimensional video image_DHas a dimension of W × H and is Z_SAnd the gray level map is expressed by bit depth, wherein the larger the pixel value of a certain pixel in the gray level map represents that the attention degree of the human eye to the relative depth of the corresponding pixel of the current texture video frame is higher, and the smaller the pixel value represents that the attention degree of the human eye to the relative depth of the current texture video frame is lower. In this embodiment, each pixel of the depth video frame consists of Z_DIn 8-bit depth representation, each pixel of the visual attention profile is represented by Z_SAn 8 bit depth representation.

The specific stereoscopic impression is the main characteristic that the stereoscopic video is different from the traditional single-channel video, and for the visual attention of the stereoscopic video, the depth perception mainly influences the visual attention of a user through two aspects, on one hand, the interest degree of the user on a scenery (or an object) close to the shooting camera array is generally larger than that of the scenery (or the object) far away from the shooting camera array; on the other hand, the depth discontinuity areas provide the user with a strong depth contrast. In this embodiment, a flow chart of the deep visual attention detection method is shown in fig. 6, and the specific process of the deep visual attention detection method is as follows:

thirdly-1, carrying out Gaussian pyramid decomposition on the current depth video frame to obtain n_LAnd D (i) layer depth video frames are obtained after the Gaussian pyramid decomposition of the current depth video frames is recorded, wherein the width and the height of the i layer depth video frames D (i) are respectively W/2ⁱAnd H/2ⁱWherein n is_LIs a positive integer less than 20, i ∈ [0, n ∈ [ ]_L-1]The 0 th layer is the bottom layer, and the resolution is maximum D (0) ═ D_tN th_LLayer-1 is the highest layer, the resolution is the lowest, W is the width of the current depth video frame, and H is the height of the current depth video frame.

③ 2, utilizing n of current depth video frame_LLayer depth video frame, extractingDepth feature map of front depth video frame, denoted as F_D，

Wherein,

s，c ∈[0，n_L-1]，s＝c+δ，δ＝{-3，-2，-1，1，2，3}，

is normalized to

Performing cross-level difference operation on D (c) and D(s), if c < s, upsampling D(s) to an image with the same resolution as D (c), then respectively performing difference on each pixel of D (c) and the corresponding pixel of the upsampled D(s), if c > s, upsampling D (c) to an image with the same resolution as D(s), then respectively performing difference on each pixel of D(s) and the corresponding pixel of the upsampled D (c), and symbolizing

And d (c) and d(s) are subjected to cross-layer addition operators, if c < s, d(s) is up-sampled to the image with the same resolution as d (c), then each pixel of d (c) is respectively summed with the corresponding pixel of d(s) after up-sampling, if c > s, d (c) is up-sampled to the image with the same resolution as d(s), and then each pixel of d(s) is respectively summed with the corresponding pixel of d (c) after up-sampling.

Thirdly, the depth edge area with larger depth difference gives stronger depth feeling to the user, so that the depth edge area in the current depth video frame is the depth visual attentionAnd performing convolution operation on the current depth video frame by using known Gabor filters in the directions of 0 degree, pi/4 degree, pi/2 degree and 3 pi/4 degree to extract four direction components in the directions of 0 degree, pi/4 degree, pi/2 degree and 3 pi/4 degree and obtain four direction component maps of the current depth video frame, wherein the four direction component maps are respectively represented as O₀ ^D、O_π/4 ^D、O_π/2 ^DAnd O_3π/4 ^D(ii) a O for current depth video frame₀ ^DDirectional component diagram, O_π/4 ^DDirectional component diagram, O_π/2 ^DDirectional component diagram and O_3π/4 ^DThe direction component diagram is respectively decomposed into n by Gaussian pyramid_LThe layer direction component diagram is marked, and the ith layer direction component diagram obtained by decomposing the direction component diagram in the theta degree direction by a Gaussian pyramid is O_θ ^D(i)，O_θ ^D(i) Is W/2 in width and height respectivelyⁱAnd H/2ⁱWhere θ ∈ {0, π/4, π/2, 3 π/4} i ∈ [0, n ∈_L-1]The 0 th layer is the bottom layer,

n th_LLayer-1 is the highest layer, W is the width of the current depth video frame, and H is the height of the current depth video frame.

Wherein,

s，c ∈[0，n_L-1]，s＝c+δ，δ＝{-3，-2，-1，1，2，3}，

is normalized to

Is O_θ ^D(c) And O_θ ^D(s) perform a cross-level differencing operator, if c < s, then O_θ ^D(s) upsampling to and O_θ ^D(c) On the image with the same resolution, and then O_θ ^D(c) And the up-sampled O_θ ^D(s) performing difference on corresponding pixels respectively, and if c is more than s, performing O_θ ^D(c) Up-sampling to and O_θ ^D(s) on an image with the same resolution, and then adding O_θ ^D(s) each pixel with up-sampled O_θ ^D(c) Performing difference and sign on corresponding pixels

Is O_θ ^D(c) And O_θ ^D(s) performing a cross-level addition operator, if c < s, adding O_θ ^D(s) upsampling to and O_θ ^D(c) On the image with the same resolution, and then O_θ ^D(c) And the up-sampled O_θ ^D(s) the corresponding pixels are summed separately, and if c > s, O is added_θ ^D(c) Up-sampling to and O_θ ^D(s) on an image with the same resolution, and then adding O_θ ^D(s) each pixel with up-sampled O_θ ^D(c) The corresponding pixels are summed separately.

③ 5, adopting the known morphological dilation algorithm to obtain the size w₁×h₁Is a preliminary depth direction feature map F 'of the basic expansion unit to the current depth video frame'_DOCarry out n₁Performing secondary expansion operation to obtain a depth direction characteristic diagram of the current depth video frame, and recording the characteristic diagram as F_DO. In this embodiment, for the "Ballet" and "Doorflower" test sequences, where the size of each image in the test sequence is 1024 × 768, the basic unit of morphological dilation uses 8 × 8 blocks, i.e., w₁×h₁8 × 8, the number of dilations n₁＝2。

is normalized to

A normalization function of the interval.

③ 7, for the left and right borders of the imageA region where the left view image has a left image boundary and the right view image does not have a corresponding region, and therefore, a stereoscopic effect cannot be formed in the human brain; similarly, it is difficult to form a stereoscopic effect on the right image boundary of the right viewpoint image. Therefore, in the stereoscopic video, the left and right boundary regions of the image provide weaker stereoscopic effect or no stereoscopic effect, and are non-stereoscopic visual attention regions, so the invention visually pays attention to the preliminary depth of the current depth video frame by the distribution map S'_DIs suppressed by using a profile S 'of the preliminary depth visual attention of the current depth video frame'_DObtaining a depth visual attention distribution diagram S of the three-dimensional video image jointly displayed by the current depth video frame and the current texture video frame_DRemember S_DThe pixel value of the pixel with the middle coordinate (x, y) is s_d(x，y)，s_d(x，y)＝s′_d(x, y) · g (x, y), wherein,

g (x, y) = \{\begin{matrix} 0.2 & if & x < b | | y < b | | x > W - b | | y > H - b \\ 0.4 & elseif & x < 2 b | | y < 2 b | | x > W - 2 b | | y > H - 2 b \\ 0.6 & elseif & x < 3 b | | y < 3 b | | x > W - 3 b | | y > H - 3 b \\ 1 & else \end{matrix},

w is the width of the current depth video frame, H is the height of the current depth video frame, b is a set second threshold, and the symbol "|" is an "or" operator. Here, the second threshold value b takes a value of 16. The function g (x, y) may also be other two-dimensional functions for suppressing the edge region of the image, such as a two-dimensional gaussian function with the template size being the same as the size of the texture video frame.

FIG. 8c is a diagram showing a depth visual attention distribution of a three-dimensional video image jointly displayed by a color video frame at time t and a corresponding depth video frame in a two-dimensional color video of the test sequence "Ballet"; fig. 13c shows a depth visual attention distribution diagram of a three-dimensional video image represented by a combination of a color video frame at time t and a corresponding depth video frame in a test sequence "DoorFlower" two-dimensional color video.

Fourthly, adopting a visual attention fusion method based on depth perception to visually pay attention to the static image domain of the current texture video frame_IDistribution diagram S of motion visual attention of current texture video frame_MAnd a current depth video frame and a depth visual attention distribution diagram S of a three-dimensional video image jointly displayed by the current depth video frame and the current texture video frame_DFusing to extract a distribution diagram of three-dimensional visual attention conforming to human eye stereoscopic perception, wherein the distribution diagram of three-dimensional visual attention is marked as S, and the size of the distribution diagram of three-dimensional visual attention is W multiplied by H and Z_SThe gray level image is represented by bit depth, the larger the pixel value of a certain pixel in the gray level image is, the higher the relative attention degree of human eyes to the corresponding pixel of the three-dimensional video image jointly represented by the current depth video frame and the current texture video frame is, and the smaller the pixel value is, the lower the relative attention degree of human eyes to the three-dimensional video image jointly represented by the current depth video frame and the current texture video frame is.

In a traditional single channel, a moving object is more likely to attract the attention of a viewer than a static object, and for objects which are all static, a bright-color region, a region with high color or brightness contrast, a region with high grain direction difference and the like are more likely to attract the attention of the viewer; in the stereo video, the visual attention distribution of human eyes is influenced by the special stereo perception provided by the stereo video for users besides the motion visual attention and the static image domain visual attention; the stereoscopic vision mainly comes from the tiny position deviation of scenes seen by left and right eyes, namely parallax, for example, the distance between two eyes is about 6 cm, the object images received by the eyes are projected to form a visual image with tiny position deviation on a retina, the tiny deviation is automatically integrated into a stereoscopic image with depth through a brain to form stereoscopic vision, and the relative distance information of objects reflected by the stereoscopic vision is another important factor directly influencing the attention selection of people. In a stereoscopic video, objects contained in a depth discontinuous region or a region with a large depth contrast can give a user a stronger depth difference, have stronger stereoscopic impression or depth sense, and are one of regions in which the user is interested; on the other hand, the viewer is interested in a foreground region close to the shooting camera (or video viewer) more than in a region far away from the shooting camera (or video viewer), so the foreground region is usually an important potential region of the stereoscopic video viewer interested region, and based on the above analysis, the factors influencing the three-dimensional visual attention of the human eyes are determined to include four factors of static image domain visual attention, moving visual attention, depth visual attention and depth, so the specific process of the visual attention fusion method based on depth perception in this specific embodiment is as follows:

The coefficients within the range, d (x, y) represents the pixel value of the pixel with coordinates (x, y) in the current depth video frame, and Q (d (x, y)) represents the pixel value of the pixel with coordinates (x, y) in the current depth video frame after the scaling.

Fourthly-2, displaying the current depth video frame after the scale transformation, the current depth video frame and the current texture video frame jointlyDepth visual attention distribution map S of dimensional video image_DDistribution diagram S of motion visual attention of current texture video frame_MAnd a histogram S of the static image domain visual attention of the current texture video frame_IAcquiring a three-dimensional visual attention distribution graph S, wherein the pixel value of a pixel with coordinates (x, y) in the three-dimensional visual attention distribution graph S is S (x, y),

0≤K_a≤1，

is normalized toNormalized function of interval, s_D(x，y)、s_M(x, y) and s_I(x, y) each represents S_D、S_MAnd S_IThe pixel value, Θ, of a pixel having the middle coordinate (x, y)_ab(x, y) is the correlation value of visual attention, [ theta ]_ab(x，y)＝min(s_a(x，y)，s_b(x, y)), min () is the minimum function, C_abAs the correlation coefficient, the correlation coefficient satisfies the condition:

The moving visual attention, the static image domain visual attention and the depth visual attention all play important roles in combination with the visual attention of people, however, the moving visual attention is the most important content in the video visual attention, the static image domain visual attention is caused by the brightness, the color and the direction of the image domain, and the depth visual attention is caused again, so in the embodiment, each visual attention distribution map is Z_SIn this particular embodiment, K is taken as the 8-bit depth representation_D＝0.15、K_M0.4 and K_I0.35, the degree of correlation between the depth visual attention and the moving visual attention is small, the degree of correlation between the depth visual attention and the still image domain visual attention is small, and the degree of correlation between the still image domain visual attention and the moving visual attention is large, so the correlation coefficient C is set here_DM、C_DIAnd C_IMThe scale transformation coefficients γ represent the scene depth of a texture video scene, the smaller γ is, the greater the scene depth is, the stronger the depth feeling given to the viewer is, and conversely, the larger γ is, the smaller the scene depth is, the weaker the depth feeling given to the viewer is, for the "Ballet" and "Door Flower" test sequences, the smaller the scene depth of the scene is, the scale transformation coefficient γ is set to 50, the distribution diagram of the three-dimensional visual attention extracted for the texture video frame at the time t of the "Ballet" test sequence and the corresponding depth video frame is shown in fig. 9, and the distribution diagram of the three-dimensional visual attention extracted for the texture video frame at the time t of the "Door Flower" test sequence and the corresponding depth video frame is shown in fig. 14.

And fifthly, carrying out thresholding and macro block post-processing on the distribution graph S of the three-dimensional visual attention to obtain the final region of interest which accords with human eye stereoscopic perception of the current texture video frame.

In this embodiment, the specific process of thresholding and macro-block post-processing the distribution map S of the three-dimensional visual attention is as follows:

Wherein W is the width of the three-dimensional visual attention profile S, H is the height of the three-dimensional visual attention profile S, and k_TE (0, 3), k in the application process of this embodiment_TThe value can be 1.5; newly building a preliminary binary mask image, and judging that s (x, y) is more than or equal to T_SAnd if so, marking the pixel with the coordinate (x, y) in the preliminary binary mask image as the pixel of interest, otherwise, marking the pixel with the coordinate (x, y) in the preliminary binary mask image as the pixel not of interest.

Fifthly, dividing the preliminary binary mask image into (W/W)₂)×(H/h₂) Size of w₂×h₂The blocks are not overlapped, and the block with the abscissa of u and the block with the ordinate of v is marked as B_u，vWherein u ∈ [0, W/W ]₂-1]，v∈[0，H/h₂-1]Determining whether the pixel in each corresponding block in the current texture video frame is the pixel of interest or the non-interest pixel according to each block in the preliminary binary mask image, and regarding the block B_u，vJudgment Block B_u，vWhether the number of pixels marked as the pixel of interest is larger than a set fourth threshold value T or not_bWherein, 0 is less than or equal to T_b≤w₂×h₂If so, willAnd block B in the current texture video frame_u，vAll pixels in the corresponding block are marked as pixels of interest, and block B is marked as the pixel of interest_u，vTaking the corresponding block as the region of interest block, otherwise, comparing the current texture video frame with the block B_u，vAll pixels in the corresponding block are marked as non-interesting pixels and block B is marked_u，vAnd taking the corresponding block as a non-interested area block to obtain a preliminary interested area mask image of the current texture video frame, wherein the preliminary interested area mask image consists of an interested area block and a non-interested area block.

In the present embodiment, the size of each image in the test sequences "Ballet" and "Door Flower" is 1024 × 768, and thus the block B can be set_u，vDimension w of₂×h₂16 × 16, the region with a small number of pixels is not likely to be interesting to the viewer, so the fourth threshold T is set here_bSet to 50.

Fifthly-3, since the transition region is not suddenly changed but slowly changed between the interested region and the non-interested region, the invention sets N between the interested region and the non-interested region_RA level transition region of interest. Marking all pixels in a region-of-non-interest block which is most adjacent to the region-of-interest block in the preliminary region-of-interest mask image as Nth pixel_RLevel transition interested areas and updating the mask images of the primary interested areas; then, the updated preliminary interested region mask image is compared with the Nth region_RAll pixels in the non-region-of-interest block nearest to the level transition region-of-interest are labeled as Nth_R-level 1 transition region of interest, recursively updating the preliminary region of interest mask image; repeating the recursion process until the region of interest is marked to the level 1 transition region of interest; finally, obtaining a final interested region mask image of the current texture video frame, wherein the final interested region mask image consists of an interested region block and N_RAnd the level transition region of interest and the region of non-interest block. In this embodiment, N_RAnd taking the value as 2, namely setting a 2-level transition region of interest.

FIG. 10a shows the final region of interest mask image of the texture video frame at time t of the test sequence "Ballet"; FIG. 15a shows the final region of interest mask image of the texture video frame at time t of the test sequence "Door Flower". In fig. 10a and 15a, the black area represents the region of interest, the gray area is the transition region of interest, and the white area is the region of non-interest.

Fifthly, recording the pixel value of a pixel with the coordinate of (x, y) in the final region-of-interest mask image as r (x, y), setting the pixel values of all pixels in a region block which is not the region of interest in the final region-of-interest mask image as r (x, y) 255, and setting N in the final region-of-interest mask image as N_RSetting the pixel values of all pixels in the region of interest of the level transition to

FIG. 10b shows the region of interest of a texture video frame at time t of the test sequence "Ballet"; fig. 15b shows the region of interest of the texture video frame at time t of the test sequence "Door Flower". The regions of interest in fig. 10b and fig. 15b have the same pixel value as the texture video frame at time t, and display the color texture content, and the transition regions of interest display the dark gray regions by reducing the brightness, and the smooth white regions are the regions of non-interest corresponding to the white regions of the mask image of the regions of interest. As a comparison of the extraction effects, fig. 11a and fig. 16a respectively show the regions of interest of the texture video frames at the time of the test sequences "Ballet" and "Door Flower" t, which are extracted only according to the visual attention cue of the static image domain, and the noise regions with rich background textures cannot be removed; fig. 11b and fig. 16b show the regions of interest of texture video frames at the time of the test sequences "Ballet" and "Door Flower" t, which are extracted only according to the motion visual attention cue in the prior art, for the "Ballet" sequence, the method for extracting the regions of interest cannot completely extract men with very slow motion according to the motion visual attention cue only, and meanwhile, the background noise caused by motion shadow is serious; for the "Door Flower" sequence, only the motion region is extracted according to the region-of-interest extraction method of the motion visual attention cue, but the texture complexity and the depth sense provided by the stereoscopic vision are not considered. Fig. 11c and fig. 16c show the regions of interest of the texture video frame at the time of the test sequences "Ballet" and "Door Flower" t, which are combined with the visual attention and the attention cue of the static image domain, although the method combines the static and the moving visual information, the texture region and the moving noise in the background environment cannot be effectively suppressed.

As can be seen from comparison experiments between fig. 10a and 10b and fig. 11a, 11b and 11c, and between fig. 15a and 15b and fig. 16a, 16b and 16c, the region of interest extracted by the present invention combines the visual attention of the static image domain, the visual attention of the motion and the visual attention of the depth, effectively inhibits the inherent singleness and inaccuracy of each visual attention extraction, solves the problem of noise caused by the complex background in the visual attention of the static image domain, and solves the problem that the visual attention of the motion cannot extract the region of interest with local motion and small motion amplitude, thereby improving the calculation accuracy, enhancing the stability of the algorithm, and being capable of extracting the region of interest from the background and the motion environment with complex texture. In addition, the region of interest obtained by the invention not only accords with the visual interesting characteristic of human eyes to the static texture video frame and the visual interesting characteristic of human eyes to the moving object, but also accords with the depth perception characteristic of interest to the object with strong depth or close distance in the stereoscopic vision, and accords with the semantic characteristic of the stereoscopic vision of human eyes.

In this embodiment, the static image domain visual attention histogram S of the current texture video frame_IDistribution diagram S of motion visual attention of current texture video frame_MDistribution diagram S of depth visual attention of three-dimensional video image_DAnd the distribution diagrams S of the three-dimensional visual attention are all Z_SA gray scale image expressed by bit depth, and a depth video frame at each time in a depth video corresponding to a texture video is Z_DBit depth representation gray scale map, wherein the gray scale map adopts 256 colors and is represented by 8 bit depth, therefore, taking Z_S＝8，Z_DAs to 8, it is of course possible to use other bit depths to represent the grayscale map in practical applications, such as 16 bit depths, and if the grayscale map is represented by 16 bit depths, the representation accuracy is higher.

Claims

1. A method for extracting a video interesting region based on visual attention is characterized by comprising the following steps:

firstly, defining a two-dimensional color video as a texture video, defining the size of texture video frames at each moment in the texture video to be W multiplied by H, W being the width of the texture video frames at each moment in the texture video, H being the height of the texture video frames at each moment in the texture video, and recording the texture video frame at t moment in the texture video as F_tDefining a texture video frame F at time t in the texture video_tUsing static image views for current texture video framesThe visual attention detection method detects the visual attention of the static image domain of the current texture video frame to obtain a distribution diagram of the visual attention of the static image domain of the current texture video frame, which is marked as S_IThe static image domain visual attention distribution map S of the current texture video frame_IHas a dimension of W × H and is Z_SA grayscale map of bit depth representations;

defining depth video frame of each time in depth video corresponding to texture video as Z_DSetting the size of a depth video frame at each moment in a depth video to be W multiplied by H, wherein W is the width of the depth video frame at each moment in the depth video, H is the height of the depth video frame at each moment in the depth video, and D is the depth video frame at t moment in the depth video_tDefining a depth video frame D at time t in the depth video_tFor the current depth video frame, detecting the depth visual attention of the three-dimensional video image jointly displayed by the current depth video frame and the current texture video frame by adopting a depth visual attention detection method to obtain a depth visual attention distribution diagram of the three-dimensional video image, which is marked as S_DDistribution diagram S of depth visual attention of three-dimensional video image_DHas a dimension of W × H and is Z_SA grayscale map of bit depth representations;

fourthly, adopting a visual attention fusion method based on depth perception to visually pay attention to the static image domain of the current texture video frame_IDistribution diagram S of motion visual attention of current texture video frame_MAnd a current depth video frame and a depth visual attention distribution diagram S of a three-dimensional video image jointly displayed by the current depth video frame and the current texture video frame_DFusing to extract a distribution diagram of three-dimensional visual attention conforming to human eye stereoscopic perception, denoted as S, the size of the distribution diagram S of three-dimensional visual attentionSize W × H and it is Z_SA grayscale map of bit depth representations;

2. The method for extracting a video interesting region based on visual attention according to claim 1, wherein the specific process of the moving visual attention detection method in the step (II) is as follows:

2, calculating the current texture video frame and the texture video frame F at the moment t + j by adopting an optical flow method_t+jMotion vector image in horizontal direction and motion vector image in vertical direction, and texture video frame F of current texture video frame and time t-j_t-jRecording the current texture video frame and the texture video frame F at the time of t + j in the motion vector image in the horizontal direction and the motion vector image in the vertical direction_t+jThe motion vector image in the horizontal direction is

And the motion vector image in the vertical direction is

Texture video frame F for recording current texture video frame and t-j moment_t-jThe motion vector image in the horizontal direction is

And the motion vector image in the vertical direction is

And

w for width and H for height;

② 3, mixing

Absolute value of and

the absolute value of the current texture video frame is superposed to obtain the texture video frame F of the current texture video frame and the texture video frame F at the moment of t + j_t+jMotion amplitude image of, noted as M_t+j，

Memory M_t+jThe motion amplitude value of the pixel with the middle coordinate (x, y) is m_t+j(x, y); will be provided with

Absolute value of andthe absolute value of the texture video frame is superposed to obtain the texture video frame F of the current texture video frame and the t-j moment_t-jMotion amplitude image of, noted as M_t-j，Memory M_t-jThe motion amplitude value of the pixel with the middle coordinate (x, y) is m_t-j(x，y)；

② 4, utilizing current texture video frame and texture video frame F at t + j moment_t+jAnd texture video frame F at time t-j_t-jExtracting the joint motion picture, which is recorded as

Extracting joint motion maps

The specific process comprises the following steps: judging the current texture video frame and the texture video frame F at the moment t + j_t+jMotion amplitude image M_t+jEach pixel in (a) and the current texture video frame and the texture video frame F at time t-j_t-jMotion amplitude image M_t-jWhether the minimum value in the motion amplitude values of the pixels of the corresponding coordinates is larger than a set first threshold value T or not₁And if so, determining a joint motion map

The pixel value of the pixel of the corresponding coordinate is M_t+jAnd M_t-jAverage of the sum of motion amplitude values of pixels of corresponding coordinates in the image, otherwise, determining a joint motion map

The pixel value of the pixel of the corresponding coordinate is 0; for M_t+jPixel with (x, y) middle coordinate and M_t-jThe pixel with the middle coordinate of (x, y) is judged to be min (m)_t+j(x，y)，m_t-j(x, y)) is greater than a set first threshold value T₁And if so, determining a joint motion mapThe pixel value of the pixel with the middle coordinate (x, y) is

Otherwise, determining a joint motion map

The pixel value of the pixel with the middle coordinate (x, y) is 0, wherein min () is a minimum function;

② -5, the distance from the t time to the 1 time to the N time_FConnection of each time of/2 timeWeighting and superposing the combined motion image to obtain a weighted combined motion image of the current texture video frame, marking the weighted combined motion image as M, and marking the pixel value of a pixel with the coordinate (x, y) in the weighted combined motion image M of the current texture video frame as M (x, y),wherein,

representing a joint motion map in time at a distance of time j from time t

Pixel value, ζ, of a pixel having a middle coordinate of (x, y)_jAs a weighting coefficient, a weighting coefficient ζ_jSatisfy the requirement of

② 6, carrying out Gaussian pyramid decomposition on the weighted joint motion picture M of the current texture video frame to decompose into n_LAnd (3) a layer weighted joint motion picture, wherein the ith layer weighted joint motion picture obtained after M Gaussian pyramid decomposition of the weighted joint motion picture of the current texture video frame is recorded as M (i), and the width and the height of the ith layer weighted joint motion picture M (i) are respectively W/2ⁱAnd H/2ⁱWherein n is_LIs a positive integer less than 20, i ∈ [0, n ∈ [ ]_L-1]W is the width of the current texture video frame, and H is the height of the current texture video frame;

② -7, utilizing the weighted joint motion map of the current texture video frame to n_LLayer weighted joint motion map, extracting the motion visual attention distribution map S of the current texture video frame_MRemember S_MThe pixel value of the pixel with the middle coordinate (x, y) is s_m(x，y)，

Wherein,

s，c∈[0，n_L-1]，is normalized to

The normalization function of the interval, the symbol "|" is the absolute value arithmetic symbol, M (c) is the c-th layer weighted joint motion diagram, M(s) is the s-th layer weighted joint motion diagram, the symbolPerforming cross-level difference operation on M (c) and M(s), if c is less than s, up-sampling M(s) to the image with the same resolution as M (c), then respectively performing difference on each pixel of M (c) and the corresponding pixel of up-sampled M(s), if c is more than s, up-sampling M (c) to the image with the same resolution as M(s), then respectively performing difference on each pixel of M(s) and the corresponding pixel of up-sampled M (c), and sign

3. The method as claimed in claim 2, wherein the first threshold T set in step (II) -4 is set₁＝1。

4. The method for extracting video interesting regions based on visual attention according to claim 1 or 2, wherein the depth visual attention detection method in the third step comprises the following specific processes:

③ 2, utilizing n of current depth video frame_LLayer depth video frame, extracting depth characteristic map of current depth video frame, and recording as

Wherein,

s，c∈[0，n_L-1]，

is normalized to

Normalization function of interval, symbol "|" as absolute value arithmetic symbol, D (c) as c-th layer depth video frame, D(s) as s-th layer depth video frame, symbol

Performing cross-level difference operation on D (c) and D(s), if c < s, upsampling D(s) to an image with the same resolution as D (c), then performing difference operation on each pixel of D (c) and a corresponding pixel of the upsampled D(s), if c > s, upsampling D (c) to an image with the same resolution as D(s), and then performing cross-level difference operation on each pixel of D(s) and the upsampled D(s)The corresponding pixels D (c) are respectively subjected to subtraction and sign

Performing a cross-layer addition operator for D (c) and D(s), if c < s, upsampling D(s) to an image with the same resolution as D (c), then summing each pixel of D (c) with the corresponding pixel of upsampled D(s), respectively, if c > s, upsampling D (c) to an image with the same resolution as D(s), and then summing each pixel of D(s) with the corresponding pixel of upsampled D (c), respectively;

thirdly, performing convolution operation on the current depth video frame by adopting Gabor filters in the directions of 0 degree, pi/4 degree, pi/2 degree and 3 pi/4 degree to extract four directional components in the directions of 0 degree, pi/4 degree, pi/2 degree and 3 pi/4 degree to obtain four directional component maps of the current depth video frame, wherein the four directional component maps are respectively expressed as

And

for current depth video frame

A direction component diagram,A direction component diagram,

A directional component diagram and

the direction component diagram is respectively decomposed into n by Gaussian pyramid_LThe layer direction component diagram is obtained by decomposing the direction component diagram in the theta degree direction by a Gaussian pyramid

Is W/2 in width and height respectivelyⁱAnd H/2ⁱWhere θ ∈ {0, π/4, π/2, 3 π/4} i ∈ [0, n ∈_L-1]W is the width of the current depth video frame, and H is the height of the current depth video frame;

thirdly-4, utilizing n of the direction component diagram of each degree direction of the current depth video frame_LLayer direction component diagram, extracting the initial depth direction characteristic diagram of the current depth video frame, and recording as

Wherein,

s，c∈[0，n_L-1]，

is normalized toThe normalization function of the interval, the sign "|" being the absolute value arithmetic sign,a c-th layer direction component diagram which is a direction component diagram of the theta degree direction,

the s-th layer direction component diagram is a direction component diagram of the theta degree direction, and the symbolIs composed of

And

carry out the cross-level difference operator, if c < s, then

Up-sampling toOn an image with the same resolution, and then will

And the up-sampled

The corresponding pixels are respectively subjected to difference making, if c is more than s, the difference is made

Up-sampling to

On an image with the same resolution, and then will

And the up-sampledPerforming difference and sign on corresponding pixels

Is composed of

And

performing a cross-level addition operator, if c < s, then

Up-sampling to

On an image with the same resolution, and then will

And the up-sampled

Corresponding pixels are respectively summed, if c > s, thenUp-sampling to

On an image with the same resolution, and then will

And the up-sampled

Summing corresponding pixels respectively;

③ 5, adopting a morphological expansion algorithm with the size of w₁×h₁The block of (1) is a preliminary depth direction feature map of the basic expansion unit to the current depth video frame

Carry out n₁Performing secondary expansion operation to obtain a depth direction characteristic diagram of the current depth video frame, and recording the characteristic diagram as

Thirdly-6, utilizing the depth characteristic graph of the current depth video frame

And depth direction feature map

Obtaining a distribution map of the preliminary depth visual attention of the current depth video frame, and recording the distribution map as S'_D，

is normalized toA normalization function of the interval;

-7, profile S 'using preliminary depth visual attention of current depth video frame'_DObtaining a depth visual attention distribution diagram S of the three-dimensional video image jointly displayed by the current depth video frame and the current texture video frame_DRemember S_DThe pixel value of the pixel with the middle coordinate (x, y) is s_d(x，y)，s_d(x，y)＝s′_d(x, y) · g (x, y), wherein,

w is the width of the current depth video frame, H is the height of the current depth video frame, b is the set second threshold, the symbol "iiis" or "operator.

5. The method according to claim 4, wherein w in-5 is w₁＝8，h₁＝8，n₁2. saidAnd step (c) -7, setting the second threshold value b to be 16.

6. The method for extracting a video interesting region based on visual attention according to claim 1, wherein the visual attention fusion method based on depth perception in the step (iv) comprises the following specific processes:

Coefficients within a range, d (x, y) represents a pixel value of a pixel with coordinates (x, y) in the current depth video frame, and Q (d (x, y)) represents a pixel value of a pixel with coordinates (x, y) in the current depth video frame after the scaling;

wherein, K_D、K_MAnd K_IAre each S_D、S_MAnd S_IThe weighting coefficient satisfies the condition:

0≤K_a≤1，

is normalized to

7. The method for extracting a video region of interest based on visual attention as claimed in claim 6, wherein the specific process of thresholding and macro-block post-processing the distribution map S of three-dimensional visual attention in the fifth step is as follows:

fifthly, dividing the preliminary binary mask image into (W/W)₂)×(H/h₂) Is large in sizeIs as small as w₂×h₂The blocks are not overlapped, and the block with the abscissa of u and the block with the ordinate of v is marked as B_u，vWherein u ∈ [0, W/W ]₂-1]，v∈[0，H/h₂-1]Determining whether the pixel in each corresponding block in the current texture video frame is the pixel of interest or the non-interest pixel according to each block in the preliminary binary mask image, and regarding the block B_u，vJudgment Block B_u，vWhether the number of pixels marked as the pixel of interest is larger than a set fourth threshold value T or not_bWherein, 0 is less than or equal to T_b≤w₂×h₂If yes, the current texture video frame is compared with the block B_u，vAll pixels in the corresponding block are marked as pixels of interest, and block B is marked as the pixel of interest_u，vTaking the corresponding block as the region of interest block, otherwise, comparing the current texture video frame with the block B_u，vAll pixels in the corresponding block are marked as non-interesting pixels and block B is marked_u，vThe corresponding block is used as a non-interested area block to obtain a preliminary interested area mask image of the current texture video frame, wherein the preliminary interested area mask image consists of an interested area block and a non-interested area block;

fifthly, 3, a non-interested region block B which is most adjacent to the interested region block in the preliminary interested region mask image_u，vAll pixels in (1) are labeled as Nth_RLevel transition interested areas and updating the mask images of the primary interested areas; then, the updated preliminary interested region mask image is compared with the Nth region_RNon-interested region block B nearest to stage transition interested region_u，vAll pixels in (1) are labeled as Nth_R-level 1 transition region of interest, recursively updating the preliminary region of interest mask image; repeating the recursion process until the region of interest is marked to the level 1 transition region of interest; finally, obtaining a final interested region mask image of the current texture video frame, wherein the final interested region mask image consists of an interested region block and N_RA level transition region of interest and a region of non-interest block;

fifthly, recording the pixel value of the pixel with the coordinate (x, y) in the final interested area mask image as r (x, y),setting the pixel values of all pixels in the region-of-non-interest block in the final region-of-interest mask image as r (x, y) ═ 255, and setting N in the final region-of-interest mask image as_RSetting the pixel values of all pixels in the region of interest of the level transition to

8. The method according to claim 7, wherein w in the step (c) -2 is w₂＝16，h₂A fourth threshold value T of 16_b＝50。