CN101651772B - Method for extracting video interested region based on visual attention - Google Patents

Method for extracting video interested region based on visual attention Download PDF

Info

Publication number
CN101651772B
CN101651772B CN2009101525203A CN200910152520A CN101651772B CN 101651772 B CN101651772 B CN 101651772B CN 2009101525203 A CN2009101525203 A CN 2009101525203A CN 200910152520 A CN200910152520 A CN 200910152520A CN 101651772 B CN101651772 B CN 101651772B
Authority
CN
China
Prior art keywords
video frame
depth
pixel
current
visual attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2009101525203A
Other languages
Chinese (zh)
Other versions
CN101651772A (en
Inventor
张云
蒋刚毅
郁梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Spparks Technology Co ltd
Original Assignee
Ningbo University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo University filed Critical Ningbo University
Priority to CN2009101525203A priority Critical patent/CN101651772B/en
Publication of CN101651772A publication Critical patent/CN101651772A/en
Application granted granted Critical
Publication of CN101651772B publication Critical patent/CN101651772B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The invention discloses a method for extracting a video interested region based on a visual attention. The interested region extracted by the method combines a static image domain visual attention, a motion visual attention and a depth perception attention, and the method efficiently restrains the internal unicity and inaccuracy of each visual attention extraction and solves the problem of noises caused by a complex background in the static image domain visual attention and the problem that the motion visual attention can not extract an interested region with small local motion and motion amplitude, thus the method improves the calculation accuracy, enhances the algorithm stability and extracts an interested region from a background with complex veins and the motion environment; furthermore, the interested region obtained by the method accords with the depth perception feature interesting on an object having a strong depth perception or a short distance in the stereovision and the semantic characteristic of human eye stereovision besides according with the vision feature of human eye interesting a static grain video frame and interesting a motion object.

Description

Video interesting area extraction method based on visual attention
Technical Field
The invention relates to a video signal processing method, in particular to a video interesting region extraction method based on visual attention.
Background
Stereoscopic Television, also called 3DTV (Three Dimensional Television), has received great attention from domestic and foreign research institutes and industries because stereoscopic Television can provide a span from flat to stereoscopic, giving viewers a specific stereoscopic impression and sense of realism. In 2002, an ATTEST (advanced three-dimensional television systems technology) project was initiated in the IST program supported by the european commission, which was aimed at establishing a complete backward compatible three-dimensional digital television broadcast chain system. The ATTEST project aims to propose a new concept of a 3DTV broadcast chain, is downward compatible with the existing two-dimensional broadcast implementation, and widely supports various different forms of two-dimensional and three-dimensional displays. The main design concept of the ATTEST project is to propose that a Depth Map (Depth Map) is added to serve as enhancement layer information on the basis of traditional two-dimensional video image transmission, namely data representation of 'two-dimensional color video plus Depth', three-dimensional video is decoded and reconstructed on a display terminal in a two-dimensional color video plus Depth mode, and a display mode of the two-dimensional color video plus Depth is supported by an advanced naked eye auto-stereoscopic display terminal in the industry.
In the human visual receiving and processing system, due to limited brain resources and the difference of importance of external environment information, the human brain does not have the same sense of the external environment information but shows the selection characteristic in the processing process, namely the interest degrees are different. Conventionally, extraction of video interesting regions is one of the core and difficult technologies of content-based video processing methods in the fields of video compression and communication, video retrieval, pattern recognition and the like. Visual psychology studies have shown that this variability of the selectivity or the degree of interest of the human eye with respect to external visual input is a close and inseparable link to the human visual attention characteristics. Currently, visual attention cue research is mainly divided into two aspects: top-down (Top-down) (also known as Concept-driven) attention cues and Bottom-up (Bottom-up) (also known as Stimulus-driven) attention cues. The top-down attention cue comes mainly from a complex psychological process and directly pays attention to some objects in the scene, including object shapes, actions and other relevant recognition features such as patterns, and the cue is influenced by factors such as personal knowledge, hobbies and subconscious and varies from person to person. The other clue is a bottom-up attention clue, which mainly comes from direct stimulation of visual characteristic factors of a video scene to visual cortex, mainly includes stimulation of color, brightness, direction and the like, and the bottom-up attention clue is instinctive and automatic, has good universal applicability, is relatively stable, and is basically not influenced by conscious factors such as personal knowledge and hobbies, so the bottom-up attention clue is one of hot contents of research of an extraction method of an automatic interested area.
However, the extraction of the automatic region of interest is mainly classified into three categories, 1), the region of interest of the human eye to the current video frame is extracted by using the image internal information of a single viewpoint, including the stimulation information such as brightness, color, texture or direction, and the like, and the method mainly extracts the region with larger contrast difference of brightness, color and texture as the region of interest, so that the method is difficult to be applied to the region of interest extraction in the complex background environment; 2) based on a visual principle that human eyes are interested in a motion region, the motion information between video frames is used as a main clue to extract the region of interest, however, the method is difficult to accurately extract slowly moving or locally moving objects and is also difficult to be applied to the extraction of the region of interest under the condition of global motion; 3) the method for extracting the static texture and the motion information by combining is low in redundancy and correlation between the static texture and the motion information, and extraction errors and noises existing in the static texture and the motion information cannot be effectively inhibited, so that the extraction accuracy is low. The three traditional methods cause that the extracted interesting region is not accurate enough and has poor stability due to the limitation of available information quantity; on the other hand, the traditional method does not consider the stereoscopic vision characteristic of interest to objects with strong depth perception or close to viewers, and cannot well express the real interest degree of human eyes with stereoscopic vision, so that the traditional method is difficult to be applied to the extraction of interest areas which accord with the stereoscopic vision semantic features in the new generation of stereoscopic (three-dimensional)/multi-view videos.
Disclosure of Invention
The invention aims to solve the technical problem of providing a video interesting region extraction method based on visual attention, which can ensure that the extracted video interesting region has higher precision and better stability and the extracted video interesting region accords with the semantic characteristics of human eye stereoscopic vision.
The technical scheme adopted by the invention for solving the technical problems is as follows: a video interesting region extraction method based on visual attention comprises the following steps:
firstly, defining a two-dimensional color video as a texture video, defining the size of texture video frames at each moment in the texture video to be W multiplied by H, W being the width of the texture video frames at each moment in the texture video, H being the height of the texture video frames at each moment in the texture video, and recording the texture video frame at t moment in the texture video as FtDefining a texture video frame F at time t in the texture videotFor the current texture video frame, detecting the visual attention of the static image domain of the current texture video frame by using a known static image visual attention detection method to obtain a distribution map, marked as S, of the visual attention of the static image domain of the current texture video frameIThe static image domain visual attention distribution map S of the current texture video frameIHas a dimension of W × H and is ZSA grayscale map of bit depth representations;
secondly, detecting the motion visual attention of the current texture video frame by adopting a motion visual attention detection method to obtain a distribution diagram of the motion visual attention of the current texture video frame, and recording the distribution diagram as SMDistribution map S of motion visual attention of current texture video frameMHas a dimension of W × H and is ZSA grayscale map of bit depth representations;
defining depth video frame of each time in depth video corresponding to texture video as ZDSetting the size of a depth video frame at each moment in a depth video to be W multiplied by H, wherein W is the width of the depth video frame at each moment in the depth video, H is the height of the depth video frame at each moment in the depth video, and D is the depth video frame at t moment in the depth videotDefining a depth video frame D at time t in the depth videotDetecting the current depth video frame and the current texture video frame link for the current depth video frame by adopting a depth visual attention detection methodCombining the depth visual attention of the displayed three-dimensional video image to obtain a distribution diagram of the depth visual attention of the three-dimensional video image, which is marked as SDDistribution diagram S of depth visual attention of three-dimensional video imageDHas a dimension of W × H and is ZSA grayscale map of bit depth representations;
fourthly, adopting a visual attention fusion method based on depth perception to visually pay attention to the static image domain of the current texture video frameIDistribution diagram S of motion visual attention of current texture video frameMAnd a current depth video frame and a depth visual attention distribution diagram S of a three-dimensional video image jointly displayed by the current depth video frame and the current texture video frameDFusing to extract a distribution diagram of three-dimensional visual attention conforming to human eye stereoscopic perception, wherein the distribution diagram of three-dimensional visual attention is marked as S, and the size of the distribution diagram of three-dimensional visual attention is W multiplied by H and ZSA grayscale map of bit depth representations;
performing thresholding and macro block post-processing on the distribution graph S of the three-dimensional visual attention to obtain a final region of interest which accords with human eye three-dimensional perception of the current texture video frame;
sixthly, repeating the steps from the first step to the fifth step until all texture video frames in the texture video are processed, and obtaining the video interesting area of the texture video.
The specific process of the motion visual attention detection method in the second step is as follows:
(II-1) recording the texture video frame at the time t + j continuous with the current texture video frame in the texture video as Ft+jRecording the texture video frame at the t-j time point which is continuous with the current texture video frame in the texture video as Ft-jWherein j ∈ (0, N)F/2],NFIs a positive integer less than 10;
2, calculating the current texture video frame and the texture video frame F at the moment t + j by adopting a known optical flow methodt+jMotion vector image in horizontal direction and motion vector image in vertical direction, and current texture video frame and time t-jTexture video frame Ft-jRecording the current texture video frame and the texture video frame F at the time of t + j in the motion vector image in the horizontal direction and the motion vector image in the vertical directiont+jThe motion vector image in the horizontal direction is Vt+j HAnd the motion vector image in the vertical direction is Vt+j VRecording the current texture video frame and the texture video frame F at the moment t-jt-jThe motion vector image in the horizontal direction is Vt-j HAnd the motion vector image in the vertical direction is Vt-j V,Vt+j H、Vt+j V、Vt-j HAnd Vt-j VW for width and H for height;
② -3, mixing Vt+j HAbsolute value of and Vt+j VThe absolute value of the current texture video frame is superposed to obtain the texture video frame F of the current texture video frame and the texture video frame F at the moment of t + jt+jMotion amplitude image of, noted as Mt+j M t + j = | V t + j H | + | V t + j V | , Memory Mt+jThe motion amplitude value of the pixel with the middle coordinate (x, y) is mt+j(x, y); will Vt-j HAbsolute value of and Vt-j VThe absolute value of the texture video frame is superposed to obtain the texture video frame F of the current texture video frame and the t-j momentt-jMotion amplitude image of, noted as Mt-j M t - j = | V t - j H | + | V t - j V | , Memory Mt-jThe motion amplitude value of the pixel with the middle coordinate (x, y) is mt-j(x,y);
② 4, utilizing current texture video frame and texture video frame F at t + j momentt+jAnd texture video frame F at time t-jt-jExtracting a joint motion map, denoted as Mj ΔExtracting the joint movement map Mj ΔThe specific process comprises the following steps: judging the current texture video frame and the texture video frame F at the moment t + jt+jMotion amplitude image Mt+jEach pixel in (a) and the current texture video frame and the texture video frame F at time t-jt-jMotion amplitude image Mt-jWhether the minimum value in the motion amplitude values of the pixels of the corresponding coordinates is larger than a set first threshold value T or not1And if so, determining a joint motion map Mj ΔThe pixel value of the pixel of the corresponding coordinate is Mt+jAnd Mt-jAverage of the sum of motion amplitude values of pixels of the corresponding coordinates, otherwise, determining the joint motion map Mj ΔThe pixel value of the pixel of the corresponding coordinate is 0; for Mt+jPixel with (x, y) middle coordinate and Mt-jThe pixel with the middle coordinate of (x, y) is judged to be min (m)t+j(x,y),mt-j(x, y)) is greater than a set first threshold value T1And if so, determining a joint motion map Mj ΔThe pixel value of the pixel with the middle coordinate (x, y) is
Figure G2009101525203D00043
Otherwise, determining the joint motion map Mj ΔThe pixel value of the pixel with the middle coordinate (x, y) is 0, wherein min () is a minimum function;
② -5, the distance from the t time to the 1 time to the N timeFWeighted superposition of the combined motion images at each moment of the/2 moment is carried out to obtain a weighted combined motion image of the current texture video frame, the weighted combined motion image is marked as M, the pixel value of a pixel with coordinates (x, y) in the weighted combined motion image M of the current texture video frame is marked as M (x, y), <math><mrow><mi>m</mi><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mrow><msub><mi>N</mi><mi>F</mi></msub><mo>/</mo><mn>2</mn></mrow></munderover><msub><mi>&zeta;</mi><mi>j</mi></msub><msubsup><mi>m</mi><mi>j</mi><mi>&Delta;</mi></msubsup><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>,</mo></mrow></math> wherein m isj Δ(x, y) represents the joint motion map M at a time j away from time tj ΔThe pixel value, ζ, of a pixel having a middle coordinate of (x, y)iAs a weighting coefficient, a weighting coefficient ζiSatisfy the requirement of <math><mrow><munderover><mi>&Sigma;</mi><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mrow><msub><mi>N</mi><mi>F</mi></msub><mo>/</mo><mn>2</mn></mrow></munderover><msub><mi>&zeta;</mi><mi>j</mi></msub><mo>=</mo><mn>1</mn><mo>;</mo></mrow></math>
② 6, carrying out Gaussian pyramid decomposition on the weighted joint motion picture M of the current texture video frame to decompose into nLThe layer weighted joint motion picture is obtained by recording M Gaussian pyramid decomposition of the weighted joint motion picture, wherein the weighted joint motion picture of the ith layer is M (i), and the width and the height of the weighted joint motion picture of the ith layer M (i) are respectively W/2iAnd H/2iWherein n isLIs a positive integer less than 20, i ∈ [0, n ∈ [ ]L-1]W is the width of the current texture video frame, and H is the height of the current texture video frame;
② -7, utilizing the weighted joint motion map of the current texture video frame to nLLayer weighted joint motion map, extracting the motion visual attention distribution map S of the current texture video frameMRemember SMThe pixel value of the pixel with the middle coordinate (x, y) is sm(x,y),SM=FMWherein
Figure G2009101525203D00052
Figure G2009101525203D00053
s,c∈[0,nL-1],s=c+δ,δ={-3,-2,-1,1,2,3},
Figure G2009101525203D00054
is normalized to
Figure G2009101525203D00055
-1 interval normalization function, symbol "|" is absolute value operation symbol, m (c) is c-th layer weighted joint motion diagram, m(s) is s-th layer weighted joint motion diagram, symbol
Figure G2009101525203D00056
Performing cross-level difference operation on M (c) and M(s), if c is less than s, up-sampling M(s) to the image with the same resolution as M (c), then respectively performing difference on each pixel of M (c) and the corresponding pixel of up-sampled M(s), if c is more than s, up-sampling M (c) to the image with the same resolution as M(s), then respectively performing difference on each pixel of M(s) and the corresponding pixel of up-sampled M (c), and sign
Figure G2009101525203D00057
Performing a cross-level addition operator for M (c) and M(s), if c < s, upsampling M(s) to an image with the same resolution as M (c), then summing each pixel of M (c) with the corresponding pixel of upsampled M(s), respectively, if c > s, upsampling M (c) to an image with the same resolution as M(s), and then summing each pixel of M(s) with the corresponding pixel of upsampled M (c), respectively.
The first threshold value T set in the step two-41=1。
The depth visual attention detection method in the third step comprises the following specific processes:
thirdly-1, carrying out Gaussian pyramid decomposition on the current depth video frame to obtain nLThe layer depth video frame is recorded as a layer i depth video frame obtained after the Gaussian pyramid decomposition of the current depth video frame is D (i), and the width and the height of the layer i depth video frame D (i) are respectively W/2iAnd H/2iWherein n isLIs a positive integer less than 20, i ∈ [0, n ∈ [ ]L-1]W is the width of the current depth video frame, and H is the height of the current depth video frame;
③ 2, utilizing n of current depth video frameLLayer depth video frame, extracting depth characteristic map of current depth video frame, and recording as FD
Figure G2009101525203D00061
Wherein,
Figure G2009101525203D00062
s,c ∈[0,nL-1],s=c+δ,δ={-3,-2,-1,1,2,3},
Figure G2009101525203D00063
is normalized to
Figure G2009101525203D00064
The normalized function of the interval, the symbol "|" is the absolute value operation symbol, D (c) is the c-th layer depth video frame, D(s) is the s-th layer depth video frame, the symbol
Figure G2009101525203D00065
Performing cross-level difference operation on D (c) and D(s), if c < s, upsampling D(s) to an image with the same resolution as D (c), and then corresponding each pixel of D (c) to the upsampled D(s)
Figure G2009101525203D00066
Respectively performing difference on pixels, if c is more than s, up-sampling D (c) to an image with the same resolution as D(s), and respectively performing difference and sign on each pixel of D(s) and the corresponding pixel of D (c) after up-sampling
Figure G2009101525203D00067
Performing a cross-layer addition operator on D (c) and D(s), if c < s, upsampling D(s) to an image with the same resolution as D (c), and then performing an addition operation on each pixel of D (c)Respectively summing the corresponding pixels of the D(s) after sampling, if c is more than s, up-sampling the D (c) onto the image with the same resolution as the D(s), and then respectively summing each pixel of the D(s) and the corresponding pixel of the D (c) after up-sampling;
thirdly, carrying out convolution operation on the current depth video frame by adopting known Gabor filters with 0 degree, pi/4 degree, pi/2 degree and 3 pi/4 degree directions to extract four direction components of the 0 degree, pi/4 degree, pi/2 degree and 3 pi/4 degree directions to obtain four direction component graphs of the current depth video frame, wherein the four direction component graphs are respectively represented as O0 D、Oπ/4 D、Oπ/2 DAnd O3π/4 D(ii) a O for current depth video frame0 DDirectional component diagram, Oπ/4 DDirectional component diagram, Oπ/2 DDirectional component diagram and O3π/4 DThe direction component diagram is respectively decomposed into n by Gaussian pyramidLThe layer direction component diagram is marked, and the ith layer direction component diagram obtained by decomposing the direction component diagram in the theta degree direction by a Gaussian pyramid is Oθ D(i),Oθ D(i) Is W/2 in width and height respectivelyiAnd H/2iWhere θ ∈ {0, π/4, π/2, 3 π/4} i ∈ [0, n ∈L-1]W is the width of the current depth video frame, and H is the height of the current depth video frame;
thirdly-4, utilizing n of the direction component diagram of each degree direction of the current depth video frameLExtracting a primary depth direction feature map of the current depth video frame, and recording the primary depth direction feature map as F'DO <math><mrow><msubsup><mover><mi>F</mi><mo>&OverBar;</mo></mover><mi>DO</mi><mo>&prime;</mo></msubsup><mo>=</mo><mfrac><mn>1</mn><mn>4</mn></mfrac><munder><mi>&Sigma;</mi><mrow><mi>&theta;</mi><mo>&Element;</mo><mo>{</mo><mn>0</mn><mo>,</mo><mfrac><mi>&pi;</mi><mn>4</mn></mfrac><mo>,</mo><mfrac><mi>&pi;</mi><mn>2</mn></mfrac><mo>,</mo><mfrac><mrow><mn>3</mn><mi>&pi;</mi></mrow><mn>4</mn></mfrac><mo>}</mo></mrow></munder><msub><mover><mi>F</mi><mo>&OverBar;</mo></mover><msub><mi>O</mi><mi>&theta;</mi></msub></msub><mo>,</mo></mrow></math> Wherein,
Figure G2009101525203D00072
s,c ∈[0,nL-1],s=c+δ,δ={-3,-2,-1,1,2,3},
Figure G2009101525203D00074
is normalized to
Figure G2009101525203D00075
Normalized function of interval, symbol "|" is absolute value operation symbol, Oθ D(c) A c-th layer direction component diagram which is a direction component diagram of a theta degree direction, Oθ D(s) an s-th layer direction component diagram of a theta degree direction, symbol
Figure G2009101525203D00076
Is Oθ D(c) And Oθ D(s) perform a cross-level differencing operator, if c < s, then Oθ D(s) upsampling to and Oθ D(c) On the image with the same resolution, and then Oθ D(c) And the up-sampled Oθ D(s) performing difference on corresponding pixels respectively, and if c is more than s, performing Oθ D(c) Up-sampling to and Oθ D(s) on an image with the same resolution, and then adding Oθ D(s) each pixel with up-sampled Oθ D(c) Performing difference and sign on corresponding pixelsIs Oθ D(c) And Oθ D(s) performing a cross-level addition operator, if c < s, adding Oθ D(s) upsampling to and Oθ D(c) On the image with the same resolution, and then Oθ D(c) And the up-sampled Oθ D(s) the corresponding pixels are summed separately, and if c > s, O is addedθ D(c) Up-sampling to and Oθ D(s) on an image with the same resolution, and then adding Oθ D(s) each pixel with up-sampled Oθ D(c) Summing corresponding pixels respectively;
③ 5, adopting the known morphological dilation algorithm to obtain the size w1×h1Is a preliminary depth direction feature map F 'of the basic expansion unit to the current depth video frame'DOCarry out n1Performing secondary expansion operation to obtain a depth direction characteristic diagram of the current depth video frame, and recording the characteristic diagram as FDO
Thirdly-6, utilizing the depth characteristic image F of the current depth video frameDAnd depth direction feature map FDOObtaining a distribution map, noted as S ', of the preliminary depth visual attention of the current depth video frame'D
Figure G2009101525203D00078
Note S'DThe pixel value of the pixel with the middle coordinate of (x, y) is s'd(x, y) wherein,
Figure G2009101525203D00079
is normalized to
Figure G2009101525203D000710
A normalization function of the interval;
-7, profile S 'using preliminary depth visual attention of current depth video frame'DObtaining a depth visual attention distribution diagram S of the three-dimensional video image jointly displayed by the current depth video frame and the current texture video frameDRemember SDThe pixel value of the pixel with the middle coordinate (x, y) is sd(x,y),sd(x,y)=s′d(x, y) · g (x, y), whichIn (1), g ( x , y ) = 0.2 if x < b | | y < b | | x > W - b | | y > H - b 0.4 elseif x < 2 b | | y < 2 b | | x > W - 2 b | | y > H - 2 b 0.6 elseif x < 3 b | | y < 3 b | | x > W - 3 b | | y > H - 3 b 1 else , w is the width of the current depth video frame, H is the height of the current depth video frame, b is a set second threshold, and the symbol "| | is an" or "operator.
Step c 5 middle w1=8,h1=8,n1The second threshold b set in the step (c) -7 is 16.
The visual attention fusion method based on depth perception in the step IV comprises the following specific processes:
and 1, carrying out scale transformation on the current depth video frame by Q (d (x, y)) ═ d (x, y) + gamma, wherein the gamma is a value in
Figure G2009101525203D00082
The coefficients within the range, d (x,y) represents the pixel value of the pixel with the coordinate (x, y) in the current depth video frame, and Q (d (x, y)) represents the pixel value of the pixel with the coordinate (x, y) in the current depth video frame after the scale transformation;
fourthly-2, utilizing the current depth video frame after the scale transformation, the current depth video frame and the depth visual attention distribution diagram S of the three-dimensional video image jointly displayed by the current depth video frame and the current texture video frameDDistribution diagram S of motion visual attention of current texture video frameMAnd a histogram S of the static image domain visual attention of the current texture video frameIAcquiring a three-dimensional visual attention distribution graph S, wherein the pixel value of a pixel with coordinates (x, y) in the three-dimensional visual attention distribution graph S is S (x, y),
Figure G2009101525203D00083
wherein, KD、KMAnd KIRespectively SD、SMAnd SIThe weighting coefficient satisfies the condition: <math><mrow><munder><mi>&Sigma;</mi><mrow><mi>a</mi><mo>&Element;</mo><mo>{</mo><mi>D</mi><mo>,</mo><mi>M</mi><mo>,</mo><mi>I</mi><mo>}</mo></mrow></munder><msub><mi>K</mi><mi>a</mi></msub><mo>=</mo><mn>1</mn><mo>,</mo></mrow></math> 0≤Ka≤1,
Figure G2009101525203D00085
is normalized to
Figure G2009101525203D00086
Normalized function of interval, sD(x,y)、sM(x, y) and sI(x, y) each represents SD、SMAnd SIThe pixel value, Θ, of a pixel having the middle coordinate (x, y)ab(x, y) is the correlation value of visual attention, [ theta ]ab(x,y)=min(sa(x,y),sb(x, y)), min () is the minimum function, CabAs the correlation coefficient, the correlation coefficient satisfies the condition: <math><mrow><munder><mi>&Sigma;</mi><mrow><mi>a</mi><mo>,</mo><mi>b</mi><mo>&Element;</mo><mo>{</mo><mi>D</mi><mo>,</mo><mi>M</mi><mo>,</mo><mi>I</mi><mo>}</mo><mo>,</mo><mi>a</mi><mo>&NotEqual;</mo><mi>b</mi></mrow></munder><msub><mi>C</mi><mi>ab</mi></msub><mo>=</mo><mn>1</mn><mo>,</mo></mrow></math> 0≤Cab< 1, correlation coefficient CDMDenotes SDAnd SMDegree of correlation, coefficient of correlation CDIDenotes SDAnd SIDegree of correlation, coefficient of correlation CIMDenotes SIAnd SMA, b ∈ { D, M, I } and a ≠ b.
The specific process of thresholding and macro block post-processing the distribution graph S of the three-dimensional visual attention in the fifth step is as follows:
-1, recording the pixel value of the pixel with coordinate (x, y) in the distribution graph S of the three-dimensional visual attention as S (x, y), defining a third threshold TS <math><mrow><msub><mi>T</mi><mi>S</mi></msub><mo>=</mo><msub><mi>k</mi><mi>T</mi></msub><mo>&CenterDot;</mo><munderover><mi>&Sigma;</mi><mrow><mi>y</mi><mo>=</mo><mn>0</mn></mrow><mrow><mi>H</mi><mo>-</mo><mn>1</mn></mrow></munderover><munderover><mi>&Sigma;</mi><mrow><mi>x</mi><mo>=</mo><mn>0</mn></mrow><mrow><mi>W</mi><mo>-</mo><mn>1</mn></mrow></munderover><mi>s</mi><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>/</mo><mrow><mo>(</mo><mi>W</mi><mo>&times;</mo><mi>H</mi><mo>)</mo></mrow><mo>,</mo></mrow></math> Wherein W is the width of the three-dimensional visual attention profile S, H is the height of the three-dimensional visual attention profile S, and kTE (0, 3); newly building a preliminary binary mask image, and judging that s (x, y) is more than or equal to TSIf so, marking the pixel with the coordinate (x, y) in the preliminary binary mask image as the pixel of interest, otherwise, marking the pixel with the coordinate (x, y) in the preliminary binary mask image as the pixel of non-interest;
fifthly, dividing the preliminary binary mask imageTo (W/W)2)×(H/h2) Size of w2×h2The blocks are not overlapped, and the block with the abscissa of u and the block with the ordinate of v is marked as Bu,vWherein u ∈ [0, W/W ]2-1],v∈[0,H/h2-1]Determining whether the pixel in each corresponding block in the current texture video frame is the pixel of interest or the non-interest pixel according to each block in the preliminary binary mask image, and regarding the block Bu,vJudgment Block Bu,vWhether the number of pixels marked as the pixel of interest is larger than a set fourth threshold value T or notbWherein, 0 is less than or equal to Tb≤w2×h2If yes, the current texture video frame is compared with the block Bu,vAll pixels in the corresponding block are marked as pixels of interest, and block B is marked as the pixel of interestu,vTaking the corresponding block as the region of interest block, otherwise, comparing the current texture video frame with the block Bu,vAll pixels in the corresponding block are marked as non-interesting pixels and block B is markedu,vThe corresponding block is used as a non-interested area block to obtain a preliminary interested area mask image of the current texture video frame, wherein the preliminary interested area mask image consists of an interested area block and a non-interested area block;
fifthly, marking all pixels in a non-interested area block which is most adjacent to the interested area block in the primary interested area mask image as the Nth pixelRLevel transition interested areas and updating the mask images of the primary interested areas; then, the updated preliminary interested region mask image is compared with the Nth regionRAll pixels in the non-region-of-interest block nearest to the level transition region-of-interest are labeled as NthR-level 1 transition region of interest, recursively updating the preliminary region of interest mask image; repeating the recursion process until the region of interest is marked to the level 1 transition region of interest; finally, obtaining a final interested region mask image of the current texture video frame, wherein the final interested region mask image consists of an interested region block and NRA level transition region of interest and a region of non-interest block;
fifthly, recording the final interested areaSetting the pixel value of a pixel with coordinates (x, y) in the region mask image as r (x, y), setting the pixel values of all pixels in a region block which is not in a region of interest in the final region of interest mask image as r (x, y) to be 255, and setting N in the final region of interest mask image as NRSetting the pixel values of all pixels in the region of interest of the level transition to <math><mrow><mi>r</mi><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>=</mo><mfrac><mi>e</mi><mrow><msub><mi>N</mi><mi>R</mi></msub><mo>+</mo><mn>1</mn></mrow></mfrac><mo>&times;</mo><mi>f</mi><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>,</mo></mrow></math> Setting the pixel values of all pixels in the region of interest block in the final region of interest mask image as r (x, y) ═ f (x, y), obtaining the region of interest of the current texture video frame, wherein e represents the progression of the transition region of interest, and e belongs to [1, N ]R]And f (x, y) represents the pixel value of the pixel with coordinates (x, y) in the current texture video frame.
W in the step (v-2)2=16,h2A fourth threshold value T of 16h=50。
Compared with the prior art, the method has the advantages that the texture video frame and the depth video frame corresponding to the texture video frame which are synchronized in time are jointly utilized, firstly, the static image domain visual attention of the texture video frame is extracted, the distribution diagram of the static image domain visual attention of the texture video frame is obtained, the motion visual attention is extracted through the texture video frames which are continuous in time, the distribution diagram of the motion visual attention of the texture video frame is obtained, the depth visual attention distribution diagram of the three-dimensional video image which is displayed by the depth video frame and the texture video frame in a combined mode is obtained through extracting the depth visual attention of the depth video frame, then, the distribution diagram of the static image domain visual attention, the distribution diagram of the motion visual attention, the distribution diagram of the depth visual attention and the depth information are utilized, the distribution diagram of the three-dimensional (stereo) visual attention which accords with the stereo vision characteristic of human eyes is obtained through a fusion method based, and performing thresholding and macro block post-processing operations to obtain a final video interested area which accords with human eye stereoscopic perception and a mask image of the interested area and a non-interested area corresponding to the video interested area. The interesting regions extracted by the method are fused with the visual attention of the static image region, the visual attention of the movement and the visual attention of the depth, the inherent singleness and inaccuracy of each visual attention extraction are effectively inhibited, the problem of noise caused by a complex background in the visual attention of the static image region is solved, the problem that the interesting regions with local movement and small movement amplitude cannot be extracted by the visual attention of the movement is solved, the calculation precision is improved, the stability of an algorithm is enhanced, and the interesting regions can be extracted from the background and the movement environment with complex textures. In addition, the region of interest obtained by the method not only accords with the visual interesting characteristic of human eyes to the static texture video frame and the visual interesting characteristic of human eyes to the moving object, but also accords with the depth perception characteristic of interest to the object with strong depth or close distance in the stereoscopic vision, and accords with the semantic characteristic of the stereoscopic vision of human eyes.
Drawings
FIG. 1a is a color video frame at time t in a test sequence "Ballet" two-dimensional color video;
FIG. 1b is a color video frame at time t in a two-dimensional color video of the test sequence "Door Flower";
FIG. 2a is a depth video frame at time t in a depth video corresponding to a test sequence "Ballet" two-dimensional color video;
FIG. 2b is a depth video frame at time t in a depth video corresponding to a two-dimensional color video of a test sequence "Door Flower";
FIG. 3 is a general flow diagram of the method of the present invention;
FIG. 4 is a block diagram of a process for detecting the visual attention of the still image domain of a current texture video frame using a conventional still image visual attention detection method;
FIG. 5 is a block flow diagram of a method of motion visual attention detection;
FIG. 6 is a block flow diagram of a method of depth vision attention detection;
FIG. 7a is a luminance characteristic diagram of a color video frame at time t in a test sequence "Ballet" two-dimensional color video;
FIG. 7b is a chart of chrominance characteristics of a color video frame at time t in a test sequence "Ballet" two-dimensional color video;
FIG. 7c is a diagram of the directional characteristics of a color video frame at time t in a test sequence "Ballet" two-dimensional color video;
FIG. 8a is a distribution diagram of the still image domain visual attention of a color video frame at time t in a test sequence "Ballet" two-dimensional color video;
FIG. 8b is a graph of the distribution of the motion visual attention of color video frames at time t in a test sequence "Ballet" two-dimensional color video;
FIG. 8c is a diagram showing a depth visual attention distribution of a three-dimensional video image jointly displayed by a color video frame at time t and a corresponding depth video frame in a test sequence "Ballet" two-dimensional color video;
FIG. 9 is a three-dimensional visual attention distribution diagram obtained after the color video frame at time t and the corresponding depth video frame in the two-dimensional color video of the test sequence "Ballet" are processed by the present invention;
FIG. 10a is a final region of interest mask image extracted by the present invention of a texture video frame at time t of the test sequence "Ballet";
FIG. 10b is a region of interest extracted by the present invention for a texture video frame at time t of the test sequence "Ballet";
FIG. 11a is a region of interest of a texture video frame at time t of a test sequence "Ballet" extracted by a conventional method for extracting a region of interest based only on a static image domain visual attention cue;
FIG. 11b is a region of interest of a texture video frame at time t of the test sequence "Ballet" extracted according to a conventional method for extracting regions of interest based only on motion visual attention cues;
FIG. 11c is a region of interest of a texture video frame at time t of a test sequence "Ballet" extracted by a conventional static image domain visual attention and moving visual attention combined region of interest extraction method;
FIG. 12a is a graph showing the luminance characteristics of color video frames at time t in a two-dimensional color video of the test sequence "Door Flower";
FIG. 12b is a chart showing the chrominance characteristics of color video frames at time t in a two-dimensional color video of the test sequence "Door Flower";
FIG. 12c is a diagram showing the directional characteristics of color video frames at time t in a two-dimensional color video of the test sequence "Door Flower";
FIG. 13a is a diagram of a static image domain visual attention distribution of color video frames at time t in a test sequence "Door Flower" two-dimensional color video;
FIG. 13b is a diagram showing a distribution of motion visual attention of color video frames at time t in a two-dimensional color video of the test sequence "Door Flower";
FIG. 13c is a diagram showing a depth visual attention distribution of a three-dimensional video image jointly represented by a color video frame at time t and a corresponding depth video frame in a two-dimensional color video of the test sequence "Door Flower";
FIG. 14 is a three-dimensional visual attention distribution diagram obtained by processing a color video frame at time t and a corresponding depth video frame in a two-dimensional color video of a test sequence "Door Flower" according to the present invention;
FIG. 15a is the final region of interest mask image extracted by the present invention of texture video frame at time t of the test sequence "Door Flower";
FIG. 15b is the region of interest extracted by the present invention for the texture video frame at time t of the test sequence "Door Flower";
FIG. 16a is a region of interest of a texture video frame at time t in the test sequence "Door Flower" extracted by the conventional method for extracting a region of interest based only on the visual attention cue of a still image region;
FIG. 16b is a region of interest of a texture video frame at time t in the test sequence "Door Flower" extracted by the conventional method for extracting regions of interest based only on the attention cues for sports vision;
fig. 16c is the roi extracted by the still image domain visual attention and moving visual attention combined roi extraction method of the texture video frame at time t of the test sequence "Door Flower".
Detailed Description
The invention is described in further detail below with reference to the accompanying examples.
The invention relates to a video interesting region extraction method based on visual attention, which mainly combines and utilizes the information of texture video and the information of depth video synchronized in time to extract the video interesting region, in the present embodiment, the texture video mainly uses a two-dimensional color video, the texture video is exemplified by a test sequence "Ballet" two-dimensional color video and a "Door Flower" two-dimensional color video, figure 1a shows a color video frame at time t in a test sequence "Ballet" two-dimensional color video, figure 1b shows a color video frame at time t in a two-dimensional color video of the test sequence "Door Flower", figure 2a shows a depth video frame at time t in a depth video corresponding to a test sequence "Ballet" two-dimensional color video, fig. 2b shows a depth video frame at time t in a depth video corresponding to the two-dimensional color video of the test sequence "Door Flower" and a depth video frame at each time in the depth video corresponding to the two-dimensional color video is Z.DA grayscale map of bit depth representations, the grayscale values of the grayscale map representing the relative distance of the object represented by each pixel in the depth video frame to the camera. Time instants in texture videoThe size of the texture video frame is defined as W × H, and for the depth video frame at each time in the depth video corresponding to the texture video, if the size of the depth video frame is different from the size of the texture video frame, the size of the depth video frame is generally set to the same size as the texture video frame by using the existing methods such as scale transformation and interpolation, that is, W × H, W is the width of the texture video frame at each time in the texture video or the width of the depth video frame at each time in the depth video, H is the height of the texture video frame at each time in the texture video or the height of the depth video frame at each time in the depth video, and the size of the depth video frame is set to be the same as the size of the texture video frame in order to more conveniently extract the video region of interest.
The general flow diagram of the method of the present invention is shown in fig. 3, and specifically includes the following steps:
firstly, defining a two-dimensional color video as a texture video, defining the size of texture video frames at each moment in the texture video to be W multiplied by H, W being the width of the texture video frames at each moment in the texture video, H being the height of the texture video frames at each moment in the texture video, and recording the texture video frame at t moment in the texture video as FtDefining a texture video frame F at time t in the texture videotFor the current texture video frame, detecting the visual attention of the static image domain of the current texture video frame by using a known static image visual attention detection method to obtain a distribution map, marked as S, of the visual attention of the static image domain of the current texture video frameIThe static image domain visual attention distribution map S of the current texture video frameIHas a dimension of W × H and is ZSAnd the gray level map is expressed by bit depth, wherein the larger the pixel value of a certain pixel in the gray level map represents that the relative attention degree of the human eye to the corresponding pixel of the current texture video frame is higher, and the smaller the pixel value represents that the relative attention degree of the human eye to the current texture video frame is lower.
In this embodiment, the process of detecting the visual attention of the still image domain of the current texture video frame by using the well-known still image visual attention detection methodBlock diagram as shown in fig. 4, each rectangle in fig. 4 represents a data processing procedure, each diamond respectively represents an image, and diamonds with different sizes represent images with different resolutions and are input and output data of corresponding operations; the current texture video frame is an image in an RGB format, each pixel in the image is represented BY R, G and B three color channels, firstly, each color channel component of each pixel of the current texture video frame is linearly transformed and decomposed into a brightness component diagram and two chroma component diagrams, namely a red-green component diagram and a blue-yellow component diagram, the brightness component diagram, the red-green component diagram and the blue-yellow component diagram are respectively represented as I, RG and BY, and the pixel value of a brightness component diagram I in (x, y) coordinates is represented as Ix,y=(rx,y+gx,y+bx,y) /3 wherein Ix,yThe pixel value, r, representing the luminance component in (x, y) coordinatesx,y、gx,y、bx,yThe pixel values of the pixels of the RGB three color channels of the current texture video frame at the (x, y) coordinate are respectively represented as follows:
R x , y = r x , y - ( g x , y + b x , y ) / 2 G x , y = g x , y - ( r x , y + b x , y ) / 2 B x , y = b x , y - ( r x , y + g x , y ) / 2 Y x , y = r x , y + g x , y - 2 ( | r x , y - g x , y | + b x , y ) RG x , y = R x , y - G x , y BY x , y = B x , y - Y x , y , wherein RGx,yThe pixel value, BY, representing the red-green component map RG in (x, y) coordinatesx,yA pixel value representing the blue-yellow component map BY at (x, y) coordinates; extracting four directional component maps of 0 degree, 45 degrees, 90 degrees and 135 degrees of the brightness component map by using a known Gabor filter, and respectively marking the four extracted directional component maps as Oθ T <math><mrow><mi>&theta;</mi><mo>&Element;</mo><mo>{</mo><mn>0</mn><mo>,</mo><mfrac><mi>&pi;</mi><mn>4</mn></mfrac><mo>,</mo><mfrac><mi>&pi;</mi><mn>2</mn></mfrac><mo>,</mo><mfrac><mrow><mn>3</mn><mi>&pi;</mi></mrow><mn>4</mn></mfrac><mo>}</mo><mo>;</mo></mrow></math> For one luminance component and two chrominance componentsAnd four directional components are respectively subjected to Gaussian pyramid decomposition into nLLayer of where nLAll the component graphs are uniformly expressed by l for positive integers less than 20, and the layer component graph of the ith layer obtained after the component graph l is decomposed by a Gaussian pyramid is recorded as l (i), wherein i belongs to [0, n ∈ [ ]L-1], <math><mrow><mi>l</mi><mo>&Element;</mo><mo>{</mo><mi>I</mi><mo>}</mo><mo>&cup;</mo><mo>{</mo><mi>RG</mi><mo>,</mo><mi>BY</mi><mo>}</mo><mo>&cup;</mo><mo>{</mo><msubsup><mi>O</mi><mn>0</mn><mi>T</mi></msubsup><mo>,</mo><msubsup><mi>O</mi><mrow><mi>&pi;</mi><mo>/</mo><mn>4</mn></mrow><mi>T</mi></msubsup><mo>,</mo><msubsup><mi>O</mi><mrow><mi>&pi;</mi><mo>/</mo><mn>2</mn></mrow><mi>T</mi></msubsup><mo>,</mo><msubsup><mi>O</mi><mrow><mn>3</mn><mi>&pi;</mi><mo>/</mo><mn>4</mn></mrow><mi>T</mi></msubsup><mo>}</mo><mo>,</mo></mrow></math> So 7 component maps are co-generated, each component map being decomposed into nLLayer-by-layer component diagram of 7 XnLLayer-by-layer component diagram, n in this exampleLThe value is 9; the feature maps of the components (chrominance, luminance, and directional components) are calculated by using the extracted layer component maps as follows:
Figure G2009101525203D00143
<math><mrow><mo>&ForAll;</mo><mi>l</mi><mo>&Element;</mo><mo>{</mo><mi>I</mi><mo>}</mo><mo>&cup;</mo><mo>{</mo><mi>RG</mi><mo>,</mo><mi>BY</mi><mo>}</mo><mo>&cup;</mo><mo>{</mo><msubsup><mi>O</mi><mn>0</mn><mi>T</mi></msubsup><mo>,</mo><msubsup><mi>O</mi><mrow><mi>&pi;</mi><mo>/</mo><mn>4</mn></mrow><mi>T</mi></msubsup><mo>,</mo><msubsup><mi>O</mi><mrow><mi>&pi;</mi><mo>/</mo><mn>2</mn></mrow><mi>T</mi></msubsup><mo>,</mo><msubsup><mi>O</mi><mrow><mn>3</mn><mi>&pi;</mi><mo>/</mo><mn>4</mn></mrow><mi>T</mi></msubsup><mo>}</mo><mo>,</mo></mrow></math> wherein s, c is equal to [0, n ]L-1],s=c+δ,δ={-3,-2,-1,1,2,3},Is normalized to
Figure G2009101525203D00146
-a normalization function of the interval 1,denotes the most noticeable, 0 denotes the least noticeable, l (c) denotes the c-th layer component map of the component map l, l(s) denotes the s-th layer component map of the component map l, and symbols
Figure G2009101525203D00148
Performing cross-level difference operation on l (c) and l(s), if c < s, upsampling l(s) to an image with the same resolution as l (c), then respectively performing difference operation on l (c) and a pixel corresponding to the upsampled l(s), if c > s, upsampling l (c) to an image with the same resolution as l(s), then respectively performing difference operation on l(s) and a pixel corresponding to the upsampled l (c), and performing symbol difference operation on l(s)
Figure G2009101525203D00149
Representing the operators of addition between l (c) and l (S) at cross-layer level, if c < S, upsampling l (S) to the image with the same resolution as l (c), then summing up l (c) and the corresponding pixel of upsampled l (S) respectively, if c > S, upsampling l (c) to the image with the same resolution as l (S), then summing up l (S) and the corresponding pixel of upsampled l (c) respectively, linearly fusing and normalizing the feature maps of all components to obtain the visual attention map S of the static domain of the current texture video frameI
Figure G2009101525203D001410
The size of each image of the test sequence "Ballet" and "Door Flower" is 1024 × 768, and the luminance characteristic diagram, the chrominance characteristic diagram and the direction characteristic diagram of the color video frame at the time t in the test sequence "Ballet" two-dimensional color video are respectively as shown in fig. 7a, fig. 7b and fig. 7b7c is shown; the luminance feature map, the chrominance feature map, and the direction feature map of the color video frame at time t in the test sequence "Door Flower" two-dimensional color video are shown in fig. 12a, 12b, and 12c, respectively. In this embodiment, ZS8, i.e. the profile S of the visual attention of the static image fieldIIs represented by 8 bit depth, and the distribution diagram of the static image domain visual attention of the color video frame at the time t in the test sequence "Ballet" two-dimensional color video is shown in fig. 8 a; fig. 13a shows a distribution diagram of the still image domain visual attention of the color video frames at time t in the test sequence "Door Flower" two-dimensional color video. Here, other known visual attention detection methods may be used as the static image domain visual attention detection method.
Secondly, detecting the motion visual attention of the current texture video frame by adopting a motion visual attention detection method to obtain a distribution diagram of the motion visual attention of the current texture video frame, and recording the distribution diagram as SMDistribution map S of motion visual attention of current texture video frameMHas a dimension of W × H and is ZSAnd the gray level map is expressed by bit depth, wherein the larger the pixel value of a certain pixel in the gray level map represents that the attention degree of the relative motion of the human eye to the corresponding pixel of the current texture video frame is higher, and the smaller the pixel value represents that the attention degree of the relative motion of the human eye to the corresponding pixel of the current texture video frame is lower.
In this embodiment, a flow chart of the motion visual attention detection method is shown in fig. 5, and the specific process of the motion visual attention detection method is as follows:
(II-1) recording the texture video frame at the time t + j continuous with the current texture video frame in the texture video as Ft+jRecording the texture video frame at the t-j time point which is continuous with the current texture video frame in the texture video as Ft-jWherein j ∈ (0, N)F/2],NFIs a positive integer less than 10, and N is taken in the specific application process of this exampleFAnd 4, extracting the motion area of the texture video by using the current texture video frame and the first two frames and the last two frames of the current texture video frame jointly.
2, calculating the current texture video frame and the texture video frame F at the moment t + j by adopting a known optical flow methodt+jMotion vector image in horizontal direction and motion vector image in vertical direction, and texture video frame F of current texture video frame and time t-jt-jRecording the current texture video frame and the texture video frame F at the time of t + j in the motion vector image in the horizontal direction and the motion vector image in the vertical directiont+jThe motion vector image in the horizontal direction is Vt+j HAnd the motion vector image in the vertical direction is Vt+j VRecording the current texture video frame and the texture video frame F at the moment t-jt-jThe motion vector image in the horizontal direction is Vt-j HAnd the motion vector image in the vertical direction is Vt-j V,Vt+j H、Vt+j V、Vt-j HAnd Vt-j VW for width and H for height.
② -3, mixing Vt+j HAbsolute value of and Vt+j VThe absolute value of the current texture video frame is superposed to obtain the texture video frame F of the current texture video frame and the texture video frame F at the moment of t + jt+jMotion amplitude image of, noted as Mt+j M t + j = | V t + j H | + | V t + j V | , Memory Mt+jThe motion amplitude value of the pixel with the middle coordinate (x, y) is mt+j(x, y); will Vt-j HAbsolute value of and Vt-j VThe absolute value of the texture video frame is superposed to obtain the texture video frame F of the current texture video frame and the t-j momentt-jMotion amplitude image of, noted as Mt-j M t - j = | V t - j H | + | V t - j V | , Memory Mt-jThe motion amplitude value of the pixel with the middle coordinate (x, y) is mt-j(x,y)。
② 4, utilizing current texture video frame and texture video frame F at t + j momentt+jAnd texture video frame F at time t-jt-jExtracting a joint motion map, denoted as Mj ΔExtracting the joint movement map Mj ΔThe specific process comprises the following steps: judging the current texture video frame and the texture video frame F at the moment t + jt+jMotion amplitude image Mt+jEach pixel in (a) and the current texture video frame and the texture video frame F at time t-jt-jMotion amplitude image Mt-jWhether the minimum value in the motion amplitude values of the pixels of the corresponding coordinates is larger than a set first threshold value T or not1And if so, determining a joint motion map Mj ΔThe pixel value of the pixel of the corresponding coordinate is Mt+jAnd Mt-jAverage of the sum of motion amplitude values of pixels of the corresponding coordinates, otherwise, determining the joint motion map Mj ΔThe pixel value of the pixel of the corresponding coordinate is 0; for Mt+jPixel with (x, y) middle coordinate and Mt-jThe pixel with the middle coordinate of (x, y) is judged to be min (m)t+j(x,y),mt-j(x, y)) is greater than a set first threshold value T1And if so, determining a joint motion map Mj ΔThe pixel value of the pixel with the middle coordinate (x, y) is
Figure G2009101525203D00162
Otherwise, determining the joint motion map Mj ΔThe pixel value of the pixel with the middle coordinate (x, y) is 0, wherein min () is a function of taking the minimum value. Here, the first threshold value T 11, to filter out small noise points caused by very small camera parameter jitter.
② -5, the distance from the t time to the 1 time to the N timeFWeighted superposition of the combined motion images at each moment of the/2 moment is carried out to obtain a weighted combined motion image of the current texture video frame, the weighted combined motion image is marked as M, the pixel value of a pixel with coordinates (x, y) in the weighted combined motion image M of the current texture video frame is marked as M (x, y), <math><mrow><mi>m</mi><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mrow><msub><mi>N</mi><mi>F</mi></msub><mo>/</mo><mn>2</mn></mrow></munderover><msub><mi>&zeta;</mi><mi>j</mi></msub><msubsup><mi>m</mi><mi>j</mi><mi>&Delta;</mi></msubsup><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>,</mo></mrow></math> wherein m isj Δ(x, y) represents the joint motion map M at a time j away from time tj ΔPixel value, ζ, of a pixel having a middle coordinate of (x, y)jAs a weighting coefficient, a weighting coefficient ζjSatisfy the requirement of <math><mrow><munderover><mi>&Sigma;</mi><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mrow><msub><mi>N</mi><mi>F</mi></msub><mo>/</mo><mn>2</mn></mrow></munderover><msub><mi>&zeta;</mi><mi>j</mi></msub><mo>=</mo><mn>1</mn><mo>.</mo></mrow></math>
In a video, a moving object is a main interested area, however, due to different motion types, the attention degree of people is different, and the motion types of the video are mainly divided into the following two types of cases, namely, the first type, for the case of shooting by a static camera, the background is static, and the moving object is a main interested object; the second type is that for the situation of shooting by a moving camera, the background moves globally, and the moving object and the camera are kept relatively still or are presented in the situation of inconsistent motion of the background, at this time, the moving object is still an interested object; for the above analysis, the motion attention area of people mainly comes from the motion attribute of the object, which is different from the motion attribute of the background environment, and is an area with large motion contrast, so the following steps can be adopted to obtain the motion visual attention.
② 6, carrying out Gaussian pyramid decomposition on the weighted joint motion picture M of the current texture video frame to decompose into nLThe layer weighted joint motion picture is characterized in that an ith layer weighted joint motion picture obtained by decomposing a weighted joint motion picture M by a Gaussian pyramid is M (i), and the width and the height of the ith layer weighted joint motion picture M (i) are respectively W/2iAnd H/2iWherein n isLIs a positive integer less than 20, i ∈ [0, n ∈ [ ]L-1]The 0 th layer is the bottom layer, the n th layerL-layer 1 is the highest layer, W is the width of the current texture video frame, H is the height of the current texture video frame; in the specific application process of the embodiment, nLThe value is 9.
② -7, utilizing weighted joint motion picture M n layers of current texture video frameLWeighting the joint motion map to extract a distribution map S of the motion visual attention of the current texture video frameMRemember SMThe pixel value of the pixel with the middle coordinate (x, y) is sm(x,y),SM=FMWherein
Figure G2009101525203D00171
s,c ∈[0,nL-1],s=c+δ,δ={-3,-2,-1,1,2,3},
Figure G2009101525203D00173
is normalized to
Figure G2009101525203D00174
The normalized function of interval, the symbol "|" is absolute value operation symbol, M (c) is the c-th layer weighted joint motion diagram, M(s) is the s-th layer weighted joint motion diagram, symbol
Figure G2009101525203D00175
Performing cross-level difference operation on M (c) and M(s), if c < s, up-sampling M(s) to the image with the same resolution as that of M (c), and then respectively adding each pixel of M (c) and the corresponding pixel of up-sampled M(s)Performing subtraction, if c > s, up-sampling M (c) to the image with the same resolution as M(s), and performing subtraction and sign on each pixel of M(s) and the corresponding pixel of up-sampled M (c)
Figure G2009101525203D00176
Performing a cross-level addition operator for M (c) and M(s), if c < s, upsampling M(s) to an image with the same resolution as M (c), then summing each pixel of M (c) with the corresponding pixel of upsampled M(s), respectively, if c > s, upsampling M (c) to an image with the same resolution as M(s), and then summing each pixel of M(s) with the corresponding pixel of upsampled M (c), respectively.
The distribution diagram of the motion visual attention obtained after the color video frames at the time t in the two-dimensional color video of the test sequence 'Ballet' are processed by the step is shown in FIG. 8 b; fig. 13b shows a distribution diagram of the motion visual attention obtained after the color video frames at time t in the two-dimensional color video of the test sequence "Door Flower" are processed in this step.
Defining depth video frame of each time in depth video corresponding to texture video as ZDGrayscale map of bit depth representation from 0 to
Figure G2009101525203D00181
The gray scale value of the range represents the relative distance from the shot object to the shooting camera represented by each pixel in the depth video frame, the gray scale value 0 corresponds to the maximum depth, and the gray scale valueSetting the size of a depth video frame at each moment in the depth video to be W multiplied by H corresponding to the minimum depth, wherein W is the width of the depth video frame at each moment in the depth video, H is the height of the depth video frame at each moment in the depth video, and D is the depth video frame at t moment in the depth videotDefining a depth video frame D at time t in the depth videotDetecting for the current depth video frame by adopting a depth visual attention detection methodThe depth visual attention of the three-dimensional video image jointly displayed by the current depth video frame and the current texture video frame is obtained, and a distribution diagram of the depth visual attention of the three-dimensional video image is obtained and recorded as SDDistribution diagram S of depth visual attention of three-dimensional video imageDHas a dimension of W × H and is ZSAnd the gray level map is expressed by bit depth, wherein the larger the pixel value of a certain pixel in the gray level map represents that the attention degree of the human eye to the relative depth of the corresponding pixel of the current texture video frame is higher, and the smaller the pixel value represents that the attention degree of the human eye to the relative depth of the current texture video frame is lower. In this embodiment, each pixel of the depth video frame consists of ZDIn 8-bit depth representation, each pixel of the visual attention profile is represented by ZSAn 8 bit depth representation.
The specific stereoscopic impression is the main characteristic that the stereoscopic video is different from the traditional single-channel video, and for the visual attention of the stereoscopic video, the depth perception mainly influences the visual attention of a user through two aspects, on one hand, the interest degree of the user on a scenery (or an object) close to the shooting camera array is generally larger than that of the scenery (or the object) far away from the shooting camera array; on the other hand, the depth discontinuity areas provide the user with a strong depth contrast. In this embodiment, a flow chart of the deep visual attention detection method is shown in fig. 6, and the specific process of the deep visual attention detection method is as follows:
thirdly-1, carrying out Gaussian pyramid decomposition on the current depth video frame to obtain nLAnd D (i) layer depth video frames are obtained after the Gaussian pyramid decomposition of the current depth video frames is recorded, wherein the width and the height of the i layer depth video frames D (i) are respectively W/2iAnd H/2iWherein n isLIs a positive integer less than 20, i ∈ [0, n ∈ [ ]L-1]The 0 th layer is the bottom layer, and the resolution is maximum D (0) ═ DtN thLLayer-1 is the highest layer, the resolution is the lowest, W is the width of the current depth video frame, and H is the height of the current depth video frame.
③ 2, utilizing n of current depth video frameLLayer depth video frame, extractingDepth feature map of front depth video frame, denoted as FD
Figure G2009101525203D00183
Wherein,
Figure G2009101525203D00184
s,c ∈[0,nL-1],s=c+δ,δ={-3,-2,-1,1,2,3},
Figure G2009101525203D00185
is normalized to
Figure G2009101525203D00186
The normalized function of the interval, the symbol "|" is the absolute value operation symbol, D (c) is the c-th layer depth video frame, D(s) is the s-th layer depth video frame, the symbol
Figure G2009101525203D00187
Performing cross-level difference operation on D (c) and D(s), if c < s, upsampling D(s) to an image with the same resolution as D (c), then respectively performing difference on each pixel of D (c) and the corresponding pixel of the upsampled D(s), if c > s, upsampling D (c) to an image with the same resolution as D(s), then respectively performing difference on each pixel of D(s) and the corresponding pixel of the upsampled D (c), and symbolizing
Figure G2009101525203D00191
And d (c) and d(s) are subjected to cross-layer addition operators, if c < s, d(s) is up-sampled to the image with the same resolution as d (c), then each pixel of d (c) is respectively summed with the corresponding pixel of d(s) after up-sampling, if c > s, d (c) is up-sampled to the image with the same resolution as d(s), and then each pixel of d(s) is respectively summed with the corresponding pixel of d (c) after up-sampling.
Thirdly, the depth edge area with larger depth difference gives stronger depth feeling to the user, so that the depth edge area in the current depth video frame is the depth visual attentionAnd performing convolution operation on the current depth video frame by using known Gabor filters in the directions of 0 degree, pi/4 degree, pi/2 degree and 3 pi/4 degree to extract four direction components in the directions of 0 degree, pi/4 degree, pi/2 degree and 3 pi/4 degree and obtain four direction component maps of the current depth video frame, wherein the four direction component maps are respectively represented as O0 D、Oπ/4 D、Oπ/2 DAnd O3π/4 D(ii) a O for current depth video frame0 DDirectional component diagram, Oπ/4 DDirectional component diagram, Oπ/2 DDirectional component diagram and O3π/4 DThe direction component diagram is respectively decomposed into n by Gaussian pyramidLThe layer direction component diagram is marked, and the ith layer direction component diagram obtained by decomposing the direction component diagram in the theta degree direction by a Gaussian pyramid is Oθ D(i),Oθ D(i) Is W/2 in width and height respectivelyiAnd H/2iWhere θ ∈ {0, π/4, π/2, 3 π/4} i ∈ [0, n ∈L-1]The 0 th layer is the bottom layer, <math><mrow><msubsup><mi>O</mi><mi>&theta;</mi><mi>D</mi></msubsup><mrow><mo>(</mo><mn>0</mn><mo>)</mo></mrow><mo>=</mo><msubsup><mi>O</mi><mi>&theta;</mi><mi>D</mi></msubsup><mo>,</mo></mrow></math> n thLLayer-1 is the highest layer, W is the width of the current depth video frame, and H is the height of the current depth video frame.
Thirdly-4, utilizing n of the direction component diagram of each degree direction of the current depth video frameLExtracting a primary depth direction feature map of the current depth video frame, and recording the primary depth direction feature map as F'DO <math><mrow><msubsup><mover><mi>F</mi><mo>&OverBar;</mo></mover><mi>DO</mi><mo>&prime;</mo></msubsup><mo>=</mo><mfrac><mn>1</mn><mn>4</mn></mfrac><munder><mi>&Sigma;</mi><mrow><mi>&theta;</mi><mo>&Element;</mo><mo>{</mo><mn>0</mn><mo>,</mo><mfrac><mi>&pi;</mi><mn>4</mn></mfrac><mo>,</mo><mfrac><mi>&pi;</mi><mn>2</mn></mfrac><mo>,</mo><mfrac><mrow><mn>3</mn><mi>&pi;</mi></mrow><mn>4</mn></mfrac><mo>}</mo></mrow></munder><msub><mover><mi>F</mi><mo>&OverBar;</mo></mover><msub><mi>O</mi><mi>&theta;</mi></msub></msub><mo>,</mo></mrow></math> Wherein,
Figure G2009101525203D00195
s,c ∈[0,nL-1],s=c+δ,δ={-3,-2,-1,1,2,3},
Figure G2009101525203D00196
is normalized to
Figure G2009101525203D00197
Normalized function of interval, symbol "|" is absolute value operation symbol, Oθ D(c) A c-th layer direction component diagram which is a direction component diagram of a theta degree direction, Oθ D(s) an s-th layer direction component diagram of a theta degree direction, symbol
Figure G2009101525203D00201
Is Oθ D(c) And Oθ D(s) perform a cross-level differencing operator, if c < s, then Oθ D(s) upsampling to and Oθ D(c) On the image with the same resolution, and then Oθ D(c) And the up-sampled Oθ D(s) performing difference on corresponding pixels respectively, and if c is more than s, performing Oθ D(c) Up-sampling to and Oθ D(s) on an image with the same resolution, and then adding Oθ D(s) each pixel with up-sampled Oθ D(c) Performing difference and sign on corresponding pixels
Figure G2009101525203D00202
Is Oθ D(c) And Oθ D(s) performing a cross-level addition operator, if c < s, adding Oθ D(s) upsampling to and Oθ D(c) On the image with the same resolution, and then Oθ D(c) And the up-sampled Oθ D(s) the corresponding pixels are summed separately, and if c > s, O is addedθ D(c) Up-sampling to and Oθ D(s) on an image with the same resolution, and then adding Oθ D(s) each pixel with up-sampled Oθ D(c) The corresponding pixels are summed separately.
③ 5, adopting the known morphological dilation algorithm to obtain the size w1×h1Is a preliminary depth direction feature map F 'of the basic expansion unit to the current depth video frame'DOCarry out n1Performing secondary expansion operation to obtain a depth direction characteristic diagram of the current depth video frame, and recording the characteristic diagram as FDO. In this embodiment, for the "Ballet" and "Doorflower" test sequences, where the size of each image in the test sequence is 1024 × 768, the basic unit of morphological dilation uses 8 × 8 blocks, i.e., w1×h18 × 8, the number of dilations n1=2。
Thirdly-6, utilizing the depth characteristic image F of the current depth video frameDAnd depth direction feature map FDOObtaining a distribution map, noted as S ', of the preliminary depth visual attention of the current depth video frame'D
Figure G2009101525203D00203
Note S'DThe pixel value of the pixel with the middle coordinate of (x, y) is s'd(x, y) wherein,
Figure G2009101525203D00204
is normalized to
Figure G2009101525203D00205
A normalization function of the interval.
③ 7, for the left and right borders of the imageA region where the left view image has a left image boundary and the right view image does not have a corresponding region, and therefore, a stereoscopic effect cannot be formed in the human brain; similarly, it is difficult to form a stereoscopic effect on the right image boundary of the right viewpoint image. Therefore, in the stereoscopic video, the left and right boundary regions of the image provide weaker stereoscopic effect or no stereoscopic effect, and are non-stereoscopic visual attention regions, so the invention visually pays attention to the preliminary depth of the current depth video frame by the distribution map S'DIs suppressed by using a profile S 'of the preliminary depth visual attention of the current depth video frame'DObtaining a depth visual attention distribution diagram S of the three-dimensional video image jointly displayed by the current depth video frame and the current texture video frameDRemember SDThe pixel value of the pixel with the middle coordinate (x, y) is sd(x,y),sd(x,y)=s′d(x, y) · g (x, y), wherein, g ( x , y ) = 0.2 if x < b | | y < b | | x > W - b | | y > H - b 0.4 elseif x < 2 b | | y < 2 b | | x > W - 2 b | | y > H - 2 b 0.6 elseif x < 3 b | | y < 3 b | | x > W - 3 b | | y > H - 3 b 1 else , w is the width of the current depth video frame, H is the height of the current depth video frame, b is a set second threshold, and the symbol "|" is an "or" operator. Here, the second threshold value b takes a value of 16. The function g (x, y) may also be other two-dimensional functions for suppressing the edge region of the image, such as a two-dimensional gaussian function with the template size being the same as the size of the texture video frame.
FIG. 8c is a diagram showing a depth visual attention distribution of a three-dimensional video image jointly displayed by a color video frame at time t and a corresponding depth video frame in a two-dimensional color video of the test sequence "Ballet"; fig. 13c shows a depth visual attention distribution diagram of a three-dimensional video image represented by a combination of a color video frame at time t and a corresponding depth video frame in a test sequence "DoorFlower" two-dimensional color video.
Fourthly, adopting a visual attention fusion method based on depth perception to visually pay attention to the static image domain of the current texture video frameIDistribution diagram S of motion visual attention of current texture video frameMAnd a current depth video frame and a depth visual attention distribution diagram S of a three-dimensional video image jointly displayed by the current depth video frame and the current texture video frameDFusing to extract a distribution diagram of three-dimensional visual attention conforming to human eye stereoscopic perception, wherein the distribution diagram of three-dimensional visual attention is marked as S, and the size of the distribution diagram of three-dimensional visual attention is W multiplied by H and ZSThe gray level image is represented by bit depth, the larger the pixel value of a certain pixel in the gray level image is, the higher the relative attention degree of human eyes to the corresponding pixel of the three-dimensional video image jointly represented by the current depth video frame and the current texture video frame is, and the smaller the pixel value is, the lower the relative attention degree of human eyes to the three-dimensional video image jointly represented by the current depth video frame and the current texture video frame is.
In a traditional single channel, a moving object is more likely to attract the attention of a viewer than a static object, and for objects which are all static, a bright-color region, a region with high color or brightness contrast, a region with high grain direction difference and the like are more likely to attract the attention of the viewer; in the stereo video, the visual attention distribution of human eyes is influenced by the special stereo perception provided by the stereo video for users besides the motion visual attention and the static image domain visual attention; the stereoscopic vision mainly comes from the tiny position deviation of scenes seen by left and right eyes, namely parallax, for example, the distance between two eyes is about 6 cm, the object images received by the eyes are projected to form a visual image with tiny position deviation on a retina, the tiny deviation is automatically integrated into a stereoscopic image with depth through a brain to form stereoscopic vision, and the relative distance information of objects reflected by the stereoscopic vision is another important factor directly influencing the attention selection of people. In a stereoscopic video, objects contained in a depth discontinuous region or a region with a large depth contrast can give a user a stronger depth difference, have stronger stereoscopic impression or depth sense, and are one of regions in which the user is interested; on the other hand, the viewer is interested in a foreground region close to the shooting camera (or video viewer) more than in a region far away from the shooting camera (or video viewer), so the foreground region is usually an important potential region of the stereoscopic video viewer interested region, and based on the above analysis, the factors influencing the three-dimensional visual attention of the human eyes are determined to include four factors of static image domain visual attention, moving visual attention, depth visual attention and depth, so the specific process of the visual attention fusion method based on depth perception in this specific embodiment is as follows:
and 1, carrying out scale transformation on the current depth video frame by Q (d (x, y)) ═ d (x, y) + gamma, wherein the gamma is a value in
Figure G2009101525203D00221
The coefficients within the range, d (x, y) represents the pixel value of the pixel with coordinates (x, y) in the current depth video frame, and Q (d (x, y)) represents the pixel value of the pixel with coordinates (x, y) in the current depth video frame after the scaling.
Fourthly-2, displaying the current depth video frame after the scale transformation, the current depth video frame and the current texture video frame jointlyDepth visual attention distribution map S of dimensional video imageDDistribution diagram S of motion visual attention of current texture video frameMAnd a histogram S of the static image domain visual attention of the current texture video frameIAcquiring a three-dimensional visual attention distribution graph S, wherein the pixel value of a pixel with coordinates (x, y) in the three-dimensional visual attention distribution graph S is S (x, y),
Figure G2009101525203D00222
wherein, KD、KMAnd KIRespectively SD、SMAnd SIThe weighting coefficient satisfies the condition: <math><mrow><munder><mi>&Sigma;</mi><mrow><mi>a</mi><mo>&Element;</mo><mo>{</mo><mi>D</mi><mo>,</mo><mi>M</mi><mo>,</mo><mi>I</mi><mo>}</mo></mrow></munder><msub><mi>K</mi><mi>a</mi></msub><mo>=</mo><mn>1</mn><mo>,</mo></mrow></math> 0≤Ka≤1,
Figure G2009101525203D00224
is normalized toNormalized function of interval, sD(x,y)、sM(x, y) and sI(x, y) each represents SD、SMAnd SIThe pixel value, Θ, of a pixel having the middle coordinate (x, y)ab(x, y) is the correlation value of visual attention, [ theta ]ab(x,y)=min(sa(x,y),sb(x, y)), min () is the minimum function, CabAs the correlation coefficient, the correlation coefficient satisfies the condition: <math><mrow><munder><mi>&Sigma;</mi><mrow><mi>a</mi><mo>,</mo><mi>b</mi><mo>&Element;</mo><mo>{</mo><mi>D</mi><mo>,</mo><mi>M</mi><mo>,</mo><mi>I</mi><mo>}</mo><mo>,</mo><mi>a</mi><mo>&NotEqual;</mo><mi>b</mi></mrow></munder><msub><mi>C</mi><mi>ab</mi></msub><mo>=</mo><mn>1</mn><mo>,</mo></mrow></math> 0≤Cab< 1, correlation coefficient CDMDenotes SDAnd SMDegree of correlation, coefficient of correlation CDIDenotes SDAnd SIDegree of correlation, coefficient of correlation CIMDenotes SIAnd SMA, b ∈ { D, M, I } and a ≠ b.
The moving visual attention, the static image domain visual attention and the depth visual attention all play important roles in combination with the visual attention of people, however, the moving visual attention is the most important content in the video visual attention, the static image domain visual attention is caused by the brightness, the color and the direction of the image domain, and the depth visual attention is caused again, so in the embodiment, each visual attention distribution map is ZSIn this particular embodiment, K is taken as the 8-bit depth representationD=0.15、KM0.4 and KI0.35, the degree of correlation between the depth visual attention and the moving visual attention is small, the degree of correlation between the depth visual attention and the still image domain visual attention is small, and the degree of correlation between the still image domain visual attention and the moving visual attention is large, so the correlation coefficient C is set hereDM、CDIAnd CIMThe scale transformation coefficients γ represent the scene depth of a texture video scene, the smaller γ is, the greater the scene depth is, the stronger the depth feeling given to the viewer is, and conversely, the larger γ is, the smaller the scene depth is, the weaker the depth feeling given to the viewer is, for the "Ballet" and "Door Flower" test sequences, the smaller the scene depth of the scene is, the scale transformation coefficient γ is set to 50, the distribution diagram of the three-dimensional visual attention extracted for the texture video frame at the time t of the "Ballet" test sequence and the corresponding depth video frame is shown in fig. 9, and the distribution diagram of the three-dimensional visual attention extracted for the texture video frame at the time t of the "Door Flower" test sequence and the corresponding depth video frame is shown in fig. 14.
And fifthly, carrying out thresholding and macro block post-processing on the distribution graph S of the three-dimensional visual attention to obtain the final region of interest which accords with human eye stereoscopic perception of the current texture video frame.
In this embodiment, the specific process of thresholding and macro-block post-processing the distribution map S of the three-dimensional visual attention is as follows:
-1, recording the pixel value of the pixel with coordinate (x, y) in the distribution graph S of the three-dimensional visual attention as S (x, y), defining a third threshold TS <math><mrow><msub><mi>T</mi><mi>S</mi></msub><mo>=</mo><msub><mi>k</mi><mi>T</mi></msub><mo>&CenterDot;</mo><munderover><mi>&Sigma;</mi><mrow><mi>y</mi><mo>=</mo><mn>0</mn></mrow><mrow><mi>H</mi><mo>-</mo><mn>1</mn></mrow></munderover><munderover><mi>&Sigma;</mi><mrow><mi>x</mi><mo>=</mo><mn>0</mn></mrow><mrow><mi>W</mi><mo>-</mo><mn>1</mn></mrow></munderover><mi>s</mi><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>/</mo><mrow><mo>(</mo><mi>W</mi><mo>&times;</mo><mi>H</mi><mo>)</mo></mrow><mo>,</mo></mrow></math> Wherein W is the width of the three-dimensional visual attention profile S, H is the height of the three-dimensional visual attention profile S, and kTE (0, 3), k in the application process of this embodimentTThe value can be 1.5; newly building a preliminary binary mask image, and judging that s (x, y) is more than or equal to TSAnd if so, marking the pixel with the coordinate (x, y) in the preliminary binary mask image as the pixel of interest, otherwise, marking the pixel with the coordinate (x, y) in the preliminary binary mask image as the pixel not of interest.
Fifthly, dividing the preliminary binary mask image into (W/W)2)×(H/h2) Size of w2×h2The blocks are not overlapped, and the block with the abscissa of u and the block with the ordinate of v is marked as Bu,vWherein u ∈ [0, W/W ]2-1],v∈[0,H/h2-1]Determining whether the pixel in each corresponding block in the current texture video frame is the pixel of interest or the non-interest pixel according to each block in the preliminary binary mask image, and regarding the block Bu,vJudgment Block Bu,vWhether the number of pixels marked as the pixel of interest is larger than a set fourth threshold value T or notbWherein, 0 is less than or equal to Tb≤w2×h2If so, willAnd block B in the current texture video frameu,vAll pixels in the corresponding block are marked as pixels of interest, and block B is marked as the pixel of interestu,vTaking the corresponding block as the region of interest block, otherwise, comparing the current texture video frame with the block Bu,vAll pixels in the corresponding block are marked as non-interesting pixels and block B is markedu,vAnd taking the corresponding block as a non-interested area block to obtain a preliminary interested area mask image of the current texture video frame, wherein the preliminary interested area mask image consists of an interested area block and a non-interested area block.
In the present embodiment, the size of each image in the test sequences "Ballet" and "Door Flower" is 1024 × 768, and thus the block B can be setu,vDimension w of2×h216 × 16, the region with a small number of pixels is not likely to be interesting to the viewer, so the fourth threshold T is set herebSet to 50.
Fifthly-3, since the transition region is not suddenly changed but slowly changed between the interested region and the non-interested region, the invention sets N between the interested region and the non-interested regionRA level transition region of interest. Marking all pixels in a region-of-non-interest block which is most adjacent to the region-of-interest block in the preliminary region-of-interest mask image as Nth pixelRLevel transition interested areas and updating the mask images of the primary interested areas; then, the updated preliminary interested region mask image is compared with the Nth regionRAll pixels in the non-region-of-interest block nearest to the level transition region-of-interest are labeled as NthR-level 1 transition region of interest, recursively updating the preliminary region of interest mask image; repeating the recursion process until the region of interest is marked to the level 1 transition region of interest; finally, obtaining a final interested region mask image of the current texture video frame, wherein the final interested region mask image consists of an interested region block and NRAnd the level transition region of interest and the region of non-interest block. In this embodiment, NRAnd taking the value as 2, namely setting a 2-level transition region of interest.
FIG. 10a shows the final region of interest mask image of the texture video frame at time t of the test sequence "Ballet"; FIG. 15a shows the final region of interest mask image of the texture video frame at time t of the test sequence "Door Flower". In fig. 10a and 15a, the black area represents the region of interest, the gray area is the transition region of interest, and the white area is the region of non-interest.
Fifthly, recording the pixel value of a pixel with the coordinate of (x, y) in the final region-of-interest mask image as r (x, y), setting the pixel values of all pixels in a region block which is not the region of interest in the final region-of-interest mask image as r (x, y) 255, and setting N in the final region-of-interest mask image as NRSetting the pixel values of all pixels in the region of interest of the level transition to <math><mrow><mi>r</mi><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>=</mo><mfrac><mi>e</mi><mrow><msub><mi>N</mi><mi>R</mi></msub><mo>+</mo><mn>1</mn></mrow></mfrac><mo>&times;</mo><mi>f</mi><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>,</mo></mrow></math> Setting the pixel values of all pixels in the region of interest block in the final region of interest mask image as r (x, y) ═ f (x, y), obtaining the region of interest of the current texture video frame, wherein e represents the progression of the transition region of interest, and e belongs to [1, N ]R]And f (x, y) represents the pixel value of the pixel with coordinates (x, y) in the current texture video frame.
FIG. 10b shows the region of interest of a texture video frame at time t of the test sequence "Ballet"; fig. 15b shows the region of interest of the texture video frame at time t of the test sequence "Door Flower". The regions of interest in fig. 10b and fig. 15b have the same pixel value as the texture video frame at time t, and display the color texture content, and the transition regions of interest display the dark gray regions by reducing the brightness, and the smooth white regions are the regions of non-interest corresponding to the white regions of the mask image of the regions of interest. As a comparison of the extraction effects, fig. 11a and fig. 16a respectively show the regions of interest of the texture video frames at the time of the test sequences "Ballet" and "Door Flower" t, which are extracted only according to the visual attention cue of the static image domain, and the noise regions with rich background textures cannot be removed; fig. 11b and fig. 16b show the regions of interest of texture video frames at the time of the test sequences "Ballet" and "Door Flower" t, which are extracted only according to the motion visual attention cue in the prior art, for the "Ballet" sequence, the method for extracting the regions of interest cannot completely extract men with very slow motion according to the motion visual attention cue only, and meanwhile, the background noise caused by motion shadow is serious; for the "Door Flower" sequence, only the motion region is extracted according to the region-of-interest extraction method of the motion visual attention cue, but the texture complexity and the depth sense provided by the stereoscopic vision are not considered. Fig. 11c and fig. 16c show the regions of interest of the texture video frame at the time of the test sequences "Ballet" and "Door Flower" t, which are combined with the visual attention and the attention cue of the static image domain, although the method combines the static and the moving visual information, the texture region and the moving noise in the background environment cannot be effectively suppressed.
As can be seen from comparison experiments between fig. 10a and 10b and fig. 11a, 11b and 11c, and between fig. 15a and 15b and fig. 16a, 16b and 16c, the region of interest extracted by the present invention combines the visual attention of the static image domain, the visual attention of the motion and the visual attention of the depth, effectively inhibits the inherent singleness and inaccuracy of each visual attention extraction, solves the problem of noise caused by the complex background in the visual attention of the static image domain, and solves the problem that the visual attention of the motion cannot extract the region of interest with local motion and small motion amplitude, thereby improving the calculation accuracy, enhancing the stability of the algorithm, and being capable of extracting the region of interest from the background and the motion environment with complex texture. In addition, the region of interest obtained by the invention not only accords with the visual interesting characteristic of human eyes to the static texture video frame and the visual interesting characteristic of human eyes to the moving object, but also accords with the depth perception characteristic of interest to the object with strong depth or close distance in the stereoscopic vision, and accords with the semantic characteristic of the stereoscopic vision of human eyes.
Sixthly, repeating the steps from the first step to the fifth step until all texture video frames in the texture video are processed, and obtaining the video interesting area of the texture video.
In this embodiment, the static image domain visual attention histogram S of the current texture video frameIDistribution diagram S of motion visual attention of current texture video frameMDistribution diagram S of depth visual attention of three-dimensional video imageDAnd the distribution diagrams S of the three-dimensional visual attention are all ZSA gray scale image expressed by bit depth, and a depth video frame at each time in a depth video corresponding to a texture video is ZDBit depth representation gray scale map, wherein the gray scale map adopts 256 colors and is represented by 8 bit depth, therefore, taking ZS=8,ZDAs to 8, it is of course possible to use other bit depths to represent the grayscale map in practical applications, such as 16 bit depths, and if the grayscale map is represented by 16 bit depths, the representation accuracy is higher.

Claims (8)

1. A method for extracting a video interesting region based on visual attention is characterized by comprising the following steps:
firstly, defining a two-dimensional color video as a texture video, defining the size of texture video frames at each moment in the texture video to be W multiplied by H, W being the width of the texture video frames at each moment in the texture video, H being the height of the texture video frames at each moment in the texture video, and recording the texture video frame at t moment in the texture video as FtDefining a texture video frame F at time t in the texture videotUsing static image views for current texture video framesThe visual attention detection method detects the visual attention of the static image domain of the current texture video frame to obtain a distribution diagram of the visual attention of the static image domain of the current texture video frame, which is marked as SIThe static image domain visual attention distribution map S of the current texture video frameIHas a dimension of W × H and is ZSA grayscale map of bit depth representations;
secondly, detecting the motion visual attention of the current texture video frame by adopting a motion visual attention detection method to obtain a distribution diagram of the motion visual attention of the current texture video frame, and recording the distribution diagram as SMDistribution map S of motion visual attention of current texture video frameMHas a dimension of W × H and is ZSA grayscale map of bit depth representations;
defining depth video frame of each time in depth video corresponding to texture video as ZDSetting the size of a depth video frame at each moment in a depth video to be W multiplied by H, wherein W is the width of the depth video frame at each moment in the depth video, H is the height of the depth video frame at each moment in the depth video, and D is the depth video frame at t moment in the depth videotDefining a depth video frame D at time t in the depth videotFor the current depth video frame, detecting the depth visual attention of the three-dimensional video image jointly displayed by the current depth video frame and the current texture video frame by adopting a depth visual attention detection method to obtain a depth visual attention distribution diagram of the three-dimensional video image, which is marked as SDDistribution diagram S of depth visual attention of three-dimensional video imageDHas a dimension of W × H and is ZSA grayscale map of bit depth representations;
fourthly, adopting a visual attention fusion method based on depth perception to visually pay attention to the static image domain of the current texture video frameIDistribution diagram S of motion visual attention of current texture video frameMAnd a current depth video frame and a depth visual attention distribution diagram S of a three-dimensional video image jointly displayed by the current depth video frame and the current texture video frameDFusing to extract a distribution diagram of three-dimensional visual attention conforming to human eye stereoscopic perception, denoted as S, the size of the distribution diagram S of three-dimensional visual attentionSize W × H and it is ZSA grayscale map of bit depth representations;
performing thresholding and macro block post-processing on the distribution graph S of the three-dimensional visual attention to obtain a final region of interest which accords with human eye three-dimensional perception of the current texture video frame;
sixthly, repeating the steps from the first step to the fifth step until all texture video frames in the texture video are processed, and obtaining the video interesting area of the texture video.
2. The method for extracting a video interesting region based on visual attention according to claim 1, wherein the specific process of the moving visual attention detection method in the step (II) is as follows:
(II-1) recording the texture video frame at the time t + j continuous with the current texture video frame in the texture video as Ft+jRecording the texture video frame at the t-j time point which is continuous with the current texture video frame in the texture video as Ft-jWherein j ∈ (0, N)F/2],NFIs a positive integer less than 10;
2, calculating the current texture video frame and the texture video frame F at the moment t + j by adopting an optical flow methodt+jMotion vector image in horizontal direction and motion vector image in vertical direction, and texture video frame F of current texture video frame and time t-jt-jRecording the current texture video frame and the texture video frame F at the time of t + j in the motion vector image in the horizontal direction and the motion vector image in the vertical directiont+jThe motion vector image in the horizontal direction is
Figure FSB00000358922400021
And the motion vector image in the vertical direction is
Figure FSB00000358922400022
Texture video frame F for recording current texture video frame and t-j momentt-jThe motion vector image in the horizontal direction is
Figure FSB00000358922400023
And the motion vector image in the vertical direction is
Figure FSB00000358922400024
And
Figure FSB00000358922400025
w for width and H for height;
② 3, mixing
Figure FSB00000358922400026
Absolute value of and
Figure FSB00000358922400027
the absolute value of the current texture video frame is superposed to obtain the texture video frame F of the current texture video frame and the texture video frame F at the moment of t + jt+jMotion amplitude image of, noted as Mt+j
Figure FSB00000358922400028
Memory Mt+jThe motion amplitude value of the pixel with the middle coordinate (x, y) is mt+j(x, y); will be provided with
Figure FSB00000358922400029
Absolute value of andthe absolute value of the texture video frame is superposed to obtain the texture video frame F of the current texture video frame and the t-j momentt-jMotion amplitude image of, noted as Mt-jMemory Mt-jThe motion amplitude value of the pixel with the middle coordinate (x, y) is mt-j(x,y);
② 4, utilizing current texture video frame and texture video frame F at t + j momentt+jAnd texture video frame F at time t-jt-jExtracting the joint motion picture, which is recorded as
Figure FSB000003589224000212
Extracting joint motion maps
Figure FSB000003589224000213
The specific process comprises the following steps: judging the current texture video frame and the texture video frame F at the moment t + jt+jMotion amplitude image Mt+jEach pixel in (a) and the current texture video frame and the texture video frame F at time t-jt-jMotion amplitude image Mt-jWhether the minimum value in the motion amplitude values of the pixels of the corresponding coordinates is larger than a set first threshold value T or not1And if so, determining a joint motion map
Figure FSB00000358922400031
The pixel value of the pixel of the corresponding coordinate is Mt+jAnd Mt-jAverage of the sum of motion amplitude values of pixels of corresponding coordinates in the image, otherwise, determining a joint motion map
Figure FSB00000358922400032
The pixel value of the pixel of the corresponding coordinate is 0; for Mt+jPixel with (x, y) middle coordinate and Mt-jThe pixel with the middle coordinate of (x, y) is judged to be min (m)t+j(x,y),mt-j(x, y)) is greater than a set first threshold value T1And if so, determining a joint motion mapThe pixel value of the pixel with the middle coordinate (x, y) is
Figure FSB00000358922400034
Otherwise, determining a joint motion map
Figure FSB00000358922400035
The pixel value of the pixel with the middle coordinate (x, y) is 0, wherein min () is a minimum function;
② -5, the distance from the t time to the 1 time to the N timeFConnection of each time of/2 timeWeighting and superposing the combined motion image to obtain a weighted combined motion image of the current texture video frame, marking the weighted combined motion image as M, and marking the pixel value of a pixel with the coordinate (x, y) in the weighted combined motion image M of the current texture video frame as M (x, y),wherein,
Figure FSB00000358922400037
representing a joint motion map in time at a distance of time j from time t
Figure FSB00000358922400038
Pixel value, ζ, of a pixel having a middle coordinate of (x, y)jAs a weighting coefficient, a weighting coefficient ζjSatisfy the requirement of
Figure FSB00000358922400039
② 6, carrying out Gaussian pyramid decomposition on the weighted joint motion picture M of the current texture video frame to decompose into nLAnd (3) a layer weighted joint motion picture, wherein the ith layer weighted joint motion picture obtained after M Gaussian pyramid decomposition of the weighted joint motion picture of the current texture video frame is recorded as M (i), and the width and the height of the ith layer weighted joint motion picture M (i) are respectively W/2iAnd H/2iWherein n isLIs a positive integer less than 20, i ∈ [0, n ∈ [ ]L-1]W is the width of the current texture video frame, and H is the height of the current texture video frame;
② -7, utilizing the weighted joint motion map of the current texture video frame to nLLayer weighted joint motion map, extracting the motion visual attention distribution map S of the current texture video frameMRemember SMThe pixel value of the pixel with the middle coordinate (x, y) is sm(x,y),
Figure FSB000003589224000310
Wherein,
Figure FSB000003589224000311
Figure FSB000003589224000312
s,c∈[0,nL-1],is normalized to
Figure FSB000003589224000314
The normalization function of the interval, the symbol "|" is the absolute value arithmetic symbol, M (c) is the c-th layer weighted joint motion diagram, M(s) is the s-th layer weighted joint motion diagram, the symbolPerforming cross-level difference operation on M (c) and M(s), if c is less than s, up-sampling M(s) to the image with the same resolution as M (c), then respectively performing difference on each pixel of M (c) and the corresponding pixel of up-sampled M(s), if c is more than s, up-sampling M (c) to the image with the same resolution as M(s), then respectively performing difference on each pixel of M(s) and the corresponding pixel of up-sampled M (c), and sign
Figure FSB00000358922400042
Performing a cross-level addition operator for M (c) and M(s), if c < s, upsampling M(s) to an image with the same resolution as M (c), then summing each pixel of M (c) with the corresponding pixel of upsampled M(s), respectively, if c > s, upsampling M (c) to an image with the same resolution as M(s), and then summing each pixel of M(s) with the corresponding pixel of upsampled M (c), respectively.
3. The method as claimed in claim 2, wherein the first threshold T set in step (II) -4 is set1=1。
4. The method for extracting video interesting regions based on visual attention according to claim 1 or 2, wherein the depth visual attention detection method in the third step comprises the following specific processes:
thirdly-1, carrying out Gaussian pyramid decomposition on the current depth video frame to obtain nLThe layer depth video frame is recorded as a layer i depth video frame obtained after the Gaussian pyramid decomposition of the current depth video frame is D (i), and the width and the height of the layer i depth video frame D (i) are respectively W/2iAnd H/2iWherein n isLIs a positive integer less than 20, i ∈ [0, n ∈ [ ]L-1]W is the width of the current depth video frame, and H is the height of the current depth video frame;
③ 2, utilizing n of current depth video frameLLayer depth video frame, extracting depth characteristic map of current depth video frame, and recording as
Figure FSB00000358922400043
Wherein,
Figure FSB00000358922400044
s,c∈[0,nL-1],
Figure FSB00000358922400045
is normalized to
Figure FSB00000358922400046
Normalization function of interval, symbol "|" as absolute value arithmetic symbol, D (c) as c-th layer depth video frame, D(s) as s-th layer depth video frame, symbol
Figure FSB00000358922400047
Performing cross-level difference operation on D (c) and D(s), if c < s, upsampling D(s) to an image with the same resolution as D (c), then performing difference operation on each pixel of D (c) and a corresponding pixel of the upsampled D(s), if c > s, upsampling D (c) to an image with the same resolution as D(s), and then performing cross-level difference operation on each pixel of D(s) and the upsampled D(s)The corresponding pixels D (c) are respectively subjected to subtraction and sign
Figure FSB00000358922400051
Performing a cross-layer addition operator for D (c) and D(s), if c < s, upsampling D(s) to an image with the same resolution as D (c), then summing each pixel of D (c) with the corresponding pixel of upsampled D(s), respectively, if c > s, upsampling D (c) to an image with the same resolution as D(s), and then summing each pixel of D(s) with the corresponding pixel of upsampled D (c), respectively;
thirdly, performing convolution operation on the current depth video frame by adopting Gabor filters in the directions of 0 degree, pi/4 degree, pi/2 degree and 3 pi/4 degree to extract four directional components in the directions of 0 degree, pi/4 degree, pi/2 degree and 3 pi/4 degree to obtain four directional component maps of the current depth video frame, wherein the four directional component maps are respectively expressed as
Figure FSB00000358922400052
And
Figure FSB00000358922400053
for current depth video frame
Figure FSB00000358922400054
A direction component diagram,A direction component diagram,
Figure FSB00000358922400056
A directional component diagram and
Figure FSB00000358922400057
the direction component diagram is respectively decomposed into n by Gaussian pyramidLThe layer direction component diagram is obtained by decomposing the direction component diagram in the theta degree direction by a Gaussian pyramid
Figure FSB00000358922400058
Is W/2 in width and height respectivelyiAnd H/2iWhere θ ∈ {0, π/4, π/2, 3 π/4} i ∈ [0, n ∈L-1]W is the width of the current depth video frame, and H is the height of the current depth video frame;
thirdly-4, utilizing n of the direction component diagram of each degree direction of the current depth video frameLLayer direction component diagram, extracting the initial depth direction characteristic diagram of the current depth video frame, and recording as
Figure FSB00000358922400059
Wherein,
Figure FSB000003589224000510
Figure FSB000003589224000511
s,c∈[0,nL-1],
Figure FSB000003589224000512
is normalized toThe normalization function of the interval, the sign "|" being the absolute value arithmetic sign,a c-th layer direction component diagram which is a direction component diagram of the theta degree direction,
Figure FSB000003589224000515
the s-th layer direction component diagram is a direction component diagram of the theta degree direction, and the symbolIs composed of
Figure FSB000003589224000517
And
Figure FSB000003589224000518
carry out the cross-level difference operator, if c < s, then
Figure FSB000003589224000519
Up-sampling toOn an image with the same resolution, and then will
Figure FSB000003589224000521
And the up-sampled
Figure FSB000003589224000522
The corresponding pixels are respectively subjected to difference making, if c is more than s, the difference is made
Figure FSB00000358922400061
Up-sampling to
Figure FSB00000358922400062
On an image with the same resolution, and then will
Figure FSB00000358922400063
And the up-sampledPerforming difference and sign on corresponding pixels
Figure FSB00000358922400065
Is composed of
Figure FSB00000358922400066
And
Figure FSB00000358922400067
performing a cross-level addition operator, if c < s, then
Figure FSB00000358922400068
Up-sampling to
Figure FSB00000358922400069
On an image with the same resolution, and then will
Figure FSB000003589224000610
And the up-sampled
Figure FSB000003589224000611
Corresponding pixels are respectively summed, if c > s, thenUp-sampling to
Figure FSB000003589224000613
On an image with the same resolution, and then will
Figure FSB000003589224000614
And the up-sampled
Figure FSB000003589224000615
Summing corresponding pixels respectively;
③ 5, adopting a morphological expansion algorithm with the size of w1×h1The block of (1) is a preliminary depth direction feature map of the basic expansion unit to the current depth video frame
Figure FSB000003589224000616
Carry out n1Performing secondary expansion operation to obtain a depth direction characteristic diagram of the current depth video frame, and recording the characteristic diagram as
Thirdly-6, utilizing the depth characteristic graph of the current depth video frame
Figure FSB000003589224000618
And depth direction feature map
Figure FSB000003589224000619
Obtaining a distribution map of the preliminary depth visual attention of the current depth video frame, and recording the distribution map as S'D
Figure FSB000003589224000620
Note S'DThe pixel value of the pixel with the middle coordinate of (x, y) is s'd(x, y) wherein,
Figure FSB000003589224000621
is normalized toA normalization function of the interval;
-7, profile S 'using preliminary depth visual attention of current depth video frame'DObtaining a depth visual attention distribution diagram S of the three-dimensional video image jointly displayed by the current depth video frame and the current texture video frameDRemember SDThe pixel value of the pixel with the middle coordinate (x, y) is sd(x,y),sd(x,y)=s′d(x, y) · g (x, y), wherein,
Figure FSB000003589224000623
w is the width of the current depth video frame, H is the height of the current depth video frame, b is the set second threshold, the symbol "iiis" or "operator.
5. The method according to claim 4, wherein w in-5 is w1=8,h1=8,n12. saidAnd step (c) -7, setting the second threshold value b to be 16.
6. The method for extracting a video interesting region based on visual attention according to claim 1, wherein the visual attention fusion method based on depth perception in the step (iv) comprises the following specific processes:
and 1, carrying out scale transformation on the current depth video frame by Q (d (x, y)) ═ d (x, y) + gamma, wherein the gamma is a value in
Figure FSB00000358922400071
Coefficients within a range, d (x, y) represents a pixel value of a pixel with coordinates (x, y) in the current depth video frame, and Q (d (x, y)) represents a pixel value of a pixel with coordinates (x, y) in the current depth video frame after the scaling;
fourthly-2, utilizing the current depth video frame after the scale transformation, the current depth video frame and the depth visual attention distribution diagram S of the three-dimensional video image jointly displayed by the current depth video frame and the current texture video frameDDistribution diagram S of motion visual attention of current texture video frameMAnd a histogram S of the static image domain visual attention of the current texture video frameIAcquiring a three-dimensional visual attention distribution graph S, wherein the pixel value of a pixel with coordinates (x, y) in the three-dimensional visual attention distribution graph S is S (x, y),
Figure FSB00000358922400072
wherein, KD、KMAnd KIAre each SD、SMAnd SIThe weighting coefficient satisfies the condition:
Figure FSB00000358922400073
0≤Ka≤1,
Figure FSB00000358922400074
is normalized to
Figure FSB00000358922400075
Normalized function of interval, sD(x,y)、sM(x, y) and sI(x, y) each represents SD、SMAnd SIThe pixel value, Θ, of a pixel having the middle coordinate (x, y)ab(x, y) is the correlation value of visual attention, [ theta ]ab(x,y)=min(sa(x,y),sb(x, y)), min () is the minimum function, CabAs the correlation coefficient, the correlation coefficient satisfies the condition:
Figure FSB00000358922400076
0≤Cab< 1, correlation coefficient CDMDenotes SDAnd SMDegree of correlation, coefficient of correlation CDIDenotes SDAnd SIDegree of correlation, coefficient of correlation CIMDenotes SIAnd SMA, b ∈ { D, M, I } and a ≠ b.
7. The method for extracting a video region of interest based on visual attention as claimed in claim 6, wherein the specific process of thresholding and macro-block post-processing the distribution map S of three-dimensional visual attention in the fifth step is as follows:
-1, recording the pixel value of the pixel with coordinate (x, y) in the distribution graph S of the three-dimensional visual attention as S (x, y), defining a third threshold TS
Figure FSB00000358922400077
Wherein W is the width of the three-dimensional visual attention profile S, H is the height of the three-dimensional visual attention profile S, and kTE (0, 3); newly building a preliminary binary mask image, and judging that s (x, y) is more than or equal to TSIf so, marking the pixel with the coordinate (x, y) in the preliminary binary mask image as the pixel of interest, otherwise, marking the pixel with the coordinate (x, y) in the preliminary binary mask image as the pixel of non-interest;
fifthly, dividing the preliminary binary mask image into (W/W)2)×(H/h2) Is large in sizeIs as small as w2×h2The blocks are not overlapped, and the block with the abscissa of u and the block with the ordinate of v is marked as Bu,vWherein u ∈ [0, W/W ]2-1],v∈[0,H/h2-1]Determining whether the pixel in each corresponding block in the current texture video frame is the pixel of interest or the non-interest pixel according to each block in the preliminary binary mask image, and regarding the block Bu,vJudgment Block Bu,vWhether the number of pixels marked as the pixel of interest is larger than a set fourth threshold value T or notbWherein, 0 is less than or equal to Tb≤w2×h2If yes, the current texture video frame is compared with the block Bu,vAll pixels in the corresponding block are marked as pixels of interest, and block B is marked as the pixel of interestu,vTaking the corresponding block as the region of interest block, otherwise, comparing the current texture video frame with the block Bu,vAll pixels in the corresponding block are marked as non-interesting pixels and block B is markedu,vThe corresponding block is used as a non-interested area block to obtain a preliminary interested area mask image of the current texture video frame, wherein the preliminary interested area mask image consists of an interested area block and a non-interested area block;
fifthly, 3, a non-interested region block B which is most adjacent to the interested region block in the preliminary interested region mask imageu,vAll pixels in (1) are labeled as NthRLevel transition interested areas and updating the mask images of the primary interested areas; then, the updated preliminary interested region mask image is compared with the Nth regionRNon-interested region block B nearest to stage transition interested regionu,vAll pixels in (1) are labeled as NthR-level 1 transition region of interest, recursively updating the preliminary region of interest mask image; repeating the recursion process until the region of interest is marked to the level 1 transition region of interest; finally, obtaining a final interested region mask image of the current texture video frame, wherein the final interested region mask image consists of an interested region block and NRA level transition region of interest and a region of non-interest block;
fifthly, recording the pixel value of the pixel with the coordinate (x, y) in the final interested area mask image as r (x, y),setting the pixel values of all pixels in the region-of-non-interest block in the final region-of-interest mask image as r (x, y) ═ 255, and setting N in the final region-of-interest mask image asRSetting the pixel values of all pixels in the region of interest of the level transition to
Figure FSB00000358922400081
Setting the pixel values of all pixels in the region of interest block in the final region of interest mask image as r (x, y) ═ f (x, y), obtaining the region of interest of the current texture video frame, wherein e represents the progression of the transition region of interest, and e belongs to [1, N ]R]And f (x, y) represents the pixel value of the pixel with coordinates (x, y) in the current texture video frame.
8. The method according to claim 7, wherein w in the step (c) -2 is w2=16,h2A fourth threshold value T of 16b=50。
CN2009101525203A 2009-09-11 2009-09-11 Method for extracting video interested region based on visual attention Expired - Fee Related CN101651772B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009101525203A CN101651772B (en) 2009-09-11 2009-09-11 Method for extracting video interested region based on visual attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009101525203A CN101651772B (en) 2009-09-11 2009-09-11 Method for extracting video interested region based on visual attention

Publications (2)

Publication Number Publication Date
CN101651772A CN101651772A (en) 2010-02-17
CN101651772B true CN101651772B (en) 2011-03-16

Family

ID=41673862

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009101525203A Expired - Fee Related CN101651772B (en) 2009-09-11 2009-09-11 Method for extracting video interested region based on visual attention

Country Status (1)

Country Link
CN (1) CN101651772B (en)

Families Citing this family (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101853513B (en) * 2010-06-06 2012-02-29 华中科技大学 Time and space significance visual attention method based on entropy
CN101894371B (en) * 2010-07-19 2011-11-30 华中科技大学 Bio-inspired top-down visual attention method
US8994792B2 (en) 2010-08-27 2015-03-31 Broadcom Corporation Method and system for creating a 3D video from a monoscopic 2D video and corresponding depth information
CN101950362B (en) * 2010-09-14 2013-01-09 武汉大学 Analytical system for attention of video signal
CN101964911B (en) * 2010-10-09 2012-10-17 浙江大学 Ground power unit (GPU)-based video layering method
CN102034267A (en) * 2010-11-30 2011-04-27 中国科学院自动化研究所 Three-dimensional reconstruction method of target based on attention
JP5627439B2 (en) * 2010-12-15 2014-11-19 キヤノン株式会社 Feature detection apparatus, feature detection method, and program thereof
CN102036073B (en) * 2010-12-21 2012-11-28 西安交通大学 Method for encoding and decoding JPEG2000 image based on vision potential attention target area
CN102063623B (en) * 2010-12-28 2012-11-07 中南大学 Method for extracting image region of interest by combining bottom-up and top-down ways
EP2485495A3 (en) * 2011-02-03 2013-08-28 Broadcom Corporation Method and system for creating a 3D video from a monoscopic 2D video and corresponding depth information
US20130009980A1 (en) * 2011-07-07 2013-01-10 Ati Technologies Ulc Viewing-focus oriented image processing
CN102496024B (en) * 2011-11-25 2014-03-12 山东大学 Method for detecting incident triggered by characteristic frame in intelligent monitor
CN102663741B (en) * 2012-03-22 2014-09-24 侯克杰 Method for carrying out visual stereo perception enhancement on color digit image and system thereof
US9661296B2 (en) 2012-07-12 2017-05-23 Samsung Electronics Co., Ltd. Image processing apparatus and method
CN103095996B (en) * 2013-01-25 2015-09-02 西安电子科技大学 Based on the multisensor video fusion method that time and space significance detects
CN104010180B (en) * 2014-06-13 2017-01-25 华为技术有限公司 Method and device for filtering three-dimensional video
CN104318569B (en) * 2014-10-27 2017-02-22 北京工业大学 Space salient region extraction method based on depth variation model
CN105550685B (en) * 2015-12-11 2019-01-08 哈尔滨工业大学 The large format remote sensing image area-of-interest exacting method of view-based access control model attention mechanism
CN105893999B (en) * 2016-03-31 2019-12-13 北京奇艺世纪科技有限公司 region-of-interest extraction method and device
CN109754357B (en) * 2018-01-26 2021-09-21 京东方科技集团股份有限公司 Image processing method, processing device and processing equipment
CN108961261B (en) * 2018-03-14 2022-02-15 中南大学 Optic disk region OCT image hierarchy segmentation method based on space continuity constraint
CN110110578B (en) * 2019-02-21 2023-09-29 北京工业大学 Indoor scene semantic annotation method
CN109903247B (en) * 2019-02-22 2023-02-03 西安工程大学 High-precision graying method for color image based on Gaussian color space correlation
CN111723829B (en) * 2019-03-18 2022-05-06 四川大学 Full-convolution target detection method based on attention mask fusion
CN110070538B (en) * 2019-04-28 2022-04-15 华北电力大学(保定) Bolt two-dimensional visual structure clustering method based on morphological optimization depth characteristics
CN110399842B (en) * 2019-07-26 2021-09-28 北京奇艺世纪科技有限公司 Video processing method and device, electronic equipment and computer readable storage medium
CN110675940A (en) * 2019-08-01 2020-01-10 平安科技(深圳)有限公司 Pathological image labeling method and device, computer equipment and storage medium
CN112654546B (en) * 2020-04-30 2022-08-02 华为技术有限公司 Method and device for identifying object of interest of user
CN113572958B (en) * 2021-07-15 2022-12-23 杭州海康威视数字技术股份有限公司 Method and equipment for automatically triggering camera to focus
CN113936015B (en) * 2021-12-17 2022-03-25 青岛美迪康数字工程有限公司 Method and device for extracting effective region of image

Also Published As

Publication number Publication date
CN101651772A (en) 2010-02-17

Similar Documents

Publication Publication Date Title
CN101651772B (en) Method for extracting video interested region based on visual attention
CN101588445B (en) Video area-of-interest exacting method based on depth
CN102592275B (en) Virtual viewpoint rendering method
US8488868B2 (en) Generation of a depth map from a monoscopic color image for rendering stereoscopic still and video images
CN102271254B (en) Depth image preprocessing method
CN109462747B (en) DIBR system cavity filling method based on generation countermeasure network
CN101699512B (en) Depth generating method based on background difference sectional drawing and sparse optical flow method
CN102202224B (en) Caption flutter-free method and apparatus used for plane video stereo transition
CN108513131B (en) Free viewpoint video depth map region-of-interest coding method
CN110378838A (en) Become multi-view image generation method, device, storage medium and electronic equipment
CN105069808A (en) Video image depth estimation method based on image segmentation
CN103384340B (en) Method for obtaining 3D imaging image from single 2D image
CN102420985B (en) Multi-view video object extraction method
Kuo et al. Depth estimation from a monocular view of the outdoors
Xu et al. Depth-aided exemplar-based hole filling for DIBR view synthesis
CN102447939A (en) Method for optimizing 2D (two-dimensional) to 3D (three-dimensional) conversion of video work
Schmeing et al. Depth image based rendering
Hwang et al. Stereo image quality assessment using visual attention and distortion predictors
CN112634127B (en) Unsupervised stereo image redirection method
CN101610422A (en) Method for compressing three-dimensional image video sequence
Yang et al. Depth map generation using local depth hypothesis for 2D-to-3D conversion
Ye et al. Hybrid scheme of image’s regional colorization using mask r-cnn and Poisson editing
CN102930542A (en) Detection method for vector saliency based on global contrast
CN104754320A (en) Method for calculating 3D-JND threshold value
CN108712642B (en) Automatic selection method for adding position of three-dimensional subtitle suitable for three-dimensional video

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: SHANGHAI SILICON INTELLECTUAL PROPERTY EXCHANGE CE

Free format text: FORMER OWNER: NINGBO UNIVERSITY

Effective date: 20120105

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 315211 NINGBO, ZHEJIANG PROVINCE TO: 200030 XUHUI, SHANGHAI

TR01 Transfer of patent right

Effective date of registration: 20120105

Address after: 200030 Shanghai City No. 333 Yishan Road Huixin International Building 1 building 1704

Patentee after: Shanghai Silicon Intellectual Property Exchange Co.,Ltd.

Address before: 315211 Zhejiang Province, Ningbo Jiangbei District Fenghua Road No. 818

Patentee before: Ningbo University

ASS Succession or assignment of patent right

Owner name: SHANGHAI SIPAI KESI TECHNOLOGY CO., LTD.

Free format text: FORMER OWNER: SHANGHAI SILICON INTELLECTUAL PROPERTY EXCHANGE CENTER CO., LTD.

Effective date: 20120217

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 200030 XUHUI, SHANGHAI TO: 201203 PUDONG NEW AREA, SHANGHAI

TR01 Transfer of patent right

Effective date of registration: 20120217

Address after: 201203 Shanghai Chunxiao Road No. 350 South Building Room 207

Patentee after: Shanghai spparks Technology Co.,Ltd.

Address before: 200030 Shanghai City No. 333 Yishan Road Huixin International Building 1 building 1704

Patentee before: Shanghai Silicon Intellectual Property Exchange Co.,Ltd.

ASS Succession or assignment of patent right

Owner name: SHANGHAI GUIZHI INTELLECTUAL PROPERTY SERVICE CO.,

Free format text: FORMER OWNER: SHANGHAI SIPAI KESI TECHNOLOGY CO., LTD.

Effective date: 20120606

C41 Transfer of patent application or patent right or utility model
C56 Change in the name or address of the patentee
CP02 Change in the address of a patent holder

Address after: 200030 Shanghai City No. 333 Yishan Road Huixin International Building 1 building 1706

Patentee after: Shanghai spparks Technology Co.,Ltd.

Address before: 201203 Shanghai Chunxiao Road No. 350 South Building Room 207

Patentee before: Shanghai spparks Technology Co.,Ltd.

TR01 Transfer of patent right

Effective date of registration: 20120606

Address after: 200030 Shanghai City No. 333 Yishan Road Huixin International Building 1 building 1704

Patentee after: Shanghai Guizhi Intellectual Property Service Co.,Ltd.

Address before: 200030 Shanghai City No. 333 Yishan Road Huixin International Building 1 building 1706

Patentee before: Shanghai spparks Technology Co.,Ltd.

DD01 Delivery of document by public notice

Addressee: Shi Lingling

Document name: Notification of Passing Examination on Formalities

TR01 Transfer of patent right

Effective date of registration: 20200120

Address after: 201203 block 22301-1450, building 14, No. 498, GuoShouJing Road, Pudong New Area (Shanghai) pilot Free Trade Zone, Shanghai

Patentee after: Shanghai spparks Technology Co.,Ltd.

Address before: 200030 Shanghai City No. 333 Yishan Road Huixin International Building 1 building 1704

Patentee before: Shanghai Guizhi Intellectual Property Service Co.,Ltd.

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20110316

Termination date: 20200911

CF01 Termination of patent right due to non-payment of annual fee