WO2009097714A1

WO2009097714A1 - Depth searching method and depth estimating method for multi-viewing angle video image

Info

Publication number: WO2009097714A1
Application number: PCT/CN2008/072141
Authority: WO
Inventors: Xiaoyun Zhang; George L Yang
Original assignee: Panovasic Technology Co., Ltd.
Priority date: 2008-02-03
Filing date: 2008-08-26
Publication date: 2009-08-13
Also published as: CN101231754A; CN100592338C

Abstract

A depth searching method for a multi-viewing angle video image is disclosed. The method involves dynamically adjusting searching step sizes in a depth searching range according to a current depth value so that each of the searching step sizes is corresponding to a same pixel searching accuracy. A depth estimating method for a multi-viewing angle video image is disclosed. The method involves dynamically adjusting each of the searching step sizes in a depth searching range in using a view synthesizing based on depths and a depth searching based on block matching.

Description

Multi-view video image depth search method and depth estimation method

TECHNICAL FIELD The present invention relates to multi-view video image processing techniques. BACKGROUND OF THE INVENTION In recent years, researchers have come to realize that technologies such as computer vision, video processing, and scene synthesis based on depth images should be utilized in advanced 3D television and Free Viewpoint Video System (FVV). Separating the acquisition and display settings of the video, that is, the viewing angle and the camera orientation of the acquired video are mutually unrestricted, thereby providing a high degree of flexibility, interactive book and operability. European stereoscopic TV projects use video plus depth data formats ("Deep Image Based Synthesis, Compression and Transmission of 3D TV New Methods", Stereoscopic Display and Virtual Reality Systems SPIE Conference, 2004.; C. Fehn, "Depth- Image-based rendering (DIBR), compression and transmission for a new approach on 3D-TV," in Proc. SPIE Conf. Stereoscopic Displays and Virtual Reality Systems XI, vol. 5291, CA, USA, Jan. 2004, pp. 93 -104.), that is, each pixel of the image corresponds to a depth value; using a depth image based view synthesis method (DIBR: Depth Image Based Rendering): the receiver decoder generates a stereo image pair according to the display setting and the viewing angle, thereby The viewing angle and the camera orientation for acquiring the video are not limited to each other. Proposal for the JVT meeting in April 2007 ("Multi-view video and depth data format for advanced 3D video systems"; A. Smolic and K. Mueller, et al., "Multi-View Video plus Depth (MVD) Format for Advanced 3D Video Systems", ISO IEC JTC1/SC29AVG11, Doc. JVT-W100, San Jose, USA, Apnl 2007. ) Extends video depth to multi-view video, and proposes video plus depth multi-view encoded data format MVD (Multi- View video plus depth) . Since MVD can meet the essential requirement of advanced 3D video or arbitrary viewing video applications, that is, it can generate a continuous view of arbitrary angles of view within a certain range, instead of a limited number of discrete views, so the video plus depth MVD scheme It has been adopted by JVT and has been identified as the future direction of development. Therefore, how to obtain the depth information of a scene from two or more views from different perspectives becomes one of the important issues in multi-view video processing. The current deep search method is: Use a fixed depth step (uniform depth-grid) for deep search within a fixed search range. When using a fixed search step, if the given search step at a smaller depth value corresponds to an offset of 1 pixel, then at a larger depth value, the pixel offset corresponding to the search step will be less than 1 pixel. Hypothesis When projecting to a non-integer pixel at a given depth value, taking the nearest neighbor pixel as a projection point, the same pixel will be searched at a plurality of different depth values in the depth search, that is, a repeated search occurs. Conversely, if the given search step corresponds to an offset of 1 pixel at a larger depth value, the pixel offset corresponding to the search step will be greater than 1 pixel at a smaller depth value, ie Two adjacent depth values will search for two non-adjacent pixel points, causing some pixels to miss detection and incomplete search. Therefore, it is originally expected to search for N pixel points in the search range [ _Zmm , _zmax ], but since the pixel point repeated search or the miss search is generated, the actual searched effective search point is less than N. In order to ensure that the search range contains all possible values of the true depth value of the scene, the search range is usually set to be large enough, and in order to ensure a certain search accuracy, the search step size is set to be small, which greatly increases the number of searches and corresponding The amount of calculation, and due to the existence of miss search and repeated search, the search effect is not good.

So far, there have been many research and estimation algorithms related to depth estimation, but most of them first calculate the disparity of the corrected, parallel stereo image pairs, and then calculate the depth information according to the relationship between parallax and depth. For example, there is only horizontal disparity between two images in a parallel camera system, the disparity is estimated first by feature or block matching, and then the depth information is calculated according to the inverse relationship between the depth and the parallax; for non-parallel camera systems, A series of processes such as image pair correction, parallax matching, depth calculation, and inverse correction are required to obtain the depth map corresponding to the original view. This type of depth estimation problem is essentially the estimation of disparity, and its performance is mainly determined by the disparity estimation algorithm. As we all know, disparity estimation or stereo matching is a classic problem in computer vision. Although there has been a lot of research work and results so far, the lack of matching or ambiguity caused by ambiguity or uncertainty makes the parallax matching problem still computer vision. Research hotspots and difficulties in the research.

2006, JVT Conference Proposal ("Multi-View Video Coding Core Experiment 3 Report"; S. Yea, J. Oh, S. Ince, E. Martinian and A. Vetro, "Report on Core Experiment CE3 of Multiview Coding", ISO IEC JTC1/SC29AVG11, Doc. JVT-T123, Klagenfurt, Austria, July 2006.) proposes the use of camera internal and external parameters and depth-based view synthesis, with a given search step within a specified depth search range, The depth that minimizes the error between the composite view and the actual view is searched as an estimate. M. Okutom et al. proposed a multiple-baseline stereo method for multi-baseline stereo systems. This method uses the inverse relationship between depth and parallax to convert the disparity estimation into a deep solution problem and eliminates the parallax matching. Deterministic Problems ("Multi-Baseline Stereo System", IEEE Transactions on Pattern Recognition and Machine Intelligence; M. Okutomi and K. Kanade, "A multiple-baseline stereo", IEEE Trans, on Pattern Analysis and Machine Intelligence 15 (4): 353 - 363, 1993. ). N. Kim et al. proposed direct depth search, matching, and view synthesis operations in distance/depth space ("general multi-baseline stereo systems and direct view synthesis using deep space search, matching, and synthesis", International Journal of Computer Vision; (N. Kim, M. Trivedi and H. Ishiguro, "Generalized multiple baseline stereo and direct view synthesis using Range-space search, match, and render", International Journal of Computer Vision 47 (1/2/3): 131 - 148, 2002. ) : Direct depth search in depth space, no parallax matching required, image correction processing directly Completed in the depth search process, and the depth value is a continuous value, the accuracy is not limited by the image pixel resolution as the disparity vector. However, in the actual solution, the depth search range and the search step size need to be specified, and the cost function is determined according to a certain cost function. The optimal solution, and whether the search range and the step value are appropriate is critical to the estimation performance. In parallax matching, the parallax search range is usually determined intuitively according to the nature of the image, while in deep search, especially in non-parallel camera systems, Since the relationship between depth variation and image pixel offset is not obvious, its search range is difficult to determine reasonably. Therefore, how to determine the appropriate depth search interval and step size for a given multi-view view becomes the key to effectively estimate depth information.

JVT-W059 'View Synthesis Prediction Core Experiment 6 Report'; S. Yea and A. Vetro, "Report of CE6 on View Synthesis Prediction", ISO IEC JTC1/SC29AVG11, Doc. JVT-W059, San Jose, USA, April 2007 . ) Using a matching feature point pair of two views, selecting a minimum of the difference between the pair of matching feature points from the minimum value, the maximum value and the search step size of the selected group of depth searches as the depth search range And the step size, this method requires the KLT (Kanade-Lucas-Tomasi) algorithm ("Feature Point Detection and Tracking", Carnegie Mellon University Technical Report; C. Tomasi, and T. Kanade, "Detection and tracking of Point features", Technical Report CMU-CS-91-132, Carnegie Mellon University, 1991.) Feature extraction matching, performance depends on the correctness of feature matching.

M. Okutom and N. Kim et al. refer to the depth variation value corresponding to the 1 pixel offset of the reference view of the longest baseline as the search step size, thereby ensuring that the pixel offset in all other reference views is less than 1 pixel. Both of the above methods use a fixed search step size, and the step size is not adaptively adjusted according to changes in image content or scene. SUMMARY OF THE INVENTION The technical problem to be solved by the present invention is to provide an adaptive determination method for a search step size, which can avoid repeated or missing search of pixel points. In addition, the present invention also proposes a depth estimation method based on an adaptive search step size. The technical solution adopted by the present invention to solve the above technical problem is a multi-view video image depth search method, which is characterized in that the search step length of each step in the depth search range is dynamically adjusted according to the current depth value, and the current depth value is smaller. The smaller the search step size is; the larger the current depth value is, the larger the search step size is, so that the search step size of each step corresponds to the same pixel search accuracy; Determining the depth search range and the search step length into a determination of a pixel search range and a pixel search accuracy according to a relationship between the depth change value and the pixel offset vector; the pixel search precision is equal to each pixel offset vector in the search The pixel search accuracy may be sub-pixel precision such as one-half pixel, one-quarter pixel, or integer pixel precision, such as one pixel, two pixels; the search step size is equal to each search The depth change value corresponding to the primary pixel offset vector; the search step size of the target view is determined by the current depth value in the target view, the pixel offset vector in the reference view, and the camera internal and external parameters corresponding to the view, each step in the target view The search step size corresponds to a pixel offset vector of the same length in the reference view. The target view refers to an image that currently needs to be estimated, and the reference view refers to other images in a multi-view video system. The reference view can be automatically selected during the deep search or specified by the user; the search step is obtained by the following formula:

Where: P is the pixel to be depth estimated in the target view, z is the current depth value of the pixel point P, Δζ is the depth change value of the pixel point 即, that is, the search step size, and AP _r is the depth change of the pixel point P in the target view value of the pixel corresponding to the offset vector Δζ reference view _{^{r,, II ΔΡ Γ II 2 = ΔΡ}} Γ Τ · ΔΡ Γ,; B r = 4 - ^^ - 1 = Wo ^4--1 is a 3 x 3 matrix, △ = t _t _T is a three-dimensional vector; where R is the three-dimensional rotation matrix of the camera coordinate system of the target perspective relative to the world coordinate system; t is the translation vector of the camera coordinate system of the target perspective relative to the world coordinate system; A is the target perspective Camera internal parameter matrix; 3⁄4 is the three-dimensional rotation matrix of the camera coordinate system with reference to the world coordinate system; ^ is the translation vector of the camera coordinate system with reference to the world coordinate system; Α _τ is the camera internal parameter matrix of the reference angle of view ; b ₃ and c ₃ are the third row vectors of matrix B, respectively. For a parallel camera system, the depth change value is proportional to the square of the current depth value. The pixel offset vector in the reference view satisfies the polar constraint equation of the target view and the reference view:

AP _r ^T (C At _r x B _r )P = 0 , where P is the pixel point in the target view and Δ ^ is the pixel offset vector in the reference view. There are two pixel offset vectors ΔΡ _{Τ in} mutually opposite directions satisfying the polar line constraint equation, and the two pixel offset vectors respectively correspond to a depth value increasing direction and a depth value decreasing direction; The depth change value corresponding to the offset vector in the large direction is larger than the depth change value corresponding to the offset vector in the direction in which the depth value decreases. A depth estimation method for multi-view video images, in depth-based view synthesis and block-based depth search, the depth search range and search step size of the target view are determined by the pixel search range and pixel search precision of the reference view In the depth search range, the search step size of each step is dynamically adjusted according to the current depth value. The smaller the current depth value is, the smaller the search step size is. The larger the current depth value is, the larger the search step size is. The search step size of each step corresponds to the same pixel search precision; the depth search step size is determined by the current depth value in the target view, the pixel offset vector in the reference view, and the camera internal and external parameters corresponding to the view, each in the target view The search step of one step corresponds to a pixel offset vector of the same length in the reference view; the depth-based view synthesis refers to the pixel point and depth value of a given target view, according to the target angle of view and the reference angle of view within the camera An external parameter, a method of backprojecting the pixel point to a three-dimensional scene space point, and then re-projecting the spatial point to an image plane of the reference perspective, obtaining a composite view of the target view at the reference view; the depth-based view synthesis and The depth search based on the block configuration is specifically, using the current depth value for view synthesis, and calculating the synthesis The error between the pixel block of the figure and the pixel block of the reference view; the depth value corresponding to the minimum error is the depth estimation value of the target view; the depth estimation method of the multi-view video image specifically includes the following steps: Step 1 Estimating the target view The depth search initial value is 3⁄4=. Step 2 Determine that the depth search corresponds to the pixel search range and the pixel search precision in the reference view, and obtain the pixel offset vector ΔΡ^ in the reference view according to the pixel search precision. Step 3 According to the current depth value 3⁄4 and the pixel offset vector Δ, the corresponding correspondence is obtained. The depth change value Δ3⁄4, the depth change value Δ3⁄4 is the next search step; Step 4 uses the current depth value for view synthesis, and calculates the error e _k between the pixel block of the composite view and the pixel block of the reference view _; step 4 to update the current depth value ¾ = ¾ + △ ¾; k = k + l; step 5 determining whether more than a given pixel search range, and if so proceeds to step 6, if not, proceeds to step 3; step 6 error EK ( The depth value corresponding to the smallest error in k=0, ..., N-1, N is the total number of search steps) is an estimated value. The search step is obtained by the following formula:

Where: P is the pixel point to be depth estimated in the target view, z is the current depth value of the pixel point P, and Δζ is the depth change value of the pixel point P, that is, the search step length, which is the depth change value Δζ of the pixel point 目标 in the target view. In the reference view r, the corresponding pixel offset vector, II ΔΡ _Γ II ² = ΔΡ _Γ τ · ΔΡ _Γ , ; B _r = 4 J - ^{1 and} PC = 4A - ¹ is a matrix of 3x3, Δ = t _t _T is a three-dimensional vector; wherein, R is a three-dimensional rotation matrix of the camera coordinate system of the target perspective relative to the world coordinate system; t is a translation vector of the camera coordinate system of the target perspective relative to the world coordinate system; A is a camera internal parameter matrix of the target perspective; 3⁄4 is the three-dimensional rotation matrix of the camera coordinate system with reference to the world coordinate system; ^ is the translation vector of the camera coordinate system with reference to the world coordinate system; Α _τ is the camera internal parameter matrix of the reference angle of view; b ₃ and c ₃ is the third row vector of matrix B, respectively. For a parallel camera system, the square of the current depth value is proportional to the depth change value. The pixel offset vector in the reference view satisfies the polar constraint equation of the target view and the reference view:

χΑ)Ρ = 0, where Ρ is the pixel in the target view and AP _r is the pixel offset vector in the reference view. The invention has the beneficial effects that the depth search of the adaptive search step does not cause the pixel leak search and the repeated search, the absolute difference between the synthesized image block and the reference image block in the depth estimation is small, the error estimation is small, and the calculation amount or depth Less searches.

BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a schematic diagram of coordinate system setting in a multi-view video system; FIG. 2 is a schematic diagram of depth-based view synthesis; FIG. 3 (a) a view of an initial moment of a video sequence of a seventh camera in a Uli test sequence; (b) A view of the initial time of the video sequence of the 7th camera in the Uli test sequence; Figure 3 (c) is a partial schematic view of Figure 2 (a), with 16 indicated points indicating the pixel points [527, 430] to [590, 493] image area; FIG. 4 is a schematic diagram of the relationship between the depth variation value and the square of the depth value; FIG. 5 is a schematic diagram of the depth variation value and the pixel offset vector of the present invention; FIG. 6 is a pixel leak search when the depth value is small. Figure 7 is a schematic diagram of pixel repeated search when the depth value is large; Figure 8 is a schematic diagram of adaptively adjusting the depth search step size according to the present invention; Figure 9 is a search for the pixel by using the adaptive variable length search step size of the present invention. Schematic diagram of the distribution of points; Figure 10 is a schematic diagram of the depth search performance using the fixed search step size and the adaptive step size of the present invention. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention proposes an adaptive determination method for depth search step size, which utilizes the internal and external parameters of the camera and the perspective projection relationship, first deducing the pixel point depth value, the depth variation value, and the projection point caused by the depth change in the composite view. The relationship between the pixel offsets in the middle, according to the relationship between the derived depth variation value and the corresponding pixel offset, the determination of the depth search range is converted into the determination of the pixel search range, and the pixel offset is in the image It has an intuitive meaning, which is easy to determine reasonably; and according to the relationship between the pixel offset and the depth value, that is, the larger the depth value, the smaller the pixel offset caused by the same depth change value, dynamically adjusts the search step size, so that Each search step corresponds to the same pixel search accuracy, avoiding repeated or missing search of pixels, thereby improving search efficiency and performance. In addition, the present invention also proposes a simple and effective initial depth estimation method, which obtains a rough estimate of the scene depth by solving the convergence point of the camera optical axis in the convergence camera system and treating the point as a scene miniature point. . In a multi-view video, three types of coordinate systems are usually needed to describe the scene and its image position information, which are the world coordinate system, the camera coordinate system and the pixel coordinate system where the scene is located, as shown in FIG. The camera coordinate system takes the center of the camera as the origin, the optical axis is the z-axis, and the xy plane is parallel to the image plane. The pixel coordinate system takes the upper left corner of the image as the coordinate origin, and the horizontal and vertical coordinates are u, v. Let the position of the camera coordinate system _0l - _Xl y _lZl of the camera d (i = l, ..., m) relative to the world coordinate system o-xyz be represented by a three-dimensional rotation matrix and a translation vector, where m is the number of cameras. The coordinates of a point in the world coordinate system are represented by the vector p=[x,y,z], and the coordinates in the camera coordinate system ₀₁ -^ ₁ are represented by the vector ₁ =[ , ,3⁄4], according to the spatial geometry and The coordinate transformation has the following relationship: p = R _i p _i + t _t ( 1 ) According to the principle of computer vision perspective projection, the coordinate _P1 in the camera coordinate system and its homogeneous pixel coordinates in the image plane Ρ^^, ν^ Ι] The following relationship is satisfied: z _r P _r = Α _ι ρ _ι (2) where A is the camera's internal parameter matrix of the reference angle of view, which mainly includes parameters such as camera focal length, center and deformation coefficient.

The present invention performs block-based depth search in depth space, that is, using camera internal and external parameters and depth-based view synthesis, searching for pixel blocks of the synthesized view and corresponding actual reference views by using search step length in the depth search range. The minimum depth value between the blocks, and the depth value is taken as the depth of the pixel of the target view Estimated value. The target view and the target perspective refer to images and corresponding perspectives that currently require depth estimation, and the reference views and reference perspectives refer to other images and perspectives in the multi-view video system. The reference view and reference view can be automatically selected during the deep search or specified by the user.

When the depth value of the pixel in the view is given, the pixel can be back projected into the scene space according to the internal and external parameters of the camera to obtain a spatial point, and the spatial point is reprojected to the required The image plane in the direction of the view gives a composite view of the view, which is the depth-based view synthesis technique, as shown in Figure 2. Consider the case of two views, with view 1 as the target view and view 2 as the reference view. The pixel points in view 1? The depth value in the camera coordinate system is _Zl , the corresponding pixel point in view 2 is P ₂ , and the depth value in the camera c ₂ coordinate system is 3⁄4, according to formula (1) (2) Derived

z^^+^z^- ¹ ^ +t ₂ (3) From equation (3):

For the convenience of description, remember:

Then (4) becomes: ζΒΡ+α = z ₂ P ₂ (6) where B, C is a three-dimensional matrix and t is the translation vector between the two cameras. Since PP ₂ is a homogeneous coordinate, 3⁄4 of (6) can be eliminated, and the pixel parity of the pixel point Pi in view 2 is obtained as:

_{λ λ} ΒΡ _λ +α

Where b ₃ and c ₃ are the third row vectors of the matrix Β and C, respectively; From equation (9), it can be concluded that, in the case where the camera ^ and the internal and external parameters are known, the pixel value of view 2 is related to view 1 A function of the pixel value and its depth value in . The view synthesis of view 1 in reference view 2 is performed using equation (7). Point of view of a pixel in the 1 Pi, at a given depth z to obtain the synthesis point of view of a pixel in the camera angle Ρ 2 c ₂ by the backprojection and reprojection _{_2, Ρ 2}

According to common assumptions in computer vision, the same scene Perspective views of different points in the corresponding pixel points having the same luminance chrominance values, A 1 is a view pixel values in pixels in the perspective view of Synthesis 2 2 ₂ P-luminance-chrominance value at depth z :

Synthesized _ I ₂ (P ₂ ) = Synthesized _ I ₂ (f ₂ (z, ^ )) = /, (;) ( 8 )

L is view 1, I ₂ is view 2, and Synthesized_I ₂ is a composite view 2 of view 1 in reference camera perspective 2. The above description is exemplified by a camera system composed of two cameras, and it can be further drawn that a camera system composed of m cameras can be applied to the above principle. Assuming that the pixel points in the partial window W centered on the pixel point P have the same scene depth value, the composite view 2 of the view 1 in the window W and the absolute view of the reference view 2 actually taken by the camera at the angle of view 2 are assumed. The difference is:

SAD(z, ) =∑ \Synthesised_I ₂ (f(z, P _} )— / ₂ (/(z, P _} ))||

(9)

Since the composite view 2 is calculated using the camera parameters corresponding to the reference view 2, the synthetic view 2 under theoretical real scene depth values has the same luminance chrominance value as the reference view 2. Therefore, the solution of the depth value of view 1 at pixel point P can be transformed into the following problem: ze{

That is, within a given depth range, the depth z that minimizes the absolute difference between the composite view and the reference view is taken as the final depth estimate.

This method of directly performing depth search in depth space does not require parallax matching, image correction processing is directly performed in the depth search process, and the depth value is a continuous value, and the accuracy is not limited by the image pixel resolution as the disparity vector. It is known from equation (7) that in the case where the internal and external parameters of the camera are known, the pixel points of the composite view 2 are a function of the pixel points in the view 1 and their depth values. If the depth value corresponding to the pixel point P1 in the view 1 changes by Δζ, the pixel coordinates in the composite view 2 are:

Therefore, the depth value change Δζ of the pixel point Pi in the view 1 causes its pixel offset in the composite view 2

From equation (12), it can be derived that the relationship between the depth value change of the pixel in view 1 and the corresponding pixel point offset vector ΔΡ in the composite view 2 is:

_{^{Δ (¾ Pft - c 3 J}} tBP x - {z x blP x + οζ ξΡ ^ ΑΡ) = (, ¾ Ρ λ + c 3 J 1) 1 ΑΡ (13) with ^ΔΡ Τ-multiplied (13) on both sides, get:

Where |A/f = Δ ^Τ Δ is the square of the modulus of the pixel offset vector ΔΡ. Therefore, the depth variation value Δζ corresponding to the pixel offset vector ΔΡ at the depth can be obtained by (14) when the camera parameters are known. In addition, the corresponding pixel points of the two views can be obtained from equation (6), which satisfies the following line constraint equation:

Where x is the cross product of the vector. Therefore, the equation (15) minus the equation (16) to obtain the pixel offset vector ΔΡ should also satisfy the polar constraint equation:

AP ^T (CtxB)P ₁ =0 (17) Given camera parameters and pixels? In the case, equation (17) is a homogeneous linear equation for the two components of the pixel offset vector ΔΡ and Δν. For a parallel camera system, the disparity d of the same scene point in the two views is inversely proportional to its depth, ie d = (18) where d and z are parallax and depth, respectively, and fiPB is the focal length and baseline length of the camera, respectively. Then, when the depth value of the pixel point Pi in the view 1 is changed from _Z1 to 3⁄4, the pixel offset of the corresponding projection point in the composite view 2 is

According to (19), the depth change value is proportional to the pixel offset and inversely proportional to the square of the depth value. For the same pixel offset, the larger the depth value is, the larger the corresponding depth change value is. The smaller the depth value is, the smaller the corresponding depth change value is. For the converged camera system, when the angles of the two cameras are not very large, it can also be seen from equation (12) that there is an approximate relationship between the depth variation value, the pixel offset, and the depth value. To verify this conclusion, we use the Uli test sequence shown in Figure 3 (this multi-view video data is provided by Heinrich-Hertz-Institut (HHI) in Germany, available from https:〃 www.3dtv-research.org/ 3dav_CfP_FhG_HHI / Downloaded. The video sequence is taken by 8 cameras arranged in a convergent manner, the video format is 1024x768, 25fps) The 7th and 8th views of the video sequence of the 7th and 8th cameras are used. The parameters of the eight cameras are calculated according to equations (14) and (17) at the pixel point P=[526,429] in view 7 (Fig. 3(a)) (corresponding to the button on the right side of the shirt collar), the depth change value , the relationship between the square of the depth value and the pixel offset. Given the unit pixel offset vector satisfying the polar constraint (17), ie |ΔΡ| = 1, the depth variation values corresponding to different depth values are calculated according to (14), and the relationship between them is as shown in Fig. 4. The abscissa is the square of the depth value, and the ordinate is the depth change value. Figure 4 shows that when the pixel offset in the composite view is given, the depth change value is approximately linear with the square of the depth value, which means that the same amount of depth change value of the pixel in view 1 at different depth values Causes different pixel offsets in the composite view. It is worth noting that since (17) is a homogeneous linear equation for the pixel offset vector ΔΡ, ΔΡ exists in two opposite directions ΔΡ ₊ and ΔΡ -, and they can be obtained by substituting them into (14). A negative two depth variation values Δζ ₊ and -, ΔΡ ₊ and ΔΡ, respectively correspond to pixel offset vectors caused by depth value increase and depth value reduction. According to the previous analysis, the pixel offset is given, and the depth change value is proportional to the square of the depth value. Therefore, the depth change values corresponding to the two pixel offset vectors ΔΡ of the same size and different directions are not the same, that is, the depth value. The decrease amount I Δζ−I is smaller than the depth value increase amount I Δζ ₊ I , as shown in FIG. 5 . For example, take the pixel point P=[526,429] in the Uli view 7 (as shown in Figure 3(a)), the depth value is 3172mm, and the pixel offset is 64 pixels, ie II ΔΡ II =64, according to (14) (17) Find the depth change values corresponding to the pixel offset vectors of the two opposite directions as Δ z ₊ = 930 and Δ ζ — = -593, respectively. According to the above analysis, in the case of the same pixel offset, the depth change value is approximately proportional to the square of the depth value. So when using a fixed search step, if the search step size at a smaller depth value corresponds to 1 image The offset of the prime, then at a larger depth value, the pixel offset corresponding to the search step will be less than 1 pixel. Assuming that a pixel of the nearest neighbor is taken as a projection point when projected to a non-integer pixel at a given depth value, the same pixel will be searched at a plurality of different depth values in the depth search, that is, a repeated search occurs. Conversely, if the given search step corresponds to an offset of 1 pixel at a larger depth value, the pixel offset corresponding to the search step will be greater than 1 pixel at a smaller depth value, ie Two adjacent depth values will search for two non-adjacent pixel points, causing some pixels to miss detection and incomplete search. Therefore, it is originally expected to search for N pixel points in the search range [z _mm , z _max ], but since the pixel point repeated search or the miss search is generated, the actual searched effective search point is less than N. For example, we perform a depth search on the pixel point P=[526,429] in Uli view 7, in the range of [2000, 4500], with a fixed step size of 10 mm. As shown in FIG. 6, when the depth value is small, for example, the pixel coordinate u searched at 2090 is 661, and the pixel coordinate of the pixel searched at the depth value 2080 is 663, and the pixel point is skipped in the middle. As shown in FIG. 7 , when the depth value is large, for example, two different depth values 4450 and 4460 search for the same pixel with the u coordinate of 437, that is, the pixel is repeatedly searched. Since the search step size of 10 mm corresponds to the search precision of 1 pixel in the local range of the true depth value 3170, we originally expected to search for 250 different pixels in the range of [2000, 4500], but due to the occurrence of pixel leakage Searching and repeating the search, the actual calculation found that only 200 pixels were searched. In order to make the depth search process correspond to the same pixel search precision in the reference view, that is, the search step always corresponds to the offset of the fixed pixel in the reference view, and must be based on the depth variation value and the depth value. The relationship dynamically adjusts the search step size and determines the corresponding search range. Assume the pixels in view 1! The initial search depth value of ^ is, and the depth change value Δζ in the view 1 corresponding to the pixel offset ΔΡ in the reference view 2 can be conveniently obtained according to the equation (14). When the initial depth value is ζ. In the case where the difference from the true depth value is not very large, the pixel corresponding to the pixel point Pi in the reference view 2 is the true corresponding pixel point and the depth z. The pixel offset between the pixels found below is usually limited to a certain range. Given in the pixel search range N, how to adaptively determine the search step size according to the depth value, so that the search step size always corresponds to the offset of a fixed pixel. Given the pixel point Pi and camera parameters, according to the polar line constraint equation (16) of the pixel offset vector, it is easy to solve the two offset vectors ΔΡ^ΡΔΡ corresponding to the pixel offset II ΔΡ II. And then calculate the correspondence according to (14)

Use them as the search step size for the next step to decrease the depth value and increase the direction, as shown in Figure 8. Ζ_ _λ = z ₀ +

_A (20) z _l = z ₀ + Az _l Next, under the depth value and the offset vector ΔΡ-, the corresponding depth change value Δ is calculated by (14), and the depth value and the offset vector ΔΡ ₊ are utilized. (14) Calculate the corresponding depth change value Az ₊₂ and use them as the search step of the next step, ie z_ ₂ = z ₁ + Az_ ₂

(21) z ₂ = Zj + Az ₂ and so on, the search depth and step size of the nth step are obtained as follows:

Wherein, the number of search steps n is determined according to the search range N and the search precision, that is, n satisfies "· Δ ≤ Λ ^ . Therefore, after determining the search range and the initial depth value, the above method can be used to obtain the variable length search step size adaptively adjusted with the change of the depth value, so that the same pixel search precision is maintained during the depth search process, and the fixed search is overcome. A defect in which the pixel repeats the search or misses the search in the step size. Since the depth search range is obtained by the accumulation of the search step size, it is also adaptively adjusted as the depth value changes. When the depth value becomes larger, the depth search range corresponding to the same pixel offset II ΔΡ II becomes correspondingly larger. When the depth value becomes smaller, the depth search range corresponding to the same pixel offset II ΔΡ II is correspondingly smaller. In addition, we can easily control the depth search accuracy by pixel precision, such as Δ=1 corresponding to the search precision of the unit pixel, and Δ=1/2 corresponds to the search precision of the half pixel. Therefore, with the relationship between the depth value change and the pixel offset vector, as in equation (14), the corresponding depth search step size can be determined by determining the pixel search accuracy, and the determination of the depth search range is also converted into corresponding The determination of the pixel offset. The determination of pixel offset and search accuracy is similar to the determination of search range and precision in disparity estimation, which is intuitive and easy to determine, and can be dynamically determined by adjusting pixel offset and search accuracy according to image content or application requirements. Depth search range and step size. In the depth estimation process, a depth initial value 需要 needs to be given. The value of the initial value affects the depth search performance and effect. When you are. When the deviation from the true depth value is small, a smaller pixel offset can be used, that is, the search range can be smaller, thereby reducing the search speed too high; when the deviation from the true depth value is large, A relatively large pixel offset is used to ensure that the true depth value is searched, and thus the amount of calculation is large. Although the poor depth initial value can improve the search performance by setting a wide range of search range and high-precision search step size, a good depth initial value can determine a small range of search range and a suitable step size, thereby improving the depth search. Efficiency and performance. Therefore, the estimation and determination of the depth initial value during the depth estimation process is also very important. The determination of the initial depth value of the video sequence image can be divided into two situations, an image at the initial moment and a subsequent image. The determination of the depth initial value of the image at the initial time is further divided into two types, that is, the first pixel and the other pixels. For the first pixel, since there is no deep search for any pixel, there is no known scene depth information. In this case, it is necessary to consider how to obtain the approximate value of the scene depth from the image features and camera parameters as the initial value. For subsequent pixels, the initial depth can be determined from the depth estimates of neighboring pixels within the image. For subsequent images, since the depth values of the video sequence images of the same viewing angle have a strong correlation, the depth of the stationary background region remains unchanged, and only a small number of motion regions change in depth, so the previous one can be The depth value of the same pixel position of the time image is taken as an initial value. Therefore, in the determination of the initial depth value, the key is to obtain the scene depth information of the initial time image, and provide a better depth initial value for the first pixel.

In multi-view video, the difference between images of different views or the position information of the camera usually contains information about the depth of the scene. In the following, for both the converged camera system and the parallel camera system, an initial estimation of the depth of the scene based on camera parameters or image information without any known depth information is given. The main goal of multi-view video is to capture information of the same scene at multiple angles, so the camera is usually placed in a circular arc and the camera's optical axis converges at one point, the convergence system. In practice, although the camera may not be strictly concentrated at one point, it is always possible to find a point that is closest to the optical axis of each camera. This point is considered to be a convergence point. The convergence point is usually the location of the scene, which can be considered as a microscopic point of the scene. Therefore, by obtaining the position of the convergence point, an approximate value of the scene depth can be obtained, and the value is used as the initial value in the depth search. Let the coordinates of the convergence point in the world coordinate system be Mc=[ _Xe , y _e , ], which is located on the optical axis of each camera, so this point can be expressed in the camera coordinate system with the optical axis as the z-axis. : , = [0, 0, z _rl ]

Where 3⁄4 is the depth of the convergence point in the camera coordinate system. According to the relationship between world coordinates and camera coordinates, there is the following formula:

= RM, +

(twenty four)

M _c =R„M _m +t„ Eliminate Mc to get:

^[0,0,^] + ^ =^[0,0,^ ₂ ] + ^ ₂

^[0,0,^] + ^ =^[0,0,^ ₃ ] + ^

(25)

Equation (25) is a linear equation of 3 (/7-1) with respect to depth, ₂ , -, 2.... Using the linear least squares method to solve the system of equations (25), you can get the depth values of the convergence points in each camera coordinate system. They are an approximate value of the depth of the scene and can be used as the depth initial value in the depth search. There is no convergence point in the parallel camera system, and the depth information cannot be obtained by the above method. However, the parallax has a simple inverse relationship with the depth (18), so the depth information can be obtained by calculating the global parallax between the two views. The global disparity can be defined as the pixel offset that minimizes the absolute difference between the two views, ie by the following method:

Where R is the number of pixels of the overlapping area of views 1 and 2. Since the estimation accuracy of the global disparity is not high, the search unit of the pixel offset X in the equation (26) can be set larger, such as 8 pixels or 16 pixels, so that the calculation amount can be greatly reduced. After obtaining the global disparity, the depth initial value can be obtained according to the relationship (18) whose depth is inversely proportional to the parallax. Use the Uli video sequence parameter document to provide a scene point: the real world coordinates of the high-brightness point on the left side of the glasses [35.07, 433.93, -1189.78] (in mm), and according to the relationship between world coordinates and camera coordinates (1) Obtain the coordinate and real depth information of the scene point in the camera coordinate system; and then use the above method of finding two camera convergence points, that is, by solving the linear equations (26) to obtain the coordinate system of the convergence point in the camera 7 and the camera 8. The depth value below is calculated as shown in Table 1. According to the human eye observation, the depth of field of the Uli scene does not change much, but the initial depth estimation value in Table 1 is not much different from the real depth information of the scene point, indicating that the initial depth estimation value is more effective and reasonable. The depth estimate provides a good initial value.

Table 1 The Uli view shown in Figure 3 (c) is from the pixel points [527, 430] to the 64x64 image area of [590, 493]. For every 15 pixels in the region, a total of 16 pixels are used for depth search using a fixed step size and an adaptive step size. Three searches with fixed search steps were performed in the fixed search range [2000, 5000] with steps of 20, 35, 50 respectively. In the determination of the adaptive search step size, the initial depth is 2817, the pixel offset is set to 32 pixels, the search precision is 1 pixel, and the depth initial value of the subsequent pixel is set as the depth estimation value of the adjacent pixel. According to the method for determining the adaptive search step size of the present invention, the search step corresponding to the search precision per unit pixel in the search range of pixels 32 [527, 430] deviating from the initial search pixel point can be obtained, such as a table. As shown in Fig. 2, the pixels are searched by these search steps as shown in Fig. 9. Table 2 shows that the step size along the direction in which the depth value decreases is negative, and as the pixel offset increases and the depth value decreases, the absolute value of the step gradually decreases; The step size in the large direction is positive, and as the pixel offset increases and the depth value increases, the absolute value of the step gradually increases. Figure 9 shows that when performing a depth search using the variable length search step of Table 2, the corresponding pixel search accuracy is guaranteed to remain constant, always 1 pixel.

Pixel step size (depth value step size (depth value pixel step size (depth value step size (depth value shift amount increase direction) decrease direction) shift amount increase direction) decrease direction)

1 11.4877 -11.2503 17 12.8909 -10.1000

2 11.5686 -11.1728 18 12.9870 -10.0340

3 11.6502 -11.0960 19 13.0842 -9.9687

4 11.7328 -11.0201 20 13.1824 -9.9041

5 11.8162 -10.9450 21 13.2818 -9.8400

6 11.9005 -10.8706 22 13.3823 -9.7766

7 11.9858 -10.7969 23 13.4840 -9.7138

8 12.0719 -10.7240 24 13.5868 -9.6515

9 12.1590 -10.6519 25 13.6908 -9.5899

10 12.2470 -10.5805 26 13.7961 -9.5289

11 12.3360 -10.5098 27 13.9025 -9.4684

12 12.4260 -10.4397 28 14.0101 -9.4086 13 12.5169 -10.3704 29 14.1191 -9.3493

14 12.6089 -10.3018 30 14.2293 -9.2905

15 12.7018 -10.2339 31 14.3407 -9.2323

16 12.7958 -10.1666 32 14.4535 -9.1746 Table 2 The depth estimation is performed by the block matching based method. The depth search result is shown in Fig. 10. Each point in the figure represents the absolute between the synthesized block and the actual block under the depth value obtained by the search. Poor, the smaller the value, the more accurate the depth estimate is. When using a fixed search step size, the smaller the step size, the higher the accuracy of the search, so the better the depth estimation is, the absolute difference obtained by the depth value obtained by searching for a step size of 20 mm is less than the absolute difference of 35 mm in steps. The absolute difference of 35mm is less than 50mm. However, the depth value obtained by the search under the adaptive search step is the best, and the corresponding absolute difference is the smallest. Figure 3(c) shows the 16 pixel points for depth estimation in the image area of the pixel points [527, 430] to [590, 493] in view 2 (a), using adaptive search step size, fixed step size 20 , 35, 50 to search. Table 3 shows that with adaptive search step size, 16 pixel points are searched for the correct depth value, while a fixed step size has an incorrect depth estimate. This is because these pixels are in an area where the texture is lacking. In a wide range of fixed search ranges, the minimum point of absolute difference in the search does not correspond to the correct pixel. When the adaptive search step is adopted, since the initial value is determined according to the neighboring information, the pixel offset can be set smaller, that is, searching in a relatively small local range, which reduces the probability of searching for the wrong pixel. And it guarantees a certain depth smoothness. Table 3 lists the depth estimation results, the number of deep search times, and the number of error estimates when using the adaptive search step size and the fixed search step size. The data with borders in the table is the error data. The results in Table 3 show that there are few search times in the adaptive search step search and there is no error estimation, and there are many search times in the fixed search step search and there are still error estimates. For example, in an adaptive depth search of 32 pixel offsets, only 64 depth values are searched, and a fixed search step size of 20 mm needs to search for 150 depth values in the search range of [2000, 5000].

3160 3180 3180 3200

Fixed search step 3160 3160 3160 2380

150

Long (20mm) 3160 3160 3160 2300

3140 3160 3160 3160

3155 3190 3155 3190

Fixed search step 3155 3155 3155 3190

86

Long (35 mm) 3155 3155 3155 2350

3120 3155 3155 3120

3150 3200 3150 3230

Fixed search step 3150 3150 3150 2350

60

Long (50mm) 3150 3150 3150 2300

3150 3150 3150 3150 Table 3 From the results of Table 3 and Figure 10, it is concluded that the depth search performance of the adaptive search step is higher than the fixed search step size, that is, the image block and the reference image block synthesized by the estimated depth value. The absolute difference is small, the error estimate is small,

Claims

Claim

The multi-view video image depth search method is characterized in that: the search step length of each step in the depth search range is dynamically adjusted according to the current depth value, and the smaller the current depth value is, the smaller the search step size is adopted; the current depth value is more Large, the search step size used is larger, so that the search step size of each step corresponds to the same pixel search precision.

2. The multi-view video image depth search method according to claim 1, wherein the determination of the depth search range and the search step length is converted into a pixel search range according to a relationship between a depth change value and a pixel offset vector. Determination of pixel search accuracy.

3. The multi-view video image depth search method according to claim 2, wherein the pixel search accuracy is equal to the length of each pixel offset vector in the search, and the pixel search precision is sub-pixel precision or integer pixel precision. .

4. The multi-view video image depth search method according to claim 2, wherein the search step size is equal to a depth change value corresponding to each pixel offset vector in the search.

5. The multi-view video image depth search method according to claim 2, wherein the search step size is determined by a current depth value, a pixel offset vector, and a camera internal/external parameter.

6. The multi-view video image depth search method according to claim 5, wherein the search step size is obtained by the following formula:

Where: P is the pixel to be depth estimated in the target view, z is the current depth value of the pixel point P, Δζ is the depth change value of the pixel point 即, that is, the search step size, and AP _r is the depth change of the pixel point P in the target view The value Δζ corresponds to the corresponding pixel offset vector in the view r, II ΔΡ _Γ II ² = ΔΡ _Γ ^Τ · ΔΡ _Γ , ; B _r = 4 — ^^― ¹禾=4 — ¹ is a matrix of 3 x 3, △ = t _t _T is a three-dimensional vector; where R is the three-dimensional rotation matrix of the camera coordinate system of the target perspective relative to the world coordinate system; t is the translation vector of the camera coordinate system of the target perspective relative to the world coordinate system; A is the target perspective Camera internal parameter matrix; 3⁄4 is the three-dimensional rotation matrix of the camera coordinate system with reference to the world coordinate system; ^ is the translation vector of the camera coordinate system with reference to the world coordinate system; Α _τ is the camera internal parameter matrix of the reference angle of view ; b ₃ and c ₃ are the third row vectors of matrix B, respectively.

7. The multi-view video image depth search method according to claim 6, wherein the pixel offset vector in the reference view satisfies a polar constraint equation of the target view and the reference view:

Where Ρ is the pixel in the target view and Δ is the pixel offset vector in the reference view.

8. The multi-view video image depth search method according to claim 7, wherein the two pixel offset vectors having mutually opposite directions satisfy the polar line constraint equation, and the two pixel offsets are The vector corresponds to the depth value increasing direction and the depth value decreasing direction respectively; the depth change value corresponding to the offset vector in the depth value increasing direction is greater than the depth change value corresponding to the offset vector in the depth value decreasing direction.

9. The multi-view video image depth search method according to claim 6, wherein in the parallel camera system, the depth change value is proportional to a square of a current depth value.

10. A depth estimation method for a multi-view video image, characterized in that in depth-based view synthesis and block-based depth search, a depth search range and a search step size of a target view are determined by a pixel search range and a pixel of a reference view. The search accuracy is determined. In the depth search range, the search step size of each step is dynamically adjusted according to the current depth value. The smaller the current depth value is, the smaller the search step size is, and the larger the current depth value is, the larger the search step size is. , so that the search step size of each step corresponds to the same pixel search precision.

The multi-view video image depth search method according to claim 10, wherein the pixel search precision is equal to the length of each pixel offset vector in the search, and the search step size is equal to each pixel offset in the search. The depth change value corresponding to the vector.

The depth estimation method of the multi-view video image according to claim 10, wherein the depth-based view synthesis and the block-based depth search are specifically: performing view synthesis using the current depth value, and calculating the composite view The error between the pixel block and the pixel block of the reference view; the depth value corresponding to the minimum error is the depth estimate of the target view.

13. The depth estimation method for a multi-view video image according to claim 12, comprising the steps of: Step 1 estimating a depth search initial value in the target view. Step 2 Determine that the depth search corresponds to the pixel search range and the pixel search precision in the reference view, and obtain the pixel offset vector ΔΡ^ in the reference view according to the pixel search precision. Step 3 According to the current depth value 3⁄4 and the pixel offset vector Δ, the corresponding correspondence is obtained. The depth change value Δ3⁄4, the depth change value Δ3⁄4 is the next search step; Step 4 uses the current depth value for view synthesis, and calculates the error e _k between the pixel block of the composite view and the pixel block of the reference view _; step 4 to update the current depth value ¾ = ¾ + △ ¾; k = k + l; Step 5: Determine whether the given pixel search range is exceeded. If yes, go to step 6. If no, go to step 3. Step 6 with error e _k (k=0,...,Ν-1, Ν is the total number of search steps) The depth value corresponding to the minimum error is an estimated value.

14. The depth estimation method of a multi-view video image according to claim 13, wherein the error _ek is an absolute difference or a square difference between a pixel block of the composite view and a pixel block of the reference view.

The depth estimation method of the multi-view video image according to claim 13, wherein, in the convergence camera system, in the step 1, the depth of the convergence point of the convergence camera system is used as the depth search initial value in the target view. 3⁄4.

The method for estimating a depth of a multi-view video image according to claim 15, wherein the convergence point of the convergence camera system is solved by the following linear equations:

^[0,0,ζ ₀ ] + ί = ^[0,0,ζ _Γΐ ] + ^

^[0,0,ζ ₀ ] + ί = ^[0,0,ζ _Γ2 ] + ί ₂

R[0, 0, z ₀ ] + t = RJO, , z _im ] + t, where ζ. For the depth value of the convergence point in the camera coordinate system of the target view, z _Ti (i=l,...,m) is the depth value of the convergence point in the camera coordinate system of the reference view i, and m is the number of reference views .

17. The depth estimation method for a multi-view video image according to claim 13, wherein, for the parallel camera system, in the step 1, the depth is searched for an initial value z. It is obtained by inversely proportional to the global parallax and depth: z ₀ = ^-; where z. For the depth initial value, d is the global disparity, f is the focal length of the camera, and B is the baseline length of the camera.

a

18. The depth estimation method of a multi-view video image according to claim 17, wherein the global disparity is a pixel offset vector having a minimum absolute difference between the translated reference view and the target view.

19. The depth estimation method of a multi-view video image according to claim 13, wherein the depth change value is obtained by the following formula:

Where: Ρ is the pixel to be depth estimated in the target view, ζ is the current depth value of the pixel Ρ, Δζ is the depth change value of the pixel Ρ, ie the search step, and AP _r is the depth change of the pixel P in the target view The value Δζ corresponds to the corresponding pixel offset vector in the view r, II ΔΡ _Γ II ² = ΔΡ _Γ τ · ΔΡ _Γ , ; B _r =4 Ji4- ^{1 and} PC ¹ is 3X3 The matrix, Δ = t _t _T is a three-dimensional vector; where R is the three-dimensional rotation matrix of the camera coordinate system of the target perspective relative to the world coordinate system; t is the translation vector of the camera coordinate system of the target perspective relative to the world coordinate system; The internal parameter matrix of the camera for the target angle of view; 3⁄4 is the three-dimensional rotation matrix of the camera coordinate system with reference to the world coordinate system; ^ is the translation vector of the camera coordinate system with reference to the world coordinate system; Α _τ is the reference angle of view The camera internal parameter matrix; b ₃ and c ₃ are the third row vectors of matrix B, respectively.

20. The depth estimation method of a multi-view video image according to claim 19, wherein the pixel offset vector ΔΡ^3⁄4 in the reference view is a polar constraint equation of the target viewing angle and the reference viewing angle:

AP _r ^T (C At _r x B _r )P = 0 , where P is the pixel point in the target view, Δ ^ is the pixel offset direction in the reference view

The method for estimating a depth of a multi-view video image according to claim 20, wherein the two pixel offset vectors that are opposite to each other satisfy the polar line constraint equation, and the two pixel offsets The shift vector corresponds to the depth value increasing direction and the depth value decreasing direction respectively; the depth change value corresponding to the depth vector increasing direction is larger than the depth value corresponding to the depth value decreasing direction A pixel offset vector that is opposite to each other.