CN113034568B

CN113034568B - Machine vision depth estimation method, device and system

Info

Publication number: CN113034568B
Application number: CN201911354323.XA
Authority: CN
Inventors: 龙学雄; 李建禹
Original assignee: Hangzhou Hikrobot Co Ltd
Current assignee: Hangzhou Hikrobot Co Ltd
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2024-03-29
Anticipated expiration: 2039-12-25
Also published as: CN113034568A

Abstract

The application discloses a machine vision depth estimation method, which comprises the steps of acquiring at least three view images, wherein the view images are respectively acquired by at least two groups of binocular image acquisition devices sharing the same image acquisition device; carrying out image stereo correction on the view image; based on binocular images from the at least two sets of binocular image acquisition devices in the view images after the stereo correction, respectively performing stereo matching on the sets of binocular images to obtain depth values of each pixel in the sets of binocular images; and fusing the depth value of the pixel I in each group of binocular images to obtain the depth estimated value of the pixel I, wherein I is a natural number not greater than the total number of pixels in the binocular images. According to the method, more information is acquired from multiple angles and multiple directions for the target scene for depth estimation, and the defects of object shielding, uncomfortable texture direction and the like are overcome.

Description

Machine vision depth estimation method, device and system

Technical Field

The invention relates to the field of machine vision, in particular to a machine vision depth estimation method.

Background

The structured light is a three-dimensional image of light information modulated by the surface shape of the object to be measured, which is formed on the surface by projecting a pattern of light information onto the surface of the object by an optical projector. The camera is used for collecting a light information three-dimensional graph, and information such as the position and depth of an object is calculated according to the change of a light signal caused by the object, so that the whole three-dimensional space is restored.

The current depth estimation technology based on stereoscopic vision is mainly a mode of collecting laser speckles by a binocular camera, and the mode is one of modes based on projection of active structured light. There are still some problems with binocular camera vision systems based on laser speckle:

firstly, the number of speckles is limited, the effective depth cannot be recovered for all points, and a plurality of holes can be formed;

secondly, due to the visual angle difference of the left camera and the right camera, shielding can be avoided, so that a large depth cavity is caused;

thirdly, at the edge of the image, the boundary of the image has a large cavity due to the limitation of the maximum search parallax of the camera and the deficiency of the actual common view area at the boundary, so that the horizontal view field range of the depth camera is seriously reduced;

fourth, the depth reliability of pixels with gradient directions perpendicular to the base line direction (generally horizontal direction) on the image, that is, the texture direction parallel to the base line is very low, or the pixels become cavities;

Fifth, binocular depth maps are not accurate enough to some extent, especially at slightly greater distances.

In the existing methods, a multi-frame fusion method in time is adopted to solve the above problems, but only part of the problems can be solved, and the problems of depth map holes caused by depth map boundaries, object shielding and/or texture gradients can not be solved.

Disclosure of Invention

The invention provides a machine vision depth estimation method, which aims to solve the problem of hollowness of a depth map in binocular vision based on structured light.

The invention provides a machine vision depth estimation method, which comprises the following steps of,

acquiring at least three view images, the view images being acquired by at least two sets of binocular image acquisition devices sharing the same image acquisition device, respectively;

carrying out image stereo correction on the view image;

based on binocular images from the at least two sets of binocular image acquisition devices in the view images after the stereo correction, respectively performing stereo matching on the sets of binocular images to obtain depth values of each pixel in the sets of binocular images;

for any pixel I in each group of binocular images, fusing the depth values of the pixel I in all the groups of binocular images to obtain the depth estimation value of the pixel I,

Wherein I is a natural number not greater than the total number of pixels in the binocular images, and the pixels I in each group of binocular images have the same coordinates in a pixel coordinate system or a world coordinate system.

Preferably, the three-dimensional matching of each group of binocular images further comprises, in the three-dimensional matching process, calculating the parallax confidence of each pixel in each group of binocular images;

the depth values of the pixels I in all the binocular images of each group are fused for any pixel I in each binocular image of each group to obtain the depth estimation value of the pixel I, which comprises,

respectively weighting the depth value of any pixel I in each group of binocular images by using the parallax confidence of the pixel I to obtain the weighted depth value of the pixel I in each group of binocular images;

and fusing the depth values weighted by the pixels I in the binocular images to obtain the depth estimation value.

Preferably, for any pixel I in each group of binocular images, the depth value of the pixel I is weighted by using the parallax confidence of the pixel I, so as to obtain the weighted depth value of the pixel I in each group of binocular images, which includes,

Fusing the parallax confidence coefficient of any pixel I in each group of binocular images to obtain the parallax confidence coefficient of the fused pixel I;

the ratio of the parallax confidence coefficient of the pixel I in each group of binocular images after fusion is used as a first factor;

the depth value of the pixel I in each set of binocular images is weighted with the first factor, respectively.

Preferably, for any pixel I in each group of binocular images, the depth value of the pixel I is weighted by using the parallax confidence of the pixel I, so as to obtain the weighted depth value of the pixel I in each group of binocular images, further comprising,

if the depth value of any pixel I in any binocular image in each set of binocular images does not exist, using the depth values of pixels I in other sets of binocular images;

if the depth value of each pixel in each group of binocular images exists, taking the difference value between the binocular base line lengths of each group as a weight, and respectively weighting the parallax confidence so that:

when the current depth fusion value is smaller than the depth threshold value, giving a first weight to any pixel depth value of the binocular image from the first baseline binocular image acquisition device, and giving a second weight to the pixel depth value of the binocular image from the second baseline binocular image acquisition device;

When the current depth fusion value is greater than or equal to a depth threshold value, a third weight of the pixel depth value of the binocular image from the first baseline binocular image acquisition device is given and is smaller than or equal to a fourth weight of the pixel depth value of the binocular image from the second baseline binocular image acquisition device;

wherein the first baseline length is less than or equal to the second baseline length;

the final result weighted by the weight is taken as the depth value of the pixel.

Preferably, the third weight is the same as the first weight, and the fourth weight is the same as the second weight;

the parallax confidence comprises the ratio of the minimum aggregate matching cost to the next-smallest aggregate matching cost in the image stereo matching process;

the acquisition of the current depth fusion value includes,

the depth values of the pixels I in the respective sets of binocular images are weighted with the first factor,

and fusing the depth values of the pixels I in the weighted binocular images to obtain the current depth fusion value.

Preferably, the at least two sets of binocular image acquisition devices comprise two sets of binocular image acquisition devices with two baselines perpendicular to each other;

the view image comprises a first binocular image from a first image acquisition device and a second binocular image from a second image acquisition device and a third image acquisition device, wherein a first baseline of the first image acquisition device and the second image acquisition device is perpendicular to a second baseline of the second image acquisition device and the third image acquisition device;

the difference values of the binocular baseline lengths of the groups comprise a first ratio of the first baseline length to the second baseline length and a second ratio of the second baseline length to the first baseline length;

the image stereo correction of the view image comprises the steps of respectively correcting a first image from a first image acquisition device and a third image from a third image acquisition device to the same plane where a second image from a second image acquisition device is located, taking the corrected first image and second image as a first binocular image, and taking the corrected second image and third image as a second binocular image, so as to obtain two groups of binocular images.

Preferably, the weighting the parallax confidence by weighting the difference between the sets of binocular baseline lengths, including,

when the current depth fusion value is smaller than the depth threshold value, weighting the parallax confidence of any pixel I in the first binocular image by the second ratio,

when the current depth fusion value is greater than or equal to a depth threshold value, weighting parallax confidence of any pixel I in the first binocular image by using the first ratio;

the parallax confidence coefficient of any pixel I in each group of binocular images is fused, and the parallax confidence coefficient obtained after the fusion of the pixel I comprises the steps of accumulating the weighted parallax confidence coefficient of the pixel I in the first binocular image and the weighted parallax confidence coefficient of the pixel in the second binocular image to obtain the parallax confidence coefficient obtained after the fusion of the two groups of binocular images of the pixel;

the proportion of the parallax confidence of the pixel I in each group of binocular images after fusion is taken as a first factor, the depth value of the pixel I in each group of binocular images is weighted by the first factor respectively and comprises,

taking the ratio of the parallax confidence of the pixel I in the second binocular image to the fused parallax confidence as a first factor of the pixel I in the second binocular image, and weighting the depth value of the pixel I in the second binocular image by using the first factor;

Taking the ratio of the weighted parallax confidence coefficient of the pixel I in the first binocular image to the fused parallax confidence coefficient as a first factor of the pixel I of the first binocular image, and weighting the depth value of the pixel I of the first binocular image by using the first factor;

the step of fusing the weighted depth values of the pixels I in each group of binocular images to obtain depth estimation values of the pixels I comprises,

and accumulating the weighted depth value of the pixel I of the second binocular image and the weighted depth value of the pixel I of the first binocular image to obtain a depth estimated value of the pixel I.

Preferably, the weighting the parallax confidence by weighting the difference value of each group of binocular baseline lengths, including,

when the current depth fusion value is smaller than the depth threshold value, weighting the parallax confidence of the pixel I in the second binocular image by using the first ratio;

when the current depth fusion value is greater than or equal to a depth threshold value, weighting the parallax confidence of the pixel I in the second binocular image by using the second ratio;

the parallax confidence coefficient of any pixel I in each group of binocular images is fused, and the parallax confidence coefficient obtained after the fusion of the pixel I comprises the steps of accumulating the weighted parallax confidence coefficient of the pixel I in the second binocular image and the parallax confidence coefficient of the pixel in the first binocular image to obtain the parallax confidence coefficient obtained after the fusion of the two groups of binocular images of the pixel;

taking the ratio of the parallax confidence coefficient of the pixel I in the first binocular image to the fused parallax confidence coefficient as a first factor of the pixel I of the first binocular image, and weighting the depth value of the pixel I of the first binocular image by using the first factor;

taking the ratio of the weighted parallax confidence of the pixel I in the second binocular image to the fused parallax confidence as a first factor of the pixel I in the second binocular image, and weighting the depth value of the pixel I in the second binocular image by using the first factor;

Preferably, the first baseline is along a vertical direction, and the second baseline is along a horizontal direction;

The correcting the first image from the first image acquisition device and the third image from the third image acquisition device to the same plane in which the second image from the second image acquisition device lies respectively includes,

correcting the first image in the first binocular image to the plane where the second image is located,

correcting the third image in the second binocular image to the plane of the second image,

and causing: the first image and the second image lines are along the image pixel vertical direction, and the second image and the third image lines are along the image pixel horizontal direction.

Preferably, the obtaining of the current depth fusion value comprises,

accumulating the parallax confidence coefficient of the pixel I in the first binocular image and the parallax confidence coefficient of the pixel I in the second binocular image to obtain fused parallax confidence coefficient;

And accumulating the depth value of the pixel I of the weighted second binocular image and the depth value of the pixel I of the weighted first binocular image to obtain the current depth fusion value.

Preferably, the stereo matching of the binocular images of the at least two sets of binocular image acquisition devices is performed based on the binocular images of the stereo corrected view images, respectively, including,

based on each group of binocular images after three-dimensional correction, performing pixel-by-pixel traversal of the binocular images in a set parallax search range D, and calculating the matching cost of each pixel under different parallaxes to obtain the matching cost of each pixel in each group of binocular images under different parallaxes;

calculating the aggregation cost of each pixel under different parallaxes according to the set window size based on the matching cost of each pixel under different parallaxes in each group of binocular images, and obtaining the aggregation cost of each pixel under different parallaxes in each group of binocular images;

based on the aggregation cost of each pixel in each binocular image under different parallaxes, determining the corresponding parallaxes according to the minimum aggregation matching cost, and calculating the parallax confidence of the pixel to obtain the parallaxes of each pixel in each binocular image;

Performing parallax optimization based on the parallax of each pixel in each group of binocular images, and obtaining the parallax of each sub-pixel level in each group of binocular images through unitary quadratic curve fitting;

the depth value of each pixel in each set of binocular images is calculated based on the parallax at each sub-pixel level in each binocular set of views, respectively.

The invention also provides a machine vision system, which comprises an image acquisition device, wherein the image acquisition device comprises at least two groups of binocular image acquisition devices sharing the same image acquisition device; the system may further comprise a processor configured to,

a control module for controlling the image acquisition device to acquire images,

a computing module for performing image stereo correction on at least three view images from the image acquisition device;

for any pixel I in each group of binocular images, fusing the depth value of the pixel I in each group of binocular images to obtain the depth estimation value of the pixel I,

wherein I is a natural number not greater than the total number of pixels in the binocular image.

Preferably, the computing module further comprises,

in the stereo matching process, calculating parallax confidence of each pixel in each group of binocular images;

Preferably, the computing module further comprises,

When the current depth fusion value is greater than or equal to a depth threshold, giving a first weight to the pixel depth value of the binocular image from the first baseline binocular image acquisition apparatus, less than a second weight to the pixel depth value of the binocular image from the second baseline binocular image acquisition apparatus;

wherein the first baseline length is less than the second baseline length.

Preferably, the calculation module further includes fusing parallax confidence of any pixel I in each group of binocular images to obtain the fused parallax confidence of the pixel I;

Preferably, the system further comprises,

the transmission module is used for transmitting any pixel depth estimated value data from the calculation module; and/or

Projection means for projecting structured light onto a target to form an optical signal image, and/or

And a three-primary-color image acquisition device to acquire three-primary-color images.

Preferably, the at least two sets of binocular image capturing devices include two sets of binocular image capturing devices with two baselines perpendicular to each other, wherein the baselines of the first image capturing device and the second image capturing device are along a vertical direction, and the baselines of the second image capturing device and the third image capturing device are along a horizontal direction; the first baseline length of the first image acquisition device and the second image acquisition device is less than the second baseline length of the second image device and the third image device;

The projection device is positioned at the center of the optical center connecting line of the first image acquisition device and the third image acquisition device;

the three-primary-color image acquisition device is positioned at a position where the optical center of the three-primary-color image acquisition device is connected with the optical center of the first image acquisition device and forms a right angle with the optical center of the third image acquisition device;

the control module synchronously outputs trigger signals to each image acquisition device and/or projection device.

Preferably, the image acquisition devices of the at least two sets of binocular image acquisition devices are equally spaced or non-equally spaced according to a checkerboard.

The invention also provides a machine vision depth estimation device, which comprises a memory and a processor,

the memory is used for storing a computer program;

the processor is configured to execute the program stored in the memory, and implement any one of the machine vision depth estimation methods described above.

The invention is based on the improvement of the distribution layout of the image acquisition device for acquiring the images and the image stereo matching of the binocular view based on at least two groups of binocular images, thereby acquiring more information from multiple angles to the target scene for depth estimation, and making up the defects of depth map boundary, and/or object shielding, and/or uncomfortable texture direction, etc.; optimizing depth values based on at least two sets of binocular images; further, on the basis of binocular stereo matching, pixel depth values of multiple views are subjected to weighted fusion, and short-distance points and long-distance points are distinguished and subjected to differentiation processing, so that a more accurate depth value estimation result is obtained.

Drawings

FIG. 1 is a schematic diagram of a camera layout in a triple vision camera system based on projecting active structured light in accordance with the present invention.

Fig. 2 is a flow chart of a depth estimation method of binocular vision according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a process of correcting an image of a first camera and an image of a third camera to a plane in which an image of a second camera is located, respectively.

Fig. 4 is a schematic diagram of corrected first and second camera image pixels (first binocular image).

Fig. 5 is a schematic diagram of the aggregate cost value searched with the parallax search range D as granularity for any one pixel I.

FIG. 6 is a schematic diagram of a secondary curve fitting sub-pixel position calculation.

FIG. 7 is a schematic diagram of electrical connection of a three-view depth camera system according to the present invention.

Fig. 8 is a schematic diagram of a multi-group dual purpose camera layout.

Detailed Description

In order to make the objects, technical means and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings.

The invention provides a method for acquiring optical information images through a plurality of groups of binocular cameras, fusing a plurality of frames in space and carrying out depth recovery, so that a depth map is more complete, accurate and dense, and the actual field of view range is enlarged.

Referring to fig. 1, fig. 1 is a schematic diagram of a camera layout in a triple vision camera system based on projecting active structured light according to an embodiment of the present invention.

In this embodiment, in order to reduce the number of cameras and reduce the complexity of image processing as much as possible, the vision camera system includes at least three cameras, where a first camera and a second camera form a first binocular group for performing vision stereo matching, and a second camera and a third camera form another binocular group for performing vision stereo matching, so that the system includes two binocular groups having one common camera (the second camera in fig. 1), and at the same time, the three cameras combine to perform vision stereo matching. In order to avoid the shielding of the visual angle, a first base line (a connecting line between two optical centers of two cameras) formed by the first binocular and a second base line formed by the second binocular are mutually perpendicular, so that the connecting lines of three cameras form a right triangle, and the three cameras are also equivalent to be distributed on any three vertexes of the rectangle. The length of the first baseline and the length of the second baseline can be adjusted according to actual requirements, for example, the baselines between the first camera and the second camera can be shorter, so that the depth map is optimized for a closer distance, and details of the close-range depth map are increased. The three cameras may be grayscale cameras, or infrared cameras.

The projector for projecting the active structured light is located on the oblique side of the right triangle, preferably in order to enable a relatively large speckle observation range for all three cameras, the projector is located in the middle of the two cameras on the oblique side, i.e. at the midpoint of the connection between the first camera and the third camera.

The camera system may further comprise three primary color cameras (RGB cameras) whose positions are arranged according to actual needs, preferably, to facilitate calibration of the cameras, the RGB cameras are arranged on the remaining vertices of the rectangle so as to form symmetry with any one of the three cameras, i.e., the three primary color cameras are located at positions where their optical centers form right angles with the first camera optical center line and with the third camera optical center line.

When the four cameras collect the optical signal images of the structured light, the image data at the same moment are collected.

For ease of understanding the following examples, the key basic concepts involved in the present application are described below.

Parallax: in a binocular image, the distance (in pixels) of the pixel coordinates of two matching pixels.

Image stereo correction: also called stereo correction, two images which are not in coplanar line alignment in practice are corrected to be in coplanar line alignment, so that matching pixel searching is changed from two-dimensional searching to one-dimensional searching, and the efficiency of matching searching is improved; where co-planar line alignment refers to the fact that the image planes of the two cameras are on the same plane and the same spatial point is projected onto the image planes of the two cameras, the alignment should be performed mathematically, rather than physically, in the same line of the two pixel coordinate systems, and typically is performed before the stereo matching of the live-action pictures.

Image stereo matching: binocular stereo matching typically includes matching cost computation, aggregation cost, disparity computation, and disparity optimization. Wherein,

the purpose of the matching cost calculation is to measure the correlation between the pixels to be matched and the candidate pixels. Whether the two pixels are homonymous points or not, the matching cost can be calculated through the matching cost function, and the smaller the cost is, the larger the correlation is, and the larger the probability of being the homonymous points is.

The aggregation cost is to build a link between adjacent pixels, and optimize the cost matrix according to a certain criterion, such as that adjacent pixels should have continuous disparity values, so as to enable the cost value to accurately reflect the correlation between pixels. The optimization is often global, and the new cost value of each pixel under a certain parallax is recalculated according to the cost values of the adjacent pixels under the same parallax value or near-parallax value.

The parallax calculation is to determine the optimal parallax value of each pixel through a cost matrix after the cost is aggregated, namely, the parallax corresponding to the minimum cost value is selected as the optimal parallax from the cost values of all parallaxes of a certain pixel.

The purpose of parallax optimization is to further optimize the parallax map obtained in the previous step, improve the quality of the parallax map, including steps such as eliminating error parallax, proper smoothing, optimizing sub-pixel precision, etc., generally adopting Left-Right consistency Check (Left-Right Check) algorithm to eliminate error parallax caused by shielding and noise; adopting a small connected region eliminating algorithm to eliminate isolated abnormal points; smoothing the parallax image by adopting smoothing algorithms such as Median filtering (Median Filter), bilateral filtering (bipolar Filter) and the like; in addition, methods such as robust plane fitting (Robust Plane Fitting), luminance consistency constraint (Intensity Consistent), local consistency constraint (Locally Consistent) and the like are also commonly used to effectively improve the quality of the disparity map.

Referring to fig. 2, fig. 2 is a schematic flow chart of a depth estimation method of binocular vision according to an embodiment of the present invention.

Step 201, after calibrating the internal parameters and external parameters of the three cameras respectively (forming parallel optical axes and parallel polar lines of the two dual-purpose cameras), acquiring three-view images;

three-dimensional correction is carried out on the collected three-view images so as to respectively correct the images of the first camera and the third camera to the same plane where the images of the second camera are positioned, and a first binocular image and a second binocular image after three-dimensional correction are obtained; in one embodiment, in the first binocular image, correcting the first camera image to a plane where the second camera image is located; correcting the third camera image to the plane of the second camera image in the second binocular image;

referring to fig. 3, fig. 3 is a schematic diagram illustrating a process of correcting an image of a first camera and an image of a third camera to a plane in which an image of a second camera is located, respectively. Wherein o1, o2, o3 are respectively three camera optical centers, the polar plane p-o1-o2, e1 is the imaging of the optical center o2, e2 is the imaging of the optical center o1, the polar plane p-o3-o2, e3 is the imaging of the optical center o2, e2 is the imaging of the optical center o3, p1, p2, p3 are respectively the imaging of the space point p in three views, p, o1, o2 forms the polar plane of the imaging of the first binocular field, p, o2, o3 forms the polar plane of the imaging of the second binocular field, and the intersecting lines p1e1, p2e2, p3e3 of the imaging planes and the polar plane are respectively the polar lines of each imaging image. The pixel coordinate origin of each imaging image is positioned at the upper left corner of the imaging image, the x direction is the horizontal direction of the image, and the y direction is the vertical direction of the image, such as the x and y coordinate systems marked in the figure.

In the first binocular image, correcting the first camera image 301 to a plane where the second camera image is located by first rotation transformation of coordinates of the first camera image 301 to obtain a first imaging plane 303, wherein y coordinates of pixels are unchanged before and after correction, x coordinates are changed, and for distinction, the corrected pixel abscissa is denoted by v; similarly, in the second binocular image, the third camera image 304 is rectified to the plane of the second camera image by the second rotation transformation of the coordinates of the third camera image 304, so as to obtain a second imaging plane 305, the y coordinates of the pixels are unchanged before and after rectification, the x coordinates are changed, and the abscissa of the rectified pixels is denoted by u. The second image and the corrected first image are used as a first binocular image, and the second image and the corrected third image are used as a second binocular image, so that two groups of binocular images are obtained.

After correction, the image lines of the first camera and the second camera are along the vertical direction of the image pixels, and the lines of the second camera and the third camera are along the horizontal direction of the image pixels.

In step 202, in view of the occlusion in the view and the difference in the common view area between the first binocular and the second binocular, even if the matching costs of the first binocular view and the second binocular view are fused together, a better result cannot be generated for the area where the hole is likely to be generated, so in the embodiment of the present invention, stereo matching is performed by using two sets of binocular images, that is, stereo matching is performed for the first binocular image and the second binocular image, respectively, where the first binocular image and the second binocular image are view images after stereo correction.

The stereo matching process of the binocular image is as follows:

step 2021, performing pixel-by-pixel traversal of the binocular image with the set parallax search range D based on the first binocular image and the second binocular image after stereo correction, respectively, and calculating a matching cost of each pixel under different parallaxes:

and traversing the view image pixel by pixel in the set parallax search range D, and calculating the matching cost of each pixel under different parallaxes. The matching cost may be calculated by using a sum of absolute differences of gray (AD), or a sum of absolute differences of all pixels in a neighborhood of the pixel to be matched (SAD, sum of Absolute Differences), or other matching costs such as Normalized Correlation Coefficient (NCC), or SSD (sum of squares of differences). In this embodiment, SAD is taken as an example for explanation.

Referring to fig. 4, fig. 4 is a schematic diagram of the corrected first camera image and the second camera image pixels (first binocular image), wherein the black dots represent the pixels I to be matched, and the pixels q in the neighborhood a×b (diagonally filled portions in the figure). Any pixel I at each parallax d ₁₂ The following matching cost is:

wherein p is ₂ For pixel values, p, of the second camera image ₁ For the pixel value of the first camera image, I and j are the coordinates of any pixel I, respectively, and d represents the first pixel IEach parallax d of each pixel I (I, j) in a binocular image within the parallax range ₁₂ The SAD matching cost is a W×H×D three-dimensional matrix, wherein W is the image width, H is the image height, and D is the parallax search range; mapping matching cost values to i, j, d ₁₂ A graph formed by a three-dimensional rectangular coordinate system is generally called a disparity map (DSI).

Step 2022, calculating an aggregate cost of each pixel under different parallaxes according to the set window size based on the matching cost of each pixel under different parallaxes in the first stereo-corrected binocular image respectively:

the calculation of the SAD matching cost in the previous step only considers local information, and calculates the cost value through pixel information in a certain size of adjacent area, which is easily affected by image noise, and when the image is in a weak texture or repeated texture area, the cost value is very likely to not accurately reflect the correlation between pixels, and the direct expression is that the cost value of a real homonymy point is not minimum.

The aggregated cost is to build a link between adjacent pixels, and optimize the cost matrix by a certain criterion, such as that adjacent pixels should have continuous parallax values, the optimization is often global, and the aggregated cost value of each pixel under a certain parallax value is recalculated according to the cost value of the adjacent pixels under the same parallax value or near-parallax value, so as to obtain the DSI of the aggregated cost.

In practice, the aggregation cost is similar to a parallax propagation step, the region with high signal-to-noise ratio is good in matching effect, the initial cost can well reflect the correlation, the optimal parallax value can be obtained more accurately, the aggregation cost is propagated to the region with low signal-to-noise ratio and poor matching effect, and finally, the cost values of all images can accurately reflect the true correlation.

In this embodiment, the aggregation cost of matching any pixel I with the window size as a block is calculated with the set window size win×win (gray part in the figure):

wherein u and v are coordinates of any pixel in a window containing a pixel I, C ₁ (u,v,d ₁₂ ) Representing each parallax d of each pixel within the parallax range ₁₂ The aggregate cost corresponds to the DSI of the aggregate cost.

Similarly, any pixel I of the second camera image and the corrected third camera image (second binocular image) is at each parallax d ₂₃ The following matching cost is calculated:

wherein d ₂₃ Parallax of a matching point of the second camera and the corrected third camera image; p is p ₃ Is the pixel value of the third camera image.

Calculating an aggregation cost matched by using the window size as a block according to the window size win multiplied by win:

and respectively obtaining a first dual-purpose parallax map and a second dual-purpose parallax map through the calculation of the aggregation cost, wherein the first dual-purpose parallax map is a right-purpose (second camera) parallax map, and the second dual-purpose parallax map is a left-purpose (second camera) parallax map.

Step 2023, determining the parallax of each pixel and the confidence thereof based on the aggregate matching cost of each pixel in the first binocular image and the second binocular image after stereo correction under different parallaxes.

For any disparity map:

comparing each aggregation matching cost of each pixel, and taking the parallax corresponding to the minimum aggregation cost as the parallax value of the pixel; referring to fig. 5, fig. 5 is a schematic diagram of an aggregate cost value searched with the parallax search range D as granularity for any pixel I, where the parallax corresponding to the minimum aggregate cost value is taken as the parallax value of the pixel.

The parallax confidence of the pixel is calculated, i.e., the ratio of the next smallest aggregate cost to the smallest aggregate cost for the pixel is calculated.

In step 2024, parallax optimization is performed on the parallaxes of each pixel in the first binocular image and the second binocular image after the body correction, so as to obtain the parallaxes of the sub-pixel level.

The parallax optimization is performed on each pixel, specifically, the situation that multiple pixels are matched with one pixel is eliminated through the uniqueness test, the situation that the matching is inconsistent is eliminated through the bilateral parallax map bidirectional matching, and the situation is generally caused by shielding. After the wild points are removed, parallax filtering, cavity filling and other operations are not performed.

Because the parallax value obtained by the minimum aggregation cost is the precision corresponding to the parallax search range D as granularity, in order to obtain higher precision, the parallax value can be further refined, a common refinement method is a unitary quadratic curve fitting method, a unitary quadratic curve is fitted by the parallax value determined by the minimum aggregation cost and the cost values under the left parallax and the right parallax, and the parallax value represented by the minimum value point of the quadratic curve is taken as the parallax of the sub-pixel level. Referring to fig. 6, fig. 6 is a schematic diagram showing a secondary curve fitting sub-pixel position calculation.

Step 2025, calculating a depth value of each pixel based on the parallax at each sub-pixel level of the stereo corrected first binocular image and the second binocular image, respectively.

Because the baseline of the first binocular camera may not be the same as the baseline of the second binocular camera, the fusion cannot be performed directly on the disparity map, converting the disparity values to depth values for fusion based on the depth values.

For each pixel I, calculating the depth value according to the parallax value of the pixel by the following formula; the depth values of all pixels are mapped to a map formed by taking i and j as rectangular coordinate systems as a depth map, so that a dual-purpose depth map is obtained, namely: a first dual purpose depth map, and a second dual purpose depth map.

Where f is the focal length, b is the binocular camera baseline length, d (i, j) is the disparity value for pixel (i, j), and z (i, j) is the depth value for pixel (i, j).

Through the above steps 2021 to 2025, the first binocular and the second binocular stereo matching are completed.

Step 203, fusion of the depth map is performed based on the stereo matching result of the first binocular image and the stereo matching result of the second binocular image.

Since the stereo matching is performed with the first dual purpose depth map and confidence, respectively, the second dual purpose depth map and confidence, for each pixel in the dual purpose image,

when one depth map has a first pixel depth value and the other depth map has no depth value, namely, is a cavity, the existing depth value and confidence are directly adopted when depth fusion is carried out

When the depth map of the view has depth values, performing depth fusion, if the current depth fusion value is smaller than a depth threshold value, giving a first weight to a first pixel depth value of a short baseline camera larger than a second weight to a second pixel depth value of a long baseline camera, and obtaining a weighted depth fusion value as a final depth estimation value, wherein the short baseline camera has better observation on objects in a short distance; if the current depth fusion value is greater than or equal to the depth threshold value, the fourth weight of the second pixel depth value of the long baseline camera is greater than the third weight of the first pixel depth value of the short baseline camera, and the weighted depth fusion value is obtained and is used as a final depth estimation value; preferably, the third weight is the same as the first weight and the fourth weight is the same as the second weight. Wherein the coordinates of the first pixel and the second pixel are the same in a pixel coordinate system or a world coordinate system.

Specifically, for any pixel:

weighting the pixel depth value by using the confidence coefficient of the pixel, for example, fusing the confidence coefficient of the pixel I in the first binocular image and the confidence coefficient of the pixel I in the second binocular image, taking the proportion of the confidence coefficient after fusing as a first factor, and weighting the depth value of the pixel corresponding to the confidence coefficient by using the first factor;

further, a distance factor is introduced in consideration of scale differences caused by different base line lengths, the distance factor is taken as a weight, and any parallax confidence in the two groups of binocular images is weighted by the weight, so that: when the current depth fusion value is less than the depth threshold, the first weight assigned to the pixel depth value of the short baseline camera is greater than the second weight assigned to the pixel depth value of the long baseline camera; when the current depth fusion value is greater than or equal to the depth threshold value, the second weight of the pixel depth value given to the long baseline camera is greater than the first weight of the pixel depth value given to the short baseline camera;

and fusing the weighted pixel depth values of the binocular images to obtain a weighted depth fusion value serving as a final depth estimation value of the pixel.

In one embodiment, the calculation formula of the depth value estimation of the pixel I (I, j) is:

/>

Wherein z (I, j) is the final depth estimation value of the pixel I (I, j) after depth fusion, and z ₂₃ 、α ₂₃ Depth value and confidence, z, of image pixel I (I, j) of the second camera and the third camera (second binocular), z ₁₂ 、α ₁₂ Depth value and confidence of image pixel I (I, j) being the first camera and the second camera (first binocular), b ₂₃ 、b ₁₂ B, respectively, the second camera and the third camera baseline lengths, the first camera and the second camera baseline lengths, in combination with the camera layout of fig. 1 ₂₃ ≥b ₁₂ The difference value of the two baseline lengths is a weight, and in this embodiment, the difference value is a ratio, and in practical application, may be an absolute value of a difference value of the two baseline lengths, or other values for quantifying the difference of the two baseline lengths; this embodimentIn the method, the current depth fusion value is formulated according toThis is advantageous for further improving the accuracy of the depth estimation values, but in practice it may be calculated in other ways, for example as an average of the depth values of the first binocular image and the depth values of the second binocular image.

From the above calculation formula, it can be seen that:

when the current depth is fused with the valueWhen the depth is smaller than the set depth threshold value, the weight is +.>This allows the first weight of parallax confidence for the short baseline binocular image pixels +. >The second weight (in the above formula, the second weight is 1) which is larger than the parallax confidence of the long baseline binocular image pixel, so that when the current depth fusion value is smaller than the set depth threshold value, the larger weight is given to the short baseline camera image pixel depth value, and better observation information of the short baseline camera on the short distance object is fully utilized. The second weight may also be set to other values less than 1, as long as the first weight is greater than the second weight.

When the current depth is fused with the valueWhen the depth value is larger than or equal to the set depth threshold value, the weight is +.>This allows the first weight of parallax confidence for the short baseline binocular image pixels +.>And a second weight (in the above formula, the second weight is 1) of the parallax confidence of the long baseline binocular image pixel, so that when the current depth fusion value is greater than or equal to the set depth threshold value, a greater weight is given to the long baseline camera image pixel depth value. The second weight may be set to other constant values greater than 1, as long as the first weight is less than or equal to the second weight.

In a specific embodiment of the calculation formula for estimating the depth value of the pixel I (I, j), if the current depth fusion value is smaller than the depth threshold value, the second weight may be reduced to make the first weight larger than the second weight, and the specific calculation formula may be:

As can be seen from this formula of the present invention,

when the current depth is fused with the valueWhen the depth is smaller than the set depth threshold value, the weight is +.>This results in a first weight of parallax confidence for the short baseline binocular image pixels (in the above equation, the first weight is 1) being greater than a second weight of parallax confidence for the long baseline binocular image pixels>Therefore, when the current depth fusion value is smaller than the set depth threshold value, the depth value of the image pixels of the short-baseline camera is given greater weight, so that better observation information of the short-baseline camera on the short-distance object is fully utilized. The first weight may also be set to other constant values greater than 1, as long as the first weight is greater than or equal to the second weight.

When the current depth is fused with the valueWhen the depth value is larger than or equal to the set depth threshold value, the weight is +.>This causes the first weight of the parallax confidence of the short baseline binocular image pixels (in the above equation, the first weight is 1) to be equal to or less than the second weight of the parallax confidence of the long baseline binocular image pixels>Therefore, when the current depth fusion value is larger than or equal to the set depth threshold value, the pixel depth value of the long baseline camera image is given greater weight. The first weight may be set to other constant values smaller than 1, as long as the first weight is smaller than or equal to the second weight.

And obtaining a weighted fusion depth map through the weighted fusion of the depth values of the pixels. Filling data into holes in the weighted fusion depth map: by taking the minimum value, the maximum value or the median in the neighborhood range of each pixel and other modes, different filling modes are selected according to different application occasions, or cavity filling is not carried out according to actual requirements. When the RGB image is collected at the same time, the depth estimation value data can be directly aligned to the RGB image pixel data through the second camera and the external parameter data calibrated in advance by the RGB camera, and the RGBD point cloud is obtained.

According to the embodiment, based on images which are respectively acquired from the horizontal direction and the vertical direction by two groups of binocular cameras with a common camera and are formed by the three-dimensional cameras, depth values obtained by three-dimensional matching of the images in the horizontal direction and the vertical direction are subjected to differential weighting fusion, so that the observation in the two directions can mutually compensate for the cavity caused by shielding, the cavity existing in the existing three-dimensional matching process is filled, the integrity, the density and the accuracy of a depth map are increased, and the depth map is spread over the whole image field of view.

Referring to fig. 7, fig. 7 is a schematic diagram showing the electrical connection relationship of the three-view depth camera system of the present invention. The system includes a first processor configured to receive a signal,

The control module is used for controlling the projection device to project the structured light and controlling each camera to synchronously trigger image acquisition so as to acquire structured light image data at the same moment;

the computing module processes the images acquired by each camera according to the three-eye visual depth estimation method in the embodiment so as to obtain the depth estimation value of each pixel of the image;

and the transmission module is used for packaging and transmitting the calculated pixel depth estimated value data to an external device using the data, such as a PC terminal.

the memory is used for storing a computer program;

the processor is configured to execute the program stored in the memory, and implement the depth estimation method for trinocular vision according to the embodiment.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The embodiment of the invention also provides a computer readable storage medium, wherein the storage medium stores a computer program, and the computer program realizes the following steps when being executed by a processor:

carrying out image stereo correction on the view image;

based on binocular images from the at least two sets of binocular image acquisition devices in the three-dimensional corrected view images, respectively performing three-dimensional matching on each set of binocular images to obtain a depth value of each pixel in each set of binocular images and parallax confidence of each pixel;

respectively weighting the depth value of any pixel in each group of binocular images by using the parallax confidence of the pixel to obtain the depth value of each group of binocular images after weighting the pixels;

and fusing the depth values weighted by the pixels in the binocular images to obtain the depth estimation value of the pixel.

For the apparatus/network side device/storage medium embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and the relevant points are referred to in the description of the method embodiment.

According to the embodiment of the invention, three cameras at right angles are adopted and laser speckle is utilized for depth estimation, the scheme overcomes many defects of a binocular and speckle scheme in the prior art, the optimization can be carried out from various reasons of cavity formation of a depth map causing binocular and laser speckle, the cavities in the depth map are filled, and the depth map which is more complete, dense, accurate and larger in visual field is obtained, so that the depth camera can meet more application scenes. The depth camera can be applied to various occasions such as robot navigation obstacle avoidance, parcel volume detection, man-machine interaction and the like, and can be used for improving the performance of the binocular speckle machine vision system in the present occasion that the binocular speckle machine vision system can be used.

It should be noted that, the embodiment of the depth estimation method based on three-view of projecting active structured light provided by the present invention is not limited to the above embodiment, for example, refer to fig. 8, and fig. 8 shows a schematic diagram of a multi-group dual-purpose camera layout. The cameras are distributed at equal intervals or unequal intervals on a chessboard to form a camera array, so that base lines of each group of binocular cameras are mutually perpendicular, the lengths of the base lines of each binocular camera can be equal or unequal, and the projector is positioned at the center of the camera array, so that depth estimation of a plurality of groups of binocular vision is formed, and other methods which are beneficial to improving parallax accuracy and improving processing efficiency can be adopted in the specific image stereo correction and stereo matching processing; in the process of depth value fusion, the weighting strategy can be adjusted according to the actual situation.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the invention.

Claims

1. A machine vision depth estimation method, characterized in that the method comprises,

carrying out image stereo correction on the view image;

wherein,

i is a natural number not greater than the total number of pixels in the binocular images, and the pixels I in each group of binocular images have the same coordinates in a pixel coordinate system or a world coordinate system;

the depth value of the pixel I is determined as follows:

2. The method of claim 1, wherein the separately performing stereo matching of the sets of binocular images further comprises calculating a parallax confidence for each pixel in the sets of binocular images during stereo matching;

3. The method of claim 2, wherein, for each pixel I in each group of binocular images, weighting the depth value of the pixel I by the parallax confidence of the pixel I to obtain the weighted depth value of the pixel I in each group of binocular images, respectively, including,

4. The method of claim 1, wherein the third weight is the same as the first weight and the fourth weight is the same as the second weight;

the parallax confidence comprises the ratio of the minimum aggregate matching cost to the next-smallest aggregate matching cost in the image stereo matching process.

5. The method of claim 1, wherein the obtaining of the current depth fusion value comprises,

6. The method of any one of claims 1 to 5, wherein the at least two sets of binocular image acquisition means comprise two sets of binocular image acquisition means having two baselines perpendicular to each other;

7. The method of claim 6, wherein weighting the confidence in the differences with respect to differences between sets of binocular baseline lengths as weights comprises,

8. The method of claim 6, wherein weighting the confidence in the video using as weight the difference value for each set of binocular baseline lengths comprises,

9. The method of claim 6, wherein the first baseline is in a vertical direction and the second baseline is in a horizontal direction;

10. The method of claim 6, wherein the obtaining of the current depth fusion value comprises,

11. The method of claim 6, wherein the stereoscopic-based view images from the at least two sets of binocular image acquisition devices are each stereo-matched, comprising,

12. A machine vision system comprising an image acquisition device, wherein the image acquisition device comprises at least two sets of binocular image acquisition devices sharing the same image acquisition device; the system may further comprise a processor configured to,

wherein,

i is a natural number not greater than the total number of pixels in the binocular image,

the depth value of the pixel I is determined as follows:

13. The system of claim 12, wherein the computing module further comprises,

14. The system of claim 13, wherein the computing module further comprises fusing the parallax confidence of any pixel I in each set of binocular images to obtain the fused parallax confidence of the pixel I;

15. The system of any one of claims 12 to 14, further comprising,

16. The system of claim 15, wherein the at least two sets of binocular image acquisition means comprises two sets of binocular image acquisition means having two baselines perpendicular to each other, wherein the baselines of the first and second image acquisition means are in a vertical direction and the baselines of the second and third image acquisition means are in a horizontal direction; the first baseline length of the first image acquisition device and the second image acquisition device is less than the second baseline length of the second image device and the third image device.

17. The system of claim 16, wherein the projection device is centered on an optical center line of the first image acquisition device and the third image acquisition device;

18. The system of claim 15, wherein the image acquisition devices of the at least two sets of binocular image acquisition devices are equally spaced or non-equally spaced apart on a checkerboard.

19. A machine vision depth estimation device is characterized in that the device comprises a memory and a processor,

the memory is used for storing a computer program;

the processor is configured to execute a program stored in the memory, and implement the machine vision depth estimation method according to any one of claims 1 to 11.