CN112907559A

CN112907559A - Monocular camera-based depth map generation device and method

Info

Publication number: CN112907559A
Application number: CN202110281368.XA
Authority: CN
Inventors: 屠礼芬; 宋伟; 彭祺; 李春生; 余振宇
Original assignee: Hubei Engineering University
Current assignee: Hubei Engineering University
Priority date: 2021-03-16
Filing date: 2021-03-16
Publication date: 2021-06-04
Anticipated expiration: 2041-03-16
Also published as: CN112907559B

Abstract

The invention relates to a monocular camera-based depth map generation device, which comprises a monocular camera and a RealSense camera; the monocular camera is arranged on the first cloud deck; the RealSense camera is arranged on the second cloud deck; the monocular camera and the RealSense camera are closely matched, and the optical axes are parallel. The invention also relates to a monocular camera-based depth map generation method, which comprises the following steps: collecting 1 monocular camera RGB image; collecting 1 RealSense camera RGB image and 1 RealSense camera depth map; sampling downwards to obtain a downwards sampled monocular camera RGB image; performing super-pixel segmentation operation to obtain a segmentation monocular camera RGB image; performing feature point matching operation to obtain a matching depth map; performing region segmentation to obtain a partition depth map; counting the average value, and filling the corresponding segmentation area to obtain a filling depth map; and (5) carrying out upward sampling to obtain a monocular camera depth map. The method can keep the high precision and the view field of the RGB image of the monocular camera and fit a depth map; the cost is low; hardware calibration is not needed, learning and modeling are not needed, and priori knowledge is not needed.

Description

Monocular camera-based depth map generation device and method

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a monocular camera-based depth map generation device and method.

Background

With the continuous development of the artificial intelligence application field, the application of combining the image depth and the RGB information is more and more extensive, compared with the RGB information, the depth information introduces the distance from a target to a camera, one spatial dimension is increased, a scene can be better understood, and the detection or identification precision is obviously improved. The image containing depth information is a depth map.

The prior art has several methods for generating a depth map as follows:

1. conventional hardware acquisition methods:

the method is a convenient depth map generation technology, and simply directly uses hardware such as laser radar, Kinect and RealSense to directly obtain the depth information of the image and then obtain the depth map; the advantages and disadvantages of these three devices are as follows:

laser radar:

the advantages are that: the precision is higher;

the disadvantages are as follows: the method comprises the steps that three-dimensional point cloud information is obtained, and RGB images are lacked, namely texture information is lost;

Kinect/RealSense：

the advantages are that: the RGB image and the depth map can be obtained simultaneously, and the method is low in price and easy to popularize;

the disadvantages are as follows: the RGB image has low resolution, low contrast and limited field range.

2. Image processing based methods, such as more mainstream binocular or binocular stereo matching:

the principle of the method is that after a camera is calibrated, a depth map is obtained through feature point matching, global matching and local matching; the advantages and disadvantages of this approach are as follows:

the advantages are that: a depth map with higher precision can be generated, and better RGB image information is also reserved;

the disadvantages are as follows: the method needs to carry out complex calibration on the camera, once the calibration is finished, the relative position of the camera cannot move, and the flexibility is poor; in addition, the hardware used in the technical scheme needs to be customized, and the cost is not low.

3. The monocular-based depth estimation method comprises the following steps:

the technical scheme is that a depth map is obtained by a traditional machine learning or deep learning method; the advantages and disadvantages of this type of solution are as follows:

the advantages are that: the hardware cost is low;

the disadvantages are as follows: the learning and modeling are required to be performed first, so that a large amount of data sets and a complex operation process are required, and the method is not suitable for popularization.

Disclosure of Invention

The invention aims to solve the problems and provides a monocular camera-based depth map generation device and a monocular camera-based depth map generation method, and aims to fit a depth map on the premise of keeping the high precision and the unchanged view field of an RGB (red, green and blue) map of a monocular camera; hardware is not required to be calibrated, learning and modeling are not required to be carried out on a scene, and a large amount of prior knowledge is not required; the application cost is reduced.

In order to solve the problems, the technical scheme provided by the invention is as follows:

a depth map generating device based on a monocular camera comprises the monocular camera and a RealSense camera; wherein:

the monocular camera is arranged on a quick-mounting plate of the first cloud deck; the base of the first tripod head is fixedly arranged on the tripod head fixing plate;

the RealSense camera is arranged on a quick installation plate of the second cloud deck; the base of the second holder is fixedly arranged on the holder fixing plate;

the monocular camera is tightly matched with the RealSense camera; the optical axis of the monocular camera is parallel to the optical axis of the RealSense camera.

Preferably, the monocular camera fits closely with the RealSense camera in the horizontal direction.

Preferably, the monocular camera fits closely with the RealSense camera in the vertical direction.

Preferably, it is characterized in that: the monocular camera is arranged on a quick-mounting plate of the first holder through a conversion frame made of a tough material and used for buffering and resisting shock; the RealSense camera is installed on the fast-installation plate of the second holder through a conversion frame made of a tough material and used for buffering and resisting shock.

Preferably, the monocular camera is provided with cooling fins in four directions, namely, up, down, left and right.

A monocular camera-based depth map generating method using a depth map generating apparatus, comprising the steps of:

s100, simultaneously aligning the optical axis of the monocular camera and the optical axis of the RealSense camera to an image acquisition target;

s200, collecting 1 monocular camera RGB image for the image collection target by using the monocular camera; collecting 1 RealSense camera RGB image and 1 RealSense camera depth map for the image collection target by using the RealSense camera;

pixel points in the RGB image of the RealSense camera correspond to pixel points in the depth map of the RealSense camera one to one;

s300, the RGB image of the monocular camera is down-sampled, so that the resolution of the RGB image of the monocular camera is reduced to be the same as that of the RGB image of the RealSense camera, and the RGB image of the down-sampled monocular camera is obtained;

s400, performing super-pixel segmentation operation on the down-sampling monocular camera RGB image to obtain a segmented monocular camera RGB image;

s500, performing feature point matching operation on the down-sampling monocular camera RGB image and the RealSense camera RGB image to obtain a matching depth map;

s600, performing region segmentation on the matched depth map according to the RGB image of the segmentation monocular camera to obtain a partition depth map; the partition depth map is composed of a plurality of partition regions;

s700, counting the average values of the depth values of all pixel points in each partition area one by one, then taking the average values as the depth values of the corresponding partition areas, and filling the corresponding partition areas to obtain a filling depth map;

s800, by up-sampling the filling depth map, raising the resolution of the filling depth map to be the same as that of the RGB image of the monocular camera, and obtaining a depth map of the monocular camera; and then outputting the monocular camera depth map as a result of the depth map generation method.

Preferably, the feature point matching operation in S500 specifically includes the following operations:

s510, searching pixel points which can be matched in the RGB image of the RealSense camera one by pixel points in the RGB image of the downward sampling monocular camera;

s520, according to the search result, the following operations are carried out:

if the pixel points in the RGB image of the down-sampling monocular camera have pixel points which can be matched in the RGB image of the RealSense camera, the pixel points in the RGB image of the down-sampling monocular camera are endowed with the depth values of the corresponding pixel points in the depth map of the RealSense camera, wherein the pixel points in the RGB image of the down-sampling monocular camera can be matched with the pixel points in the RGB image of the RealSense camera;

otherwise, setting the gray value of the pixel point in the RGB image of the down-sampling monocular camera to be 0.

Compared with the prior art, the invention has the following advantages:

1. because the monocular camera and the RealSense camera are tightly connected together and the optical axes are approximately coincident, the RGB image acquired by the RealSense camera and the high-precision RGB image acquired by the monocular camera can be matched with the characteristic points, so that the depth image can be fitted on the premise of keeping the high precision and the view field of the RGB image of the monocular camera unchanged, and the defect that the texture information is lost in the technical scheme of the laser radar is overcome;

2. because the invention does not use the customization equipment, thus overcome the disadvantage of high cost of the method technical scheme based on image processing;

3. the three-dimensional coordinates are not calculated through multi-camera image coordinates, so that the defect of the technical scheme of the monocular-based depth estimation method is overcome, hardware does not need to be calibrated, a scene does not need to be learned and modeled, a large amount of priori knowledge is not needed, and the method is further suitable for popularization and application.

Drawings

Fig. 1 is a schematic front view of a monocular camera-based depth map generating device according to an embodiment of the present invention;

FIG. 2 is a front view of an apparatus according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of a monocular camera-based depth map generation method according to an embodiment of the present invention;

FIG. 4 shows an embodiment of the present invention_RGBThe image of (a);

FIG. 5 shows R in an embodiment of the present invention_RGBThe image of (a);

FIG. 6 shows R in an embodiment of the present invention_DThe image of (a);

FIG. 7 shows an embodiment of the present invention_RGBAn image of the feature point detection result of (1);

FIG. 8 shows R in an embodiment of the present invention_RGBAn image of the feature point detection result of (1);

FIG. 9 is an image of feature point matching results of an embodiment of the present invention;

FIG. 10 shows the result S after super-pixel segmentation according to an embodiment of the present invention_RGBThe image of (a);

FIG. 11 shows a schematic diagram of an embodiment of the present invention_DThe image of (a);

FIG. 12 is a schematic diagram of the variation of each image according to the algorithm flow according to the embodiment of the present invention.

Wherein: 1. the camera comprises a monocular camera, a RealSense camera, a first tripod head, a second tripod head, a tripod head fixing plate, a conversion frame, a radiating fin and a tripod, wherein the monocular camera comprises 2. the RealSense camera, 3. the first tripod head, 4. the second tripod head, 5. the tripod head fixing plate, 6. the conversion frame, 7. the radiating fin and 8. the tripod.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

As shown in fig. 1 (front view), a monocular camera-based depth map generating apparatus includes a monocular camera 1 and a RealSense camera 2.

In this embodiment, the monocular camera 1 is an industrial camera, and specifically, a micro-vision RS-A14K-GC8 industrial camera is adopted.

In this specific embodiment, the RealSense camera 2 employs an Intel RealSense D415 depth camera; in addition, the RealSense camera 2 may also employ an Intel RealSense D435 depth camera.

Wherein:

the monocular camera 1 is installed on a quick-mounting plate of the first cloud deck 3; the base of the first pan/tilt head 3 is fixedly mounted on the pan/tilt fixing plate 5.

In this embodiment, the monocular camera 1 is mounted on the quick mount plate of the first pan/tilt head 3 via the conversion frame 6 made of a flexible material for buffering and shock resistance.

The RealSense camera 2 is arranged on a quick-mounting plate of the second cloud deck 4; the base of the second pan/tilt head 4 is fixedly mounted on the pan/tilt head fixing plate 5.

In this embodiment, the RealSense camera 2 is mounted on the fast-mounting plate of the second pan/tilt head 4 through a conversion frame 6 made of a flexible material for buffering and shock resistance.

The optical axis of the monocular camera 1 is parallel to the optical axis of the RealSense camera 2. The monocular camera 1 fits closely with the RealSense camera 2 in the horizontal direction, or fits closely in the vertical direction.

The purpose of this is to: the same scene collected by the two cameras has the same depth, so that the RealSense camera depth map can be used for fitting and finally obtaining the monocular camera depth map.

In this embodiment, the monocular camera 1 and the RealSense camera 2 are closely fitted in the horizontal direction.

In this embodiment, a tripod 8 is installed below the holder fixing plate 5, and the optical axis of the monocular camera 1 and the optical axis of the RealSense camera 2 are both kept in a horizontal posture by adjusting the posture of the tripod 8.

For the monocular camera 1, there may be very little error in the absolute depth generated, because the image planes of the two cameras, monocular camera 1 and RealSense camera 2, do not completely coincide, but the relative depths of different objects in the scene are not affected.

In this embodiment, the heat dissipation fins 7 are mounted on the monocular camera 1 in the four directions, i.e., up, down, left, and right directions. This is because the industrial camera consumes a large amount of power and easily generates heat when in use, and heat dissipation is necessarily required.

Fig. 2 is a front view of the device according to this embodiment.

As shown in fig. 3, a monocular camera-based depth map generating method using a depth map generating apparatus includes the steps of:

s100, simultaneously aligning the optical axis of the monocular camera 1 and the optical axis of the RealSense camera 2 to an image acquisition target.

S200, collecting 1 monocular camera RGB image for an image collection target by using the monocular camera 1, and recording the monocular camera RGB image as I for convenience of description below_RGB(ii) a Collecting 1 RealSense camera RGB image and 1 RealSense camera depth map for an image collection target by using a RealSense camera 2; for convenience, the RealSense camera RGB image will be referred to as R_RGBLet RealSense Camera depth map be R_D。

In this particular example, I_RGBAs shown in fig. 4; r_RGBAs shown in fig. 5; r_DAs shown in fig. 6.

As can be clearly seen by comparing fig. 4 and 5, although the optical axes of the industrial camera and the RealSense camera 2 are not perfectly collinear, the image scenes are very close since they are mounted in close proximity.

As can be seen by comparing FIG. 5 with FIG. 6, R_RGBAnd R_DThe points in (1) are in one-to-one correspondence and are coincident.

R_RGBPixel point of (3) and R_DThe pixel points in the image are in one-to-one correspondence.

S300. by mixing I_RGBDown-sampling to make I_RGBResolution of (2) is reduced to R_RGBThe resolution ratios are the same, and an RGB image of the down-sampling monocular camera is obtained; for convenience of description, the following will be made of a down-sampled monocular camera RGB mapLike note as i_RGB。

The reasons for this step are: the resolution of the industrial camera in this embodiment is 4384 × 3288, while there are various resolutions of the RealSense camera 2; the resolution of the RealSense camera 2 requires that the resolution be chosen with an aspect ratio that is consistent with that of the industrial camera, due to algorithmic requirements.

In this particular embodiment: the model of the RealSense camera 2 is selected to be Intel RealSense D415, or the model is changed to be Intel RealSense D435, so that the same effect can be achieved; however, whether the Intel RealSense D415 or the Intel RealSense D435 can output the depth map with the maximum resolution of 1280 multiplied by 720; clearly the resolution of the contrast industrial camera is much poorer; also as mentioned above, in order to select the resolution of the RealSense camera 2 with an aspect ratio that is consistent with the resolution of the industrial camera, the resolution of the RealSense camera 2 can only be selected to be 640 × 480 in this embodiment, i.e. R in this embodiment_RGBThe image quality of (1).

On the other hand, however, the image definition and contrast are high in industrial applications, so R_RGBThe picture quality of (1) is unusable, but only by I_RGB(ii) a May be of the formula I_RGBAnd lack the corresponding depth map; thus, the contradiction is revealed; the contradiction is the fundamental problem to be solved by the invention; briefly, the object of the present invention is to generate a sum of I_RGBPoint-to-point depth map I_D。

Therefore, the practical meaning of this step of S300 is that it is obtained by the following steps to I_RGBDown-sampling to convert into R_RGBNew graphs of the same resolution, i.e. i_RGB。

S400, to i_RGBPerforming super-pixel segmentation operation to obtain a segmentation monocular camera RGB image; for convenience, the divided monocular camera RGB image will be referred to as S_RGB。

In this embodiment, the super-pixel segmentation operation is performed by performing region division according to a scene to be analyzed.

Superpixel segmentation is an irregular block of pixels with certain visual significance, which is composed of adjacent pixels with similar texture, color, brightness, etc.

S500, mixing the I_RGBAnd R_RGBPerforming feature point matching operation to obtain a matching depth map; for convenience of description, the matching depth map is denoted as i_DP(ii) a The method specifically comprises the following operations:

s510, one-by-one R_DIn (1) the pixel point is in R_RGBSearching pixel points which can be matched;

s520, according to the search result, the following operations are carried out:

if i_RGBIn (1) the pixel point is in R_RGBHaving pixel points which can be matched, if the matching is successful, i is_RGBThe pixel point in (1) is endowed with a corresponding R_RGBIn (3) the pixel points capable of being matched are in R_DThe depth value of the corresponding pixel point in the image; two successfully paired pixel points are called feature points

Otherwise, it indicates that the match was not successful, and then i will be_RGBThe gray value of the pixel point in (1) is set to 0.

In step S500, the feature point matching operation is used to eliminate points with large errors and retain good matching points. This is due to: the model difference between the industrial camera and the RealSense camera 2 is large, the view field angles are different, and two sides of an image collected by the camera with the large view field angle have more misaligned areas. However, since the industrial camera and the RealSense camera 2 are installed in close connection, the similarity of the overlapped partial images in the middle of the field of view is high, which can reduce the matching difficulty, so that more matching point pairs are usually generated in the region.

For example, there is a pair of matching points, where at i_RGBThe image coordinates of the pixel points in (1) are (m, n) and are in R_RGBThe coordinates in (a) are (m ', n'). Because the two cameras are different in model and coordinate, the two cameras are generally different, but the two cameras are installed in a left-right or up-down close connection mode, and the front position and the rear position are kept consistent, so that the absolute depth is similar and the relative depth is the same for the same scene. Due to R_RGBAnd R_DThe pixel points in (1) are in one-to-one correspondence, and R is used_DThe depth value of the pixel point at the middle coordinate (m ', n') is taken as i_RGBDepth at coordinates (m, n) in an imageThe value is obtained. According to the corresponding relation, the depth values of all the matching point positions can be generated by analogy, and a new image is formed, namely i_DP。

In this embodiment, FIG. 7 is I_RGBFIG. 8 shows the result of detection of characteristic points of (A), R_RGBFIG. 9 shows the result of feature point detection, although R is the result of feature point matching_RGBThe image quality is poor due to the influence of the resolution, but the influence on the characteristic points is small, the detected characteristic points are basically consistent with the industrial camera with high resolution, and the matching result is good.

In this embodiment, FIG. 10 shows S obtained by superpixel segmentation_RGB。

S600. according to S_RGBTo i_DPPerforming region segmentation to obtain a partition depth map; for convenience of description, the partition depth map is denoted as i_DPA；i_DPAIs composed of a plurality of divided regions.

Filling the irregular pixel blocks obtained after the super-pixel segmentation in the step S400; during filling, each irregular pixel block is taken as a unit, and filling is carried out by using the same depth value; therefore, the meaning of this step is to convert i into_DPDepth values at characteristic points of medium hash distribution according to S_RGBIs partitioned to generate i_DPAAnd i is_DPAI.e. the result of the partitioning.

S700, counting the average values of the depth values of all pixel points in each segmentation area one by one, then taking the average values as the depth values of the corresponding segmentation areas, and filling the corresponding segmentation areas to obtain a filling depth map; for convenience, the filling depth map is denoted as i_D。

It should be noted that, in practical operation, it is found that, in a scene with rare feature points, there is a phenomenon that some regions have no feature points, and then the scene is filled with a background gray-scale value 0.

S800. by mixing i_DUp-sampling, so that i_DResolution of (2) is raised to_RGBThe resolutions of the two images are the same, and a monocular camera depth map is obtained; for convenience of description, the monocular phase will be described belowThe depth map is marked as I_D(ii) a Then adding I_DAnd output as a result of the depth map generation method.

To be particularly noted is that I_DExcept for the partial area where the feature points are not detected and the area where the image can not be acquired by the outer RealSense camera 2, the depth values and I of other points_RGBAnd correspond to each other.

In this particular example, I_DAs shown in fig. 11.

The black areas in fig. 6 and 11 are both depth missing areas.

Comparing fig. 6 and fig. 11, it can be found that: r automatically generated by RealSense camera 2_DThe image is complete, and only small areas on two sides of the image are lost in depth; although the depth area causing the deficiency is more, the depth area is mainly distributed around the image.

The depth missing of the algorithm is mainly caused by two reasons:

1. although the industrial camera and the RealSense camera 2 are closely connected and installed, the optical axes are not completely consistent, so that the shot pictures are not completely overlapped, and the depth loss at the periphery is caused.

2. The partial area is smooth and lacks of characteristic points, and the problem is likely to occur in the middle and the periphery of the image. From fig. 7 to fig. 11, in addition to the partial depth missing region, other regions with depth also have better results.

It can therefore be concluded that: the invention can well solve the defects of the prior art and obviously improve the precision of industrial detection or identification.

Finally, supplementary notes are that: fig. 12 is a schematic diagram showing a change of each image according to the algorithm flow in the present invention, in which: i.e. i_DPIs shown white to more clearly represent image i_DPThe actual algorithm is as described in S520, and the gray value of the pixel point that is not successfully matched is set to 0; in FIG. 12, i_DPAThe region dividing line in (a) is drawn for more clearly expressing the meaning of the algorithm, and in the actual algorithm, as described in S600, there is no region dividing line.

In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. To those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A depth map generation device based on a monocular camera is characterized in that: comprises a monocular camera (1) and a RealSense camera (2); wherein:

the monocular camera (1) is arranged on a quick-mounting plate of the first cloud deck (3); the base of the first tripod head (3) is fixedly arranged on the tripod head fixing plate;

the RealSense camera (2) is installed on a fast-installation plate of the second cloud deck (4); the base of the second tripod head (4) is fixedly arranged on the tripod head fixing plate;

the monocular camera (1) is tightly matched with the RealSense camera (2); the optical axis of the monocular camera (1) is parallel to the optical axis of the RealSense camera (2).

2. The monocular camera-based depth map generating device of claim 1, wherein: the monocular camera (1) and the RealSense camera (2) are closely matched in the horizontal direction.

3. The monocular camera-based depth map generating device of claim 1, wherein: the monocular camera (1) and the RealSense camera (2) are closely matched in the vertical direction.

4. The monocular camera-based depth map generating device according to claim 2 or 3, wherein: the monocular camera (1) is arranged on a quick-mounting plate of the first cloud deck (3) through a conversion frame (6) made of a tough material and used for buffering and resisting shock; the RealSense camera (2) is arranged on a quick-mounting plate of the second cloud deck (4) through a conversion frame (6) made of a flexible material and used for buffering and resisting shock.

5. The monocular camera-based depth map generating device of claim 4, wherein: the monocular camera (1) is provided with radiating fins (7) in the upper, lower, left and right directions.

6. A monocular camera-based depth map generating method using the depth map generating apparatus according to any one of claims 1 to 5, characterized in that: comprises the following steps:

s100, simultaneously aligning the optical axis of the monocular camera (1) and the optical axis of the RealSense camera (2) to an image acquisition target;

s200, collecting 1 monocular camera RGB image for the image collection target by using the monocular camera (1); collecting 1 RealSense camera RGB image and 1 RealSense camera depth map for the image collection target by using the RealSense camera (2);

7. The monocular camera-based depth map generating method according to claim 6, characterized in that: the feature point matching operation in S500 specifically includes the following operations:

s520, according to the search result, the following operations are carried out: