CN107369158B

CN107369158B - Indoor scene layout estimation and target area extraction method based on RGB-D image

Info

Publication number: CN107369158B
Application number: CN201710442910.9A
Authority: CN
Inventors: 吴晓秋; 霍智勇
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2017-06-13
Filing date: 2017-06-13
Publication date: 2020-11-13
Anticipated expiration: 2037-06-13
Also published as: CN107369158A

Abstract

The invention discloses an indoor scene layout estimation and target area extraction method based on RGB-D images, which comprises the following steps: estimating scene layout; carrying out over-segmentation on the preprocessed depth map and the preprocessed RGB map by using a map-based segmentation algorithm and a constraint parameter minimum segmentation algorithm to obtain region sets with different sizes; performing over-segmentation hierarchical grouping, and performing region merging by using four different similarity measurement modes to complete region hierarchical grouping to obtain target regions of all scales; and target bounding box matching. The method and the device realize the high-efficiency and high-recall-rate target area extraction of the indoor scene.

Description

Indoor scene layout estimation and target area extraction method based on RGB-D image

Technical Field

The invention belongs to the technical field of artificial intelligence calculation, in particular to an indoor scene layout estimation and target area extraction method based on RGB-D images, which is applied to an indoor service robot technology.

Background

The research of indoor scene analysis is one of the research hotspots of scholars at home and abroad, has important application value for indoor robot semantic positioning and map generation, and has very important significance for solving some high-level computer vision problems. The target segmentation and extraction algorithm aims to obtain high-quality target positioning and instance segmentation results, and is one of the key steps of scene analysis. The target extraction result is usually a target candidate region or a target bounding box, and through years of development, target extraction algorithms can be currently divided into two types: the first kind is algorithm based on sliding window detection idea, the second kind is algorithm based on segmentation, including image over-segmentation and segmentation amalgamation strategy. The first kind of algorithms is more classical than a DPM (deformable Parts model) target detection algorithm, and has strong robustness to deformation of a target by adopting an improved HOG feature and an SVM classifier, but the algorithms are high in calculation cost and cannot use complex feature representation.

The second type of algorithm is more classical image segmentation algorithm based on GBS (graph based segmentation), which is simple to implement and high in speed, can find out visually consistent regions, but easily causes over-segmentation; and the target segmentation algorithm with the minimum constraint parameter segmentation is good in segmentation effect and only comprises a foreground segmentation area. In recent years, due to the popularization of depth sensors, a large number of RGB-D image data sets including depth images appear, researchers begin to improve effects by increasing geometric features or depth information by using the RGB-D data sets, but these algorithms are usually supervised algorithms, a pre-obtained contour model needs to be trained, the calculation complexity is high, although the accuracy of target extraction is partially improved, the target types are few and the recall rate is low, and planar area objects are easily ignored during detection. In addition, some unsupervised RGB-D target extraction and segmentation algorithms exist, the calculation speed is high, but the algorithms are sensitive to image brightness change, noise and the like, and the robustness is low.

Although the target extraction algorithm is continuously developed, due to the limitation of features such as texture, color, brightness, etc. of RGB images, when applied to a complex indoor scene, the following problems still exist: 1) occlusion problems, where some large objects cannot be detected due to occlusion; 2) the problem that objects in a plane area and small-size objects are easy to ignore is solved, so that the recall rate is low; 3) the calculation complexity is high, pre-training is needed, and the method is not suitable for practical system application; 4) the influence on uncertain factors in the image has poor reaction capability and low robustness.

Disclosure of Invention

In order to overcome the defects of the prior art, take the recall rate and the rapidity into consideration, solve the problems of layout estimation and target extraction in a complex indoor scene, provide an indoor scene layout estimation and target area extraction method based on RGB-D images, and realize the high-efficiency and high-recall-rate target area extraction of the indoor scene.

An indoor scene layout estimation and target area extraction method based on RGB-D images comprises the following steps,

step 1, scene layout estimation: converting the depth map into dense 3D point clouds, performing plane segmentation by calculating three-dimensional Euclidean distances among the point clouds to divide a plane area and a non-plane area, and classifying the obtained plane area into a boundary plane and a non-boundary plane;

step 2, image over-segmentation: carrying out over-segmentation on the preprocessed depth map and the preprocessed RGB map by using a map-based segmentation algorithm and a constraint parameter minimum segmentation algorithm to obtain region sets with different sizes;

step 3, over-segmentation hierarchical grouping: combining the regions by utilizing four different similarity measurement modes of color, texture, size and coincidence to complete region level grouping and obtain target regions of all scales;

step 4, matching a target bounding box: and adopting different strategies to match the minimum rectangular bounding box containing the target under four conditions of dividing the target of the plane area and the non-plane area into a bounding plane, a non-bounding plane, the plane area and the non-plane area to obtain the bounding box of the target area.

According to the method, the input 3D point cloud is used for plane segmentation and classification, the influence of occlusion on layout estimation is reduced by using the geometric continuity of the point cloud, and the effect of scene layout estimation is improved; the preprocessed depth map and the preprocessed RGB map are subjected to over-segmentation by using a map-based segmentation algorithm and a constraint parameter minimum segmentation algorithm, and the depth information and the RGB information are combined, so that the segmentation effect is improved; four different similarity measurement modes are utilized to carry out region combination, target regions with all sizes are obtained, multiple image conditions are considered, and the robustness of the algorithm is improved; different bounding box matching strategies are adopted for target areas of the planar area and the non-planar area, so that objects in the planar area are reserved, and the problem of over-segmentation of large objects due to shielding is solved; redundant bounding boxes are eliminated by utilizing the bounding box overlapping rate, the best target area bounding box is left, and the target bounding box recall rate is effectively improved under the condition that fewer candidate bounding boxes are generated; the whole process does not need pre-training, the calculation complexity is low, the realization is easy, and the calculation speed is high.

Drawings

FIG. 1 is a flow chart of an embodiment of a RGB-D image based indoor scene layout estimation and target region extraction method;

FIG. 2 is a schematic view of a depth map and a 3D point cloud of the embodiment of FIG. 1;

FIG. 3 is a diagram of the effect of plane segmentation and classification in different scenes;

FIG. 4 is a flow chart of homomorphic filtering process in the embodiment of FIG. 1;

FIG. 5 is a diagram illustrating the effect of a bounding box in extracting a target region of an implementation scenario in the embodiment of FIG. 1.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Fig. 1 is a general flowchart of an indoor scene layout estimation and target area extraction method based on RGB-D images according to an embodiment of the present invention. The procedure of this example is as follows

Step (1) scene layout estimation: firstly, a depth map is converted into dense 3D point clouds, as shown in fig. 2, then, plane segmentation is carried out by calculating three-dimensional Euclidean distances among the point clouds to divide a plane area and a non-plane area, and the obtained plane area is classified into a boundary plane and a non-boundary plane.

Step (1.1) plane segmentation: carrying out consistent sampling on the depth map to obtain a triangular point set, and matching each triangular point group in the triangular point set with a candidate plane by adopting a RANSAC algorithm; then searching in-plane points in the 3D space, wherein each in-plane point can be represented by a pixel point in the depth map and a corresponding 3D effective point, and when the three-dimensional Euclidean distance from one point to the plane is smaller than the distance tolerance D of the in-plane points_tolDefining the point as an inner point of the plane, the distance tolerance D of the inner point_tolCalculating as shown in formula (1); and finally removing the tiny planes with few inner points, and splicing the planes which are close to or close to the same plane in space.

In the formula, f is a focal length, b is a base length of the sensor, m is a linear normalization parameter, and Z represents a depth value.

Step (1.2) plane classification: and according to the obtained main plane area, assuming that a normal vector of the plane faces an observer, calculating the ratio of the point cloud number on the other surface of the plane to the total point cloud number of the whole scene, classifying the plane smaller than a certain threshold value as a boundary plane, and classifying the plane larger than the certain threshold value as a non-boundary plane. Ideally, the threshold is 0, and is set to 0.01 in consideration of the influence of noise. The final effect of the plane classification and segmentation is shown in fig. 3.

And (2) image over-segmentation: and (3) carrying out over-segmentation on the preprocessed depth map and the preprocessed RGB map by using a graph-based segmentation algorithm and a constraint parameter minimum segmentation algorithm to obtain a region set R (R) with different sizes₁，…，r_n}。

Combining RGB information and depth information, firstly, carrying out over-segmentation of a plurality of pixel-level images from different signal channels to different degrees to obtain area-level images, and then carrying out hierarchical grouping on the areas by using a bottom-up grouping method according to area characteristics until the whole image becomes an area so as to obtain a group-level image containing all large and small target areas.

Step (2.1) is based on segmentation of the RGI color space: the RGB three-channel image is converted into a normalized RG channel and a brightness I channel, namely an RGI color space, and then the RGI image is subjected to three over-segmentation in different degrees by adopting a graph-based segmentation method.

And (2.2) segmentation based on the homomorphic filtered gray level image: homomorphic filtering is firstly carried out on the RGB image, the processing flow is shown in figure 4, and then the processed and output gray level image is subjected to over-segmentation of three different degrees by adopting a graph-based segmentation method.

And (2.3) segmenting the depth map after filling the cavity: and filling holes in the depth map by using a global optimization coloring method, and then performing over-segmentation on the filled depth map in three different degrees by using a graph-based segmentation method.

Step (2.4) segmentation based on RGB-D hybrid channels: the method is based on an energy formula shown as the following formula (2):

in the formula, lambda belongs to R, and v is a set of all pixel points and is an edge set between adjacent pixels. C_λAs a cost function, a cost is incurred when assigning a label to each pixel. Binary potential function V_μvHere, as a penalty function, a penalty value is generated when similar neighboring pixels are assigned different labels.

Step (2.4.1) of calculating a cost function C_λ：

In the formula v_fRepresents the foreground seed, v_bRepresents the background seed, λ is the offset, and formula f is defined as formula (4):

f(x_μ)＝lnp_f(x_μ)-lnp_b(x_μ) (4)

in the formula p_fRepresenting the probability distribution of pixel point mu belonging to foreground region, p after adding depth information_fThe definition is shown in the following formula (5):

in the formula, D is a depth map, and I is an RGB image. j represents representative pixel points of the seed region, the pixel points are selected by a K-means algorithm (K is 5) to serve as the center of the region, alpha is a scale factor, and gamma is a scale factor.

Step (2.4.2) calculating penalty function V_μv：

The similarity g (μ, ν) between two neighboring pixels is calculated according to gPb values of the pixel points μ and v:

in the formula sigma²For the edge-sharpening parameter, the binary term V is controlled_μvThe smoothness of the surface. gPb calculation is performed on the RGB image and the depth map, and the RGB image and the depth map are linearly combined to be used as a final gPb value of each pixel:

gPb＝α·gPb_r+(1-α)·gPb_d (8)

formula (III) gPb_rgPb values, gPb, representing pixels in an RGB map_dThe gPb value representing a pixel in the depth map is set to 0.3.

Step (3) over-segmentation hierarchical grouping: and performing region combination by using four different similarity measurement modes to complete region level grouping and obtain target regions with all dimensions.

Step (3.1) first of all the similarity s (r) of every two adjacent regions is calculated_i，r_j) Adding the similarity into a similarity set S, and finding out two regions r with the maximum similarity in the set S_iAnd r_jMerge them into one region r_tAnd added to the region set R;

step (3.2) removing r from the similarity set_iAnd r_jSimilarity with its adjacent region, i.e. S ═ S \ S (r)_i，r_*) Calculating a new region r_tAdding the similarity of the adjacent regions to a similarity set S;

and (3.3) repeating the steps (3.1) to (3.2) until the whole image becomes a large area, completing the hierarchical grouping of the areas and acquiring the target areas with all dimensions.

Calculating the color similarity by the step (3.1.1): color histograms of 25bins for the three RGB channels were obtained using L1 norm normalization, so that each region resulted in a 75-dimensional vector

Then, the color similarity between the regions is calculated according to the vector as shown in formula (9):

calculating the texture similarity by the steps of (3.1.2): calculating Gaussian differential with variance of 1 for 8 different directions of each color channel, and acquiring a 10bins histogram for each direction of each color channel, so that each region obtains a 240-dimensional vector

Then, the texture similarity between the regions is calculated as shown in equation (10):

calculating the size similarity by the step (3.1.3): the size similarity between the regions is calculated according to the ratio of each region size in the image as shown in equation (11):

wherein I refers to the entire image.

Calculating the anastomosis similarity according to the following steps (3.1.4): calculating the minimum rectangular bounding box B containing two merged regions_ijAnd the difference between the sizes of the two regions, and then calculating the coincidence similarity between the regions according to the ratio of the difference value in the image as shown in formula (12):

step (3.1.5) combines four similarities: final similarity s (r)_i，r_j) The calculation method of (2) is shown by the following formula (13) by linear combination of the above four similarities:

s(r_i，r_j)＝a₁s_c(r_i，r_j)+a₂s_t(r_i，r_j)+a₃s_s(r_i，r_j)+a₄s_f(r_i，r_j) (13)

in the formula a_iE {0, 1}, indicates whether the similarity is used.

And (4): matching a target bounding box: and for the targets in the planar area and the non-planar area, adopting different strategies to match the minimum rectangular bounding box containing the targets under four conditions to obtain the bounding box of the target area.

Step (4.1) for the plane areas, the boundary plane areas are directly adopted; for each non-boundary plane, finding boundary points of the non-boundary plane, calculating the minimum Euclidean distance between the non-boundary plane and other non-boundary planes, splicing the planes with the distance less than a certain threshold value, and adopting the spliced non-boundary plane areas; while the object located in the plane area, only the object area generated by the RGB image remains.

And (4.2) for the non-planar area, other target areas except the target area with the small overlapping area with the non-planar area are directly adopted.

Step (4.3) matching the target bounding box: converting all adopted target areas into masks, and matching the masks with the minimum rectangular bounding boxes containing the masks one by one to obtain a bounding box set B shown as a formula (14):

B＝B_BP+B_MPR+B_NPR+B_PR (14)

where BP represents a bounding plane region, MPR represents a stitching plane region, NPR represents an object located in a non-planar region, and PR represents an object located in a planar region.

Then removing the tiny boundary frames in the set B, sequencing the boundary frames according to the area size, and iteratively calculating the overlapping rate between the boundary frames from top to bottom, wherein the overlapping rate O (B)_i，b_j) The calculation method is as the following formula (15), then the repeated bounding boxes with the overlapping rate larger than a certain threshold are filtered out to obtain the final optimal target area bounding box set, and the effect is as shown in fig. 5Shown in the figure.

In the formula b_i，b_j∈B，a(b_i) Represents a bounding box b_iThe area of (a).

The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features.

Claims

1. An indoor scene layout estimation and target area extraction method based on RGB-D images is characterized in that: comprises the following steps of (a) carrying out,

step 4, matching a target bounding box: for the target of the plane area, the boundary plane area is directly adopted, for each non-boundary plane, the boundary point of the non-boundary plane is found, the minimum Euclidean distance between the non-boundary plane and other non-boundary planes is calculated, the planes with the distance smaller than a certain threshold value are spliced, the spliced non-boundary plane area is adopted, the target in the plane area is adopted, and only the target area generated by the RGB image is reserved; for the target of the non-planar area, except the target area with the small overlapping area with the non-planar area, other target areas are directly adopted, all the adopted target areas are converted into masks, and the masks are matched with the minimum rectangular bounding box containing the masks one by one to obtain the bounding box.

2. The RGB-D image based indoor scene layout estimation and target area extraction method as claimed in claim 1, wherein: the specific process of the step 1 is that,

step 1.1, plane segmentation: carrying out consistent sampling on the depth map to obtain a triangular point set, and matching each triangular point group in the triangular point set with a candidate plane by adopting a RANSAC algorithm; then searching the plane inner point in the 3D point cloud space, and when the three-dimensional Euclidean distance from one point to the plane is smaller than the distance tolerance D of the inner point_tolDefining the point as an inner point of the plane, the distance tolerance D of the inner point_tolCalculating the formula (1); finally, removing the inner point fine plane, and splicing planes which are close to or close to the same plane in space;

in the formula, f is a focal length, b is a base length of the sensor, m is a linear normalization parameter, and Z represents a depth value;

step 1.2, plane classification: and according to the obtained main plane area, assuming that the normal vector of the plane faces an observer, calculating the ratio of the point cloud number on the other surface of the plane to the total point cloud number of the whole scene, classifying the plane smaller than the threshold value as a boundary plane, and classifying the plane larger than the threshold value as a non-boundary plane.

3. The RGB-D image based indoor scene layout estimation and target area extraction method as claimed in claim 1, wherein: the specific process of the step 2 is that,

step 2.1, segmentation based on RGI color space: converting the RGB three-channel image into a normalized RG channel and a brightness I channel, namely an RGI color space, and then performing over-segmentation on the RGI image by adopting a graph-based segmentation method;

2.2, segmentation based on the homomorphic filtered gray level image: homomorphic filtering is carried out on the RGB image, and the processed and output gray level image is subjected to over-segmentation by adopting a graph-based segmentation method;

and 2.3, segmenting the depth map after filling the cavity: filling holes in the depth map by using a global optimization coloring method, and over-segmenting the filled depth map by using a map-based segmentation method;

2.4, segmentation based on an RGB-D mixed channel: the foreground segmentation method of constraint parameter minimum segmentation is adopted, RGB image information and depth information are combined, the image of an RGB-D mixed channel is subjected to over-segmentation, and the energy formula based on the formula (2) is as follows:

in the formula, lambda belongs to R, v is the set of all pixel points and is the edge set between adjacent pixels, C_λAs a cost function, a cost is generated when each pixel is labeled, a binary potential function V_μvAs a penalty function, a penalty value is generated when similar neighboring pixels are assigned different labels.

4. The RGB-D image based indoor scene layout estimation and target area extraction method as claimed in claim 1, wherein: the specific process of the step 3 is that,

step 3.1, calculating similarity sets of all every two adjacent regions, and finding out two regions r with the maximum similarity_iAnd r_jMerge them into a new region r_tAdding the data into the region set;

step 3.2, remove r from the similarity set_iAnd r_jSimilarity with its adjacent region, calculating new region r_tAdding the similarity with the adjacent region into a similarity set;

and 3.3, repeating the steps 3.1-3.2 until the whole image becomes a large area, completing the hierarchical grouping of the images and acquiring target areas with all dimensions.

5. The RGB-D image based indoor scene layout estimation and target area extraction method as claimed in claim 1, wherein: the specific process of the step 4 is that,

step 4.1, as for the plane areas, the boundary plane areas are directly adopted; for each non-boundary plane, finding the boundary point of the non-boundary plane, calculating the minimum Euclidean distance between the non-boundary plane and other non-boundary planes, splicing the planes with the distance less than a threshold value, and adopting the spliced non-boundary plane area; the target located in the plane area only reserves the target area generated by the RGB image;

step 4.2, for the non-planar area, except for the target area with the small overlapping area with the non-planar area, other target areas are directly adopted;

4.3, matching a target boundary box: converting all adopted target areas into masks, matching the masks with the minimum rectangular bounding boxes including the masks one by one, then removing the fine bounding boxes, sequencing the bounding boxes according to the area size, repeatedly calculating the overlapping rate between the bounding boxes from top to bottom, filtering the bounding boxes with the overlapping rate larger than a certain threshold value, and obtaining the final bounding box of the target area.