CN110599587A - 3D scene reconstruction technology based on single image - Google Patents

3D scene reconstruction technology based on single image Download PDF

Info

Publication number
CN110599587A
CN110599587A CN201910728079.2A CN201910728079A CN110599587A CN 110599587 A CN110599587 A CN 110599587A CN 201910728079 A CN201910728079 A CN 201910728079A CN 110599587 A CN110599587 A CN 110599587A
Authority
CN
China
Prior art keywords
layout
image
vanishing points
vanishing
geometric
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910728079.2A
Other languages
Chinese (zh)
Inventor
李烜
孙华志
王建全
吴昊聪
郜鹏宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Terminal Information Technology Co Ltd
Original Assignee
Nanjing Terminal Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Terminal Information Technology Co Ltd filed Critical Nanjing Terminal Information Technology Co Ltd
Priority to CN201910728079.2A priority Critical patent/CN110599587A/en
Publication of CN110599587A publication Critical patent/CN110599587A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/80Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
    • G06T7/85Stereo camera calibration

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a single image-based 3D scene reconstruction technology, which carries out three-dimensional reconstruction based on a single image to obtain a simulated room frame, and recovers indoor layout and calculates indoor area. For an original input image, firstly, carrying out linear detection to mark long line segments, dividing the long line segments into three groups and removing abnormal line segments; then voting to calculate vanishing points in three orthogonal directions; a series of possible candidate layout structures can be generated by obtaining information and other features from uniform ray sampling from vanishing points, and then ranked by structural learning methods; the first ranked layout provides a label segmentation map, the generated geometric label map is reused for estimating the frame layout, a confidence map is provided for the visible surface, and the straight lines falling on the object and the straight lines falling on the ceiling, the wall and the floor are distinguished, so that the final correction is more accurate; and finally, calibrating the camera, transforming the coordinate to obtain a three-dimensional coordinate, and estimating the room area according to the layout.

Description

3D scene reconstruction technology based on single image
Technical Field
The invention belongs to the field of indoor scene layout recovery and area calculation, and particularly relates to a 3D scene reconstruction technology based on a single image.
Background
With the development of computer vision, two-dimensional image processing techniques have been dramatically advanced. However, how to make a computer feel a three-dimensional space like human eyes, three-dimensional reconstruction technology is slowly developed for this purpose. The three-dimensional reconstruction belongs to high-level vision in the field of machine vision, and a complete three-dimensional image of an object is restored in a coordinate system taking the object as a center, the three-dimensional object is identified, and the position and the direction of the object are determined. Dividing according to a visual mode: currently, the three-dimensional reconstruction technology based on vision mainly comprises the following steps: video-based three-dimensional reconstruction, picture-based three-dimensional reconstruction. Understanding indoor scenes is of great value in measuring room area and the like, and the emphasis is placed herein on indoor layout recovery and estimation of indoor area with the aid of the recovery results. In most rooms, the boundaries representing layout information are occluded and there are many cluttered objects, making it difficult to detect edges directly. Therefore, a reconstruction method is needed to accurately detect the indoor edge and the structural layout.
The purpose of the invention is as follows: the method is based on three-dimensional reconstruction of a single image, the single image can provide geometric information of different angles for processing unlike multiple images, and the single image has difficult-to-process details such as shading and the like in a limited small amount of information. The method provided by the invention models the scene according to the 3D frame layout and the surface label of the pixel together, and the obtained frame roughly simulates the room space as if the room is empty. The surface tags provide the positions of visible objects, walls, floors and ceilings in the picture, and a more complete and appropriate layout estimation can be obtained through the common modeling of the visible objects, the walls, the floors and the ceilings, and meanwhile, the robust parameter layout with good robustness can be obtained under the condition that the details provided by the surface tags are not lost. The invention adopts the structural learning to predict the estimation result based on the whole situation, and enhances the robustness of the reconstruction effect. The built model is suitable for most indoor scenes, stray edges can be effectively processed, and finally the effects of indoor scene layout recovery and area calculation are achieved.
The technical scheme is as follows: in order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows: a single image based 3D scene reconstruction technique, the technique comprising the steps of:
(1) firstly, carrying out linear detection on an original input image to mark a long line segment;
(2) according to the dip angle clustering of each long line segment, removing abnormal line segments and selecting three vanishing points in the orthogonal direction through voting;
(3) generating a series of possible candidate layout structures, i.e. a series of preliminary layouts, by sampling information and other features from the uniform rays emitted from the vanishing points;
(4) scoring each candidate layout using edge-based image features and a learning model, ranking each possible candidate layout structure using a structured learning approach;
(5) providing a geometric label segmentation graph for the first ranked layout (most likely layout);
(6) re-estimating the layout of the frame through the generated geometric label, and correcting the layout into an accurate layout;
(7) performing camera calibration and coordinate conversion on the obtained accurate layout to obtain a three-dimensional coordinate of the accurate layout;
(8) and estimating the room area according to the three-dimensional coordinate layout diagram of the precise layout.
In step (1), in the three-dimensional perspective view, all parallel straight lines in the x, y, and z directions intersect at a respective vanishing point on one picture, so that three vanishing points are obtained. Judging the vanishing points in three directions is an important step in reconstructing the layout structure of the room. Many lines in the scene are parallel and the various edges are orthogonal. Most vanishing point detection relies on the detection of straight line segments in the image.
In step (1), in order to obtain straight line segments in the original image, image edge information is extracted by using a canny edge detection operator. However, in the indoor complex scene, the edge information is much, and only longer straight lines have a great effect on accurately calculating vanishing points, so that filtering can be performed through the length of the straight lines on the basis of edge straight line detection by using a canny operator to remove useless short line segments. The pixels of the edge are numbered one by one, then the gradient direction is quantized and numbered into 1, 2, 3, …, k intervals, and the gradient direction of each edge is marked in the corresponding interval. Therefore, to obtain an effective straight line with a certain length, only a communication area exceeding a preset threshold value needs to be found, so that redundant straight lines are reduced, and the calculation efficiency and accuracy are improved. The invention uses straight lines of length greater than 30 pixels for the calculation.
In the step (2), firstly, clustering is performed according to the inclination angle of each line segment, the line segments sharing one vanishing point among three vanishing points can be gathered into one cluster, and then voting is performed in the generated main cluster to select the target. All the edge lines taken out are divided into three groups, and the three groups of lines in different directions are marked with different colors respectively. Red represents the z-axis direction perpendicular to the ground, green represents the x-direction, and blue represents the y-direction. Yellow is an abnormal value and is a redundant straight line that is not collinear with any of the calculated vanishing points.
In step (2), the clustered line segments cannot be directly used to mark vanishing points, because many redundant lines may be included. The redundant straight lines can greatly influence the accuracy of vanishing point calculation on one hand, and the calculation complexity can exponentially increase along with the number of the vanishing points on the other hand. We remove the yellow line at the time of calculation. When selecting the vanishing point, the target point and the candidate point are obtained while voting is needed to remove redundancy. The distance between the vanishing point and the straight line is first defined. Any vanishing point detection method must either directly or indirectly include a distance function of vanishing points and straight lines. m isx、myRespectively representing the x-axis value and the y-axis value of the midpoint of the line segment in an x-y coordinate plane, wherein l is the length of the current line segment and alphasIs the angle between the current line segment and the coordinate plane. Using the midpoint of the line segment (m)x,my,l,αs) The segment length can be definitely processed by storing the segment information, and a longer segment can be found. Since longer line segments contain more accurate vanishing point information. The line segment s and its projected line segment s' on the straight line formed by the line segment midpoint and vanishing point have the same midpoint and correspond to the same vanishing point. On the basis of which the distance function d (vp, s) is defined asThe angle between the vanishing point and the line segment, i.e. the angle of s' and s. We can express as the following equation:
for example, a candidate vanishing point corresponds to the accumulator a, and if the distance function of the straight line in the corresponding angular domain is smaller than the threshold T, a vote can be successfully cast. The threshold T is adjusted within a suitable range, not too small, otherwise lines containing useful information may be missed, so that redundant lines can be filtered out well. Meanwhile, voting also depends on the length of the line segment, and the weight w can be adjusted1And w2To balance the impact of both on the final result.
And finally, making a statistical histogram of the total votes of all the vanishing point accumulators, wherein the larger the peak value of the histogram is, the more the votes of the current vanishing point are. The largest three of the peaks were taken as the three vanishing points. In the experiment, the length of the straight line set according to the step (1) is more than 30 pixels, there are about 100 to 200 straight lines on each picture, and each detected line in the image can be assigned to one of the vanishing points.
Wherein, in step (3), a preliminary 3D frame is generated first. The direction information of the 3D frame can be deduced according to the vanishing points, and the information exerts good geometric constraint on the vertex projection of the frame model. The layout of the room relates to five faces at most, including a left wall, a right wall, a floor and a ceiling, the frame line comprises three kinds of intersecting lines of the wall and the wall, intersecting lines of the wall and the floor and intersecting lines of the wall and the ceiling. Vanishing points corresponding to three orthogonal directions in reality are vp1,vp2,vp3. V p of the selection1And vp2Two vanishing points, on the basis of which pairs of vertical bar rays are drawn, when the visible surfaces of the room photo are less than five, the intersection points of two clusters of rays are positioned outside the picture, but the processing method is the same. Once the vanishing point is known, we sample the space. The layout is completely specified by two rays passing through two vanishing points providing four corners and four edges, respectively, while the remaining edge of the box passes through the third vanishing pointvp3Light is projected to these four corners. In the present invention, i select each vanishing point to use 10 evenly spaced rays to generate different candidate layouts.
In step (4), the candidate 3D layouts generated in the previous section are ranked according to the adaptation degree of the ground route. Given a set of indoor image training sets { x1,x2…xnBelongs to X and their output frames y1,y2…ynIs ∈ Y. Here, a mapping f needs to be learned: x, Y → R is used to score the candidate frame generated automatically by the image, and R is the set of scores that score the candidate frame. Each layout is composed of five parameterized polygonal surfaces, yi={F1,F2,F3,F4,F5}. The more matched the input image and layout, the more f (x) is mappediY) higher score, similarly to y and yiThe more deviations, the lower the score. Thus, for a new test image x, its correct layout is set to y*Then y*Is defined as
y*=argmax f(xi,y)
The above is a structured regression problem, and the output is a layout with a complex structure. Structured regression is carried out by using a Structured-SVM learning framework, and a model is obtained by training the learning framework. It models the relationship between different outputs within the output space to better utilize the available training data.
In step (5), an indoor scene label graph needs to be generated according to geometric features, then the line segments are weighted by each kind of label information given by the geometric context, and the line segments are re-ranked to obtain a more reasonable layout. The first step is to obtain a superpixel segmentation map from original pixels of an image, and judge RGB distance on the basis of an adaptive threshold value by using a pixel accumulation method to determine the segmentation degree. Then, the superpixel map needs to be segmented again, that is, a plurality of segments are learned on the basis of the characteristics of color, texture, edge and the like of the superpixel to form continuous segmentation areas. Each segment provides different image visual angles, in order to obtain a final marking result, the probability that each segment belongs to each label needs to be calculated, for an indoor structure, the labels are divided into a left wall surface, a right wall surface, a ceiling, a ground surface and an indoor object, and if superpixels contained in a certain area all have the same label, the area is consistent.
In step (6), the border layout needs to be re-marked by using the generated indoor scene geometric label, so as to obtain a more accurate 3D border layout. Adaboost classification is combined with a boost decision tree to estimate feature regions with consistency. The training process is as follows: and calculating a superpixel segmentation graph for each training image, further integrating into region segmentation, labeling each region according to the ground route graph, calculating a feature vector, and then training a geometric class label classifier and a region classifier. Accordingly, the left/center/right wall and ceiling and indoor object labels are finally obtained. In training, cross-validation is used to calculate box layout cues for the training set and calculate the percentage and overlap area of a segmented region on its corresponding surface, which may reduce confusion between object surfaces and room surfaces. And finally, applying the trained model to the generated label picture to obtain accurate frame layout.
In the step (7), the purpose of camera calibration is to obtain internal and external parameters of the camera, and obtain the relationship between the two-dimensional plane pixel coordinate and the three-dimensional world coordinate. The image coordinate system is a pixel-based coordinate system with an origin at the upper left, and the position of each pixel is expressed in pixel units, so the coordinate system is called an image pixel coordinate system (u, v), and u and v respectively represent the number of columns and rows of pixels in the digital image, but the position of the pixel is not expressed in physical units, so the image coordinate system expressed in physical units, called an image physical coordinate system (x, y), needs to be established, and the coordinate system is the origin at the intersection of the optical axis and the image plane, and the point is located at the center of the image. In millimeters. The two coordinate axes are respectively parallel to the image pixel coordinate system.
Wherein, in the step (7), the coordinate conversion includes the steps of:
transformation from world coordinate system to camera coordinate system: can be composed of a rotation matrixR and a translation vector t to describe:
to homogeneous coordinates:
camera coordinate system to image physical coordinate system:
where f is the focal length. The formation homogeneous coordinates are:
a transformation from the world coordinate system to the image coordinate system is obtained:
wherein M is1Is an intrinsic parameter of the camera, M2Is an extrinsic parameter. Therefore, the relation between the position of the camera image pixel and the position of the scene point is established, the length of the frame straight line in the picture is calculated, only the end point coordinates (u, v) are obtained and substituted into the formula, the conversion can be completed, and the world coordinate is solved
In step (8), the length and width values are obtained from the world coordinates thus solved, and the area is calculated. If the standard height of the room is preset to be 2.8 meters, the volume of the room can be calculated.
Has the advantages that: compared with the prior art, the technical scheme of the invention has the following beneficial technical effects:
1. the three-dimensional reconstruction based on a single image has the advantages of small reconstruction calculation amount, less preprocessing steps, no need of multi-angle calibration of a camera in the follow-up process, and avoidance of correlation and matching of extracted features in multi-image processing;
2. the method provided by the invention models the scene according to the 3D frame layout and the surface label of the pixel together, and the obtained frame roughly simulates the room space as if the room is empty. The surface label provides the positions of visible objects, walls, floors and ceilings in the picture, a more complete and proper layout estimation can be obtained through the common modeling of the visible objects, the walls, the floors and the ceilings, and meanwhile, the strong parameter layout with good robustness is obtained under the condition that the details provided by the surface label are not lost;
3. the invention is based on the overall situation, and obtains higher accuracy rate by adding the constraint of light sampling on the layout. After the vanishing point is calculated, the error is reduced by combining two methods of structural learning output and geometric label addition.
Drawings
FIG. 1 is a flow diagram of a single image based 3D scene reconstruction technique of the present invention;
FIG. 2 is the final practical effect of the present invention, obtaining an indoor 3D frame to complete scene reconstruction;
FIG. 3 is a diagram illustrating the three detected line segments after the initial image is subjected to line detection;
FIG. 4 is an image of the present invention for calculating the length of each segment of the graph after segment labeling and performing abnormal segment removal;
FIG. 5 is three vanishing points selected by the present invention after the vanishing point selection for FIG. 4;
FIG. 6 is a layout of the present invention resulting from projecting evenly spaced rays from a vanishing point;
FIG. 7 is a diagram of the precise layout obtained by the present invention by re-estimating the layout of the bounding box from the generated geometric labels. The finally obtained image completes image layout reconstruction;
fig. 8 is an image of the invention extracting a 3D scene border from a reconstructed image.
Detailed Description
The invention is further described with reference to the following figures and specific examples.
The invention discloses a 3D scene reconstruction technology based on a single image, the flow of which is shown in figure 1, the input image and the processing result are shown in figure 2:
(1) performing linear detection and marking long line segment
In the three-dimensional perspective view, all parallel straight lines in the x, y and z directions intersect at a respective vanishing point on a picture, so that three vanishing points are obtained. Judging the vanishing points in three directions is an important step in reconstructing the layout structure of the room. Many lines in the scene are parallel and the various edges are orthogonal. Most vanishing point detection relies on the detection of straight line segments in the image.
And extracting image edge information by using a canny edge detection operator. The detected line segments are shown in fig. 3. However, in such a relatively complex scene indoors, the edge information is much, and only a long straight line has a great effect on accurately calculating the vanishing point, so that the length of the straight line can be set on the basis of edge straight line detection by using a canny operator. The pixels of the edge are numbered one by one, then the gradient direction is quantized and numbered into 1, 2, 3, …, k intervals, and the gradient direction of each edge is marked in the corresponding interval. Therefore, to obtain an effective straight line with a certain length, only a communication area exceeding a preset threshold value needs to be found, so that redundant straight lines are reduced, and the calculation efficiency and accuracy are improved. The invention uses straight lines of length greater than 30 pixels for the calculation.
(2) Clustering according to the inclination angle to obtain three vanishing points
Firstly, according to the inclination angle clustering of each long line segment, the line segments sharing one vanishing point in the three vanishing points can be gathered into a cluster, and then voting is carried out in the generated main cluster to select the target. All the edge lines taken out are divided into three groups, and the three groups of lines in different directions are marked with different colors respectively. Red represents the z-axis direction perpendicular to the ground, green represents the x-direction, and blue represents the y-direction. Yellow is an abnormal value and is a redundant straight line that is not collinear with any of the calculated vanishing points. On one hand, the accuracy of vanishing point calculation is greatly influenced by too many redundant straight lines, and on the other hand, the calculation complexity is exponentially increased along with the number of the vanishing points. We remove the yellow line at the time of calculation.
The long line segments after clustering cannot be directly used to mark vanishing points, because more redundant lines may be included. It is necessary to vote to remove redundancy while obtaining the target point and the candidate point. The distance between the vanishing point and the straight line is first defined. Any vanishing point detection method must either directly or indirectly include a distance function of vanishing points and straight lines. m isx、myRespectively representing the x-axis value and the y-axis value of the midpoint of the line segment in an x-y coordinate plane, wherein l is the length of the current line segment and alphasIs the angle between the current line segment and the coordinate plane. Using the midpoint of the line segment (m)x,my,l,αs) The segment length can be definitely processed by storing the segment information, and a longer segment can be found. Since longer line segments contain more accurate vanishing point information. The line segment s and its projected line segment s' on the straight line formed by the line segment midpoint and vanishing point have the same midpoint and correspond to the same vanishing point. On the basis of the above, the distance function d (vp, s) is defined as the included angle between the vanishing point and the line segment, i.e. the included angle between s' and s.
Returning to the redundant straight line as a way of handling errors caused by noise, we can express the following formula:
for example, a candidate vanishing point corresponds to the accumulator a, and if the distance function of the straight line in the corresponding angular domain is smaller than the threshold T, a vote can be successfully cast. The threshold T is adjusted within a suitable range, not too small, otherwise lines containing useful information may be missed, so that redundant lines can be filtered out well. Meanwhile, voting also depends on the length of the line segment, and the weight w can be adjusted1And w2To balance the impact of both on the final result.
And finally, making a statistical histogram of the total votes of all the vanishing point accumulators, wherein the larger the peak value of the histogram is, the more the votes of the current vanishing point are. The largest three of the peaks were taken as the three vanishing points. In the experiment, the length of the straight line set according to the step (1) is more than 30 pixels, there are about 100 to 200 straight lines on each picture, and each detected line in the image can be assigned to one of the vanishing points.
Fig. 4 shows all available line segments that are finally detected. All line segments are divided into three groups according to three spatial directions, and the length of each line segment is marked for screening and adjusting. Finally, the marked image is used for vanishing point detection to obtain an image 5.
(3) Generating a series of candidate layouts
A preliminary 3D frame is generated before a final 3D frame is generated. The direction information of the 3D frame can be deduced according to the vanishing points, and the information exerts good geometric constraint on the vertex projection of the frame model. The layout of the room relates to five faces at most, including a left wall, a right wall, a floor and a ceiling, the frame line comprises three kinds of intersecting lines of the wall and the wall, intersecting lines of the wall and the floor and intersecting lines of the wall and the ceiling. Vanishing points corresponding to three orthogonal directions in reality are vp1,vp2,vp3. V p of the selection1And vp2Two vanishing points, on the basis of which pairs of vertical bar rays are drawn, as shown in fig. 6. When the visible surfaces of the room photo are less than five, the intersection of the two clusters of rays will be located outside the picture, but the processing method is the same. Once the vanishing point is known, we sample the space. The layout is completely specified by two rays passing through two vanishing points providing four corners and four edges, respectively, while the remaining edge of the box passes through a third vanishing point vp3Light is projected to these four corners. In the present invention, i select each vanishing point to use 10 evenly spaced rays to generate different candidate layouts.
(4) Ranking the candidate layouts and selecting the layout with the first score
And ranking the candidate 3D layouts generated in the previous section according to the adaptation degree of the ground route. Given a set of indoor image training sets { x1,x2…xnBelongs to X and their output frames y1,y2…ynIs ∈ Y. Here, a mapping f needs to be learned: x, Y → R is used to score the candidate frame generated automatically by the image, and R is the set of scores that score the candidate frame. Each clothComprising five parameterized polygonal surfaces, yi ═ F1,F2,F3,F4,F5}. The more matched the input image and layout, the more f (x) is mappediY) higher score, similarly to y and yiThe more deviations, the lower the score. Thus, for a new test image x, its correct layout is set to y*Then y*Is defined as
y*=argmax f(xi,y)
The above is a structured regression problem, and the output is a layout with a complex structure. Structured regression is carried out by using a Structured-SVM learning framework, and a model is obtained by training the learning framework. It models the relationship between different outputs within the output space to better utilize the available training data.
(5) Providing a geometric label segmentation graph for a first ranked layout
The indoor scene label graph is generated according to the geometric features, then the line segments are weighted by each label information given by the geometric context, and the line segments are re-ranked to obtain a more reasonable layout. The first step is to obtain a superpixel segmentation map from original pixels of an image, and judge RGB distance on the basis of an adaptive threshold value by using a pixel accumulation method to determine the segmentation degree. Then, the superpixel map needs to be segmented again, that is, a plurality of segments are learned on the basis of the characteristics of color, texture, edge and the like of the superpixel to form continuous segmentation areas. Each segment provides different image visual angles, in order to obtain a final marking result, the probability that each segment belongs to each label needs to be calculated, for an indoor structure, the labels are divided into a left wall surface, a right wall surface, a ceiling, a ground surface and an indoor object, and if superpixels contained in a certain area all have the same label, the area is consistent.
(6) Re-estimating bezel layout
The generated indoor scene geometric label is needed to be used for marking the frame layout again, and a more accurate 3D frame layout is obtained. Adaboost classification is combined with a boost decision tree to estimate feature regions with consistency. The training process is as follows: for each training image, calculating a superpixel segmentation graph, further integrating into region segmentation, labeling each region according to a groudtuth graph, calculating a feature vector, and then training a geometric class label classifier and a region classifier. Accordingly, the left/center/right wall and ceiling and indoor object labels are finally obtained. In training, cross-validation is used to calculate box layout cues for the training set and calculate the percentage and overlap area of a segmented region on its corresponding surface, which may reduce confusion between object surfaces and room surfaces. And finally, applying the trained model to the generated label picture to obtain accurate frame layout.
The resulting final bezel layout is shown in fig. 7.
(7) Performing camera calibration and coordinate transformation
The purpose of camera calibration is to obtain internal and external parameters of a camera and obtain the relationship between a two-dimensional plane pixel coordinate and a three-dimensional world coordinate. The image coordinate system is a pixel-based coordinate system with an origin at the upper left, and the position of each pixel is expressed in pixel units, so the coordinate system is called an image pixel coordinate system (u, v), and u and v respectively represent the number of columns and rows of pixels in the digital image, but the position of the pixel is not expressed in physical units, so the image coordinate system expressed in physical units, called an image physical coordinate system (x, y), needs to be established, and the coordinate system is the origin at the intersection of the optical axis and the image plane, and the point is located at the center of the image. The two coordinate axes are respectively parallel to the image pixel coordinate system.
The coordinate transformation includes the following steps:
transformation from world coordinate system to camera coordinate system: can be described by a rotation matrix R and a translation vector t:
to homogeneous coordinates:
camera coordinate system to image physical coordinate system:
where f is the focal length. The formation homogeneous coordinates are:
a transformation from the world coordinate system to the image coordinate system is obtained:
wherein M is1Is an intrinsic parameter of the camera, M2Is an extrinsic parameter. Therefore, the relation between the position of the camera image pixel and the position of the scene point is established, the length of the frame straight line in the picture is calculated, only the end point coordinates (u, v) are obtained and substituted into the formula, the conversion can be completed, and the world coordinate is solved
The final extracted 3D bounding box is shown in fig. 8.
(8) Calculating room area and volume
And obtaining a length and width value according to the solved world coordinates, and calculating the area. If the standard height of the room is preset to be 2.8 meters, the volume of the room can be calculated.

Claims (7)

1. A single image-based 3D scene reconstruction technique is characterized in that the method comprises the following steps:
(1) firstly, carrying out linear detection on an original input image to mark a long line segment;
(2) according to the dip angle clustering of each long line segment, removing abnormal line segments and selecting three vanishing points in the orthogonal direction through voting;
(3) generating a series of possible candidate layout structures, i.e. a series of preliminary layouts, by sampling information and other features from the uniform rays emitted from the vanishing points;
(4) scoring each candidate layout using edge-based image features and a learning model, ranking each possible candidate layout structure using a structured learning approach;
(5) providing a geometric label segmentation graph for the first ranked layout (most likely layout);
(6) re-estimating the layout of the frame through the generated geometric label, and correcting the layout into an accurate layout;
(7) performing camera calibration and coordinate conversion on the obtained accurate layout to obtain a three-dimensional coordinate of the accurate layout;
(8) and estimating the room area according to the three-dimensional coordinate layout diagram of the precise layout.
2. The single-image-based 3D scene reconstruction technology according to claim 1, wherein in step (1), in the three-dimensional perspective view, all parallel straight lines in x, y and z directions intersect at a respective vanishing point on a single image, so that three vanishing points are obtained by using straight-line segment detection. Effective straight lines with certain length can be obtained by setting a threshold value, and redundant straight lines are reduced.
3. A single-image based 3D scene reconstruction technique according to claim 2, wherein in step (2), the lines are classified according to the inclination angle of each line segment, and the lines are divided into three groups, and the three groups of lines in different directions are marked with different colors. Voting selection is selected for selecting the vanishing points, a statistical histogram is made for the total votes of all vanishing point accumulators, and the larger the peak value of the histogram is, the more votes of the current vanishing points are. The maximum three vanishing points in the peak are taken.
4. The single-image-based 3D scene reconstruction technology according to claim 3, wherein in the step (3), the direction information of the 3D frame is estimated according to the vanishing point, and geometric constraints are applied to the frame model. And generating a preliminary 3D frame to form different candidate layouts.
5. The single-image-based 3D scene reconstruction technology according to claim 4, wherein in the step (4), the candidate layouts are scored according to ground route by using a Structured-SVM learning framework, and the model is obtained by training the learning framework. And selecting the layout with the highest score from all the candidate layouts for the next operation.
6. The single-image-based 3D scene reconstruction technology according to claim 5, wherein in the step (5) and the step (6), an indoor scene label map is generated according to the geometric features, the border layout is re-marked by using the generated indoor scene geometric labels, and the line segments are weighted by each label information given by the geometric context, so as to obtain a more accurate 3D border layout.
7. The single-image-based 3D scene reconstruction technique according to claim 6, wherein in step (8), the length and width values are obtained from the solved world coordinates, and the area is calculated. If the standard height of the room is preset to be 2.8 meters, the volume of the room can be calculated.
CN201910728079.2A 2019-08-08 2019-08-08 3D scene reconstruction technology based on single image Pending CN110599587A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910728079.2A CN110599587A (en) 2019-08-08 2019-08-08 3D scene reconstruction technology based on single image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910728079.2A CN110599587A (en) 2019-08-08 2019-08-08 3D scene reconstruction technology based on single image

Publications (1)

Publication Number Publication Date
CN110599587A true CN110599587A (en) 2019-12-20

Family

ID=68853705

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910728079.2A Pending CN110599587A (en) 2019-08-08 2019-08-08 3D scene reconstruction technology based on single image

Country Status (1)

Country Link
CN (1) CN110599587A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111798431A (en) * 2020-07-06 2020-10-20 苏州市职业大学 Real-time vanishing point detection method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106327532A (en) * 2016-08-31 2017-01-11 北京天睿空间科技股份有限公司 Three-dimensional registering method for single image
CN107292234A (en) * 2017-05-17 2017-10-24 南京邮电大学 It is a kind of that method of estimation is laid out based on information edge and the indoor scene of multi-modal feature

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106327532A (en) * 2016-08-31 2017-01-11 北京天睿空间科技股份有限公司 Three-dimensional registering method for single image
CN107292234A (en) * 2017-05-17 2017-10-24 南京邮电大学 It is a kind of that method of estimation is laid out based on information edge and the indoor scene of multi-modal feature

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111798431A (en) * 2020-07-06 2020-10-20 苏州市职业大学 Real-time vanishing point detection method, device, equipment and storage medium
CN111798431B (en) * 2020-07-06 2023-09-15 苏州市职业大学 Real-time vanishing point detection method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
US11971726B2 (en) Method of constructing indoor two-dimensional semantic map with wall corner as critical feature based on robot platform
CN109544677B (en) Indoor scene main structure reconstruction method and system based on depth image key frame
Zhang et al. Semantic segmentation of urban scenes using dense depth maps
CN108304798B (en) Street level order event video detection method based on deep learning and motion consistency
CN104134234B (en) A kind of full automatic three-dimensional scene construction method based on single image
Xiao et al. Multiple view semantic segmentation for street view images
US9942535B2 (en) Method for 3D scene structure modeling and camera registration from single image
US9430865B2 (en) Real-time dynamic non-planar projection apparatus and method
CN108648194B (en) Three-dimensional target identification segmentation and pose measurement method and device based on CAD model
CN112784736B (en) Character interaction behavior recognition method based on multi-modal feature fusion
CN102982524B (en) Splicing method for corn ear order images
CN109583483A (en) A kind of object detection method and system based on convolutional neural networks
CN113012293A (en) Stone carving model construction method, device, equipment and storage medium
CN109711268B (en) Face image screening method and device
CN111160291B (en) Human eye detection method based on depth information and CNN
Pound et al. A patch-based approach to 3D plant shoot phenotyping
CN105574545B (en) The semantic cutting method of street environment image various visual angles and device
CN110570457A (en) Three-dimensional object detection and tracking method based on stream data
CN105809716A (en) Superpixel and three-dimensional self-organizing background subtraction algorithm-combined foreground extraction method
CN115937461B (en) Multi-source fusion model construction and texture generation method, device, medium and equipment
Cui et al. Dense depth-map estimation based on fusion of event camera and sparse LiDAR
CN115018999A (en) Multi-robot-cooperation dense point cloud map construction method and device
CN110599587A (en) 3D scene reconstruction technology based on single image
CN113920254B (en) Monocular RGB (Red Green blue) -based indoor three-dimensional reconstruction method and system thereof
CN106909936B (en) Vehicle detection method based on double-vehicle deformable component model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination