WO2024088071A1 - 三维场景重建方法、装置、设备及存储介质 - Google Patents

三维场景重建方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2024088071A1
WO2024088071A1 PCT/CN2023/124212 CN2023124212W WO2024088071A1 WO 2024088071 A1 WO2024088071 A1 WO 2024088071A1 CN 2023124212 W CN2023124212 W CN 2023124212W WO 2024088071 A1 WO2024088071 A1 WO 2024088071A1
Authority
WO
WIPO (PCT)
Prior art keywords
semantic
terrain
semantic object
target images
pixel
Prior art date
Application number
PCT/CN2023/124212
Other languages
English (en)
French (fr)
Inventor
冯驰原
Original Assignee
深圳市其域创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市其域创新科技有限公司 filed Critical 深圳市其域创新科技有限公司
Publication of WO2024088071A1 publication Critical patent/WO2024088071A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/005General purpose rendering architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks

Definitions

  • the embodiments of the present invention relate to the field of computer vision technology, and specifically to a three-dimensional scene reconstruction method, device, equipment and storage medium.
  • embodiments of the present invention provide a three-dimensional scene reconstruction method, apparatus, device and storage medium to solve the problems existing in the prior art.
  • a three-dimensional scene reconstruction method comprising:
  • the camera parameters include camera intrinsic parameters and camera extrinsic parameters
  • a three-dimensional reconstruction is performed according to the vector outline, the modeling position and the terrain layer to obtain a three-dimensional reconstruction model of the outdoor scene.
  • determining the terrain layer according to the j target images further comprises:
  • a cloth filter calculation is performed according to the three-dimensional point cloud to determine the terrain layer.
  • performing semantic recognition and pixel depth estimation on at least one of the target images to obtain a semantic object further comprises:
  • Pixel depth estimation is performed on all pixels of each of the identified objects in each of the target images to obtain an estimated object and a depth position corresponding to the estimated object, and the estimated object is used as the semantic object.
  • performing pixel depth estimation on all pixels of each identified object in each target image to obtain an estimated object and a depth position corresponding to the estimated object further comprises:
  • the average pixel depth position of all first pixels of the identified object in each of the target images is calculated according to the pixel depth position, and the average pixel depth position is used as the depth position.
  • in response to the semantic object meeting the first preset condition, taking the semantic object meeting the first preset condition as the first semantic object, and determining the vector contour of the first semantic object in the terrain layer according to the camera parameter further comprises:
  • the second semantic label having the largest number of the terrain points corresponding to the target image is used as the target semantic label
  • performing semantic recognition and pixel depth estimation on at least one of the target images to obtain a semantic object further comprises:
  • a fusion calculation is performed on the same estimated object in the j target images to obtain a three-dimensional fused object and a fusion position corresponding to the fused object, the fused object is used as the semantic object, and the fusion position corresponding to the second semantic object is used as the modeling position.
  • the performing fusion calculation on the same estimated object in the j target images to obtain a three-dimensional fused object and a fused position corresponding to the fused object further includes:
  • each pixel in the j target images and the same identified object is fused and calculated to obtain the fused object, wherein the fused object includes a number of voxels, each of which is obtained by fusion of the corresponding pixels of the same identified object in the j target images, where a is a positive integer greater than 0;
  • d i represents the voxel depth position of each of the voxels in the fused object.
  • a three-dimensional scene reconstruction device comprising:
  • a first acquisition module is used to acquire j target images for an outdoor scene, where j is a positive integer greater than 1;
  • a second acquisition module used to acquire camera parameters of each of the target images, wherein the camera parameters include camera intrinsic parameters and camera extrinsic parameters;
  • a first determination module is used to determine a terrain layer according to j target images, wherein the terrain layer has k terrain points, and k is a positive integer greater than 1;
  • a first computing module configured to perform semantic recognition and pixel depth estimation on at least one of the target images to obtain a semantic object
  • a second determination module is configured to take the semantic object as a first semantic object according to a first preset condition, and determine a vector contour of the first semantic object in the terrain layer according to the camera parameter;
  • a third determining module configured to take the semantic object as a second semantic object according to a second preset condition, determine a modeling position of the second semantic object in the terrain layer according to the camera parameter, and the second semantic object has fewer feature points than the first semantic object;
  • the second calculation module is used to perform three-dimensional reconstruction according to the vector outline, the modeling position and the terrain layer to obtain a three-dimensional reconstruction model of the outdoor scene.
  • a computing device comprising: a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface communicate with each other via the communication bus;
  • the memory is used to store at least one executable instruction, and the executable instruction enables the processor to perform the operation of any one of the three-dimensional scene reconstruction methods described above.
  • a computer-readable storage medium wherein at least one executable instruction is stored in the storage medium, and the executable instruction, when run, can execute the operation of any of the above-mentioned three-dimensional scene reconstruction methods.
  • the size of the terrain part is large, the number of corresponding pixels in the target image is large. If the terrain layer adopts the same generation method as the first semantic object and the second semantic object, the amount of calculation of the processor will be increased, affecting the working efficiency of the processor.
  • the generation method of the terrain layer is different from the generation method of the first semantic object and the second semantic object, and the terrain part of the outdoor scene can be separated from other objects, so that the first semantic object and the second semantic object can be combined with the terrain layer later, thereby improving the accuracy and comprehensiveness of the three-dimensional scene reconstruction.
  • the terrain layer, the first semantic object and the second semantic object are all automatically generated by the processor, and the vector contour of the first semantic object in the terrain layer and the modeling position of the second semantic object in the terrain layer are also automatically extracted by the processor, without the need for manual extraction of the three-dimensional model of the object to be reconstructed, thereby improving the efficiency of the three-dimensional reconstruction of the outdoor scene.
  • the terrain layer, the first semantic object and the second semantic object are all generated by j target images, which can be better positioned and aligned, so that the vector contour of the first semantic object in the terrain layer and the modeling position of the second semantic object in the terrain layer can be calculated more accurately, which is conducive to the accuracy of the three-dimensional scene reconstruction.
  • the corresponding first semantic object and second semantic object are obtained, so that the second semantic object obtained for the object to be reconstructed with a smaller outline size, such as the telephone poles and trash cans in the outdoor scene in the target image, can retain more feature information of the object to be reconstructed, so that the second semantic object can be displayed more completely, and the first semantic object and the second semantic object have more complete modeling details, reducing data missing, which is conducive to the three-dimensional reconstruction of each object in the outdoor scene.
  • a first semantic object with a larger outline size can obtain a more complete vector outline, so as to more accurately locate the position of the first semantic object in the terrain layer, thereby obtaining a more accurate reconstruction model.
  • the first semantic object with a smaller outline size can obtain a more accurate position, so as to accurately locate the position of the second semantic object in the terrain layer.
  • FIG1 is a schematic diagram showing a flow chart of a three-dimensional scene reconstruction method provided by an embodiment of the present invention
  • FIG2 is a schematic diagram showing the structure of a three-dimensional scene reconstruction device provided by an embodiment of the present invention.
  • FIG3 shows a schematic diagram of the structure of a computing device provided by some embodiments of the present invention.
  • the processor builds models based on multiple images taken by the camera.
  • the existing 3D scene reconstruction process requires professional manpower. For example, in many cases, a scene requires professional artists to manually cut out or split the model to extract useful information. It takes half a day or even a day to extract terrain data, vegetation data, building data, etc.
  • the extraction process is cumbersome, the workload is large and the difficulty is high, which affects the reconstruction efficiency and is difficult to expand to large-scale modeling projects.
  • the inventor provides a three-dimensional scene reconstruction method, which determines the terrain layer according to j target images, and performs semantic recognition and pixel depth estimation on the target image to obtain a first semantic object and a second semantic object, and then determines the vector outline of the first semantic object in the terrain layer according to the camera parameters, determines the modeling position of the second semantic object in the terrain layer according to the camera parameters, and finally performs three-dimensional reconstruction according to the vector outline, the modeling position and the terrain layer to obtain a three-dimensional reconstruction model of the outdoor scene.
  • the first semantic object includes objects with larger outline sizes such as vegetation and buildings
  • the second semantic object includes objects with smaller outline sizes such as electric poles and trash cans.
  • the objects to be reconstructed are extracted in different corresponding ways, and the objects to be reconstructed are combined for three-dimensional reconstruction to improve the accuracy and comprehensiveness of the three-dimensional scene reconstruction, and the data of each object to be reconstructed can be automatically extracted without the need for manual extraction of the objects to be reconstructed, thereby improving the efficiency of the three-dimensional scene reconstruction.
  • FIG1 shows a flow chart of a three-dimensional scene reconstruction method provided by an embodiment of the present invention.
  • the method is executed by a computing device, which may be a computing device including one or more processors.
  • the processor may be a central processing unit (CPU), or an ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement an embodiment of the present invention, which is not limited here.
  • the one or more processors included in the computing device may be processors of the same type, such as one or more CPUs; or may be processors of different types, such as one or more CPUs and one or more ASICs, which is not limited here.
  • the method comprises the following steps:
  • Step 110 Obtain j target images for outdoor scenes, where j is a positive integer greater than 1.
  • Step 120 Determine the camera parameters of each target image, where the camera parameters include camera intrinsic parameters and camera extrinsic parameters.
  • Step 130 Determine a terrain layer according to the j target images, where the terrain layer has k terrain points, where k is a positive integer greater than 1.
  • Step 140 Perform semantic recognition and pixel depth estimation on at least one target image to obtain a semantic object.
  • Step 150 In response to the semantic object meeting the first preset condition, the semantic object meeting the first preset condition is taken as a first semantic object, and a vector contour of the first semantic object in the terrain layer is determined according to camera parameters.
  • Step 160 In response to the semantic object meeting the second preset condition, the semantic object meeting the second preset condition is used as a second semantic object, and a modeling position of the second semantic object in the terrain layer is determined according to camera parameters, and the feature points of the second semantic object are less than the feature points of the first semantic object.
  • Step 170 Perform 3D reconstruction based on the vector outline, modeling location and terrain layer to obtain a 3D reconstruction of the outdoor scene Model.
  • the processor needs to obtain j target images for the outdoor scene to perform three-dimensional reconstruction of the outdoor scene.
  • the target image is an image that can achieve three-dimensional reconstruction, and j is set accordingly to the number of images that can achieve three-dimensional reconstruction, which is usually at least 2, preferably more than 8.
  • the target image can be captured by a handheld three-dimensional scanning device or an aerial scanning device, and the images captured by the camera of the scanning device for outdoor scenes at different viewing angles are used as the target images.
  • the camera intrinsic parameters are fixed after the camera leaves the factory, and can be pre-calculated by the existing camera calibration method, and then the processor obtains the camera intrinsic parameters.
  • the specific calculation method of the camera intrinsic parameters is not described in detail.
  • the camera extrinsic parameters are the position and posture of the camera, which are represented by the RT matrix, and the processor can calculate accordingly according to the corresponding target image.
  • step 130 based on the three-dimensional scene reconstruction process, it is necessary to extract data from different physical objects in the outdoor scene.
  • the terrain has a large coverage area and other objects are basically located on the ground. Therefore, it is necessary to separate the terrain part from other objects so that the three-dimensional models of other objects can be subsequently arranged on the model of the terrain part.
  • the terrain layer is determined according to the j target images, the pixel points of the j target images are projected into the world coordinate system, and the three-dimensional point cloud is obtained through point cloud registration and fusion, and then the three-dimensional point cloud is filtered and calculated by a terrain filtering algorithm such as a morphological filtering algorithm, an index filtering algorithm or a cloth filtering algorithm to obtain a corresponding terrain layer, which has multiple terrain points, wherein the number of terrain points is determined according to the filtering result, and k is the number of the terrain points.
  • the terrain points obtained by the terrain filtering algorithm can also be further optimized by error optimization to obtain a smaller number of terrain points, and the number of optimized terrain points is correspondingly used as the k value.
  • the processor can obtain semantic objects corresponding to objects in the outdoor scene by performing semantic recognition and pixel depth estimation on at least one target image.
  • each target image can be segmented into a corresponding recognition object according to semantic recognition, and the recognition object at this time is a two-dimensional plane shape.
  • the semantic recognition algorithm can be a Normalized Cut algorithm, a Grab cut algorithm, a threshold method, a segmentation method based on pixel clustering, or an algorithm for semantic recognition through deep learning or other semantic recognition algorithms.
  • a semantic segmentation model is established through deep learning, and objects of different categories in each target image are predicted and recognized according to the semantic segmentation model, and the corresponding pixel is assigned a label of the object category, thereby performing a semantic recognition operation on the target image.
  • the pixel position of each pixel in each recognized object is obtained by pixel depth estimation, so as to give a three-dimensional form to the recognized object in two-dimensional plane form and obtain the corresponding semantic object.
  • the pixel depth estimation can be monodepth2, Adabins, Sydneypth and other algorithms.
  • one of the target images may be selected for semantic recognition and pixel depth estimation to obtain a corresponding semantic object.
  • the position of the same target object corresponding to different target images can be calculated by the average value, or the position of the same target object corresponding to different target images can be calculated by weighted method, or the position of the same target object corresponding to different target images can be calculated by other methods, which are not limited here and can be set as needed.
  • step 140 semantic recognition and pixel depth estimation are directly performed on each pixel of the target image to obtain the corresponding semantic object, so that the semantic object obtained for the object to be reconstructed with fewer feature points such as telephone poles and trash cans in the outdoor scene in the target image can retain more feature information of the object to be reconstructed, so that the second semantic object can be displayed more completely, and the first semantic object and the second semantic object have more complete modeling details, reducing data missing, which is conducive to the three-dimensional reconstruction of various objects in the outdoor scene.
  • the processor determines whether the semantic object meets the first preset condition. If the semantic object meets the first preset condition, the processor treats the semantic object that meets the first preset condition as the first semantic object.
  • the first preset condition can be determined based on the feature points of the semantic object. For example, if the feature points of the semantic object are greater than or equal to a preset value, the semantic object is treated as the first semantic object.
  • the first semantic object includes objects to be reconstructed that have more feature points such as vegetation and buildings in the target image. In some embodiments, the first preset condition can also be determined based on a preset value.
  • the first semantic object category is determined to determine a semantic object that can form a relatively complete vector outline in the terrain layer as the first semantic object.
  • the first semantic object category includes buildings, grass, bushes, etc.
  • the processor determines a semantic object that meets at least one of the categories of buildings, grass, bushes, etc. as the first semantic object.
  • the first preset condition can also be determined by deep learning to determine a semantic object that can form a relatively complete vector outline in the terrain layer as the first semantic object.
  • the first semantic object and the terrain layer are both generated based on the target image and have the same corresponding camera parameters, the first semantic object and the terrain layer have a mapping relationship, so the vector contour of the first semantic object in the terrain layer can be determined based on the camera parameters.
  • the first semantic object since the first semantic object has many feature points, a more complete vector contour can be obtained, and the position of the first semantic object in the terrain layer can be more accurately located through the vector contour, thereby obtaining a more accurate reconstruction model.
  • each point of the first semantic object can be converted to the same coordinate system of the terrain layer according to the camera parameters, and then the semantic points of the first semantic object are matched with the terrain points through point cloud registration so that the terrain points have semantic labels, and then the semantic terrain points are clustered, and the vector contours corresponding to the clustered terrain points are calculated through the contour recognition algorithm.
  • each terrain point of the terrain is mapped to a target image that has undergone semantic recognition according to camera parameters, thereby assigning a semantic label of the corresponding pixel to the terrain point of the corresponding pixel, and then the terrain points corresponding to the first semantic object are clustered, so that similar terrain points will be clustered, and then the vector contours corresponding to the clustered terrain points are calculated through a contour recognition algorithm.
  • contour recognition algorithms such as the alphashape algorithm.
  • the processor determines whether the semantic object meets the second preset condition. If the semantic object meets the second preset condition, the processor uses the semantic object that meets the second preset condition as the second semantic object in response to the semantic object meeting the second preset condition.
  • the second preset condition can be determined based on the feature points of the semantic object. For example, if the feature points of the semantic object are less than the preset value, the semantic object is used as the second semantic object.
  • the second semantic object includes objects to be reconstructed with fewer feature points such as electric poles and trash cans in the target image. In some embodiments, it can also be determined based on the area occupied by the semantic object in the terrain layer.
  • the second preset condition can also be determined based on a pre-set second semantic object category to determine a semantic object that cannot form a relatively complete vector outline in the terrain layer as the second semantic object.
  • the second semantic object category includes electric poles, trash cans, benches, etc.
  • the processor uses a semantic object that meets at least one category of electric poles, trash cans, benches, etc. as the first semantic object.
  • the second preset condition can also be determined by deep learning to use a semantic object that cannot form a relatively complete vector outline in the terrain layer as the second semantic object. Or other methods, not limited here, set as needed.
  • the second semantic object Since the second semantic object has fewer feature points than the first semantic object, fewer feature points can be mapped with terrain points, and a complete outline mapped on the terrain layer cannot be formed.
  • the obtained second semantic object cannot accurately locate the position of the second semantic object in the terrain layer, affecting the accuracy of outdoor scene modeling. In this case, by determining the modeling position of the second semantic object corresponding to the terrain layer, the position of the second semantic object in the terrain layer can be more accurately located.
  • the modeling position of the second semantic object can be determined by calculating the position corresponding to each pixel in the second semantic object. For example, the position of each voxel in the second semantic object is calculated by depth estimation, and the modeling position in the second semantic object is calculated by the average value. Alternatively, in some embodiments, the center of gravity of the second semantic object can be determined, and the position of the center of gravity is used as the modeling position.
  • step 170 after obtaining the vector outline of the first semantic object and the modeling position of the second semantic object, the processor of the embodiment of the present invention combines the first semantic object on the terrain layer according to the vector outline, and the processor combines the second semantic object on the terrain layer according to the modeling position to obtain a reconstructed model of the outdoor scene.
  • the processor may also combine a preset 3D model in a preset model library with a terrain layer according to the vector outline. For example, for a model of grass, the processor may fill the grass model in the preset model library into the area corresponding to the vector outline at the terrain layer for the grass vector outline to form the grass in the 3D reconstruction model of the outdoor scene. Ground model.
  • the work of combining models can also be handed over to artists.
  • the artists combine the first semantic object or the corresponding preset 3D model with the terrain layer according to the vector outline.
  • the artists only need to prevent the corresponding 3D model, and do not need to extract the location information and outline information corresponding to the 3D model, which greatly reduces the workload of the artists and improves the production efficiency of the 3D reconstruction model of the outdoor scene.
  • the processor may combine the second semantic object or the corresponding preset 3D model with the terrain layer according to the modeling position; or the artist may combine the second semantic object or the corresponding preset 3D model with the terrain layer according to the modeling position.
  • step 170 is executed by a processor or manually, which is not limited here and can be set as needed.
  • the three-dimensional reconstruction model of the outdoor scene may be rendered and processed as needed to achieve a better reconstruction effect.
  • step 110 to step 170 since the size of the terrain part is large, the number of corresponding pixels in the target image is large. If the terrain layer adopts the same generation method as the first semantic object and the second semantic object, the amount of calculation of the processor will be increased, affecting the working efficiency of the processor.
  • the generation method of the terrain layer is different from the generation method of the first semantic object and the second semantic object, and the terrain part of the outdoor scene can be separated from other objects, so that the first semantic object and the second semantic object can be combined with the terrain layer later, thereby improving the accuracy and comprehensiveness of the three-dimensional scene reconstruction.
  • the terrain layer, the first semantic object and the second semantic object are all automatically generated by the processor, and the vector contour of the first semantic object in the terrain layer and the modeling position of the second semantic object in the terrain layer are also automatically extracted by the processor, without the need for manual extraction of the three-dimensional model of the object to be reconstructed, thereby improving the efficiency of the three-dimensional reconstruction of the outdoor scene.
  • the terrain layer, the first semantic object and the second semantic object are all generated by j target images, which can be better positioned and aligned, so that the vector contour of the first semantic object in the terrain layer and the modeling position of the second semantic object in the terrain layer can be calculated more accurately, which is conducive to the accuracy of the three-dimensional scene reconstruction.
  • the corresponding first semantic object and second semantic object are obtained, so that the second semantic object obtained for the object to be reconstructed with a smaller outline size, such as the telephone poles and trash cans in the outdoor scene in the target image, can retain more feature information of the object to be reconstructed, so that the second semantic object can be displayed more completely, and the first semantic object and the second semantic object have more complete modeling details, reducing data missing, which is conducive to the three-dimensional reconstruction of each object in the outdoor scene.
  • the first semantic object with a larger outline size can obtain a more complete vector outline, so as to more accurately locate the position of the first semantic object in the terrain layer, thereby obtaining a more accurate reconstruction model.
  • the first semantic object with a smaller outline size can obtain a more accurate position, so as to accurately locate the position of the second semantic object in the terrain layer.
  • step 130 further includes:
  • Step a01 Determine a 3D point cloud based on j target images
  • Step a02 Perform cloth filter calculation based on the 3D point cloud to determine the terrain layer.
  • step a01 and step a02 the pixel points of j target images are projected into the world coordinate system through matrix transformation according to the camera intrinsic parameters and camera extrinsic parameters, and a three-dimensional point cloud is obtained through point cloud registration and fusion, and then the three-dimensional point cloud is filtered and calculated by the cloth filtering algorithm to obtain the corresponding terrain layer, which has multiple terrain points, where the number of terrain points is determined according to the filtering result, and k is the number of terrain points.
  • the cloth filtering algorithm can be implemented by the CSF algorithm, and the resolution of the cloth can be selected. If the length and width of the ground are m ⁇ n, and our resolution is 1 meter, then the cloth has m ⁇ n points.
  • step a01 and step a02 using the cloth filter algorithm to calculate the terrain layer can effectively reduce the amount of processor calculations and improve the efficiency of the processor. In the subsequent process of combining the terrain layer with the first semantic object and the second semantic object, the amount of processor calculations can be further reduced.
  • step 140 further includes:
  • Step b01 performing semantic recognition on each target image to obtain a recognition object of each target image, and determining a first semantic label of each pixel in each recognition object;
  • Step b02 performing pixel depth estimation on all pixels of each recognized object in each target image, obtaining an estimated object and a depth position corresponding to the estimated object, and taking the estimated object as a semantic object.
  • semantic recognition segmentation is performed on each target image through a deep learning algorithm to obtain multiple recognition objects in each target image, and the multiple recognition objects include a first recognition object and a second recognition object.
  • the semantic segmentation model of deep learning can predict the preset category. For example, a semantic segmentation model of categories such as people, buildings, and grasslands is established in advance through a deep learning algorithm, and then the corresponding recognition object is determined on the target image according to the corresponding semantic segmentation model, and then the first semantic label is assigned to the corresponding pixel according to the corresponding recognition object.
  • the first recognition object includes a grassland object
  • the first semantic label representing the grassland is assigned to the pixel corresponding to the grassland object
  • the second recognition object includes a pole object
  • the first semantic label representing the pole is assigned to the pixel corresponding to the pole object.
  • the first semantic label can be expressed in Chinese, in numbers, or in English letters, which is not limited here and is set as needed. For example, if a grassland object is recognized, the pixel corresponding to the grassland object is represented by the number 1; if a pole object is recognized, the pixel corresponding to the pole object is represented by the number 2.
  • step b02 since the pixels displayed in each target image are in a two-dimensional plane form, the corresponding recognition object is in a two-dimensional plane form, and the depth position of the corresponding pixel of the recognition object in each target image is obtained by depth estimation. According to the depth position, the three-dimensional coordinates of the corresponding pixel in space can be determined, and the corresponding recognition object combined with the depth position can obtain a three-dimensional estimated object.
  • each target image can obtain an estimated object in a three-dimensional form, and accordingly, the estimated object can be modeled as a semantic object.
  • step b01 and step b02 a semantic object with relatively complete feature points can be obtained on a target image, so as to reduce information loss in the process of three-dimensional scene reconstruction and improve the accuracy and completeness of three-dimensional scene reconstruction.
  • step b02 further comprises:
  • Step b021 performing pixel depth estimation on each pixel of the identified object in each target image to obtain the pixel depth position
  • Step b022 Calculate the average pixel depth position of all first pixels of the recognized object in each target image according to the pixel depth position, and use the average pixel depth position as the depth position.
  • step 150 further comprises:
  • Step c01 mapping the terrain point to a first pixel corresponding to a first semantic object in the same target image according to camera parameters, and determining a second semantic label corresponding to the terrain point according to a first semantic label corresponding to the first pixel;
  • Step c02 taking the second semantic label with the largest number of topographic points corresponding to the target image as the target semantic label
  • Step c03 performing cluster calculation on all terrain points with the same target semantic label to obtain a target terrain point set
  • Step c04 performing contour recognition calculation on the target terrain point set to determine the vector contour of the first semantic object in the terrain layer.
  • the first semantic object and the terrain layer are both generated according to the target image, and accordingly, the corresponding camera parameters are the same, and the pixel points of the first semantic object and the terrain points of the terrain layer can form a corresponding mapping relationship.
  • the terrain point is projected to the first pixel corresponding to the first semantic object in the target image through the projection matrix, and the first semantic label corresponding to the first pixel is assigned to the terrain point to form the second semantic label corresponding to the terrain point in the target image.
  • the first semantic objects corresponding to the same terrain point in different target images may be the same or different, so the second semantic labels obtained by the same terrain point in different target images may be the same or different.
  • step c02 since the terrain point may register multiple pixel points of the target image as the same point during the process of forming the three-dimensional point cloud, in this case, the types of second semantic labels obtained by projecting the terrain point to the pixel points of the same target image may be different; in addition, in multiple target images, the same terrain point may obtain different types of second semantic labels corresponding to different target images.
  • the second semantic label with the largest number of identical second semantic labels corresponding to the target image as the target semantic label a more accurate semantic result corresponding to the terrain point can be obtained.
  • a terrain point obtains more than two types of second semantic labels, for example, second semantic labels representing grass objects and second semantic labels representing road objects, if the number of second semantic labels representing grass objects is larger, the second semantic label representing grass objects will be used as the target semantic label of the same terrain point.
  • the same terrain point obtains different types of second semantic labels in different target images, for example, assuming that there are two target images corresponding to the same location, in the first target image, the terrain point obtains second semantic labels representing grass objects and road objects, and in the second target image, the terrain point obtains second semantic labels representing grass objects and building objects. If the number of second semantic labels representing grass objects is the largest, the second semantic label representing grass objects will be used as the target semantic label for the same terrain point.
  • step c03 after steps c01 and c02, all terrain points corresponding to the first semantic object in the terrain layer are assigned target semantic labels, and a target terrain point set is obtained by clustering all terrain points with the same target semantic label.
  • the obtained target terrain point set can also be considered as a projection point set corresponding to the first semantic object projected onto the terrain layer, so as to provide calculation of the corresponding vector contour in the subsequent step c04.
  • the terrain points of the target semantic label representing the road object are clustered through the clustering algorithm DBSCAN, and only the horizontal and vertical coordinates of the terrain points are taken, which is equivalent to flattening the road, and then clustering calculations are performed on these terrain points belonging to the road, so that similar points will be clustered into one road. If the road is disconnected, the disconnected road will be clustered into two accordingly.
  • the contour of the target terrain point set is identified by a contour recognition algorithm, thereby determining the vector contour of the first semantic object corresponding to the terrain layer.
  • the vector contour corresponding to the target terrain point set can be calculated by an alphashape algorithm.
  • the alphashape algorithm is used to perform contour recognition calculation on a target terrain point set representing the semantics of a road object, and a vector contour representing the contours of the edges on both sides of the road is obtained.
  • step c01 to step c04 the second semantic label corresponding to the terrain point is determined according to the first semantic label corresponding to the first pixel, so that the terrain point forms a mapping relationship with the multiple first pixels, and the number of second semantic label results obtained is larger and relatively more accurate; the same second semantic label with the largest number obtained by corresponding the terrain point to the target image is used as the target semantic label to obtain a more accurate semantic result corresponding to the terrain point; all terrain points with the same target semantic label are clustered to obtain a set of target terrain points, and the contour of the target terrain point set is identified by a contour recognition algorithm, so as to determine the vector contour corresponding to the first semantic object in the terrain layer, so that the positioning of the vector contour is more accurate.
  • step 140 further includes:
  • Step d01 perform fusion calculation on the same estimated object in j target images according to the depth position to obtain a three-dimensional fused object and a fused position corresponding to the fused object, use the fused object as a semantic object, and use the fused position corresponding to the second semantic object as a modeling position.
  • step d01 since there will be errors in calculating the position of the same estimated object based on only one target image, a better positioning result can be obtained by inferring several incompletely overlapping positions of the same estimated object based on multiple target images obtained from different shooting angles.
  • the voxel depth position of the estimated object converted from pixels into voxels must be determined first, and then each pixel in the estimated object is converted into a voxel to form a three-dimensional fused object.
  • the fused object is used as a semantic object, and if the processor determines that the semantic object meets the second preset condition, the corresponding fused position corresponding to the second semantic object is used as the modeling position.
  • each pixel corresponding to the estimated object in each target image has a pixel depth position, and during the fusion calculation process, the pixel depth positions of the corresponding pixels of the same estimated object can be fused to form a voxel depth position by calculating the minimum error.
  • the minimum error can be determined by the least squares method, or a preset distance error can be set to select a suitable voxel depth position.
  • the average pixel depth position of pixels corresponding to the same estimated object in multiple target images may also be calculated by averaging, and the average pixel depth position may be used as the voxel depth position.
  • the voxel depth position may also be calculated in a weighted manner, and the corresponding weight may be determined according to the display of the estimated object in the target image. For example, if the corresponding estimated object in the target image displays a complete outline and the degree of distortion is small, the corresponding weight is relatively large. For ease of understanding, it is assumed that there are 8 target images, and there are 4 target images showing the estimated object.
  • w 2 and w 3 can be set larger, for example, w 2 and w 3 are both set to 0.4, w 1 and w 4 are both set to 0.1, or w 2 and w 3 are both set to 0.3, w 1 and w 4 are both set to 0.2, as required, and no limitation is made here.
  • w 2 and w 3 can be set larger, for example, w 2 and w 3 are both set to 0.4, w 1 and w 4 are both set to 0.1, or w 2 and w 3 are both set to 0.3, w 1 and w 4 are both set to 0.2, as required, and no limitation is made here.
  • w 1 , w 2 , and w 3 may be set larger, for example, w 1 , w 2 , and w 3 are all set to 0.3, and w 4 is set to 0.1, or other numerical setting methods are set as needed, which are not limited here.
  • it is determined according to the display situation of the corresponding estimated object in the target image.
  • the fusion position corresponding to the fusion object can be calculated by the minimum error; or, the fusion position corresponding to the fusion object can be calculated by the average value; the average pixel depth position is used as the voxel depth position; or, the fusion position corresponding to the fusion object can be calculated by a weighted method; or the fusion position corresponding to the fusion object can be calculated by other methods, which are not limited here and can be set as needed.
  • step d01 further comprises:
  • Step d011 performing pixel depth estimation on each pixel of the identified object in each target image to obtain the pixel depth position
  • Step d012 performing fusion calculation on each pixel of the same identified object in the j target images according to the pixel depth position to obtain a fused object, the fused object including a voxels, each voxel being obtained by fusion of corresponding pixels of the same identified object in the j target images, where a is a positive integer greater than 0;
  • Step d013 Calculate the voxel depth position of each voxel according to the pixel depth position;
  • Step d014 Calculate the fusion position of the fused object based on the voxel depth position
  • d i represents the voxel depth position of each voxel in the fused object.
  • step d011 and step d012 all pixels in the same identified object in j target images are fused and converted into voxels according to the pixel depth position, thereby obtaining a corresponding fused object, so that more feature information is retained in the three-dimensional scene reconstruction process.
  • a more accurate positioning effect can be obtained due to the large amount of feature information.
  • each voxel in the fused object is fused by the corresponding pixels of the corresponding estimated object.
  • the identified objects in multiple target images can be corresponded by matching calculation.
  • the corresponding identified objects in multiple target images can be understood as the same identified object.
  • the identified objects of multiple target images can be corresponded by matching calculation.
  • pixel distance matching can be used. If the matching value between the identified objects of multiple target images is greater than or equal to a preset threshold, the identified objects are considered to be corresponding. If the matching value is set as a ratio, the preset threshold is usually set to be above 0.6, preferably above 0.8; in some embodiments, matching can also be performed through other identification objects around the identification object. For example, assuming that there are multiple other identification objects around the identification object in one of the target images, and there are multiple other identification objects of corresponding types around the identification objects of other target images, then it can be considered that the identification objects of multiple target images are correspondingly matched; or matching can also be performed through other methods, which are not limited here and can be set as needed.
  • the grass objects corresponding to the 4 target images are fused accordingly.
  • the corresponding voxel and voxel depth position are obtained.
  • the corresponding grass objects in multiple target images can be understood as the same grass object.
  • the grass objects in the 4 target images are matched by pixel distance.
  • the distance between the pixel of the grass object in one target image and the pixel of the grass object in another target image meets the preset distance, it is considered that the pixels of the grass objects of the two grass objects that meet the preset distance are matched.
  • the distance between each pixel of the grass object in each target image and each pixel of the grass object in the other target images is calculated, and the ratio of the number of matching pixels of the grass object in each target image to the total number of pixels is taken as the matching value. If the matching value of the grass object in each target image is greater than 0.8, it means that the grass objects in the 4 target images are corresponding.
  • step d013 during the fusion calculation process, the pixel depth positions of corresponding pixels of the same identified object in j target images are fused and calculated to obtain the voxel depth position of the corresponding voxel.
  • the method of calculating the voxel depth position by the pixel depth position refers to the above step d01 and will not be repeated here.
  • step d014 by calculating the average depth position This makes the calculation simpler and reduces the processor In addition, a more accurate fusion position can be obtained, which makes the reconstruction effect of the three-dimensional scene better.
  • step b011 to step b014 all pixels in the same identified object in the j target images are fused and converted into voxels according to the pixel depth position, thereby obtaining the corresponding fused object, so that more feature information is retained in the three-dimensional scene reconstruction process.
  • more accurate positioning effect can be obtained due to more feature information.
  • By calculating the average depth position This makes the calculation simpler and reduces the amount of processor computation.
  • a more accurate fusion position can be obtained, resulting in a better reconstruction effect of the three-dimensional scene.
  • FIG2 shows a schematic diagram of the structure of a 3D scene reconstruction device provided by an embodiment of the present invention.
  • the device 200 includes:
  • a first acquisition module 210 is used to acquire j target images for an outdoor scene, where j is a positive integer greater than 1;
  • a second acquisition module 220 is used to acquire camera parameters of each target image, where the camera parameters include camera intrinsic parameters and camera extrinsic parameters;
  • a first determination module 230 is used to determine a terrain layer according to j target images, where the terrain layer has k terrain points, and k is a positive integer greater than 1;
  • a first calculation module 240 is used to perform semantic recognition and pixel depth estimation on at least one target image to obtain a semantic object
  • a second determination module 250 is used to take the semantic object as a first semantic object according to the first preset condition, and determine a vector contour of the first semantic object in the terrain layer according to the camera parameters;
  • a third determination module 260 is used to determine the modeling position of the second semantic object in the terrain layer according to the camera parameters, taking the semantic object as the second semantic object according to the second preset condition, and the feature points of the second semantic object are less than the feature points of the first semantic object;
  • the second calculation module 270 is used to perform three-dimensional reconstruction according to the vector outline, the modeling position and the terrain layer to obtain a three-dimensional reconstruction model of the outdoor scene.
  • the first determination module 230 further includes:
  • a first determination unit configured to determine a three-dimensional point cloud according to j target images
  • the second determination unit is used to perform cloth filter calculation according to the three-dimensional point cloud to determine the terrain layer.
  • the first calculation module 240 further includes:
  • a first recognition unit is used to perform semantic recognition on each target image, obtain a recognition object of each target image, and determine a first semantic label of each pixel in each recognition object;
  • the first obtaining unit is used to perform pixel depth estimation on all pixels of each recognized object in each target image, obtain the estimated object and the depth position corresponding to the estimated object, and use the estimated object as a semantic object.
  • the first obtaining unit further comprises:
  • a second recognition unit is used to perform pixel depth estimation on each pixel of the recognized object in each target image to obtain a pixel depth position
  • the first operation unit is used to calculate the average pixel depth position of all first pixels of the recognized object in each target image according to the pixel depth position, and use the average pixel depth position as the depth position.
  • the second determination module 250 further includes:
  • a first mapping unit configured to map the terrain point to a first pixel corresponding to a first semantic object in the same target image according to the camera parameters, and determine a second semantic label corresponding to the terrain point according to a first semantic label corresponding to the first pixel;
  • the second operation unit is used to use the second semantic label with the largest number of topographic points corresponding to the target image as the target semantic label;
  • the first clustering unit is used to perform clustering calculation on all terrain points with the same target semantic label to obtain a target terrain point set;
  • the third recognition unit is used to perform contour recognition calculation on the target terrain point set to determine the vector contour of the first semantic object in the terrain layer.
  • the first calculation module 240 further includes:
  • the first fusion unit is used to perform fusion calculation on the same estimated object in j target images to obtain a three-dimensional fusion object and a fusion position corresponding to the fusion object, use the fusion object as a semantic object, and use the fusion position corresponding to the second semantic object as a modeling position.
  • the first fusion unit further comprises:
  • a third operation unit is used to estimate the pixel depth of each pixel of the identified object in each target image to obtain the pixel depth position;
  • the second fusion unit is used to perform fusion calculation on each pixel in the same recognition object of the j target images according to the pixel depth position to obtain a fusion object, where the fusion object includes a number of a voxels obtained by fusion of corresponding pixels, where a is a positive integer greater than 0;
  • a fourth operation unit configured to calculate a voxel depth position of each voxel according to the pixel depth position
  • a fifth computing unit configured to calculate the fusion position of the fusion object according to the voxel depth position
  • d i represents the voxel depth position of each of the voxels in the fused object.
  • FIG3 shows a schematic diagram of the structure of a computing device provided in an embodiment of the present invention.
  • the specific embodiment of the present invention does not limit the specific implementation of the computing device.
  • the computing device may include: a processor (processor) 302, a communication interface (Communications Interface) 304, a memory (memory) 306, and a communication bus 308.
  • processor processor
  • communication interface Communication Interface
  • memory memory
  • the processor 302, the communication interface 304, and the memory 306 communicate with each other via the communication bus 308.
  • the communication interface 304 is used to communicate with other devices such as a client or other server network elements.
  • the processor 302 is used to execute the program 310, which can specifically execute the relevant steps in the above-mentioned method embodiment for three-dimensional scene reconstruction.
  • the program 310 may include program code including computer executable instructions.
  • the processor 302 may be a central processing unit (CPU), or an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present invention.
  • the one or more processors included in the computing device may be processors of the same type, such as one or more CPUs; or processors of different types, such as one or more CPUs and one or more ASICs.
  • the memory 306 is used to store the program 310.
  • the memory 306 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
  • the embodiment of the present invention also provides a computer-readable storage medium, wherein the storage medium stores at least one executable instruction
  • the executable instruction executes the operation of any one of the above-mentioned three-dimensional scene reconstruction methods when running.
  • modules in the devices in the embodiments may be adaptively changed and arranged in one or more devices different from the embodiments.
  • the modules or units or components in the embodiments may be combined into one module or unit or component, and may be divided into a plurality of submodules or subunits or subcomponents. Except that at least some of such features and/or processes or units are mutually exclusive, all features disclosed in this specification (including the accompanying claims, abstracts and drawings) and all processes or units of any method or device disclosed in this manner may be combined in any combination. Unless otherwise expressly stated, each feature disclosed in this specification (including the accompanying claims, abstracts and drawings) may be replaced by an alternative feature providing the same, equivalent or similar purpose.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Graphics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Geometry (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

本发明实施例涉及计算机视觉技术领域,提供一种三维场景重建方法、装置、设备及存储介质,该方法包括:根据j个目标图像确定地形层;对至少一目标图像进行语义识别和像素深度估计,得到语义对象;将符合第一预设条件的语义对象作为第一语义对象,根据相机参数确定第一语义对象在地形层的矢量轮廓;将符合第二预设条件的语义对象作为第二语义对象,根据相机参数确定第二语义对象在地形层的建模位置;根据矢量轮廓、建模位置和地形层进行三维重建,得到所述室外场景的三维重建模型。通过自动生成地形层、第一语义对象和第二语义对象,以及自动提取第一语义对象在地形层的矢量轮廓和第二语义对象在地形层的建模位置,提高室外场景三维重建的效率。

Description

三维场景重建方法、装置、设备及存储介质 技术领域
本发明实施例涉及计算机视觉技术领域,具体涉及一种三维场景重建方法、装置、设备及存储介质。
背景技术
随着数字孪生和元宇宙的兴起,对真实的场景数字化是一个迫在眉睫的问题。其中,对场景数字化需求最大的主要来自政府部门和游戏领域。对于政府部门来说,数字城市可以实时反应城市的动态状态,对于交通,应急抢险等方向具有重要作用。在游戏方面,内容生产制约着整个生态的发展,因为其内容城从原本的2D进展到3D。而3D的内容生产有着较高的门槛,尤其是3D场景的制作,有着难度大和时间长的特点,限制着创作者的生产数量和质量。
因此,有必要提供一种三维场景重建方法、装置、设备及存储介质,以克服上述问题。
发明内容
鉴于上述问题,本发明实施例提供了一种三维场景重建方法、装置、设备及存储介质,用于解决现有技术中存在的问题。
根据本发明实施例的第一方面,提供了一种三维场景重建方法,所述方法包括:
获取针对室外场景的j个目标图像,j为大于1的正整数;
获取每一所述目标图像的相机参数,所述相机参数包括相机内参和相机外参;
根据j个所述目标图像确定地形层,所述地形层具有k个地形点,k为大于1的正整数;
对至少一所述目标图像进行语义识别和像素深度估计,得到语义对象;
响应于所述语义对象符合第一预设条件,将符合第一预设条件的所述语义对象作为第一语义对象,根据所述相机参数确定所述第一语义对象在所述地形层的矢量轮廓;
响应于所述语义对象符合第二预设条件,将符合第二预设条件的所述语义对象作为第二语义对象,根据所述相机参数确定所述第二语义对象在所述地形层的建模位置,所述第二语义对象的特征点少于所述第一语义对象的特征点;
根据所述矢量轮廓、所述建模位置和所述地形层进行三维重建,得到所述室外场景的三维重建模型。
在一些实施例中,所述根据j个所述目标图像确定地形层,进一步包括:
根据j个所述目标图像确定三维点云;
根据所述三维点云进行布料滤波计算,确定所述地形层。
在一些实施例中,所述对至少一所述目标图像进行语义识别和像素深度估计,得到语义对象,进一步包括:
对每一所述目标图像进行语义识别,得到每一所述目标图像的识别对象,确定每一所述识别对象中每一像素的第一语义标签;
对每一所述目标图像中每一所述识别对象的所有像素进行像素深度估计,得到估计对象和所述估计对象对应的深度位置,将所述估计对象作为所述语义对象。
在一些实施例中,所述对每一所述目标图像中每一所述识别对象的所有像素进行像素深度估计,得到估计对象和所述估计对象对应的深度位置,进一步包括:
对每一所述目标图像中所述识别对象的每一所述像素进行像素深度估计,得到像素深度位置;
根据所述像素深度位置计算每一所述目标图像中所述识别对象的所有第一像素的平均像素深度位置,将所述平均像素深度位置作为所述深度位置。
在一些实施例中,所述响应于所述语义对象符合第一预设条件,将符合第一预设条件的所述语义对象作为第一语义对象,根据所述相机参数确定所述第一语义对象在所述地形层的矢量轮廓,进一步包括:
根据所述相机参数将所述地形点映射到同一所述目标图像中所述第一语义对象对应的第一像素,根据对应所述第一像素的所述第一语义标签确定所述地形点对应的第二语义标签;
将所述地形点对应所述目标图像所得数量最多的相同所述第二语义标签作为目标语义标签;
将具有同一所述目标语义标签的所有所述地形点进行聚类计算,得到目标地形点集合;
对所述目标地形点集合进行轮廓识别计算,确定所述第一语义对象在所述地形层的矢量轮廓。
在一些实施例中,所述对至少一所述目标图像进行语义识别和像素深度估计,得到语义对象,进一步包括:
将j个所述目标图像中同一所述估计对象进行融合计算,得到三维的融合对象和所述融合对象对应的融合位置,将所述融合对象作为所述语义对象,将所述第二语义对象对应的所述融合位置作为所述建模位置。
在一些实施例中,所述将j个所述目标图像中同一所述估计对象进行融合计算,得到三维的融合对象和所述融合对象对应的融合位置,进一步包括:
对每一所述目标图像中所述识别对象的每一所述像素进行像素深度估计,得到像素深度位置;
根据所述像素深度位置将j个所述目标图像同一所述识别对象中每一所述像素进行融合计算,得到所述融合对象,所述融合对象包括a个体素,每一所述体素由j个所述目标图像中同一所述识别对象的对应所述像素融合得到,a为大于0的正整数;
根据所述像素深度位置计算每一所述体素的体素深度位置;
根据所述体素深度位置计算所述融合对象的融合位置其中,di表示所述融合对象中每一所述体素的所述体素深度位置。
根据本发明实施例的第二方面,提供了一种三维场景重建装置,包括:
第一获取模块,用于获取针对室外场景的j个目标图像,j为大于1的正整数;
第二获取模块,用于获取每一所述目标图像的相机参数,所述相机参数包括相机内参和相机外参;
第一确定模块,用于根据j个所述目标图像确定地形层,所述地形层具有k个地形点,k为大于1的正整数;
第一计算模块,用于对至少一所述目标图像进行语义识别和像素深度估计,得到语义对象;
第二确定模块,用于根据第一预设条件,将所述语义对象作为第一语义对象,根据所述相机参数确定所述第一语义对象在所述地形层的矢量轮廓;
第三确定模块,用于根据第二预设条件,将所述语义对象作为第二语义对象,根据所述相机参数确定所述第二语义对象在所述地形层的建模位置,所述第二语义对象的特征点少于所述第一语义对象的特征点;
第二计算模块,用于根据所述矢量轮廓、所述建模位置和所述地形层进行三维重建,得到所述室外场景的三维重建模型。
根据本发明实施例的第三方面,提供了一种计算设备,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;
所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行如上述任一项所述的三维场景重建方法的操作。
根据本发明实施例的第四方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一可执行指令,所述可执行指令在运行时执行如上述任一项所述的三维场景重建方法的操作。
本发明实施例中,由于地形部分的尺寸较大,在目标图像中对应的像素数量较多,若地形层采用与第一语义对象、第二语义对象的生成方式,则会增加处理器的运算量,影响处理器的工作效率。而地形层的生成方式与第一语义对象、第二语义对象的生成方式不同,可以将室外场景的地形部分与其他物体分离,以便后续将第一语义对象以及第二语义对象结合到地形层,提高三维场景重建的准确性以及全面性。其中,地形层、第一语义对象和第二语义对象均为处理器自动生成,以及第一语义对象在地形层的矢量轮廓和第二语义对象在地形层的建模位置也是处理器自动提取,无需人工额外对待重建对象的三维模型进行手动提取,提高室外场景三维重建的效率。并且,地形层、第一语义对象和第二语义对象均由j个目标图像生成,可以更好的定位对齐,从而使得第一语义对象在地形层的矢量轮廓和第二语义对象在地形层的建模位置可以计算更准确,有利于三维场景重建的准确性。
通过对目标图像的各个像素进行语义识别和像素深度估计得到对应的第一语义对象和第二语义对象,使得针对目标图像中室外场景的电线杆、垃圾桶等轮廓尺寸较小的待重建对象所得的第二语义对象,能够保留该待重建对象较多的特征信息,使得第二语义对象能够显示较为完整,并且使得第一语义对象和第二语义对象具有较为完整的建模细节,减少数据缺失,有利于室外场景的各个物体均可三维重建。
通过根据所述相机参数确定所述第一语义对象在所述地形层的矢量轮廓,使得轮廓尺寸较大的第一语义对象可以获得较为完整的矢量轮廓,以更准确定位第一语义对象在地形层的位置,从而得到更为准确的重建模型。
通过根据所述相机参数确定所述第二语义对象在所述地形层的建模位置,使得轮廓尺寸较小的第一语义对象可以获得较准确的位置,以准确定位第二语义对象在地形层的位置。
上述说明仅是本发明实施例技术方案的概述,为了能够更清楚了解本发明实施例的技术手段,而可依照说明书的内容予以实施,并且为了让本发明实施例的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。
附图说明
附图仅用于示出实施方式,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:
图1示出了本发明实施例提供的三维场景重建方法的流程示意图;
图2示出了本发明实施例提供的三维场景重建装置的结构示意图;
图3示出了本发明一些实施例提供的计算设备的结构示意图。
具体实施方式
下面将参照附图更详细地描述本发明的示例性实施例。虽然附图中显示了本发明的示例性实施例,然而应当理解,可以以各种形式实现本发明而不应被这里阐述的实施例所限制。
在进行三维场景重建过程中,处理器根据相机拍摄的多个图像进行建模。现有的三维场景重建过程中,需要耗费专业的人工,例如很多时候一个场景要需要专业的美术人员,人工去抠图或者拆分模型,将有用的信息提取出来,半天甚至一天才能将地形数据、植被数据、建筑数据等提取出来,提取过程繁琐,工作量较大且难度较大,影响重建效率,难以扩展至大规模建模工程。
发明人提供了一种三维场景重建方法,通过根据j个目标图像确定地形层,以及对目标图像进行语义识别和像素深度估计,得到第一语义对象和第二语义对象,而后根据相机参数确定第一语义对象在地形层的矢量轮廓,根据相机参数确定第二语义对象在地形层的建模位置,最后根据矢量轮廓、建模位置和地形层进行三维重建,得到室外场景的三维重建模型。其中,第一语义对象包括植被、建筑等轮廓尺寸较大的待重建对象,第二语义对象包括电线杆、垃圾桶等轮廓尺寸较小的待重建对象。通过将地形、第一语义对象、第二语义对象进行区分,以对应采用不同的方式提取待重建对象,以及将待重建对象结合进行三维重建,以提高三维场景重建的准确性以及全面性,并且能够自动提取各待重建对象的数据,无需人工额外对待重建对象进行手动提取,从而提高三维场景重建的效率。
图1示出了本发明实施例提供的三维场景重建方法的流程图,该方法由计算设备执行,该计算设备可以是包括一个或多个处理器的计算设备,该处理器可能是中央处理器CPU,或者是特定集成电路ASIC(Applicatioj Specific Ijtegrated Circuit),或者是被配置成实施本发明实施例的一个或多个集成电路,在此不做限定。计算设备包括的一个或多个处理器,可以是同一类型的处理器,如一个或多个CPU;也可以是不同类型的处理器,如一个或多个CPU以及一个或多个ASIC,在此不做限定。
如图1所示,该方法包括以下步骤:
步骤110:获取针对室外场景的j个目标图像,j为大于1的正整数。
步骤120:确定每一目标图像的相机参数,相机参数包括相机内参和相机外参。
步骤130:根据j个目标图像确定地形层,地形层具有k个地形点,k为大于1的正整数。
步骤140:对至少一目标图像进行语义识别和像素深度估计,得到语义对象。
步骤150:响应于语义对象符合第一预设条件,将符合第一预设条件的语义对象作为第一语义对象,根据相机参数确定第一语义对象在地形层的矢量轮廓。
步骤160:响应于语义对象符合第二预设条件,将符合第二预设条件的语义对象作为第二语义对象,根据相机参数确定第二语义对象在地形层的建模位置,第二语义对象的特征点少于第一语义对象的特征点。
步骤170:根据矢量轮廓、建模位置和地形层进行三维重建,得到室外场景的三维重建 模型。
步骤110中,处理器需要获取针对室外场景的j个目标图像以对室外场景进行三维重建。其中,目标图像为能够实现三维重建的图像,j以能够实现三维重建的图像数量相应设置,通常数量至少为2,优选为8以上。目标图像可以通过手持式三维扫描设备或者航拍式扫描设备拍摄,该扫描设备中相机针对不同视角的室外场景拍摄的图像作为目标图像。
步骤120中,相机内参是相机在出厂后即固定的,可以通过现有的相机标定的方式预先计算出来,而后处理器获取该相机内参,该相机内参的具体计算方式不做赘述。相机外参为相机的位姿,通过RT矩阵表示,处理器可以根据对应目标图像相应计算。
步骤130中,基于三维场景重建过程中,需要将室外场景的不同实物进行数据提取,其中地形的覆盖范围较大,且其他物体基本设于地面上,因此需要将地形部分与其他物体分离,以便后续将其他物体的三维模型布置于地形部分的模型上。
其中,根据j个目标图像确定地形层,将j个目标图像的像素点投影到世界坐标系中,并经过点云配准融合得到三维点云,而后通过形态学滤波算法、索引滤波算法或布料滤波算法等地形滤波算法对该三维点云进行滤波计算,得到对应的地形层,该地形层具有多个地形点,其中地形点的数量根据滤波结果确定,此时k即为该地形点的数量。在一些实施例中,也可以通过误差优化,进一步将通过地形滤波算法得到的地形点进一步优化,得到数量更少的多个地形点,此时对应将该优化后的地形点数量作为k值。
步骤140中,处理器通过对至少一目标图像进行语义识别和像素深度估计,可以得到室外场景中物体对应的语义对象。其中,每一目标图像根据语义识别可以分割出对应的识别对象,此时的识别对象为二维平面形态。该语义识别算法可以为Normalized Cut算法、Grab cut算法、阈值法、基于像素聚类的分割方法或者是通过深度学习进行语义识别的算法或者其他语义识别算法。例如通过深度学习建立语义分割模型,根据该语义分割模型预测识别每一目标图像中不同类别的物体,并赋予对应像素该物体类别的标签,从而对目标图像进行语义识别的操作。
通过像素深度估计得到各识别对象中每一像素的像素位置,以将二维平面形态的识别对象赋予三维形态,得到对应的语义对象。其中,像素深度估计可以为monodepth2,Adabins,Adelaidepth等算法。
在一些实施例中,可以选择其中一个目标图像进行语义识别以及像素深度估计,得到对应的语义对象。
在一些实施例中,由于只根据一个目标图像推理同一目标物体位置会有误差,通过根据不同的拍摄角度得到的多个目标图像推理出同一目标物体的几个不完全重叠的位置,可以得到更好的定位结果。其中,可以通过平均值计算不同目标图像对应同一目标物体的位置,也可以通过加权方式计算不同目标图像对应同一目标物体的位置,或者通过其他方式计算不同目标图像对应同一目标物体的位置,在此不做限定,根据需要设置。
通过步骤140,直接对目标图像的各个像素进行语义识别和像素深度估计得到对应的语义对象,使得针对目标图像中室外场景的电线杆、垃圾桶等特征点较少的待重建对象所得的语义对象,能够保留该待重建对象较多的特征信息,使得第二语义对象能够显示较为完整,并且使得第一语义对象和第二语义对象具有较为完整的建模细节,减少数据缺失,有利于室外场景的各个物体均可三维重建。
步骤150中,处理器判断语义对象是否符合第一预设条件,若语义对象符合第一预设条件,则处理器相应于语义对象符合第一预设条件,将符合第一预设条件的语义对象作为第一语义对象。第一预设条件可以是根据语义对象的特征点确定,例如,若语义对象的特征点大于或等于预设值,则将语义对象作为第一语义对象。该第一语义对象包括目标图像中植被、建筑等特征点较多的待重建对象。在一些实施例中,第一预设条件也可以根据预先 设定的第一语义对象类别确定,以将能够在地形层形成较为完整矢量轮廓的语义对象确定为第一语义对象。例如,第一语义对象类别包括建筑物、草地、灌木丛等,处理器将符合建筑物、草地、灌木丛等类别中至少一种类别的语义对象作为第一语义对象。在一些实施例中,也可以通过深度学习确定第一预设条件,以将能够在地形层形成较为完整的矢量轮廓的语义对象作为第一语义对象。
由于第一语义对象与地形层均是根据目标图像生成,其对应的相机参数相同,因此第一语义对象与地形层具有映射关系,因此可以根据相机参数确定第一语义对象在地形层的矢量轮廓。其中,由于第一语义对象的特征点较多,可以获得较为完整的矢量轮廓,通过矢量轮廓可以更准确定位第一语义对象在地形层的位置,从而得到更为准确的重建模型。
在一些实施例中,可以根据相机参数将第一语义对象的各点转换到地形层的同一坐标系下,而后经过点云配准将第一语义对象带有语义的点与地形点匹配,使得地形点具有语义标签,而后将具有语义的地形点聚类,通过轮廓识别算法计算聚类的地形点对应的矢量轮廓。
在一些实施例中,根据相机参数将地形的每个地形点映射到经过语义识别的目标图像上,从而将对应像素的语义标签赋予到对应像素的地形点,而后将对应第一语义对象的地形点做聚类,那么相近的地形点就会聚类,之后通过轮廓识别算法计算聚类的地形点对应的矢量轮廓。
其中,轮廓识别算法有多种,例如alphashape算法。
步骤160中,处理器判断语义对象是否符合第二预设条件,若语义对象符合第二预设条件,则处理器相应于语义对象符合第二预设条件,将符合第二预设条件的语义对象作为第二语义对象。第二预设条件可以是根据语义对象的特征点确定,例如,若语义对象的特征点小于预设值,则将语义对象作为第二语义对象。该第二语义对象包括目标图像中电线杆、垃圾桶等特征点较少的待重建对象。在一些实施例中,也可以根据语义对象在地形层的占地面积确定,例如,若语义对象的占地面积小于预设值,则将语义对象作为第二语义对象。在一些实施例中,第二预设条件也可以根据预先设定的第二语义对象类别确定,以将不能够在地形层形成较为完整矢量轮廓的语义对象确定为第二语义对象。例如,第二语义对象类别包括电线杆、垃圾桶、长椅等,处理器将符合电线杆、垃圾桶、长椅等类别中至少一种类别的语义对象作为第一语义对象。在一些实施例中,也可以通过深度学习确定第二预设条件,以将不能够在地形层形成较为完整的矢量轮廓的语义对象作为第二语义对象。或者其他方式,在此不做限定,根据需要设置。
由于第二语义对象的特征点相对第一语义对象较少,可与地形点映射的特征点较少,不能形成映射在地形层上的完整轮廓,所得的第二语义对象的无法准确定位第二语义对象在地形层的位置,影响室外场景建模的准确度。在此情况下,通过确定第二语义对象其对应在地形层的建模位置,可以较准确定位第二语义对象在地形层的位置。
在一些实施例中,可以通过计算第二语义对象中各像素对应的位置而确定第二语义对象的建模位置。例如,通过深度估计计算第二语义对象中每一体素的位置,通过平均值计算第二语义对象中建模位置。或者在一些实施例中,也可以通过确定第二语义对象的重心,将该重心所在位置作为建模位置。
步骤170中,在得到第一语义对象的矢量轮廓以及第二语义对象的建模位置后,本发明实施例的处理器根据该矢量轮廓将第一语义对象结合于地形层上,以及处理器根据建模位置将第二语义对象结合于地形层上,得到室外场景的重建模型。
在一些实施例中,处理器也可以根据矢量轮廓将预设模型库中的预设三维模型结合到地形层。例如,针对草地的模型,处理器可以在地形层针对草地的矢量轮廓部位将预设模型库中的草地模型填充到该矢量轮廓对应的区域中,形成室外场景的三维重建模型中的草 地模型。
在一些实施例中,也可以将模型结合的工作交给美术人员,在此情况下,美术人员根据矢量轮廓将第一语义对象或者对应的预设三维模型结合到地形层。此时美术人员只需防止对应的三维模型即可,无需提取三维模型对应的位置信息以及轮廓信息等数据,较大的减少美术人员的工作量,提高室外场景的三维重建模型的生产效率。
同理,处理器可以根据建模位置将第二语义对象或对应的预设三维模型结合于地形层;或者美术人员根据建模位置将第二语义对象或者对应的预设三维模型结合到地形层。
其中,可以根据需要设置为由处理器执行步骤170或者人工执行步骤170,在此不做限定,根据需要设置。
当然,在一些实施例中,也可以根据需要进行对室外场景的三维重建模型进行渲染加工等操作,以实现更好的重建效果。
步骤110至步骤170中,由于地形部分的尺寸较大,在目标图像中对应的像素数量较多,若地形层采用与第一语义对象、第二语义对象的生成方式,则会增加处理器的运算量,影响处理器的工作效率。而地形层的生成方式与第一语义对象、第二语义对象的生成方式不同,可以将室外场景的地形部分与其他物体分离,以便后续将第一语义对象以及第二语义对象结合到地形层,提高三维场景重建的准确性以及全面性。其中,地形层、第一语义对象和第二语义对象均为处理器自动生成,以及第一语义对象在地形层的矢量轮廓和第二语义对象在地形层的建模位置也是处理器自动提取,无需人工额外对待重建对象的三维模型进行手动提取,提高室外场景三维重建的效率。并且,地形层、第一语义对象和第二语义对象均由j个目标图像生成,可以更好的定位对齐,从而使得第一语义对象在地形层的矢量轮廓和第二语义对象在地形层的建模位置可以计算更准确,有利于三维场景重建的准确性。
通过对目标图像的各个像素进行语义识别和像素深度估计得到对应的第一语义对象和第二语义对象,使得针对目标图像中室外场景的电线杆、垃圾桶等轮廓尺寸较小的待重建对象所得的第二语义对象,能够保留该待重建对象较多的特征信息,使得第二语义对象能够显示较为完整,并且使得第一语义对象和第二语义对象具有较为完整的建模细节,减少数据缺失,有利于室外场景的各个物体均可三维重建。
通过根据相机参数确定第一语义对象在地形层的矢量轮廓,使得轮廓尺寸较大的第一语义对象可以获得较为完整的矢量轮廓,以更准确定位第一语义对象在地形层的位置,从而得到更为准确的重建模型。
通过根据相机参数确定第二语义对象在地形层的建模位置,使得轮廓尺寸较小的第一语义对象可以获得较准确的位置,以准确定位第二语义对象在地形层的位置。
在一些实施例中,步骤130进一步包括:
步骤a01:根据j个目标图像确定三维点云;
步骤a02:根据三维点云进行布料滤波计算,确定地形层。
步骤a01和步骤a02中,根据相机内参和相机外参经过矩阵变换将j个目标图像的像素点投影到世界坐标系中,并经过点云配准融合得到三维点云,而后通过布料滤波算法对该三维点云进行滤波计算,得到对应的地形层,该地形层具有多个地形点,其中地形点的数量根据滤波结果确定,此时k即为该地形点的数量。其中,布料滤波算法可以通过CSF算法实现,该布的分辨率可以选择,如果地面的长宽为m×n,而我们分辨率取1米,那么布就有m×n个点。
通过步骤a01和步骤a02,采用布料滤波算法计算地形层可以有效减少处理器的运算量,提高处理器的效率,并且在后续将地形层与第一语义对象和第二语义对象结合过程中,能够进一步减小处理器的运算量。
在一些实施例中,步骤140进一步包括:
步骤b01:对每一目标图像进行语义识别,得到每一目标图像的识别对象,确定每一识别对象中每一像素的第一语义标签;
步骤b02:对每一目标图像中每一识别对象的所有像素进行像素深度估计,得到估计对象和估计对象对应的深度位置,将估计对象作为语义对象。
步骤b01中,通过深度学习算法对每一目标图像进行语义识别分割,得到每一目标图像中的多个识别对象,该多个识别对象包括第一识别对象和第二识别对象。通过深度学习的语义分割模型可以预测预设类别,例如预先通过深度学习算法建立关于人、建筑、草地等类别的语义分割模型,而后在目标图像上根据对应的语义分割模型确定对应的识别对象,而后根据对应的识别对象赋予对应像素第一语义标签。例如,若第一识别对象包括草地对象,则对应给草地对象对应的像素赋予代表草地的第一语义标签;若第二识别对象包括电线杆对象,则对应给电线杆对象对应的像素赋予代表电线杆的第一语义标签标签。第一语义标签可以用中文表示,也可以用数字表示,或者也可以用英文字母表示,在此不做限定,根据需要设置。例如,若识别到草地对象,则赋予对应草地对象的像素以数字1表示;若识别到电线杆对象,则赋予对应电线杆对象的像素以数字2表示。
步骤b02中,由于每一目标图像中显示的像素为二维平面形态,相应的,识别对象为二维平面形态,而通过深度估计得到每一目标图像中识别对象对应像素的深度位置,根据该深度位置,可以确定对应像素在空间上的三维坐标,相应的识别对象结合深度位置可以得到三维的估计对象。在此情况下,每一目标图像均可得到三维形态的估计对象,相应的,可以将估计对象作为语义对象进行建模。
通过步骤b01和步骤b02,可以在一个目标图像上得到特征点较为完整的语义对象,以减少三维场景重建过程中的信息损失,提高三维场景重建的准确性和完整性。
在一些实施例中,步骤b02进一步包括:
步骤b021:对每一目标图像中识别对象的每一像素进行像素深度估计,得到像素深度位置;
步骤b022:根据像素深度位置计算每一目标图像中识别对象的所有第一像素的平均像素深度位置,将平均像素深度位置作为深度位置。
在一些实施例中,步骤150进一步包括:
步骤c01:根据相机参数将地形点映射到同一目标图像中第一语义对象对应的第一像素,根据对应第一像素的第一语义标签确定地形点对应的第二语义标签;
步骤c02:将地形点对应目标图像所得数量最多的相同第二语义标签作为目标语义标签;
步骤c03:将具有同一目标语义标签的所有地形点进行聚类计算,得到目标地形点集合;
步骤c04:对目标地形点集合进行轮廓识别计算,确定第一语义对象在地形层的矢量轮廓。
步骤c01中,第一语义对象和地形层均根据目标图像生成,相应的,对应的相机参数相同,第一语义对象的像素点和地形层的地形点可形成对应的映射关系。在同一目标图像中,根据相机参数,将地形点通过投影矩阵投影至该目标图像中第一语义对象对应的第一像素上,而该第一像素对应的第一语义标签对应赋予至地形点形成地形点在该目标图像对应的第二语义标签。此外,在不同目标图像中,同一地形点对应不同目标图像的第一语义对象可能相同也可能不同,因此,同一地形点在不同目标图像所得的第二语义标签可能相同或不同。
步骤c02中,由于地形点在三维点云形成过程中,可能会将目标图像的多个像素点配准为同一个点,在此情况下,地形点投影至同一目标图像的像素点所得的第二语义标签的类型可能不同;此外,多个目标图像中,同一地形点可能得到对应不同目标图像的不同类型的第二语义标签。通过将地形点对应目标图像所得数量最多的相同第二语义标签作为目标语义标签,以获得对应地形点较为准确的语义结果。
在一些情况下,在同一目标图像中,若一个地形点获得两个以上类型的第二语义标签,例如获得表示草地对象和表示道路对象的第二语义标签,若表示草地对象的第二语义标签数量较多,则将表示草地对象的第二语义标签作为同一地形点的目标语义标签。
在一些情况下,在多个目标图像中,若同一地形点在不同目标图像中获得不同类型的第二语义标签,例如,假设同一地点对应的目标图像有两个,在第一个目标图像中地形点获得表示草地对象和表示道路对象的第二语义标签,在第二个目标图像中地形点获得表示草地对象、表示建筑对象的第二语义标签,若表示草地对象的第二语义标签的数量最多,则将表示草地对象的第二语义标签作为同一地形点的目标语义标签。
步骤c03中,经过步骤c01和步骤c02后,地形层中对应于第一语义对象的地形点均被赋予目标语义标签,通过将具有同一目标语义标签的所有地形点进行聚类计算,得到目标地形点集合,此时,所得的目标地形点集合也可以认为是对应第一语义对象投影至地形层的投影点集合,以提供后续步骤c04中对应矢量轮廓的计算。
在一些实施例中,可以采用基于密度的聚类算法DBSCAN进行聚类计算,将邻居点距离<=1m的地形点进行聚类,至少100个聚类的地形点即可以构成一个聚类簇。
例如,以道路作为例子,将表示道路对象的目标语义标签的地形点通过聚类算法DBSCAN进行聚类,只取地形点的横坐标和纵坐标,相当于把道路拍平,而后对这些属于道路的地形点做聚类计算,那么相近的点就会聚类成一条路。若道路是断开的,相应的,断开的道路会聚类成两条。
步骤c04中,通过轮廓识别算法识别目标地形点集合的轮廓,从而确定第一语义对象对应在地形层的矢量轮廓。其中,在一些实施例中,可以通过alphashape算法,计算目标地形点集合对应的矢量轮廓。
同样的,以道路为例。通过alphashape算法对表示道路对象语义的目标地形点集合进行轮廓识别计算,得到表示道路对应两侧边缘轮廓的矢量轮廓。
步骤c01至步骤c04中,通过根据对应第一像素的第一语义标签确定地形点对应的第二语义标签,使得地形点与多个第一像素形成映射关系,所得的第二语义标签结果数量较多,相对更准确;通过将地形点对应目标图像所得数量最多的相同第二语义标签作为目标语义标签,以获得对应地形点更为准确的语义结果;通过将具有同一目标语义标签的所有地形点进行聚类计算,得到目标地形点集合,以及通过轮廓识别算法识别目标地形点集合的轮廓,从而确定第一语义对象对应在地形层的矢量轮廓,使得该矢量轮廓定位更准确。
在一些实施例中,步骤140进一步包括:
步骤d01:根据深度位置将j个目标图像中同一估计对象进行融合计算,得到三维的融合对象和融合对象对应的融合位置,将融合对象作为语义对象,将第二语义对象对应的融合位置作为建模位置。
步骤d01中,由于只根据一个目标图像计算同一估计对象位置会有误差,通过根据不同的拍摄角度得到的多个目标图像推理出同一估计对象的几个不完全重叠的位置,可以得到更好的定位结果。其中,在融合过程中,需先确定估计对象中由像素三维化成体素的体素深度位置,而后估计对象中每一像素三维化成体素即形成三维的融合对象,此时,将融合对象作为语义对象,并且,若处理器判断语义对象符合第二预设的条件,则对应将第二语义对象对应的融合位置作为建模位置。
在一些实施例中,每一目标图像中对应估计对象的每一像素均具有像素深度位置,在进行融合计算过程中,可以通过最小误差计算将同一估计对象的对应像素的像素深度位置融合形成的体素深度位置。该最小误差可以通过最小二乘法确定,或者也可设定预设的距离误差,选取适合的体素深度位置。
在一些实施例中,也可以通过平均值计算多个目标图像中同一估计对象对应像素的平均像素深度位置,将该平均像素深度位置作为体素深度位置。
在一些实施例中,也可以通过加权方式计算体素深度位置,根据目标图像中估计对象的显示情况确定对应的权重,例如若目标图像中对应估计对象显示完整的轮廓,且扭曲变形程度较小,则对应的权重则相对较大。为方便理解,假设目标图像设有8个,显示有估计对象的目标图像有4个,相应的,体素深度位置结合到4个目标图像中每一目标图像对应得到的位置进行加权计算,得到体素深度位置ds=w1*ds1+w2*ds2+w3*ds3+w4*ds4,w1+w2+w3+w4=1,w1、w2、w3、w4表示分别4个目标图像的权重,ds1表示4个目标图像中的第一个目标图像中估计对象的其中一像素的像素深度位置,ds2表示4个目标图像中的第二个目标图像中同一估计对象的对应像素的像素深度位置,ds3表示4个目标图像中的第三个目标图像中同一估计对象的对应像素的像素深度位置,ds4表示4个目标图像中的第4个目标图像的中同一估计对象的对应像素的像素深度位置。
其中,在一些情况下,若w2、w3对应目标图像的扭曲变形程度较小且显示完整,而w1、w4对应目标图像的扭曲变形程度较大且显示有缺失,可以将w2、w3设置较大,例如w2、w3均设为0.4,w1、w4均设为0.1,或者w2、w3均设为0.3,w1、w4均设为0.2,根据需要设置,在此不做限定。在一些情况下,若判断w2+w3对应的目标图像未发生扭曲变形,可以将w2、w3设置较大,例如w2、w3均设为0.4,w1、w4均设为0.1,或者w2、w3均设为0.3,w1、w4均设为0.2,根据需要设置,在此不做限定。
在一些情况下,若w1、w2、w3对应目标图像的扭曲变形程度较小且显示完整,而w4对应目标图像的扭曲变形程度较大且显示有缺失,可以将w1、w2、w3设置较大,例如w1、w2、w3均设为0.3,w4设为0.1,或者其他数值设置方式,根据需要设置,在此不做限定。以此类推,根据目标图像中对应估计对象显示情况确定。
同理,可以通过最小误差计算将融合对象对应的融合位置;或者,也可以通过平均值计算融合对象对应的融合位置;将该平均像素深度位置作为体素深度位置;或者,也可以通过加权方式计算融合对象对应的融合位置;或者通过其他方式计算融合对象对应的融合位置,在此不做限定,根据需要设置。
在一些实施例中,步骤d01进一步包括:
步骤d011:对每一目标图像中识别对象的每一像素进行像素深度估计,得到像素深度位置;
步骤d012:根据像素深度位置将j个目标图像中同一识别对象中每一像素进行融合计算,得到融合对象,融合对象包括a个体素,每一体素由j个目标图像中同一识别对象的对应像素融合得到,a为大于0的正整数;
步骤d013:根据像素深度位置计算每一体素的体素深度位置;
步骤d014:根据体素深度位置计算融合对象的融合位置 其中,di表示融合对象中每一体素的体素深度位置。
步骤d011和步骤d012中,将j个目标图像中同一识别对象中所有像素根据像素深度位置进行融合计算转化为体素,从而得到对应的融合对象,以使得在三维场景重建过程中保留较多的特征信息,此外,在后续进行位置计算以及轮廓计算的过程中,由于特征信息较多可以得到更准确的定位效果。其中,融合对象中的每一体素由对应估计对象的对应像素融合而成。多个目标图像中识别对象可以通过匹配计算实现对应,在此情况下,多个目标图像中对应的识别对象可以理解为同一识别对象。其中,多个目标图像的识别对象可以通过匹配计算实现对应,在一些实施例中,可以通过像素距离匹配,若多个目标图像的识别对象相互之间的匹配值大于或等于预设阈值,则认为该识别对象是对应的。若匹配值设置为比值,相应的,预设阈值通常设置为0.6以上,优选为0.8以上;在一些实施例中,也可以通过识别对象周围的其他识别对象进行匹配,作为举例,假设其中一目标图像中识别对象周围具有多个其他识别对象,其他目标图像的识别对象周围均具有对应类型的多个其他识别对象,则可以认为多个目标图像的识别对象是对应匹配的;或者也可以通过其他方式进行匹配,在此不做限定,根据需要设置。
例如,假设目标图像设有8个,显示有估计对象的目标图像有4个,以估计对象为草地对象为例,相应的,将4个目标图像对应的草地对象进行融合。其中,通过将每一目标图像中草地对象的每一像素与其他目标图像中的对应草地对象的对应像素根据像素深度位置进行融合计算,得到对应的体素以及体素深度位置,在此情况下,多个目标图像中对应的草地对象可以理解为同一草地对象。其中,4个目标图像的草地对象通过像素距离匹配,若其中一目标图像中草地对象的像素与另一目标图像中草地对象的像素之间的距离符合预设距离,则认为符合预设距离的两个草地对象的认为草地对象的该像素是匹配的,将各目标图像中草地对象的每一像素与其他目标图像中草地对象的每一像素进行距离计算,将每一目标图像中草地对象匹配像素的数量与总像素数量的比值作为匹配值,若每一目标图像中草地对象的匹配值均大于0.8,则说明4个目标图像中草地对象是对应的。
步骤d013中,在进行融合计算过程中,通过将j个目标图像中同一识别对象的对应像素的像素深度位置进行融合计算,得到对应体素的体素深度位置,该通过像素深度位置计算体素深度位置的方式参照上述步骤d01,在此不做赘述。
步骤d014中,通过计算平均深度位置使得计算比较简单,可以减少处理器的 运算量,此外,也可以得到较为准确的融合位置,使得三维场景的重建效果较好。
步骤b011至步骤b014中,通过将j个目标图像中同一识别对象中所有像素根据像素深度位置进行融合计算转化为体素,从而得到对应的融合对象,以使得在三维场景重建过程中保留较多的特征信息,此外,在后续进行位置计算以及轮廓计算的过程中,由于特征信息较多可以得到更准确的定位效果。通过计算平均深度位置使得计算比较简单,可以减少处理器的运算量,此外,也可以得到较为准确的融合位置,使得三维场景的重建效果较好。
图2示出了本发明实施例提供的三维场景重建装置的结构示意图,该装置200包括:
第一获取模块210,用于获取针对室外场景的j个目标图像,j为大于1的正整数;
第二获取模块220,用于获取每一目标图像的相机参数,相机参数包括相机内参和相机外参;
第一确定模块230,用于根据j个目标图像确定地形层,地形层具有k个地形点,k为大于1的正整数;
第一计算模块240,用于对至少一目标图像进行语义识别和像素深度估计,得到语义对象;
第二确定模块250,用于根据第一预设条件,将语义对象作为第一语义对象,根据相机参数确定第一语义对象在地形层的矢量轮廓;
第三确定模块260,用于根据第二预设条件,将语义对象作为第二语义对象,根据相机参数确定第二语义对象在地形层的建模位置,第二语义对象的特征点少于第一语义对象的特征点;
第二计算模块270,用于根据矢量轮廓、建模位置和地形层进行三维重建,得到室外场景的三维重建模型。
在一些实施例中,第一确定模块230进一步包括:
第一判定单元,用于根据j个目标图像确定三维点云;
第二判定单元,用于根据三维点云进行布料滤波计算,确定地形层。
在一些实施例中,第一计算模块240进一步包括:
第一识别单元,用于对每一目标图像进行语义识别,得到每一目标图像的识别对象,确定每一识别对象中每一像素的第一语义标签;
第一获得单元,用于对每一目标图像中每一识别对象的所有像素进行像素深度估计,得到估计对象和估计对象对应的深度位置,将估计对象作为语义对象。
在一些实施例中,第一获得单元进一步包括:
第二识别单元,用于对每一目标图像中识别对象的每一像素进行像素深度估计,得到像素深度位置;
第一运算单元,用于根据像素深度位置计算每一目标图像中识别对象的所有第一像素的平均像素深度位置,将平均像素深度位置作为深度位置。
在一些实施例中,第二确定模块250进一步包括:
第一映射单元,用于根据相机参数将地形点映射到同一目标图像中第一语义对象对应的第一像素,根据对应第一像素的第一语义标签确定地形点对应的第二语义标签;
第二运算单元,用于将地形点对应目标图像所得数量最多的相同第二语义标签作为目标语义标签;
第一聚类单元,用于将具有同一目标语义标签的所有地形点进行聚类计算,得到目标地形点集合;
第三识别单元,用于对目标地形点集合进行轮廓识别计算,确定第一语义对象在地形层的矢量轮廓。
在一些实施例中,第一计算模块240进一步包括:
第一融合单元,用于将j个目标图像中同一估计对象进行融合计算,得到三维的融合对象和融合对象对应的融合位置,将融合对象作为语义对象,将第二语义对象对应的融合位置作为建模位置。
在一些实施例中,第一融合单元进一步包括:
第三运算单元,用于对每一目标图像中识别对象的每一像素进行像素深度估计,得到像素深度位置;
第二融合单元,用于根据像素深度位置将j个目标图像同一识别对象中每一像素进行融合计算,得到融合对象,融合对象包括多个由对应像素融合得到的a个体素,a为大于0的正整数;
第四运算单元,用于根据像素深度位置计算每一体素的体素深度位置;
第五运算单元,用于根据所述体素深度位置计算所述融合对象的融合位置其中,di表示所述融合对象中每一所述体素的所述体素深度位置。
图3示出了本发明实施例提供的计算设备的结构示意图,本发明具体实施例并不对计算设备的具体实现做限定。
如图3所示,该计算设备可以包括:处理器(processor)302、通信接口(Communications Interface)304、存储器(memory)306、以及通信总线308。
其中:处理器302、通信接口304、以及存储器306通过通信总线308完成相互间的通信。通信接口304,用于与其它设备比如客户端或其它服务器等的网元通信。处理器302,用于执行程序310,具体可以执行上述用于三维场景重建方法方法实施例中的相关步骤。
具体地,程序310可以包括程序代码,该程序代码包括计算机可执行指令。
处理器302可能是中央处理器CPU,或者是特定集成电路ASIC(Appliication Specific Integrated Circuit),或者是被配置成实施本发明实施例的一个或多个集成电路。计算设备包括的一个或多个处理器,可以是同一类型的处理器,如一个或多个CPU;也可以是不同类型的处理器,如一个或多个CPU以及一个或多个ASIC。
存储器306,用于存放程序310。存储器306可能包含高速RAM存储器,,也可能还包括非易失性存储器(non-volatile memory),,例如至少一个磁盘存储器。
本发明实施例还提供一种种计算机可读存储介质,存储介质中存储有至少一可执行指 令,可执行指令在运行时执行上述任一项的三维场景重建方法方法的操作。
在此提供的算法或显示不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与基于在此的示教一起使用。根据上面的描述,构造这类系统所要求的结构是显而易见的。此外,本发明实施例也不针对任何特定编程语言。应当明白,可以利用各种编程语言实现在此描述的本发明的内容,并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。
类似地,应当理解,为了精简本发明并帮助理解各个发明方面中的一个或多个,在上面对本发明的示例性实施例的描述中,本发明实施例的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。
本领域技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。上述实施例中的步骤,除有特殊说明外,不应理解为对执行顺序的限定。

Claims (10)

  1. 一种三维场景重建方法,其特征在于,所述方法包括:
    获取针对室外场景的j个目标图像,j为大于1的正整数;
    获取每一所述目标图像的相机参数,所述相机参数包括相机内参和相机外参;
    根据j个所述目标图像确定地形层,所述地形层具有k个地形点,k为大于1的正整数;
    对至少一所述目标图像进行语义识别和像素深度估计,得到语义对象;
    响应于所述语义对象符合第一预设条件,将符合第一预设条件的所述语义对象作为第一语义对象,根据所述相机参数确定所述第一语义对象在所述地形层的矢量轮廓;
    响应于所述语义对象符合第二预设条件,将符合第二预设条件的所述语义对象作为第二语义对象,根据所述相机参数确定所述第二语义对象在所述地形层的建模位置,所述第二语义对象的特征点少于所述第一语义对象的特征点;
    根据所述矢量轮廓、所述建模位置和所述地形层进行三维重建,得到所述室外场景的三维重建模型。
  2. 根据权利要求1所述的三维场景重建方法,其特征在于,所述根据j个所述目标图像确定地形层,进一步包括:
    根据j个所述目标图像确定三维点云;
    根据所述三维点云进行布料滤波计算,确定所述地形层。
  3. 根据权利要求1所述的三维场景重建方法,其特征在于,所述对至少一所述目标图像进行语义识别和像素深度估计,得到语义对象,进一步包括:
    对每一所述目标图像进行语义识别,得到每一所述目标图像的识别对象,确定每一所述识别对象中每一像素的第一语义标签;
    对每一所述目标图像中每一所述识别对象的所有像素进行像素深度估计,得到估计对象和所述估计对象对应的深度位置,将所述估计对象作为所述语义对象。
  4. 根据权利要求3所述的三维场景重建方法,其特征在于,所述对每一所述目标图像中每一所述识别对象的所有像素进行像素深度估计,得到估计对象和所述估计对象对应的深度位置,进一步包括:
    对每一所述目标图像中所述识别对象的每一所述像素进行像素深度估计,得到像素深度位置;
    根据所述像素深度位置计算每一所述目标图像中所述识别对象的所有第一像素的平均像素深度位置,将所述平均像素深度位置作为所述深度位置。
  5. 根据权利要求3所述的三维场景重建方法,其特征在于,所述响应于所述语义对象符合第一预设条件,将符合第一预设条件的所述语义对象作为第一语义对象,根据所述相机参数确定所述第一语义对象在所述地形层的矢量轮廓,进一步包括:
    根据所述相机参数将所述地形点映射到同一所述目标图像中所述第一语义对象对应的第一像素,根据对应所述第一像素的所述第一语义标签确定所述地形点对应的第二语义标签;
    将所述地形点对应所述目标图像所得数量最多的相同所述第二语义标签作为目标语义标签;
    将具有同一所述目标语义标签的所有所述地形点进行聚类计算,得到目标地形点集合;
    对所述目标地形点集合进行轮廓识别计算,确定所述第一语义对象在所述地形层的矢量轮廓。
  6. 根据权利要求3所述的三维场景重建方法,其特征在于,所述对至少一所述目标图像进行语义识别和像素深度估计,得到语义对象,进一步包括:
    将j个所述目标图像中同一所述估计对象进行融合计算,得到三维的融合对象和所述融合对象对应的融合位置,将所述融合对象作为所述语义对象,将所述第二语义对象对应的所述融合位置作为所述建模位置。
  7. 根据权利要求6所述的三维场景重建方法,其特征在于,所述将j个所述目标图像中同一所述估计对象进行融合计算,得到三维的融合对象和所述融合对象对应的融合位置,进一步包括:
    对每一所述目标图像中所述识别对象的每一所述像素进行像素深度估计,得到像素深度位置;
    根据所述像素深度位置将j个所述目标图像同一所述识别对象中每一所述像素进行融合计算,得到所述融合对象,所述融合对象包括a个体素,每一所述体素由j个所述目标图像中同一所述识别对象的对应所述像素融合得到,a为大于0的正整数;
    根据所述像素深度位置计算每一所述体素的体素深度位置;
    根据所述体素深度位置计算所述融合对象的融合位置
    其中,di表示所述融合对象中每一所述体素的所述体素深度位置。
  8. 一种三维场景重建装置,其特征在于,包括:
    第一获取模块,用于获取针对室外场景的j个目标图像,j为大于1的正整数;
    第二获取模块,用于获取每一所述目标图像的相机参数,所述相机参数包括相机内参和相机外参;
    第一确定模块,用于根据j个所述目标图像确定地形层,所述地形层具有k个地形点,k为大于1的正整数;
    第一计算模块,用于对至少一所述目标图像进行语义识别和像素深度估计,得到语义对象;
    第二确定模块,用于根据第一预设条件,将所述语义对象作为第一语义对象,根据所述相机参数确定所述第一语义对象在所述地形层的矢量轮廓;
    第三确定模块,用于根据第二预设条件,将所述语义对象作为第二语义对象,根据所述相机参数确定所述第二语义对象在所述地形层的建模位置,所述第二语义对象的特征点少于所述第一语义对象的特征点;
    第二计算模块,用于根据所述矢量轮廓、所述建模位置和所述地形层进行三维重建,得到所述室外场景的三维重建模型。
  9. 一种计算设备,其特征在于,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;
    所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行如权利要求1-7中任一项所述的三维场景重建方法的操作。
  10. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一可执行指令,所述可执行指令在运行时执行如权利要求1-7中任一项所述的三维场景重建方法的操作。
PCT/CN2023/124212 2022-10-26 2023-10-12 三维场景重建方法、装置、设备及存储介质 WO2024088071A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211315512.8 2022-10-26
CN202211315512.8A CN115375857B (zh) 2022-10-26 2022-10-26 三维场景重建方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2024088071A1 true WO2024088071A1 (zh) 2024-05-02

Family

ID=84072743

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/124212 WO2024088071A1 (zh) 2022-10-26 2023-10-12 三维场景重建方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN115375857B (zh)
WO (1) WO2024088071A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115375857B (zh) * 2022-10-26 2023-01-03 深圳市其域创新科技有限公司 三维场景重建方法、装置、设备及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002063596A (ja) * 2000-06-05 2002-02-28 Namco Ltd ゲームシステム、プログラム及び情報記憶媒体
CN107945268A (zh) * 2017-12-15 2018-04-20 深圳大学 一种基于二元面结构光的高精度三维重建方法及系统
CN111968129A (zh) * 2020-07-15 2020-11-20 上海交通大学 具有语义感知的即时定位与地图构建系统及方法
US20210327126A1 (en) * 2018-10-31 2021-10-21 Shenzhen University 3D Object Reconstruction Method, Computer Apparatus and Storage Medium
CN113673400A (zh) * 2021-08-12 2021-11-19 土豆数据科技集团有限公司 基于深度学习的实景三维语义重建方法、装置及存储介质
CN114782530A (zh) * 2022-03-28 2022-07-22 杭州国辰机器人科技有限公司 室内场景下的三维语义地图构建方法、装置、设备及介质
CN115375857A (zh) * 2022-10-26 2022-11-22 深圳市其域创新科技有限公司 三维场景重建方法、装置、设备及存储介质

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114283052A (zh) * 2021-12-30 2022-04-05 北京大甜绵白糖科技有限公司 妆容迁移及妆容迁移网络的训练方法和装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002063596A (ja) * 2000-06-05 2002-02-28 Namco Ltd ゲームシステム、プログラム及び情報記憶媒体
CN107945268A (zh) * 2017-12-15 2018-04-20 深圳大学 一种基于二元面结构光的高精度三维重建方法及系统
US20210327126A1 (en) * 2018-10-31 2021-10-21 Shenzhen University 3D Object Reconstruction Method, Computer Apparatus and Storage Medium
CN111968129A (zh) * 2020-07-15 2020-11-20 上海交通大学 具有语义感知的即时定位与地图构建系统及方法
CN113673400A (zh) * 2021-08-12 2021-11-19 土豆数据科技集团有限公司 基于深度学习的实景三维语义重建方法、装置及存储介质
CN114782530A (zh) * 2022-03-28 2022-07-22 杭州国辰机器人科技有限公司 室内场景下的三维语义地图构建方法、装置、设备及介质
CN115375857A (zh) * 2022-10-26 2022-11-22 深圳市其域创新科技有限公司 三维场景重建方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN115375857A (zh) 2022-11-22
CN115375857B (zh) 2023-01-03

Similar Documents

Publication Publication Date Title
CN108764048B (zh) 人脸关键点检测方法及装置
CN109544677B (zh) 基于深度图像关键帧的室内场景主结构重建方法及系统
US9942535B2 (en) Method for 3D scene structure modeling and camera registration from single image
CN106940704B (zh) 一种基于栅格地图的定位方法及装置
WO2021174939A1 (zh) 人脸图像的获取方法与系统
CN108734120B (zh) 标注图像的方法、装置、设备和计算机可读存储介质
CN104134234B (zh) 一种全自动的基于单幅图像的三维场景构建方法
WO2024088071A1 (zh) 三维场景重建方法、装置、设备及存储介质
CN112489099B (zh) 点云配准方法、装置、存储介质及电子设备
CN110176064B (zh) 一种摄影测量生成三维模型的主体对象自动识别方法
WO2023116430A1 (zh) 视频与城市信息模型三维场景融合方法、系统及存储介质
CN111027538A (zh) 一种基于实例分割模型的集装箱检测方法
CN113850136A (zh) 基于yolov5与BCNN的车辆朝向识别方法及系统
CN108629742B (zh) 真正射影像阴影检测与补偿方法、装置及存储介质
CN108961385A (zh) 一种slam构图方法及装置
CN114972646A (zh) 一种实景三维模型独立地物的提取与修饰方法及系统
CN107610216B (zh) 基于粒子群优化多视角立体点云生成方法及应用的摄像机
CN115063485B (zh) 三维重建方法、装置及计算机可读存储介质
CN115546422A (zh) 一种建筑物的三维模型构建方法、系统和电子设备
CN113487741B (zh) 稠密三维地图更新方法及装置
CN115908729A (zh) 三维实景构建方法、装置、设备及计算机可读存储介质
CN113436251B (zh) 一种基于改进的yolo6d算法的位姿估计系统及方法
US11734790B2 (en) Method and apparatus for recognizing landmark in panoramic image and non-transitory computer-readable medium
CN114723973A (zh) 大尺度变化鲁棒的图像特征匹配方法及装置
CN111369651A (zh) 三维表情动画生成方法及系统