WO2021175050A1 - 三维重建方法和三维重建装置 - Google Patents

三维重建方法和三维重建装置 Download PDF

Info

Publication number
WO2021175050A1
WO2021175050A1 PCT/CN2021/074094 CN2021074094W WO2021175050A1 WO 2021175050 A1 WO2021175050 A1 WO 2021175050A1 CN 2021074094 W CN2021074094 W CN 2021074094W WO 2021175050 A1 WO2021175050 A1 WO 2021175050A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
nolf
dimensional
map
model
Prior art date
Application number
PCT/CN2021/074094
Other languages
English (en)
French (fr)
Inventor
王昊
张淋淋
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021175050A1 publication Critical patent/WO2021175050A1/zh
Priority to US17/902,624 priority Critical patent/US20220414911A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/20Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/75Determining position or orientation of objects or cameras using feature-based methods involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/08Indexing scheme for image data processing or generation, in general involving all processing steps from image acquisition to 3D model generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2210/00Indexing scheme for image generation or computer graphics
    • G06T2210/04Architectural design, interior design
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2210/00Indexing scheme for image generation or computer graphics
    • G06T2210/56Particle system, point based geometry or rendering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2219/00Indexing scheme for manipulating 3D models or images for computer graphics
    • G06T2219/20Indexing scheme for editing of 3D models
    • G06T2219/2016Rotation, translation, scaling

Definitions

  • This application relates to three-dimensional modeling technology, in particular to a three-dimensional reconstruction method and a three-dimensional reconstruction device.
  • Three-dimensional digital data can improve people's cognition level and level of real space, and bring people rich information far beyond two-dimensional image data. It is a common solution to use high-precision laser scanners to obtain the point cloud information of the 3D environment for modeling. However, laser scanners are expensive. How to obtain real and usable 3D digital data efficiently, accurately and cheaply is limiting the further development of 3D applications. One of the key bottlenecks.
  • the existing image technology based on a pre-built model database, image analysis is performed on a single scene image input by the user to obtain a partial image of the target model, and model retrieval is performed in the pre-built model database to determine the matching model, and then achieve Three-dimensional reconstruction.
  • the preset model database is based on the different positions and angles of the virtual three-dimensional space, and the three-dimensional model is projected and mapped, and the three-dimensional model itself is replaced with a set of multi-angle projection images. In this way, the two-dimensional-three-dimensional retrieval problem is solved in the model retrieval.
  • a preset 3D model is projected from a preset position and angle
  • a set of multi-angle projection images is used to represent a 3D model in the model database.
  • the actual viewing angle of the scene image is often It is arbitrary and random, and has a low degree of matching with the position and angle of the projected image in the model database. Therefore, the accuracy of model retrieval is low.
  • the embodiments of the present application provide a three-dimensional reconstruction method, which is used to implement three-dimensional reconstruction of an object through a two-dimensional image, which can improve the accuracy of three-dimensional model matching.
  • the first aspect of the embodiments of the present application provides a three-dimensional reconstruction method, including: acquiring an image of a first object and a camera pose of the image; determining the first object in the image through a first deep learning network
  • the first normalized position field NOLF map of the first NOLF map indicates the normalized three-dimensional point cloud of the first object under the shooting angle of the image; according to the first NOLF map, from the model Determine the first model corresponding to the first object from the multiple three-dimensional models in the database; determine the pose of the first object according to the first model and the camera pose of the image; according to the first model And the pose of the first object to three-dimensionally reconstruct the first object in the image.
  • a two-dimensional image obtained by shooting a scene containing a first object is obtained, and inputted into a deep learning network to obtain a NOLF map of the first object, indicating that the first object is capturing the image
  • the normalized three-dimensional point cloud under the camera's shooting angle of view, the first model corresponding to the first object is determined from the model database according to the NOLF map of the first object, and the first model is determined according to the first model and the camera pose of the image The pose of the first object, so as to realize the three-dimensional reconstruction of the first object in the image.
  • the NOLF of the first object indicates the normalized three-dimensional point cloud of the first object under the shooting angle of the camera that took the image, that is, part of the three-dimensional information of the first object under the shooting angle of the image is recovered through the deep learning network
  • the accuracy of model matching can be improved, and the success rate of three-dimensional reconstruction can be improved.
  • the method further includes: determining a first relative pose of the first object according to the first NOLF map, where the first relative pose is the The relative pose between the pose of the first object and the camera pose of the image; determine the NOLF map in the perspective corresponding to the first relative pose in the plurality of three-dimensional models; according to the The first NOLF map, determining a first model corresponding to the first object from a plurality of three-dimensional models in a model database includes: determining from NOLF maps corresponding to each of the plurality of three-dimensional models that are similar to the first NOLF map The NOLF graph with the highest degree corresponds to the first model.
  • the first relative pose between the first object and the camera at the time of shooting can be determined according to the first NOLF map.
  • various three-dimensional images in the model database can be obtained.
  • the candidate NOLF graph of the model According to the first relative pose and the posture of the three-dimensional model, the position and direction of the observation point can be determined.
  • the NOLF map of the three-dimensional model indicates the normalized three-dimensional point cloud that can be selected when observing the three-dimensional model from the position and direction of the observation point.
  • observing the three-dimensional model from the position and direction of the observation point can be understood as a simulated camera photographing the first object, and thus, a candidate NOLF map of the three-dimensional model for comparison can be obtained.
  • the three-dimensional model corresponding to the candidate NOLF map with the highest similarity to the first NOLF map is determined as the first model.
  • the method of the embodiment of the present application obtains the candidate NOLF map based on the calculated initial pose, thereby reducing the difficulty of retrieval.
  • the two-dimensional modeling object and the three-dimensional prefabricated model library object are unified into the same data expression through the NOLF graph, and the data expression has nothing to do with the lighting conditions of the modeling target on the real image and the texture details of the three-dimensional model.
  • the three-dimensional point cloud indicated by the NOLF map implies the three-dimensional shape and geometric information of the first object, which is conducive to the similarity comparison between objects in the feature space.
  • the determining the first relative pose of the first object according to the first NOLF map includes: determining the position in the image through a second deep learning network The pixel coordinates of at least four feature points of the first object, the four object points indicated by the four feature points are not coplanar in a three-dimensional space; the at least four feature points are determined from the first NOLF map The three-dimensional coordinates; the first relative pose is determined according to the pixel coordinates and the three-dimensional coordinates of the at least four feature points.
  • the three-dimensional reconstruction method can determine at least four feature points according to the first NOLF image. Since the first NOLF image is a two-dimensional image, the pixel coordinates of the feature points in the first NOLF image can be determined. In addition, because The NOLF diagram indicates a three-dimensional point cloud, and each pixel corresponds to a three-dimensional coordinate. From this, the correspondence between the pixel coordinates of the feature point and the three-dimensional coordinate can be obtained through the pixel coordinates of at least four feature points and the three-dimensional coordinate. The corresponding relationship of can be calculated to obtain the relative pose of the camera relative to the first object when the image is taken.
  • the determining the first relative pose according to the pixel coordinates and three-dimensional coordinates of the at least four feature points includes: according to the at least four feature points The pixel coordinates and the three-dimensional coordinates are used to determine the first relative pose through a perspective N-point estimation PnP algorithm.
  • the three-dimensional reconstruction method provided by the embodiment of the present application provides a specific implementation method for calculating the relative pose, that is, calculation is performed by the PnP algorithm.
  • the characteristic points of the first object include: eight corner points of a bounding box of the first object.
  • the characteristic points may specifically be the corner points of the bounding box of the first object. Since the corner points of the bounding box can be determined for all objects, the method has universality and is easy to implement. It can also be performed by training a deep learning network, and predicting the corner points of the bounding box of the object through the deep learning network to determine the feature points of the first object.
  • the method further includes: inputting the image into the first deep learning network to determine a first original NOLF map; according to the first original NOLF map and the The image mask of the first object determines the first NOLF image.
  • the method further includes: determining an image mask of the first object in the image through a third deep learning network; optionally, the third deep learning network and the first deep learning network The network is a different deep learning network or the same deep learning network, which is not specifically limited here.
  • the image mask of the first object may be determined through the third deep learning network, and the first NOLF image may be determined according to the mask.
  • the model database includes categories of the multiple three-dimensional models, wherein the first object belongs to a first category, and the method further includes: according to the first category The NOLF map determines the first model from the three-dimensional models belonging to the first category.
  • the method further includes: inputting the image into a fourth deep learning network, and determining that the first object belongs to the first category; optionally, the fourth deep learning network and the first deep learning The network is a different deep learning network or the same deep learning network, which is not specifically limited here.
  • the three-dimensional reconstruction method provided by the embodiment of the present application can also predict the category of the first object through the fourth deep learning network, and determine the three-dimensional model belonging to the category from the model database according to the category for subsequent model matching, which can reduce the amount of calculation.
  • a second aspect of the embodiments of the present application provides a three-dimensional reconstruction device, including: an acquisition unit, configured to acquire an image of a first object and a camera pose of the image; and a determining unit, configured to determine through a first deep learning network A first normalized position field NOLF map of the first object in the image, where the first NOLF map indicates a normalized three-dimensional point cloud of the first object under the shooting angle of view of the image;
  • the determining unit is configured to determine a first model corresponding to the first object from a plurality of three-dimensional models in a model database according to the first NOLF map; the determining unit is also configured to determine a first model corresponding to the first object according to the first model And the camera pose of the image to determine the pose of the first object; a reconstruction unit for three-dimensional reconstruction of the first object in the image according to the first model and the pose of the first object .
  • the determining unit is further configured to determine a first relative pose of the first object according to the first NOLF map, where the first relative pose is the difference between the pose of the first object and the image The relative poses between the camera poses; determine the NOLF maps in the viewing angles corresponding to the first relative poses among the plurality of three-dimensional models; the determining unit is specifically configured to: from each of the plurality of three-dimensional models The corresponding NOLF map determines the first model corresponding to the NOLF map with the highest similarity to the first NOLF map.
  • the determining unit is specifically configured to determine the pixel coordinates of at least four feature points of the first object in the image through a second deep learning network, and the four object points indicated by the four feature points are in a three-dimensional space Determine the three-dimensional coordinates of the at least four feature points from the first NOLF map; determine the first relative position according to the pixel coordinates and the three-dimensional coordinates of the at least four feature points posture.
  • the determining unit is specifically configured to determine the first relative pose by using a perspective N-point estimation PnP algorithm according to the pixel coordinates and the three-dimensional coordinates of the at least four feature points.
  • the characteristic points of the first object include: eight corner points of the bounding box of the first object.
  • the determining unit is specifically configured to: input the image into the first deep learning network to determine a first original NOLF map; determine the first original NOLF map and the image mask of the first object The first NOLF diagram.
  • the model database includes categories of the multiple three-dimensional models, wherein the first object belongs to a first category; the determining unit is specifically configured to: subordinate to three-dimensional models belonging to the first category according to the first NOLF graph The first model is determined in the model.
  • the third aspect of the embodiments of the present application provides a three-dimensional reconstruction device, including a processor and a memory, the processor and the memory are connected to each other, wherein the memory is used to store a computer program, and the computer program includes program instructions
  • the processor is configured to call the program instructions to execute the method described in any one of the foregoing first aspect and various possible implementation manners.
  • the fourth aspect of the embodiments of the present application provides a computer program product containing instructions, which is characterized in that when it runs on a computer, the computer executes the first aspect or any one of the possible implementation manners of the first aspect. Methods.
  • the fifth aspect of the embodiments of the present application provides a computer-readable storage medium, including instructions, which are characterized in that when the instructions run on a computer, the computer executes any one of the first aspect or the first aspect. Way of realization.
  • the sixth aspect of the embodiments of the present application provides a chip system, the chip system includes a processor, and the processor is used to read and execute a computer program stored in a memory to perform functions involved in any possible implementation manner of any of the foregoing aspects. .
  • the chip system further includes a memory, and the memory is electrically connected to the processor.
  • the chip further includes a communication interface, and the processor is connected to the communication interface.
  • the communication interface is used to receive data and/or information that needs to be processed, and the processor obtains the data and/or information from the communication interface, processes the data and/or information, and outputs the processing result through the communication interface.
  • the communication interface can be an input and output interface.
  • the chip system can be composed of chips, or include chips and other discrete devices.
  • the image is input to the deep learning network, and the first normalization of the first object under the shooting angle of the image is output
  • the position field NOLF map, the first NOLF map can indicate the normalized three-dimensional information of the first object in the shooting angle of the image; according to the first NOLF map, from a plurality of three-dimensional model database preset
  • the first model is determined in the model, the camera pose of the first model and the image is calculated, and the pose of the first target model is determined, so that the three-dimensional reconstruction of the first target object can be realized.
  • this solution uses the NOLF map recovered from the captured image through the deep learning network when retrieving the 3D model in the model database, compared with the prior art, there is no need to project the 3D model in different positions and perspectives in advance. The information is retrieved and matched with the three-dimensional model, and the accuracy is high.
  • the three-dimensional reconstruction method can also predict the relative pose of the first object based on the NOLF map and the pixel coordinates of the feature points. Based on the calculated initial pose, obtain the three-dimensional model in the model database in the initial position The NOLF graph in the pose is compared with the NOLF graph of the modeling object. As a result, the object and the 3D model in the database are unified under the same data expression form, which can reduce the difficulty of 3D model retrieval and effectively reduce the amount of calculation.
  • FIG. 1 is a schematic diagram of an artificial intelligence main body framework provided by an embodiment of this application.
  • Figure 2 is a schematic diagram of an application environment provided by an embodiment of the application.
  • FIG. 3 is a schematic diagram of a convolutional neural network structure provided by an embodiment of the application.
  • FIG. 4 is a schematic diagram of another convolutional neural network structure provided by an embodiment of the application.
  • FIG. 5 is a schematic diagram of an application scenario of the three-dimensional reconstruction method in an embodiment of the application.
  • FIG. 6 is a schematic diagram of an embodiment of a three-dimensional reconstruction method in an embodiment of the application.
  • FIG. 7 is a schematic diagram of an embodiment of the NOLF map of the target object in the embodiment of the application.
  • FIG. 8 is a schematic diagram of the deep learning network architecture of the three-dimensional modeling method in an embodiment of the application.
  • FIG. 9 is a schematic diagram of an embodiment of similarity detection in an embodiment of the application.
  • FIG. 10 is a schematic diagram of another embodiment of a three-dimensional reconstruction method in an embodiment of this application.
  • FIG. 11 is a schematic diagram of an embodiment of a three-dimensional reconstruction device in an embodiment of the application.
  • FIG. 12 is a schematic diagram of another embodiment of the three-dimensional reconstruction device in the embodiment of the application.
  • FIG. 13 is a diagram of a chip hardware structure provided by an embodiment of the application.
  • the embodiment of the present application provides a three-dimensional reconstruction method, which is used for three-dimensional reconstruction of an object, which can improve the accuracy of model matching.
  • a three-dimensional model is a polygonal representation of an object, which is usually displayed by a computer or other video equipment.
  • the displayed objects can be real-world entities or fictional objects. Anything that exists in the physical world can be represented by a three-dimensional model.
  • the three-dimensional model of the object is used to indicate the three-dimensional structure and size information of the object.
  • Normalized three-dimensional model with directionality normalize the size according to the three-dimensional model of the object, and place it in the three-dimensional coordinate system according to the preset main viewing direction, which contains the three-dimensional structure information of the object.
  • the preset main viewing direction is usually the direction that conforms to user habits and best reflects the shape and characteristics of the object. For example, for a camera, set the shooting button to face up, and the lens direction to be the main viewing direction, normalize the actual size of the camera, and zoom To the preset size, the obtained contains a normalized three-dimensional model with directionality.
  • the method for acquiring the directional normalized three-dimensional model in the embodiment of the present application is to preset the main viewing direction of the object, and define the normalized object position field as a three-dimensional space whose length, width and height are all 1.
  • the three-dimensional model of the object is normalized and zoomed, and the center of mass is located at the center point of the three-dimensional space, and a directional normalized three-dimensional model can be obtained.
  • NOLF map a normalized three-dimensional point cloud indicating the visible part of an object or a three-dimensional model under a certain angle of view.
  • the NOLF diagram is a data expression form of a class of images, that is, each pixel coordinate of the image corresponds to the XYZ coordinate value of the three-dimensional model in the normalized three-dimensional space, that is, each pixel coordinate value corresponds to a three-dimensional coordinate value, thus establishing a three-dimensional The corresponding relationship between the pixel coordinates of the model on the image and the three-dimensional coordinates in the normalized space.
  • the NOLF map under the shooting angle of view of the image mentioned in the embodiment of this application that is, when the camera shoots the object, the normalization of the visible part of the object based on the relative pose between the camera and the object under the angle of the camera shooting the object 3D point cloud.
  • Perspective N-point estimation also known as projection N-point estimation, refers to the calculation of the projection relationship between N object points in the world and the corresponding N image points in the image to obtain the camera or The pose of the object.
  • Bounding box The smallest cuboid that just completely encloses the object is the three-dimensional bounding box of the object.
  • the bounding box can refer to the smallest hexahedron that contains the object and has sides parallel to the coordinate axis.
  • the corner points of the bounding box are the 8 vertices of the smallest hexahedron.
  • Key points refer to the corner points of the three-dimensional bounding box, that is, the vertices of the cuboid of the three-dimensional bounding box.
  • Camera pose refers to the position of the camera in space and the pose of the camera, which can be regarded as the translation transformation and rotation transformation of the camera from the original reference position to the current position, respectively.
  • the posture of the object in this application refers to the position of the object in space and the posture of the object.
  • External camera parameters the external parameters of the camera, the conversion relationship between the world coordinate system and the camera coordinate system, including displacement parameters and rotation parameters.
  • the camera pose can be determined according to the external camera parameters.
  • the object category can be, for example, a table, a chair, a cat, a dog, a car, and so on.
  • a model database is constructed in advance according to the needs of the application scenario, and the three-dimensional information of the object to be modeled is stored.
  • you can set the category for the 3D model For example, for 3D reconstruction in the indoor home scene, the database needs to store all possible 3D models of furniture in advance, and set the category according to the type of furniture, such as "chair” , “Table”, “coffee table”, "bed”, etc.
  • the database needs to store all possible 3D models of furniture in advance, and set the category according to the type of furniture, such as "chair” , “Table”, “coffee table”, "bed”, etc.
  • you can further set secondary categories such as multiple types of chairs, such as “stools” and “armchairs” ""Sofa Chair” and so on.
  • Image mask Use selected images, graphics or objects to completely or partially occlude the image to be processed to control the area or process of image processing.
  • the image mask in the embodiment of the application is used to extract the region of interest, for example, a partial image of the first target object in the image, and the region of interest image is obtained by multiplying the image to be processed according to the region of interest mask. The value remains unchanged, while the values of the out-of-zone images are all 0.
  • the term "and/or" appearing in the embodiments of this application can be an association relationship describing associated objects, indicating that there can be three types of relationships.
  • a and/or B can indicate that A exists alone and A exists at the same time.
  • B the situation where B exists alone, where A and B can be singular or plural.
  • the character "/" in this application generally indicates that the associated objects before and after are in an "or” relationship.
  • "at least one” refers to one or more
  • “multiple” refers to two or more.
  • "The following at least one item (a)” or similar expressions refers to any combination of these items, including any combination of a single item (a) or a plurality of items (a).
  • at least one of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple .
  • Figure 1 shows a schematic diagram of an artificial intelligence main framework, which describes the overall workflow of the artificial intelligence system and is suitable for general artificial intelligence field requirements.
  • Intelligent Information Chain reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensing process of "data-information-knowledge-wisdom".
  • the infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform.
  • smart chips hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA
  • basic platforms include distributed computing frameworks and network related platform guarantees and support, which can include cloud storage and Computing, interconnection network, etc.
  • sensors communicate with the outside to obtain data, and these data are provided to the smart chip in the distributed computing system provided by the basic platform for calculation.
  • the data in the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as the Internet of Things data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
  • machine learning and deep learning can symbolize and formalize data for intelligent information modeling, extraction, preprocessing, training, etc.
  • Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, using formal information to conduct machine thinking and solving problems based on reasoning control strategies.
  • the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, and usually provides functions such as classification, ranking, and prediction.
  • some general capabilities can be formed based on the results of the data processing, such as an algorithm or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image Recognition and so on.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is an encapsulation of the overall solution of artificial intelligence, productizing intelligent information decision-making and realizing landing applications. Its application fields mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical, smart security, autonomous driving, safe city, smart terminal, etc.
  • the method of using a single two-dimensional image to achieve three-dimensional reconstruction of an object described in this application has a broad application space.
  • a digital site is constructed through three-dimensional reconstruction, which can realize automatic site design, online guidance of equipment installation, and wireless signals. Simulation, etc.
  • AR augmented reality
  • VR virtual reality
  • the difficulty of creating 3D content lies in the difficulty of 3D modeling.
  • an embodiment of the present application provides a system architecture 200.
  • the data collection device 260 is used to collect images and store them in the database 230, and the training device 220 generates a target model/rule 201 based on the image data maintained in the database 230.
  • the following will describe in more detail how the training device 220 obtains the target model/rule 201 based on the image data.
  • the target model/rule 201 can be used in application scenarios such as image recognition, three-dimensional reconstruction, and virtual reality.
  • the target model/rule 201 may be obtained based on a deep neural network, and the deep neural network will be introduced below.
  • the work of each layer in the deep neural network can be expressed in mathematical expressions To describe: From the physical level, the work of each layer in the deep neural network can be understood as the transformation of the input space to the output space (that is, the row space of the matrix to the column Space), these five operations include: 1. Dimension Up/Down; 2. Enlarge/Reduce; 3. Rotate; 4. Translation; 5. "Bend”. The operations of 1, 2, and 3 are determined by Completed, the operation of 4 is completed by +b, and the operation of 5 is realized by a(). The reason why the word "space” is used here is because the object to be classified is not a single thing, but a class of things, and space refers to the collection of all individuals of this class of things.
  • W is a weight vector, and each value in the vector represents the weight value of a neuron in the layer of neural network.
  • This vector W determines the spatial transformation from the input space to the output space described above, that is, the weight W of each layer controls how the space is transformed.
  • the purpose of training a deep neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by the vector W of many layers). Therefore, the training process of the neural network is essentially the way of learning the control space transformation, and more specifically the learning weight matrix.
  • the weight vector of the network (of course, there is usually an initialization process before the first update, which is to pre-configure parameters for each layer in the deep neural network). For example, if the predicted value of the network is high, adjust the weight vector to make it The prediction is lower and keep adjusting until the neural network can predict the target value you really want. Therefore, it is necessary to predefine "how to compare the difference between the predicted value and the target value".
  • This is the loss function or objective function, which is used to measure the difference between the predicted value and the target value. Important equation. Among them, take the loss function as an example. The higher the output value (loss) of the loss function, the greater the difference. Then the training of the deep neural network becomes a process of reducing this loss as much as possible.
  • the target model/rule obtained by the training device 220 can be applied to different systems or devices.
  • the execution device 210 is configured with an I/O interface 212 to perform data interaction with external devices.
  • the "user" can input data to the I/O interface 212 through the client device 240.
  • the execution device 210 can call data, codes, etc. in the data storage system 250, and can also store data, instructions, etc. in the data storage system 250.
  • the calculation module 211 uses the target model/rule 201 to process the input data. Taking three-dimensional modeling as an example, the calculation module 211 can analyze the input image or image sequence to restore the depth information of the target.
  • the correlation function module 213 can preprocess the image data in the calculation module 211.
  • the correlation function module 214 can preprocess the image data in the calculation module 211.
  • the I/O interface 212 returns the processing result to the client device 240 and provides it to the user.
  • the training device 220 can generate corresponding target models/rules 201 based on different data for different targets, so as to provide users with better results.
  • the user can manually specify the input data in the execution device 210, for example, to operate in the interface provided by the I/O interface 212.
  • the client device 240 can automatically input data to the I/O interface 212 and obtain the result. If the client device 240 automatically inputs data and needs the user's authorization, the user can set the corresponding authority in the client device 240.
  • the user can view the result output by the execution device 210 on the client device 240, and the specific presentation form may be a specific manner such as display, sound, and action.
  • the client device 240 may also serve as a data collection terminal to store the collected training data in the database 230.
  • Fig. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 250 is an external memory relative to the execution device 210. In other cases, the data storage system 250 may also be placed in the execution device 210.
  • the training device 220, the execution device 210, and the client device 240 are separate devices.
  • the training device 220 and the execution device 210 may be the same physical device, and the physical device can implement this All functions of the training device 220 and the execution device 210; optionally, the execution device 210 and the client device 240 can also be the same physical device, and the physical device can implement all the functions of the execution device 210 and the client device 240; optional, The training device 220, the execution device 210, and the client device 240 are all the same physical device. All the functions of the physical device training device 220, the execution device 210, and the client device 240 are not limited here for the specific scenario architecture of the embodiment of the present application.
  • the deep neural network used to predict the NOLF map of the target from the two-dimensional image in the embodiment of the application may be a convolutional neural network (convolutional neural network, CNN), which is a deep neural network with a convolutional structure
  • the network is a deep learning architecture.
  • the deep learning architecture refers to the use of machine learning algorithms to perform multiple levels of learning at different levels of abstraction.
  • CNN is a feed-forward artificial neural network. Taking image processing as an example, each neuron in the feed-forward artificial neural network responds to overlapping areas in the input image. .
  • it can also be of other types, and this application does not limit the type of deep neural network.
  • a convolutional neural network (CNN) 100 may include an input layer 110, a convolutional layer/pooling layer 120, where the pooling layer is optional, and a neural network layer 130.
  • the convolutional layer/pooling layer 120 may include layers 121-126 as in the examples.
  • layer 121 is a convolutional layer
  • layer 122 is a pooling layer
  • layer 123 is a convolutional layer
  • 124 is a pooling layer
  • 121 and 122 are convolutional layers
  • 123 is a pooling layer
  • 124 and 125 are convolutional layers
  • 126 is a convolutional layer.
  • Pooling layer That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.
  • the convolutional layer 121 can include many convolution operators.
  • the convolution operator is also called a kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can be a weight matrix. This weight matrix is usually predefined. In the process of convolution on the image, the weight matrix is usually one pixel after another pixel in the horizontal direction on the input image ( Or two pixels followed by two pixels...It depends on the value of stride) to complete the work of extracting specific features from the image.
  • the initial convolutional layer (such as 121) often extracts more general features, which can also be called low-level features; with the convolutional neural network
  • the subsequent convolutional layers (for example, 126), such as features such as high-level semantics, the features with higher semantics are more suitable for the problem to be solved.
  • multiple convolutional layers can be referred to as a block.
  • a pooling layer after the convolutional layer that is, the 121-126 layers as illustrated by 120 in Figure 3, which can be a convolutional layer followed by a layer
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the sole purpose of the pooling layer is to reduce the size of the image space.
  • the neural network layer 130 may include multiple hidden layers (131, 132 to 13n as shown in FIG. 3) and an output layer 140.
  • the parameters contained in the multiple hidden layers may be based on specific task types.
  • the relevant training data of the, for example, the task type can include image recognition, image classification, image super-resolution reconstruction and so on.
  • the output layer 140 After the multiple hidden layers in the neural network layer 130, that is, the final layer of the entire convolutional neural network 100 is the output layer 140.
  • the convolutional neural network 100 shown in FIG. 3 is only used as an example of a convolutional neural network.
  • the convolutional neural network may also exist in the form of other network models, such as
  • the multiple convolutional layers/pooling layers shown in FIG. 4 are in parallel, and the respectively extracted features are input to the full neural network layer 130 for processing.
  • FIG. 5 is a schematic diagram of an application scenario of the three-dimensional reconstruction method in an embodiment of this application.
  • A is a single two-dimensional image obtained by shooting the scene to be modeled
  • B is a pre-built three-dimensional model database
  • C is the recognition of multiple target objects in the scene from A
  • D is the model retrieval from B
  • E is to implement three-dimensional reconstruction of the target objects according to the matched three-dimensional models, and obtain three-dimensional data of the scene.
  • a preset 3D model is projected from a preset position and angle
  • a set of multi-angle projection images is used to represent a 3D model in the model database.
  • the actual viewing angle of the scene image is often It is arbitrary and random, and usually does not completely match the preset position and angle when the projection image is acquired in the model database. Therefore, the accuracy of model retrieval is low.
  • the three-dimensional model projection image in the virtual space and the scene image in the real space differ greatly in key elements such as lighting conditions, background, and texture color.
  • the existing technology uses real images to retrieve virtual images due to the above factors. Constraints, resulting in greater room for improvement in the robustness of search results.
  • the embodiments of the present application provide a three-dimensional reconstruction method, which can perform three-dimensional reconstruction based on two-dimensional images and a pre-built model database, with high modeling accuracy.
  • FIG. 6 is a schematic diagram of an embodiment of the three-dimensional reconstruction method in the embodiment of the application.
  • a model database is built in advance, and each object corresponds to a three-dimensional model, and the model database includes multiple three-dimensional models.
  • the three-dimensional model contains three-dimensional geometric information of the object, specifically, including geometric structure and size information.
  • the three-dimensional model also contains texture features of the object.
  • the three-dimensional model in the model database carries a label of the object category, for example, chairs of various styles and shapes are all “chairs”.
  • the three-dimensional reconstruction device acquires the image and the camera pose
  • the three-dimensional reconstruction device obtains the image of the first target object and the camera pose corresponding to the image.
  • the image refers to a two-dimensional image obtained by shooting a target scene including the first target object.
  • the first target object is also referred to as a modeling target in the embodiment of the present application.
  • the image is taken by an image acquisition device, which can be a normal camera or a panoramic camera.
  • the image acquisition device may be a component set in the three-dimensional reconstruction device, or an independent device that has a wired or wireless communication connection with the three-dimensional reconstruction device.
  • the specific type and device form of the image acquisition device are not limited.
  • the image may be a 360-degree panoramic image with a horizontal angle of view collected by a panoramic camera, or an ultra-wide-angle image, or a central projection image shot by a normal camera, which is not specifically limited here.
  • the three-dimensional reconstruction method provided by the embodiments of the present application can realize three-dimensional reconstruction based on a single image, but when a single central projection image cannot fully collect the target scene information, multiple images can be collected, which is not specifically limited here.
  • the camera pose corresponding to an image refers to the position and pose of the camera that took the image at the time of shooting.
  • the camera pose can be determined by translation and rotation relative to the reference coordinate system.
  • the acquisition method of the camera pose corresponding to the image is not limited here.
  • the three-dimensional reconstruction device inputs the image into the deep learning network, and obtains the NOLF map of the first target object and the pixel coordinates of the feature points.
  • the three-dimensional reconstruction device inputs the image obtained in step 601 into the deep learning network, and obtains the NOLF map of the first target object under the shooting angle of view of the image and the pixel coordinates of the feature points of the first target object in the image.
  • the feature points of the first target object in the image are used to determine the pose of the first target object after determining the target 3D model in the 3D model database.
  • To calculate the pose of the first target object at least 4 features need to be determined Point, and the object points corresponding to the 4 feature points (image points) are not coplanar in the three-dimensional space, and the specific number of feature points is not limited here.
  • the image point in the image captured by the camera corresponds to the point in the three-dimensional space, which is called an object point.
  • the feature point of the first target object in the image can be located in the image area of the first target object, and the corresponding object point can be located in the first target object; the feature point of the first target object can also be located in the first target object Outside of the image area, for example, the feature point can be the point where the corner point of the bounding box of the target object is projected in the image, where the corner point of the bounding box of the target object has no physical correspondence in the shooting space, and only indicates a three-dimensional space Location.
  • the NOLF map of the first target object output by the deep learning network please refer to FIG. 7, which is a schematic diagram of an embodiment of the NOLF map of the target object in the embodiment of this application.
  • the first target object in the output image is The NOLF map under the shooting angle of view of the image, which may indicate the normalized three-dimensional information of the first target object under the shooting angle of view of the image.
  • the NOLF image is a two-dimensional image. Each pixel in the NOLF image stores the XYZ three-dimensional information of the object point corresponding to the pixel.
  • the NOLF image can be expressed by a two-dimensional RGB image. Each color channel corresponds to the information of one dimension in the three-dimensional coordinate system.
  • the three-dimensional reconstruction device may output the NOLF map of the first target object and the pixel coordinates of the key points of the first target object through a deep learning network; it may also obtain the NOLF map of the first target object through different deep learning networks.
  • the pixel coordinates of the key points of the first target object the specifics are not limited here.
  • the three-dimensional reconstruction device may also obtain the category of the first target object through a deep learning network, and the category is used to screen the model database to determine the model corresponding to the category during model retrieval, for example, it is recognized that the first target object is a chair , There may be multiple types of models in the model database, such as “table”, “chair”, “coffee table” and “bed”, etc. If the category of the first target object is obtained as “chair”, then the object’s The model is limited to the three-dimensional model of the "chair” category, and it is not necessary to search in other types of three-dimensional models.
  • the three-dimensional reconstruction device may also obtain a mask of the first target object through a deep learning network, and the mask may be used to optimize the NOLF map and improve the accuracy of the NOLF map.
  • FIG. 8 is a schematic diagram of the deep learning network architecture of the three-dimensional modeling method in an embodiment of the application.
  • Figure 8 depicts a convolutional neural network that predicts NOLF maps, key points, and instance segmentation masks based on modeling targets in two-dimensional images.
  • the input is a two-dimensional RGB image containing the modeling target.
  • the Convolutional Neural Network (CNN) can predict the mask, NOLF map and eight key point pixel coordinate values of the target modeling object.
  • the NOLF map contains the three-dimensional information of the modeling target under the shooting angle of the image.
  • the eight key points are the corner points of the bounding box of the modeling target, and the pixel coordinates of the key points are the mapping of the eight corner points of the bounding box corresponding to the three-dimensional model in space based on the accurate pose on the input image.
  • the deep learning network predicts the target category and mask based on a general instance segmentation architecture (Mask-RCNN framework).
  • the convolutional neural network can also predict the category of the target modeling object through a detection branch, and reduce the scope of model retrieval according to the preset type tags of each three-dimensional model in the three-dimensional model database.
  • the three-dimensional reconstruction device inputs the image acquired in step 601 into the deep learning network, which can identify multiple target objects in the image. And respectively generate the NOLF map of each target object.
  • the image area of the first target object in the image is usually a partial area of the image, and the three-dimensional reconstruction device recognizes all three-dimensional target objects to be reconstructed in the image by analyzing the image information, and obtains a partial image of the target object.
  • Each partial image corresponds to a target object, and the specific number of partial images obtained from the image is not limited here.
  • the three-dimensional reconstruction device may first determine the partial image of the target object from the image, and then input the partial image into the deep learning network, which is not specifically limited here.
  • the deep learning network is a pre-trained model.
  • For the training method please refer to the subsequent embodiments.
  • the three-dimensional reconstruction device determines the initial pose of the first target object according to the NOLF map of the first target object.
  • a correspondence relationship between the image pixel coordinates of the NOLF map and the three-dimensional coordinates indicated by the NOLF map is established.
  • a modeling target can be obtained by solving PnP. Normalize the first relative pose in the object space, and the first relative pose is the initial pose.
  • multiple initial poses can be obtained through multiple sampling, clustering is performed based on the multiple initial poses, and the initial pose corresponding to the cluster center point is taken as the target object's
  • the initial pose of the directional normalized 3D model under the shooting angle of the image is taken as the target object's The initial pose of the directional normalized 3D model under the shooting angle of the image.
  • any pixel coordinate pi in the NOLF map corresponds to a 3D point Pi in a normalized space, and then a series of 2D pixels are established Correspondence between position and three-dimensional space position.
  • PnP the relative pose of a series of corresponding three-dimensional models in the normalized space can be obtained by relying on these corresponding relations. Randomly sample four pixels and the corresponding three-dimensional coordinates in the four pairs of 2D-3D (two-dimensional-three-dimensional) correspondences, and solve PnP to obtain a pose Hj in the normalized space. For each position The posture Hj is aligned by the following formula for score evaluation:
  • Pj is the normalized three-dimensional coordinate corresponding to the pixel point pj
  • A is the internal parameter matrix of the camera
  • therehold is the defined threshold.
  • the score evaluation formula is used to measure how many pixels in the pose Hj correspond to the three-dimensional coordinate value after re For projection, the pixel error is less than therehold, thereby measuring the confidence of each pose.
  • pre-emptive random sampling consensus idea pre-emptive RANSAC
  • the three-dimensional reconstruction device determines the NOLF image of the three-dimensional model in the model database under the initial pose
  • NOLF map projection can be performed on all 3D models in the prefabricated model database, and the NOLF map of all candidate 3D models in the database under the initial pose perspective can be rendered, and then based on the NOLF map in this perspective Perform similarity detection, and use these NOLF images in the initial pose perspective as the information representation of the 3D model database, so as to unify the 2D image information of the first target object and the 3D model information in the model database into the same data expression Under the form.
  • the three-dimensional model set belonging to the category can be filtered from the three-dimensional model database according to the category, and the initial pose determined in step 603 can be used to render The normalized position field diagram of the 3D model of the same type as the modeled object under this initial pose perspective.
  • the three-dimensional reconstruction device performs model retrieval according to the NOLF map of the first target object and the NOLF map of the three-dimensional model in the initial pose, and determines the first target model;
  • the NOLF map of the first target object acquired in step 602 and the 3D model library acquired in step 604 perform similarity detection based on the NOLF map of the initial viewing angle, and the 3D model corresponding to the NOLF map with the highest similarity is determined as the first target model.
  • FIG. 9 is a schematic diagram of an embodiment of similarity detection in an embodiment of this application.
  • Establish a Triple relationship based on the NOLF map that is, the NOLF map predicted by the modeling target on the 2D image, the NOLF map of the 3D model corresponding to the modeling target as a positive correlation, and the NOLF map of the 3D model that does not correspond to the modeling target It is a negative correlation, that is, the closer the Euclidean distance between the modeling target and the corresponding 3D model in the feature space, the better, and the farther the Euclidean distance between the modeling target and the non-corresponding 3D model, the better, so the NOLF graph feature descriptor similarity
  • the loss function (similaritytripleloss) is expressed as:
  • u is the target model
  • v +, v - u are three-dimensional model corresponding to the positive and negative samples
  • f is NOLF CNN-based feature descriptors
  • (u, v +, v -) represents the triple relationship
  • M is the minimum Euclidean distance value, specifically, the specific value of m>0 is not limited here
  • represents the Euclidean distance of two sample points in the feature space.
  • the embodiment of the application performs retrieval based on the NOLF graph, and the beneficial effects are as follows:
  • the two-dimensional modeling object and the three-dimensional prefabricated model library object are unified into the same data expression through the NOLF map, and the data expression is the same as the lighting conditions of the modeling target on the real image
  • the NOLF map implies the 3D shape and geometric information of the object, which is conducive to the similarity comparison between objects in the feature space.
  • traditional methods usually measure the similarity between the real image of the modeling target and the virtual map of the 3D model.
  • the real image and the virtual map are quite different under the conditions of illumination, texture, and background, and there is a cross-domain. Difficulty in retrieval.
  • the three-dimensional reconstruction device determines the pose of the first target object according to the pixel coordinates of the characteristic point and the first target model.
  • the pose of the first target object is determined.
  • the pose of the first target object relative to the camera when photographed can be determined through the PnP solution.
  • the PnP solution can be used to determine the two-dimensional image shooting time. The relative position relationship between the camera and the 3D model.
  • the external parameters of the camera that took the two-dimensional image are known, that is, the pose of the camera in the real three-dimensional space is known, and the relative position relationship between the camera and the three-dimensional model obtained by combining PnP can be restored.
  • Absolute pose in three-dimensional space The first target model determined by the model retrieval and the absolute position of the modeling target can realize the 3D reconstruction of the target to be modeled. Repeat the above operations for all the targets to be modeled in the 3D scene to realize the construction of key equipment in the entire site. Mode, I won’t go into details here.
  • FIG. 10 is a schematic diagram of another embodiment of the three-dimensional reconstruction method in the embodiment of the present application.
  • the target or modeling target or modeling object Using the deep learning network, identify the 3D model target to be built in the 3D scene captured by the image, that is, the target object in the image, hereinafter referred to as the target or modeling target or modeling object, and optionally, obtain the part where the target is located Image area mask (ie the aforementioned image mask), predict the key point pixel coordinates of the target and the normalized position field information (ie NOLF map), optionally, predict the category of the target; use the predicted modeling
  • the normalized position field map corresponding to the target establishes the correspondence between the two-dimensional pixels of the modeled target on the image and the three-dimensional points in the normalized space. Randomly sample four points on the NOLF map to establish one From the correspondence between four pixel coordinates and three-dimensional coordinates, the relative pose of a modeling target in the normalized object space can be obtained by solving the PnP algorithm;
  • a series of relative poses can be obtained by repeated sampling on the NOLF map, and these relative poses are regarded as a series of hypotheses, and N relative poses satisfying the conditions are extracted through the hypothesis testing method. Clustering based on these N relative poses, and the obtained relative pose corresponding to the cluster center point is used as the initial pose of the modeling target in the normalized object space;
  • step 3 Using the initial pose, render the normalized position field map of all three-dimensional models in the database under the pose perspective.
  • the target category obtained in step 1) it is determined that the three-dimensional model database belongs to
  • the three-dimensional models of this category are candidate three-dimensional models.
  • NOLF diagrams are used as the information representation of the three-dimensional model database, combined with the NOLF diagram corresponding to the modeling target in step 1), so as to combine the two-dimensional image information of the modeling target with the three-dimensional model database The information is unified under the same data expression; under the initial relative pose predicted in step 2), the NOLF map corresponding to the 3D model will be closer to the NOLF map corresponding to the modeling target, thereby reducing the difficulty and complexity of retrieval;
  • the image feature descriptor is used to extract the high-dimensional features of the NOLF map of the modeling target, and the high-dimensional features of the NOLF map corresponding to the three-dimensional model in the prefabricated model database, and then the modeling target
  • the three-dimensional model in the prefabricated model library is mapped to a unified feature space.
  • Model retrieval is carried out based on the similarity of features.
  • the relative position relationship in the feature space between the NOLF map of the modeling target and the NOLF map of the 3D model in the prefabricated model library is used to retrieve the 3D corresponding to the modeling target from the database Model, where the closer the Euclidean distance in the feature space, the higher the similarity between the modeling target and a three-dimensional model;
  • the relative position relationship between the camera of the input image and the initial position of the 3D model is established by solving the PnP problem; Input the camera pose information of the image, and finally recover the position and pose of the 3D model corresponding to the modeling target in the real 3D space;
  • the pixel coordinates of the key points of the target object and the NOLF map of the target object under the shooting angle of the image are obtained through a convolutional neural network (CNN), and the initial pose estimation is performed according to the NOLF map.
  • CNN convolutional neural network
  • a series of NOLF images are generated based on the initial pose, the NOLF images of the target object are compared with this series of NOLF images, and similarity detection is performed according to the NOLF image descriptors to obtain
  • the matched three-dimensional model can obtain the pose of the target object according to the PnP algorithm and the camera pose of the original two-dimensional image, thereby realizing the three-dimensional reconstruction of the target object.
  • the input of the convolutional neural network for predicting categories, NOLF maps, key points, and instance segmentation masks based on modeling targets on a two-dimensional image is a two-dimensional RGB image containing multiple modeling targets.
  • the category, mask, NOLF map and eight key point pixel coordinate values of each modeled object can be predicted.
  • the network structure is based on the Mask-RCNN framework.
  • NOLF maps and key points are additionally output.
  • the key point pixel coordinates corresponding to the target are the eighth of the bounding box of the corresponding 3D model in space.
  • the corner points are based on the mapping of accurate pose on the input image.
  • the following example illustrates an implementation method of obtaining the NOLF diagram of the three-dimensional model.
  • a directional normalized three-dimensional model can be obtained.
  • the shape and size information of the modeling target can be encoded into a pre-defined normalized object space, as belonging to the same category A normalized coordinate space is established from the three-dimensional model of.
  • the NOLF map under a certain angle of view can be further obtained.
  • NOLF is an image-like data expression form that can encode the XYZ coordinates of a three-dimensional model into a normalized three-dimensional space, that is, the RBG channel corresponding to each pixel coordinate of the image is stored separately
  • Correspondence between coordinates, that is, a NOLF map corresponds to the three-dimensional point cloud in the visible perspective.
  • NOLF can be defined independently according to the model category.
  • normalized size scaling is performed, that is, the diagonal length of the bounding box of all three-dimensional models is 1 and their geometric center points are located at the center of the NOLF.
  • we align all 3D models that is, all models have the same orientation and the XYZ axis of the model coordinate system is parallel to the XYZ axis of the space coordinate system.
  • the target category prediction branch and the strength segmentation mask prediction branch of this application are consistent with the instance segmentation algorithm (Mask RCNN), and the key point detection branch is similar to the two-dimensional bounding box output branch of Mask RCNN.
  • the Mask RCNN The four vertices of the recursive two-dimensional bounding box of the output branch of the bounding box are replaced with the pixel coordinates of the eight three-dimensional bounding box vertices of the key point detection branch recursively.
  • the NOLF map is obtained through a deep learning network and an image mask.
  • the deep learning network adds three new branch structures, which correspond to the prediction model.
  • the modulo object has x, y, z coordinates in the normalized coordinate space.
  • the corresponding region of interest (ROI) feature vector is converted into a fixed size, and the fixed size feature vector is used as a feature shared by the instance segmentation mask and the NOLF map prediction branch Data, through the full convolutional neural network of the same structure, the instance segmentation mask branch outputs a mask picture, and the three prediction branches of the NOLF map output one picture, where N is the number of categories to which the modeled object belongs, and 32 is used as the NOLF map
  • N is the number of categories to which the modeled object belongs
  • 32 is used as the NOLF map
  • the depth division of pixels in the normalized space in the three directions of XYZ, that is, the NOLF map is divided into 32 equal parts along each coordinate in the normalized space, so as to convert the NOLF map prediction problem into a depth classification problem and improve training Robustness.
  • the input sample is a single two-dimensional image
  • the two-dimensional image has category labels, masks, corner coordinates of the bounding box and the known NOLF map of the target to be modeled
  • the category of the target to be modeled is outputted through the convolutional neural network.
  • the loss function of the convolutional neural network is defined as:
  • L cls is the cross-entropy loss function of classification
  • L mask represents the segmentation mask loss function of the target
  • L bb8 is similar to the recursive loss function of the mask-RCNN bounding box, but the recursive output consists of the four key point coordinates of the two-dimensional bounding box Changed to the 8 key point coordinates of the three-dimensional bounding box.
  • L nolf is the NOLF graph prediction loss function, which is specifically defined as:
  • p i and p j are the three-dimensional coordinate values of the true and predicted values of the NOLF map at the same pixel coordinates, and N represents the number of pixels in the NOLF map.
  • ⁇ 1 , ⁇ 2 , ⁇ 3 and ⁇ 4 are the weights of each part of the loss function.
  • FIG. 11 is a schematic diagram of an embodiment of the three-dimensional reconstruction device in the embodiment of the present application.
  • the acquiring unit 1101 is configured to acquire an image of the first object and the camera pose of the image;
  • the determining unit 1102 is configured to determine a first normalized position field NOLF map of the first object in the image through a first deep learning network, where the first NOLF map indicates that the first object is in the image Normalized 3D point cloud under the shooting angle of view;
  • the determining unit 1102 is configured to determine a first model corresponding to the first object from a plurality of three-dimensional models in a model database according to the first NOLF map;
  • the determining unit 1102 is further configured to determine the pose of the first object according to the first model and the camera pose of the image;
  • the reconstruction unit 1103 is configured to three-dimensionally reconstruct the first object in the image according to the first model and the pose of the first object.
  • the determining unit 1102 is further configured to: determine a first relative pose of the first object according to the first NOLF map, where the first relative pose is the pose of the first object The relative pose between the camera pose and the image of the camera; determine the NOLF maps of the multiple three-dimensional models in the viewing angle corresponding to the first relative pose; the determining unit 1102 is specifically configured to: The NOLF map corresponding to each of the three-dimensional models determines the first model corresponding to the NOLF map with the highest similarity to the first NOLF map.
  • the determining unit 1102 is specifically configured to determine the pixel coordinates of at least four feature points of the first object in the image through a second deep learning network, and the four object points indicated by the four feature points are in three dimensions. Are not coplanar in space; determine the three-dimensional coordinates of the at least four feature points from the first NOLF map; determine the first relative position according to the pixel coordinates and the three-dimensional coordinates of the at least four feature points Posture.
  • the determining unit 1102 is specifically configured to determine the first relative pose according to the pixel coordinates and the three-dimensional coordinates of the at least four feature points by using a perspective N-point estimation PnP algorithm.
  • the characteristic points of the first object include: eight corner points of the bounding box of the first object.
  • the determining unit 1102 is specifically configured to: input the image into the first deep learning network to determine a first original NOLF image; and mask according to the first original NOLF image and the image of the first object. Membrane, determine the first NOLF map.
  • the model database includes categories of the multiple three-dimensional models, wherein the first object belongs to a first category; the determining unit 1102 is specifically configured to: subordinate to the first NOLF graph according to the The first model is determined from the three-dimensional model of the first category.
  • the foregoing unit may be used to execute the method introduced in any of the foregoing embodiments, and the specific implementation process and technical effects can be referred to the corresponding embodiments of FIG. 6 to FIG. 10, and details are not described herein again.
  • FIG. 12 is a schematic diagram of another embodiment of the three-dimensional reconstruction device in the embodiment of the application.
  • the three-dimensional reconstruction device provided in this embodiment may be an electronic device such as a server or a terminal, and the specific device form is not limited in the embodiment of the present application.
  • the three-dimensional reconstruction apparatus 1200 may have relatively large differences due to different configurations or performances, and may include one or more processors 1201 and a memory 1202, and the memory 1202 stores programs or data.
  • the memory 1202 may be volatile storage or non-volatile storage.
  • the processor 1201 is one or more central processing units (CPU, Central Processing Unit).
  • the CPU may be a single-core CPU or a multi-core CPU.
  • the processor 1201 may communicate with the memory 1202, and the three-dimensional reconstruction device 1200 A series of instructions in the memory 1202 are executed.
  • the 3D reconstruction apparatus 1200 also includes one or more wired or wireless network interfaces 1203, such as an Ethernet interface.
  • the three-dimensional reconstruction apparatus 1200 may also include one or more power supplies; one or more input and output interfaces, which can be used to connect a display, a mouse, a keyboard, a touch screen device or a transmission
  • the input and output interfaces are optional components, which may or may not exist, and are not limited here.
  • FIG. 13 is a hardware structure diagram of a chip provided by an embodiment of this application.
  • the embodiment of the present application provides a chip system that can be used to implement the three-dimensional reconstruction method.
  • the algorithm based on the convolutional neural network shown in FIG. 3 and FIG. 4 can be implemented in the NPU chip shown in FIG. 13.
  • the neural network processor NPU 50 is mounted on the main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks.
  • the core part of the NPU is the arithmetic circuit 503.
  • the arithmetic circuit 503 is controlled by the controller 504 to extract matrix data from the memory and perform multiplication operations.
  • the arithmetic circuit 503 includes multiple processing units (process engines, PE). In some implementations, the arithmetic circuit 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 503 is a general-purpose matrix processor.
  • the arithmetic circuit fetches the corresponding data of matrix B from the weight memory 502 and caches it on each PE in the arithmetic circuit.
  • the arithmetic circuit takes the matrix A data and matrix B from the input memory 501 to perform matrix operations, and the partial or final result of the obtained matrix is stored in the accumulator 508.
  • the unified memory 506 is used to store input data and output data.
  • the weight data is directly transferred to the weight memory 502 through the storage unit access controller 505 (direct memory access controller, DMAC).
  • the input data is also transferred to the unified memory 506 through the DMAC.
  • the BIU is the Bus Interface Unit, that is, the bus interface unit 510, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer 509.
  • the bus interface unit 510 (bus interface unit, BIU) is used for the instruction fetch memory 509 to obtain instructions from an external memory, and is also used for the storage unit access controller 505 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 506 or to transfer the weight data to the weight memory 502 or to transfer the input data to the input memory 501.
  • the vector calculation unit 507 may include multiple arithmetic processing units, if necessary, further processing the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
  • the vector calculation unit 507 can store the processed output vector to the unified buffer 506.
  • the vector calculation unit 507 may apply a nonlinear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value.
  • the vector calculation unit 507 generates a normalized value, a combined value, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 503, for example for use in a subsequent layer in a neural network.
  • the instruction fetch buffer 509 connected to the controller 504 is used to store instructions used by the controller 504;
  • the unified memory 506, the input memory 501, the weight memory 502, and the fetch memory 509 are all On-Chip memories.
  • the external memory is private to the NPU hardware architecture.
  • each layer in the convolutional neural network shown in FIG. 3 and FIG. 4 may be executed by the matrix calculation unit 212 or the vector calculation unit 507.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disks or optical disks and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computer Graphics (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Architecture (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例公开了一种三维重建方法,可通过单张二维图像实现物体的三维重建,可以提高三维重建的准确度。本申请实施例方法包括:获取第一物体的图像和所述图像的相机位姿;通过第一深度学习网络,确定所述图像中所述第一物体的第一归一化位置场NOLF图,所述第一NOLF图指示所述第一物体在所述图像的拍摄视角下的归一化的三维点云;根据所述第一NOLF图,从模型数据库的多个三维模型中确定与所述第一物体对应的第一模型;根据所述第一模型和所述图像的相机位姿确定所述第一物体的位姿;根据所述第一模型和所述第一物体的位姿三维重建所述图像中的所述第一物体。

Description

三维重建方法和三维重建装置
本申请要求于2020年03月04日提交中国专利局、申请号为202010143002.1、发明名称为“三维重建方法和三维重建装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及三维建模技术,尤其涉及一种三维重建方法和三维重建装置。
背景技术
三维数字化数据可以提升人们对真实空间的认知水平和层次,为人们带来远超二维图像数据的丰富信息。利用高精度激光扫描仪获得三维环境的点云信息进行建模是一种常见的方案,但是激光扫描仪设备昂贵,如何高效准确低廉地获得真实可用的三维数字化数据,是限制三维应用进一步蓬勃发展的关键性瓶颈之一。
图像现有技术中,以预先构建好的模型数据库为基础,对用户输入的单张场景图像进行图像分析获取目标模型的局部图像,在预先构建模型数据库中进行模型检索确定匹配的模型,进而实现三维重建。其中,预设的模型数据库基于虚拟三维空间的不同位置和角度,对三维模型进行投影映射,用一组多角度的投影图像代替三维模型本身,这样,就将模型检索中二维-三维检索问题转换成二维-二维相似性度量问题;当某一个三维模型在某一视角下的投影图像与待检索的局部图像比较接近,则认为投影图像对应的三维模型即为检索结果。
由于现有技术中,从预设的位置和角度对预设的三维模型进行投影映射,模型数据库中用一组多角度的投影图像代表一个三维模型,但是,实际拍摄的场景图像时的视角往往是任意且随机的,与模型数据库中投影图像的位置和角度匹配度低,因此,模型检索的准确度较低。
发明内容
本申请实施例提供了一种三维重建方法,用于通过二维图像实现物体的三维重建,可以提高三维模型匹配的准确率。
本申请实施例的第一方面提供了一种三维重建方法,包括:获取第一物体的图像和所述图像的相机位姿;通过第一深度学习网络,确定所述图像中所述第一物体的第一归一化位置场NOLF图,所述第一NOLF图指示所述第一物体在所述图像的拍摄视角下的归一化的三维点云;根据所述第一NOLF图,从模型数据库的多个三维模型中确定与所述第一物体对应的第一模型;根据所述第一模型和所述图像的相机位姿确定所述第一物体的位姿;根据所述第一模型和所述第一物体的位姿三维重建所述图像中的所述第一物体。
本申请实施例提供的三维重建方法,获取对包含第一物体的场景进行拍摄得到的二维图像,输入深度学习网络,得到该第一物体的NOLF图,指示该第一物体在拍摄该图像的相机拍摄视角下的归一化的三维点云,根据该第一物体的NOLF图从模型数据库中确定第一物体对应的第一模型,根据所述第一模型和所述图像的相机位姿确定所述第一物体的位姿,从而实现图像中第一物体的三维重建。由于第一物体的NOLF指示的是第一物体在拍摄该图像的相机拍摄视角下的归一化的三维点云,即通过深度学习网络恢复了第一物体在图像的拍摄视角下的部分三维信息,通过该NOLF图进行三维检索,相较现有技术可以提高模型匹配的准确度,进行提高三维重建成功率。
在第一方面的一种可能的实现方式中,所述方法还包括:根据所述第一NOLF图,确定所述第一物体的第一相对位姿,所述第一相对位姿为所述第一物体的位姿与所述图像的相机位姿之间的相对位姿;确定多个所述三维模型中在所述第一相对位姿对应的视角下的NOLF图;所述根据所述第一NOLF图,从模型数据库的多个三维模型中确定与所述第一物体对应的第一模型包括:从多个所述三维模型各自对应的NOLF图中确定与所述第一NOLF图相似度最高的NOLF图对应的所述第一模型。
本申请实施例提供的三维重建方法,可以根据第一NOLF图,确定第一物体在拍摄时与相机之间的第一相对位姿,基于该第一相对位姿,可以获取模型数据库中各个三维模型的候选NOLF图。根据第一相对位姿和三维模型的姿态,可以确定观测点的位置和方向,三维模型的NOLF图指示从观测点位置和方向观察三维模型时可选部分的归一化的三维点云。这里,从观测点的位置和方向观察三维模型,可以理解为模拟相机拍摄第一物体,由此,可以获取用于进行比对的三维模型的候选NOLF图。根据多个候选NOLF图与第一NOLF图之间的相似度,将与第一NOLF图相似度最高的候选NOLF图对应的三维模型确定为第一模型。本申请实施例的方法基于计算出的初始位姿,获取候选NOLF图,由此可以降低检索的难度。此外,通过NOLF图将二维的建模对象与三维的预制模型库对象统一到了相同的数据表达中,且该数据表达与建模目标在真实图像上的光照条件和三维模型的纹理细节无关,且NOLF图指示的三维点云隐含了第一物体的三维形状和几何信息,有利于特征空间内对象间的相似性比较。
在第一方面的一种可能的实现方式中,所述根据所述第一NOLF图,确定所述第一物体的第一相对位姿包括:通过第二深度学习网络,确定所述图像中所述第一物体的至少四个特征点的像素坐标,所述四个特征点指示的四个物点在三维空间中不共面;从所述第一NOLF图中确定所述至少四个特征点的三维坐标;根据所述至少四个特征点的所述像素坐标和所述三维坐标确定所述第一相对位姿。
本申请实施例提供的三维重建方法,可以根据第一NOLF图确定至少四个特征点,由于第一NOLF图为二维图像,可以确定第一NOLF图中的特征点的像素坐标,此外,由于NOLF图指示三维点云,每个像素点对应于一个三维坐标,由此,可以获取特征点的像素坐标与三维坐标之间的对应关系,通过至少四个特征点的像素坐标与三维坐标之间的对应关系可以计算得到图像拍摄时,相机相对于第一物体的相对位姿。
在第一方面的一种可能的实现方式中,所述根据所述至少四个特征点的像素坐标和三维坐标确定所述第一相对位姿包括:根据所述至少四个特征点的所述像素坐标和所述三维坐标,通过透视N点估计PnP算法确定所述第一相对位姿。
本申请实施例提供的三维重建方法,提供了计算相对位姿的一种具体实现方式,即通过PnP算法进行计算。
在第一方面的一种可能的实现方式中,所述第一物体的特征点包括:所述第一物体的包围盒的八个角点。
本申请实施例提供的三维重建方法,特征点具体可以是第一物体的包围盒的角点,由于对于所有物体都可以确定其包围盒的角点,该方法具有普适通用性,易于实现。还可以通过训练深度学习网络进行,通过深度学习网络预测物体的包围盒的角点,以确定第一物体的特征点。基于检索出的三维模型,利用该三维模型的包围盒的角点与二维图像上建模目标预测 出的包围盒的角点的对应关系,结合图像对应的相机内外参数,解算出精确的三维模型空间姿态,从而将该模型放置到真实的三维空间位置上。
在第一方面的一种可能的实现方式中,所述方法还包括:将所述图像输入所述第一深度学习网络,确定第一原始NOLF图;根据所述第一原始NOLF图和所述第一物体的图像掩膜,确定所述第一NOLF图。可选地,所述方法还包括:通过第三深度学习网络,确定所述图像中所述第一物体的图像掩膜;可选地,所述第三深度学习网络与所述第一深度学习网络为不同的深度学习网络或者同一个深度学习网络,具体此处不做限定。
本申请实施例提供的三维重建方法,还可以通过第三深度学习网络确定第一物体的图像掩膜,根据该掩膜确定第一NOLF图。
在第一方面的一种可能的实现方式中,所述模型数据库包括所述多个三维模型的类别,其中,所述第一物体属于第一类别,所述方法还包括:根据所述第一NOLF图从属于所述第一类别的三维模型中确定所述第一模型。可选地,所述方法还包括:将所述图像输入第四深度学习网络,确定所述第一物体属于第一类别;可选地,所述第四深度学习网络与所述第一深度学习网络为不同的深度学习网络或者同一个深度学习网络,具体此处不做限定。
本申请实施例提供的三维重建方法,还可以通过第四深度学习网络预测第一物体的类别,根据该类别从模型数据库中确定属于该类别的三维模型用于后续模型匹配,可以减少计算量。
本申请实施例第二方面提供了一种三维重建装置,包括:获取单元,用于获取第一物体的图像和所述图像的相机位姿;确定单元,用于通过第一深度学习网络,确定所述图像中所述第一物体的第一归一化位置场NOLF图,所述第一NOLF图指示所述第一物体在所述图像的拍摄视角下的归一化的三维点云;所述确定单元,用于根据所述第一NOLF图,从模型数据库的多个三维模型中确定与所述第一物体对应的第一模型;所述确定单元,还用于根据所述第一模型和所述图像的相机位姿确定所述第一物体的位姿;重建单元,用于根据所述第一模型和所述第一物体的位姿三维重建所述图像中的所述第一物体。
所述确定单元还用于:根据所述第一NOLF图,确定所述第一物体的第一相对位姿,所述第一相对位姿为所述第一物体的位姿与所述图像的相机位姿之间的相对位姿;确定多个所述三维模型中在所述第一相对位姿对应的视角下的NOLF图;所述确定单元具体用于:从多个所述三维模型各自对应的NOLF图中确定与所述第一NOLF图相似度最高的NOLF图对应的所述第一模型。
所述确定单元具体用于:通过第二深度学习网络,确定所述图像中所述第一物体的至少四个特征点的像素坐标,所述四个特征点指示的四个物点在三维空间中不共面;从所述第一NOLF图中确定所述至少四个特征点的三维坐标;根据所述至少四个特征点的所述像素坐标和所述三维坐标确定所述第一相对位姿。
所述确定单元具体用于:根据所述至少四个特征点的所述像素坐标和所述三维坐标,通过透视N点估计PnP算法确定所述第一相对位姿。
所述第一物体的特征点包括:所述第一物体的包围盒的八个角点。
所述确定单元具体用于:将所述图像输入所述第一深度学习网络,确定第一原始NOLF图;根据所述第一原始NOLF图和所述第一物体的图像掩膜,确定所述第一NOLF图。
所述模型数据库包括所述多个三维模型的类别,其中,所述第一物体属于第一类别;所 述确定单元具体用于:根据所述第一NOLF图从属于所述第一类别的三维模型中确定所述第一模型。
本申请实施例第三方面提供了一种三维重建装置,包括处理器和存储器,所述处理器和所述存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器用于调用所述程序指令,执行如上述第一方面以及各种可能的实现方式中任一项所述的方法。
本申请实施例第四方面提供了一种包含指令的计算机程序产品,其特征在于,当其在计算机上运行时,使得所述计算机执行如上述第一方面或第一方面任意一种可能实现方式的方法。
本申请实施例第五方面提供了一种计算机可读存储介质,包括指令,其特征在于,当所述指令在计算机上运行时,使得计算机执行如上述第一方面或第一方面任意一种可能实现方式的方法。
本申请实施例第六方面提供了一种芯片系统,该芯片系统包括处理器,处理器用于读取并执行存储器中存储的计算机程序,以执行上述任一方面任意可能的实现方式所涉及的功能。在一种可能的设计中,芯片系统还包括存储器,该存储器与该处理器电连接。进一步可选地,该芯片还包括通信接口,处理器与该通信接口连接。通信接口用于接收需要处理的数据和/或信息,处理器从该通信接口获取该数据和/或信息,并对该数据和/或信息进行处理,并通过该通信接口输出处理结果。该通信接口可以是输入输出接口。该芯片系统,可以由芯片构成,也可以包含芯片和其他分立器件。
其中,第二方面、第三方面、第四方面、第五方面以及第六方面中任一种实现方式所带来的技术效果可参见第一方面中相应实现方式所带来的技术效果,此处不再赘述。
从以上技术方案可以看出,本申请实施例具有以下优点:
本申请实施例提供的三维重建方法,在获取第一物体的图像和所述图像的相机位姿后,将图像输入深度学习网络,输出第一物体在图像的拍摄视角下的第一归一化位置场NOLF图,该第一NOLF图可以指示所述第一物体在所述图像的拍摄视角下的归一化的三维信息;根据该第一NOLF图,从预设的模型数据库的多个三维模型中确定第一模型,根据所述第一模型和所述图像的相机位姿进行计算,确定所述第一目标模型的位姿,由此可以实现所述第一目标物体的三维重建。
由于本方案在模型数据库中检索三维模型时,使用的是拍摄图像经深度学习网络恢复出的NOLF图,相较现有技术,不需要预先在不同位置和视角下对三维模型进行投影,基于三维信息与三维模型进行检索匹配,准确度高。
此外,本申请实施例提供的三维重建方法,还可以根据NOLF图和特征点像素坐标预测第一物体的相对位姿,基于计算出的初始位姿,获取模型数据库中的三维模型在该初始位姿下的NOLF图,再与建模对象的NOLF图进行对比,由此,将对象与数据库中的三维模型统一到了相同的数据表达形式下,可以降低三维模型检索的难度,有效降低计算量。
附图说明
图1为本申请实施例提供的一种人工智能主体框架示意图;
图2为本申请实施例提供的一种应用环境示意图;
图3为本申请实施例提供的一种卷积神经网络结构示意图;
图4为本申请实施例提供的另一种卷积神经网络结构示意图;
图5为本申请实施例中三维重建方法一个应用场景的示意图;
图6为本申请实施例中三维重建方法的一个实施例示意图;
图7为本申请实施例中目标物体的NOLF图的一个实施例示意图;
图8为本申请实施例中三维建模方法的深度学习网络架构示意图;
图9为本申请实施例中相似性检测的一个实施例示意图;
图10为本申请实施例中三维重建方法的另一个实施例示意图;
图11为本申请实施例中三维重建装置的一个实施例示意图;
图12为本申请实施例中三维重建装置的另一个实施例示意图;
图13为本申请实施例提供的一种芯片硬件结构图。
具体实施方式
本申请实施例提供了一种三维重建方法,用于物体的三维重建,可以提高模型匹配的准确度。
为了便于理解,下面对本申请实施例涉及的部分技术术语进行简要介绍:
1、三维模型:三维模型是物体的多边形表示,通常用计算机或者其它视频设备进行显示。显示的物体可以是现实世界的实体,也可以是虚构的物体。任何物理自然界存在的东西都可以用三维模型表示。本申请实施例中,物体的三维模型用于指示物体的三维结构和尺寸信息。三维模型的数据存储形式有多种,例如以三维点云、网格或体元等形式表示,具体此处不做限定。
2、具有方向性的归一化三维模型:根据物体的三维模型进行尺寸归一化,并根据预设的主视方向置于三维坐标系中,包含物体的三维结构信息。预设的主视方向,通常为符合用户习惯且可以最好反映物体形状特征的方向,例如对于摄像机,设置拍摄按键朝上,镜头方向为主视方向,将摄像机的实际尺寸归一化,缩放至预设大小,得到的包含具有方向性的归一化三维模型。
示例性的,本申请实施例中具有方向性的归一化三维模型的获取方法为,预设物体的主视方向,定义归一化对象位置场为一个长宽高均为1的三维空间,物体的三维模型经归一化缩放,质心位于该三维空间的中心点,既可获取具有方向性的归一化三维模型。
3、归一化位置场(normalized object location field,NOLF)图,后简称NOLF图:指示物体或三维模型在某一视角下可见部分的归一化的三维点云。NOLF图是一个类图像的数据表达形式,即该图像每个像素坐标分别对应存储三维模型在归一化三维空间的XYZ坐标值,即每个像素坐标值对应一个三维坐标值,从而建立了三维模型在图像上的像素坐标与在归一化空间上的三维坐标间对应关系。本申请实施例中提及的在图像的拍摄视角下的NOLF图,即相机拍摄物体时,从相机拍摄物体的视角下,基于相机与物体之间的相对位姿,物体可见部分的归一化的三维点云。
4、透视N点估计(perspective-n-point,PnP)又称投影N点估计,是指通过世界中的N个物点与图像中对应的N个像点计算其投影关系,从而获得相机或物体位姿。
5、包围盒:恰好将物体完全包容起来的最小长方体即为该物体的三维包围盒。对于设置在三维坐标系中的物体,包围盒可以指包含物体且边平行于坐标轴的最小六面体。包围盒的角点即为该最小六面体的8个顶点。
6、关键点:本申请实施例中所称的关键点是指三维包围盒的角点,即三维包围盒的长方体顶点。
7、相机位姿:位姿即相机在空间中的位置和相机的姿态,可以分别看作相机从原始参考位置到当前位置的平移变换和旋转变换。类似的,本申请中物体的位姿即,物体在空间中的位置和物体的姿态。
8、相机外参:即相机的外参数,世界坐标系与相机坐标系之间的转换关系,包括位移参数和旋转参数,根据相机外参可以确定相机位姿。
9、类别:通过深度学习可以进行图像识别,识别图像中物体的类别,即物体分类,物体类别例如可以是:桌子、椅子、猫、狗、汽车等等。
本申请实施例提供的三维重建方法,根据应用场景需要预先构建模型数据库,存储待建模物体的三维信息。根据场景覆盖的物体种类,可以为三维模型设定类别,例如,室内居家场景中进行三维重建,数据库中需要预先存储所有可能出现的家具的三维模型,根据家具类型设定类别,例如“椅子”,“桌子”,“茶几”,“床”等,可选地,若场景中通类型的物体较多,可以进一步设定二级类别,例如多种类型的椅子,例如“凳子”“扶手椅”“沙发椅”等。
10、图像掩膜:用选定的图像、图形或物体,对待处理的图像进行全部或局部的遮挡,来控制图像处理的区域或处理过程。本申请实施例中图像掩膜用于提取感兴趣区,例如图像中第一目标物体的局部图像,根据感兴趣区掩模与待处理图像相乘,得到感兴趣区图像,感兴趣区内图像值保持不变,而区外图像值都为0。
下面结合附图,对本申请的实施例进行描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请实施例中出现的术语“和/或”,可以是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。另外,本申请中字符“/”,一般表示前后关联对象是一种“或”的关系。本申请中,“至少一个”是指一个或多个,“多个”是指两个或两个以上。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。
本申请说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或模块的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或模块,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或模块。 在本申请中出现的对步骤进行的命名或者编号,并不意味着必须按照命名或者编号所指示的时间/逻辑先后顺序执行方法流程中的步骤,已经命名或者编号的流程步骤可以根据要实现的技术目的变更执行次序,只要能达到相同或者相类似的技术效果即可。
图1示出一种人工智能主体框架示意图,该主体框架描述了人工智能系统总体工作流程,适用于通用的人工智能领域需求。
下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。
“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。
“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施:
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,平安城市,智能终端等。
本申请描述的利用单张二维图像实现物体的三维重建的方法具有广阔的应用空间,例如, 在基站站点勘测中,通过三维重建构建数字化站点,进而可以实现站点自动设计、在线指导设备安装以及无线信号仿真等。在终端增强现实(AR)或虚拟现实(VR)技术中,三维内容创建的难点就在与三维建模困难。
参见附图2,本申请实施例提供了一种系统架构200。数据采集设备260用于采集图像并存入数据库230,训练设备220基于数据库230中维护的图像数据生成目标模型/规则201。下面将更详细地描述训练设备220如何基于图像数据得到目标模型/规则201,目标模型/规则201能够用于图像识别、三维重建和虚拟现实等应用场景。
该目标模型/规则201可以是基于深度神经网络得到的,下面对深度神经网络进行介绍。
深度神经网络中的每一层的工作可以用数学表达式
Figure PCTCN2021074094-appb-000001
来描述:从物理层面深度神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由
Figure PCTCN2021074094-appb-000002
完成,4的操作由+b完成,5的操作则由a()来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,W是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量W决定着上文所述的输入空间到输出空间的空间变换,即每一层的权重W控制着如何变换空间。训练深度神经网络的目的,也就是最终得到训练好的神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。
因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
训练设备220得到的目标模型/规则可以应用不同的系统或设备中。在附图2中,执行设备210配置有I/O接口212,与外部设备进行数据交互,“用户”可以通过客户设备240向I/O接口212输入数据。
执行设备210可以调用数据存储系统250中的数据、代码等,也可以将数据、指令等存入数据存储系统250中。
计算模块211使用目标模型/规则201对输入的数据进行处理,以三维建模为例,计算模块211可以对输入的图像或图像序列进行解析,恢复目标的深度信息。
关联功能模块213可以对计算模块211中的图像数据进行预处理。
关联功能模块214可以对计算模块211中的图像数据进行预处理。
最后,I/O接口212将处理结果返回给客户设备240,提供给用户。
更深层地,训练设备220可以针对不同的目标,基于不同的数据生成相应的目标模型/规 则201,以给用户提供更佳的结果。
在附图2中所示情况下,用户可以手动指定输入执行设备210中的数据,例如,在I/O接口212提供的界面中操作。另一种情况下,客户设备240可以自动地向I/O接口212输入数据并获得结果,如果客户设备240自动输入数据需要获得用户的授权,用户可以在客户设备240中设置相应权限。用户可以在客户设备240查看执行设备210输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备240也可以作为数据采集端将采集到训练数据存入数据库230。
值得注意的,附图2仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在附图2中,数据存储系统250相对执行设备210是外部存储器,在其它情况下,也可以将数据存储系统250置于执行设备210中。再例如:在附图2中,训练设备220、执行设备210以及客户设备240为各自独立的设备,在其他情况下,训练设备220和执行设备210可以为同一物理设备,该物理设备可以实现该训练设备220和执行设备210的所有功能;可选的,执行设备210以及客户设备240也可以为同一物理设备,该物理设备可以实现该执行设备210以及客户设备240的所有功能;可选的,训练设备220、执行设备210以及客户设备240均同一物理设备,该物理设备训练设备220、执行设备210以及客户设备240的所有功能,对于本申请实施例具体场景架构,此处不做限定。
本申请实施例中用于从二维图像中预测目标的NOLF图的深度神经网络,例如,可以是卷积神经网络(convolutional neural network,CNN),CNN是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,以图像处理为例,该前馈人工神经网络中的各个神经元对输入其中的图像中的重叠区域作出响应。当然,还可以是其他类型,本申请不限制深度神经网络的类型。
如图3所示,卷积神经网络(CNN)100可以包括输入层110,卷积层/池化层120,其中池化层为可选的,以及神经网络层130。
卷积层/池化层120:
卷积层:
如图3所示卷积层/池化层120可以包括如示例121-126层,在一种实现中,121层为卷积层,122层为池化层,123层为卷积层,124层为池化层,125为卷积层,126为池化层;在另一种实现方式中,121、122为卷积层,123为池化层,124、125为卷积层,126为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
以卷积层121为例,卷积层121可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。
当卷积神经网络100有多个卷积层的时候,初始的卷积层(例如121)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络100深度的加深,越往后的卷积层(例如126)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。为方便描述网络结构,可以将多个卷积层称为一个块(block)。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,即如图3中120所示例的121-126各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。
神经网络层130:
在经过卷积层/池化层120的处理后,卷积神经网络100还不足以输出所需要的输出信息。因此,在神经网络层130中可以包括多层隐含层(如图3所示的131、132至13n)以及输出层140,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等。
在神经网络层130中的多层隐含层之后,也就是整个卷积神经网络100的最后层为输出层140。
需要说明的是,如图3所示的卷积神经网络100仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在,例如,如图4所示的多个卷积层/池化层并行,将分别提取的特征均输入给全神经网络层130进行处理。
请参阅图5,为本申请实施例中三维重建方法一个应用场景的示意图;
如图5所示,A为拍摄待建模场景得到的单张二维图像,B为预先构建的三维模型数据库,C为从A中识别场景中的多个目标物体,D为从B中进行模型检索,获取多个目标物体对应的三维模型,E即根据匹配的三维模型实现目标物体的三维重建,得到场景的三维数据。
现有技术中,以预先构建好的模型库为基础,对用户输入的单张场景图像进行图像分析获取目标模型的局部图像,在预先构建模型数据库中进行模型检索确定匹配的模型,由此实现三维重建。其中,预设的模型数据库基于虚拟三维空间的不同位置和角度,对三维模型进行投影映射,用一组多角度的投影图像代替三维模型本身,这样,就将模型检索中二维-三维检索问题转换成二维-二维相似性度量问题;当某一个三维模型在某一视角下的投影图像与待检索的局部图像比较接近,则认为投影图像对应的三维模型即为检索结果。
由于现有技术中,从预设的位置和角度对预设的三维模型进行投影映射,模型数据库中用一组多角度的投影图像代表一个三维模型,但是,实际拍摄的场景图像时的视角往往是任意且随机的,通常与模型数据库中投影图像获取时预设的位置和角度不完全匹配,因此,模型检索的准确度较低。
若为了增加三维模型投影视角与待建模目标在拍摄图像上的视角接近的几率,需要增加三维模型投影的位置和角度,采样数量大,计算成本高。
除此之外,虚拟空间上的三维模型投影图像与真实空间上的场景图像在光照条件、背景和纹理颜色等关键要素上差异较大,现有技术通过真实图像对虚拟图像进行检索受到以上因 素的制约,造成了检索结果的鲁棒性有较大的提升空间。
综上所示,在建模效率,建模成本和建模精确度等关键性的建模指标上,目前已有的三维建模技术很难取得较好的平衡。对于那些想使用三维信息和三维应用的消费者和开发者而言,如何快速、简单、低成本地获得高精度和可使用三维场景模型,是一个需要不断突破的关键性问题。
本申请实施例提供了一种三维重建方法,可以根据二维图像和预先构建的模型数据库进行三维重建,建模精度高。
请参阅图6,为本申请实施例中三维重建方法的一个实施例示意图。
对于场景内可能出现的目标物体,预先构建模型数据库,每个物体对应一个三维模型,模型数据库中包括多个三维模型。三维模型包含物体的三维几何信息,具体的,包括几何结构和尺寸信息,可选地,三维模型中还包含物体的纹理特征。可选地,模型数据库中的三维模型携带有物体类别的标签,例如多种款式造型的椅子,其类别均为“椅子”。
以室内场景为例,可以先构建常见家具的模型,例如场景中可能出现的各种样式的茶几、餐桌和椅子等;对于基站,可以先构建常见类型的设备,例如各种型号的天线设备等。基于我们对待构建场景的了解,提供先验信息,以使得可以根据单张二维图像提供的信息实现三维场景的重建。
601、三维重建装置获取图像和相机位姿;
三维重建装置获取第一目标物体的图像,和图像对应的相机位姿。
该图像是指对包括第一目标物体的目标场景进行拍摄得到的二维图像。第一目标物体在本申请实施例中也称为建模目标。图像由图像采集设备拍摄,可以是普通相机或全景相机。图像采集设备可以是设置于三维重建装置中的部件,也可以是与三维重建装置存在有线或无线通信连接的独立设备。本申请实施例中对图像采集设备的具体类型和设备形态不做限定。图像可以是全景相机采集的水平视角呈360度的全景图像,或者超广角图像,或者普通相机拍摄的中心投影图像,具体此处不做限定。此外,本申请实施例提供的三维重建方法可以根据单幅图像实现三维重建,但是当单幅中心投影图像无法完全采集目标场景信息时,可以采集多张图像,具体此处不做限定。
图像对应的相机位姿是指拍摄该图像的相机在拍摄时的位置和位姿,在设定了参考坐标系的情况下,相机位姿可以通过相对参考坐标系的平移和旋转确定。图像对应的相机位姿的获取方式此处不做限定。
602、三维重建装置将图像输入深度学习网络,获取第一目标物体的NOLF图和特征点的像素坐标;
三维重建装置将步骤601中获取的图像输入深度学习网络,获取第一目标物体在图像的拍摄视角下的NOLF图和图像中的第一目标物体特征点的像素坐标。
图像中的第一目标物体的特征点,用于在确定三维模型数据库中的目标三维模型之后,确定第一目标物体的位姿,为计算第一目标物体的位姿,需要确定至少4个特征点,且该4个特征点(像点)对应的物点在三维空间中不共面,特征点的具体数量此处不做限定。相机拍摄获取的图像中的像点对应于三维空间中的点,称之为物点。需要说明的是,图像中第一 目标物体的特征点可以位于第一目标物体的图像区域内,对应的物点位于第一目标物体内;第一目标物体的特征点也可以位于第一目标物体的图像区域之外,例如,特征点可以是目标物体的包围盒的角点投影在图像中的点,这里目标物体的包围盒的角点在拍摄空间中并无实物对应,仅指示一个三维空间位置。
通过深度学习网络输出的第一目标物体的NOLF图,请参阅图7,为本申请实施例中目标物体的NOLF图的一个实施例示意图,根据二维RGB图像,输出图像中第一目标物体在图像的拍摄视角下的NOLF图,该NOLF图可以指示第一目标物体在所述图像的拍摄视角下的归一化的三维信息。NOLF图为二维图像,NOLF图中每个像素点存储该像素点对应的物点的XYZ三个维度的信息,可选地,NOLF图可以用二维RGB图像表达,其中,RGB图像中三个颜色通道分别对应三维坐标系中一个维度的信息。
可选地,三维重建装置可以通过一个深度学习网络输出第一目标物体的NOLF图和第一目标物体的关键点的像素坐标;还可以通过不同的深度学习网络分别获取第一目标物体的NOLF图,和第一目标物体的关键点的像素坐标,具体此处不做限定。
可选地,三维重建装置还可以通过深度学习网络获取第一目标物体的类别,类别用于在模型检索时,从模型数据库中筛选确定该类别对应的模型,例如识别出第一目标物体为椅子,模型数据库中可能有多种类型的模型,例如“桌子”、“椅子”、“茶几”和“床”等等,若获取了第一目标物体的类别为“椅子”,则可以将物体的模型限定在“椅子”这一类别的三维模型中,不必在其他类型的三维模型中进行检索。
可选地,三维重建装置还可以通过深度学习网络获取第一目标物体的掩膜,该掩膜可以用于对NOLF图进行优化,提升NOLF图的精确度。
示例性的,请参阅图8,为本申请实施例中三维建模方法的深度学习网络架构示意图;
图8中描述了基于二维图像中的建模目标进行NOLF图、关键点和实例分割掩膜进行预测的卷积神经网络。输入是一张包含建模目标的二维RGB图像。卷积神经网络(CNN)基于该图像,能够预测出目标建模对象的掩膜、NOLF图和八个关键点像素坐标值。NOLF图包含建模目标在图像的拍摄视角下的三维信息。八个关键点即建模目标的包围盒的角点,关键点像素坐标为其对应三维模型在空间中包围盒的八个角点基于准确位姿在输入图像上的映射。可选地,该深度学习网络基于通用实例分割架构(Mask-RCNN框架)预测目标类别和掩膜。可选地,该卷积神经网络还可以通过一个检测分支预测目标建模对象的类别,根据三维模型数据库中各个三维模型预设的类型的标签,可以缩小模型检索的范围。
由于通常场景中包括多个目标物体,获取的图像中包括多个目标物体的图像,可选地,三维重建装置将步骤601中获取的图像输入深度学习网络,可以识别图像中多个目标物体,并分别生成每个目标物体的NOLF图。可选地,图像中的第一目标物体图像区域通常为图像的部分区域,三维重建装置通过对图像信息的解析,识别出图像中所有待重建的三维目标物体,获取该目标物体的局部图像,每个局部图像对应一个目标物体,从图像中获取的局部图像的具体数量此处不做限定。由于可选地,三维重建装置可以先从图像中确定目标物体的局部图像,再将局部图像输入深度学习网络,具体此处不做限定。
该深度学习网络为预先训练的模型,训练方法请参考后续实施例。
603、三维重建装置根据第一目标物体的NOLF图,确定第一目标物体的初始位姿;
基于步骤602获取的NOLF图,建立NOLF图的图像像素坐标与NOLF图指示的三维坐标之间的对应关系。示例性的,在NOLF图上随机采样四个在三维空间上不共面的点,就能建立起四对二维像素坐标-三维坐标对应关系,进而通过求解PnP就能获得一个建模目标在归一化对象空间中的第一相对位姿,该第一相对位姿即为初始位姿。
可选地,为提高初始位姿的计算精度,可以通过多次采样获取多个初始位姿,根据多个初始位姿进行聚类,将聚类中心点对应的初始位姿作为目标物体的具有方向性的归一化三维模型在图像的拍摄视角下的初始位姿。下面具体进行介绍:
在二维图像上的建模目标所在区域NOLF图已知的条件下,对于NOLF图内的任意像素坐标pi,均对应一个归一化空间内的三维点Pi,进而建立起一系列二维像素位置与三维空间位置的对应关系,依赖这些对应关系通过求解PnP就能获得一系列对应的三维模型在归一化空间内的相对位姿,具体过程如下:设NOLF图像素区域为R,从R中随机采样四个像素点和对应的三维坐标,通过这四对2D-3D(二维-三维)对应关系,求解PnP就能获得一个在归一化空间内的位姿Hj,对每个位姿Hj通过以下公式对齐进行分数评价:
Figure PCTCN2021074094-appb-000003
其中Pj为像素点pj对应的归一化的三维坐标,A为相机的内参矩阵,thereshold为定义的阈值,分数评价公式用来度量在位姿Hj下有多少个像素对应的三维坐标值经过重投影,像素误差小于thereshold,从而度量每个位姿的置信度。利用抢占式随机抽样一致思路(pre-emptive RANSAC)对上述公式进行做大化优化,重复进行以上采样过程,最终将分数排名前百分之十的假设位姿作为初始位姿集合在进行聚类,将聚类中心点对应的假设位姿作为在归一化空间内建模目标的初始位姿。
604、三维重建装置确定模型数据库中的三维模型在该初始位姿下的NOLF图;
利用步骤603中获取的初始位姿,可以对预制模型数据库所有三维模型进行NOLF图投影,渲染出数据库内所有候选三维模型在该初始位姿视角下的NOLF图,再基于该视角下的NOLF图进行相似性检测,将这些在该初始位姿视角下的NOLF图作为三维模型数据库的信息表征,从而将第一目标物体的二维图像信息与模型数据库中的三维模型信息统一到相同的数据表达形式下。
可选地,若步骤602中,获取了第一目标物体的类别,则可以根据该类别从三维模型数据库中筛选出属于该类别的三维模型集合,并利用步骤603中确定的初始位姿渲染出与建模对象同类型的三维模型该初始位姿视角下的归一化位置场图。
605、三维重建装置根据第一目标物体的NOLF图和三维模型在初始位姿下的NOLF图进行模型检索,确定第一目标模型;
步骤602中获取的第一目标物体的NOLF图与步骤604获取的三维模型库基于初始视角的NOLF图进行相似性检测,将相似度最高的NOLF图对应的三维模型确定为第一目标模型。
相似性检测的方法有多种,此处不做限定。示例性地,请参阅图9,为本申请实施例中相似性检测的一个实施例示意图。基于NOLF图建立一个Triple关系,即建模目标在二维图像上预测的NOLF图,与建模目标对应的三维模型的NOLF图作为正相关关系,与建模目标不对应的三维模型的NOLF图为负相关关系,即在特征空间内建模目标与对应的三维模型欧式距 离越近越好,且建模目标与非对应的三维模型欧式距离越远越好,因此NOLF图特征描述子相似度的损失函数(similaritytripleloss)表示为:
Figure PCTCN2021074094-appb-000004
其中,u为建模目标,v +,v -分别为u对应的三维模型正样本和负样本,f为基于CNN的NOLF特征描述子,(u,v +,v -)表示triple关系,f(u)代表建模目标基于CNN的NOLF特征描述子,f(v +)代表三维模型正样本基于CNN的NOLF特征描述子,f(v -)代表三维模型负样本基于CNN的NOLF特征描述子,m为最小欧式距离值,具体地,m>0具体数值此处不做限定,||.||表示特征空间两个样本点的欧式距离。
本申请实施例根据NOLF图进行检索,有益效果在于:
1)无需预先在不同位置和视角下对三维模型进行投影,可以降低三维模型检索过程中的数据规模;
2)本步骤中基于计算出的初始位姿,获取模型数据库中的三维模型在该初始位姿下的NOLF图,再与建模对象的NOLF图进行对比,虽然获取的初始位姿并不一定是目标物体的准确位姿,根据初始位姿确定的NOLF图进行相似度检测可以降低检索的难度;
3)本步骤在三维模型检索中,通过NOLF图将二维的建模对象与三维的预制模型库对象统一到了相同的数据表达中,且该数据表达与建模目标在真实图像上的光照条件和三维模型的纹理细节无关,且NOLF图隐含了对象的三维形状和几何信息,有利于特征空间内对象间的相似性比较。对应的,传统方法通常将建模目标的真实图像与三维模型虚拟映射进行相似性度量,而真实图像与虚拟映射在光照、纹理和背景等条件下差异较大,存在跨域(cross-domain)检索的困难。
606、三维重建装置根据特征点像素坐标和第一目标模型,确定第一目标物体的位姿;
根据步骤602确定的第一目标物体的特征点像素坐标,以及步骤605中确定的第一目标模型,确定第一目标物体的位姿。
根据该特征点像素坐标和第一目标模型中对应的特征点的三维坐标,可以通过PnP解法确定第一目标物体拍摄时相对于相机的位姿。可选地,若特征点为8个关键点,由于预测的八个关键点的顺序与检索出的三维模型包围盒的角点是一一对应的,通过PnP解法可以确定二维图像拍摄时的相机与三维模型的相对位置关系。
由于拍摄二维图像的相机外参已知,即已知相机在真实三维空间上的位姿,结合PnP求解出的相机与三维模型相对位置关系,既可恢复出建模目标对应三维模型在真实三维空间的绝对位姿。根据模型检索确定的第一目标模型以及建模目标的绝对位置即可实现对待建模目标的三维重建,对三维场景内所有待建模目标重复以上操作,就能实现整个站点内的关键设备建模,此处不再赘述。
请参阅图10,为本申请实施例中三维重建方法的另一个实施例示意图;
采集二维影像数据进行三维场景的信息采集,通过高精度影像位姿的计算,获得了图像的相机的外参信息,通过该外参信息可以计算得到图像的相机位姿;三维场景内可能出现的关键物体的三维模型集合成一个预制的三维模型数据库,该三维模型数据库,可简称为模型数据库,作为三维场景建模的另一个输入。
1)利用深度学习网络,识别出图像拍摄的三维场景内待建的三维模型目标,即图像中的目标物体,后简称目标或建模目标或建模对象,可选地,获得目标所在的局部图像区域掩膜(即前述的图像掩膜),预测出目标的关键点像素坐标和归一化位置场信息(即NOLF图),可选地,预测目标所属的类别;利用预测出的建模目标对应的归一化位置场图,建立起建模目标在图像上的二维像素与归一化空间三维点之间的对应关系,在NOLF图上随机采样四个点,就能建立起一个由四个像素坐标-三维坐标对应关系,进而通过PnP算法求解就能获得一个建模目标在归一化对象空间中的相对位姿;
2)可选地,在NOLF图上重复进行采样就能获得一系列相对位姿,将这些相对位姿视为一系列假设,通过假设检验方法提取出满足条件的N个相对姿态。基于这N个相对姿态进行聚类,获得的聚类中心点对应的相对位姿作为建模目标在归一化对象空间中的初始位姿;
3)利用该初始位姿,渲染出数据库内所有三维模型在该姿态视角下的归一化位置场图,可选地,根据第1)步中获取的目标的类别,确定三维模型数据库中属于该类别的三维模型为候选三维模型,将这些NOLF图作为三维模型数据库的信息表征,结合第1)步建模目标对应的NOLF图,从而将建模目标的二维图像信息与三维的模型数据库信息统一到相同的数据表达下;在第2)步预测的初始相对位姿下,三维模型对应的NOLF图也会更加接近于建模目标对应的NOLF图,从而降低检索的难度和复杂度;
4)基于另一个深度学习网络,利用图像特征描述子,分别提取出建模目标的NOLF图的高维特征,和预制的模型数据库内三维模型对应NOLF图的高维特征,进而将建模目标与预制模型库内三维模型映射到统一的特征空间内。通过特征的相似性进行模型检索,可选地,利用建模目标的NOLF图与预制模型库内三维模型的NOLF图在特征空间内的相对位置关系,从数据库内检索出建模目标对应的三维模型,其中,特征空间内的欧氏距离越接近表明建模目标与某个三维模型相似性越高;
5)利用检索出的三维模型包围盒顶点与建模目标在图像上预测出的关键点对应关系,通过解算PnP问题建立起输入图像的相机与三维模型初始位置之间的相对位置关系;结合输入图像的相机位姿信息,最终恢复出建模目标对应三维模型在真实三维空间上的位置姿态;
6)遍历所有建模目标,重复以上第2)到第5)步骤,实现三维环境内所有目标的三维模型姿态恢复,进而完成整个场景的三维建模。
具体地,通过卷积神经网络(CNN)获取目标物体的关键点的像素坐标,以及目标物体在该图像的拍摄视角下的NOLF图,根据NOLF图进行初始位姿估计。对于预设的模型数据库中的多个三维模型,基于该初始位姿生成一系列NOLF图,将目标物体的NOLF图与这一系列NOLF图进行比较,根据NOLF图描述子进行相似度检测,得到匹配的三维模型,根据PnP算法以及原始的二维图像的相机位姿,可以获取目标物体的位姿,由此,实现目标物体的三维重建。有关图10对应的实施例中,与前文对应的部分,请参考前文相应的描述,此处不再赘述。
可选地,基于二维图像上的建模目标进行类别、NOLF图、关键点和实例分割掩膜进行预测的卷积神经网络中输入是一张包含多个建模目标的二维RGB图像。通过CNN网络,能够预测出每个建模对象的类别、掩膜、NOLF图和八个关键点像素坐标值。该网络结构基于Mask-RCNN框架,除了该框架能够预测的目标类别和掩膜外,额外输出了NOLF图和关键点,目标对应的关键点像素坐标为其对应三维模型在空间中包围盒的八个角点基于准确位姿在输入图像上的映射。
下面举例的说明获取三维模型的NOLF图的一种实现方法。根据模型数据库中的三维模型,可以获取具有方向性的归一化三维模型,可选地,将建模目标的形状和尺寸信息编码到一个预先定义的归一化对象空间内,为属于同一类别的三维模型建立归一化的坐标空间,根据该归一化的三维模型可以进一步获取在某一视角下的NOLF图。示例性的,如图7所示,NOLF是一个类图像的数据表达形式,能够将三维模型的XYZ坐标编码到一个归一化三维空间内,即该图像每个像素坐标对应的RBG通道分别存储三维模型在归一化三维空间的XYZ坐标值,而不是颜色信息,即每个像素坐标值对应一个三维坐标值,从而建立了三维模型在图像上的像素坐标与在归一化空间上的三维坐标间对应关系,即一个NOLF图即对应在可见视角下的三维点云。需要说明的是,考虑到同类型三维模型的形状和几何特征相似性,可以按照模型类别独立地定义NOLF。可选地,对于同类型三维模型集合,对其进行归一化尺寸缩放,即所有三维模型的包围盒对角线长度为1且其几何中心点位于NOLF的中心。为了保证三维模型包围盒是最紧致包围盒,我们对所有三维模型做了对齐操作,即所有模型的朝向一致且模型坐标系的XYZ轴与空间坐标系XYZ轴平行。
可选地,本申请的目标类别预测分支和实力分割掩膜预测分支,与实例分割算法(Mask RCNN)一致,关键点检测分支与Mask RCNN的二维包围盒输出分支类似,区别在于将Mask RCNN的包围盒输出分支递归的二维包围盒的四个顶点,替换为关键点检测分支递归八个三维包围盒顶点的像素坐标。
预测NOLF图的方式有多种,可选地,通过深度学习网络以及图像掩膜获取NOLF图,该深度学习网络相较通用的Mask RCNN网络增加了三个新的分支结构,分别对应与预测建模对象在归一化坐标空间内x,y,z坐标。对于输入图像内的每一个建模对象,对应的兴趣区域(region of interests,ROI)特征向量均转换成的固定尺寸,该固定尺寸的特征向量作为实例分割掩膜和NOLF图预测分支共有的特征数据,通过相同结构的全卷积神经网络,实例分割掩膜分支输出一个掩膜图片,NOLF图三个预测分支均输出一个的图片,其中N为建模对象所属的类别数,32作为NOLF图像素在归一化空间在XYZ三个方向上的深度划分,即NOLF图在归一化空间内沿每个坐标划分成32个等分,从而将NOLF图预测问题转换成深度分类问题,提升训练的鲁棒性。
下面介绍本申请实施例中卷积神经网络的训练方法:
输入样本为单张二维图像,该二维图像具有类别的标签、掩膜、包围盒的角点坐标以及已知的待建模目标的NOLF图,通过卷积神经网络分别输出待建模目标的类别、图像掩膜、包围盒的角点的像素坐标,以及NOLF图。
卷积神经网络的损失函数定义为:
L=α 1L cls2L mask3L bb84L nolf
其中L cls为分类的交叉熵损失函数,L mask表示目标的分割掩模损失函数,L bb8与Mask-RCNN的包围盒递归损失函数类似,但递归输出由二维包围盒的四个关键点坐标改为三维包围盒的8个关键点坐标。L nolf为NOLF图预测损失函数,具体定义为:
Figure PCTCN2021074094-appb-000005
其中p i、p j分别为NOLF图真值和预测值在相同像素坐标下的三维坐标值,N表示NOLF图的像素数量。α 1、α 2、α 3和α 4分别为各部分损失函数的权重。
上面介绍了本申请提供的三维重建方法,下面对实现该三维重建方法的三维重建装置进行介绍,请参阅图11,为本申请实施例中三维重建装置的一个实施例示意图。
本申请实施例提供的三维重建装置,包括:
获取单元1101,用于获取第一物体的图像和所述图像的相机位姿;
确定单元1102,用于通过第一深度学习网络,确定所述图像中所述第一物体的第一归一化位置场NOLF图,所述第一NOLF图指示所述第一物体在所述图像的拍摄视角下的归一化的三维点云;
所述确定单元1102,用于根据所述第一NOLF图,从模型数据库的多个三维模型中确定与所述第一物体对应的第一模型;
所述确定单元1102,还用于根据所述第一模型和所述图像的相机位姿确定所述第一物体的位姿;
重建单元1103,用于根据所述第一模型和所述第一物体的位姿三维重建所述图像中的所述第一物体。
可选的,所述确定单元1102还用于:根据所述第一NOLF图,确定所述第一物体的第一相对位姿,所述第一相对位姿为所述第一物体的位姿与所述图像的相机位姿之间的相对位姿;确定多个所述三维模型在所述第一相对位姿对应的视角下的NOLF图;所述确定单元1102具体用于:从多个所述三维模型各自对应的NOLF图中确定与所述第一NOLF图相似度最高的NOLF图对应的所述第一模型。
所述确定单元1102具体用于:通过第二深度学习网络,确定所述图像中所述第一物体的至少四个特征点的像素坐标,所述四个特征点指示的四个物点在三维空间中不共面;从所述第一NOLF图中确定所述至少四个特征点的三维坐标;根据所述至少四个特征点的所述像素坐标和所述三维坐标确定所述第一相对位姿。
可选的,所述确定单元1102具体用于:根据所述至少四个特征点的所述像素坐标和所述三维坐标,通过透视N点估计PnP算法确定所述第一相对位姿。
可选的,所述第一物体的特征点包括:所述第一物体的包围盒的八个角点。
可选的,所述确定单元1102具体用于:将所述图像输入所述第一深度学习网络,确定第一原始NOLF图;根据所述第一原始NOLF图和所述第一物体的图像掩膜,确定所述第一NOLF图。
可选的,所述模型数据库包括所述多个三维模型的类别,其中,所述第一物体属于第一类别;所述确定单元1102具体用于:根据所述第一NOLF图从属于所述第一类别的三维模型中确定所述第一模型。
上述单元可以用于执行上述任一实施例中所介绍的方法,具体实现过程和技术效果可参考图6至图10对应的实施例,具体此处不再赘述。
请参阅图12,为本申请实施例中三维重建装置的另一个实施例示意图;
本实施例提供的三维重建装置,可以为服务器或者终端等电子设备,本申请实施例中对其具体设备形态不做限定。
该三维重建装置1200可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器1201和存储器1202,该存储器1202中存储有程序或数据。
其中,存储器1202可以是易失性存储或非易失性存储。可选地,处理器1201是一个或多个中央处理器(CPU,Central Processing Unit,该CPU可以是单核CPU,也可以是多核CPU。处理器1201可以与存储器1202通信,在三维重建装置1200上执行存储器1202中的一系列指令。
该三维重建装置1200还包括一个或一个以上有线或无线网络接口1203,例如以太网接口。
可选地,尽管图12中未示出,三维重建装置1200还可以包括一个或一个以上电源;一个或一个以上输入输出接口,输入输出接口可以用于连接显示器、鼠标、键盘、触摸屏设备或传感设备等,输入输出接口为可选部件,可以存在也可以不存在,此处不做限定。
本实施例中三维重建装置1200中的处理器1201所执行的流程可以参考前述方法实施例中描述的方法流程,此处不加赘述。
请参阅图13,为本申请实施例提供的一种芯片硬件结构图。
本申请实施例提供了一种芯片系统,可以用于实现该三维重建方法,具体地,图3和图4所示的基于卷积神经网络的算法可以在图13所示的NPU芯片中实现。
神经网络处理器NPU 50作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路503,通过控制器504控制运算电路503提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路503内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路503是二维脉动阵列。运算电路503还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路503是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器502中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器501中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器508accumulator中。
统一存储器506用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器505(direct memory access controller,DMAC)被搬运到权重存储器502中。输入数据也通过DMAC被搬运到统一存储器506中。
BIU为Bus Interface Unit即,总线接口单元510,用于AXI总线与DMAC和取指存储器509Instruction Fetch Buffer的交互。
总线接口单元510(bus interface unit,BIU),用于取指存储器509从外部存储器获 取指令,还用于存储单元访问控制器505从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器506或将权重数据搬运到权重存储器502中或将输入数据数据搬运到输入存储器501中。
向量计算单元507可以包括多个运算处理单元,在需要的情况下,对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/FC层网络计算,如Pooling(池化),Batch Normalization(批归一化),Local Response Normalization(局部响应归一化)等。
在一些实现中,向量计算单元能507将经处理的输出的向量存储到统一缓存器506。例如,向量计算单元507可以将非线性函数应用到运算电路503的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元507生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路503的激活输入,例如用于在神经网络中的后续层中的使用。
控制器504连接的取指存储器(instruction fetch buffer)509,用于存储控制器504使用的指令;
统一存储器506,输入存储器501,权重存储器502以及取指存储器509均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,图3和图4所示的卷积神经网络中各层的运算可以由矩阵计算单元212或向量计算单元507执行。
在本申请的各实施例中,为了方面理解,进行了多种举例说明。然而,这些例子仅仅是一些举例,并不意味着是实现本申请的最佳实现方式。
上述实施例,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现,当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以 存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。

Claims (17)

  1. 一种三维重建方法,其特征在于,包括:
    获取第一物体的图像和所述图像的相机位姿;
    通过第一深度学习网络,确定所述图像中所述第一物体的第一归一化位置场NOLF图,所述第一NOLF图指示所述第一物体在所述图像的拍摄视角下的归一化的三维点云;
    根据所述第一NOLF图,从模型数据库的多个三维模型中确定与所述第一物体对应的第一模型;
    根据所述第一模型和所述图像的相机位姿确定所述第一物体的位姿;
    根据所述第一模型和所述第一物体的位姿三维重建所述图像中的所述第一物体。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    根据所述第一NOLF图,确定所述第一物体的第一相对位姿,所述第一相对位姿为所述第一物体的位姿与所述图像的相机位姿之间的相对位姿;
    确定多个所述三维模型在所述第一相对位姿对应的视角下的NOLF图;
    所述根据所述第一NOLF图,从模型数据库的多个三维模型中确定与所述第一物体对应的第一模型包括:
    从多个所述三维模型各自对应的NOLF图中确定与所述第一NOLF图相似度最高的NOLF图对应的所述第一模型。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述第一NOLF图,确定所述第一物体的第一相对位姿包括:
    通过第二深度学习网络,确定所述图像中所述第一物体的至少四个特征点的像素坐标,所述四个特征点指示的四个物点在三维空间中不共面;
    从所述第一NOLF图中确定所述至少四个特征点的三维坐标;
    根据所述至少四个特征点的所述像素坐标和所述三维坐标确定所述第一相对位姿。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述至少四个特征点的所述像素坐标和所述三维坐标确定所述第一相对位姿包括:
    根据所述至少四个特征点的所述像素坐标和所述三维坐标,通过透视N点估计PnP算法确定所述第一相对位姿。
  5. 根据权利要求3或4所述的方法,其特征在于,
    所述第一物体的特征点包括:所述第一物体的包围盒的八个角点。
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述方法还包括:
    将所述图像输入所述第一深度学习网络,确定第一原始NOLF图;
    根据所述第一原始NOLF图和所述第一物体的图像掩膜,确定所述第一NOLF图。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述模型数据库包括所述多个三维模型的类别,其中,所述第一物体属于第一类别,所述方法还包括:
    根据所述第一NOLF图从属于所述第一类别的三维模型中确定所述第一模型。
  8. 一种三维重建装置,其特征在于,包括:
    获取单元,用于获取第一物体的图像和所述图像的相机位姿;
    确定单元,用于通过第一深度学习网络,确定所述图像中所述第一物体的第一归一化位 置场NOLF图,所述第一NOLF图指示所述第一物体在所述图像的拍摄视角下的归一化的三维点云;
    所述确定单元,用于根据所述第一NOLF图,从模型数据库的多个三维模型中确定与所述第一物体对应的第一模型;
    所述确定单元,还用于根据所述第一模型和所述图像的相机位姿确定所述第一物体的位姿;
    重建单元,用于根据所述第一模型和所述第一物体的位姿三维重建所述图像中的所述第一物体。
  9. 根据权利要求8所述的装置,其特征在于,所述确定单元还用于:
    根据所述第一NOLF图,确定所述第一物体的第一相对位姿,所述第一相对位姿为所述第一物体的位姿与所述图像的相机位姿之间的相对位姿;
    确定多个所述三维模型在所述第一相对位姿对应的视角下的NOLF图;
    所述确定单元具体用于:
    从多个所述三维模型各自对应的NOLF图中确定与所述第一NOLF图相似度最高的NOLF图对应的所述第一模型。
  10. 根据权利要求9所述的装置,其特征在于,所述确定单元具体用于:
    通过第二深度学习网络,确定所述图像中所述第一物体的至少四个特征点的像素坐标,所述四个特征点指示的四个物点在三维空间中不共面;
    从所述第一NOLF图中确定所述至少四个特征点的三维坐标;
    根据所述至少四个特征点的所述像素坐标和所述三维坐标确定所述第一相对位姿。
  11. 根据权利要求10所述的装置,其特征在于,所述确定单元具体用于:
    根据所述至少四个特征点的所述像素坐标和所述三维坐标,通过透视N点估计PnP算法确定所述第一相对位姿。
  12. 根据权利要求10或11所述的装置,其特征在于,所述第一物体的特征点包括:所述第一物体的包围盒的八个角点。
  13. 根据权利要求8至12中任一项所述的装置,其特征在于,
    所述确定单元具体用于:
    将所述图像输入所述第一深度学习网络,确定第一原始NOLF图;
    根据所述第一原始NOLF图和所述第一物体的图像掩膜,确定所述第一NOLF图。
  14. 根据权利要求8至13中任一项所述的装置,其特征在于,所述模型数据库包括所述多个三维模型的类别,其中,所述第一物体属于第一类别;
    所述确定单元具体用于:
    根据所述第一NOLF图从属于所述第一类别的三维模型中确定所述第一模型。
  15. 一种三维重建装置,其特征在于,包括处理器和存储器,所述处理器和所述存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器用于调用所述程序指令,执行如权利要求1至7中任一项所述的方法。
  16. 一种包含指令的计算机程序产品,其特征在于,当其在计算机上运行时,使得所述计算机执行如权利要求1至7中任一项所述的方法。
  17. 一种计算机可读存储介质,包括指令,其特征在于,当所述指令在计算机上运行时,使得计算机执行如权利要求1至7中任一项所述的方法。
PCT/CN2021/074094 2020-03-04 2021-01-28 三维重建方法和三维重建装置 WO2021175050A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/902,624 US20220414911A1 (en) 2020-03-04 2022-09-02 Three-dimensional reconstruction method and three-dimensional reconstruction apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010143002.1A CN113362382A (zh) 2020-03-04 2020-03-04 三维重建方法和三维重建装置
CN202010143002.1 2020-03-04

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/902,624 Continuation US20220414911A1 (en) 2020-03-04 2022-09-02 Three-dimensional reconstruction method and three-dimensional reconstruction apparatus

Publications (1)

Publication Number Publication Date
WO2021175050A1 true WO2021175050A1 (zh) 2021-09-10

Family

ID=77523353

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/074094 WO2021175050A1 (zh) 2020-03-04 2021-01-28 三维重建方法和三维重建装置

Country Status (3)

Country Link
US (1) US20220414911A1 (zh)
CN (1) CN113362382A (zh)
WO (1) WO2021175050A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115223028A (zh) * 2022-06-02 2022-10-21 支付宝(杭州)信息技术有限公司 场景重建及模型训练方法、装置、设备、介质及程序产品
CN115422613A (zh) * 2022-09-20 2022-12-02 国能(惠州)热电有限责任公司 一种三维数字化智能设计方法及系统
CN115578515A (zh) * 2022-09-30 2023-01-06 北京百度网讯科技有限公司 三维重建模型的训练方法、三维场景渲染方法及装置
CN115840507A (zh) * 2022-12-20 2023-03-24 北京帮威客科技有限公司 一种基于3d图像控制的大屏设备交互方法
CN116258835A (zh) * 2023-05-04 2023-06-13 武汉大学 基于深度学习的点云数据三维重建方法和系统
CN116524123A (zh) * 2023-04-20 2023-08-01 深圳市元甪科技有限公司 一种三维电阻抗断层扫描图像重建方法及相关设备

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914901A (zh) * 2020-07-06 2020-11-10 周爱丽 座椅排列整齐度测量系统及方法
CN113989445B (zh) * 2021-12-29 2022-03-01 国网瑞嘉(天津)智能机器人有限公司 三维场景重建方法、装置、系统及计算机可读存储介质
CN114529604B (zh) * 2022-01-25 2022-12-13 广州极点三维信息科技有限公司 一种空间物体定向碰撞检测方法、系统装置及介质
CN114758076A (zh) * 2022-04-22 2022-07-15 北京百度网讯科技有限公司 一种用于建立三维模型的深度学习模型的训练方法及装置
CN114596363B (zh) * 2022-05-10 2022-07-22 北京鉴智科技有限公司 一种三维点云标注方法、装置及终端
CN115222896B (zh) * 2022-09-20 2023-05-23 荣耀终端有限公司 三维重建方法、装置、电子设备及计算机可读存储介质
CN116258817B (zh) * 2023-02-16 2024-01-30 浙江大学 一种基于多视图三维重建的自动驾驶数字孪生场景构建方法和系统
CN116758198A (zh) * 2023-06-15 2023-09-15 北京京东乾石科技有限公司 图像重建方法、装置、设备及存储介质
CN116580163B (zh) * 2023-07-14 2023-12-22 深圳元戎启行科技有限公司 三维场景重建方法、电子设备及存储介质
CN117876610A (zh) * 2024-03-12 2024-04-12 之江实验室 针对三维构建模型的模型训练方法、装置、存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129708A (zh) * 2010-12-10 2011-07-20 北京邮电大学 增强现实环境中快速多层次虚实遮挡处理方法
US20150015582A1 (en) * 2013-07-15 2015-01-15 Markus Kaiser Method and system for 2d-3d image registration
CN107833270A (zh) * 2017-09-28 2018-03-23 浙江大学 基于深度相机的实时物体三维重建方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101631257A (zh) * 2009-08-06 2010-01-20 中兴通讯股份有限公司 一种实现二维视频码流立体播放的方法及装置
CN109003325B (zh) * 2018-06-01 2023-08-04 杭州易现先进科技有限公司 一种三维重建的方法、介质、装置和计算设备
CN109658449B (zh) * 2018-12-03 2020-07-10 华中科技大学 一种基于rgb-d图像的室内场景三维重建方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129708A (zh) * 2010-12-10 2011-07-20 北京邮电大学 增强现实环境中快速多层次虚实遮挡处理方法
US20150015582A1 (en) * 2013-07-15 2015-01-15 Markus Kaiser Method and system for 2d-3d image registration
CN107833270A (zh) * 2017-09-28 2018-03-23 浙江大学 基于深度相机的实时物体三维重建方法

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115223028A (zh) * 2022-06-02 2022-10-21 支付宝(杭州)信息技术有限公司 场景重建及模型训练方法、装置、设备、介质及程序产品
CN115223028B (zh) * 2022-06-02 2024-03-29 支付宝(杭州)信息技术有限公司 场景重建及模型训练方法、装置、设备、介质及程序产品
CN115422613A (zh) * 2022-09-20 2022-12-02 国能(惠州)热电有限责任公司 一种三维数字化智能设计方法及系统
CN115578515A (zh) * 2022-09-30 2023-01-06 北京百度网讯科技有限公司 三维重建模型的训练方法、三维场景渲染方法及装置
CN115578515B (zh) * 2022-09-30 2023-08-11 北京百度网讯科技有限公司 三维重建模型的训练方法、三维场景渲染方法及装置
CN115840507A (zh) * 2022-12-20 2023-03-24 北京帮威客科技有限公司 一种基于3d图像控制的大屏设备交互方法
CN116524123A (zh) * 2023-04-20 2023-08-01 深圳市元甪科技有限公司 一种三维电阻抗断层扫描图像重建方法及相关设备
CN116524123B (zh) * 2023-04-20 2024-02-13 深圳市元甪科技有限公司 一种三维电阻抗断层扫描图像重建方法及相关设备
CN116258835A (zh) * 2023-05-04 2023-06-13 武汉大学 基于深度学习的点云数据三维重建方法和系统

Also Published As

Publication number Publication date
CN113362382A (zh) 2021-09-07
US20220414911A1 (en) 2022-12-29

Similar Documents

Publication Publication Date Title
WO2021175050A1 (zh) 三维重建方法和三维重建装置
US10679046B1 (en) Machine learning systems and methods of estimating body shape from images
CN110458939B (zh) 基于视角生成的室内场景建模方法
US10297070B1 (en) 3D scene synthesis techniques using neural network architectures
US11232286B2 (en) Method and apparatus for generating face rotation image
CN111598998B (zh) 三维虚拟模型重建方法、装置、计算机设备和存储介质
WO2021143101A1 (zh) 人脸识别方法和人脸识别装置
JP2023540917A (ja) 3次元再構成及び関連インタラクション、測定方法及び関連装置、機器
CN109684969B (zh) 凝视位置估计方法、计算机设备及存储介质
CN108895981A (zh) 一种三维测量方法、装置、服务器和存储介质
WO2021218238A1 (zh) 图像处理方法和图像处理装置
CN110599395A (zh) 目标图像生成方法、装置、服务器及存储介质
US20200057778A1 (en) Depth image pose search with a bootstrapped-created database
CN112085835B (zh) 三维卡通人脸生成方法、装置、电子设备及存储介质
WO2021249114A1 (zh) 目标跟踪方法和目标跟踪装置
US11423615B1 (en) Techniques for producing three-dimensional models from one or more two-dimensional images
US20230326173A1 (en) Image processing method and apparatus, and computer-readable storage medium
CN115222896B (zh) 三维重建方法、装置、电子设备及计算机可读存储介质
CN115578393B (zh) 关键点检测方法、训练方法、装置、设备、介质及产品
CN114219855A (zh) 点云法向量的估计方法、装置、计算机设备和存储介质
Sengan et al. Cost-effective and efficient 3D human model creation and re-identification application for human digital twins
CN114565916A (zh) 目标检测模型训练方法、目标检测方法以及电子设备
WO2022208440A1 (en) Multiview neural human prediction using implicit differentiable renderer for facial expression, body pose shape and clothes performance capture
Correia et al. 3D reconstruction of human bodies from single-view and multi-view images: A systematic review
Hu et al. Computer vision for sight: Computer vision techniques to assist visually impaired people to navigate in an indoor environment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21763967

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21763967

Country of ref document: EP

Kind code of ref document: A1