CN117649485A

CN117649485A - Three-dimensional reconstruction method based on common sense information, electronic equipment and storage medium

Info

Publication number: CN117649485A
Application number: CN202311433531.5A
Authority: CN
Inventors: 贺潇
Original assignee: DeepRoute AI Ltd
Current assignee: DeepRoute AI Ltd
Priority date: 2023-10-30
Filing date: 2023-10-30
Publication date: 2024-03-05

Abstract

The application discloses a three-dimensional reconstruction method based on common sense information, which comprises the steps of obtaining at least one data; extracting common sense features from the at least one data using a multi-modal learning model, the common sense features being used to characterize a three-dimensional scene represented by the at least one data; and carrying out three-dimensional reconstruction based on the common sense characteristics to obtain a three-dimensional model of the three-dimensional scene, and realizing reconstruction of the three-dimensional scene. The application also discloses an electronic device and a storage medium. The method and the device improve the accuracy and the integrity of the three-dimensional reconstruction technology.

Description

Three-dimensional reconstruction method based on common sense information, electronic equipment and storage medium

Technical Field

The disclosed embodiments of the present application relate to the field of computer vision technology, and more particularly, to a three-dimensional reconstruction method, an electronic device, and a storage medium based on common sense information.

Background

The three-dimensional reconstruction technology has wide application prospects in the fields of virtual reality, game development, building design, intelligent manufacturing and the like, and can promote innovation and development of related industries, however, the traditional three-dimensional reconstruction technology mainly depends on geometric and texture data captured by a sensor, and the data generally only provide surface shape and appearance information of an object or a scene and lack deep understanding of the internal structure of the object, so that the accuracy and the integrity of a result obtained by reconstruction based on the traditional three-dimensional reconstruction technology are to be improved.

Disclosure of Invention

According to an embodiment of the present application, the present application proposes a three-dimensional reconstruction method, an electronic device and a storage medium based on common sense information, so as to solve the above-mentioned problems.

The first aspect of the application discloses a three-dimensional reconstruction method based on common sense information, which comprises the steps of obtaining at least one data; extracting common sense features from the at least one data using a multi-modal learning model, the common sense features being used to characterize a three-dimensional scene represented by the at least one data; and carrying out three-dimensional reconstruction based on the common sense characteristics to obtain a three-dimensional model of the three-dimensional scene, and realizing reconstruction of the three-dimensional scene.

In some embodiments, the extracting common sense features from the at least one data using a multi-modal learning model comprises: encoding the at least one data as feature data using the multimodal learning model; and decoding the characteristic data into the common sense characteristic by utilizing the multi-modal learning model.

In some embodiments, the at least one data includes at least one of image data, text data, sensor data, and structured data; the encoding the at least one data into feature data using the multimodal learning model includes: inputting the image data into an image encoder to obtain image characteristic data; and/or inputting the text data into a text encoder to obtain text characteristic data; and/or inputting the sensor data into a point cloud encoder to obtain point cloud characteristic data; and/or inputting the structured data into a structured information encoder to obtain structured feature data.

In some embodiments, the performing three-dimensional reconstruction based on the common sense features comprises: scene understanding is carried out based on the common sense features, so that scene features in the three-dimensional scene are obtained; performing object recognition based on the common sense features to obtain object features in the three-dimensional scene; and matching the scene features with the object features to obtain the spatial relationship between the scene features and the object features in the three-dimensional scene.

In some embodiments, the performing three-dimensional reconstruction based on the common sense features comprises: determining computing resources for the three-dimensional reconstruction based on the common sense features; and carrying out the three-dimensional reconstruction based on the common sense characteristics by using the computing resources of the three-dimensional reconstruction.

In some embodiments, the determining computing resources for the three-dimensional reconstruction based on the common sense features comprises: based on the common sense characteristics, evaluating the area of the three-dimensional scene to obtain the reconstruction geometric information of the area of the three-dimensional scene; computing resources for a region of the three-dimensional scene are determined based on reconstructed geometric information of the region of the three-dimensional scene.

In some embodiments, the computing resource that utilizes the three-dimensional reconstruction, based on the common sense features, performs the three-dimensional reconstruction comprising: and reconstructing the region of the three-dimensional scene based on the common sense features by using the computing resources of the region of the three-dimensional scene.

In some embodiments, the method further comprises: and smoothing the three-dimensional model based on the common sense characteristics and the related reconstruction characteristics of the three-dimensional model.

A second aspect of the present application discloses an electronic device, comprising a memory and a processor coupled to each other, the processor being configured to execute program instructions stored in the memory, to implement the three-dimensional reconstruction method based on common sense information described in the first aspect.

A third aspect of the present application discloses a non-transitory computer readable storage medium having stored thereon program instructions which, when executed by a processor, implement the three-dimensional reconstruction method based on common sense information described in the first aspect.

The beneficial effects of this application are: at least one data is obtained, common sense features are extracted from the at least one data by utilizing the multi-mode learning model, and the common sense features are used for representing a three-dimensional scene represented by the at least one data, so that three-dimensional reconstruction is performed based on the common sense features, a three-dimensional model of the three-dimensional scene is obtained, and the reconstruction of the three-dimensional scene is realized, namely the accuracy and the integrity of a three-dimensional reconstruction technology are improved.

Drawings

The application will be further described with reference to the accompanying drawings and embodiments, in which:

fig. 1 is a flow chart of a three-dimensional reconstruction method based on common sense information according to an embodiment of the present application;

FIG. 2 is a flow diagram of multi-modal pre-training in accordance with one embodiment of the present application;

FIG. 3 is a flow diagram of three-dimensional reconstruction according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 5 is a schematic structural view of a nonvolatile computer-readable storage medium according to an embodiment of the present application.

Detailed Description

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The term "and/or" in this application is merely an association relation describing an associated object, and indicates that three relations may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. Further, "a plurality" herein means two or more than two. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C. Furthermore, the terms "first," "second," and "third" in this application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated.

In order to enable those skilled in the art to better understand the technical solutions of the present application, the technical solutions of the present application are described in further detail below with reference to the accompanying drawings and the detailed description.

Referring to fig. 1, fig. 1 is a flow chart of a three-dimensional reconstruction method based on common sense information according to an embodiment of the present application. The execution subject of the method can be an electronic device with a computing function, such as a microcomputer, a server, a mobile device such as a notebook computer, a tablet computer, and the like.

It should be noted that, if there are substantially the same results, the method of the present application is not limited to the flow sequence shown in fig. 1.

In some possible implementations, the method may be implemented by a processor invoking computer readable instructions stored in a memory, as shown in fig. 1, and may include the steps of:

s11: at least one data is acquired.

The at least one data includes various forms of information data about the driving of the vehicle, such as image information, text information, sensor information such as lidar, vehicle track information, map information, etc., wherein the sensor may provide geometric information about objects, distance information, etc., for a more comprehensive understanding of the objects and scenes.

S12: and extracting common sense features from the at least one data by using the multi-mode learning model, wherein the common sense features are used for representing a three-dimensional scene represented by the at least one data.

By utilizing a multi-mode learning model, common sense features are extracted from at least one data, namely, the at least one data is subjected to multi-mode pre-training, such as preprocessing and feature extraction on different types of data, and further common sense features are extracted from the at least one data, for example, geometric information, semantic information and features corresponding to structural information of related data can be obtained through multi-mode training on the data such as sensor information such as image information, text information, laser radar and the like, vehicle track information, map information and the like, wherein the common sense features are used for characterizing a three-dimensional scene represented by the at least one data.

S13: and carrying out three-dimensional reconstruction based on common sense features to obtain a three-dimensional model of the three-dimensional scene, and realizing the reconstruction of the three-dimensional scene.

And carrying out three-dimensional reconstruction based on common sense features, namely carrying out initial understanding on the scene through common sense features, and identifying object features existing in the scene so as to infer the spatial relationship between the scene features and the object features, further obtaining a three-dimensional model of the three-dimensional scene, and realizing reconstruction of the three-dimensional scene.

In this embodiment, at least one data is obtained, and common sense features are extracted from the at least one data by using a multi-mode learning model, and the common sense features are used for characterizing a three-dimensional scene represented by the at least one data, so that a three-dimensional model of the three-dimensional scene is obtained by performing three-dimensional reconstruction based on the common sense features, thereby realizing reconstruction of the three-dimensional scene, i.e., improving the accuracy and completeness of the three-dimensional reconstruction technology.

In some embodiments, extracting common sense features from at least one data using a multi-modal learning model comprises: encoding at least one data as feature data using a multi-modal learning model; the feature data is decoded into common sense features using a multimodal learning model.

Encoding at least one data into characteristic data by utilizing a multi-mode learning model, for example, encoding point cloud data and image data to form geometric characteristic data of a scene and an object; encoding the text data, for example, processing the text data using natural language processing (Natural Language Processing, NLP) techniques to obtain semantic features; and encoding the structured data such as the track of the vehicle, the map and the like to obtain the structured feature data.

Further, the feature data is decoded into common sense features using the multimodal learning model, for example by decoding the scene, the geometric features, the semantic features and the structured features of the object using the multimodal learning model, which are converted into common sense features for characterizing the three-dimensional scene represented by the at least one data.

For example, using the multi-modal learning model, the image data and the point cloud data of the signboard are encoded into the signboard feature data, and the signboard feature data is further decoded into the common sense feature a, where the common sense feature a may be used to characterize geometric information of the signboard in the three-dimensional scene, such as a specific size and an internal structure of the signboard.

In some embodiments, the at least one data includes at least one of image data, text data, sensor data, and structured data; encoding at least one data as feature data using a multi-modal learning model, comprising: inputting the image data into an image encoder to obtain image characteristic data; and/or inputting the text data into a text encoder to obtain text characteristic data; and/or inputting the sensor data into a point cloud encoder to obtain point cloud characteristic data; and/or inputting the structured data into a structured information encoder to obtain structured feature data.

Specifically, the at least one data includes at least one of image data, text data, sensor data, and structured data, wherein the sensor data may be data acquired by using a sensor such as a laser radar, a depth camera, and the like, and the structured data may be track data of the vehicle itself and external map data.

Inputting the image data into an image encoder, namely encoding the image data by the image encoder, so as to obtain image characteristic data; inputting the text data into a text encoder, namely encoding the text data through the text encoder, so as to obtain text characteristic data; inputting the sensor data into a point cloud encoder, namely encoding other sensor data such as a laser radar and the like through the point cloud encoder, so as to obtain point cloud characteristic data; the structured data is input into a structured information encoder, namely, the structured information encoder encodes the vehicle track data and the map data, so that the structured feature data can be obtained.

Further, at least one of the image feature data, the text feature data, the point cloud feature data and the structured feature data can be input into a common sense information decoder for decoding, and the corresponding common sense feature is obtained through conversion.

For ease of understanding, the multi-modal pre-training is illustrated in fig. 2, where fig. 2 is a schematic flow chart of the multi-modal pre-training according to an embodiment of the present application, and the multi-modal learning model X includes an image encoder, a text encoder, a point cloud encoder, a structured information encoder, and a common sense information decoder, and is used to extract common sense features from at least one data. Specifically, an image encoder is utilized to encode image data to obtain image characteristic data; encoding the text data by using a text encoder to obtain text characteristic data; encoding the sensor data by using a point cloud encoder to obtain point cloud characteristic data; and encoding the structured data by using a structured information encoder to obtain structured feature data, and further decoding by using a common sense information decoder to convert the obtained features into corresponding common sense features.

In some embodiments, the three-dimensional reconstruction based on common sense features comprises: scene understanding is carried out based on common sense features, so that scene features in a three-dimensional scene are obtained; object recognition is carried out based on common sense features, so that object features in the three-dimensional scene are obtained; and matching the scene features with the object features to obtain the spatial relationship between the scene features and the object features in the three-dimensional scene.

Scene understanding is based on common sense features, resulting in scene features in a three-dimensional scene, e.g. common sense features are converted into existing structures by a scene decoder to obtain scene features, i.e. common sense features can be used to understand objects, structures and environments in the scene. The common sense features are converted into existing structures such as streets, buildings, labels and the like through the scene decoder, the corresponding features are scene features, and it is understood that the common sense features can serve as initial understanding of the scene by the system and provide a basis for subsequent reconstruction.

And performing object recognition based on the common sense features to obtain object features in the three-dimensional scene, for example, combining the common sense features, and recognizing and classifying the objects by using an object decoder to obtain the object features in the three-dimensional scene. The object decoder is used for obtaining the shape, the size, the color and the like of the object in the scene, and the corresponding characteristics are the object characteristics, and it can be understood that the common sense characteristics can be used for identifying and classifying the object, so that a corresponding model is built for each object.

The obtained scene features and object features are matched to obtain a specific spatial relationship of the scene features and object features in the three-dimensional scene, for example, the scene features and the object features are input into a feature matching system to perform feature matching, and then the relationship between the scene features and the object features is established, so that the spatial relationship between the scene and the object is restored through the matching of the scene features and the object features, and the whole scene is reconstructed based on the spatial relationship.

For ease of understanding, an example of a three-dimensional reconstruction process based on common sense features is illustrated in fig. 3, and fig. 3 is a schematic flow chart of three-dimensional reconstruction according to an embodiment of the present application, where three-dimensional reconstruction based on common sense features is implemented by a three-dimensional reconstruction system Y, and the three-dimensional reconstruction system Y includes a scene decoder, an object decoder, and a feature matching system. In particular, the common sense features are converted into existing structures by a scene decoder to obtain scene features, such as streets, buildings, signboards, and the like; identifying and classifying the objects by utilizing an object decoder to obtain object characteristics in a three-dimensional scene, such as the characteristics of the shape, the size, the color and the like of the objects in the scene; and matching the obtained scene features with the object features by using a feature matching system to obtain the spatial relationship between the scene features and the object features in the three-dimensional scene, so as to reconstruct the three-dimensional scene and obtain the three-dimensional model.

In some embodiments, the three-dimensional reconstruction based on common sense features comprises: determining computing resources for three-dimensional reconstruction based on common sense features; and carrying out three-dimensional reconstruction based on common sense characteristics by utilizing the computing resources of the three-dimensional reconstruction.

Based on the common sense features, computing resources for three-dimensional reconstruction are determined, e.g., the three-dimensional reconstruction system evaluates the importance of different regions of the scene based on the common sense features and dynamically allocates the computing resources based on the common sense features. Utilizing computing resources of three-dimensional reconstruction determined based on common sense features, carrying out three-dimensional reconstruction by combining common sense features, and specifically, carrying out scene understanding based on common sense features to obtain scene features in a three-dimensional scene; object recognition is carried out based on common sense features, so that object features in the three-dimensional scene are obtained; and matching the scene features and the object features to obtain the spatial relationship between the scene features and the object features in the three-dimensional scene, so as to obtain a three-dimensional model of the three-dimensional scene, and realize the reconstruction of the three-dimensional scene.

In some embodiments, determining computing resources for three-dimensional reconstruction based on common sense features comprises: based on common sense characteristics, evaluating the area of the three-dimensional scene to obtain reconstruction geometric information of the area of the three-dimensional scene; computing resources for an area of the three-dimensional scene are determined based on reconstructed geometric information of the area of the three-dimensional scene.

Based on common sense characteristics, evaluating the area of the three-dimensional scene to obtain reconstruction geometric information of the area of the three-dimensional scene; based on the reconstruction geometric information of the area of the three-dimensional scene, the computing resources of the area used for the three-dimensional scene are determined, for example, the three-dimensional reconstruction system evaluates the importance of different areas of the scene grid according to common sense characteristics, the reconstruction geometric information of the area of the three-dimensional scene can be obtained, then the computing resources are allocated according to the frequency change of the related reconstruction geometric information, for example, the area corresponding to the large frequency change is evaluated as important, more computing resources are needed, otherwise, the area corresponding to the small frequency change does not need to use excessive computing resources, and the dynamic allocation of the computing resources is realized.

In some embodiments, using computing resources of three-dimensional reconstruction, three-dimensional reconstruction is performed based on common sense features, comprising: and reconstructing the region of the three-dimensional scene based on common sense features by using computing resources of the region of the three-dimensional scene.

Reconstructing the region of the three-dimensional scene based on common sense features by utilizing computing resources of the region of the three-dimensional scene, reconstructing the region of the three-dimensional scene by utilizing computing resources of the region of the three-dimensional scene determined based on the common sense features and combining the common sense features, and specifically, performing scene understanding based on the common sense features to obtain scene features of the region of the three-dimensional scene; object recognition is carried out based on common sense features, so that object features in the area of the three-dimensional scene are obtained; and matching the scene features with the object features to obtain the spatial relationship between the scene features and the object features in the three-dimensional scene area, so as to obtain a three-dimensional model of the three-dimensional scene area, and realize the reconstruction of the three-dimensional scene.

Further, in some embodiments, the three-dimensional model is smoothed based on common sense features and associated reconstructed features of the three-dimensional model.

And carrying out smoothing treatment on the three-dimensional model based on the common sense features and the related reconstruction features of the three-dimensional model, namely carrying out refining treatment on the three-dimensional model, wherein the related reconstruction features of the three-dimensional model can be scene features and object features constructed based on the common sense features based on the previous extracted common sense features, so as to macroscopically fine-tune the whole three-dimensional reconstruction system, and realizing the smoothing treatment on the three-dimensional reconstruction result, thereby enabling the three-dimensional reconstruction result to be more natural and real.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 40 comprises a memory 41 and a processor 42 coupled to each other, the processor 42 being adapted to execute program instructions stored in the memory 41 for implementing the steps of the three-dimensional reconstruction method embodiment based on common sense information described above. In one particular implementation scenario, electronic device 40 may include, but is not limited to: the microcomputer and the server are not limited herein.

In particular, the processor 42 is arranged to control itself and the memory 41 to implement the steps of the three-dimensional reconstruction method embodiment described above based on common sense information. The processor 42 may also be referred to as a CPU (Central Processing Unit ), and the processor 42 may be an integrated circuit chip with signal processing capabilities. The processor 42 may also be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a Field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 42 may be commonly implemented by an integrated circuit chip.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a non-volatile computer readable storage medium according to an embodiment of the present application. The non-transitory computer readable storage medium 50 is for storing a computer program 501, which computer program 501, when executed by a processor, for example by the processor 42 in the above-described embodiment of fig. 4, is for carrying out the steps of the above-described embodiment of the three-dimensional reconstruction method for information based on common sense.

The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

In the several embodiments provided in this application, it should be understood that the disclosed methods and related devices may be implemented in other ways. For example, the above-described embodiments of related devices are merely illustrative, e.g., the division of modules or elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication disconnection between the illustrated or discussed elements may be through some interface, indirect coupling or communication disconnection of a device or element, electrical, mechanical, or other form.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all or part of the technical solution contributing to the prior art or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those skilled in the art will readily appreciate that many modifications and variations are possible in the device and method while maintaining the teachings of the present application. Accordingly, the above disclosure should be viewed as limited only by the scope of the appended claims.

Claims

1. A three-dimensional reconstruction method based on common sense information, comprising:

acquiring at least one data;

extracting common sense features from the at least one data using a multi-modal learning model, the common sense features being used to characterize a three-dimensional scene represented by the at least one data;

and carrying out three-dimensional reconstruction based on the common sense characteristics to obtain a three-dimensional model of the three-dimensional scene, and realizing reconstruction of the three-dimensional scene.

2. The method of claim 1, wherein extracting common sense features from the at least one data using a multi-modal learning model comprises:

encoding the at least one data as feature data using the multimodal learning model;

and decoding the characteristic data into the common sense characteristic by utilizing the multi-modal learning model.

3. The method of claim 2, wherein the at least one data comprises at least one of image data, text data, sensor data, and structured data;

the encoding the at least one data into feature data using the multimodal learning model includes:

inputting the image data into an image encoder to obtain image characteristic data; and/or

Inputting the text data into a text encoder to obtain text characteristic data; and/or

Inputting the sensor data into a point cloud encoder to obtain point cloud characteristic data; and/or

And inputting the structured data into a structured information encoder to obtain structured characteristic data.

4. The method according to claim 1, wherein said performing a three-dimensional reconstruction based on said common sense features comprises:

scene understanding is carried out based on the common sense features, so that scene features in the three-dimensional scene are obtained;

performing object recognition based on the common sense features to obtain object features in the three-dimensional scene;

and matching the scene features with the object features to obtain the spatial relationship between the scene features and the object features in the three-dimensional scene.

5. The method according to claim 1, wherein said performing a three-dimensional reconstruction based on said common sense features comprises:

determining computing resources for the three-dimensional reconstruction based on the common sense features;

and carrying out the three-dimensional reconstruction based on the common sense characteristics by using the computing resources of the three-dimensional reconstruction.

6. The method of claim 5, wherein the determining computing resources for the three-dimensional reconstruction based on the common sense features comprises:

based on the common sense characteristics, evaluating the area of the three-dimensional scene to obtain the reconstruction geometric information of the area of the three-dimensional scene;

computing resources for a region of the three-dimensional scene are determined based on reconstructed geometric information of the region of the three-dimensional scene.

7. The method of claim 5, wherein the performing the three-dimensional reconstruction based on the common sense features using the computational resources of the three-dimensional reconstruction comprises:

and reconstructing the region of the three-dimensional scene based on the common sense features by using the computing resources of the region of the three-dimensional scene.

8. The method as recited in claim 1, further comprising:

and smoothing the three-dimensional model based on the common sense characteristics and the related reconstruction characteristics of the three-dimensional model.

9. An electronic device comprising a memory and a processor coupled to each other, the processor being configured to execute program instructions stored in the memory to implement the three-dimensional reconstruction method based on common sense information of any one of claims 1 to 8.

10. A non-transitory computer readable storage medium having stored thereon program instructions, which when executed by a processor, implement the three-dimensional reconstruction method based on common sense information of any one of claims 1 to 8.