CN115578515A

CN115578515A - Training method of three-dimensional reconstruction model, and three-dimensional scene rendering method and device

Info

Publication number: CN115578515A
Application number: CN202211216645.XA
Authority: CN
Inventors: 孟庆月
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2023-01-06
Anticipated expiration: 2042-09-30
Also published as: CN115578515B

Abstract

The invention provides a training method of a three-dimensional reconstruction model, a three-dimensional scene rendering method and a three-dimensional scene rendering device, relates to the technical field of artificial intelligence, specifically relates to the technical fields of augmented reality, virtual reality, computer vision, deep learning and the like, and can be applied to scenes such as a meta universe, a smart city and the like. The implementation scheme is as follows: acquiring a sample image of a target scene and a first pose of image acquisition equipment when acquiring the sample image; determining a plurality of first rays based on the first pose; inputting information of a plurality of first rays into the model to obtain a first spatial field and a rendering image; determining a color loss based on a difference of the rendered image and the sample image; generating a second pose of the image capture device based on the first pose; determining a plurality of second rays based on the second position; inputting information of a plurality of second rays into the model to obtain a second spatial field; determining a geometric loss based on at least the second spatial field; based on the color loss and the geometric loss, parameters of the model are adjusted.

Description

Training method of three-dimensional reconstruction model, and three-dimensional scene rendering method and device

Technical Field

The utility model relates to an artificial intelligence technical field especially relates to technical fields such as augmented reality, virtual reality, computer vision, deep learning, can be applied to scenes such as meta universe, wisdom city. The present disclosure relates to a training method and apparatus for a three-dimensional reconstruction model, a three-dimensional scene rendering method and apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

Background

Three-dimensional Reconstruction (3D Reconstruction) refers to the establishment of a mathematical model suitable for computer representation and processing of a three-dimensional scene, which is the basis for processing, operating and analyzing the properties of the three-dimensional scene in a computer environment, and is also a key technology for establishing virtual reality expressing an objective world in a computer.

In computer vision, three-dimensional reconstruction refers to the process of reconstructing three-dimensional information of a scene from single-view or multi-view images of the scene.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

The disclosure provides a training method and device for a three-dimensional reconstruction model, a three-dimensional scene rendering method and device, an electronic device, a computer-readable storage medium and a computer program product.

According to an aspect of the present disclosure, there is provided a training method of a three-dimensional reconstruction model, including: acquiring a sample image of a target scene and a first pose of image acquisition equipment when acquiring the sample image; determining a plurality of first rays of the sample image based on the first pose, wherein the plurality of first rays correspond to a plurality of pixels of the sample image respectively; inputting information of the first rays into a three-dimensional reconstruction model to obtain a first spatial field and a rendered image of the target scene output by the three-dimensional reconstruction model, wherein the first spatial field comprises respective voxel densities of a plurality of sampling points on each first ray; determining a color loss based on a difference of the rendered image and the sample image; generating a second pose of the image capture device based on the first pose; determining a plurality of second rays of a virtual image based on the second pose, wherein the virtual image is an image plane of the image acquisition device in the second pose, and the plurality of second rays correspond to a plurality of pixels of the virtual image respectively; inputting information of the plurality of second rays into the three-dimensional reconstruction model to obtain a second spatial field of the target scene output by the three-dimensional reconstruction model, wherein the second spatial field comprises respective voxel densities of a plurality of sampling points on each second ray; determining a geometric loss based on at least the second spatial field; and adjusting parameters of the three-dimensional reconstruction model based on the color loss and the geometric loss.

According to an aspect of the present disclosure, there is provided a three-dimensional scene rendering method, including: acquiring a three-dimensional reconstruction model aiming at a target scene and an observation pose of the target scene, wherein the three-dimensional reconstruction model is obtained by training based on a training method of the three-dimensional reconstruction model; and generating a rendering image of the target scene under the observation pose based on the three-dimensional reconstruction model and the observation pose.

According to an aspect of the present disclosure, there is provided a training apparatus for a three-dimensional reconstruction model, including: an acquisition module configured to acquire a sample image of a target scene and a first pose of an image acquisition device when acquiring the sample image; a first determination module configured to determine a plurality of first rays of the sample image based on the first pose, wherein the plurality of first rays correspond to a plurality of pixels of the sample image, respectively; a first output module configured to input information of the plurality of first rays into a three-dimensional reconstruction model to obtain a first spatial field of the target scene and a rendered image output by the three-dimensional reconstruction model, wherein the first spatial field comprises respective voxel densities of a plurality of sampling points on each first ray; a first loss module configured to determine a color loss based on a difference of the rendered image and the sample image; a generation module configured to generate a second pose of the image capture device based on the first pose; a second determining module configured to determine, based on the second position, a plurality of second rays of a virtual image, wherein the virtual image is an image plane of the image capturing device in the second position, and the plurality of second rays correspond to a plurality of pixels of the virtual image respectively; a second output module configured to input information of the plurality of second rays into the three-dimensional reconstruction model to obtain a second spatial field of the target scene output by the three-dimensional reconstruction model, wherein the second spatial field includes respective voxel densities of the plurality of sampling points on each second ray; a second loss module configured to determine a geometric loss based at least on the second spatial field; and an adjustment module configured to adjust parameters of the three-dimensional reconstruction model based on the color loss and the geometric loss.

According to an aspect of the present disclosure, there is provided a three-dimensional scene rendering apparatus including: the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is configured to acquire a three-dimensional reconstruction model for a target scene and an observation pose of the target scene, and the three-dimensional reconstruction model is obtained by training based on a training device of the three-dimensional reconstruction model; and a generation module configured to generate a rendered image of the target scene at the observation pose based on the three-dimensional reconstruction model and the observation pose.

According to an aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of any of the above aspects.

According to an aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any of the above aspects.

According to an aspect of the disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of any of the above aspects.

According to one or more embodiments of the present disclosure, the accuracy of three-dimensional reconstruction can be improved.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of example only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

Fig. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, in accordance with embodiments of the present disclosure;

FIG. 2 shows a flow diagram of a method of training a three-dimensional reconstruction model according to an embodiment of the present disclosure;

fig. 3 shows a flow diagram of a method of rendering a three-dimensional scene according to an embodiment of the present disclosure;

FIG. 4 shows a block diagram of a training apparatus for three-dimensional reconstruction models, according to an embodiment of the present disclosure;

fig. 5 shows a block diagram of a three-dimensional scene rendering apparatus according to an embodiment of the present disclosure; and

FIG. 6 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", and the like to describe various elements is not intended to limit the positional relationship, the temporal relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, while in some cases they may refer to different instances based on the context of the description.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing the particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

In the related art, images of a certain scene from different perspectives can be used for training a neural network model, so that the neural network learns the three-dimensional information of the scene, and the three-dimensional reconstruction of the scene is realized. Subsequently, an image of the scene at the new viewing angle is generated using the trained neural network model. Under the condition that the visual angle images of the scene are few, the neural network model cannot fully learn the three-dimensional information of the scene, so that the three-dimensional reconstruction effect of the neural network model is poor, and the space geometry of the scene in the generated new visual angle image is seriously distorted.

In order to solve the above problem, the embodiments of the present disclosure provide a training method for a three-dimensional reconstruction model and a three-dimensional scene rendering method. The training method of the three-dimensional reconstruction model of the embodiment of the disclosure can enable the three-dimensional reconstruction model to accurately learn the three-dimensional information of the target scene, and improve the accuracy of the three-dimensional reconstruction result of the target scene. A high quality three-dimensional reconstruction can be achieved even with a small number of sample images, i.e. images of different perspectives of the target scene. By adopting the three-dimensional reconstruction model of the embodiment of the disclosure to generate the rendering image of the target scene under the new visual angle, the accuracy of the rendering image can be improved, the rendering image can accurately express the space geometry of the target scene under the new visual angle, and distortion is avoided.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented in accordance with embodiments of the present disclosure. Referring to fig. 1, the system 100 includes one or

more client devices

101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120.

Client devices

101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, the server 120 may run one or more services or software applications that enable the training method of the three-dimensional reconstruction model and/or the three-dimensional scene rendering method to be performed.

In some embodiments, the server 120 may also provide other services or software applications, which may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of

client devices

101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof, which may be executed by one or more processors. A user operating a

client device

101, 102, 103, 104, 105, and/or 106 may, in turn, utilize one or more client applications to interact with the server 120 to take advantage of the services provided by these components. It should be understood that a variety of different system configurations are possible, which may differ from system 100. Accordingly, fig. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The user may specify an observation pose (i.e., perspective) of the target scene using the

client device

101, 102, 103, 104, 105, and/or 106, and send a rendering request (i.e., three-dimensional scene rendering request) of the target scene at the observation pose to the server 120. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that any number of client devices may be supported by the present disclosure.

Client devices

101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptops), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and so forth. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, linux, or Linux-like operating systems (e.g., GOOGLE Chrome OS); or include various Mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablet computers, personal Digital Assistants (PDAs), and the like. Wearable devices may include head-mounted displays (such as smart glasses) and other devices. The gaming system may include a variety of handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a variety of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. By way of example only, one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a blockchain network, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, wi-Fi), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture involving virtualization (e.g., one or more flexible pools of logical storage that may be virtualized to maintain virtual storage for the server). In various embodiments, the server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above, as well as any commercially available server operating systems. The server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, and the like.

In some implementations, the server 120 can include one or more applications to analyze and consolidate data feeds and/or event updates received from users of the

client devices

101, 102, 103, 104, 105, and/or 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of

client devices

101, 102, 103, 104, 105, and/or 106.

In some embodiments, the server 120 may be a server of a distributed system, or a server incorporating a blockchain. The server 120 may also be a cloud server, or a smart cloud computing server or a smart cloud host with artificial intelligence technology. The cloud Server is a host product in a cloud computing service system, and is used for solving the defects of high management difficulty and weak service expansibility in the conventional physical host and Virtual Private Server (VPS) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of the databases 130 may be used to store information such as audio files and video files. The database 130 may reside in various locations. For example, the database used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The database 130 may be of different types. In certain embodiments, the database used by the server 120 may be, for example, a relational database. One or more of these databases may store, update, and retrieve data to and from the database in response to the command.

In some embodiments, one or more of the databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key-value stores, object stores, or regular stores supported by a file system.

In an embodiment of the present disclosure, a user may send a three-dimensional scene rendering request to the server 120 through the

client device

101, 102, 103, 104, 105, or 106, the three-dimensional scene rendering request including a target scene to be rendered and an observation pose of the target scene. The server 120 responds to the three-dimensional scene rendering request of the user, executes the three-dimensional scene rendering method of the embodiment of the disclosure, and generates a rendered image of the target scene in the specified observation pose based on the trained three-dimensional reconstruction model.

According to some embodiments, the three-dimensional reconstruction model may be trained by the server 120, or may be trained by another server (not shown in fig. 1). In other words, the training method for the three-dimensional reconstruction model according to the embodiment of the present disclosure may be performed by the server 120, or may be performed by another server.

The server for executing the three-dimensional scene rendering method according to the embodiment of the present disclosure and the server for executing the training method of the three-dimensional reconstruction model according to the embodiment of the present disclosure may be the same server (e.g., the server 120) or different servers (e.g., the three-dimensional scene rendering method is executed by the server 120, and the training method of the three-dimensional reconstruction model is executed by a server different from the server 120).

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

FIG. 2 shows a flow diagram of a method 200 for training a three-dimensional reconstructed model according to an embodiment of the disclosure. As described above, the execution subject of the method 200 may be the server 120 in fig. 1, or may be another server different from the server 120. As shown in FIG. 2, method 200 includes steps S210-S290.

In step S210, a sample image of a target scene and a first pose of an image capture device at the time of capturing the sample image are acquired.

In step S220, a plurality of first rays of the sample image are determined based on the first pose. The first rays correspond to pixels of the sample image, respectively.

In step S230, information of the first rays is input into the three-dimensional reconstruction model to obtain a first spatial field and a rendered image of the target scene output by the three-dimensional reconstruction model. The first spatial field includes respective voxel densities of a plurality of sampling points on each of the first rays.

In step S240, a color loss is determined based on a difference of the rendered image and the sample image.

In step S250, a second pose of the image capture device is generated based on the first pose.

In step S260, a plurality of second rays of the virtual image are determined based on the second pose. The virtual image is an image plane of the image acquisition equipment in a second position, and the plurality of second rays correspond to the plurality of pixels of the virtual image respectively.

In step S270, the information of the plurality of second rays is input into the three-dimensional reconstruction model to obtain a second spatial field of the target scene output by the three-dimensional reconstruction model. The second spatial field includes respective voxel densities of the plurality of sampling points on each of the second rays.

In step S280, a geometric loss is determined based on at least the second spatial field.

In step S290, parameters of the three-dimensional reconstruction model are adjusted based on the color loss and the geometric loss.

According to an embodiment of the present disclosure, a three-dimensional reconstruction model is trained with color loss and geometric loss. The color loss is used to ensure that the generated rendered image is consistent with the color of the real image. The geometric loss can add geometric constraint of the target scene under a new view angle (namely a second posture) to the three-dimensional reconstruction model, so that the three-dimensional reconstruction model accurately learns the spatial geometric information of the target scene, and the accuracy of the three-dimensional reconstruction result of the target scene is improved. High-quality three-dimensional reconstruction can be achieved even in the case of a small number of sample images.

In embodiments of the present disclosure, the target scene may be any three-dimensional scene to be reconstructed, such as a city street view, a field view, a meta-universe avatar, and the like.

According to some embodiments, the image capture device may be any device with image capture capabilities including, but not limited to, a camera, a camcorder, a cell phone, a tablet computer, and the like.

In an embodiment of the present disclosure, the first pose is a pose (i.e., an existing perspective) of the image capture device at the time of capturing the sample image, and the second pose is a new pose (i.e., a new perspective) generated based on the first pose.

According to some embodiments, a sample image of the target scene and a first pose of the image capture device to which the sample image corresponds may be derived by an SFM (Structure From Motion) algorithm. For example, sample images from a plurality of different viewing angles of the sample image may be acquired, and then the SFM algorithm is used to calculate the first pose of the image capturing device corresponding to each sample image.

The poses of the image capturing device (including the first pose and the second pose) are used to indicate the position and pose of the image capturing device. The position of the image acquisition device may be represented, for example, in three-dimensional coordinates in the form of (x, y, z). The pose of the image acquisition device may be represented, for example, in a pose angle. Attitude angles further include pitch angle (pitch), yaw angle (yaw), and roll angle (roll).

Based on the first pose of the image acquisition device, a plurality of first rays of the sample image may be determined. The plurality of first rays correspond to a plurality of pixels in the sample image respectively, and each first ray comprises a plurality of sampling points.

According to some embodiments, each first ray of the plurality of first rays is a ray directed by the image acquisition device having the first pose to a respective pixel in the sample image. Specifically, based on the first pose and the intrinsic parameters of the image capture device (including the focal length, the physical size of the pixel, the number of pixels whose image centers are different from the origin of the image, etc.), and the two-dimensional coordinates of each pixel of the sample image in the sample image, the position of each pixel in the sample image in space, that is, the three-dimensional coordinates, can be determined. Further, by connecting the position of the image capturing device (i.e. the three-dimensional coordinates of the image capturing device in the first position) with the position of the pixel (i.e. the three-dimensional coordinates of the pixel), the first ray corresponding to the pixel can be obtained.

Sampling points on the first ray can result in multiple sampling points. For example, starting from the origin of the first ray (i.e., the position of the image acquisition apparatus in the first position), sampling is performed at regular intervals (i.e., one point is sampled at regular intervals), resulting in a plurality of sample points. The number of sampling points can be set as desired, for example, 64, 238, 256, etc. It will be appreciated that the more the number of sample points is set, the more accurate the three-dimensional reconstruction of the object will generally be, but the less computationally efficient it will be.

And inputting the information of the plurality of first rays into the three-dimensional reconstruction model to obtain a first spatial field and a rendered image of the target scene in the first pose output by the three-dimensional reconstruction model.

The spatial fields of the object scene (comprising the first spatial field and the second spatial field) are a geometric representation of the object scene. The first spatial field of the object scene includes respective voxel densities of a plurality of sample points on each first ray. The voxel density of a sample point represents the probability that the corresponding first ray is terminated when passing through the sample point, with a higher probability representing a lower transparency of the sample point. For example, in city street view, the voxel density of a sample point located on the surface of a building is greater than the voxel density of a sample point located in the air in front of the building surface.

The rendered image includes predicted color values (e.g., RGB values) for a plurality of pixels.

According to some embodiments, the three-dimensional reconstruction model may be implemented as a MultiLayer Perceptron (MLP). In other embodiments, the three-dimensional reconstruction model may also be implemented as a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), or the like.

According to some embodiments, the three-dimensional reconstruction model may comprise a geometric reconstruction module and a color reconstruction module. The geometric reconstruction module and the color reconstruction module may be implemented as MLP, CNN, DNN, etc., respectively.

According to some embodiments, the geometric reconstruction module and the color reconstruction module may be in series. And inputting the information of the plurality of first rays into a geometric reconstruction module, wherein the geometric reconstruction module outputs a first spatial field of the target scene in a first bit position, namely outputs the respective voxel densities of the plurality of sampling points on each first ray. Then, the information of the plurality of first rays and the first spatial field are input to a color reconstruction module, and the color reconstruction module outputs a color field, that is, outputs color values (for example, RGB values) of the plurality of sampling points on each first ray. As described above, each first ray corresponds to a pixel. By integrating the color values of a plurality of sampling points on the corresponding first ray, the predicted color value of each pixel can be obtained, and thus the rendered image is obtained.

Based on the differences of the rendered image and the sample image, a color loss of the three-dimensional reconstructed model may be determined.

According to some embodiments, the color Loss may be an average Absolute Error (MAE), also referred to as L1Loss, of pixel values of pixels at corresponding positions of the rendered image and the sample image. According to other embodiments, the color Loss may also be a Mean Square Error (MSE), also known as L2Loss, of pixel values of pixels at corresponding positions of the rendered image and the sample image. It should be understood that other functions (i.e., loss functions) besides the above-described L1Loss and L2Loss may be employed to calculate the color Loss. The present disclosure does not limit the loss function of color loss.

In addition to color loss, embodiments of the present disclosure train a three-dimensional reconstruction model based on geometric loss. The geometric loss can add geometric constraint of the target scene under a new view angle (namely, the second pose) to the three-dimensional reconstruction model, so that the model accurately learns the spatial geometric information of the target scene, and the three-dimensional reconstruction effect is improved.

In an embodiment of the present disclosure, the second pose is generated based on the first pose. Specifically, there are various ways of generating the second pose.

According to some embodiments, a plurality of sample images of a target scene and a plurality of first poses of an image acquisition device corresponding to the plurality of sample images, respectively, may be acquired. And determining a pose range of the image acquisition equipment based on the plurality of first poses, and determining any pose in the pose range, which is different from the plurality of first poses, as a second pose. According to this embodiment, random interpolation may be performed over a range of the first pose to generate the second pose. The second posture is within the range of the first posture, so that the relevance between the first posture and the second posture can be ensured, and the combined training effect based on color loss and geometric loss is improved.

According to some embodiments, a center pose within the range of poses may be determined as the second pose. Therefore, the first pose and the second pose can be uniformly distributed, and the three-dimensional reconstruction model can be favorably used for fully learning the space geometric information of the target scene.

For example, two sample images of a target scene are acquired, and the corresponding first bit positions are p respectively ₁ 、p ₂ . Each first pose p _i (i =1,2) may each include three-dimensional coordinates (x) _i ，y _i ，z _i ) And three attitude angles pitch _i 、yaw _i 、roll _i I.e. p _i ＝(x _i ，y _i ，z _i ，pitch _i ，yaw _i ，roll _i ). Accordingly, the second position p 'may be determined as p' = ((x) s) ₁ +x ₂ )/2，(y ₁ +y ₂ )/2，(z ₁ +z ₂ )/2，(pitch ₁ +pitch ₂ )/2，(yaw ₁ +yaw ₂ )/2，(roll ₁ +roll ₂ )/2)。

For another example, four sample images of the target scene are acquired, and the corresponding first bit positions are p ₁ 、p ₂ 、p ₃ 、p ₄ . Each first pose p _i (i =1,2,3,4) may each include three-dimensional coordinates (x) _i ，y _i ，z _i ) And three attitude angles pitch _i 、yaw _i 、roll _i I.e. p _i ＝(x _i ，y _i ，z _i ，pitch _i ，yaw _i ，roll _i ). Determining the pose range as x based on the four first poses ₁ ～x ₂ ，y ₄ ～y ₂ ，z ₃ ～z ₁ ，pitch ₂ ～pitch ₁ ，yaw ₄ ～yaw ₃ ，roll ₁ ～roll ₂ . Accordingly, the second pose may be determined as p' = ((x) ₁ +x ₂ )/2，(y ₄ +y ₂ )/2，(z ₃ +z ₁ )/2，(pitch ₂ +pitch ₁ )/2，(yaw ₄ +yaw ₃ )/2，(roll ₁ +roll ₂ )/2)。

According to some embodiments, a plurality of second poses may be generated, the plurality of second poses being evenly distributed over a range of poses of the plurality of first poses. For example, two second poses may be set, at 1/3, 2/3 of the range of poses, respectively. For another example, three second poses may be set, which are located at 1/4, 2/4, and 3/4 of the range of poses, respectively.

It should be noted that the number of the second positions is not too large. A small number of second positions can enable the three-dimensional reconstruction model to more fully learn the space geometric information of the target scene, and the phenomenon that the learning effect is deviated due to introduction of too strong geometric constraint and the target scene is learned into a plane is avoided.

According to further embodiments, a perturbation may also be added to the first pose, thereby generating a second pose of the image capture device.

For example, gaussian noise may be added to the first pose to generate the second pose. The corresponding calculation formula is shown in the following formula (1).

p′＝p+Guass(mean,std) (1)

In the formula (1), p and p' are respectively a first pose and a second pose, and Guass (mean, std) represents a Gaussian function with mean as a mean and std as a standard deviation.

After generating the second pose, a plurality of second rays of the virtual image may be generated based on the second pose. The virtual image is an image plane of the image capturing device in the second position, and the virtual image may be determined based on the second position and an internal parameter of the image capturing device. The plurality of second rays correspond to the plurality of pixels of the virtual image respectively, and each second ray comprises a plurality of sampling points.

According to some embodiments, each of the plurality of second rays is a ray directed by the image acquisition device having the second pose to a respective pixel in the virtual image. Specifically, based on the second pose and the internal parameters of the image capturing device (including the focal length, the physical size of the pixel, the number of pixels whose image centers are different from the origin of the image, and the like), and the two-dimensional coordinates of each pixel of the virtual image in the virtual image, the position, i.e., the three-dimensional coordinates, of each pixel in the virtual image in space can be determined. Further, by connecting the position of the image capturing device (i.e. the three-dimensional coordinates of the image capturing device in the second pose) and the position of the pixel (i.e. the three-dimensional coordinates of the pixel), the second ray corresponding to the pixel can be obtained.

Sampling points on the second ray can result in multiple sampling points. For example, starting from the origin of the second ray (i.e., the position of the image acquisition apparatus in the second position), sampling is performed at regular intervals (i.e., one point is sampled at regular intervals), and a plurality of sampling points are obtained. The number of sampling points can be set as desired, for example, 64, 238, 256, etc. It will be appreciated that the more the number of sample points is set, the more accurate the three-dimensional reconstruction of the object will generally be, but the less computationally efficient it will be.

And inputting the information of the plurality of second rays into the three-dimensional reconstruction model to obtain a second spatial field of the target scene output by the three-dimensional reconstruction model in the second position. It will be appreciated that the three-dimensional reconstructed model may also output a rendered image of the target scene in the second pose. However, since the rendered image does not have a corresponding real image, it cannot be used to calculate color loss.

The second spatial field of the object scene includes respective voxel densities of the plurality of sample points on each of the second rays. The voxel density of a sample point represents the probability that the corresponding second ray will be terminated when passing through the sample point, with a higher probability representing a lower transparency of the sample point.

In an embodiment of the present disclosure, a geometric loss of the three-dimensional reconstructed model is determined based at least on the second spatial field. The geometric loss can add geometric constraint of the target scene under a new view angle (namely, a second pose) to the three-dimensional reconstruction model, so that the model accurately learns the spatial geometric information of the target scene, and the three-dimensional reconstruction effect is improved.

In embodiments of the present disclosure, geometric constraints mean that depth values (i.e., distances of three-dimensional points to a camera) in a three-dimensional scene are generally smooth, with depth values of neighboring three-dimensional points being approximately the same. Accordingly, the depth values of neighboring pixels are also substantially the same in the two-dimensional image corresponding to the three-dimensional scene.

Based on the above-mentioned geometric constraints, according to some embodiments, the geometric loss can be calculated based on the second spatial field only, thereby enabling to improve the computational efficiency. Specifically, for any second ray of the plurality of second rays: determining a depth value corresponding to the second ray based on the voxel density of each of the plurality of sampling points on the second ray; and determining a geometric penalty based on a difference in depth values of the second ray within a different neighborhood of the second ray.

According to some embodiments, the depth value corresponding to the second ray may be obtained by integrating the voxel density of each of the plurality of sampling points on the second ray. The depth value of the second ray may be calculated by, for example, the following equation (2):

in equation (2), D (t) represents a depth value corresponding to the second ray, exp () represents an exponential function with a natural constant e as a base, r (t) represents the t-th sampling point, σ (r (t)) represents the voxel density of the t-th sampling point, t _n 、t _f Respectively representing the start and end points of the integration zone on the second ray. t is t _n For example, the origin of the second ray (i.e., the position of the image capture device, and typically also the 0 th sample point) may be taken. t is t _f For example, positive infinity (∞) or the last sample point may be taken.

According to some embodiments, for any one second ray, a first neighborhood and a second neighborhood of the second ray may be determined, the first neighborhood and the second neighborhood being the same size (e.g. 4 x 4 each), i.e. both comprising the same number of second rays, and the second rays in the first neighborhood corresponding to the second rays in the second neighborhood, respectively. In particular, since each second ray corresponds to a pixel, a first pixel neighborhood and a second pixel neighborhood of the pixel may be determined, the first pixel neighborhood and the second pixel neighborhood being the same size, e.g., 4 x 4 each. The second ray corresponding to each pixel in the first pixel neighborhood is the second ray included in the first neighborhood, and the second ray corresponding to each pixel in the second pixel neighborhood is the second ray included in the second neighborhood.

According to some embodiments, the absolute value of the difference of the depth values of the second rays at the respective positions of the first neighbourhood and the second neighbourhood may be calculated, and then the sum or average of the absolute values corresponding to the respective positions may be taken as the geometric penalty. For example, the absolute value of the difference in depth values of the second ray at the respective positions of the first neighbourhood and the second neighbourhood may be calculated as follows (3):

in formula (3), P _ij1 Representing a matrix of depth values, P, of respective second rays in a first neighborhood of the second ray corresponding to the pixel ij _ij2 A matrix of depth values representing respective second rays in a second neighborhood of the second ray corresponding to pixel ij,

a matrix representing the absolute value of the difference of the depth values of the second ray at the respective positions of the first neighbourhood and the second neighbourhood.

According to further embodiments, the squares of the differences of the depth values of the second rays at the respective positions of the first neighbourhood and the second neighbourhood may also be calculated, and then the sum or average of the squares corresponding to the respective positions may be taken as the geometric penalty.

According to further embodiments, a geometric loss of the three-dimensional reconstructed model may also be determined based on the first spatial field and the second spatial field. Therefore, the three-dimensional reconstruction model can more fully learn the space geometric information of the target scene, and the three-dimensional reconstruction effect is improved.

Specifically, for any of the first plurality of rays and the second plurality of rays: determining a depth value corresponding to the ray based on the voxel density of each of the plurality of sampling points on the ray; and determining a geometric penalty based on a difference in depth values of the ray in different neighborhoods of the ray.

The depth value corresponding to the ray (including the first ray and the second ray) can be obtained by integrating the voxel density of each of the plurality of sampling points on the ray. The specific calculation method of the depth value may refer to the above formula (2), and the specific calculation method of the geometric loss may refer to the above formula (3), which is not described herein again.

Based on the color loss and the geometric loss, the overall loss of the three-dimensional reconstructed model can be determined. According to some embodiments, the overall loss may be a weighted sum of the color loss and the geometric loss. Based on the overall loss, an algorithm such as back propagation may be employed to adjust the parameters of the three-dimensional reconstructed model.

Based on the trained three-dimensional reconstruction model, the embodiment of the disclosure also provides a three-dimensional scene rendering method. Fig. 3 shows a flow diagram of a three-dimensional scene rendering method 300 according to an embodiment of the present disclosure. The method 300 may be performed, for example, by the server 120 shown in fig. 1. As shown in FIG. 3, the method 300 includes steps S310-S320.

In step S310, a three-dimensional reconstruction model for a target scene and an observation pose of the target scene are obtained, where the three-dimensional reconstruction model is obtained by training based on a training method of the three-dimensional reconstruction model according to the embodiment of the present disclosure.

In step S320, a rendered image of the target scene at the observation pose is generated based on the three-dimensional reconstruction model and the observation pose.

According to the embodiment of the disclosure, the trained three-dimensional reconstruction model is adopted to generate the rendering image of the target scene under the specified visual angle (namely the observation pose), so that the accuracy of the rendering image can be improved, the rendering image can accurately express the space geometry of the target scene under the new visual angle, and the distortion is avoided.

According to some embodiments, a three-dimensional scene rendering request of a user may be received, the three-dimensional scene rendering request including a target scene to be rendered and an observation pose of the target scene specified by the user.

The observation pose includes, for example, the position (expressed in three-dimensional coordinates) and attitude angle (including pitch angle, yaw angle, and roll angle) of the observation target scene. And inputting the observation pose into the trained three-dimensional reconstruction model to obtain a rendered image of the target scene output by the three-dimensional reconstruction model under the observation pose.

According to the embodiment of the disclosure, a training device of the three-dimensional reconstruction model is also provided.

Fig. 4 shows a block diagram of a training apparatus 400 for three-dimensional reconstruction model according to an embodiment of the present disclosure. As shown in fig. 4, the apparatus 400 includes an obtaining module 410, a first determining module 420, a first outputting module 430, a first loss module 440, a generating module 450, a second determining module 460, a second outputting module 470, a second loss module 480, and an adjusting module 490.

The acquisition module 410 is configured to acquire a sample image of a target scene and a first pose of an image acquisition device when acquiring the sample image.

The first determination module 420 is configured to determine a plurality of first rays of the sample image based on the first pose, wherein the plurality of first rays correspond to a plurality of pixels of the sample image, respectively.

The first output module 430 is configured to input information of the first rays into a three-dimensional reconstruction model to obtain a first spatial field of the target scene and a rendered image output by the three-dimensional reconstruction model, where the first spatial field includes respective voxel densities of a plurality of sampling points on each first ray.

The first loss module 440 is configured to determine a color loss based on a difference of the rendered image and the sample image.

The generation module 450 is configured to generate a second pose of the image acquisition device based on the first pose.

The second determining module 460 is configured to determine a plurality of second rays of a virtual image based on the second position, wherein the virtual image is an image plane of the image capturing device at the second position, and the plurality of second rays respectively correspond to a plurality of pixels of the virtual image.

The second output module 470 is configured to input the information of the plurality of second rays into the three-dimensional reconstruction model to obtain a second spatial field of the target scene output by the three-dimensional reconstruction model, wherein the second spatial field includes respective voxel densities of the plurality of sampling points on each second ray.

The second loss module 480 is configured to determine a geometric loss based at least on the second spatial field.

The adjusting module 490 is configured to adjust parameters of the three-dimensional reconstruction model based on the color loss and the geometric loss.

According to an embodiment of the present disclosure, a three-dimensional reconstruction model is trained with color loss and geometric loss. The color loss is used to ensure that the generated rendered image is consistent with the color of the real image. The geometric loss can add geometric constraint of the target scene under a new view angle (namely, a second posture) to the three-dimensional reconstruction model, so that the three-dimensional reconstruction model accurately learns the spatial geometric information of the target scene, and the accuracy of the three-dimensional reconstruction result of the target scene is improved. High quality three-dimensional reconstruction can be achieved even with a small number of sample images.

According to some embodiments, each first ray of the plurality of first rays is a ray directed by an image acquisition device having the first pose to a respective pixel in the sample image; each of the plurality of second rays is a ray directed by the image capture device having the second pose to a corresponding pixel in the virtual image.

According to some embodiments, the obtaining module 410 is further configured to obtain a plurality of sample images of the target scene and a plurality of first poses of the image acquisition device corresponding to the plurality of sample images, respectively; and wherein the generating module 450 comprises: a first determination unit configured to determine a pose range of the image capture device based on the plurality of first poses; and a second determination unit configured to determine any one of the poses within the pose range that is different from the plurality of first poses as the second pose.

According to some embodiments, the second determination unit is further configured to determine a center pose within the pose range as the second pose.

According to some embodiments, the second loss module 480 comprises: a third determining unit, configured to determine, for any second ray in the plurality of second rays, a depth value corresponding to the second ray based on a voxel density of each of a plurality of sampling points on the second ray; and a fourth determination unit configured to determine, for any second ray of the plurality of second rays, the geometric loss based on a difference in depth values of the second rays within different neighborhoods of the second ray.

According to some embodiments, the second loss module 480 is further configured to determine a geometric loss of the three-dimensional reconstructed model based on the first spatial field and the second spatial field.

According to some embodiments, the second loss module 480 comprises: a fifth determining unit configured to determine, for any one of the plurality of first rays and the plurality of second rays, a depth value corresponding to the ray based on a voxel density of each of a plurality of sampling points on the ray; and a sixth determining unit configured to determine, for any one of the plurality of first rays and the plurality of second rays, the geometric loss based on a difference in depth values of rays within different neighborhoods of the ray.

According to the embodiment of the disclosure, a three-dimensional scene rendering device is also provided.

Fig. 5 shows a block diagram of a three-dimensional scene rendering apparatus 500 according to an embodiment of the present disclosure. As depicted in fig. 5, the apparatus 500 includes an obtaining module 510 and a generating module 520.

The obtaining module 510 is configured to obtain a three-dimensional reconstruction model for a target scene and an observation pose of the target scene, wherein the three-dimensional reconstruction model is trained based on a training apparatus of the three-dimensional reconstruction model of the embodiment of the present disclosure; and

the generating module 520 is configured to generate a rendered image of the target scene at the observation pose based on the three-dimensional reconstruction model and the observation pose.

It should be understood that the various modules or units of the apparatus 400 shown in fig. 4 may correspond to the various steps in the method 200 described with reference to fig. 2, and the various modules or units of the apparatus 500 shown in fig. 5 may correspond to the various steps in the method 300 described with reference to fig. 3. Thus, the operations, features and advantages described above with respect to method 200 are equally applicable to apparatus 400 and the modules and units included therein, and the operations, features and advantages described above with respect to method 300 are equally applicable to apparatus 500 and the modules and units included therein. Certain operations, features and advantages may not be described in detail herein for the sake of brevity.

Although specific functionality is discussed above with reference to particular modules, it should be noted that the functionality of the various modules discussed herein can be separated into multiple modules and/or at least some of the functionality of multiple modules can be combined into a single module.

It should also be appreciated that various techniques may be described herein in the general context of software hardware elements or program modules. The various modules described above with respect to fig. 4, 5 may be implemented in hardware or in hardware in combination with software and/or firmware. For example, the modules may be implemented as computer program code/instructions configured to be executed in one or more processors and stored in a computer readable storage medium. Alternatively, the modules may be implemented as hardware logic/circuitry. For example, in some embodiments, one or more of the modules 410-520 may be implemented together in a System on Chip (SoC). The SoC may include an integrated circuit chip (which includes one or more components of a Processor (e.g., a Central Processing Unit (CPU), microcontroller, microprocessor, digital Signal Processor (DSP), etc.), memory, one or more communication interfaces, and/or other circuitry), and may optionally execute received program code and/or include embedded firmware to perform functions.

According to an embodiment of the present disclosure, there is also provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor, the memory storing instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform a method of training a three-dimensional reconstruction model and/or a method of rendering a three-dimensional scene according to an embodiment of the disclosure.

There is also provided, in accordance with an embodiment of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a training method of a three-dimensional reconstruction model and/or a three-dimensional scene rendering method of an embodiment of the present disclosure.

There is also provided, in accordance with an embodiment of the present disclosure, a computer program product, including a computer program, which when executed by a processor, implements a method of training a three-dimensional reconstruction model and/or a method of rendering a three-dimensional scene of an embodiment of the present disclosure.

Referring to fig. 6, a block diagram of a structure of an electronic device 600, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the electronic device 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the electronic device 600 are connected to the I/O interface 605, including: an input unit 606, an output unit 607, a storage unit 608, and a communication unit 609. The input unit 606 may be any type of device capable of inputting information to the electronic device 600, and the input unit 606 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote control. Output unit 607 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 608 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 609 allows the electronic device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication transceiver, and/or a chipset, such as bluetooth ^TM Devices, 802.11 devices, wi-Fi devices, wiMAX devices, cellular communication devices, and/or the like.

Computing unit 601 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 601 performs the various methods and processes described above, such as the method 200 and/or the method 300. For example, in some embodiments, method 200 and/or method 300 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by the computing unit 601, one or more steps of the

methods

200 and 300 described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the method 200 and/or the method 300 by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical aspects of the present disclosure can be achieved.

While embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems, and apparatus are merely illustrative embodiments or examples and that the scope of the disclosure is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. A training method of a three-dimensional reconstruction model comprises the following steps:

acquiring a sample image of a target scene and a first pose of image acquisition equipment when acquiring the sample image;

determining a plurality of first rays of the sample image based on the first pose, wherein the plurality of first rays correspond to a plurality of pixels of the sample image respectively;

inputting information of the first rays into a three-dimensional reconstruction model to obtain a first spatial field and a rendered image of the target scene output by the three-dimensional reconstruction model, wherein the first spatial field comprises respective voxel densities of a plurality of sampling points on each first ray;

determining a color loss based on a difference of the rendered image and the sample image;

generating a second pose of the image capture device based on the first pose;

determining a plurality of second rays of a virtual image based on the second pose, wherein the virtual image is an image plane of the image acquisition equipment in the second pose, and the plurality of second rays correspond to a plurality of pixels of the virtual image respectively;

inputting information of the plurality of second rays into the three-dimensional reconstruction model to obtain a second spatial field of the target scene output by the three-dimensional reconstruction model, wherein the second spatial field comprises respective voxel densities of a plurality of sampling points on each second ray;

determining a geometric loss based on at least the second spatial field; and

adjusting parameters of the three-dimensional reconstruction model based on the color loss and the geometric loss.

2. The method of claim 1, wherein each first ray of the plurality of first rays is a ray directed by an image acquisition device having the first pose to a respective pixel in the sample image;

each of the plurality of second rays is a ray directed by the image capture device having the second pose to a corresponding pixel in the virtual image.

3. The method of claim 1 or 2, wherein said acquiring a sample image of a target scene and a first pose of an image acquisition device at a time of acquiring the sample image comprises:

acquiring a plurality of sample images of a target scene and a plurality of first positions of image acquisition equipment respectively corresponding to the sample images;

and wherein said generating a second pose for the image capture device based on the first pose comprises:

determining a pose range of the image capture device based on the plurality of first poses; and

determining any pose within the range of poses other than the plurality of first poses as the second pose.

4. The method of claim 3, wherein the determining any pose within the range of poses other than the plurality of first poses as the second pose comprises:

determining a center pose within the range of poses as the second pose.

5. The method of claim 1 or 2, wherein the generating a second pose of the image acquisition device based on the first pose comprises:

adding a perturbation to the first pose to generate a second pose of the image capture device.

6. The method according to any one of claims 1-5, wherein the determining a geometric loss of the three-dimensional reconstructed model based on at least the second spatial field comprises:

for any second ray of the plurality of second rays:

determining a depth value corresponding to the second ray based on the voxel density of each of the plurality of sampling points on the second ray; and

the geometric penalty is determined based on a difference in depth values of the second ray within a different neighborhood of the second ray.

7. The method according to any one of claims 1-5, wherein the determining a geometric loss of the three-dimensional reconstructed model based at least on the second spatial field includes:

determining a geometric loss of the three-dimensional reconstructed model based on the first spatial field and the second spatial field.

8. The method of claim 7, wherein the determining a geometric loss of the three-dimensional reconstructed model based on the first and second spatial fields comprises:

for any of the first plurality of rays and the second plurality of rays:

determining a depth value corresponding to the ray based on the voxel density of each of the plurality of sampling points on the ray; and

the geometric penalty is determined based on a difference in depth values of the rays within different neighborhoods of the ray.

9. A method of three-dimensional scene rendering, comprising:

acquiring a three-dimensional reconstruction model for a target scene and an observation pose of the target scene, wherein the three-dimensional reconstruction model is obtained by training based on the method of any one of claims 1-8; and

and generating a rendering image of the target scene under the observation pose based on the three-dimensional reconstruction model and the observation pose.

10. A training apparatus for three-dimensional reconstruction models, comprising:

an acquisition module configured to acquire a sample image of a target scene and a first pose of an image acquisition device when acquiring the sample image;

a first determination module configured to determine a plurality of first rays of the sample image based on the first pose, wherein the plurality of first rays correspond to a plurality of pixels of the sample image, respectively;

a first output module configured to input information of the plurality of first rays into a three-dimensional reconstruction model to obtain a first spatial field and a rendered image of the target scene output by the three-dimensional reconstruction model, wherein the first spatial field comprises respective voxel densities of a plurality of sampling points on each first ray;

a first loss module configured to determine a color loss based on a difference of the rendered image and the sample image;

a generation module configured to generate a second pose of the image capture device based on the first pose;

a second determining module configured to determine, based on the second position, a plurality of second rays of a virtual image, wherein the virtual image is an image plane of the image capturing device in the second position, and the plurality of second rays correspond to a plurality of pixels of the virtual image respectively;

a second output module configured to input information of the plurality of second rays into the three-dimensional reconstruction model to obtain a second spatial field of the target scene output by the three-dimensional reconstruction model, wherein the second spatial field includes respective voxel densities of the plurality of sampling points on each second ray;

a second loss module configured to determine a geometric loss based at least on the second spatial field; and

an adjustment module configured to adjust parameters of the three-dimensional reconstruction model based on the color loss and the geometric loss.

11. The apparatus of claim 10, wherein each first ray of the plurality of first rays is a ray directed by an image acquisition device having the first pose to a respective pixel in the sample image;

12. The apparatus of claim 10 or 11, wherein the acquisition module is further configured to acquire a plurality of sample images of a target scene and a plurality of first poses of an image acquisition device corresponding to the plurality of sample images, respectively;

and wherein the generating module comprises:

a first determination unit configured to determine a pose range of the image capture device based on the plurality of first poses; and

a second determination unit configured to determine any one of the poses within the pose range that is different from the plurality of first poses as the second pose.

13. The apparatus according to claim 12, wherein the second determination unit is further configured to determine a center pose within the pose range as the second pose.

14. The apparatus of any of claims 10-13, wherein the second loss module comprises:

a third determining unit, configured to determine, for any second ray in the plurality of second rays, a depth value corresponding to the second ray based on a voxel density of each of a plurality of sampling points on the second ray; and

a fourth determination unit configured to determine, for any second ray of the plurality of second rays, the geometric penalty based on a difference in depth values of the second rays within different neighborhoods of the second ray.

15. The apparatus of any one of claims 10-13, wherein the second loss module is further configured to determine a geometric loss of the three-dimensional reconstruction model based on the first and second spatial fields.

16. The apparatus of claim 15, wherein the second loss module comprises:

a fifth determining unit configured to determine, for any one of the plurality of first rays and the plurality of second rays, a depth value corresponding to the ray based on a voxel density of each of a plurality of sampling points on the ray; and

a sixth determining unit configured to determine, for any one of the plurality of first rays and the plurality of second rays, the geometric loss based on a difference in depth values of rays within different neighborhoods of the ray.

17. A three-dimensional scene rendering apparatus comprising:

an obtaining module configured to obtain a three-dimensional reconstruction model for a target scene and an observation pose of the target scene, wherein the three-dimensional reconstruction model is trained based on the apparatus of any one of claims 10-16; and

a generating module configured to generate a rendered image of the target scene at the observation pose based on the three-dimensional reconstruction model and the observation pose.

18. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

19. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-9.

20. A computer program product comprising a computer program, wherein the computer program realizes the method of any one of claims 1-9 when executed by a processor.