CN116993924B

CN116993924B - Three-dimensional scene modeling method and device, storage medium and computer equipment

Info

Publication number: CN116993924B
Application number: CN202311234506.4A
Authority: CN
Inventors: 方顺; 孙思远; 冯星; 崔铭; 杨峰峰; 韦建伟; 胡梓楠; 乔磊; 张造时; 汪成峰; 穆子杰; 刘锦; 王月; 熊宏康; 房超; 李荣华; 单仝; 张志恒
Original assignee: Beijing Xuanguang Technology Co ltd
Current assignee: Beijing Xuanguang Technology Co ltd
Priority date: 2023-09-25
Filing date: 2023-09-25
Publication date: 2023-12-15
Anticipated expiration: 2043-09-25
Also published as: CN116993924A

Abstract

The application discloses a three-dimensional scene modeling method and device, a storage medium and computer equipment, wherein the method comprises the following steps: performing foreground clipping on a scene picture to obtain at least one foreground clipping picture, determining a foreground feature vector and a background feature vector corresponding to the scene picture based on the foreground clipping picture and the scene picture, predicting background object model data corresponding to the background feature vector, and predicting foreground object model data corresponding to each foreground feature vector; predicting local gestures of foreground objects corresponding to each foreground clipping image respectively, and determining global gestures of the foreground objects based on the positions of the foreground clipping images in the scene images and the local gestures of the foreground objects; and performing pose transformation on the foreground object model data according to the foreground object global pose, and determining three-dimensional scene model data corresponding to the scene picture based on the background object model data and the foreground object model data subjected to the pose transformation.

Description

Three-dimensional scene modeling method and device, storage medium and computer equipment

Technical Field

The present application relates to the field of three-dimensional modeling technologies, and in particular, to a three-dimensional scene modeling method and apparatus, a storage medium, and a computer device.

Background

The problem faced by three-dimensional scene reconstruction is that the objects in the scene are numerous and have shielding, and the importance of different objects is different, so that the objects are not only reconstructed three-dimensionally, but also the gestures of the objects in the scene are restored. The indoor complex scene is automatically modeled through a single view, the difficulty is high, but the 3D content creation efficiency of a plurality of application scenes can be greatly improved, such as games, e.g. movie and television cartoon, e.g. meta universe/digital man/digital twin, and the like, and the method has very great significance.

In the field of computer vision, better layout recognition and three-dimensional reconstruction of complex indoor scenes has been an important but challenging problem.

Disclosure of Invention

In view of the above, the application provides a three-dimensional scene modeling method and device, a storage medium and a computer device, wherein three-dimensional scene reconstruction is divided into two parts of foreground modeling and background modeling, and for a foreground part with high precision requirements, a foreground clipping image is utilized to perform local gesture estimation, and global information is combined to further determine the global gesture of a foreground object, so that the pose of the foreground object in the three-dimensional scene is restored, and accurate modeling of the three-dimensional scene can be realized only by one scene image, and the method is simple, convenient and high in practicability.

According to one aspect of the present application, there is provided a three-dimensional scene modeling method, the method comprising:

acquiring a scene picture to be modeled;

the three-dimensional scene modeling model data corresponding to the scene picture is obtained by executing the following steps through a pre-trained three-dimensional scene modeling model:

performing foreground clipping on the scene picture to obtain at least one foreground clipping picture, determining a foreground feature vector and a background feature vector corresponding to the scene picture based on the foreground clipping picture and the scene picture, predicting background object model data corresponding to the background feature vector, and predicting foreground object model data corresponding to each foreground feature vector;

predicting the local gesture of the foreground object corresponding to each foreground clipping image respectively, and determining the global gesture of the foreground object based on the position of the foreground clipping image in the scene image and the local gesture of the foreground object;

and performing pose transformation on the foreground object model data according to the foreground object global pose, and determining three-dimensional scene model data corresponding to the scene picture based on the background object model data and the foreground object model data subjected to the pose transformation.

Optionally, the determining, based on the foreground clipping map and the scene picture, a foreground feature vector and a background feature vector corresponding to the scene picture includes:

respectively extracting foreground feature vectors corresponding to each foreground clipping image through a pre-trained foreground feature extraction network;

and extracting panoramic feature vectors corresponding to the scene pictures through a pre-trained panoramic feature extraction network, subtracting foreground feature parts from the panoramic feature vectors based on foreground object bounding box coordinates corresponding to the foreground clipping pictures in the scene pictures, and obtaining background feature vectors corresponding to the scene pictures.

Optionally, the determining the global foreground object pose based on the position of the foreground clipping map in the scene picture and the local foreground object pose includes:

based on the position of the foreground clipping image in the scene image, determining translation data of a clipping camera relative to an origin of the scene image, and determining weak perspective camera projection parameters corresponding to the translation data;

and transforming the weak perspective camera projection parameters into perspective camera projection parameters, and determining the global posture of the foreground object based on the perspective camera projection parameters and the local posture of the foreground object.

Optionally, the transforming the weak perspective projection parameter into a perspective camera projection parameter, and determining the foreground object global pose based on the perspective camera projection parameter and the foreground object local pose, includes:

transforming the weak perspective projection parameters into perspective camera projection parameters according to a transformation formula from the weak perspective camera to the perspective cameraWherein, the transformation formula from the weak perspective camera to the perspective camera is as follows:

wherein,and->Respectively representing translation data of the cropping camera along the X axis and the Y axis by taking the central point of the scene picture as an origin, b represents the side length of the foreground cropping picture, s represents a scaling parameter, and +_>Representing the clipping camera focal length;

calculating the perspective camera projection parameters and the local posture of the foreground object according to a global posture estimation formula to obtain the global posture of the foreground objectWherein, the global attitude estimation formula is:

wherein,is the transformation angle of the cropping camera relative to the original camera, < >>And representing the center point coordinates of the foreground clipping map taking the center point of the scene picture as an origin.

Optionally, before the obtaining the scene picture to be modeled, the method further includes:

Establishing a three-dimensional scene modeling model, wherein the three-dimensional scene modeling model comprises a foreground clipping network, a foreground feature extraction network, a panoramic feature extraction network, a background feature extraction module, a foreground model prediction network, a background model prediction network, a local gesture prediction network, a global gesture prediction module and a three-dimensional modeling module;

constructing a foreground prediction loss function corresponding to the foreground model prediction network, a background prediction loss function corresponding to the background model prediction network and a gesture prediction loss function corresponding to the local gesture prediction network, and determining a model loss function of the three-dimensional scene modeling model based on the foreground prediction loss function, the background prediction loss function and the gesture prediction loss function;

and training the three-dimensional scene modeling model by using the scene picture sample and the scene three-dimensional model data sample, wherein the trained three-dimensional scene modeling model is used for carrying out three-dimensional scene modeling on the two-dimensional scene picture.

Optionally, the foreground model prediction network is configured to predict whether a three-dimensional model foreground sampling point corresponding to each foreground pixel point in the input foreground feature vector belongs to a scene to be modeled, and the prediction value corresponding to each foreground pixel point is used to represent a probability that the corresponding three-dimensional model foreground sampling point belongs to the scene to be modeled; the foreground prediction loss function is used for calculating a loss value between a predicted value and a true value of each three-dimensional model foreground sampling point;

The background model prediction network is used for predicting whether the three-dimensional model background sampling points corresponding to all background pixel points in the input background feature vector belong to a scene to be modeled or not, and the predicted value of each background pixel point is used for representing the probability that the corresponding three-dimensional model background sampling point belongs to the scene to be modeled; the background prediction loss function is used for calculating a loss value between a predicted value and a true value of each three-dimensional model background sampling point;

the gesture prediction loss function is used for calculating a loss value between orthogonal projection data corresponding to the global gesture of the foreground object output by the global gesture prediction module and real orthogonal projection data corresponding to the foreground object sample in the three-dimensional model data sample of the scene.

Optionally, before the training of the three-dimensional scene modeling model by using the scene picture sample and the scene three-dimensional model data sample, the method further includes:

acquiring at least one scene three-dimensional model data sample, wherein the scene three-dimensional model data sample is any one of a point cloud type, a voxel type and a network type;

combining a plurality of preset angles and a plurality of preset bounding box depths to obtain a plurality of image shooting parameters, and shooting a sample three-dimensional model corresponding to the scene three-dimensional model data sample by each image shooting parameter to obtain a plurality of scene picture samples corresponding to the scene three-dimensional model data sample;

Sampling each sample three-dimensional model to obtain a three-dimensional model foreground sampling point and a three-dimensional model background sampling point, and respectively calculating orthogonal projection data corresponding to each foreground object sample in each sample three-dimensional model as real orthogonal projection data;

correspondingly, the training the three-dimensional scene modeling model by using the scene picture sample and the scene three-dimensional model data sample comprises the following steps:

and training the three-dimensional scene modeling model by utilizing each scene three-dimensional model data sample, the corresponding three-dimensional model foreground sampling point, the three-dimensional model background sampling point, the real orthogonal projection data and a plurality of scene picture samples.

Optionally, the process of three-dimensional scene modeling of the two-dimensional scene picture using the three-dimensional scene modeling model includes:

performing foreground clipping on the scene picture through the foreground clipping network to obtain at least one foreground clipping picture, and respectively extracting foreground feature vectors corresponding to each foreground clipping picture through the foreground feature extraction network;

extracting panoramic feature vectors corresponding to the scene pictures through the panoramic feature extraction network, and inputting the panoramic feature vectors and the position information of each foreground clipping picture in the scene pictures into the background feature extraction module so that the background feature extraction module subtracts foreground feature parts from the panoramic feature vectors to obtain background feature vectors corresponding to the scene pictures;

Predicting background object model data corresponding to the background feature vectors through the background model prediction network, and respectively predicting foreground object model data corresponding to the foreground feature vectors of each foreground clipping image through the foreground model prediction network;

for each foreground clipping image, predicting a local foreground object gesture corresponding to the foreground clipping image through the local gesture prediction network, and predicting a global foreground object gesture according to the local foreground object gesture through the global gesture prediction module;

and through the three-dimensional modeling module, performing pose transformation on the foreground object model data according to the global pose of the foreground object corresponding to each foreground clipping image, and combining the foreground object model data subjected to the pose transformation with the background object model data to obtain three-dimensional scene model data corresponding to the scene image.

According to another aspect of the present application, there is provided a three-dimensional scene modeling apparatus, the apparatus including:

the image acquisition unit is used for acquiring a scene image to be modeled;

the three-dimensional modeling unit is used for obtaining three-dimensional scene model data corresponding to the scene picture by executing the following steps through a pre-trained three-dimensional scene modeling model:

Optionally, the three-dimensional modeling unit is further configured to:

according to the globalThe posture estimation formula calculates the perspective camera projection parameters and the local posture of the foreground object to obtain the global posture of the foreground object Wherein, the global attitude estimation formula is:

Optionally, the apparatus further comprises: model training unit for:

Optionally, the apparatus further comprises: a sample determination unit for:

Optionally, the three-dimensional modeling unit is specifically configured to:

According to still another aspect of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the three-dimensional scene modeling method described above.

According to still another aspect of the present application, there is provided a computer apparatus including a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, the processor implementing the three-dimensional scene modeling method described above when executing the program.

By means of the technical scheme, the three-dimensional scene modeling method, the three-dimensional scene modeling device, the storage medium and the computer equipment provided by the application are characterized in that for a scene picture to be modeled, foreground clipping is firstly carried out to obtain at least one foreground clipping picture, then feature extraction is carried out by utilizing the foreground clipping picture and the scene picture to obtain foreground feature extraction and background feature extraction, background object model data and preliminary foreground object model data are predicted, then local posture estimation is carried out on a foreground object by utilizing the foreground clipping picture, the local posture of the foreground object is combined with position information in the scene picture of the foreground clipping picture, global posture of the foreground object is further carried out, and finally, the global posture is utilized to carry out posture transformation on the foreground object model, so that complete three-dimensional scene model data containing the background object model data and the foreground object model data after posture transformation are obtained. According to the method, three-dimensional scene reconstruction is divided into two parts, namely foreground modeling and background modeling, for a foreground part with high precision requirements, local gesture estimation is carried out by utilizing a foreground clipping image, and global gesture of a foreground object is further determined by combining global information, so that the gesture of the foreground object in the three-dimensional scene is restored, accurate modeling of the three-dimensional scene can be realized by only one scene image, and the method is simple, convenient and high in practicability.

The foregoing description is only an overview of the present application, and is intended to be implemented in accordance with the teachings of the present application in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present application more readily apparent.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

fig. 1 shows a flow diagram of a three-dimensional scene modeling method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of another three-dimensional scene modeling method according to an embodiment of the present application;

FIG. 3 shows a schematic structural diagram of a three-dimensional scene modeling model according to an embodiment of the present application;

fig. 4 shows a schematic structural diagram of a three-dimensional scene modeling apparatus according to an embodiment of the present application;

fig. 5 shows a schematic device structure of a computer device according to an embodiment of the present application.

Detailed Description

The application will be described in detail hereinafter with reference to the drawings in conjunction with embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.

In this embodiment, a three-dimensional scene modeling method is provided, as shown in fig. 1, and the method includes:

step 101, obtaining a scene picture to be modeled, and executing steps 102 to 104 through a pre-trained three-dimensional scene modeling model to obtain three-dimensional scene model data corresponding to the scene picture.

In the embodiment of the application, the three-dimensional modeling of the scene picture is realized through the pre-trained three-dimensional scene modeling model, specifically, the scene picture to be modeled is input into the pre-trained three-dimensional scene modeling model, the processes from step 102 to step 104 are executed through the model, and the three-dimensional scene model data output by the model are obtained, so that the corresponding three-dimensional scene model is generated by utilizing the three-dimensional scene model data in a computer, and the two-dimensional scene picture is converted into the three-dimensional scene model.

Step 102, performing foreground clipping on the scene picture to obtain at least one foreground clipping picture, determining a foreground feature vector and a background feature vector corresponding to the scene picture based on the foreground clipping picture and the scene picture, predicting background object model data corresponding to the background feature vector, and predicting foreground object model data corresponding to each foreground feature vector.

Firstly, after a scene picture is obtained, a foreground and a background need to be extracted from the scene picture, and then model prediction is carried out on the foreground and the background respectively. Specifically, foreground clipping is performed on a scene picture, and each foreground object in the scene picture is clipped into a foreground clipping diagram. And extracting foreground feature vectors corresponding to each foreground object and background feature vectors of background objects except the foreground in the scene picture, wherein the foreground and background feature extraction can be performed in the following two ways: in the mode 1, feature extraction is performed on each foreground clipping image and the complete scene image respectively to obtain a foreground feature vector corresponding to each foreground Jing Caijian image and a panoramic feature vector corresponding to the complete scene image, then each foreground feature part in the panoramic feature vector is removed according to the position of each foreground clipping image in the scene image, and the rest part is the background feature vector. And 2, directly extracting features of the complete scene picture to obtain panoramic feature vectors, determining foreground feature vectors in the panoramic feature vectors according to the positions of each foreground clipping picture in the scene picture, and taking the rest as background feature vectors. And finally, predicting corresponding foreground object model data by using the foreground feature vector of each foreground clipping image and predicting background object model data by using the background feature vector.

In the embodiment of the present application, optionally, in step 102, determining, based on the foreground clipping map and the scene picture, a foreground feature vector and a background feature vector corresponding to the scene picture includes: respectively extracting foreground feature vectors corresponding to each foreground clipping image through a pre-trained foreground feature extraction network; and extracting panoramic feature vectors corresponding to the scene pictures through a pre-trained panoramic feature extraction network, subtracting foreground feature parts from the panoramic feature vectors based on foreground object bounding box coordinates corresponding to the foreground clipping pictures in the scene pictures, and obtaining background feature vectors corresponding to the scene pictures.

In this embodiment, the foreground feature extraction and the panoramic feature extraction are performed through different feature extraction networks, so as to implement feature extraction with different precision, for example, the foreground feature extraction network precision is higher than that of the panoramic feature extraction network, so that the foreground feature extraction effect is better and finer, the background feature is mainly used in the extracted panoramic feature, the precision requirement of the background object is not high in the three-dimensional modeling, but the area of the background object in the scene picture is relatively large, therefore, a network structure with the running speed faster than that of the foreground feature extraction network can be selected as the panoramic feature extraction network, the feature extraction of the complete scene picture is performed by using the panoramic feature extraction network, and the feature extraction of the foreground cropping picture is performed by using the foreground feature extraction network. Specifically, the foreground feature vector corresponding to each front Jing Caijian image is extracted through a pre-trained foreground feature extraction network, and the panoramic feature vector corresponding to the complete scene image is extracted through a pre-trained panoramic feature extraction network. And then, removing the foreground feature part in the panoramic feature vector according to the coordinate position of the foreground object bounding box in each front Jing Caijian image in the complete scene image to obtain a background feature vector. For example, the foreground feature extraction network may specifically select a transducer network, the panoramic feature extraction network may specifically select a stacked hourglass network, and those skilled in the art may also select other models and algorithms to implement foreground and panoramic feature extraction, which is not limited herein.

Step 103, predicting the local gesture of the foreground object corresponding to each foreground clipping image, and determining the global gesture of the foreground object based on the position of the foreground clipping image in the scene image and the local gesture of the foreground object.

Secondly, in addition to restoring each object in the three-dimensional scene modeling problem, in order to ensure the accuracy of the three-dimensional scene model, pose estimation is required to be carried out on the objects in the scene, so that the pose of the objects can be restored while the objects are restored in the three-dimensional scene. Specifically, the local gesture of the foreground object in the foreground clipping image is predicted by using the foreground clipping image, and then the global gesture of each foreground object is predicted by combining the position of the foreground clipping image in the complete scene image, so that gesture recovery accuracy is improved.

In the embodiment of the present application, optionally, the determining the global foreground object pose based on the position of the foreground clipping map in the scene picture and the local foreground object pose in step 103 includes: based on the position of the foreground clipping image in the scene image, determining translation data of a clipping camera relative to an origin of the scene image, and determining weak perspective camera projection parameters corresponding to the translation data; and transforming the weak perspective camera projection parameters into perspective camera projection parameters, and determining the global posture of the foreground object based on the perspective camera projection parameters and the local posture of the foreground object.

In this embodiment, the projection parameters of the weak perspective camera, i.e. the translation of the cropping camera relative to the origin of the scene picture, are first determinedWherein P represents projection, weak represents weak perspective weak-hyperspectral, s represents scaling parameter scale, t represents translation, < >>And->Representing translation of the cropping camera along the X and Y axes with the origin in the scene picture as the starting point, which may be the center point of the picture. Then converting the weak perspective camera projection parameters into perspective camera projection parameters, so as to combine and calculate the predicted local posture of the foreground object and the perspective camera projection parameters obtained by conversion, thereby obtaining the foregroundGlobal pose of the object.

In an embodiment of the present application, optionally, the transforming the weak perspective projection parameter into a perspective camera projection parameter, and determining the global pose of the foreground object based on the perspective camera projection parameter and the local pose of the foreground object includes:

wherein, And->Respectively representing translation data of the cropping camera along the X axis and the Y axis by taking the central point of the scene picture as an origin, b represents the side length of the foreground cropping picture, s represents a scaling parameter, and +_>Representing the clipping camera focal length;

And 104, performing pose transformation on the foreground object model data according to the global pose of the foreground object, and determining three-dimensional scene model data corresponding to the scene picture based on the background object model data and the pose-transformed foreground object model data.

And finally, combining the foreground object model data, the background object model data and the foreground object global gesture to restore a three-dimensional scene model corresponding to the scene picture, wherein the foreground object model data is firstly subjected to gesture transformation according to the foreground object global gesture, and then the background object model data and the foreground object model data subjected to gesture transformation are combined into complete three-dimensional scene model data.

By applying the technical scheme of the embodiment, foreground clipping is firstly carried out on a scene picture to be modeled to obtain at least one foreground clipping picture, then feature extraction is carried out on the foreground clipping picture and the scene picture to obtain foreground feature extraction and background feature extraction, background object model data and preliminary foreground object model data are predicted, then local gesture estimation is carried out on the foreground object by using the foreground clipping picture, the local gesture of the foreground object is combined with position information in the scene picture of the foreground clipping picture, global gesture of the foreground object is further carried out, and finally the global gesture is used for carrying out gesture transformation on the foreground object model, so that complete three-dimensional scene model data comprising the background object model data and the foreground object model data after the gesture transformation are obtained. According to the embodiment of the application, the three-dimensional scene reconstruction is divided into two parts of foreground modeling and background modeling, for the foreground part with high precision requirement, the global pose of the foreground object is further determined by combining global information after the local pose estimation is performed by using the foreground clipping image, so that the pose of the foreground object in the three-dimensional scene is restored, and the accurate modeling of the three-dimensional scene can be realized by only one scene image, so that the method is simple, convenient and high in practicability.

Further, the embodiment of the application also provides a method for generating the three-dimensional scene modeling model, as shown in fig. 2, which comprises the following steps:

step 201, a three-dimensional scene modeling model is established, wherein the three-dimensional scene modeling model comprises a foreground clipping network, a foreground feature extraction network, a panoramic feature extraction network, a background feature extraction module, a foreground model prediction network, a background model prediction network, a local gesture prediction network, a global gesture prediction module and a three-dimensional modeling module.

Step 202, constructing a foreground prediction loss function corresponding to the foreground model prediction network, a background prediction loss function corresponding to the background model prediction network and a gesture prediction loss function corresponding to the local gesture prediction network, and determining a model loss function of the three-dimensional scene modeling model based on the foreground prediction loss function, the background prediction loss function and the gesture prediction loss function.

And 203, training the three-dimensional scene modeling model by using the scene picture sample and the scene three-dimensional model data sample, wherein the trained three-dimensional scene modeling model is used for carrying out three-dimensional scene modeling on the two-dimensional scene picture.

In this embodiment, the three-dimensional scene modeling model includes a foreground clipping network, a foreground feature extraction network, a panoramic feature extraction network, a background feature extraction module, a foreground model prediction network, a background model prediction network, a local pose prediction network, a global pose prediction module, and a three-dimensional modeling module. The foreground clipping network may be an R-CNN network, and both the foreground model prediction network and the background model prediction network may be multi-layer perceptron models, which may, of course, also be other network structures that can be considered by those skilled in the art, and are not limited herein. The loss function of the model mainly comprises a foreground prediction loss function corresponding to a foreground model prediction network, a background prediction loss function corresponding to a background model prediction network and a gesture prediction loss function corresponding to a local gesture prediction network. In addition, the foreground clipping network, the foreground feature extraction network and the panorama feature extraction network can also be used for designing a loss function through local information, which is not exemplified here.

The following detailed description describes a process of three-dimensional scene modeling of a two-dimensional scene picture using a three-dimensional scene modeling model, the process comprising: performing foreground clipping on the scene picture through the foreground clipping network to obtain at least one foreground clipping picture, and respectively extracting foreground feature vectors corresponding to each foreground clipping picture through the foreground feature extraction network; extracting panoramic feature vectors corresponding to the scene pictures through the panoramic feature extraction network, and inputting the panoramic feature vectors and the position information of each foreground clipping picture in the scene pictures into the background feature extraction module so that the background feature extraction module subtracts foreground feature parts from the panoramic feature vectors to obtain background feature vectors corresponding to the scene pictures; predicting background object model data corresponding to the background feature vectors through the background model prediction network, and respectively predicting foreground object model data corresponding to the foreground feature vectors of each foreground clipping image through the foreground model prediction network; for each foreground clipping image, predicting a local foreground object gesture corresponding to the foreground clipping image through the local gesture prediction network, and predicting a global foreground object gesture according to the local foreground object gesture through the global gesture prediction module; and through the three-dimensional modeling module, performing pose transformation on the foreground object model data according to the global pose of the foreground object corresponding to each foreground clipping image, and combining the foreground object model data subjected to the pose transformation with the background object model data to obtain three-dimensional scene model data corresponding to the scene image.

In the above embodiment, as shown in fig. 3, the three-dimensional scene modeling model includes a foreground clipping network, a foreground feature extraction network, a panoramic feature extraction network, a background feature extraction module, a foreground model prediction network, a background model prediction network, a local pose prediction network, a global pose prediction module, and a three-dimensional modeling module. After the scene picture is input into the three-dimensional scene modeling model, the scene picture is input into a panoramic feature extraction network for panoramic feature extraction, meanwhile, the scene picture is also input into a foreground clipping network for foreground clipping, and the scene picture and a foreground clipping picture output by the foreground clipping network are also input into a global gesture prediction module for standby; according to the foreground clipping images and the scene images output by the foreground clipping network, the coordinate positions of bounding boxes in the scene images of each foreground clipping image can be determined, the coordinate positions and panoramic feature vectors output by the panoramic feature extraction network are input into a background feature extraction module, foreground feature parts in the panoramic feature vectors are removed through the background feature extraction module, the rest background feature vectors are reserved, and the background feature vectors are input into a background model prediction network for background object model data prediction; each front Jing Caijian chart output by the foreground clipping network is respectively input into a foreground object prediction network to predict foreground object model data, and each front Jing Caijian chart is also respectively input into a local gesture prediction network to predict the local gesture of the foreground object; the local gesture prediction network outputs the corresponding foreground object local gesture of each front Jing Caijian image to the global gesture prediction module, and the global gesture prediction module combines the scene images, each front Jing Caijian image and the corresponding foreground object local gesture to respectively perform weak perspective projection parameter calculation, perspective projection parameter calculation and foreground object global gesture prediction on each front Jing Caijian image; finally, the foreground object model data output by the foreground model prediction network, the background object model data output by the background model prediction network and the foreground object global gesture output by the global gesture prediction module are all input into the three-dimensional modeling module, the three-dimensional modeling module firstly carries out the position and gesture transformation of the foreground object, and then the foreground object model data and the background object model data after the position and gesture transformation are combined together to obtain complete three-dimensional scene model data, and finally, the three-dimensional scene model can be generated by utilizing the three-dimensional scene model data.

In the embodiment of the present application, regarding to a training portion of a three-dimensional scene model, optionally, the foreground model prediction network is configured to predict whether a three-dimensional model foreground sampling point corresponding to each foreground pixel point in an input foreground feature vector belongs to a scene to be modeled, and a prediction value corresponding to each foreground pixel point is used to characterize a probability that the corresponding three-dimensional model foreground sampling point belongs to the scene to be modeled; the foreground prediction loss function is used for calculating a loss value between a predicted value and a true value of each three-dimensional model foreground sampling point; the background model prediction network is used for predicting whether the three-dimensional model background sampling points corresponding to all background pixel points in the input background feature vector belong to a scene to be modeled or not, and the predicted value of each background pixel point is used for representing the probability that the corresponding three-dimensional model background sampling point belongs to the scene to be modeled; the background prediction loss function is used for calculating a loss value between a predicted value and a true value of each three-dimensional model background sampling point; the gesture prediction loss function is used for calculating a loss value between orthogonal projection data corresponding to the global gesture of the foreground object output by the global gesture prediction module and real orthogonal projection data corresponding to the foreground object sample in the three-dimensional model data sample of the scene.

In the above embodiment, the foreground model prediction network is configured to predict foreground sampling points of the three-dimensional model corresponding to each foreground pixel point in the input foreground feature vector, and predict whether each sampling point belongs to a sampling point to be used for modeling, that is, predict whether each sampling point belongs to a scene to be modeled. The concept of a hidden function is needed to be introduced into a foreground model prediction network of a three-dimensional scene modeling model, and the hidden function is expressed as a three-dimensional model in a functional mode, for exampleA sphere is indicated. The hidden function expression form of the foreground object isWhere X represents a sampling point in the three-dimensional scene, X represents a pixel point of the three-dimensional scene picture, X is a projection mapping of X in the two-dimensional space, +.>Is the feature vector of each pixel, z (X) represents the distance, i.e. depth, from the sampling point X to the pixel X in the picture, which can be found from the camera parameters, the hidden function f is unknown, but is equal to +.>And z (X) has some functional expression relationship. The input of the foreground model prediction network is the foreground feature vector +/for each foreground clip>And the depth information z (X) of the scene picture outputs a predicted value of a three-dimensional model sampling point corresponding to each pixel point of the foreground clipping picture, wherein the predicted value indicates whether the sampling point belongs to the scene to be modeled, and all the sampling points corresponding to one of the front Jing Caijian pictures belong to the scene to be modeled form the three-dimensional model of the foreground object in the front Jing Caijian picture. When the foreground model prediction network is trained, a foreground prediction loss function is required to be constructed to measure the quality of the network, and the foreground prediction loss function L1 can be

Where n is the number of samples,is the predictive value of the foreground model predictive network, i.e. the hidden function +.>The value of (2) represents the point in the two-dimensional picture +.>Inside or outside the three-dimensional model, +.>Is the sampling point +.>The foreground predictive loss function L1 is the mean square error of the predicted value and the true value. In addition, regarding f(X) the picture and explicit three-dimensional model data are input in pairs, where the explicit three-dimensional model data is relative to a three-dimensional model represented by a hidden function, such as a point cloud, voxel, grid, etc. Sampling the sampling point X in the explicit three-dimensional model space to obtain the true vertex position in or out of the 3D model surface, and representing the sampling result by 0 or 1, for example, the following formula is shown:

in addition, the background model prediction network provided by the embodiment of the application can adopt the same network structure as the foreground model prediction network, and adopts the same loss function construction mode to construct a background prediction loss function, and the background model prediction network and the background prediction loss function L2 are not described herein.

Further, regarding the training section of the local posture prediction network, the quality of the local posture prediction network can be evaluated by constructing a posture prediction loss function. Because the output of the local gesture prediction network is the local gesture of the foreground object, the local gesture is obtained and then is further fused with other information to obtain the global gesture of the foreground object, and the fused calculation process adopts a fixed calculation mode, and no loss is generated in the calculation process, the embodiment of the application utilizes the output data of the global gesture prediction module to calculate the loss of the local gesture prediction network. Specifically, the global pose of the foreground object output by the global pose prediction module is utilized to perform pose transformation on the foreground object model output by the foreground model prediction network, then the foreground object model subjected to the pose transformation is orthogonally projected to a two-dimensional plane to obtain orthogonal projection data of a predicted value of the foreground object model, real orthogonal projection data of the foreground object model is obtained, and loss calculation is performed by utilizing the orthogonal projection data and the real orthogonal projection data of the predicted value of the foreground object model. Orthogonal projection data of foreground object model predictors can be expressed as Wherein->Is orthographic projection, +.>Representing the foreground object model data,representing global pose of foreground object,/->The method is characterized by representing orthogonal projection of a three-dimensional foreground object model subjected to pose transformation on the foreground object model by utilizing the global pose of the foreground object to two dimensions. The pose prediction loss function is expressed asWherein (1)>Is a loss function of orthogonal projection of foreground object to two-dimensional whole picture, +>Is key position information (namely orthogonal projection data of a predicted value of the foreground object model) obtained by projecting the foreground object model after pose transformation to a two-dimensional picture, and is +.>Is a true value of the key position relative to the whole picture (i.e. the true orthographic projection data of the foreground object model).

It should be noted that the model loss function of the three-dimensional scene modeling model overall may be. Other parts of the model, such as a panoramic feature extraction network, a foreground feature extraction network and a foreground clipping network, can be independently trained, or can be combined with a foreground prediction loss function, a background prediction loss function and a gesture prediction loss function after the loss functions of the networks are respectively constructed to obtain the model loss function.

In an embodiment of the present application, with respect to the training sample obtaining portion of the three-dimensional scene modeling model, optionally, before the training of the three-dimensional scene modeling model by using the scene picture sample and the scene three-dimensional model data sample, the method further includes: acquiring at least one scene three-dimensional model data sample, wherein the scene three-dimensional model data sample is any one of a point cloud type, a voxel type and a network type; combining a plurality of preset angles and a plurality of preset bounding box depths to obtain a plurality of image shooting parameters, and shooting a sample three-dimensional model corresponding to the scene three-dimensional model data sample by each image shooting parameter to obtain a plurality of scene picture samples corresponding to the scene three-dimensional model data sample; sampling each sample three-dimensional model to obtain a three-dimensional model foreground sampling point and a three-dimensional model background sampling point, and respectively calculating orthogonal projection data corresponding to each foreground object sample in each sample three-dimensional model as real orthogonal projection data;

Correspondingly, the training the three-dimensional scene modeling model by using the scene picture sample and the scene three-dimensional model data sample comprises the following steps: and training the three-dimensional scene modeling model by utilizing each scene three-dimensional model data sample, the corresponding three-dimensional model foreground sampling point, the three-dimensional model background sampling point, the real orthogonal projection data and a plurality of scene picture samples.

In the above embodiment, a plurality of training samples may be obtained for each three-dimensional scene model that completes three-dimensional modeling, where a plurality of two-dimensional scene pictures of the three-dimensional scene model may be obtained by photographing a three-dimensional scene model at different angles and different depths, and each two-dimensional scene picture may be combined with the three-dimensional scene model to form one training sample. Specifically, M preset angles and N preset bounding box depths can be set, every preset angle and every preset bounding box depth are combined two by two, m×n groups of image shooting parameters are obtained, a three-dimensional scene model is shot according to every group of image shooting parameters, m×n scene picture samples are obtained, further, for the subsequent training, a loss function is calculated, the three-dimensional scene model is further required to be sampled to obtain a true value of each sampling point, and orthogonal projection calculation is performed on every foreground object in the three-dimensional scene model to obtain true orthogonal projection data of every foreground object. In one specific application scenario, preparing explicitly represented 3D scenes, e.g. 1-ten thousand each of point clouds, voxels, grids; 4 sides of each 3D scene are provided with 5 distances from 1 time to 5 times of the depth of the bounding box, 5 pictures with 1024 x 1024 resolution are generated by rendering and baking, 20 total sides and 1 ten thousand models are provided for the 4 sides, and 20 total ten thousand pictures are provided for the 4 sides (the three 3D model representation methods can share the pictures); sampling the 3D model to obtain a true value of each point; and carrying out orthogonal projection on each foreground object in the 3D model to obtain real orthogonal projection data of each foreground object.

Further, as a specific implementation of the method of fig. 1, an embodiment of the present application provides a three-dimensional scene modeling apparatus, as shown in fig. 4, including:

the image acquisition unit is used for acquiring a scene image to be modeled;

Optionally, the three-dimensional modeling unit is further configured to:

transforming the weak perspective projection parameters into perspective camera projection parameters according to a transformation formula from the weak perspective camera to the perspective camera Wherein, the transformation formula from the weak perspective camera to the perspective camera is as follows: />

Optionally, the apparatus further comprises: model training unit for:

Optionally, the apparatus further comprises: a sample determination unit for:

Optionally, the three-dimensional modeling unit is specifically configured to:

It should be noted that, other corresponding descriptions of each functional unit related to the three-dimensional scene modeling apparatus provided by the embodiment of the present application may refer to corresponding descriptions in the methods of fig. 1 to fig. 2, and are not described herein again.

The embodiment of the application also provides a computer device, which can be a personal computer, a server, a network device and the like, and as shown in fig. 5, the computer device comprises a bus, a processor, a memory and a communication interface, and can also comprise an input/output interface and a display device. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing location information. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement the steps in the method embodiments.

It will be appreciated by those skilled in the art that the structure shown in FIG. 5 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer readable storage medium is provided, which may be non-volatile or volatile, and on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

The user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method of modeling a three-dimensional scene, the method comprising:

acquiring a scene picture to be modeled;

Predicting the local gesture of a foreground object corresponding to each foreground clipping image respectively, determining translation data of a clipping camera relative to an origin of the scene image based on the position of the foreground clipping image in the scene image, and determining weak perspective camera projection parameters corresponding to the translation data; transforming the weak perspective camera projection parameters into perspective camera projection parameters, and determining a global perspective view pose of the foreground object based on the perspective camera projection parameters and the local perspective view pose of the foreground object;

performing pose transformation on the foreground object model data according to the foreground object global pose, and determining three-dimensional scene model data corresponding to the scene picture based on the background object model data and the foreground object model data subjected to the pose transformation;

before the obtaining the scene picture to be modeled, the method further includes:

training the three-dimensional scene modeling model by using a scene picture sample and a scene three-dimensional model data sample, wherein the trained three-dimensional scene modeling model is used for carrying out three-dimensional scene modeling on a two-dimensional scene picture;

the foreground model prediction network is used for predicting whether the three-dimensional model foreground sampling points corresponding to all foreground pixel points in the input foreground feature vector belong to a scene to be modeled, and the prediction value corresponding to each foreground pixel point is used for representing the probability that the corresponding three-dimensional model foreground sampling points belong to the scene to be modeled; the foreground prediction loss function is used for calculating a loss value between a predicted value and a true value of each three-dimensional model foreground sampling point;

2. The method of claim 1, wherein the determining, based on the foreground clipping map and the scene picture, a foreground feature vector and a background feature vector corresponding to the scene picture comprises:

3. The method of claim 1, wherein prior to training the three-dimensional scene modeling model using the scene picture sample and the scene three-dimensional model data sample, the method further comprises:

4. The method of claim 1, wherein the process of three-dimensional scene modeling a two-dimensional scene picture using a three-dimensional scene modeling model comprises:

5. A three-dimensional scene modeling apparatus, the apparatus comprising:

the image acquisition unit is used for acquiring a scene image to be modeled;

model training unit for:

6. A storage medium having stored thereon a computer program, which when executed by a processor, implements the method of any of claims 1 to 4.

7. A computer device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, characterized in that the processor implements the method of any one of claims 1 to 4 when executing the computer program.